The SonicVisionLM, a novel framework, is designed to generate sound effects for silent videos by leveraging vision language models (VLMs). Instead of creating sound from visual representations, which can be challenging, the approach identifies events in the video using a VLM and suggests suitable sounds. This method transforms the difficult task of aligning visual and audio representations into simpler sub-problems of aligning image-to-text and text-to-audio through diffusion models. The system has been enhanced with a large dataset that maps text descriptions to specific sound effects and temporally controlled audio adapters, improving the quality of audio recommendations. As a result, SonicVisionLM outperforms current methods in converting video to audio, providing better synchronization and alignment between audio and video components.

 

Publication date: 9 Jan 2024
Project Page: https://yusiissy.github.io/SonicVisionLM.github.io/
Paper: https://arxiv.org/pdf/2401.04394