This paper presents Video2Music, a novel AI framework for generating music that matches the emotion of a provided video. The authors first curated a unique collection of music videos, extracting semantic, scene offset, motion, and emotion features. These features served as input to the music generation model. They also created a new dataset, MuVi-Sync, using this process. The music is then generated using an Affective Multimodal Transformer model, which ensures affective similarity between the video and the generated music. Post-processing is done to ensure dynamic rendering of rhythm and volume. The framework was confirmed to generate music that matches the video content in terms of emotion through a user study.
Publication date: 3 Nov 2023
Project Page: https://arxiv.org/abs/2311.00968v1
Paper: https://arxiv.org/pdf/2311.00968