This paper presents a novel AI framework, Video2Music, capable of generating music that matches the emotion of a given video. The researchers first collected a unique set of music videos, extracting features such as semantics, motion, and emotion. These features were then used as guiding input to the music generation model. The audio files were transcribed into MIDI and chords, extracting features like density and loudness. The result was a rich multimodal dataset, MuVi-Sync, used to train the Affective Multimodal Transformer (AMT) model. The model enforces affective similarity between video and music, ensuring dynamic rendering of the generated chords with varying rhythm and volume. User studies confirmed the quality of music-video matching.

 

Publication date: 3 Nov 2023
Project Page: https://arxiv.org/abs/2311.00968
Paper: https://arxiv.org/pdf/2311.00968