The researchers developed a generative music AI framework, Video2Music, capable of generating music to match a provided video. They created a unique collection of music videos, from which they extracted semantic, scene offset, motion, and emotion features. These features served as guiding input to their music generation model. They also transcribed the audio files into MIDI and chords, and extracted other features such as note density and loudness. The product of this process is a rich multimodal dataset called MuVi-Sync, which was used to train an Affective Multimodal Transformer (AMT) model. This model includes a novel mechanism to enforce affective similarity between video and music. The generated music matches the video content in terms of emotion, which was confirmed in a user study.
Publication date: 8 Nov 2023
Project Page: https://arxiv.org/abs/2311.00968
Paper: https://arxiv.org/pdf/2311.00968