The research focuses on linking sheet music images to audio recordings, a key issue for efficient cross-modal music retrieval systems. It explores a method to learn a cross-modal embedding space via deep neural networks to connect short snippets of audio and sheet music. The scarcity of annotated data from real musical content affects the capacity of these methods to generalize to real retrieval scenarios. Thus, the paper investigates if this limitation can be mitigated with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step. The results show that pre-trained models can retrieve snippets with better precision in all scenarios and pre-training configurations.

 

Publication date: 25 Sep 2023
Project Page: https://github.com/luisfvc/ucasr
Paper: https://arxiv.org/pdf/2309.12134