The article discusses a framework for training singer identity encoders that can extract representations suitable for singing-related tasks. Different self-supervised learning techniques are explored on a large collection of isolated vocal tracks. The quality of the resulting representations is evaluated on singer similarity and identification tasks across multiple datasets. The proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz.

 

Publication date: 11 Jan 2024
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2401.05064