The paper introduces ESPnet-SPK, a toolkit for training speaker embedding extractors. It offers an open-source platform for the speaker recognition community to build models, ranging from x-vector to recent SKA-TDNN. The paper also discusses the integration of diverse self-supervised learning features and provides a reproducible recipe that achieves a 0.39% equal error rate on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN. The toolkit aims to bridge developed models with other domains, allowing the broad research community to effortlessly incorporate state-of-the-art embedding extractors.

 

Publication date: 31 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.17230