The article presents an improved version of SURT (Streaming Unmixing and Recognition Transducer) for speaker-attributed transcription in multi-talker speech recognition. The authors propose methods for both short mixtures and long recordings by adding an auxiliary speaker branch to SURT. The updated model ensures consistency in relative speaker labels across different utterance groups in a recording. The study was validated through experiments on synthetic LibriSpeech mixtures and demonstrated on the AMI corpus.

 

Publication date: 31 Jan 2024
Project Page: Unavailable
Paper: https://arxiv.org/pdf/2401.15676