The paper presents a novel framework named SLIDAR that is capable of joint speaker diarization and automatic speech recognition. SLIDAR can process inputs of any length and can handle any number of speakers. The framework uses a sliding window approach and consists of an end-to-end diarization-augmented speech transcription model. The model provides transcripts, diarization, and speaker embeddings for each window. These local outputs are then combined to get the final result by clustering the speaker embeddings. The method was tested on monaural recordings and proved to be effective in both close-talk and far-field speech scenarios.

 

Publication date: 4 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.01688