This article discusses NTT Corporation’s speaker diarization system, designed for multi-domain, multi-microphone casual conversations. The system uses weighted prediction error-based dereverberation, applies end-to-end neural diarization with vector clustering to each channel separately, and integrates the results using diarization output voting error reduction plus overlap. The system was part of NTT’s submission for the CHiME-7 challenge, achieving significant improvements compared to the baseline diarization system. The article further discusses the challenges in speaker diarization and the various approaches used, including vector clustering and end-to-end neural diarization.

 

Publication date: 25 Sep 2023
Project Page: https://www.ntt-review.jp/archive/ntttechnical.php?contents=ntr202302fa2.html
Paper: https://arxiv.org/pdf/2309.12656