The paper investigates the use of modulation spectrum features (MSF) and mel-frequency cepstral coefficients (MFCC) in joint speaker diarization and identification (JSID) using machine learning models like convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The study found that models using both MSF and MFCC have significantly better diarization error rates (DERs) than models using either alone. The research also explores the role of uncertainties in these models, finding that models helpfully indicate where they are uncertain. The study concludes that while individual models may struggle with overlapping speakers, model ensembles perform better in such cases.

 

Publication date: 5 May 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2312.16763