The article explores the role of deep neural networks (DNNs) in automatic speaker recognition, particularly in modeling supra-segmental temporal information (SST). Despite their success, the understanding of what contributes to these results remains limited. The authors present a test to quantify the extent to which DNNs model SST and introduce several means to encourage the networks to focus more on SST. The findings suggest that even when forced, various CNN- and RNN-based neural network architectures for speaker recognition do not model SST to a sufficient degree. These insights contribute to the explanation of deep learning for speech technologies and the better exploitation of the full speech signal.

 

Publication date: 2 Nov 2023
Project Page: https://arxiv.org/abs/2311.00489v2
Paper: https://arxiv.org/pdf/2311.00489