The study presents a novel test to measure how much the performance of neural networks in speaker recognition can be attributed to the modeling of supra-segmental temporal features (SST). It found that various CNN- and RNN-based architectures do not sufficiently model SST, even when forced to. The findings provide a basis for further research into better exploiting the full speech signal and offer insights into the workings of these networks, enhancing the explainability of deep learning for speech technologies.
Publication date: 2 Nov 2023
Project Page: https://arxiv.org/abs/2311.00489v2
Paper: https://arxiv.org/pdf/2311.00489