The article introduces a new model called VOT to revolutionize speaker verification. It proposes a Memory-Attention framework that uses a deep feed-forward temporal memory network (DFSMN) into a self-attention mechanism. This captures long-term context and enhances the modeling of local dependencies. The VOT model uses a parallel variable weight summation structure and an attention-based statistical pooling layer. The authors also propose a new loss function, AM-Softmax-Focal, to address the hard sample mining problem. The performance of the VOT model on the VoxCeleb1 dataset showed significant improvement, outperforming most mainstream models.
Publication date: 29 Dec 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2312.16826