Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

The paper introduces a Cross-Speaker Encoding (CSE) network to improve multi-talker speech recognition. Current methods, single-input multiple-output (SIMO) and single-input single-output (SISO) models, have limitations. The CSE network addresses these by aggregating cross-speaker representations. The CSE model, when integrated with serialized output training (SOT), leverages the advantages of SIMO and SISO while mitigating their drawbacks. This approach is a novel attempt to integrate SIMO and SISO for multi-talker speech recognition. Tests on the two-speaker LibrispeechMix dataset show that the CSE model lowers the word error rate by 8% over the SIMO baseline. The CSE-SOT model reduces the word error rate by 10% overall and 16% on high-overlap speech compared to the SOT model.

Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04152

Post Views: 286

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement

Leave a Reply Cancel reply

Please allow ads on our site