The paper introduces a Cross-Speaker Encoding (CSE) network to improve multi-talker speech recognition. Current methods, single-input multiple-output (SIMO) and single-input single-output (SISO) models, have limitations. The CSE network addresses these by aggregating cross-speaker representations. The CSE model, when integrated with serialized output training (SOT), leverages the advantages of SIMO and SISO while mitigating their drawbacks. This approach is a novel attempt to integrate SIMO and SISO for multi-talker speech recognition. Tests on the two-speaker LibrispeechMix dataset show that the CSE model lowers the word error rate by 8% over the SIMO baseline. The CSE-SOT model reduces the word error rate by 10% overall and 16% on high-overlap speech compared to the SOT model.
Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04152