The study introduces synchronous bilingual Connectionist Temporal Classification (CTC), a framework that addresses both modality and language gaps in the speech translation (ST) task. The model uses transcript and translation as concurrent objectives for CTC, bridging the gap between audio and text and between source and target languages. A new enhanced variant, BiL-CTC+, is developed, achieving state-of-the-art performances on the MuST-C ST benchmarks. The method also significantly improves speech recognition performance, demonstrating the impact of cross-lingual learning on transcription and its wide applicability.

 

Publication date: 21 Sep 2023
Project Page: https://github.com/xuchennlp/S2T
Paper: https://arxiv.org/pdf/2309.12234