The article presents a novel approach to real-time spoken language transcription and translation using a streaming Transformer-Transducer (T-T) model. The T-T model can jointly produce many-to-one and one-to-many transcription and translation using a single decoder. This approach is more efficient than traditional methods, which rely on separate systems for automatic speech recognition (ASR) and speech translation (ST), leading to inefficiencies in computational resources and increased synchronization complexity. The new method uses timestamp information to train the system to produce ASR and ST outputs in the streaming setting. The effectiveness of this approach is demonstrated through experiments.

 

Publication date: 23 Oct 2023
Project Page: https://arxiv.org/abs/2310.14806v1
Paper: https://arxiv.org/pdf/2310.14806