This research focuses on Speech-to-Text Translation (S2TT), comparing traditional cascade systems with direct translation systems. The authors argue that direct S2TT systems can better manage non-verbal content such as prosody, and prove this by testing Korean-English translation systems on wh-phrases. The results show that direct translation systems outperform cascade models, with a significant improvement in overall accuracy and F1 scores. The research provides quantitative evidence of the effectiveness of direct S2TT models in leveraging prosody.
Publication date: 2 Feb 2024
Project Page: https://github.com/GiulioZhou/contrastive_prosody
Paper: https://arxiv.org/pdf/2402.00632