The paper discusses a new approach to Disfluency Correction (DC) for four major Indo-European languages – English, Hindi, German, and French. DC is a process of removing disfluent elements such as fillers, repetitions, and corrections from spoken utterances to make the text more readable and interpretable. It’s a crucial step in Automatic Speech Recognition (ASR) outputs. The research provides an extensive analysis of state-of-the-art DC models across these languages and demonstrates the benefits of DC on downstream tasks, such as Machine Translation (MT).

 

Publication date: 25 Oct 2023
Project Page: https://github.com/vineet2104/DISCO
Paper: https://arxiv.org/pdf/2310.16749