The article introduces DISCO, a large-scale human annotated corpus for disfluency correction in four Indo-European languages: English, Hindi, German, and French. Disfluency correction is the process of removing disfluent elements such as fillers, repetitions, and corrections from spoken utterances, making them more readable and interpretable. This corpus aims to aid language understanding tasks and improve Automatic Speech Recognition outputs. The researchers also demonstrate the positive impact of disfluency correction on Machine Translation systems.
Publication date: 25 Oct 2023
Project Page: https://github.com/vineet2104/DISCO
Paper: https://arxiv.org/pdf/2310.16749