The study proposes a data augmentation framework based on deepfake audio to train robust speech to text transcription models. The need for such a model arises due to the challenge of acquiring diverse and labeled datasets, especially for languages less popular than English. Traditional data augmentation techniques fail to produce models that maintain transcription quality with varying accents. The proposed technique uses a voice cloner and a dataset produced by Indians (in English), ensuring the presence of a single accent in the dataset. The results show that this method can be a viable solution to the challenges in training transcription models.
Publication date: 25 Sep 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2309.12802