Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

The study proposes a data augmentation framework based on deepfake audio to train robust speech to text transcription models. The need for such a model arises due to the challenge of acquiring diverse and labeled datasets, especially for languages less popular than English. Traditional data augmentation techniques fail to produce models that maintain transcription quality with varying accents. The proposed technique uses a voice cloner and a dataset produced by Indians (in English), ensuring the presence of a single accent in the dataset. The results show that this method can be a viable solution to the challenges in training transcription models.

Publication date: 25 Sep 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2309.12802

Post Views: 325

Press ESC to close

Share Article:

root

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Targeted Attacks: Redefining Spear Phishing and Business Email Compromise

Please allow ads on our site