This paper proposes a method to create a phonetically rich corpus for low-resource languages, with a focus on Brazilian Portuguese. The researchers developed a sentence selection algorithm based on triphone distribution and a new phonemic classification that reflects acoustic-articulatory speech features. The methodology was applied to Brazilian Portuguese, a language with limited resources despite its broad user base. The authors’ approach achieved a 55.8% higher percentage of distinct triphones compared to other available phonetic-rich corpuses, improving the representation of language-specific speech features.

 

Publication date: 8 Feb 2024
Project Page: https://arxiv.org/abs/2402.05794v1
Paper: https://arxiv.org/pdf/2402.05794