This article provides a comprehensive overview of Text-to-Speech (TTS) systems and their applications in media. It explores the complexity of designing TTS systems, which typically require a text frontend, a predictive model, and a signal-processing vocoder. The article also discusses the shift from conventional concatenative and statistical parametric approaches to neural network-based TTS, which offers higher quality. The use of TTS in various media applications is also covered, highlighting its potential in cost-saving and efficiency. The paper concludes with a comparison of recently released TTS systems.

 

Publication date: 25 Oct 2023
Project Page: ?
Paper: https://arxiv.org/pdf/2310.14301