This academic paper delves into the advancements made by Large Language Models (LLMs) in natural language processing and their extension to other modalities such as speech and vision. It carries out an empirical study on boosting LLMs with the capacity to generate speech, using pre-trained LLM LLaMA/OPT and text-to-speech synthesis model V ALL-E. The study compares three integration methods between LLMs and speech synthesis models. The findings reveal that the coupled methods leveraging LLMs as the text encoder perform best, significantly improving the quality of generated speech in terms of speaker similarity and word error rate.

 

Publication date: 30 Dec 2023
Project Page: https://arxiv.org/abs/2401.00246v1
Paper: https://arxiv.org/pdf/2401.00246