The article presents SPIRIT-LM, a foundational multimodal language model that seamlessly combines text and speech. The model builds upon a pretrained text language model, extending its capabilities to the speech modality. This is achieved by training the model continuously on text and speech units. The model comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that incorporates pitch and style units alongside the semantic units. The model demonstrates the semantic abilities of text models and the expressive abilities of speech models. Moreover, SPIRIT-LM can learn new tasks in a few-shot fashion across different modalities such as ASR, TTS, and Speech Classification.

 

Publication date: 9 Feb 2024
Project Page: https://speechbot.github.io/spiritlm2022
Paper: https://arxiv.org/pdf/2402.05755