The academic article discusses the Unified Spoken Dialog Model (USDM), a framework that enables large language models (LLM) to understand and synthesize speech. Instead of relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions, the USDM uses a multi-step speech-text inference scheme to generate coherent spoken responses relevant to the given input speech. The authors also propose a new speech-text pretraining scheme that helps capture cross-modal semantics. The approach has been found to generate natural-sounding spoken responses, outperforming both prior and cascaded baselines.

 

Publication date: 9 Feb 2024
Project Page: https://unifiedsdm.github.io
Paper: https://arxiv.org/pdf/2402.05706