This paper presents a novel Conversational Automatic Speech Recognition (ASR) system that extends the Conformer encoder-decoder model with cross-modal conversational representation. The approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input, allowing for richer historical speech context extraction without explicit error propagation. The model also incorporates conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. The results show significant accuracy improvements on Mandarin conversation datasets.

 

Publication date: 25 Oct 2023
Project Page: N/A
Paper: https://arxiv.org/pdf/2310.14278