The authors introduce a multitask, multilingual, dual-modal Speech and Language Model (SLM). The SLM uses pretrained foundational speech and language models, preserving their capabilities while training a simple adapter with only 1% of the foundational model parameters. This model performs well on tasks like automatic speech recognition and translation, and can follow zero-shot instructions for a variety of tasks. The SLM demonstrates that the gap between pretrained speech and language models can be bridged with a simple adaptation mechanism.
Publication date: 4 Oct 2023
Project Page: Not Provided
Paper: https://arxiv.org/pdf/2310.00230