This paper introduces LLM-ST, a model for speech translation developed from a pre-trained large language model (LLM). The model integrates a speech encoder with the LLM and uses multi-task instruction tuning. LLM-ST can create accurate timestamped transcriptions and translations, even from lengthy audio inputs. The application of Chain-of-Thought (CoT) prompting also brings benefits to LLM-ST. The model has been tested on English and Chinese datasets and has set a new standard in the field of speech translation.
Publication date: 21 Dec 2023
Project Page: https://speechtranslation.github.io/llm-st/
Paper: https://arxiv.org/pdf/2312.13585