The article presents a novel method for efficient Conformer-based end-to-end automatic speech recognition. The Conformer block uses a self-attention mechanism to capture global information and a convolutional neural network to capture local information. However, its computational complexity grows with the length of the input sequence. The authors propose a key frame-based self-attention mechanism to reduce this computation. The method involves two encoders and uses an intermediate CTC loss function to compute the label frame. This approach can discard more than 60% of the useless frames during model training and inference, significantly accelerating the inference speed.
Publication date: 25 Oct 2023
Project Page: https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer
Paper: https://arxiv.org/pdf/2310.14954