The article presents Distil-Whisper, a distilled variant of the Whisper model for speech recognition. To address the challenges of running large models in low-latency or resource-constrained environments, the researchers used pseudo-labelling to assemble a large-scale open-source dataset. They applied a simple word error rate (WER) heuristic to select the highest quality pseudo-labels for training. The resulting Distil-Whisper model is 5.8 times faster with 51% fewer parameters than the original Whisper model, while maintaining similar performance levels and robustness to difficult acoustic conditions. The model is less prone to hallucination errors on long-form audio and is designed to be paired with Whisper for speculative decoding.
Publication date: 1 Nov 2023
Project Page: https://arxiv.org/abs/2311.00430v1
Paper: https://arxiv.org/pdf/2311.00430