The article ‘How Much Context Does My Attention-Based ASR System Need?’ by Robert Flynn and Anton Ragni of the University of Sheffield, UK, investigates the effect of using longer acoustic sequences in the training of speech recognition systems. They used a dataset of about 100,000 pseudo-labelled Spotify podcasts with context lengths ranging from 5 seconds to 1 hour. Their findings showed benefits from training with around 80 seconds of acoustic context, indicating a relative improvement of up to 14.9% from a limited context baseline. The researchers also combined their system with long-context transformer language models, resulting in a fully long-context ASR system that is competitive with the current state-of-the-art.

 

Publication date: 25 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.15672