This paper discusses a technique to enhance the accuracy of Automatic Speech Recognition (ASR) systems. The proposed method uses embeddings derived from utterance audio to query a correction database, which helps overcome issues related to phonetic dissimilarity between textual hypotheses and transcript truth. The study demonstrates a 6% reduction in word error rate for utterances whose transcripts appear in the candidate set, without increasing error rate on general utterances. The approach leverages multimodal speech-text embedding networks and nearest-neighbors search for improved recall and precision.
Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04235