The authors examine two popular end-to-end automatic speech recognition models, Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries. The models employ the neural architecture of Google’s universal speech model and incorporate additional funnel pooling layers to reduce frame rate and speed up training and inference. The authors find that a 900M RNN-T outperforms a 1.8B CTC and is more tolerant to severe time reduction.
Publication date: 22 Sep 2023
Project Page: https://arxiv.org/abs/2309.12963v1
Paper: https://arxiv.org/pdf/2309.12963