The paper explores the efficiency of self-supervised pre-trained audio models. It posits that these models can achieve comparable inference efficiency to more complex models that use speech transformer encoders. These encoders mix convolutional modules with self-attention modules. However, the study suggests that similar efficiency can be achieved with advanced self-attention alone. This simpler approach is particularly beneficial when combined with a low-bit weight quantization technique of a neural network to improve efficiency.
Publication date: 5 Nov 2023
Project Page: https://arxiv.org/abs/2311.02772v1
Paper: https://arxiv.org/pdf/2311.02772