The paper introduces DATAR, a Deformable Audio Transformer for audio recognition. The authors argue that the quadratic complexity of self-attention computation in existing transformer models limits their application in low-resource settings. To overcome this, they propose a deformable attention mechanism that reduces computation complexity and improves the model’s adaptability to each individual input. The proposed architecture includes a pyramid transformer backbone and a learnable input adaptor. These innovations have proven effective in prediction tasks such as event classification. The paper claims that DATAR achieves state-of-the-art performance.

 

Publication date: 24 Dec 2023
Project Page: https://arxiv.org/abs/2312.16228v1
Paper: https://arxiv.org/pdf/2312.16228