The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal representation for improved action recognition. The research reduces cross-modality complexity via an audio-video bottleneck Transformer, integrating self-supervised objectives into AVT training. The AVT maps diverse audio and video representations into a common multimodal representation space. The study demonstrated the effectiveness of AVT, outperforming previous methods in accuracy and efficiency. The AVT offers a more comprehensive understanding of actions from a combination of video and audio inputs compared to previous multimodal video Transformers.

 

Publication date: 11 Jan 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2401.04154