Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal representation for improved action recognition. The research reduces cross-modality complexity via an audio-video…
Continue reading