Press ESC to close

audio-video recognition

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

root 0

The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal representation for improved action recognition. The research reduces cross-modality complexity via an audio-video…

Continue reading