Press ESC to close

Computer Vision and Pattern Recognition

Concerns with enabling computers to interpret and understand visual inputs, such as images and videos.

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

root 0

The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal representation for improved action recognition. The research reduces cross-modality complexity via an audio-video…

Continue reading