Distilling Vision-Language Models on Millions of Videos
The research aims to replicate the success of image-text data for video-language models. The researchers fine-tuned a…
The research aims to replicate the success of image-text data for video-language models. The researchers fine-tuned a…
The study presents the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This competition, which follows…
The paper discusses a new approach towards audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information…
The paper proposes a DeepJointCascade Model (DJCM) for singing voice separation and vocal pitch estimation tasks in…
This 2005 paper by Laurent Millot, Gerard Pel and Mohammed Elliq presents a method for audio scene…
The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal…
The article introduces ‘FunnyNet-W’, a model that relies on cross- and self-attention for visual, audio, and text…
This paper discusses a technique to enhance the accuracy of Automatic Speech Recognition (ASR) systems. The proposed…
The paper proposes two novel models: DI-AEC and FADI-AEC for Acoustic Echo Cancellation (AEC). These models pioneer…
The SonicVisionLM, a novel framework, is designed to generate sound effects for silent videos by leveraging vision…