root, Author at BytesArchive

January 12, 2024

Distilling Vision-Language Models on Millions of Videos

The research aims to replicate the success of image-text data for video-language models. The researchers fine-tuned a…

January 11, 2024

The study presents the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This competition, which follows…

January 11, 2024

The paper discusses a new approach towards audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information…

January 11, 2024

The paper proposes a DeepJointCascade Model (DJCM) for singing voice separation and vocal pitch estimation tasks in…

January 11, 2024

This 2005 paper by Laurent Millot, Gerard Pel and Mohammed Elliq presents a method for audio scene…

January 11, 2024

The article introduces a novel audio-video recognition approach called the Audio-Video Transformer (AVT) that uses effective spatio-temporal…

January 11, 2024

The article introduces ‘FunnyNet-W’, a model that relies on cross- and self-attention for visual, audio, and text…

January 11, 2024

This paper discusses a technique to enhance the accuracy of Automatic Speech Recognition (ASR) systems. The proposed…

January 11, 2024

The paper proposes two novel models: DI-AEC and FADI-AEC for Acoustic Echo Cancellation (AEC). These models pioneer…

January 11, 2024

The SonicVisionLM, a novel framework, is designed to generate sound effects for silent videos by leveraging vision…