RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

This article introduces a new method for audio-visual speech separation, called RTFS-Net. This method operates in the time-frequency domain and uses a multi-layered RNN to independently model and capture the time and frequency dimensions of the audio. The authors also introduce a unique attention-based fusion technique for integrating audio and visual information, and a new mask separation approach that capitalizes on the intrinsic spectral nature of the acoustic features for clearer separation. The RTFS-Net outperforms previous state-of-the-art methods, using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

Publication date: 29 Sep 2023
Project Page: https://arxiv.org/abs/2309.17189v1
Paper: https://arxiv.org/pdf/2309.17189

Post Views: 401

RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

Leave a Reply Cancel reply

Please allow ads on our site