This article introduces a new method for audio-visual speech separation, called RTFS-Net. This method operates in the time-frequency domain and uses a multi-layered RNN to independently model and capture the time and frequency dimensions of the audio. The authors also introduce a unique attention-based fusion technique for integrating audio and visual information, and a new mask separation approach that capitalizes on the intrinsic spectral nature of the acoustic features for clearer separation. The RTFS-Net outperforms previous state-of-the-art methods, using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
Publication date: 29 Sep 2023
Project Page: https://arxiv.org/abs/2309.17189v1
Paper: https://arxiv.org/pdf/2309.17189