The article introduces a model named TransFace, designed to translate audio-visual speech directly into other languages, overcoming the challenges of delay and cascading errors associated with current methods. The model includes a speech-to-unit translation component and a unit-based audio-visual speech synthesizer, Unit2Lip. The model also introduces a Bounded Duration Predictor to ensure isometric talking head translation and prevent duplicate reference frames. The model demonstrated significant improvements in synchronization and inference speed, with impressive BLEU scores.
Publication date: 23 Dec 2023
Project Page: https://transface-demo.github.io/
Paper: https://arxiv.org/pdf/2312.15197