This article introduces DiffSpeaker, a new model for speech-driven 3D facial animation. Traditional models use either Diffusion models or Transformer architectures, but these have limitations due to the shortage of paired audio-4D data. DiffSpeaker tackles this issue by incorporating novel biased conditional attention modules, which focus on both the task-specific and diffusion-related conditions. The model achieves state-of-the-art performance and fast inference speed, making it ideal for multimedia applications such as virtual assistants, video games, and movie productions.

 

Publication date: 9 Feb 2024
Project Page: https://github.com/theEricMa/DiffSpeaker
Paper: https://arxiv.org/pdf/2402.05712