DiffSHEG offers a solution for speech-driven holistic 3D expression and gesture generation. Unlike previous research that focused on individual generation of expression or gesture, DiffSHEG facilitates a joint generation, improving the matching of expression-gesture distributions. The system uses a diffusion-based co-speech motion generation transformer for this purpose and introduces an outpainting-based sampling strategy for long sequence generation. This method is efficient and flexible, producing high-quality synchronized expressions and gestures driven by speech. It is evaluated on two public datasets and has shown superior performance over previous methods, indicating its potential in developing digital humans and embodied agents.

 

Publication date: 11 Jan 2024
Project Page: https://jeremycjm.github.io/proj/DiffSHEG
Paper: https://arxiv.org/pdf/2401.04747