The article presents HumanTOMATO, a framework for generating whole-body motion from textual descriptions. Traditional models often ignore the importance of fine-grained control over hands and face in creating realistic motion, and struggle with aligning text and motion. HumanTOMATO addresses these issues with a Holistic Hierarchical VQ-VAE and a Hierarchical-GPT for detailed body and hand motion reconstruction, and a pre-trained text-motion-alignment model. This results in more realistic, text-aligned motion generation.

 

Publication date: 19 Oct 2023
Project Page: https://lhchen.top/HumanTOMATO
Paper: https://arxiv.org/pdf/2310.12978