GestureGPT, a novel zero-shot gesture understanding and grounding framework, leverages large language models (LLMs). It formulates gesture descriptions based on hand landmark coordinates from gesture videos and feeds them into a dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context, which a context agent organizes and provides. The gesture agent discerns user intent, grounding it to an interactive function. The system showed high accuracy in two real-world settings: video streaming and smart home IoT control.

 

Publication date: 19 Oct 2023
Project Page: https://arxiv.org/abs/2310.12821v1
Paper: https://arxiv.org/pdf/2310.12821