The paper introduces a pipeline that enhances a Vision Language Model, GPT-4V(ision), by incorporating human action observations to facilitate robotic manipulation. This system analyzes videos of humans performing tasks and creates executable robot programs. The computation starts by analyzing the videos with GPT-4V to convert environmental and action details into text, followed by a GPT-4-empowered task planner. Object names are grounded using an open-vocabulary object detector, while focus on the hand-object relation helps to detect the moment of grasping and releasing. This spatiotemporal grounding allows the vision systems to further gather affordance data. The method demonstrates efficacy in achieving real robot operations from human demonstrations in a zero-shot manner.

 

Publication date: 21 Nov 2023
Project Page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/
Paper: https://arxiv.org/pdf/2311.12015