The paper proposes a new method named SuSIE to improve robotic manipulation in unstructured environments. SuSIE uses an image-editing diffusion model to propose intermediate subgoals that a low-level controller can accomplish. The model, InstructPix2Pix, is finetuned on video data that outputs hypothetical future subgoal observations given the robot’s current observation and a language command. This approach enables the robot to recognize and reason about novel objects and scenarios that might not be present in the robot’s own training data. The paper demonstrates that this method outperforms conventional language-conditioned policies, achieving state-of-the-art results on the CALVIN benchmark.

 

Publication date: 16 Oct 2023
Project Page: http://rail-berkeley.github.io/susie
Paper: https://arxiv.org/pdf/2310.10639