The paper introduces a new task called multimodal planning problem specification aimed at generating a problem description (PD), a machine-readable file used by planners to find a plan. The authors propose a new framework called Vision-Language Interpreter (ViLaIn) that generates PDs using state-of-the-art large language models and vision-language models. ViLaIn refines generated PDs via error message feedback from the symbolic planner. The framework is evaluated with the ProDG dataset and four new evaluation metrics. Results show that ViLaIn can generate syntactically correct problems with more than 99% accuracy and valid plans with more than 58% accuracy.

 

Publication date: 3 Nov 2023
Project Page: https://github.com/omron-sinicx/ViLaIn
Paper: https://arxiv.org/pdf/2311.00967