The article discusses the development of MM-Navigator, a GPT-4V-based agent that can interact with a smartphone screen and determine subsequent actions based on given instructions. The system demonstrated high accuracy rates in generating reasonable action descriptions and executing correct actions for single-step instructions. The model was tested on iOS and Android screens and outperformed previous GUI navigators. The authors aim to provide a solid groundwork for future research into the GUI navigation task.

 

Publication date: 14 Nov 2023
Project Page: https://github.com/zzxslp/MM-Navigator
Paper: https://arxiv.org/pdf/2311.07562