The paper discusses the challenges faced by Large Language Models (LLMs) when applied to general-purpose software systems like operating systems. These include maintaining an up-to-date understanding of vast and dynamic action spaces, inter-application cooperation, and aligning with user constraints. The authors propose AndroidArena, an environment and benchmark to evaluate LLM agents in a modern operating system. The findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. The study identifies a lack of understanding, reasoning, exploration, and reflection as primary reasons for the failure of LLM agents. An exploration strategy is proposed, improving the success rate by 27%.

 

Publication date: 12 Feb 2024
Project Page: https://github.com/AndroidArenaAgent/AndroidArena
Paper: https://arxiv.org/pdf/2402.06596