The article introduces TaskBench, a benchmark for evaluating the capabilities of large language models (LLMs) in task automation. Task automation, which decomposes complex tasks into sub-tasks and invokes external tools to execute them, plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark for this purpose. TaskBench is designed to fill this gap, providing a means to evaluate task decomposition, tool invocation, and parameter prediction in LLMs. The authors also introduce the concept of a Tool Graph to represent decomposed tasks and a back-instruct method to simulate user instructions and annotations. Experimental results show TaskBench effectively reflects the capability of LLMs in task automation.

 

Publication date: 30 Nov 2023
Project Page: https://github.com/microsoft/JARVIS
Paper: https://arxiv.org/pdf/2311.18760