TaskBench: Benchmarking Large Language Models for Task Automation

The article introduces TaskBench, a benchmark for evaluating the capabilities of large language models (LLMs) in task automation. Task automation, which decomposes complex tasks into sub-tasks and invokes external tools to execute them, plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark for this purpose. TaskBench is designed to fill this gap, providing a means to evaluate task decomposition, tool invocation, and parameter prediction in LLMs. The authors also introduce the concept of a Tool Graph to represent decomposed tasks and a back-instruct method to simulate user instructions and annotations. Experimental results show TaskBench effectively reflects the capability of LLMs in task automation.

Publication date: 30 Nov 2023
Project Page: https://github.com/microsoft/JARVIS
Paper: https://arxiv.org/pdf/2311.18760

Post Views: 308

TaskBench: Benchmarking Large Language Models for Task Automation

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

AlignBench: Benchmarking Chinese Alignment of Large Language Models

Leave a Reply Cancel reply

Please allow ads on our site