This research paper presents T-Eval, a new method for evaluating the capabilities of Large Language Models (LLMs) in tool utilization. Unlike previous benchmarks, T-Eval decomposes the evaluation into multiple sub-tasks, such as planning, reasoning, retrieval, understanding, instruction following, and review. This provides a more detailed and fair assessment of LLMs’ competencies. The paper suggests that T-Eval offers a more in-depth analysis and understanding of LLMs’ abilities, providing a new perspective in LLM evaluation on tool-utilization ability.

 

Publication date: 22 Dec 2023
Project Page: https://github.com/open-compass/T-Eval
Paper: https://arxiv.org/pdf/2312.14033