The article introduces ALIGN BENCH, a comprehensive benchmark for evaluating alignment in Large Language Models (LLMs) for the Chinese language. The benchmark utilizes a human-in-the-loop data curation pipeline and includes automatic evaluations tailored for alignment. The system, CritiqueLLM, recovers 95% of GPT-4’s evaluation ability and is available to researchers via public APIs. The main focus is on real-world user queries, open-ended answers, and challenging tasks to reflect the authentic usage of LLMs. All evaluation codes, data, and LLM generations are available on the project’s GitHub page.

 

Publication date: 1 Dec 2023
Project Page: https://github.com/THUDM/AlignBench
Paper: https://arxiv.org/pdf/2311.18743