The article introduces GeoEval, a comprehensive collection of geometry math problems designed to evaluate the proficiency of Large Language Models (LLMs) and Multi-Modal Models (MMs) in problem-solving. The benchmark includes a variety of subsets, each with a different focus. The study found that the WizardMath model performed best overall, but highlighted the need for testing models against data they haven’t been pre-trained on. Additionally, GPT-series models showed improved performance on rephrased problems, suggesting a potential method for enhancing model capabilities.
Publication date: 15 Feb 2024
Project Page: https://github.com/GeoEval/GeoEval
Paper: https://arxiv.org/pdf/2402.10104