The paper investigates the role of large language models (LLMs) in evaluating tutors’ performance when dealing with students’ math errors. The study analyzes 50 real-life tutoring dialogues and finds that models like GPT-3.5-Turbo and GPT-4 are proficient in assessing tutors’ reactions to students’ errors. However, these models also have limitations, like overidentifying students’ errors. Future work will focus on a larger dataset and evaluating learning transfer in real-life scenarios.
Publication date: 9 Jan 2024
Project Page: https://arxiv.org/abs/2401.03238v1
Paper: https://arxiv.org/pdf/2401.03238