This research evaluates the clinical alignment of GPT-4 evaluation with human clinician experts in assessing responses to ophthalmology-related patient queries generated by fine-tuned large language model (LLM) chatbots. A dataset of 400 general ophthalmology questions and 400 paired answers were created, which was divided for fine-tuning and testing. Five different LLMs were fine-tuned. GPT-4 evaluation was compared against human ranking by 5 clinicians for clinical alignment. The study found a significant agreement between GPT-4 evaluation and human clinician rankings. However, there were clinical inaccuracies in the LLM-generated responses, which were identified by the GPT-4 evaluation.
Publication date: 16 Feb 2024
Project Page: Not provided
Paper: https://arxiv.org/pdf/2402.10083