Recent developments in Large Language Models (LLMs) have opened new opportunities in healthcare sector. These LLMs are not just capable of modeling language, but can also act as intelligent agents that interact with stakeholders in open-ended conversations and influence clinical decision-making. The paper suggests that these LLM agents should be assessed for their performance on real-world clinical tasks, rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions. The authors propose new evaluation frameworks, called Artificial-intelligence Structured Clinical Examinations (AI-SCI), for this purpose.

 

Publication date: 19 Sep 2023
Project Page: https://arxiv.org/abs/2309.10895v1
Paper: https://arxiv.org/pdf/2309.10895