Investigating Error Types in GPT-4

L3S Best Paper of the Quarter (Q2/2024)  
Category: IR, Generative AI

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions 

Authors: Soumyadeep Roy, Aparup Khatua, Fatemeh Ghoochani, Uwe Hadler, Wolfgang Nejdl, Niloy Ganguly 

Published in the Q1 journal “Computer Networks” https://arxiv.org/abs/2404.13307

The paper in a nutshell: 

The study investigates errors made by GPT-4, a leading AI model when answering complex medical questions from the United States Medical Licensing Examination (USMLE). The critical contribution of this work is to establish that a rationale (model explanations in natural language) must be generated along with the answer by the AI model for evaluating medical question-answering AI models. The researchers developed a detailed error taxonomy with the help of medical professionals. Our large-scale annotation study involves 44 medical experts who annotated GPT-4 responses at the sentence level according to the developed error taxonomy. The researchers find that GPT-4 rarely makes factual errors. However, they are highly prone to reasoning errors, which are even more challenging for humans and existing moderation systems to detect. This research thus provides valuable insights into the strengths and limitations of GPT-4 in medical question-answering tasks. 

Which problem do you solve with your research? 

This research addresses the challenge of understanding why and how advanced AI models like GPT-4 make errors when answering complex medical questions. By developing a detailed error taxonomy and analyzing GPT-4’s responses, the study provides a deeper understanding of the model’s decision-making process and limitations in the medical domain. 

What is the potential impact of your findings? 

The findings from this study can help improve AI models for medical applications by identifying specific areas where they struggle. This could lead to more reliable AI-assisted medical decision-making tools, potentially enhancing patient care and medical education. Additionally, the research provides valuable resources for further studies on AI performance in complex medical tasks. 

What is new about your research? 

This study introduces a novel, domain-specific error taxonomy for AI responses to medical questions developed in collaboration with medical professionals. It also presents a new dataset of GPT-4’s detailed responses to USMLE questions, including explanations for its choices. This approach allows for a more comprehensive understanding of the AI’s reasoning process beyond simple accuracy measurements.