The picture shows a practice exercise for the United States Medical Licensing Examination (USMLE). Source: https://www.usmle.org/exam-resources/step-1-materials/step-1-sample-test-questions

Issue: 02/2024

Excellence in AI-Research

AI in Medicine

How GPT-4 Fails in Medical Tests – and Yet Still Convinces

GPT-4 is considered to be the most advanced language model. But while it is capable of answering complex medical questions, it is not without its flaws. In a recent study, scientists from the L3S Research Centre, IIT Kharagpur and the University of Michigan investigated the types of errors GPT-4 makes in medical exam questions, and why some of them are even considered ‘reasonable’ by experts. The study shows that GPT-4 impresses not only with its correct answers, but also with its errors – and that there is still plenty of room for improvement.

There is a lot of hype around GPT-4 in the world of artificial intelligence (AI), and the model shows impressive performance, especially when answering medical questions. In the MedQA-USMLE dataset, which contains questions from the US Medical Licensing Examination (USMLE), GPT-4 achieves a remarkable accuracy of 86.7%. However, even this success rate leaves 14 per cent incorrect answers – no mean feat when it comes to medical diagnosis.

“We wanted to understand why GPT-4 is wrong in these cases,” explains Soumyadeep Roy, a PhD student at IIT Kharagpur and lead author of the study. The team created an error taxonomy that classified GPT-4 responses into seven error categories. In particular, they focused on the model’s reasoning – its thought processes and conclusions.

Plausible but wrong

In a complex process, the research team had 44 medical experts analyse a total of 300 incorrect answers from the GPT-4. Interestingly, the majority of errors were categorised as ‘reasonable answers from the GPT-4’. This shows that even in the case of incorrect diagnoses, GPT-4’s reasoning sounds plausible – a major problem for medical professionals who want to use this technology as an aid.

It is frightening to see that GPT-4 often argues so convincingly that even experts do not immediately recognise the errors,” says co-author Uwe Hadler of L3S. A common mistake: GPT-4 recognises the symptoms and interprets them correctly, but still makes an incorrect diagnosis because it sticks to the wrong decision.

AI defends mistakes

One of the biggest challenges is that GPT-4 often tries to justify its initial decision, rather than acting properly on the information given. This leads to errors that the AI stubbornly defends. “Once GPT-4 has made a decision, there is no going back,” the study says.

Despite these weaknesses, GPT-4 is still seen as a potentially valuable tool in the medical field, mainly because of its ability to summarise medical information and suggest diagnoses. However, the researchers point out that a detailed understanding of the sources of error is crucial to further improve the technology and make it safer.

Learning outcomes can be improved

Another aspect of the study was the so-called ‘drift behaviour’ of GPT-4, which means that the model’s performance can change over time. “It is fascinating to see how quickly GPT-4 improves over time,” says Hadler. An analysis of GPT-4’s responses several months apart showed that 23.3 per cent of the time it continued to make the same mistakes it had made before. “This shows that there is still a lot of room for improvement.”

The results of the study are a double-edged sword: on the one hand, GPT-4 shows impressive capabilities in answering medical questions; on the other, the types of errors need to be understood and addressed before such systems can be widely used in medical practice. Until then, artificial intelligence remains a tool to be used with caution.

“The fact that GPT-4 sounds so convincing even when the answers are wrong is an indication of how difficult it can be, even for experts, to recognise the limitations of AI-based systems and to find the right balance between trust and critical questioning,” say the authors.

Soumyadeep Roy, Aparup Khatua, Fatemeh Ghoochani, Uwe Hadler, Wolfgang Nejdl, Niloy Ganguly: Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. SIGIR 2024: 1073-1082 dl.acm.org/doi/10.1145/3626772.3657882

Contact

Soumyadeep Roy

sroy@L3S.de

Soumyadeep is a PhD student at IIT Kharagpur and was a research assistant at the Leibniz AI Lab at L3S for 2.5 years. His research interests are Natural Language Processing and AI in Medicine. His PhD thesis focuses on the development of domain adaptation techniques for various medical NLP applications.

Uwe Hadler, M. Sc.

uwe.hadler@L3S.de

Uwe Hadler is a research associate at L3S and supports companies in the introduction of AI systems via the Mittelstand-Digital Zentrum Hannover. His research focuses on language models and the development of methods to improve the reliability and trustworthiness of language models.