ChatGPT underperforms on clinical kidney disease questions


Research published in Kidney International Reports has found that the ChatGPT artificial intelligence (AI) language model (OpenAI) fails to consistently provide accurate answers to clinical questions in the treatment of glomerular disease. Though it recently achieved an “impressive” score on the United States Medical Licensing Examination, it still has “a long path to go” on before it is a useful tool for nephrologists, say a team of authors led by Jing Miao (Mayo Clinic, Rochester, USA). 

The authors focused their attentions on ChatGPT’s answers to questions regarding glomerular disease for the reason that, they say, its effective management demands “a deep understanding of renal anatomy, physiology, pathology, and various treatment options” across a wide range of disciplines. They suggest that this more specific topic within the field of nephrology is the best way to assess the model’s capacity to provide useful clinical information. 

Taking questions from the question bank of the Nephrology Self-Assessment Program (NephSAP), which “is a review of the literature over the last several years in different domains of nephrology”, the authors also used those on glomerular disease found in the Kidney Self-Assessment Program (KSAP) of the American Society of Nephrology (ASN), which is more “geared for preparation for the American Board of Internal Medicine (ABIM) nephrology board exam”. These they then presented to the free version of ChatGPT, GPT 3.5, first in its 14 March iteration and then, in a second run of questions, its 23 March version. Accuracy was compared between these two versions, and the concordance of their answers recorded. 

Miao et al detail that, on the five NephSAP glomerular disease question banks with a total of 150 questions, ChatGPT achieved a score of 45% on its first run and 41% on its second. Concordance between the two sets of answers was 73%. Next, the authors presented ChatGPT with 33 questions on glomerular disease from the KSAP question banks. On these, the model achieved 42% and 39% on the first and second runs respectively, with a 76% concordance rate. 

In total, 183 questions on glomerular disease were presented, with 45% accuracy on ChatGPT’s first run and 41% on its second run answers, with concordance between its two sets of responses of 74%. The authors make the point that concordance was higher for incorrect answers (82% compared with 67%). These rates, the authors explain, fall far short of the 75% passing rate for NephSAP and the 76% rate for KSAP. The 74% concordance rate, they say, suggests “a limited repeatability”, though they add that the more recent paid version of the model, ChatGPT 4.0, may achieve greater consistency. 

Miao et al voice concerns about the technology’s potential use in the American Board of Internal Medicine (ABIM) longitudinal knowledge assessment (LKA), which is taken at home as part of the renewal of the Nephrology Board Certificate. Steps should be taken to ensure ChatGPT is not utilised for this examination, they say. The model is also limited by an inability to engage with medical images, which comprise a significant portion of examinations, and the authors conclude that “further [research] and training are needed to improve its accuracy and repeatability in real-world clinical situations, particularly in processing medical images.”


Please enter your comment!
Please enter your name here