Large Language Models Perform Poorly for Differential Diagnosis

Differential diagnosis was less accurate than diagnostic testing, but final diagnosis and management were more accurate

Adobe Stock

Published on:

16 Apr 2026, 3:29 pm

THURSDAY, April 16, 2026 (HealthDay News) -- Large language models (LLMs) achieve high accuracy on final diagnosis but have poorer performance for generating differential diagnoses, according to a study published online April 13 in JAMA Network Open.

Arya S. Rao, from Harvard Medical School in Boston, and colleagues examined the longitudinal clinical reasoning ability of state-of-the art LLMs and introduced a multidimensional, clinically meaningful benchmark for clinical-grade artificial intelligence in a cross-sectional study. The Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score was examined as the primary outcome, defined as the normalized polygonal area representing balanced accuracy across five domains of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning questions. Twenty-one off-the-shelf LLMs were evaluated.

The LLMs were tested against 29 clinical vignettes, representing 16,254 responses. The researchers found that the PrIME-LLM scores varied from 0.64 to 0.78 for Gemini 1.5 Flash and Grok 4, respectively, with reasoning-optimized models outperforming nonreasoning models and the highest scoring overall for GPT models. Compared with diagnostic testing, differential diagnosis was less accurate, and final diagnosis, management, and miscellaneous reasoning were more accurate. For differential diagnosis, failure rates exceeded 0.80 in all models, but they were less than 0.40 for final diagnosis. Robust multimodal performance was seen; improved accuracy was seen with image inputs in most LLM models.

"By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor," Rao said in a statement. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information."

One author disclosed ties to Abbott.

Abstract/Full Text

Editorial

Journal

Artificial Intelligence

Diagnosis

Deep Learning Model

Large Language Models Perform Poorly for Differential Diagnosis

Related Stories