In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors
By Jakub Antkiewicz
•2026-05-04T10:13:32Z
AI Outperforms Physicians in ER Diagnostic Accuracy Study
A new study led by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center indicates that a large language model from OpenAI can outperform human doctors in diagnostic accuracy within an emergency room context. Published in the journal *Science*, the research compared the performance of OpenAI's o1 and 4o models against two internal medicine attending physicians using real, unprocessed electronic medical records from 76 ER cases. The findings are notable because the AI's superior performance was most pronounced during initial triage, the critical, time-sensitive phase where patient information is most limited.
Methodology and Performance Metrics
The research team presented the AI models with the exact same text-based information available to physicians at each diagnostic touchpoint. Two other attending physicians, who were blinded to whether a diagnosis came from a human or AI, then assessed the accuracy of the conclusions. The study found that the o1 model consistently performed on par with or better than both the human physicians and the 4o model. The key performance differences emerged at the first clinical encounter.
- OpenAI o1 Model: Achieved an 'exact or very close diagnosis' in 67% of initial triage cases.
- Human Physician 1: Reached an 'exact or close diagnosis' in 55% of the same cases.
- Human Physician 2: Hit the mark in 50% of the cases.
Implications and Expert Caveats
While the results are compelling, the study's authors and external experts caution against over-interpreting the findings. The researchers emphasized an 'urgent need for prospective trials' and noted that the models were only tested on text-based data, not multi-modal inputs like medical imaging. Dr. Kristen Panthagani, an emergency physician, pointed out a critical nuance: the study used internal medicine physicians as its human baseline, not ER physicians, who have a different primary skillset focused on immediate life threats rather than ultimate diagnoses. This, along with the lack of a formal accountability framework for AI in medicine, suggests that these models are best viewed as potential diagnostic support tools rather than physician replacements in the near term.
This study's core insight is not that AI is a 'better doctor,' but that frontier models can synthesize unstructured text from electronic health records with high accuracy, especially in information-poor scenarios. It validates their potential as powerful augmentation tools for clinicians, but also highlights the significant gap that remains between text-based reasoning and the multi-modal, high-stakes reality of clinical practice.