Expert-curated benchmark for evaluating AI systems on real-world medical questions with consensus-validated answers
Research PaperOpenAI introduced HealthBench, a benchmark for evaluating AI performance on real-world health questions. Unlike medical exam-style benchmarks (MedQA, USMLE), HealthBench uses real patient questions curated and validated by medical professionals, with consensus-based scoring. Tests both clinical accuracy and safety-critical behaviors like knowing when to defer to a physician. GPT-5.2 achieved 78% accuracy but showed concerning patterns: high performance on common conditions but significant failures on rare diseases and multi-comorbidity cases.
HealthBench uses real health questions submitted by patients, curated and validated by panels of 3-5 medical professionals per question. Unlike exam-style benchmarks with single correct answers, HealthBench includes consensus scoring that captures the range of medically acceptable responses. This better reflects real clinical practice where multiple valid approaches exist.
Beyond accuracy, HealthBench evaluates whether models appropriately recommend professional medical consultation for serious symptoms, correctly identify emergency situations, and avoid giving definitive diagnoses for ambiguous presentations. This "when not to answer" evaluation is unique among medical benchmarks.
GPT-5.2 showed strong performance on common conditions (85%+) but dropped to 45% on rare diseases and 52% on multi-comorbidity cases where multiple conditions interact. This performance cliff has direct safety implications — the cases where AI is least reliable are precisely the cases where patients may lack other resources.