SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·Dec 2025

45. FrontierScience: Expert-Level Scientific Capability Benchmark

700+ scientific problems written by 42 Olympiad medalists and 45 PhD scientists across two difficulty tracks

Research Paper
Summary

Introduced FrontierScience, a two-track benchmark for expert-level scientific reasoning. The Olympiad track contains competition-style problems; the Research track contains open-ended research questions. Written by 42 international Olympiad medalists and 45 PhD scientists across physics, chemistry, and biology. GPT-5.2 scores 77% on Olympiad, 25% on Research — revealing a large gap between solving known problems and tackling novel research questions.

Key Concepts

Two-track design: Olympiad track (competition-style) vs. Research track (open-ended)

The Olympiad track contains competition-style problems that have known solutions but require expert-level problem-solving. The Research track contains open-ended research questions from active scientists, representing the frontier of unsolved problems. This design captures both the ability to solve hard known problems and to approach novel research challenges.

Expert authorship by 42 Olympiad medalists and 45 PhD scientists

Unlike crowdsourced benchmarks, FrontierScience was written by domain experts: 42 international Olympiad medalists and 45 PhD scientists across physics, chemistry, and biology. This ensures problems are genuinely difficult and representative of real scientific work, not artificial test cases.

The Olympiad-Research gap: 77% vs. 25% reveals fundamental capability limits

GPT-5.2 achieved 77% on the Olympiad track but only 25% on the Research track — a 52-point gap. This dramatic difference reveals a fundamental limitation: models can solve hard known problems better than they can approach novel research questions, suggesting that generalization to truly new domains remains constrained.