700+ scientific problems written by 42 Olympiad medalists and 45 PhD scientists across two difficulty tracks
Research PaperIntroduced FrontierScience, a two-track benchmark for expert-level scientific reasoning. The Olympiad track contains competition-style problems; the Research track contains open-ended research questions. Written by 42 international Olympiad medalists and 45 PhD scientists across physics, chemistry, and biology. GPT-5.2 scores 77% on Olympiad, 25% on Research — revealing a large gap between solving known problems and tackling novel research questions.
The Olympiad track contains competition-style problems that have known solutions but require expert-level problem-solving. The Research track contains open-ended research questions from active scientists, representing the frontier of unsolved problems. This design captures both the ability to solve hard known problems and to approach novel research challenges.
Unlike crowdsourced benchmarks, FrontierScience was written by domain experts: 42 international Olympiad medalists and 45 PhD scientists across physics, chemistry, and biology. This ensures problems are genuinely difficult and representative of real scientific work, not artificial test cases.
GPT-5.2 achieved 77% on the Olympiad track but only 25% on the Research track — a 52-point gap. This dramatic difference reveals a fundamental limitation: models can solve hard known problems better than they can approach novel research questions, suggesting that generalization to truly new domains remains constrained.