SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·May 2023

19. Let's Verify Step by Step

Process supervision for reasoning

Research Paper
Summary

Demonstrated that process-based reward models (verifying each step of reasoning) dramatically outperform outcome-based models (judging only the final answer) for mathematical reasoning, laying groundwork for the o1 reasoning paradigm.

Key Concepts

Reward each reasoning step (PRM) vs. only judging the final answer (ORM)

Process reward models (PRMs) are trained to evaluate each individual step in a chain-of-thought solution. Outcome reward models (ORMs) only judge the final answer.

800K human-annotated step-level labels for mathematical solutions

OpenAI created a dataset of 800,000 step-level labels for mathematical solutions, with human annotators marking each step as correct or incorrect.

Process supervision solved 78% vs. 72% for outcome supervision on MATH benchmark

Process supervision significantly outperformed outcome supervision on MATH benchmark problems. The PRM approach solved 78% of problems vs. 72% for ORM.

Rewarding correct reasoning discourages shortcuts — inherently more aligned

Process supervision is inherently more aligned — it rewards models for correct reasoning, not just correct answers. This discourages the model from learning shortcuts or getting right answers for wrong reasons.

Connections

19. Let's Verify Ste…May 202318. GPT-4 Technical …Mar 202328. Learning to Reas…Sep 2024Influenced byInfluences
Influenced by
18. GPT-4 Technical Report
Mar 2023
Influences
28. Learning to Reason with LLMs (o1)
Sep 2024