Alignment·OpenAI·May 2023

19. Let's Verify Step by Step

Process supervision for reasoning

Research Paper

Summary

Demonstrated that process-based reward models (verifying each step of reasoning) dramatically outperform outcome-based models (judging only the final answer) for mathematical reasoning, laying groundwork for the o1 reasoning paradigm.

Key Concepts

Reward each reasoning step (PRM) vs. only judging the final answer (ORM)

Process reward models (PRMs) are trained to evaluate each individual step in a chain-of-thought solution. Outcome reward models (ORMs) only judge the final answer.

800K human-annotated step-level labels for mathematical solutions

OpenAI created a dataset of 800,000 step-level labels for mathematical solutions, with human annotators marking each step as correct or incorrect.

Process supervision solved 78% vs. 72% for outcome supervision on MATH benchmark

Process supervision significantly outperformed outcome supervision on MATH benchmark problems. The PRM approach solved 78% of problems vs. 72% for ORM.

Rewarding correct reasoning discourages shortcuts — inherently more aligned

Process supervision is inherently more aligned — it rewards models for correct reasoning, not just correct answers. This discourages the model from learning shortcuts or getting right answers for wrong reasons.

Connections

Influenced by

18. GPT-4 Technical Report

Mar 2023

Influences

28. Learning to Reason with LLMs (o1)

Sep 2024