Process supervision for reasoning
Research PaperDemonstrated that process-based reward models (verifying each step of reasoning) dramatically outperform outcome-based models (judging only the final answer) for mathematical reasoning, laying groundwork for the o1 reasoning paradigm.
Process reward models (PRMs) are trained to evaluate each individual step in a chain-of-thought solution. Outcome reward models (ORMs) only judge the final answer.
OpenAI created a dataset of 800,000 step-level labels for mathematical solutions, with human annotators marking each step as correct or incorrect.
Process supervision significantly outperformed outcome supervision on MATH benchmark problems. The PRM approach solved 78% of problems vs. 72% for ORM.
Process supervision is inherently more aligned — it rewards models for correct reasoning, not just correct answers. This discourages the model from learning shortcuts or getting right answers for wrong reasons.