Models·OpenAI·Sep 2024

★28. Learning to Reason with LLMs (o1)

The model that thinks before it speaks

Product Announcement

Summary

Launched o1 (codenamed 'Strawberry'), a model trained with reinforcement learning to reason through problems step-by-step before answering, achieving breakthrough performance on math, coding, and science benchmarks through 'test-time compute scaling'.

Key Concepts

Trained with RL to produce extended reasoning chains — thinking is the objective, not a prompt hack

Unlike GPT-4 (which can be prompted to think step-by-step), o1 is trained with RL to produce extended chains of thought. The model spends more time "thinking" before responding.

New scaling paradigm: more thinking time = better answers, instead of bigger models

A new scaling paradigm. Instead of making the model bigger (parameter scaling), you let the model think longer (inference compute scaling). More thinking time = better answers.

Chain-of-thought hidden from users to prevent strategy extraction and model distillation

The model's chain-of-thought is hidden from users (they see a summary). This prevents users from extracting the reasoning strategy and prevents model distillation.

83% on IMO qualifying (vs. GPT-4's 13%), 89th percentile Codeforces, PhD-level science

o1-preview scored 83% on the International Mathematics Olympiad qualifying exam (vs. GPT-4's 13%). 89th percentile on Codeforces. PhD-level performance on physics, biology, and chemistry.

Model reasons about safety policies in its chain-of-thought for nuanced refusals

o1 reasons about safety policies in its chain-of-thought, leading to more nuanced safety behavior than hardcoded refusals.

Connections

Influenced by

19. Let's Verify Step by Step

May 2023

Influences

30. 12 Days of OpenAI: o3, Sora, and More

Dec 2024

29. OpenAI o1 System Card

Dec 2024