The model that thinks before it speaks
Product AnnouncementLaunched o1 (codenamed 'Strawberry'), a model trained with reinforcement learning to reason through problems step-by-step before answering, achieving breakthrough performance on math, coding, and science benchmarks through 'test-time compute scaling'.
Unlike GPT-4 (which can be prompted to think step-by-step), o1 is trained with RL to produce extended chains of thought. The model spends more time "thinking" before responding.
A new scaling paradigm. Instead of making the model bigger (parameter scaling), you let the model think longer (inference compute scaling). More thinking time = better answers.
The model's chain-of-thought is hidden from users (they see a summary). This prevents users from extracting the reasoning strategy and prevents model distillation.
o1-preview scored 83% on the International Mathematics Olympiad qualifying exam (vs. GPT-4's 13%). 89th percentile on Codeforces. PhD-level performance on physics, biology, and chemistry.
o1 reasons about safety policies in its chain-of-thought, leading to more nuanced safety behavior than hardcoded refusals.