SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·Mar 2022

★13. Training Language Models to Follow Instructions (InstructGPT)

The paper that made ChatGPT possible

Research Paper
Summary

Applied RLHF at scale to GPT-3, producing InstructGPT — a model that followed human instructions far more reliably. The 1.3B InstructGPT was preferred to the 175B GPT-3, proving that alignment could substitute for raw scale.

Key Concepts

SFT → Reward Model → PPO: the production-scale RLHF pipeline that became industry standard

At production scale:

1.3B aligned model preferred over 175B unaligned GPT-3 — alignment beat scale by 100x

A 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B GPT-3 model. Alignment beat scale by 100x.

25% fewer toxic outputs and much better instruction-following than base GPT-3

InstructGPT produced 25% fewer toxic outputs than GPT-3 and was much more likely to follow instructions accurately.

Connections

13. Training Languag…Mar 20229. Learning to Summ…Sep 202016. ChatGPT: Optimiz…Nov 2022Influenced byInfluences
Influenced by
9. Learning to Summarize from Human Feedback
Sep 2020
Influences
16. ChatGPT: Optimizing Language Models for Dialogue
Nov 2022