Alignment·OpenAI·Mar 2022

★13. Training Language Models to Follow Instructions (InstructGPT)

The paper that made ChatGPT possible

Research Paper

Summary

Applied RLHF at scale to GPT-3, producing InstructGPT — a model that followed human instructions far more reliably. The 1.3B InstructGPT was preferred to the 175B GPT-3, proving that alignment could substitute for raw scale.

Key Concepts

SFT → Reward Model → PPO: the production-scale RLHF pipeline that became industry standard

At production scale:

1.3B aligned model preferred over 175B unaligned GPT-3 — alignment beat scale by 100x

A 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B GPT-3 model. Alignment beat scale by 100x.

25% fewer toxic outputs and much better instruction-following than base GPT-3

InstructGPT produced 25% fewer toxic outputs than GPT-3 and was much more likely to follow instructions accurately.

Connections

Influenced by

9. Learning to Summarize from Human Feedback

Sep 2020

Influences

16. ChatGPT: Optimizing Language Models for Dialogue

Nov 2022