The paper that made ChatGPT possible
Research PaperApplied RLHF at scale to GPT-3, producing InstructGPT — a model that followed human instructions far more reliably. The 1.3B InstructGPT was preferred to the 175B GPT-3, proving that alignment could substitute for raw scale.
At production scale:
A 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B GPT-3 model. Alignment beat scale by 100x.
InstructGPT produced 25% fewer toxic outputs than GPT-3 and was much more likely to follow instructions accurately.