Alignment·OpenAI·Sep 2020

9. Learning to Summarize from Human Feedback

The prototype for RLHF on language models

Research Paper

Summary

Applied RLHF to text summarization, training a reward model from human preferences and using PPO to optimize against it — establishing the methodology that would produce InstructGPT and ChatGPT.

Key Concepts

Comparison data → reward model → PPO became the standard RLHF recipe

This exact pipeline — comparison data → reward model → PPO — became the standard RLHF recipe used by InstructGPT, ChatGPT, Claude, and Gemini.

A 1.3B RLHF model beat a 12B supervised model — human feedback outweighs raw scale

Models trained with human feedback significantly outperformed both supervised baselines and much larger models. A 1.3B RLHF model produced summaries preferred by humans over a 12B supervised model.

Connections

Influenced by

5. Proximal Policy Optimization (PPO)

Aug 2017

8. Language Models are Few-Shot Learners (GPT-3)

May 2020

Influences

13. Training Language Models to Follow Instructions (InstructGPT)

Mar 2022