SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·Sep 2020

9. Learning to Summarize from Human Feedback

The prototype for RLHF on language models

Research Paper
Summary

Applied RLHF to text summarization, training a reward model from human preferences and using PPO to optimize against it — establishing the methodology that would produce InstructGPT and ChatGPT.

Key Concepts

Comparison data → reward model → PPO became the standard RLHF recipe

This exact pipeline — comparison data → reward model → PPO — became the standard RLHF recipe used by InstructGPT, ChatGPT, Claude, and Gemini.

A 1.3B RLHF model beat a 12B supervised model — human feedback outweighs raw scale

Models trained with human feedback significantly outperformed both supervised baselines and much larger models. A 1.3B RLHF model produced summaries preferred by humans over a 12B supervised model.

Connections

9. Learning to Summ…Sep 20205. Proximal Policy …Aug 20178. Language Models …May 202013. Training Languag…Mar 2022Influenced byInfluences
Influenced by
5. Proximal Policy Optimization (PPO)
Aug 2017
8. Language Models are Few-Shot Learners (GPT-3)
May 2020
Influences
13. Training Language Models to Follow Instructions (InstructGPT)
Mar 2022