The prototype for RLHF on language models
Research PaperApplied RLHF to text summarization, training a reward model from human preferences and using PPO to optimize against it — establishing the methodology that would produce InstructGPT and ChatGPT.
This exact pipeline — comparison data → reward model → PPO — became the standard RLHF recipe used by InstructGPT, ChatGPT, Claude, and Gemini.
Models trained with human feedback significantly outperformed both supervised baselines and much larger models. A 1.3B RLHF model produced summaries preferred by humans over a 12B supervised model.