SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·Aug 2017

★5. Proximal Policy Optimization (PPO)

The RL algorithm that would power RLHF

Research Paper
Summary

Introduced PPO, a simple and stable reinforcement learning algorithm that became the workhorse behind RLHF, InstructGPT, ChatGPT, and most subsequent language model alignment methods.

Key Concepts

Clipped policy gradient prevents catastrophic updates while matching TRPO stability

PPO clips the policy gradient objective to prevent excessively large updates. Instead of constraining the KL divergence between old and new policies (as in TRPO), PPO simply clips the probability ratio between old and new policies to a small range (typically [0.8, 1.2]). This is computationally simple — it's just a clipped loss function — but empirically as stable as the much more complex TRPO.

PPO-Clip (dominant) and PPO-Penalty — two approaches to constraining policy updates

Connections

5. Proximal Policy …Aug 20172. OpenAI GymApr 20169. Learning to Summ…Sep 2020Influenced byInfluences
Influenced by
2. OpenAI Gym
Apr 2016
Influences
9. Learning to Summarize from Human Feedback
Sep 2020