Alignment·OpenAI·Aug 2017

★5. Proximal Policy Optimization (PPO)

The RL algorithm that would power RLHF

Research Paper

Summary

Introduced PPO, a simple and stable reinforcement learning algorithm that became the workhorse behind RLHF, InstructGPT, ChatGPT, and most subsequent language model alignment methods.

Key Concepts

Clipped policy gradient prevents catastrophic updates while matching TRPO stability

PPO clips the policy gradient objective to prevent excessively large updates. Instead of constraining the KL divergence between old and new policies (as in TRPO), PPO simply clips the probability ratio between old and new policies to a small range (typically [0.8, 1.2]). This is computationally simple — it's just a clipped loss function — but empirically as stable as the much more complex TRPO.

PPO-Clip (dominant) and PPO-Penalty — two approaches to constraining policy updates

Connections

Influenced by

2. OpenAI Gym

Apr 2016

Influences

9. Learning to Summarize from Human Feedback

Sep 2020