The RL algorithm that would power RLHF
Research PaperIntroduced PPO, a simple and stable reinforcement learning algorithm that became the workhorse behind RLHF, InstructGPT, ChatGPT, and most subsequent language model alignment methods.
PPO clips the policy gradient objective to prevent excessively large updates. Instead of constraining the KL divergence between old and new policies (as in TRPO), PPO simply clips the probability ratio between old and new policies to a small range (typically [0.8, 1.2]). This is computationally simple — it's just a clipped loss function — but empirically as stable as the much more complex TRPO.