Proximal Policy Optimization (PPO)

The original algorithm behind RLHF.

The model generates outputs, a reward model scores them, and PPO updates the parameters to favor higher-scoring outputs. The key constraint: a clipping mechanism caps how much the model can change in a single step. Without it, a big reward spike can push the model into a region it can’t recover from.

PPO also trains a separate value function (critic) that estimates how much reward to expect, which makes training more stable. That critic roughly doubles the memory cost, which motivated simpler alternatives like GRPO.

References

Proximal policy optimization algorithms Schulman et al., 2017