Group Relative Policy Optimization (GRPO)

Generate a group. Reward what beats the average.

For each prompt, GRPO generates a group of outputs and scores them all. The group mean becomes the baseline: above-average outputs get reinforced, below-average ones pushed down. That’s the whole algorithm.

The group serves as both the training signal and a local estimate of what “good” looks like on this prompt. Cheap on memory, simple to implement.

Behind DeepSeek-R1, where pure RL with no SFT warmup produced strong reasoning from scratch.

Talk to an RL expert