Generate a group. Reward what beats the average.
For each prompt, GRPO generates a group of outputs and scores them all. The group mean becomes the baseline: above-average outputs get reinforced, below-average ones pushed down. That’s the whole algorithm.
The group serves as both the training signal and a local estimate of what “good” looks like on this prompt. Cheap on memory, simple to implement.
Behind DeepSeek-R1, where pure RL with no SFT warmup produced strong reasoning from scratch.