RL from verifiable rewards (RLVR)

Replace the reward model with a rule-based check.

Some tasks have a right answer you can check. Math problems have a final number. Code passes or fails its tests. Formatted outputs match a schema or don’t. RLVR uses a rule-based check (1 for correct, 0 for incorrect) as the reward.

No reward model means no reward hacking. But verifiable rewards only apply where correctness can be checked mechanically. Open-ended tasks still need an AI judge or a learned reward model.

DeepSeek-R1 trained its reasoning model this way, pairing rule-based rewards on math and code with GRPO.

References

Tülu 3: Pushing Frontiers in Open Language Model Post-Training Lambert et al., Allen AI, 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI, 2025