Replace the reward model with a rule-based check.
Some tasks have a right answer you can check. Math problems have a final number. Code passes or fails its tests. Formatted outputs match a schema or don’t. RLVR uses a rule-based check (1 for correct, 0 for incorrect) as the reward.
No reward model means no reward hacking. But verifiable rewards only apply where correctness can be checked mechanically. Open-ended tasks still need an AI judge or a learned reward model.
DeepSeek-R1 trained its reasoning model this way, pairing rule-based rewards on math and code with GRPO.