How you tell the model what “good” means.
RL needs a reward to improve from. Some tasks have right answers: math, code, factual recall. Most don’t. “Write a helpful response” has no answer to check against. So you train a model to score outputs the way a human would, and use that as the reward.