Contact

Reinforcement learning (RL)

Learning from rewards instead of examples.

SFT needs a correct answer for every input. But for open-ended tasks, there’s no single right answer. RL takes a different approach: let the model generate outputs, score them, and adjust toward higher scores. No labeled examples. Just a signal for what’s better and what’s worse. Defining that signal is the problem rewards address.

References
  1. Deep reinforcement learning from human preferences Christiano et al., 2017
Talk to an RL expert