RL from human feedback (RLHF)

Humans rank outputs. A model learns their preferences. RL optimizes against it.

Open-ended tasks have no correct answer to train on. RLHF turns human judgment into a training signal instead.

Annotators compare pairs of model outputs and pick the better one. Those comparisons become preference data: a dataset of (prompt, chosen response, rejected response) triples.

A reward model trains on this data to predict which output a human would prefer. Then RL uses those scores to improve the language model. The human is no longer in the loop for every training step, just for providing the initial comparisons.

References

Deep reinforcement learning from human preferences Christiano et al., 2017
Training language models to follow instructions with human feedback Ouyang et al., 2022