Contact

RL from human feedback (RLHF)

Humans rank outputs. A model learns their preferences. RL optimizes against it.

Open-ended tasks have no correct answer to train on. RLHF turns human judgment into a training signal instead.

Annotators compare pairs of model outputs and pick the better one. Those comparisons become preference data: a dataset of (prompt, chosen response, rejected response) triples.

A reward model trains on this data to predict which output a human would prefer. Then RL uses those scores to improve the language model. The human is no longer in the loop for every training step, just for providing the initial comparisons.

Talk to an RL expert