Reinforcement learning. Instead of showing a model the right answer, you let it try things and score the results. The model learns to produce outputs that score higher. What counts as “higher” is the rewards problem. The glossary covers RL for LLMs specifically. Start with the RL entry.
Two ways to teach a model. Show it the right answer (SFT), or let it try things and tell it what’s better (RL).
A strong model generates training data for a weaker one. Start with real examples, generate variations, filter for quality.
Depends on the task. Style or format changes can work with under 100 examples. RL needs less than SFT because it learns from scores, not labeled answers.
Both score model outputs. A reward model is trained to score automatically. An AI judge is prompted to evaluate, no training needed.