How the model actually improves from reward signals.
You have a score for each output. Now what? The model needs to shift its parameters so higher-scoring outputs become more likely. Push too hard and the model collapses. Too soft and it never learns. Each algorithm here is a different answer to that tradeoff.