Optimization — RL Glossary

How the model actually improves from reward signals.

You have a score for each output. Now what? The model needs to shift its parameters so higher-scoring outputs become more likely. Push too hard and the model collapses. Too soft and it never learns. Each algorithm here is a different answer to that tradeoff.

Talk to an RL expert