Group Sequence Policy Optimization (GSPO)

Same idea as GRPO. Optimize the sequence, not the token.

GRPO measures and constrains how much the model can shift its probability for each individual token. That’s noisy, especially in mixture-of-experts models where different experts activate for different tokens. GSPO moves the unit of optimization up to the full sequence: one probability ratio per output, one constraint. Cleaner signal, more stable training.

Behind Qwen3, where it stabilized MoE training that GRPO struggled with.

References

Group Sequence Policy Optimization Zheng et al., 2025