Synthetic data — RL Glossary

Data generated by models, not collected from humans.

Human annotation is slow and expensive. Synthetic data sidesteps this: a strong model generates instruction data that a weaker model trains on. Stanford’s Alpaca, for example, used GPT-3.5 to generate 52,000 instruction-response pairs, then fine-tuned a smaller model on them. Models can also generate their own practice problems or reasoning chains. This can scale data production well past what human labelers can reach.

Model collapse

The risk is compounding errors. Synthetic data inherits the generating model’s biases and blind spots. Train on too much of it and each generation’s mistakes amplify through the next. Tails of the original distribution vanish. Outputs converge toward a narrow, degraded mean. Mixing real and synthetic data mitigates this. Pure synthetic pipelines tend to degrade across generations.

References

Stanford Alpaca: an instruction-following LLaMA model Taori et al., Stanford CRFM
AI models collapse when trained on recursively generated data Shumailov et al., Nature, 2024