Data generated by models, not collected from humans.
Human annotation is slow and expensive. Synthetic data sidesteps this: a strong model generates instruction data that a weaker model trains on. Stanford’s Alpaca, for example, used GPT-3.5 to generate 52,000 instruction-response pairs, then fine-tuned a smaller model on them. Models can also generate their own practice problems or reasoning chains. This can scale data production well past what human labelers can reach.
The risk is compounding errors. Synthetic data inherits the generating model’s biases and blind spots. Train on too much of it and each generation’s mistakes amplify through the next. Tails of the original distribution vanish. Outputs converge toward a narrow, degraded mean. Mixing real and synthetic data mitigates this. Pure synthetic pipelines tend to degrade across generations.