Trillion token pressure

Pretraining needs more data than exists.

A compute-optimal 1T parameter model needs roughly 20 trillion tokens (about 20 tokens per parameter, per scaling laws). That’s about seven times the deduplicated Common Crawl. Usable public text is finite, and frontier models are approaching the limits of what’s available.

Frontier runs already cost hundreds of millions of dollars across thousands of GPUs. Only a handful of labs can play this game. For everyone else, the path is post-training: take an open-source base model and specialize it with SFT and RL. The economics of pretraining make fine-tuning not just convenient but necessary.

References

Will we run out of data? Limits of LLM scaling based on human-generated data Villalobos et al., Epoch AI, 2022