The raw text that pretraining runs on.
Petabytes of text, scraped from the internet, cleaned, assembled. The quality ceiling of any model is set here. The pipeline has four stages: web crawl → filtering → deduplication → mixing. Each step shapes what the model learns.