Contact

Training data

The raw text that pretraining runs on.

Petabytes of text, scraped from the internet, cleaned, assembled. The quality ceiling of any model is set here. The pipeline has four stages: web crawlfilteringdeduplicationmixing. Each step shapes what the model learns.

Talk to an RL expert