Training data — RL Glossary

The raw text that pretraining runs on.

Petabytes of text, scraped from the internet, cleaned, assembled. The quality ceiling of any model is set here. The pipeline has four stages: web crawl → filtering → deduplication → mixing. Each step shapes what the model learns.

Talk to an RL expert