Contact

Web crawl

Where most training data starts.

Nearly every major LLM trains on some version of Common Crawl, a nonprofit archive that snapshots the public web. The raw archive is billions of pages and growing. But most of it is boilerplate, spam, duplicate nav bars, cookie banners. The crawl isn’t a dataset. It’s the ore. Everything downstream (filtering, deduplication) exists because the raw crawl is overwhelmingly noise.

References
  1. Common Crawl overview Common Crawl Foundation
Talk to an RL expert