Removing duplicate text so the model generalizes instead of memorizes.
Web crawls are full of repeated content: syndicated articles, templated pages, copy-pasted boilerplate. If duplicates stay in, the model sees them disproportionately often and starts memorizing specific passages instead of learning patterns.
Catches identical documents. Fast, but most real-world duplication isn’t exact.
Finds documents that are similar but not identical. Techniques like MinHash compress each document into a compact signature, then group documents with similar signatures as likely duplicates. This catches paraphrased copies, slightly reformatted articles, and templated pages that exact matching would miss.