Splitting text into the pieces a model actually reads.
Models can’t process raw text. Tokenization breaks it into tokens: subword chunks. Whole words would mean a huge vocabulary. Individual characters would mean very long sequences. Subwords split the difference. The standard algorithm, BPE (byte pair encoding), builds a vocabulary by merging the most frequent character pairs. When we say an LLM predicts the “next word,” it’s really predicting the next token.