Contact

Tokenization

Splitting text into the pieces a model actually reads.

Models can’t process raw text. Tokenization breaks it into tokens: subword chunks. Whole words would mean a huge vocabulary. Individual characters would mean very long sequences. Subwords split the difference. The standard algorithm, BPE (byte pair encoding), builds a vocabulary by merging the most frequent character pairs. When we say an LLM predicts the “next word,” it’s really predicting the next token.

References
  1. Neural machine translation of rare words with subword units Sennrich et al., 2015
Talk to an RL expert