Tokenization
Breaking text into tokens — the bridge between raw text and model input.
What Is Tokenization?
Tokenization converts raw text into a sequence of discrete units (tokens) that a model can process. It's the first step in any NLP pipeline and directly affects model performance, vocabulary size, and sequence length.
"Tokenization is fundamental"
→ ["Token", "ization", " is", " fundamental"]
→ [15642, 2065, 374, 16188]
Tokenization Strategies
Word-Level
Split on whitespace and punctuation. Each unique word is a token.
Pros: intuitive, preserves word boundaries. Cons: huge vocabulary (100k+ words), can't handle misspellings or new words (out-of-vocabulary problem).
Character-Level
Each character is a token. Vocabulary is tiny (hundreds).
Pros: handles any text, no OOV problem. Cons: sequences become very long, harder for models to learn meaningful patterns from individual characters.
Subword Tokenization
The modern standard. Splits text into sub-word units: common words stay whole, rare words are decomposed into known pieces.
"unhappiness" → ["un", "happiness"]
"transformers" → ["transform", "ers"]
"ChatGPT" → ["Chat", "G", "PT"]
Byte-Pair Encoding (BPE)
The most widely used subword algorithm (GPT-2, GPT-3, GPT-4, LLaMA).
Training Algorithm
- Start with individual bytes/characters as the initial vocabulary
- Count all adjacent pair frequencies in the training corpus
- Merge the most frequent pair into a new token
- Repeat for a fixed number of merges (determines vocab size)
Example
Corpus: "low low low low lowest lowest newer newer wider"
Initial: l, o, w, e, s, t, n, r, i, d
Merge 1: lo → lo (most frequent pair)
Merge 2: lo + w → low
Merge 3: e + r → er
Merge 4: n + ew → new
...
The merge table is saved and applied at inference time to tokenize new text.
WordPiece
Used by BERT. Similar to BPE but chooses merges based on likelihood improvement rather than raw frequency:
score = freq(ab) / (freq(a) × freq(b))
This favors merges where the pair appears together more often than expected by chance. Tokens are prefixed with ## when they continue a word: ["token", "##ization"].
SentencePiece
A language-agnostic tokenizer (used by T5, LLaMA, Mistral) that:
- Treats the input as a raw byte stream (no pre-tokenization)
- Uses BPE or Unigram algorithm internally
- Handles any language without language-specific preprocessing
- Represents spaces as
▁(visible in token outputs)
Vocabulary Size Trade-offs
| Vocab Size | Sequence Length | Embedding Size | Coverage |
|---|---|---|---|
| Small (8k) | Longer | Smaller | May split common words |
| Medium (32k) | Balanced | Moderate | Good balance |
| Large (100k+) | Shorter | Larger | Most words are single tokens |
Typical sizes: GPT-2 (50,257), LLaMA (32,000), GPT-4 (100,277).
Tokenization Pitfalls
Inconsistent Splitting
The same word might tokenize differently depending on context (capitalization, preceding space). This can confuse models.
Multilingual Bias
Tokenizers trained mostly on English text fragment non-English text into many more tokens, making the model less efficient and effective for other languages.
Arithmetic Difficulty
Numbers are often split into individual digits or arbitrary chunks, making arithmetic tasks harder for models. Some newer tokenizers handle digits specially.
Whitespace Sensitivity
A leading space often changes tokenization: "hello" vs " hello" produce different tokens. Models learn to handle this but it's a source of subtle bugs.
Special Tokens
Most tokenizers include special tokens:
<bos>/<s>— beginning of sequence<eos>/</s>— end of sequence<pad>— padding for batch processing<unk>— unknown token (rare in subword tokenizers)<mask>— masked position (BERT-style pre-training)