Tokenization

Breaking text into tokens — the bridge between raw text and model input.

What Is Tokenization?

Tokenization converts raw text into a sequence of discrete units (tokens) that a model can process. It's the first step in any NLP pipeline and directly affects model performance, vocabulary size, and sequence length.

"Tokenization is fundamental"
→ ["Token", "ization", " is", " fundamental"]
→ [15642, 2065, 374, 16188]

Tokenization Strategies

Word-Level

Split on whitespace and punctuation. Each unique word is a token.

Pros: intuitive, preserves word boundaries. Cons: huge vocabulary (100k+ words), can't handle misspellings or new words (out-of-vocabulary problem).

Character-Level

Each character is a token. Vocabulary is tiny (hundreds).

Pros: handles any text, no OOV problem. Cons: sequences become very long, harder for models to learn meaningful patterns from individual characters.

Subword Tokenization

The modern standard. Splits text into sub-word units: common words stay whole, rare words are decomposed into known pieces.

"unhappiness" → ["un", "happiness"]
"transformers" → ["transform", "ers"]
"ChatGPT" → ["Chat", "G", "PT"]

Byte-Pair Encoding (BPE)

The most widely used subword algorithm (GPT-2, GPT-3, GPT-4, LLaMA).

Training Algorithm

Start with individual bytes/characters as the initial vocabulary
Count all adjacent pair frequencies in the training corpus
Merge the most frequent pair into a new token
Repeat for a fixed number of merges (determines vocab size)

Example

Corpus: "low low low low lowest lowest newer newer wider"

Initial: l, o, w, e, s, t, n, r, i, d
Merge 1: lo → lo (most frequent pair)
Merge 2: lo + w → low
Merge 3: e + r → er
Merge 4: n + ew → new
...

The merge table is saved and applied at inference time to tokenize new text.

WordPiece

Used by BERT. Similar to BPE but chooses merges based on likelihood improvement rather than raw frequency:

score = freq(ab) / (freq(a) × freq(b))

This favors merges where the pair appears together more often than expected by chance. Tokens are prefixed with ## when they continue a word: ["token", "##ization"].

SentencePiece

A language-agnostic tokenizer (used by T5, LLaMA, Mistral) that:

Treats the input as a raw byte stream (no pre-tokenization)
Uses BPE or Unigram algorithm internally
Handles any language without language-specific preprocessing
Represents spaces as ▁ (visible in token outputs)

Vocabulary Size Trade-offs

Vocab Size	Sequence Length	Embedding Size	Coverage
Small (8k)	Longer	Smaller	May split common words
Medium (32k)	Balanced	Moderate	Good balance
Large (100k+)	Shorter	Larger	Most words are single tokens

Typical sizes: GPT-2 (50,257), LLaMA (32,000), GPT-4 (100,277).

Tokenization Pitfalls

Inconsistent Splitting

The same word might tokenize differently depending on context (capitalization, preceding space). This can confuse models.

Multilingual Bias

Tokenizers trained mostly on English text fragment non-English text into many more tokens, making the model less efficient and effective for other languages.

Arithmetic Difficulty

Numbers are often split into individual digits or arbitrary chunks, making arithmetic tasks harder for models. Some newer tokenizers handle digits specially.

Whitespace Sensitivity

A leading space often changes tokenization: "hello" vs " hello" produce different tokens. Models learn to handle this but it's a source of subtle bugs.

Special Tokens

Most tokenizers include special tokens:

<bos> / <s> — beginning of sequence
<eos> / </s> — end of sequence
<pad> — padding for batch processing
<unk> — unknown token (rare in subword tokenizers)
<mask> — masked position (BERT-style pre-training)