Transformers
The architecture behind modern language models — attention, encoders, and decoders.
The Transformer Architecture
Introduced in "Attention Is All You Need" (Vaswani et al., 2017), the transformer replaced recurrent architectures with pure attention mechanisms. It processes all positions in parallel, enabling massive speedups and better handling of long-range dependencies.
Self-Attention
The core mechanism. For each token in the sequence, self-attention computes how much to "attend to" every other token.
Three learned projections per token:
- Query (Q) — what am I looking for?
- Key (K) — what do I contain?
- Value (V) — what information do I provide?
Attention scores: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
The division by √d_k prevents dot products from growing too large in high dimensions, which would push softmax into regions with tiny gradients.
Multi-Head Attention
Instead of one attention function, run h attention heads in parallel:
- Project Q, K, V into h different subspaces (different learned linear projections)
- Compute attention independently in each head
- Concatenate outputs and project back
Each head can learn to attend to different types of relationships — syntactic structure, semantic similarity, positional patterns, coreference.
Typical configurations: GPT-3 uses 96 heads, LLaMA-2 70B uses 64 heads.
Positional Encoding
Self-attention is permutation-invariant — it treats the input as a set, losing sequence order. Positional encodings restore position information.
Sinusoidal (Original)
Fixed functions: PE(pos, 2i) = sin(pos / 10000^(2i/d)) and cosine for odd dimensions. Different frequencies let the model learn relative positions.
Learned Embeddings
A learned embedding vector for each position. Simple and effective but limited to the maximum training sequence length.
Rotary Position Embeddings (RoPE)
Encodes position by rotating the query and key vectors. Naturally decays attention with distance. Used by LLaMA, Mistral, and most modern LLMs. Extends better to longer sequences than absolute positions.
Encoder vs Decoder
Encoder-Only (BERT)
Bidirectional attention — each token attends to all other tokens. Good for understanding tasks: classification, NER, sentence similarity. Pre-trained with masked language modeling.
Decoder-Only (GPT)
Causal (left-to-right) attention — each token only attends to previous tokens. Good for generation tasks. Pre-trained with next-token prediction. This is the dominant architecture for modern LLMs.
Encoder-Decoder (T5, BART)
Encoder processes the input bidirectionally; decoder generates output autoregressively while cross-attending to encoder outputs. Good for sequence-to-sequence tasks: translation, summarization.
Feed-Forward Networks
After attention, each position passes through a position-wise feed-forward network (same weights for every position):
FFN(x) = GELU(xW₁ + b₁)W₂ + b₂
This is where much of the model's "knowledge" is stored — the attention layers route information, the FFN layers process it.
Modern variants use SwiGLU activation (LLaMA) or GeGLU for better performance.
Layer Normalization
Applied before (Pre-LN) or after (Post-LN) each sub-layer. Pre-LN is more stable during training and is the standard in modern architectures.
Scaling Laws
Transformer performance scales predictably with three factors:
- Parameters (model size)
- Data (training tokens)
- Compute (FLOPs)
Kaplan et al. and Chinchilla (Hoffmann et al.) established that these follow power laws, enabling prediction of model performance before training.