Transformers

The architecture behind modern language models — attention, encoders, and decoders.

The Transformer Architecture

Introduced in "Attention Is All You Need" (Vaswani et al., 2017), the transformer replaced recurrent architectures with pure attention mechanisms. It processes all positions in parallel, enabling massive speedups and better handling of long-range dependencies.

Input Embeddings

→

Positional Encoding

→

Multi-Head Attention

→

Feed-Forward

→

Output

Self-Attention

The core mechanism. For each token in the sequence, self-attention computes how much to "attend to" every other token.

Three learned projections per token:

Query (Q) — what am I looking for?
Key (K) — what do I contain?
Value (V) — what information do I provide?

Attention scores: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

The division by √d_k prevents dot products from growing too large in high dimensions, which would push softmax into regions with tiny gradients.

Multi-Head Attention

Instead of one attention function, run h attention heads in parallel:

Project Q, K, V into h different subspaces (different learned linear projections)
Compute attention independently in each head
Concatenate outputs and project back

Each head can learn to attend to different types of relationships — syntactic structure, semantic similarity, positional patterns, coreference.

Typical configurations: GPT-3 uses 96 heads, LLaMA-2 70B uses 64 heads.

Positional Encoding

Self-attention is permutation-invariant — it treats the input as a set, losing sequence order. Positional encodings restore position information.

Sinusoidal (Original)

Fixed functions: PE(pos, 2i) = sin(pos / 10000^(2i/d)) and cosine for odd dimensions. Different frequencies let the model learn relative positions.

Learned Embeddings

A learned embedding vector for each position. Simple and effective but limited to the maximum training sequence length.

Rotary Position Embeddings (RoPE)

Encodes position by rotating the query and key vectors. Naturally decays attention with distance. Used by LLaMA, Mistral, and most modern LLMs. Extends better to longer sequences than absolute positions.

Encoder vs Decoder

Encoder-Only (BERT)

Bidirectional attention — each token attends to all other tokens. Good for understanding tasks: classification, NER, sentence similarity. Pre-trained with masked language modeling.

Decoder-Only (GPT)

Causal (left-to-right) attention — each token only attends to previous tokens. Good for generation tasks. Pre-trained with next-token prediction. This is the dominant architecture for modern LLMs.

Encoder-Decoder (T5, BART)

Encoder processes the input bidirectionally; decoder generates output autoregressively while cross-attending to encoder outputs. Good for sequence-to-sequence tasks: translation, summarization.

Feed-Forward Networks

After attention, each position passes through a position-wise feed-forward network (same weights for every position):

FFN(x) = GELU(xW₁ + b₁)W₂ + b₂

This is where much of the model's "knowledge" is stored — the attention layers route information, the FFN layers process it.

Modern variants use SwiGLU activation (LLaMA) or GeGLU for better performance.

Layer Normalization

Applied before (Pre-LN) or after (Post-LN) each sub-layer. Pre-LN is more stable during training and is the standard in modern architectures.

Scaling Laws

Transformer performance scales predictably with three factors:

Parameters (model size)
Data (training tokens)
Compute (FLOPs)

Kaplan et al. and Chinchilla (Hoffmann et al.) established that these follow power laws, enabling prediction of model performance before training.

Transformers

The Transformer Architecture

Self-Attention

Multi-Head Attention

Positional Encoding

Sinusoidal (Original)

Learned Embeddings

Rotary Position Embeddings (RoPE)

Encoder vs Decoder

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5, BART)

Feed-Forward Networks

Layer Normalization

Scaling Laws

Review Questions