Vectorized Operations
Vectorized operations replace slow Python loops with fast matrix math. In neural networks, a single matrix multiplication processes an entire batch of inputs simultaneously — this is what makes training on GPUs possible and practical.
Intuition First
Imagine you have 1,000 students and you want to compute each student's final grade from their three exam scores.
Slow way: Write a loop. Grab student 1, multiply their scores by the weights, add up, store. Repeat 999 more times. Python runs each step sequentially.
Fast way: Stack all 1,000 student scores into a matrix (1000×3). Stack the exam weights into a column vector (3×1). Do one matrix multiplication: (1000×3) × (3×1) = (1000×1). All 1,000 grades computed simultaneously, at hardware speed.
This is vectorization: express loops as matrix operations, then let the hardware run it in parallel.
What's Actually Happening
Python loops are slow because:
- Each iteration has interpreter overhead
- Memory access patterns are irregular
- CPUs and GPUs have no chance to parallelize
Matrix operations are fast because:
- Entire arrays are passed to optimized BLAS/cuBLAS routines
- Hardware can process many elements simultaneously (SIMD, GPU cores)
- Memory access is contiguous and cache-friendly
A matrix multiply on a GPU can perform billions of multiply-add operations per second. The same computation in a Python loop takes thousands of times longer.
Build the Idea Step-by-Step
Formal Explanation
From Loop to Matrix Multiply
A single linear layer computes:
yᵢ = W · xᵢ + b for each sample i
This is n dot products for a batch of n samples.
Vectorized: stack all xᵢ into rows of matrix X:
X ∈ ℝⁿˣᵈⁱⁿ (n samples, d_in features each)
W ∈ ℝᵈⁱⁿˣᵈᵒᵘᵗ (weight matrix)
b ∈ ℝᵈᵒᵘᵗ (bias — broadcast across all n samples)
Y = X @ W + b → Y ∈ ℝⁿˣᵈᵒᵘᵗ
One matrix multiply processes the entire batch. No loop.
Broadcasting
Broadcasting is how NumPy/PyTorch handles operations between arrays of different shapes by automatically expanding dimensions:
X: shape (100, 4) — 100 samples, 4 features
b: shape (4,) — 1 bias per output feature
X + b — b expands to (100, 4) automatically
Rules:
- If arrays have different numbers of dimensions, prepend 1s to the smaller shape
- Dimensions of size 1 expand to match the other array
- Dimensions must either match or be 1
(100, 1) + (4,) → (100, 1) + (1, 4) → (100, 4) ✓
(100, 4) + (100,) → error: dimensions don't align ✗
(100, 4) + (1, 4) → (100, 4) ✓
Elementwise Operations
Activation functions apply independently to every element:
ReLU(X) = max(0, X) ← applied to every element of X simultaneously
sigmoid(X) = 1/(1+e⁻ˣ) ← e^X is elementwise exponentiation, / is elementwise division
No loops. The operation is dispatched once to hardware.
Key Properties / Rules
| Concept | Detail |
|---|---|
| Matrix multiply shape rule | (m×k) @ (k×n) = (m×n) — inner dims must match |
| Batch dimension first | Convention: shape (batch, features) — first dim is samples |
| Broadcasting direction | Expand dims of size 1 to match; prepend 1s if needed |
| Contiguous memory | tensor.contiguous() ensures fast element access — matters for reshaping |
.T vs .transpose() | For 2D: same. For higher dims, specify which axes to swap |
einsum | Flexible notation for any dot product / contraction — more general than @ |
Why It Matters
This is why GPUs are necessary. A GPU has thousands of small cores. Matrix multiplication maps perfectly: each core computes one element of the output. A 1000×1000 × 1000×1000 multiply has 10⁶ output elements — all computed in parallel.
A neural network forward pass is almost entirely matrix ops:
Layer 1: X₁ = ReLU(X @ W₁ + b₁)
Layer 2: X₂ = ReLU(X₁ @ W₂ + b₂)
Output: logits = X₂ @ W₃ + b₃
Three matrix multiplies, three bias additions, two elementwise ReLUs. If your batch size is 64, you're computing outputs for 64 inputs simultaneously.
Attention in transformers is vectorized:
Attention(Q, K, V) = softmax(Q @ Kᵀ / √d_k) @ V
Q @ Kᵀ computes similarity scores for all query-key pairs in one matrix multiply. For a sequence of 512 tokens, that's 512×512 = 262,144 pairs computed simultaneously.
Common Pitfalls
- Transposing wrong dimensions.
W.Ttransposes correctly for 2D. For a 3D tensor(batch, seq, features),.Treverses all dims — usually not what you want. Use.transpose(1, 2)or.permute(0, 2, 1)to specify which axes to swap. - Shape confusion with batched matmul. For
(batch, m, k) @ (batch, k, n)usetorch.bmm. Or just@in PyTorch — it broadcasts over batch dims automatically. - In-place operations breaking autograd.
x += 1modifies in-place and can corrupt PyTorch's computation graph. Preferx = x + 1in training code. - Using Python loops in the model forward pass. Looping over batch items is almost always wrong — instead, vectorize with a matrix op. If you find yourself doing
for i in range(batch_size), rethink the operation. - Ignoring memory layout. Calling
.view()on a non-contiguous tensor fails. Call.contiguous().view(...)if you get an error, or prefer.reshape()which handles both cases.
Examples
import numpy as np
# --- Loop vs vectorized: same result, very different speed ---
W = np.random.randn(4, 8) # weight matrix: 4 inputs -> 8 outputs
b = np.random.randn(8)
X = np.random.randn(1000, 4) # batch of 1000 samples
# Slow: loop over each sample
output_loop = np.zeros((1000, 8))
for i in range(1000):
output_loop[i] = X[i] @ W + b # one dot product at a time
# Fast: single matrix multiply
output_vectorized = X @ W + b # one call to optimized BLAS
print(np.allclose(output_loop, output_vectorized)) # True — same result
# Broadcasting — adding a bias to a batch
import numpy as np
X = np.random.randn(32, 64) # 32 samples, 64 features
b = np.random.randn(64) # 1 bias per feature (not per sample)
result = X + b # b broadcasts from (64,) to (32, 64)
print(result.shape) # (32, 64)
# More explicit broadcasting example
A = np.ones((4, 1)) # column vector
B = np.ones((1, 5)) # row vector
print((A + B).shape) # (4, 5) — outer sum
# Vectorized attention (simplified)
import numpy as np
d_k = 64 # key dimension
seq_len = 10 # sequence length
batch = 2
Q = np.random.randn(batch, seq_len, d_k) # queries
K = np.random.randn(batch, seq_len, d_k) # keys
V = np.random.randn(batch, seq_len, d_k) # values
# Compute all seq_len^2 attention scores at once
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k) # (batch, seq_len, seq_len)
# Softmax over key dimension (axis=-1)
scores_max = scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
attn_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
output = attn_weights @ V # (batch, seq_len, d_k)
print(output.shape) # (2, 10, 64) — attention output for all tokens at once
# PyTorch: torch.einsum for readable, flexible matrix operations
import torch
batch, seq, d = 2, 10, 64
Q = torch.randn(batch, seq, d)
K = torch.randn(batch, seq, d)
# Two ways to compute Q @ Kᵀ:
method1 = Q @ K.transpose(-2, -1) # explicit transpose
method2 = torch.einsum("bqd,bkd->bqk", Q, K) # einsum: q-d dot k-d for each batch
print(torch.allclose(method1, method2)) # True
# einsum is often clearer for 3D+ operations —
# "bqd,bkd->bqk" reads: "for each batch b, dot product over d between query q and key k"