MnemosyneMnemosyne

Vectorized Operations

Vectorized operations replace slow Python loops with fast matrix math. In neural networks, a single matrix multiplication processes an entire batch of inputs simultaneously — this is what makes training on GPUs possible and practical.

Intuition First

Imagine you have 1,000 students and you want to compute each student's final grade from their three exam scores.

Slow way: Write a loop. Grab student 1, multiply their scores by the weights, add up, store. Repeat 999 more times. Python runs each step sequentially.

Fast way: Stack all 1,000 student scores into a matrix (1000×3). Stack the exam weights into a column vector (3×1). Do one matrix multiplication: (1000×3) × (3×1) = (1000×1). All 1,000 grades computed simultaneously, at hardware speed.

This is vectorization: express loops as matrix operations, then let the hardware run it in parallel.


What's Actually Happening

Python loops are slow because:

  • Each iteration has interpreter overhead
  • Memory access patterns are irregular
  • CPUs and GPUs have no chance to parallelize

Matrix operations are fast because:

  • Entire arrays are passed to optimized BLAS/cuBLAS routines
  • Hardware can process many elements simultaneously (SIMD, GPU cores)
  • Memory access is contiguous and cache-friendly

A matrix multiply on a GPU can perform billions of multiply-add operations per second. The same computation in a Python loop takes thousands of times longer.


Build the Idea Step-by-Step

One sample: y = Wx + b — a dot product
Batch of samples: Y = XW + b — matrix multiplication
GPU: runs thousands of dot products in parallel
Broadcasting: b (1D) automatically expands to match batch dimension
Elementwise ops: ReLU, sigmoid — apply function to every element at once
Result: entire forward pass of a neural network = a few matrix ops

Formal Explanation

From Loop to Matrix Multiply

A single linear layer computes:

yᵢ = W · xᵢ + b    for each sample i

This is n dot products for a batch of n samples.

Vectorized: stack all xᵢ into rows of matrix X:

X ∈ ℝⁿˣᵈⁱⁿ    (n samples, d_in features each)
W ∈ ℝᵈⁱⁿˣᵈᵒᵘᵗ  (weight matrix)
b ∈ ℝᵈᵒᵘᵗ      (bias — broadcast across all n samples)

Y = X @ W + b   → Y ∈ ℝⁿˣᵈᵒᵘᵗ

One matrix multiply processes the entire batch. No loop.

Broadcasting

Broadcasting is how NumPy/PyTorch handles operations between arrays of different shapes by automatically expanding dimensions:

X: shape (100, 4)   — 100 samples, 4 features
b: shape (4,)       — 1 bias per output feature

X + b               — b expands to (100, 4) automatically

Rules:

  1. If arrays have different numbers of dimensions, prepend 1s to the smaller shape
  2. Dimensions of size 1 expand to match the other array
  3. Dimensions must either match or be 1
(100, 1) + (4,)    →  (100, 1) + (1, 4)  → (100, 4)   ✓
(100, 4) + (100,)  →  error: dimensions don't align      ✗
(100, 4) + (1, 4)  →  (100, 4)                          ✓

Elementwise Operations

Activation functions apply independently to every element:

ReLU(X) = max(0, X)     ← applied to every element of X simultaneously
sigmoid(X) = 1/(1+e⁻ˣ)  ← e^X is elementwise exponentiation, / is elementwise division

No loops. The operation is dispatched once to hardware.


Key Properties / Rules

ConceptDetail
Matrix multiply shape rule(m×k) @ (k×n) = (m×n) — inner dims must match
Batch dimension firstConvention: shape (batch, features) — first dim is samples
Broadcasting directionExpand dims of size 1 to match; prepend 1s if needed
Contiguous memorytensor.contiguous() ensures fast element access — matters for reshaping
.T vs .transpose()For 2D: same. For higher dims, specify which axes to swap
einsumFlexible notation for any dot product / contraction — more general than @

Why It Matters

This is why GPUs are necessary. A GPU has thousands of small cores. Matrix multiplication maps perfectly: each core computes one element of the output. A 1000×1000 × 1000×1000 multiply has 10⁶ output elements — all computed in parallel.

A neural network forward pass is almost entirely matrix ops:

Layer 1: X₁ = ReLU(X @ W₁ + b₁)
Layer 2: X₂ = ReLU(X₁ @ W₂ + b₂)
Output:  logits = X₂ @ W₃ + b₃

Three matrix multiplies, three bias additions, two elementwise ReLUs. If your batch size is 64, you're computing outputs for 64 inputs simultaneously.

Attention in transformers is vectorized:

Attention(Q, K, V) = softmax(Q @ Kᵀ / √d_k) @ V

Q @ Kᵀ computes similarity scores for all query-key pairs in one matrix multiply. For a sequence of 512 tokens, that's 512×512 = 262,144 pairs computed simultaneously.


Common Pitfalls

  • Transposing wrong dimensions. W.T transposes correctly for 2D. For a 3D tensor (batch, seq, features), .T reverses all dims — usually not what you want. Use .transpose(1, 2) or .permute(0, 2, 1) to specify which axes to swap.
  • Shape confusion with batched matmul. For (batch, m, k) @ (batch, k, n) use torch.bmm. Or just @ in PyTorch — it broadcasts over batch dims automatically.
  • In-place operations breaking autograd. x += 1 modifies in-place and can corrupt PyTorch's computation graph. Prefer x = x + 1 in training code.
  • Using Python loops in the model forward pass. Looping over batch items is almost always wrong — instead, vectorize with a matrix op. If you find yourself doing for i in range(batch_size), rethink the operation.
  • Ignoring memory layout. Calling .view() on a non-contiguous tensor fails. Call .contiguous().view(...) if you get an error, or prefer .reshape() which handles both cases.

Examples

import numpy as np

# --- Loop vs vectorized: same result, very different speed ---

W = np.random.randn(4, 8)   # weight matrix: 4 inputs -> 8 outputs
b = np.random.randn(8)
X = np.random.randn(1000, 4) # batch of 1000 samples

# Slow: loop over each sample
output_loop = np.zeros((1000, 8))
for i in range(1000):
    output_loop[i] = X[i] @ W + b   # one dot product at a time

# Fast: single matrix multiply
output_vectorized = X @ W + b        # one call to optimized BLAS

print(np.allclose(output_loop, output_vectorized))  # True — same result
# Broadcasting — adding a bias to a batch
import numpy as np

X = np.random.randn(32, 64)   # 32 samples, 64 features
b = np.random.randn(64)       # 1 bias per feature (not per sample)

result = X + b   # b broadcasts from (64,) to (32, 64)
print(result.shape)  # (32, 64)

# More explicit broadcasting example
A = np.ones((4, 1))   # column vector
B = np.ones((1, 5))   # row vector
print((A + B).shape)  # (4, 5) — outer sum
# Vectorized attention (simplified)
import numpy as np

d_k = 64      # key dimension
seq_len = 10  # sequence length
batch = 2

Q = np.random.randn(batch, seq_len, d_k)   # queries
K = np.random.randn(batch, seq_len, d_k)   # keys
V = np.random.randn(batch, seq_len, d_k)   # values

# Compute all seq_len^2 attention scores at once
scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)   # (batch, seq_len, seq_len)

# Softmax over key dimension (axis=-1)
scores_max = scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
attn_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

output = attn_weights @ V   # (batch, seq_len, d_k)
print(output.shape)          # (2, 10, 64) — attention output for all tokens at once
# PyTorch: torch.einsum for readable, flexible matrix operations
import torch

batch, seq, d = 2, 10, 64
Q = torch.randn(batch, seq, d)
K = torch.randn(batch, seq, d)

# Two ways to compute Q @ Kᵀ:
method1 = Q @ K.transpose(-2, -1)                    # explicit transpose
method2 = torch.einsum("bqd,bkd->bqk", Q, K)        # einsum: q-d dot k-d for each batch

print(torch.allclose(method1, method2))  # True
# einsum is often clearer for 3D+ operations — 
# "bqd,bkd->bqk" reads: "for each batch b, dot product over d between query q and key k"

Review Questions