Matrix-Vector Multiplication
Multiplying a matrix by a vector produces a new vector. The matrix is a transformation — it rotates, scales, or projects the input into a new space. Every neural network layer is this operation.
Intuition First
Imagine you have a machine that takes a 3D point and outputs a 2D point. You feed it [x, y, z], it applies some rule, and out comes [a, b]. That machine is a matrix.
More concretely: every linear neural network layer is exactly this. You have a weight matrix W. You feed it an input vector x. Out comes an output vector y. The layer is transforming the input — reshaping, rotating, and scaling it into a new representation.
This is matrix-vector multiplication: y = W @ x.
What's Actually Happening
You can think about y = Ax in two complementary ways — they give the same result, but each reveals different structure.
Row view: each output entry y[i] is a dot product between row i of A and the full input x. It asks: "how much does x align with this row direction?"
Column view: Ax is a weighted sum of A's columns, where the weights are the entries of x:
Ax = x[0]·(col 0) + x[1]·(col 1) + ... + x[n-1]·(col n-1)
The column view is key: the output always lives in the column space of A — the set of all linear combinations of A's columns. A can never produce a result outside this space, no matter what you put in.
Build the Idea Step-by-Step
Formal Explanation
For A (m×n) and x (n-dim):
y = Ax
y[i] = Σⱼ A[i,j] · x[j] (row view: dot product with row i)
Ax = Σⱼ x[j] · A[:, j] (column view: linear combination of columns)
Common geometric effects of different matrices:
| Matrix type | Effect on vectors |
|---|---|
| Identity I | No change |
| Diagonal (s₁, s₂) | Scale axis 1 by s₁, axis 2 by s₂ |
| Rotation by θ | Rotate all vectors by θ |
| Projection matrix | Squash onto a line or plane |
| Singular matrix | Collapses some directions to zero |
Key Properties / Rules
| Property | Description |
|---|---|
| Shape | (m×n) @ (n,) → (m,) |
| Linearity | A(u + v) = Au + Av |
| Linearity | A(cu) = c·Au |
| Origin always fixed | A·0 = 0 — linear maps can't translate |
| Composition | B(Ax) = (BA)x — order reverses in the matrix |
Why It Matters
The forward pass of a neural network is a sequence of matrix-vector multiplications:
h₁ = σ(W₁x)
h₂ = σ(W₂h₁)
output = W₃h₂
Each Wᵢ reshapes the representation, and σ (ReLU, GELU) adds non-linearity between them.
Without non-linearity: composing two linear maps W₂(W₁x) = (W₂W₁)x is still just one linear map. Any chain of linear layers can be collapsed to a single one. Depth only helps because of the activation function.
Common Pitfalls
- A linear transformation always fixes the origin.
A·0 = 0always. This is why layers add a bias:y = Ax + b. Withoutb, the model can only represent hyperplanes through the origin — severely limiting expressivity. - The output dimension is the number of rows. An (5×3) matrix maps 3D → 5D. Input size = columns, output size = rows. This is the most common shape confusion.
- Composition reverses order. "Apply A first, then B" =
B(Ax) = (BA)x. The combined matrix isBA, notAB. The order flips when you write it as a product.
Examples
import numpy as np
# A (2×3) matrix maps 3D → 2D
A = np.array([[1., 0., -1.],
[0., 1., 2.]])
x = np.array([3., 1., 2.])
y = A @ x # [3-2, 1+4] = [1., 5.]
print(f"Input: {x} → Output: {y}")
# Column view gives the same result
col_view = x[0]*A[:,0] + x[1]*A[:,1] + x[2]*A[:,2]
print(f"Column view: {col_view}") # identical to y
# Batch: 32 inputs, each 3-dim → 2-dim output
W = np.random.randn(2, 3) # weight matrix
X = np.random.randn(32, 3) # 32 input vectors
Y = X @ W.T # (32, 3) @ (3, 2) → (32, 2)
print(f"Batch output shape: {Y.shape}") # (32, 2)
# Why bias matters
b = np.array([1., -3.])
y_with_bias = A @ x + b # can represent any affine output, not just 0-centered