MnemosyneMnemosyne

Matrices

A matrix is a rectangular grid of numbers. Matrix multiplication composes transformations, and the transpose flips rows into columns. These two operations are the foundation of every neural network layer.

Intuition First

A matrix is just a 2D grid of numbers — rows and columns. The moment you have a spreadsheet of data (samples × features), you have a matrix. The moment you write down the weights of a neural network layer, you have a matrix.

Two operations matter most:

  • Multiplication — combining two transformations into one
  • Transpose — flipping rows into columns (and vice versa)

If you can reason about matrix shapes and understand why AB ≠ BA, you have the core skill for debugging deep learning code.


What's Actually Happening

Multiplication

When you multiply matrix A by matrix B, each entry of the result is a dot product: row i of A dotted with column j of B.

C[i, j] = (row i of A) · (column j of B)

Think of it as: A applies its transformation first, then B applies its transformation on top. The result C is the composition of both transformations at once.

The critical constraint: inner dimensions must match. (m×n) @ (n×p)(m×p). The n must be the same.

Transpose

The transpose swaps rows and columns. A[i, j] becomes Aᵀ[j, i].

If A is (m×n), then Aᵀ is (n×m).

The key rule for transposing products: (AB)ᵀ = BᵀAᵀ — order reverses.


Build the Idea Step-by-Step

Matrix A (m×n)
Matrix B (n×p)
C = A @ B → each C[i,j] is a dot product
Result shape: (m×p)
Inner n must match

Formal Explanation

Matrix product C = AB where A is m×n, B is n×p:

C[i, j] = Σₖ A[i,k] · B[k,j]

Example:

A = [[1, 2],    B = [[5, 6],
     [3, 4]]         [7, 8]]

C[0,0] = 1·5 + 2·7 = 19
C[0,1] = 1·6 + 2·8 = 22
C[1,0] = 3·5 + 4·7 = 43
C[1,1] = 3·6 + 4·8 = 50

Transpose:

(Aᵀ)[i, j] = A[j, i]
(AB)ᵀ = BᵀAᵀ   ← order reverses!

Key Properties / Rules

PropertyFormulaNotes
Shape rule(m×n) @ (n×p) → (m×p)Inner dims must match
Non-commutativeAB ≠ BA in generalOrder matters always
Associative(AB)C = A(BC)Can group freely
DistributiveA(B+C) = AB + ACWorks like algebra
Transpose of product(AB)ᵀ = BᵀAᵀOrder reverses
Symmetric matrixA = AᵀCovariance matrices, Hessians

Why It Matters

A linear layer in a neural network is: output = W @ input + b. When processing a batch of inputs at once, you multiply W by an entire input matrix. The forward pass of a neural network is a chain of matrix multiplications.

Shape reasoning is the most practically useful linear algebra skill. Shape errors are the most common bug in deep learning. Track (batch, features) through every operation.


Common Pitfalls

  • * vs @ in NumPy/PyTorch. A * B is element-wise multiplication (both arrays must have the same shape). A @ B is matrix multiplication. Both are valid syntax — you'll get no error, just silently wrong results.
  • AB doesn't exist just because BA does. (3×4) @ (4×2) is valid, but (4×2) @ (3×4) is not. Always verify shapes before multiplying.
  • Transpose rule order. (AB)ᵀ = BᵀAᵀ, not AᵀBᵀ. Forgetting the reversal causes shape bugs in backpropagation derivations.

Examples

import numpy as np

A = np.array([[1., 2.],
              [3., 4.]])
B = np.array([[5., 6.],
              [7., 8.]])

C = A @ B       # [[19,22],[43,50]]
At = A.T        # [[1,3],[2,4]] — rows and columns swapped

print(f"A shape: {A.shape}")
print(f"C = A@B:\n{C}")
print(f"Aᵀ:\n{At}")
print(f"AB ≠ BA: {not np.allclose(A @ B, B @ A)}")  # True

# Neural network layer: W maps 3-dim input to 4-dim output
W = np.random.randn(4, 3)   # (out_features × in_features)
x = np.random.randn(3)      # single input
y = W @ x                   # shape (4,)

# Batch: process 8 inputs simultaneously
X = np.random.randn(8, 3)   # (batch × in_features)
Y = X @ W.T                 # (8,3) @ (3,4) → shape (8, 4)
print(f"Batch output shape: {Y.shape}")  # (8, 4)

Review Questions