Rank of a Matrix
Rank measures how many truly independent directions a matrix spans — its real information content. A low-rank matrix is secretly simpler than it looks, and low-rank structure is the foundation of compression, LoRA fine-tuning, and understanding when systems have solutions.
Intuition First
Imagine a 1000×1000 spreadsheet, but every row is just a multiple of the same row. All that data, but really only one independent piece of information. That spreadsheet has rank 1 — it looks big but contains almost nothing new.
Rank is the count of truly independent rows (or columns) in a matrix. It tells you how many distinct "dimensions of information" the matrix actually contains.
A matrix that maps 100-dimensional inputs to 100-dimensional outputs might really only move things around in a 3-dimensional subspace. That's a rank-3 matrix disguised as a 100×100 grid.
What's Actually Happening
The column space of a matrix is the set of all vectors you can produce by multiplying the matrix by some input. Rank is the dimension of that column space — how many independent directions the matrix can reach.
Equivalently, rank = number of linearly independent columns = number of linearly independent rows. These two counts are always equal (a non-obvious fact).
Full rank means rank = min(rows, columns) — all rows and columns are independent. The matrix is using every dimension it has.
Rank-deficient means some rows/columns are linear combinations of others — the matrix is collapsing some directions to zero.
Build the Idea Step-by-Step
Formal Explanation
Definition: rank(A) = dimension of the column space of A = number of linearly independent columns.
Equivalent definitions (all equal):
- Number of non-zero rows after row-reduction (Gaussian elimination)
- Number of non-zero singular values (from SVD)
- Dimension of the row space
Rank-nullity theorem:
rank(A) + nullity(A) = n
Where n = number of columns, and nullity = dimension of the null space (vectors A maps to zero). If rank goes up, null space shrinks — they trade off.
For an m×n matrix:
- rank(A) ≤ min(m, n)
- Full rank: rank(A) = min(m, n)
- Rank-deficient: rank(A) < min(m, n)
Invertibility (square matrices only):
- A is invertible ⟺ rank(A) = n ⟺ det(A) ≠ 0 ⟺ null space is 0
Key Properties / Rules
| Property | Meaning |
|---|---|
rank(A) = rank(Aᵀ) | Row rank = column rank |
rank(A) ≤ min(m, n) | Bounded by smaller dimension |
| rank = count of non-zero singular values | SVD gives the cleanest rank computation |
rank(AB) ≤ min(rank(A), rank(B)) | Multiplication can only reduce rank |
| rank(A) = n (square) | A is invertible |
| rank = n − nullity | Rank-nullity theorem |
Why It Matters
LoRA (Low-Rank Adaptation): fine-tuning LLMs by adding a low-rank update ΔW = BA where B is m×r and A is r×n, with r ≪ min(m,n). Instead of updating all m×n parameters, you only train the two thin matrices. The assumption: the meaningful update lives in a low-rank subspace. Rank r = 4, 8, or 16 often works as well as full fine-tuning. This is only justified because weight updates are empirically low-rank.
Solving linear systems Ax = b: if rank(A) < n (rank-deficient), the system has either no solution or infinitely many. If rank(A) = n, there's a unique solution. In neural network terms: an under-determined system has many weight configs that fit training data — rank structure determines solution uniqueness.
Numerical rank: in practice, matrices are rarely exactly rank-deficient due to floating-point noise. The effective rank is the number of singular values above some threshold ε. Libraries use this to compute "numerically stable" ranks.
Attention and representation collapse: if a model's attention weight matrix or representation matrix becomes low-rank (few independent directions), the model has collapsed — different inputs are being mapped to the same representation. Tracking the effective rank of intermediate layer activations is a diagnostic tool.
Common Pitfalls
- Rank is not the same as number of non-zero entries. A matrix can be dense but low-rank (every entry non-zero, but all rows are combinations of a few).
- Computing rank with floating point is fragile. Never use
np.linalg.matrix_rankblindly for continuous-valued matrices — it uses a threshold. Better: look at singular values yourself and choose a meaningful cutoff. - rank(A + B) ≠ rank(A) + rank(B). Rank doesn't add linearly. The correct bound:
rank(A + B) ≤ rank(A) + rank(B)andrank(A + B) ≥ |rank(A) − rank(B)|.
Examples
import numpy as np
# --- Rank-1 matrix: outer product ---
u = np.array([[1.], [2.], [3.]])
v = np.array([[4., 5., 6.]])
A = u @ v # 3×3 but rank 1 — every row is a multiple of [4,5,6]
print(np.linalg.matrix_rank(A)) # 1
# --- Full-rank matrix ---
B = np.array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]]) # identity
print(np.linalg.matrix_rank(B)) # 3
# --- Rank-deficient: third row = first + second ---
C = np.array([[1., 2., 3.],
[4., 5., 6.],
[5., 7., 9.]]) # row 2 = row 0 + row 1
print(np.linalg.matrix_rank(C)) # 2
# --- Rank via singular values (more robust) ---
_, singular_values, _ = np.linalg.svd(C)
print(singular_values.round(4)) # last value is ~0, confirming rank 2
# --- LoRA-style low-rank update ---
r = 4 # rank of the update
m, n = 768, 768 # typical transformer weight shape
B_mat = np.random.randn(m, r) * 0.01 # m×r
A_mat = np.random.randn(r, n) * 0.01 # r×n
delta_W = B_mat @ A_mat # 768×768 but rank ≤ 4
original_params = m * n # 589,824
lora_params = m * r + r * n # 2×768×4 = 6,144
print(f"Parameter reduction: {original_params / lora_params:.1f}×") # ~96×
print(np.linalg.matrix_rank(delta_W)) # 4