MnemosyneMnemosyne

Rank of a Matrix

Rank measures how many truly independent directions a matrix spans — its real information content. A low-rank matrix is secretly simpler than it looks, and low-rank structure is the foundation of compression, LoRA fine-tuning, and understanding when systems have solutions.

Intuition First

Imagine a 1000×1000 spreadsheet, but every row is just a multiple of the same row. All that data, but really only one independent piece of information. That spreadsheet has rank 1 — it looks big but contains almost nothing new.

Rank is the count of truly independent rows (or columns) in a matrix. It tells you how many distinct "dimensions of information" the matrix actually contains.

A matrix that maps 100-dimensional inputs to 100-dimensional outputs might really only move things around in a 3-dimensional subspace. That's a rank-3 matrix disguised as a 100×100 grid.


What's Actually Happening

The column space of a matrix is the set of all vectors you can produce by multiplying the matrix by some input. Rank is the dimension of that column space — how many independent directions the matrix can reach.

Equivalently, rank = number of linearly independent columns = number of linearly independent rows. These two counts are always equal (a non-obvious fact).

Full rank means rank = min(rows, columns) — all rows and columns are independent. The matrix is using every dimension it has.

Rank-deficient means some rows/columns are linear combinations of others — the matrix is collapsing some directions to zero.


Build the Idea Step-by-Step

Count independent columns (not duplicates/combinations)
That count = rank
Rank = dimension of column space (what outputs are reachable)
Full rank → invertible (square) or unique solution (tall)
Rank-deficient → some info lost, not invertible

Formal Explanation

Definition: rank(A) = dimension of the column space of A = number of linearly independent columns.

Equivalent definitions (all equal):

  • Number of non-zero rows after row-reduction (Gaussian elimination)
  • Number of non-zero singular values (from SVD)
  • Dimension of the row space

Rank-nullity theorem:

rank(A) + nullity(A) = n

Where n = number of columns, and nullity = dimension of the null space (vectors A maps to zero). If rank goes up, null space shrinks — they trade off.

For an m×n matrix:

  • rank(A) ≤ min(m, n)
  • Full rank: rank(A) = min(m, n)
  • Rank-deficient: rank(A) < min(m, n)

Invertibility (square matrices only):

  • A is invertible ⟺ rank(A) = n ⟺ det(A) ≠ 0 ⟺ null space is 0

Key Properties / Rules

PropertyMeaning
rank(A) = rank(Aᵀ)Row rank = column rank
rank(A) ≤ min(m, n)Bounded by smaller dimension
rank = count of non-zero singular valuesSVD gives the cleanest rank computation
rank(AB) ≤ min(rank(A), rank(B))Multiplication can only reduce rank
rank(A) = n (square)A is invertible
rank = n − nullityRank-nullity theorem

Why It Matters

LoRA (Low-Rank Adaptation): fine-tuning LLMs by adding a low-rank update ΔW = BA where B is m×r and A is r×n, with r ≪ min(m,n). Instead of updating all m×n parameters, you only train the two thin matrices. The assumption: the meaningful update lives in a low-rank subspace. Rank r = 4, 8, or 16 often works as well as full fine-tuning. This is only justified because weight updates are empirically low-rank.

Solving linear systems Ax = b: if rank(A) < n (rank-deficient), the system has either no solution or infinitely many. If rank(A) = n, there's a unique solution. In neural network terms: an under-determined system has many weight configs that fit training data — rank structure determines solution uniqueness.

Numerical rank: in practice, matrices are rarely exactly rank-deficient due to floating-point noise. The effective rank is the number of singular values above some threshold ε. Libraries use this to compute "numerically stable" ranks.

Attention and representation collapse: if a model's attention weight matrix or representation matrix becomes low-rank (few independent directions), the model has collapsed — different inputs are being mapped to the same representation. Tracking the effective rank of intermediate layer activations is a diagnostic tool.


Common Pitfalls

  • Rank is not the same as number of non-zero entries. A matrix can be dense but low-rank (every entry non-zero, but all rows are combinations of a few).
  • Computing rank with floating point is fragile. Never use np.linalg.matrix_rank blindly for continuous-valued matrices — it uses a threshold. Better: look at singular values yourself and choose a meaningful cutoff.
  • rank(A + B) ≠ rank(A) + rank(B). Rank doesn't add linearly. The correct bound: rank(A + B) ≤ rank(A) + rank(B) and rank(A + B) ≥ |rank(A) − rank(B)|.

Examples

import numpy as np

# --- Rank-1 matrix: outer product ---
u = np.array([[1.], [2.], [3.]])
v = np.array([[4., 5., 6.]])
A = u @ v          # 3×3 but rank 1 — every row is a multiple of [4,5,6]
print(np.linalg.matrix_rank(A))  # 1

# --- Full-rank matrix ---
B = np.array([[1., 0., 0.],
              [0., 1., 0.],
              [0., 0., 1.]])   # identity
print(np.linalg.matrix_rank(B))  # 3

# --- Rank-deficient: third row = first + second ---
C = np.array([[1., 2., 3.],
              [4., 5., 6.],
              [5., 7., 9.]])   # row 2 = row 0 + row 1
print(np.linalg.matrix_rank(C))  # 2

# --- Rank via singular values (more robust) ---
_, singular_values, _ = np.linalg.svd(C)
print(singular_values.round(4))   # last value is ~0, confirming rank 2

# --- LoRA-style low-rank update ---
r = 4                       # rank of the update
m, n = 768, 768             # typical transformer weight shape

B_mat = np.random.randn(m, r) * 0.01   # m×r
A_mat = np.random.randn(r, n) * 0.01   # r×n
delta_W = B_mat @ A_mat                 # 768×768 but rank ≤ 4

original_params = m * n                  # 589,824
lora_params = m * r + r * n             # 2×768×4 = 6,144
print(f"Parameter reduction: {original_params / lora_params:.1f}×")  # ~96×

print(np.linalg.matrix_rank(delta_W))   # 4

Review Questions