MnemosyneMnemosyne

Matrix Decomposition

Matrix decomposition breaks a matrix into simpler structured factors. SVD shows any matrix as a rotation + scaling + rotation — revealing how much information each direction carries. This unlocks PCA, compression, and LoRA fine-tuning.

Intuition First

Imagine you have a complex transformation that rotates, scales, and reshapes space. Can you break it apart into simpler pieces?

Yes — that's what matrix decomposition does. The most powerful version (SVD) says:

Any matrix = rotate input → scale along axes → rotate into output space

Three steps. Every linear transformation, no matter how complicated, can be understood as these three simpler operations in sequence. This is the Singular Value Decomposition (SVD).

The scales in the middle — the singular values — tell you how much "energy" or "importance" each direction carries. Large singular value = important direction. Near-zero = basically irrelevant. That asymmetry is what enables compression.


What's Actually Happening

SVD writes any m×n matrix A as:

A = U · Σ · Vᵀ
  • Vᵀ — rotate the input to align with A's natural directions
  • Σ — scale each direction by its singular value (σ₁ ≥ σ₂ ≥ ... ≥ 0)
  • U — rotate into the output space

Most matrices have a few large singular values and many small ones. The large ones carry most of the information. Throwing away the small ones loses very little.

This is the foundation of low-rank approximation: keep only the top-k singular values and reconstruct an approximate version of A. It's the best possible rank-k approximation (by the Eckart-Young theorem).


Build the Idea Step-by-Step

Matrix A
SVD: A = UΣVᵀ
Σ diagonal: singular values σ₁ ≥ σ₂ ≥ ...
Keep top k: rank-k approximation Aₖ
Discard small σᵢ → compress, denoise, fine-tune

Formal Explanation

SVD: every m×n matrix A decomposes as A = UΣVᵀ where:

  • U (m×m): orthogonal — columns are orthonormal vectors in output space
  • Σ (m×n): diagonal entries σ₁ ≥ σ₂ ≥ ... ≥ 0 are the singular values
  • Vᵀ (n×n): orthogonal — rows are orthonormal vectors in input space

Low-rank approximation:

Aₖ = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + ... + σₖuₖvₖᵀ

This is the closest rank-k matrix to A in the Frobenius norm sense.

Other decompositions (brief):

DecompositionFormulaWhen used
LUA = L·USolving Ax = b
QRA = Q·RLeast squares
CholeskyA = LLᵀSymmetric positive definite
EigendecompositionA = QΛQᵀSymmetric matrices only

Key Properties / Rules

PropertyMeaning
Singular valuesAlways σᵢ ≥ 0; sorted descending
Rank of ANumber of non-zero singular values
SVD exists for any matrixAny shape, any rank
Frobenius norm‖A‖_F = √(Σσᵢ²)
Best rank-k approximationKeep first k singular values (Eckart-Young)
Condition numberσ_max / σ_min — measures numerical stability

Why It Matters

LoRA fine-tuning is SVD intuition applied directly. Instead of updating a full d×d weight matrix (d² parameters), LoRA parameterizes the update as ΔW = B·A where B is d×r and A is r×d. Only 2dr parameters. The assumption: the optimal update has low rank — only a few important directions need adjustment during fine-tuning. SVD justifies why this works.

PCA is SVD of the data matrix (or eigendecomposition of the covariance). The first right singular vectors are the principal components; the singular values tell you how much variance each direction holds.

Numerical stability: the condition number σ_max / σ_min tells you how sensitive a linear system is to errors. High condition number = numerically unstable. Never explicitly compute A⁻¹ — use np.linalg.solve(A, b) which handles conditioning internally via LU decomposition.


Common Pitfalls

  • SVD ≠ eigendecomposition. Singular values are always ≥ 0. Eigenvalues can be negative or complex. For symmetric positive-semidefinite matrices they happen to match, but not in general.
  • Use thin SVD in practice. For a (100×10) matrix, the full U is 100×100 but only the first 10 columns are useful. np.linalg.svd(full_matrices=False) returns the thin version — much cheaper.
  • Floating-point "zero" singular values. Truly zero σᵢ show up as 1e-15 due to floating point. Use a threshold (σᵢ > 1e-10) to determine effective rank rather than counting exact zeros.
  • Full SVD on large matrices is expensive. O(min(m,n)·m·n). For large weight matrices where you only need the top-k singular values, use randomized SVD (sklearn.utils.extmath.randomized_svd) — it's much faster.

Examples

import numpy as np

# Full SVD of a small matrix
A = np.random.randn(4, 3)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# U: (4,3), s: (3,), Vt: (3,3)  ← thin SVD

print(f"Singular values: {s.round(3)}")
print(f"Rank: {np.sum(s > 1e-10)}")

# Reconstruct A from SVD
A_reconstructed = U @ np.diag(s) @ Vt
print(f"Reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")  # ≈ 0

# Low-rank approximation: keep top-2 singular values
k = 2
A_approx = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
fraction_kept = (s[:k]**2).sum() / (s**2).sum()
print(f"Rank-2 captures {fraction_kept:.1%} of total energy")

# LoRA-style low-rank weight update
d, r = 512, 8           # d = model dim, r = LoRA rank
B = np.random.randn(d, r) * 0.01
A_lora = np.zeros((r, d))           # initialized to 0 so ΔW starts at 0
delta_W = B @ A_lora                # (d×d) update, rank ≤ r
params_full = d * d                 # 262,144
params_lora = d * r + r * d         # 8,192 — 32× fewer
print(f"LoRA param reduction: {params_full}  →  {params_lora}")

Review Questions