Matrix Decomposition
Matrix decomposition breaks a matrix into simpler structured factors. SVD shows any matrix as a rotation + scaling + rotation — revealing how much information each direction carries. This unlocks PCA, compression, and LoRA fine-tuning.
Intuition First
Imagine you have a complex transformation that rotates, scales, and reshapes space. Can you break it apart into simpler pieces?
Yes — that's what matrix decomposition does. The most powerful version (SVD) says:
Any matrix = rotate input → scale along axes → rotate into output space
Three steps. Every linear transformation, no matter how complicated, can be understood as these three simpler operations in sequence. This is the Singular Value Decomposition (SVD).
The scales in the middle — the singular values — tell you how much "energy" or "importance" each direction carries. Large singular value = important direction. Near-zero = basically irrelevant. That asymmetry is what enables compression.
What's Actually Happening
SVD writes any m×n matrix A as:
A = U · Σ · Vᵀ
- Vᵀ — rotate the input to align with A's natural directions
- Σ — scale each direction by its singular value (σ₁ ≥ σ₂ ≥ ... ≥ 0)
- U — rotate into the output space
Most matrices have a few large singular values and many small ones. The large ones carry most of the information. Throwing away the small ones loses very little.
This is the foundation of low-rank approximation: keep only the top-k singular values and reconstruct an approximate version of A. It's the best possible rank-k approximation (by the Eckart-Young theorem).
Build the Idea Step-by-Step
Formal Explanation
SVD: every m×n matrix A decomposes as A = UΣVᵀ where:
- U (m×m): orthogonal — columns are orthonormal vectors in output space
- Σ (m×n): diagonal entries σ₁ ≥ σ₂ ≥ ... ≥ 0 are the singular values
- Vᵀ (n×n): orthogonal — rows are orthonormal vectors in input space
Low-rank approximation:
Aₖ = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + ... + σₖuₖvₖᵀ
This is the closest rank-k matrix to A in the Frobenius norm sense.
Other decompositions (brief):
| Decomposition | Formula | When used |
|---|---|---|
| LU | A = L·U | Solving Ax = b |
| QR | A = Q·R | Least squares |
| Cholesky | A = LLᵀ | Symmetric positive definite |
| Eigendecomposition | A = QΛQᵀ | Symmetric matrices only |
Key Properties / Rules
| Property | Meaning |
|---|---|
| Singular values | Always σᵢ ≥ 0; sorted descending |
| Rank of A | Number of non-zero singular values |
| SVD exists for any matrix | Any shape, any rank |
| Frobenius norm | ‖A‖_F = √(Σσᵢ²) |
| Best rank-k approximation | Keep first k singular values (Eckart-Young) |
| Condition number | σ_max / σ_min — measures numerical stability |
Why It Matters
LoRA fine-tuning is SVD intuition applied directly. Instead of updating a full d×d weight matrix (d² parameters), LoRA parameterizes the update as ΔW = B·A where B is d×r and A is r×d. Only 2dr parameters. The assumption: the optimal update has low rank — only a few important directions need adjustment during fine-tuning. SVD justifies why this works.
PCA is SVD of the data matrix (or eigendecomposition of the covariance). The first right singular vectors are the principal components; the singular values tell you how much variance each direction holds.
Numerical stability: the condition number σ_max / σ_min tells you how sensitive a linear system is to errors. High condition number = numerically unstable. Never explicitly compute A⁻¹ — use np.linalg.solve(A, b) which handles conditioning internally via LU decomposition.
Common Pitfalls
- SVD ≠ eigendecomposition. Singular values are always ≥ 0. Eigenvalues can be negative or complex. For symmetric positive-semidefinite matrices they happen to match, but not in general.
- Use thin SVD in practice. For a (100×10) matrix, the full U is 100×100 but only the first 10 columns are useful.
np.linalg.svd(full_matrices=False)returns the thin version — much cheaper. - Floating-point "zero" singular values. Truly zero σᵢ show up as
1e-15due to floating point. Use a threshold (σᵢ > 1e-10) to determine effective rank rather than counting exact zeros. - Full SVD on large matrices is expensive. O(min(m,n)·m·n). For large weight matrices where you only need the top-k singular values, use randomized SVD (
sklearn.utils.extmath.randomized_svd) — it's much faster.
Examples
import numpy as np
# Full SVD of a small matrix
A = np.random.randn(4, 3)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# U: (4,3), s: (3,), Vt: (3,3) ← thin SVD
print(f"Singular values: {s.round(3)}")
print(f"Rank: {np.sum(s > 1e-10)}")
# Reconstruct A from SVD
A_reconstructed = U @ np.diag(s) @ Vt
print(f"Reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}") # ≈ 0
# Low-rank approximation: keep top-2 singular values
k = 2
A_approx = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
fraction_kept = (s[:k]**2).sum() / (s**2).sum()
print(f"Rank-2 captures {fraction_kept:.1%} of total energy")
# LoRA-style low-rank weight update
d, r = 512, 8 # d = model dim, r = LoRA rank
B = np.random.randn(d, r) * 0.01
A_lora = np.zeros((r, d)) # initialized to 0 so ΔW starts at 0
delta_W = B @ A_lora # (d×d) update, rank ≤ r
params_full = d * d # 262,144
params_lora = d * r + r * d # 8,192 — 32× fewer
print(f"LoRA param reduction: {params_full} → {params_lora}")