Singular Value Decomposition (SVD)
SVD is the universal matrix factorization — it breaks any matrix into rotate → scale → rotate. It reveals the hidden structure of transformations and is the mathematical engine behind PCA, recommendation systems, image compression, and why low-rank approximations work.
Intuition First
Every matrix is a transformation: it takes vectors and moves them somewhere. SVD says that no matter how complicated this transformation looks, it can always be broken into three simple steps:
- Rotate the input (without distorting it)
- Scale along clean axes (stretch some directions, compress others)
- Rotate the output (without distorting it)
That's it. Any matrix, however large or messy, does something equivalent to: align → scale → realign.
The scaling amounts are the singular values. They tell you which directions matter most — the large ones carry the most information, the small ones are nearly irrelevant.
What's Actually Happening
The decomposition is:
A = U Σ Vᵀ
- A is any m×n matrix (doesn't need to be square)
- U is an m×m orthogonal matrix — the "output rotation"
- Σ is an m×n diagonal matrix with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ 0 on the diagonal — the singular values
- Vᵀ is an n×n orthogonal matrix — the "input rotation"
Reading the transformation: Vᵀ rotates the input into a new coordinate system → Σ stretches each axis independently → U rotates into the final output coordinate system.
The singular values σᵢ measure how much the matrix stretches in each direction. Large σᵢ = important direction. σᵢ = 0 = that direction is annihilated (rank-deficient).
Build the Idea Step-by-Step
Formal Explanation
Full SVD: A = UΣVᵀ where U (m×m), Σ (m×n), V (n×n) are all full-size.
Economy (thin) SVD: if m > n, U is m×n, Σ is n×n, Vᵀ is n×n. Drops the extra zeros in Σ. This is what np.linalg.svd(A, full_matrices=False) returns.
Relationship to eigendecomposition:
AᵀA = V Σᵀ Σ Vᵀ (eigendecomposition of AᵀA)
AAᵀ = U Σ Σᵀ Uᵀ (eigendecomposition of AAᵀ)
The singular values σᵢ are the square roots of the eigenvalues of AᵀA. The columns of V are the eigenvectors of AᵀA (right singular vectors). The columns of U are the eigenvectors of AAᵀ (left singular vectors).
Low-rank approximation (Eckart-Young theorem): the best rank-k approximation to A (in Frobenius norm) is:
Aₖ = U[:, :k] · Σ[:k, :k] · Vᵀ[:k, :]
— just keep the top k singular values and their corresponding vectors. This is provably optimal — no other rank-k matrix is closer to A.
Condition number: σ_max / σ_min. Large condition number = the matrix stretches some directions enormously while barely touching others. This makes linear systems involving A numerically unstable (ill-conditioned).
Key Properties / Rules
| Property | Meaning |
|---|---|
A = UΣVᵀ | Core decomposition |
| σᵢ ≥ 0, sorted descending | Singular values are non-negative and ordered |
| rank(A) = count of non-zero σᵢ | Singular values reveal rank |
| U, V are orthogonal | Uᵀ = U⁻¹, Vᵀ = V⁻¹ |
‖A‖₂ = σ₁ | The spectral norm = largest singular value |
‖A‖_F = √Σσᵢ² | Frobenius norm from singular values |
| Aₖ (top-k) = best rank-k approximation | Eckart-Young theorem |
Why It Matters
PCA via SVD: center your data matrix X (subtract mean). Compute X = UΣVᵀ. The columns of V are the principal components — the directions of maximum variance. The columns of U (scaled by σ) are the projections. SVD is numerically more stable for PCA than eigendecomposing the covariance matrix directly.
Image compression: a grayscale image is a matrix of pixel values. Most images are approximately low-rank — nearby pixels are correlated. Keep the top-k singular values, discard the rest. Even k=50 from a 1000×1000 image often gives a visually passable reconstruction. The ratio of retained σᵢ² / total σᵢ² tells you what fraction of "image information" you kept.
Recommendation systems: a user-item matrix (rows=users, cols=movies, entries=ratings) is sparse but approximately low-rank — users cluster into a few taste groups, movies cluster into genres. SVD factorization finds these latent factors. Netflix's initial prize-winning algorithms were essentially SVD-based.
Latent Semantic Analysis (LSA): apply SVD to a term-document matrix. The low-rank approximation captures semantic relationships — words that appear in similar contexts cluster together in the reduced space, even if they never co-appear directly.
Spectral norm in neural networks: regularizing the spectral norm (= σ_max) of weight matrices keeps the Lipschitz constant bounded, stabilizing GAN training (Spectral Normalization) and improving robustness.
Common Pitfalls
- SVD is not eigendecomposition. Singular values ≠ eigenvalues in general. For symmetric positive-semidefinite matrices they're equal, but that's a special case. Don't conflate them.
- Sign ambiguity. U and V are not unique — you can flip the sign of any matched pair (column of U, column of V) and get a valid SVD. Libraries pick a convention, but don't rely on the sign.
- Computational cost. Full SVD of an m×n matrix costs O(min(m,n) · m · n). For huge matrices (embeddings, activations), use truncated SVD (
sklearn.utils.extmath.randomized_svd) which approximates the top-k components much faster. - Confusion about shapes. Always check whether you're getting full SVD or economy SVD — the shape of U and Σ differs, and multiplying back to recover A requires the right shapes.
Examples
import numpy as np
# --- Basic SVD ---
A = np.array([[1., 2., 3.],
[4., 5., 6.]]) # 2×3 matrix
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# s is a 1D array of singular values, not a matrix
print(f"U shape: {U.shape}") # (2, 2)
print(f"s shape: {s.shape}") # (2,) — [σ₁, σ₂]
print(f"Vt shape: {Vt.shape}") # (2, 3)
# Reconstruct A
A_reconstructed = U @ np.diag(s) @ Vt
print(np.allclose(A, A_reconstructed)) # True
# --- Low-rank approximation ---
k = 1
A_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
print(f"\nOriginal A:\n{A}")
print(f"\nRank-1 approximation:\n{A_k.round(3)}")
# Fraction of variance explained by rank-1
variance_retained = (s[:k]**2).sum() / (s**2).sum()
print(f"Variance retained: {variance_retained:.1%}") # ~99%
# --- Image compression ---
# (Simulated with random matrix)
np.random.seed(42)
image = np.random.randn(100, 100)
U_img, s_img, Vt_img = np.linalg.svd(image, full_matrices=False)
for k in [1, 5, 10, 50]:
img_k = U_img[:, :k] @ np.diag(s_img[:k]) @ Vt_img[:k, :]
retained = (s_img[:k]**2).sum() / (s_img**2).sum()
params = 100*k + k + k*100
print(f"k={k:2d}: {retained:.1%} variance, {params} params (vs {100*100} original)")
# --- PCA via SVD (numerically stable) ---
data = np.random.randn(500, 10) # 500 samples, 10 features
data -= data.mean(axis=0) # center
U_pca, s_pca, Vt_pca = np.linalg.svd(data, full_matrices=False)
# Principal components are rows of Vt_pca
# Projections onto first 2 PCs:
projections = data @ Vt_pca[:2].T # (500, 2)
print(f"\nProjected shape: {projections.shape}")
variance_per_pc = s_pca**2 / (s_pca**2).sum()
print(f"Variance explained by PC1, PC2: {variance_per_pc[:2].round(3)}")