Singular Value Decomposition (SVD)

SVD is the universal matrix factorization — it breaks any matrix into rotate → scale → rotate. It reveals the hidden structure of transformations and is the mathematical engine behind PCA, recommendation systems, image compression, and why low-rank approximations work.

Intuition First

Every matrix is a transformation: it takes vectors and moves them somewhere. SVD says that no matter how complicated this transformation looks, it can always be broken into three simple steps:

Rotate the input (without distorting it)
Scale along clean axes (stretch some directions, compress others)
Rotate the output (without distorting it)

That's it. Any matrix, however large or messy, does something equivalent to: align → scale → realign.

The scaling amounts are the singular values. They tell you which directions matter most — the large ones carry the most information, the small ones are nearly irrelevant.

What's Actually Happening

The decomposition is:

A = U Σ Vᵀ

A is any m×n matrix (doesn't need to be square)
U is an m×m orthogonal matrix — the "output rotation"
Σ is an m×n diagonal matrix with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ 0 on the diagonal — the singular values
Vᵀ is an n×n orthogonal matrix — the "input rotation"

Reading the transformation: Vᵀ rotates the input into a new coordinate system → Σ stretches each axis independently → U rotates into the final output coordinate system.

The singular values σᵢ measure how much the matrix stretches in each direction. Large σᵢ = important direction. σᵢ = 0 = that direction is annihilated (rank-deficient).

Build the Idea Step-by-Step

A = any m×n matrix

→

Vᵀ: rotate input into natural input axes

→

Σ: scale each axis (big σ = important direction)

→

U: rotate into natural output axes

→

Truncate to top-k singular values → best low-rank approximation

Formal Explanation

Full SVD: A = UΣVᵀ where U (m×m), Σ (m×n), V (n×n) are all full-size.

Economy (thin) SVD: if m > n, U is m×n, Σ is n×n, Vᵀ is n×n. Drops the extra zeros in Σ. This is what np.linalg.svd(A, full_matrices=False) returns.

Relationship to eigendecomposition:

AᵀA = V Σᵀ Σ Vᵀ    (eigendecomposition of AᵀA)
AAᵀ = U Σ Σᵀ Uᵀ    (eigendecomposition of AAᵀ)

The singular values σᵢ are the square roots of the eigenvalues of AᵀA. The columns of V are the eigenvectors of AᵀA (right singular vectors). The columns of U are the eigenvectors of AAᵀ (left singular vectors).

Low-rank approximation (Eckart-Young theorem): the best rank-k approximation to A (in Frobenius norm) is:

Aₖ = U[:, :k] · Σ[:k, :k] · Vᵀ[:k, :]

— just keep the top k singular values and their corresponding vectors. This is provably optimal — no other rank-k matrix is closer to A.

Condition number: σ_max / σ_min. Large condition number = the matrix stretches some directions enormously while barely touching others. This makes linear systems involving A numerically unstable (ill-conditioned).

Key Properties / Rules

Property	Meaning
`A = UΣVᵀ`	Core decomposition
σᵢ ≥ 0, sorted descending	Singular values are non-negative and ordered
rank(A) = count of non-zero σᵢ	Singular values reveal rank
U, V are orthogonal	Uᵀ = U⁻¹, Vᵀ = V⁻¹
`‖A‖₂ = σ₁`	The spectral norm = largest singular value
`‖A‖_F = √Σσᵢ²`	Frobenius norm from singular values
Aₖ (top-k) = best rank-k approximation	Eckart-Young theorem

Why It Matters

PCA via SVD: center your data matrix X (subtract mean). Compute X = UΣVᵀ. The columns of V are the principal components — the directions of maximum variance. The columns of U (scaled by σ) are the projections. SVD is numerically more stable for PCA than eigendecomposing the covariance matrix directly.

Image compression: a grayscale image is a matrix of pixel values. Most images are approximately low-rank — nearby pixels are correlated. Keep the top-k singular values, discard the rest. Even k=50 from a 1000×1000 image often gives a visually passable reconstruction. The ratio of retained σᵢ² / total σᵢ² tells you what fraction of "image information" you kept.

Recommendation systems: a user-item matrix (rows=users, cols=movies, entries=ratings) is sparse but approximately low-rank — users cluster into a few taste groups, movies cluster into genres. SVD factorization finds these latent factors. Netflix's initial prize-winning algorithms were essentially SVD-based.

Latent Semantic Analysis (LSA): apply SVD to a term-document matrix. The low-rank approximation captures semantic relationships — words that appear in similar contexts cluster together in the reduced space, even if they never co-appear directly.

Spectral norm in neural networks: regularizing the spectral norm (= σ_max) of weight matrices keeps the Lipschitz constant bounded, stabilizing GAN training (Spectral Normalization) and improving robustness.

Common Pitfalls

SVD is not eigendecomposition. Singular values ≠ eigenvalues in general. For symmetric positive-semidefinite matrices they're equal, but that's a special case. Don't conflate them.
Sign ambiguity. U and V are not unique — you can flip the sign of any matched pair (column of U, column of V) and get a valid SVD. Libraries pick a convention, but don't rely on the sign.
Computational cost. Full SVD of an m×n matrix costs O(min(m,n) · m · n). For huge matrices (embeddings, activations), use truncated SVD (sklearn.utils.extmath.randomized_svd) which approximates the top-k components much faster.
Confusion about shapes. Always check whether you're getting full SVD or economy SVD — the shape of U and Σ differs, and multiplying back to recover A requires the right shapes.

Examples

import numpy as np

# --- Basic SVD ---
A = np.array([[1., 2., 3.],
              [4., 5., 6.]])   # 2×3 matrix

U, s, Vt = np.linalg.svd(A, full_matrices=False)
# s is a 1D array of singular values, not a matrix

print(f"U shape: {U.shape}")   # (2, 2)
print(f"s shape: {s.shape}")   # (2,) — [σ₁, σ₂]
print(f"Vt shape: {Vt.shape}") # (2, 3)

# Reconstruct A
A_reconstructed = U @ np.diag(s) @ Vt
print(np.allclose(A, A_reconstructed))  # True

# --- Low-rank approximation ---
k = 1
A_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
print(f"\nOriginal A:\n{A}")
print(f"\nRank-1 approximation:\n{A_k.round(3)}")

# Fraction of variance explained by rank-1
variance_retained = (s[:k]**2).sum() / (s**2).sum()
print(f"Variance retained: {variance_retained:.1%}")  # ~99%

# --- Image compression ---
# (Simulated with random matrix)
np.random.seed(42)
image = np.random.randn(100, 100)
U_img, s_img, Vt_img = np.linalg.svd(image, full_matrices=False)

for k in [1, 5, 10, 50]:
    img_k = U_img[:, :k] @ np.diag(s_img[:k]) @ Vt_img[:k, :]
    retained = (s_img[:k]**2).sum() / (s_img**2).sum()
    params = 100*k + k + k*100
    print(f"k={k:2d}: {retained:.1%} variance, {params} params (vs {100*100} original)")

# --- PCA via SVD (numerically stable) ---
data = np.random.randn(500, 10)   # 500 samples, 10 features
data -= data.mean(axis=0)         # center

U_pca, s_pca, Vt_pca = np.linalg.svd(data, full_matrices=False)

# Principal components are rows of Vt_pca
# Projections onto first 2 PCs:
projections = data @ Vt_pca[:2].T   # (500, 2)
print(f"\nProjected shape: {projections.shape}")

variance_per_pc = s_pca**2 / (s_pca**2).sum()
print(f"Variance explained by PC1, PC2: {variance_per_pc[:2].round(3)}")