Orthogonality

Orthogonal vectors are perfectly independent — they share no information. Orthogonal matrices are the "clean" transformations that rotate without distorting. This concept underlies stable training, attention scoring, and why certain initializations work.

Intuition First

Imagine two arrows pointing in completely different directions — one straight up, one straight right. They share nothing. Knowing how far right you've gone tells you nothing about how far up you've gone. That's orthogonality.

Two vectors are orthogonal when their dot product is zero. Geometrically, they form a right angle. Informationally, they're completely independent — knowing one tells you nothing about the other.

This makes orthogonal vectors the ideal building blocks for a coordinate system.

What's Actually Happening

Dot product = 0 is the test:

u · v = 0  ⟺  u and v are orthogonal

When you have a whole set of vectors that are pairwise orthogonal and each has length 1, you have an orthonormal basis — the cleanest possible coordinate system.

Orthogonal matrices are matrices whose columns form an orthonormal set. They have a remarkable property: QᵀQ = I, meaning the transpose equals the inverse. Multiplying by Q rotates and/or reflects space, but it never distorts it — lengths and angles are preserved.

Build the Idea Step-by-Step

Dot product = 0 → right angle

→

Set of pairwise-orthogonal unit vectors → orthonormal basis

→

Orthonormal columns in matrix Q → orthogonal matrix

→

QᵀQ = I → Q⁻¹ = Qᵀ (free inversion)

→

Multiply by Q → rotate/reflect, never distort

Formal Explanation

Orthogonal vectors: u · v = u₁v₁ + u₂v₂ + ... + uₙvₙ = 0

Orthonormal set: vectors that are pairwise orthogonal and all have unit length (‖v‖ = 1).

Orthogonal matrix Q:

Columns are orthonormal: qᵢ · qⱼ = 0 for i ≠ j, qᵢ · qᵢ = 1
Key identity: QᵀQ = I
Therefore: Q⁻¹ = Qᵀ — inversion is free (just transpose)
Preserves lengths: ‖Qx‖ = ‖x‖ for all x
Preserves angles: the angle between Qx and Qy equals the angle between x and y

Orthogonal complement: the set of all vectors perpendicular to a given subspace V is called V⊥ (V-perp). If V is the column space of a matrix A, then V⊥ is the null space of Aᵀ.

Key Properties / Rules

Property	Meaning
`u · v = 0`	u and v are orthogonal
`QᵀQ = I`	Q is an orthogonal matrix
`Q⁻¹ = Qᵀ`	Inversion = free transpose
`‖Qx‖ = ‖x‖`	Orthogonal matrices preserve vector length
det(Q) = ±1	Orthogonal matrices only rotate or reflect (no scaling)

Why It Matters

Attention scores: the raw query-key dot product Q·K measures alignment. When Q and K are orthogonal (dot product ≈ 0), the attention weight is nearly zero — that token is being ignored. High alignment means high attention. Orthogonality literally defines "irrelevance" in transformer attention.

Weight initialization (Orthogonal Init): initializing weight matrices as orthogonal matrices (torch.nn.init.orthogonal_) prevents gradients from exploding or vanishing at the start of training. Because orthogonal transformations preserve vector length, the signal flowing forward (and gradients flowing backward) stay the same scale layer after layer.

QR decomposition: any matrix A can be written as A = QR where Q is orthogonal and R is upper-triangular. This decomposes a transformation into a "clean rotation" followed by a "scaling". Used in numerical solvers for linear systems and least-squares fitting.

Gram-Schmidt process: a procedure for taking any set of independent vectors and constructing an orthonormal basis from them. QR decomposition uses Gram-Schmidt internally.

Common Pitfalls

Orthogonal ≠ orthonormal. Orthogonal just means u · v = 0. Orthonormal additionally requires ‖v‖ = 1. An orthogonal matrix's columns are orthonormal, not just orthogonal.
Square vs. non-square "orthogonal" matrices. For tall rectangular matrices Q (m×n, m > n), QᵀQ = I still holds (columns orthonormal), but QQᵀ ≠ I. These are called semi-orthogonal. The full orthogonal matrix (invertible, det ±1) is always square.
Near-orthogonality matters in practice. As training proceeds, weight matrices can become highly non-orthogonal (correlated columns). This correlates with gradient instability. Techniques like spectral normalization try to keep weight matrices close to orthogonal during training.

Examples

import numpy as np

# --- Checking orthogonality ---
u = np.array([1.0, 0.0, 0.0])
v = np.array([0.0, 1.0, 0.0])
print(u @ v)  # 0.0 — orthogonal

# --- Building an orthonormal basis via Gram-Schmidt ---
def gram_schmidt(vectors):
    basis = []
    for v in vectors:
        w = v.copy().astype(float)
        for b in basis:
            w -= (w @ b) * b        # subtract projection onto each existing basis vector
        norm = np.linalg.norm(w)
        if norm > 1e-10:            # skip zero vectors (linearly dependent)
            basis.append(w / norm)
    return np.array(basis)

vecs = np.array([[1., 1.], [1., 0.]])
Q = gram_schmidt(vecs)
print(Q @ Q.T)  # should be identity (2×2)

# --- Orthogonal matrix: preserves length ---
theta = np.pi / 4   # 45-degree rotation
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])

x = np.array([3., 4.])
print(np.linalg.norm(x))      # 5.0
print(np.linalg.norm(R @ x))  # 5.0 — unchanged

# Confirm Rᵀ R = I
print(np.allclose(R.T @ R, np.eye(2)))  # True

# --- PyTorch orthogonal initialization ---
import torch
import torch.nn as nn

layer = nn.Linear(128, 128, bias=False)
nn.init.orthogonal_(layer.weight)
W = layer.weight.detach().numpy()
print(np.allclose(W @ W.T, np.eye(128), atol=1e-5))  # True