SVD (Singular Value Decomposition) is hailed as the "crown jewel" of
linear algebra — it can decompose any matrix, not just square or
symmetric ones. From image compression to Netflix recommendation
algorithms, from face recognition to gene analysis, SVD is everywhere.
Understanding SVD means mastering one of the most powerful mathematical
tools in data science.
Introduction: Why is SVD
So Important?
In the previous chapter, we learned spectral decomposition for
symmetric matrices: . But this has a limitation — the matrix must be
symmetric.
In the real world, most matrices are not symmetric: - Image matrices
(, usually) - User-item rating matrices
(recommender systems) - Document-term matrices (natural language
processing) - Gene expression data matrices (bioinformatics)
Singular Value Decomposition (SVD) can decompose
anymatrix:This is one of the most powerful and useful decompositions
in linear algebra.
A Life Analogy:
Understanding the Essence of SVD
Imagine you're a photographer trying to understand the "essence" of a
photo. A photo can be viewed as a matrix — each pixel is a number. SVD
tells you:
Any photo can be decomposed into the sum of several "basic
layers"
These layers are ordered by importance— the first
layer captures the main structure, the second captures secondary
details...
Keeping only the first few layers can restore most of the
information
It's like a band's recording can be decomposed into different
instrument tracks: lead vocals, guitar, bass, drums... Some tracks (like
lead vocals) are more "important"— removing background harmonies has
little effect, but removing lead vocals changes the whole song.
Learning Objectives for
This Chapter
Definition and Geometric Meaning of SVD:
Understanding
Computation Methods: How to find singular values
and singular vectors
Relationship with Eigendecomposition: How SVD
generalizes the spectral theorem
Low-Rank Approximation: The optimal rank-approximation theorem
Pseudoinverse: A general method for handling
non-invertible matrices
Singular Value Decomposition Theorem: Anymatrixcan be decomposed as:Where: -:orthogonal matrix (left singular vectors)
-:diagonal matrix
(singular values), with diagonal elements -:orthogonal matrix
(right singular vectors) -Standard Form:
Key Properties: - Singular valuesare always non-negative
real numbers (unlike eigenvalues which can be negative or
complex) - Singular values are in descending order: - SVD exists for any matrix (this is why it's more
powerful than eigendecomposition)
Geometric
Meaning: Dissecting a Transformation
SVD tells us: any linear transformation can be decomposed
into three steps:
Rotation/Reflection (): Rotate the coordinate system in
input space
Stretching (): Stretch (or compress) along new
coordinate axes
Rotation/Reflection (): Rotate the coordinate system in
output space
Life Analogy: Imagine you're kneading dough.
Step 1 ():
Rotate the dough to a "convenient" angle
Step 2 (): Use a rolling pin to
flatten and stretch the dough — this is the step that actually changes
the shape
Step 3 ():
Rotate the flattened dough to its final orientation
Visual Understanding: - Columns of: An orthonormal basis in input space,
the "most natural directions" for the transformation to act - Columns
of: An orthonormal basis in output
space, the "most natural directions" for transformation results -: Stretching factor in the-th direction
A unit circle, after transformation by matrix, becomes an ellipse. SVD tells us: -
Ellipse principal axis directions are determined by columns of - Ellipse semi-axis lengths are the
singular values - The
"special directions" in input space are given by columns of
Outer Product Form
SVD also has an important equivalent expression —outer
product expansion:Eachis a rank-1
matrix. So:
Any matrix is a weighted sum of rank-1 matrices,
with weights being the singular values.
This perspective is crucial for low-rank approximation.
Example
Consider amatrix:SVD decomposition (computation details
follow):Singular values:Geometric
meaning: Unit circle → Rotate 45° → Stretch 5x along x-axis, 1x along
y-axis → Rotate to final position
Computing SVD
Relationship
withandSVD is closely related to spectral
decomposition of symmetric matrices.
Key Observation: -andare both symmetric positive
semidefinite matrices - They have the same nonzero eigenvalues - Their
eigenvectors are the columns ofandrespectively
Detailed Derivation:
From:Since. Note thatis a diagonal matrix with
diagonal elements:This is exactly the spectral decomposition
ofConclusion:
- Columns ofare eigenvectors
of(right singular
vectors) -are
eigenvalues of -are
singular values
Intuitive Understanding: Why use?can be understood as the effect of
"acting twice": first transform
with, then transform back
with. This round trip
amplifies's "main action
directions"— directions with large singular values get amplified more
(times), directions with
small singular values get suppressed.
Computation Steps
Given anmatrix(assume):
Step 1: Compute eigenvalues and eigenvectors of - Eigenvalues: - Eigenvectors:(orthonormalized) - Theseform matrixStep 2: Compute
singular values -Step 3: Compute left singular
vectors -(for) - If, extend to a complete orthonormal basis
Step 4: ConstructWhy?
Fromwe can
derive. Taking the-th column:So(assuming).
Meaning: - Orthonormal basisof row
space maps to orthonormal basisof column space - Eachis stretched by factor - Null space maps to zero
Low-Rank Approximation
One of SVD's most important applications is optimal low-rank
approximation. This is the theoretical foundation for data
compression, denoising, and recommender systems.
Eckart-Young Theorem
Theorem: Letbe the SVD of a rank-matrix. Define the rank-truncation:Whereare the firstcolumns
of, andis the diagonal matrix of
firstsingular values.
Thenis the
closest matrix toamong all rank-matrices (in Frobenius norm):
Significance: - Keep firstsingular values/vectors, discard the
rest - This is the optimal rank-approximation — no other rank-matrix can be closer to - Error is determined by discarded
singular values
Life Analogy: This is like music compression. MP3
format discards high-frequency components human ears aren't sensitive
to, keeping the most important frequencies. SVD works the same way —
keeping the "highest energy" components.
Energy Perspective
The sum of squared singular values equals the squared Frobenius norm
of the matrix:Energy captured by firstsingular values:
Practical Observation: Most matrices have rapidly
decaying singular values. For example, anatural image's first 50 singular values might
capture 95% of the energy.
Application: Image
Compression
Images can be viewed as matrices (grayscale values). Compress using
low-rank approximation:
import numpy as np from PIL import Image import matplotlib.pyplot as plt
# Read image and convert to grayscale img = np.array(Image.open('photo.jpg').convert('L'), dtype=float)
# SVD decomposition U, s, Vt = np.linalg.svd(img, full_matrices=False)
# Keep first k singular values defcompress(U, s, Vt, k): return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
# Different compression levels for k in [5, 20, 50, 100]: compressed = compress(U, s, Vt, k) energy = sum(s[:k]**2) / sum(s**2) * 100 print(f"k={k}: Energy retained {energy:.1f}%")
Pseudoinverse
Motivation: When Inverse
Doesn't Exist
For equation:
- Ifis invertible, solution
is - What
ifis not invertible or not
square?
The pseudoinverse (also called Moore-Penrose
inverse) provides a "best alternative."
Definition
For matrix, the
pseudoinverseis
defined as:Whereis the
pseudoinverse of:That is: take reciprocals of
nonzero diagonal elements, keep zeros as zeros, then transpose (to
adjust dimensions).
Properties
The pseudoinverse satisfies four Moore-Penrose conditions:
Least squares solution (minimizing):If there are
multiple least squares solutions,is the one with
minimum norm.
Geometric Interpretation:does two things: 1.
Projectsonto column space
(find closest reachable point) 2. Among allthat can reach the projected
point, choose the one with minimum norm
Application:
Overdetermined and Underdetermined Systems
Overdetermined system (, more equations than unknowns): - Usually no exact
solution -gives the
least squares solution
Underdetermined system (, more unknowns than equations):
- Usually infinitely many solutions -gives the minimum norm solution
Practical Applications: Linear regression in machine
learning, data fitting, system identification all heavily use
pseudoinverse.
Principal Component Analysis
(PCA)
Relationship Between PCA and
SVD
Principal Component Analysis is a classic
dimensionality reduction method, and its core is SVD.
Given data matrix(,samples,features):
Step 1: Center the data (Subtract mean
from each column)
Step 2: Perform SVD onThen: -
Principal component directions: Columns of(right singular vectors) -
Principal component scores: - Variance
explained:is variance of-th
principal component
PCA looks for directions of maximum variance in
data:
First principal component: Direction of maximum
variance
Second principal component: Orthogonal to, with next maximum variance
Conclusion: These are exactly the right singular
vectors from SVD.
Life Analogy: Imagine you have a "cloud" of data
points. PCA finds the "flattest" and "longest" directions of the cloud.
The longest direction (maximum variance) is the first principal
component — it captures the most variation in the data.
Application: Data
Visualization
High-dimensional data (e.g., 100 dimensions) → Project onto first 2
or 3 principal components → Visualize
Example: Handwritten digit recognition. Eachimage is a 784-dimensional
vector. Using PCA to project to 2D, different digits form different
clusters.
SVD in Recommender Systems
Problem Background
Netflix, Amazon, Taobao and other platforms face a core problem:
How to predict user preferences for unrated items?
User-item rating matrix(users ×items): - Known portion: Items users
have rated - Goal: Predict missing ratings
Matrix Factorization Idea
Core Assumption: Ratings are determined by a small
number of "latent factors."
For example, movie ratings might be determined by these latent
factors: - Action level - Romance level - Humor level - Depth/artistry -
...
Each user has different preferences for these factors, and each movie
scores differently on these factors.
SVD Approach: - Rows of: Each user's "latent factor vector" -
Rows of: Each item's "latent
factor vector" - Predicted rating ≈ User factors · Item factors
Netflix Prize
Netflix held a million-dollar competition from 2006-2009: - Task:
Predict user movie ratings - Data: 100 million rating records - Goal: Be
10% more accurate than Netflix's algorithm
Winning approach core: Matrix factorization (SVD
variants), handling sparse data and implicit feedback.
Other Applications
Latent Semantic Analysis
(LSA)
In natural language processing, the document-term
matrix: - Rows: Documents
- Columns: Vocabulary - Elements: Term frequency (TF-IDF)
LSA: Perform SVD on, keep firstsingular vectors: - Captures "latent
semantics" (topics) - Dimensionality reduction: From tens of thousands
of dimensions to a few hundred - Applications: Document similarity,
information retrieval, synonym discovery
Signal Denoising
Model: Observed signal = True signal + Noise
If true signal is "low-rank" (structured), while noise is "full-rank"
(random): - SVD decompose observed signal - Keep large singular values
(signal) - Discard small singular values (noise)
Face Recognition (Eigenfaces)
Perform PCA on face image collection: - Principal components =
"Eigenfaces" - Each face ≈ Linear combination of eigenfaces -
Recognition: Compare distances between coefficient vectors
Summary and Outlook
Key Takeaways
SVD Definition: -: Left
singular vectors (output space orthonormal basis) -: Singular values (stretching
factors, always non-negative) -:
Right singular vectors (input space orthonormal basis)
Geometric Meaning: Any linear transformation =
Rotation + Stretching + Rotation
Computation Method: -: Eigenvectors of -: Eigenvectors of -
Text Analysis: Construct a small document-term
matrix (5 documents, 10 words):
Compute SVD
Interpret "semantics" of first two singular vectors
Calculate similarities between documents
Challenge Questions
SVD and 2-Norm: Prove that matrix 2-norm
(operator norm) equals largest singular value:.
Singular Values of Random Matrices: Generate
arandom matrix
(elements i.i.d. from), plot
singular value distribution. Compare with theoretical prediction
(Marchenko-Pastur distribution).
Error Bound for Truncated SVD: Prove(error in
2-norm sense).
Condition Number and SVD: Matrix condition
number is defined as. Explain why large condition number means matrix is
"ill-conditioned" and solvingis sensitive to errors.