Essence of Linear Algebra (9): Singular Value Decomposition

SVD (Singular Value Decomposition) is hailed as the "crown jewel" of linear algebra — it can decompose any matrix, not just square or symmetric ones. From image compression to Netflix recommendation algorithms, from face recognition to gene analysis, SVD is everywhere. Understanding SVD means mastering one of the most powerful mathematical tools in data science.

Introduction: Why is SVD So Important?

In the previous chapter, we learned spectral decomposition for symmetric matrices: . But this has a limitation — the matrix must be symmetric.

In the real world, most matrices are not symmetric: - Image matrices (, usually) - User-item rating matrices (recommender systems) - Document-term matrices (natural language processing) - Gene expression data matrices (bioinformatics)

Singular Value Decomposition (SVD) can decompose anymatrix:This is one of the most powerful and useful decompositions in linear algebra.

A Life Analogy: Understanding the Essence of SVD

Imagine you're a photographer trying to understand the "essence" of a photo. A photo can be viewed as a matrix — each pixel is a number. SVD tells you:

Any photo can be decomposed into the sum of several "basic layers"
These layers are ordered by importance— the first layer captures the main structure, the second captures secondary details...
Keeping only the first few layers can restore most of the information

It's like a band's recording can be decomposed into different instrument tracks: lead vocals, guitar, bass, drums... Some tracks (like lead vocals) are more "important"— removing background harmonies has little effect, but removing lead vocals changes the whole song.

Learning Objectives for This Chapter

Definition and Geometric Meaning of SVD: Understanding
Computation Methods: How to find singular values and singular vectors
Relationship with Eigendecomposition: How SVD generalizes the spectral theorem
Low-Rank Approximation: The optimal rank-approximation theorem
Pseudoinverse: A general method for handling non-invertible matrices
Applications: Image compression, PCA, recommender systems, latent semantic analysis

Definition of SVD

The Fundamental Theorem

Singular Value Decomposition Theorem: Anymatrixcan be decomposed as:Where: -:orthogonal matrix (left singular vectors) -:diagonal matrix (singular values), with diagonal elements -:orthogonal matrix (right singular vectors) - Standard Form:

Key Properties: - Singular valuesare always non-negative real numbers (unlike eigenvalues which can be negative or complex) - Singular values are in descending order: - SVD exists for any matrix (this is why it's more powerful than eigendecomposition)

Geometric Meaning: Dissecting a Transformation

SVD tells us: any linear transformation can be decomposed into three steps:

Rotation/Reflection (): Rotate the coordinate system in input space
Stretching (): Stretch (or compress) along new coordinate axes
Rotation/Reflection (): Rotate the coordinate system in output space

Life Analogy: Imagine you're kneading dough.

Step 1 (): Rotate the dough to a "convenient" angle
Step 2 (): Use a rolling pin to flatten and stretch the dough — this is the step that actually changes the shape
Step 3 (): Rotate the flattened dough to its final orientation

Visual Understanding: - Columns of: An orthonormal basis in input space, the "most natural directions" for the transformation to act - Columns of: An orthonormal basis in output space, the "most natural directions" for transformation results -: Stretching factor in the-th direction

A unit circle, after transformation by matrix, becomes an ellipse. SVD tells us: - Ellipse principal axis directions are determined by columns of - Ellipse semi-axis lengths are the singular values - The "special directions" in input space are given by columns of

Outer Product Form

SVD also has an important equivalent expression —outer product expansion:Eachis a rank-1 matrix. So:

Any matrix is a weighted sum of rank-1 matrices, with weights being the singular values.

This perspective is crucial for low-rank approximation.

Example

Consider amatrix:SVD decomposition (computation details follow):Singular values:Geometric meaning: Unit circle → Rotate 45° → Stretch 5x along x-axis, 1x along y-axis → Rotate to final position

Computing SVD

Key Observation: -andare both symmetric positive semidefinite matrices - They have the same nonzero eigenvalues - Their eigenvectors are the columns ofandrespectively

Detailed Derivation:

From:Since. Note thatis a diagonal matrix with diagonal elements:This is exactly the spectral decomposition of Conclusion: - Columns ofare eigenvectors of(right singular vectors) -are eigenvalues of -are singular values

Similarly: - Columns ofare eigenvectors of(left singular vectors)

Intuitive Understanding: Why use?can be understood as the effect of "acting twice": first transform with, then transform back with. This round trip amplifies's "main action directions"— directions with large singular values get amplified more (times), directions with small singular values get suppressed.

Computation Steps

Given anmatrix(assume):

Step 1: Compute eigenvalues and eigenvectors of - Eigenvalues: - Eigenvectors:(orthonormalized) - Theseform matrix Step 2: Compute singular values - Step 3: Compute left singular vectors -(for) - If, extend to a complete orthonormal basis

Step 4: Construct Why?

Fromwe can derive. Taking the-th column:So(assuming).

Example: Computing SVD by Hand

Compute SVD of.

Solution:

Step 1: Compute Characteristic equation:Eigenvalues:Eigenvectors (normalized): -: -: Step 2: Singular values

Step 3: Left singular vectorsComputing:Similarly: Result:

SVD and the Four Fundamental Subspaces

SVD elegantly reveals the four fundamental subspaces of a matrix.

Relationships Among the Four Subspaces

For(, rank):

Row Space: - Dimension: - Basis: Firstcolumns of: - In Null Space: - Dimension: - Basis: Lastcolumns of: - In Column Space: - Dimension: - Basis: Firstcolumns of: - In Left Null Space: - Dimension: - Basis: Lastcolumns of: - In Key Insight: - Columns ofgive orthogonal decomposition of domain: Row spaceNull space - Columns ofgive orthogonal decomposition of codomain: Column spaceLeft null space

Geometric Picture of SVD

SVD gives a complete picture of howacts:

Meaning: - Orthonormal basisof row space maps to orthonormal basisof column space - Eachis stretched by factor - Null space maps to zero

Low-Rank Approximation

One of SVD's most important applications is optimal low-rank approximation. This is the theoretical foundation for data compression, denoising, and recommender systems.

Eckart-Young Theorem

Theorem: Letbe the SVD of a rank-matrix. Define the rank-truncation:Whereare the firstcolumns of, andis the diagonal matrix of firstsingular values.

Thenis the closest matrix toamong all rank-matrices (in Frobenius norm):

Significance: - Keep firstsingular values/vectors, discard the rest - This is the optimal rank-approximation — no other rank-matrix can be closer to - Error is determined by discarded singular values

Life Analogy: This is like music compression. MP3 format discards high-frequency components human ears aren't sensitive to, keeping the most important frequencies. SVD works the same way — keeping the "highest energy" components.

Energy Perspective

The sum of squared singular values equals the squared Frobenius norm of the matrix:Energy captured by firstsingular values:

Practical Observation: Most matrices have rapidly decaying singular values. For example, anatural image's first 50 singular values might capture 95% of the energy.

Application: Image Compression

Images can be viewed as matrices (grayscale values). Compress using low-rank approximation:

Original Image:matrix, storingnumbers

Compressed Image (rank): - Storage:() +(numbers) +() - Total:numbers - Compression ratio: Example:image, using rankapproximation: - Original:numbers - Compressed:numbers - Compression ratio: about 10%

Python Implementation:

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Read image and convert to grayscale
img = np.array(Image.open('photo.jpg').convert('L'), dtype=float)

# SVD decomposition
U, s, Vt = np.linalg.svd(img, full_matrices=False)

# Keep first k singular values
def compress(U, s, Vt, k):
    return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

# Different compression levels
for k in [5, 20, 50, 100]:
    compressed = compress(U, s, Vt, k)
    energy = sum(s[:k]**2) / sum(s**2) * 100
    print(f"k={k}: Energy retained {energy:.1f}%")

Pseudoinverse

Motivation: When Inverse Doesn't Exist

For equation: - Ifis invertible, solution is - What ifis not invertible or not square?

The pseudoinverse (also called Moore-Penrose inverse) provides a "best alternative."

Definition

For matrix, the pseudoinverseis defined as:Whereis the pseudoinverse of:That is: take reciprocals of nonzero diagonal elements, keep zeros as zeros, then transpose (to adjust dimensions).

Properties

The pseudoinverse satisfies four Moore-Penrose conditions:

1.$AA^+A = AA^+AA+ = A^+(AA+)^T = AA^+AA+(A^+A)T = A^+AA+A$is symmetric)

Special Case: Ifis invertible,.

Least Squares Solution

For equation:

Least squares solution (minimizing):If there are multiple least squares solutions,is the one with minimum norm.

Geometric Interpretation:does two things: 1. Projectsonto column space (find closest reachable point) 2. Among allthat can reach the projected point, choose the one with minimum norm

Application: Overdetermined and Underdetermined Systems

Overdetermined system (, more equations than unknowns): - Usually no exact solution -gives the least squares solution

Underdetermined system (, more unknowns than equations): - Usually infinitely many solutions -gives the minimum norm solution

Practical Applications: Linear regression in machine learning, data fitting, system identification all heavily use pseudoinverse.

Principal Component Analysis (PCA)

Relationship Between PCA and SVD

Principal Component Analysis is a classic dimensionality reduction method, and its core is SVD.

Given data matrix(,samples,features):

Step 1: Center the data (Subtract mean from each column)

Step 2: Perform SVD on Then: - Principal component directions: Columns of(right singular vectors) - Principal component scores: - Variance explained:is variance of-th principal component

Dimensionality Reduction: Keep firstprincipal components:

Why PCA Works

PCA looks for directions of maximum variance in data:

First principal component: Direction of maximum variance

Second principal component: Orthogonal to, with next maximum variance

Conclusion: These are exactly the right singular vectors from SVD.

Life Analogy: Imagine you have a "cloud" of data points. PCA finds the "flattest" and "longest" directions of the cloud. The longest direction (maximum variance) is the first principal component — it captures the most variation in the data.

Application: Data Visualization

High-dimensional data (e.g., 100 dimensions) → Project onto first 2 or 3 principal components → Visualize

Example: Handwritten digit recognition. Eachimage is a 784-dimensional vector. Using PCA to project to 2D, different digits form different clusters.

SVD in Recommender Systems

Problem Background

Netflix, Amazon, Taobao and other platforms face a core problem: How to predict user preferences for unrated items?

User-item rating matrix(users ×items): - Known portion: Items users have rated - Goal: Predict missing ratings

Matrix Factorization Idea

Core Assumption: Ratings are determined by a small number of "latent factors."

For example, movie ratings might be determined by these latent factors: - Action level - Romance level - Humor level - Depth/artistry - ...

Each user has different preferences for these factors, and each movie scores differently on these factors.

SVD Approach: - Rows of: Each user's "latent factor vector" - Rows of: Each item's "latent factor vector" - Predicted rating ≈ User factors · Item factors

Netflix Prize

Netflix held a million-dollar competition from 2006-2009: - Task: Predict user movie ratings - Data: 100 million rating records - Goal: Be 10% more accurate than Netflix's algorithm

Winning approach core: Matrix factorization (SVD variants), handling sparse data and implicit feedback.

Other Applications

Latent Semantic Analysis (LSA)

In natural language processing, the document-term matrix: - Rows: Documents - Columns: Vocabulary - Elements: Term frequency (TF-IDF)

LSA: Perform SVD on, keep firstsingular vectors: - Captures "latent semantics" (topics) - Dimensionality reduction: From tens of thousands of dimensions to a few hundred - Applications: Document similarity, information retrieval, synonym discovery

Signal Denoising

Model: Observed signal = True signal + Noise

If true signal is "low-rank" (structured), while noise is "full-rank" (random): - SVD decompose observed signal - Keep large singular values (signal) - Discard small singular values (noise)

Face Recognition (Eigenfaces)

Perform PCA on face image collection: - Principal components = "Eigenfaces" - Each face ≈ Linear combination of eigenfaces - Recognition: Compare distances between coefficient vectors

Summary and Outlook

Key Takeaways

SVD Definition: -: Left singular vectors (output space orthonormal basis) -: Singular values (stretching factors, always non-negative) -: Right singular vectors (input space orthonormal basis)
Geometric Meaning: Any linear transformation = Rotation + Stretching + Rotation
Computation Method: -: Eigenvectors of -: Eigenvectors of -
Low-Rank Approximation:
- Eckart-Young theorem:is optimal rank-approximation
- Applications: Data compression, denoising
Pseudoinverse: - Handles non-invertible and non-square matrices
- Gives least squares and minimum norm solutions
PCA: SVD's core application in data analysis

SVD vs Eigendecomposition

Property	Eigendecomposition	SVD
Applicable matrices	Square (usually symmetric)	Any matrix
Decomposition form
Vectors	Eigenvectors	Singular vectors
Values	Eigenvalues (can be negative/complex)	Singular values (non-negative real)
Geometric meaning	Invariant directions + scaling	Rotation + Stretching + Rotation

Why is SVD the "Crown Jewel"?

Universality: Applies to any matrix
Stability: Numerically very stable
Optimality: Optimality guarantee for low-rank approximation
Insight: Reveals complete structure of matrix
Wide Applications: From image compression to recommender systems

Next Chapter Preview

"Matrix Norms and Condition Numbers"

What is the "size" of a matrix?
Meaning and applications of different norms
Condition number: How "ill-conditioned" is a matrix?
Numerical stability analysis
Applications: Error analysis, algorithm design

Understanding norms and condition numbers is crucial for numerical linear algebra!

Exercises

Conceptual Questions

Explain why singular values are always non-negative real numbers, while eigenvalues can be negative or complex.
Compute SVD ofby hand.
Prove:(Frobenius norm).
Why doandhave the same singular values? Give a proof.
Ifis anmatrix, what is the shape of? What are the shapes ofand?

Computation and Proof Questions

Prove:number of nonzero singular values.
Let, find its SVD and pseudoinverse.
Prove Eckart-Young theorem:is the optimal rank-approximation. (Hint: Use inequality)
Prove:is the projection matrix onto column space,is the projection matrix onto row space.
Ifis an orthogonal matrix, what are its singular values? Why?

Application Questions

Image Compression Experiment: Choose a grayscale image and implement SVD compression.
- Plot singular value decay curve
- Compare compressed images for - Calculate PSNR (Peak Signal-to-Noise Ratio) for each case
PCA Dimensionality Reduction: Use the Iris dataset:
- Perform PCA on the data
- Plot scatter plot of first two principal components, color-coded by flower species
- Calculate variance explained by first two components
Simple Recommender System: Create a 5-user × 10-movie rating matrix (partially missing):
- Use SVD for low-rank approximation ()
- Predict missing ratings
- Discuss reasonableness of results
Text Analysis: Construct a small document-term matrix (5 documents, 10 words):
- Compute SVD
- Interpret "semantics" of first two singular vectors
- Calculate similarities between documents

Challenge Questions

SVD and 2-Norm: Prove that matrix 2-norm (operator norm) equals largest singular value:.
Singular Values of Random Matrices: Generate arandom matrix (elements i.i.d. from), plot singular value distribution. Compare with theoretical prediction (Marchenko-Pastur distribution).
Error Bound for Truncated SVD: Prove(error in 2-norm sense).
Condition Number and SVD: Matrix condition number is defined as. Explain why large condition number means matrix is "ill-conditioned" and solvingis sensitive to errors.

Python Implementation Examples

Basic SVD Computation

import numpy as np

def svd_manual(A):
    """
    Compute SVD using eigendecomposition of A^T A
    """
    # Compute A^T A and AA^T
    ATA = A.T @ A
    AAT = A @ A.T
    
    # Eigendecomposition
    eigenvalues_ATA, V = np.linalg.eigh(ATA)
    eigenvalues_AAT, U = np.linalg.eigh(AAT)
    
    # Sort in descending order
    idx_V = np.argsort(eigenvalues_ATA)[::-1]
    idx_U = np.argsort(eigenvalues_AAT)[::-1]
    
    V = V[:, idx_V]
    U = U[:, idx_U]
    eigenvalues = eigenvalues_ATA[idx_V]
    
    # Singular values
    sigma = np.sqrt(np.maximum(eigenvalues, 0))
    
    # Construct Sigma matrix
    m, n = A.shape
    Sigma = np.zeros((m, n))
    np.fill_diagonal(Sigma, sigma[:min(m, n)])
    
    # Fix sign ambiguity
    for i in range(min(m, n)):
        if sigma[i] > 1e-10:
            expected_u = A @ V[:, i] / sigma[i]
            if np.dot(expected_u, U[:, i]) < 0:
                U[:, i] = -U[:, i]
    
    return U, Sigma, V.T

# Test
A = np.array([[3, 2], [2, 3]], dtype=float)
U, Sigma, Vt = svd_manual(A)
print("U =\n", U)
print("Sigma =\n", Sigma)
print("V^T =\n", Vt)
print("Verification A = U Sigma V^T:\n", U @ Sigma @ Vt)

Image Compression

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

def compress_image(img_array, k):
    """Compress image using rank-k SVD approximation"""
    U, s, Vt = np.linalg.svd(img_array, full_matrices=False)
    return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

def plot_compression_comparison(img_path, k_values):
    """Compare different compression levels"""
    img = np.array(Image.open(img_path).convert('L'), dtype=float)
    U, s, Vt = np.linalg.svd(img, full_matrices=False)
    
    fig, axes = plt.subplots(2, len(k_values) + 1, figsize=(15, 6))
    
    # Original image
    axes[0, 0].imshow(img, cmap='gray')
    axes[0, 0].set_title('Original')
    axes[0, 0].axis('off')
    
    # Singular value decay
    axes[1, 0].semilogy(s)
    axes[1, 0].set_xlabel('Index')
    axes[1, 0].set_ylabel('Singular Value')
    axes[1, 0].set_title('Singular Value Decay')
    
    for i, k in enumerate(k_values):
        compressed = compress_image(img, k)
        energy = sum(s[:k]**2) / sum(s**2) * 100
        compression_ratio = k * (img.shape[0] + img.shape[1] + 1) / (img.shape[0] * img.shape[1]) * 100
        
        axes[0, i+1].imshow(compressed, cmap='gray')
        axes[0, i+1].set_title(f'k={k}')
        axes[0, i+1].axis('off')
        
        axes[1, i+1].bar(['Energy', 'Size'], [energy, compression_ratio])
        axes[1, i+1].set_ylim(0, 100)
        axes[1, i+1].set_title(f'k={k}: {energy:.1f}% energy')
    
    plt.tight_layout()
    plt.show()

# Example usage (need actual image file)
# plot_compression_comparison('photo.jpg', [5, 20, 50, 100])

PCA Implementation

import numpy as np
import matplotlib.pyplot as plt

def pca(X, n_components):
    """
    Principal Component Analysis using SVD
    X: n_samples x n_features
    """
    # Center the data
    mean = np.mean(X, axis=0)
    X_centered = X - mean
    
    # SVD
    U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
    
    # Principal components (directions)
    components = Vt[:n_components]
    
    # Transform data
    X_transformed = X_centered @ components.T
    
    # Explained variance ratio
    explained_variance = (s ** 2) / (X.shape[0] - 1)
    explained_variance_ratio = explained_variance / explained_variance.sum()
    
    return X_transformed, components, explained_variance_ratio[:n_components]

# Generate example 2D correlated data
np.random.seed(42)
n_samples = 200
X = np.random.multivariate_normal([0, 0], [[3, 2], [2, 2]], n_samples)

# Apply PCA
X_pca, components, var_ratio = pca(X, 2)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original data with PC directions
axes[0].scatter(X[:, 0], X[:, 1], alpha=0.5)
origin = np.mean(X, axis=0)
for i, (comp, var) in enumerate(zip(components, var_ratio)):
    axes[0].arrow(origin[0], origin[1], 
                  comp[0]*3, comp[1]*3, 
                  head_width=0.2, head_length=0.1, fc=f'C{i+1}', ec=f'C{i+1}')
    axes[0].text(origin[0] + comp[0]*3.2, origin[1] + comp[1]*3.2, 
                 f'PC{i+1} ({var*100:.1f}%)', fontsize=10)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Original Data with Principal Components')
axes[0].axis('equal')

# Transformed data
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
axes[1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1].axvline(x=0, color='k', linestyle='-', alpha=0.3)
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('Data in PC Space')
axes[1].axis('equal')

plt.tight_layout()
plt.show()

print(f"Explained variance ratio: PC1={var_ratio[0]*100:.1f}%, PC2={var_ratio[1]*100:.1f}%")

References

Strang, G. (2019). Introduction to Linear Algebra. 5th ed. Chapter 7.
- Classic treatment of SVD
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.
- Numerical computation and applications of SVD
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations. 4th ed. Johns Hopkins.
- Authoritative reference for SVD algorithms
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- PCA and dimensionality reduction
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems". Computer, 42(8).
- Netflix Prize and recommender systems
3Blue1Brown. Essence of Linear Algebra series. YouTube.
- Excellent visualizations of SVD

Next Chapter: Matrix Norms and Condition Numbers →

Previous Chapter: ← Symmetric Matrices and Quadratic Forms

This is Chapter 9 of the 18-part "Essence of Linear Algebra" series.