Essence of Linear Algebra (12): Sparse Matrices and Compressed Sensing

Sparsity is a ubiquitous feature in nature and data. Compressed sensing exploits this property to achieve the "less is more" miracle.

Introduction: From JPEG to MRI — Sparsity is Everywhere

Have you ever wondered why a 10MB raw photo can be compressed into a few hundred KB JPEG file with almost no visible difference? Why can MRI scan time be reduced from 30 minutes to 5 minutes?

The answer is sparsity— signals, when represented in some appropriate basis, have most coefficients equal to zero or close to zero.

Sparsity in Daily Life: - Images: Natural images are sparse in the wavelet basis (most wavelet coefficients are small) - Audio: Speech signals are sparse in the Fourier basis (only a few frequencies have energy) - Text: An article uses only a small fraction of the vocabulary - Social Networks: A person knows only a tiny fraction of all people

The Revolutionary Idea of Compressed Sensing: Since signals are sparse, can we recover them using fewer measurements than traditional methods require?

The answer is yes! This is the core content of this chapter.

Learning Objectives

Sparse Representation: Understanding signal sparse decomposition in dictionaries
L0 and L1 Norms: Why does L1 promote sparsity?
Compressed Sensing Theory: RIP conditions, measurement matrix design
Algorithm Implementation: LASSO, OMP, ISTA
Applications: MRI imaging, single-pixel cameras, image compression

Mathematical Foundations of Sparse Representation

What is Sparsity?

Definition: A vector is k-sparse if it has at mostnonzero elements.

Mathematical Notation:Hereis called the "L0 norm" (though strictly speaking it's not a true norm, as it doesn't satisfy homogeneity).

Examples: -is 3-sparse -is strictly 2-sparse

Approximately Sparse: In practice, signals are usually approximately sparse— most coefficients are small but not exactly zero.

Analogy: Imagine a class's grade distribution. If only a few students got high scores and most got zero or low scores, this grade vector is sparse.

Sparse Representation and Dictionaries

Most signals are not sparse themselves, but may be sparse in some transform domain or dictionary.

Mathematical Expression: Given a signaland dictionary, find sparse coefficients:The columns of the dictionary are called atoms.

Common Dictionaries:

Dictionary Type	Suitable Signals	Characteristics
Fourier basis	Periodic signals, audio	Frequency-domain sparse
Wavelet basis	Images, non-stationary signals	Time-frequency localization
DCT basis	Images (JPEG)	Similar to Fourier, real-valued
Learned dictionary	Domain-specific data	Data-driven

Overcomplete Dictionary: When, the dictionary is overcomplete, meaning there are more atoms to choose from, offering more flexible representation, but also introducing uniqueness issues.

The Sparse Coding Problem

Problem: Given signaland dictionary, find the sparsest representation:

Difficulty: This is an NP-hard problem!

Why NP-hard? - Need to search all possible nonzero element position combinations - For an-dimensional vector, there arepossible sparsity patterns - No known polynomial-time algorithm exists

Analogy: It's like finding a book in a library when you only know the general content, and the book could be in any location. You'd need to check every possible combination of locations, which is infeasible when there are many books.

Uniqueness of Sparse Representation

Question: Even if we find a sparse solution, is it unique?

Theorem (Uniqueness Condition): Ifand, thenis the unique sparsest representation.

Hereis the mutual coherence of the dictionary:

Intuition: Mutual coherence measures the maximum similarity between atoms in the dictionary. If atoms are very different from each other (low mutual coherence), sparse solutions are more likely to be unique.

L1 Regularization: Convex Relaxation of Sparsity

From L0 to L1

Since L0 optimization is NP-hard, we need a computable alternative.

Key Observation: The L1 norm is the best convex relaxation of the L0 norm.

Convex Relaxation Problem:This is a linear programming problem that can be solved in polynomial time!

Why Does L1 Promote Sparsity? — Geometric Explanation

This is one of the most important intuitions in this chapter.

Shape of the L1 Ball: In 2D, the L1 unit ball is a diamond (square rotated 45 degrees):

Shape of the L2 Ball: In 2D, the L2 unit ball is a circle:

Key Difference: The L1 ball has "sharp corners" on the coordinate axes, while the L2 ball is smooth.

Geometric Intuition: 1. The constraintdefines an affine subspace (line, plane, etc.) 2. Minimizingis equivalent to finding the intersection of this subspace with the smallestball 3. The L1 ball's "corners" are on the coordinate axes, so intersections are more likely at corners (sparse solutions) 4. The L2 ball is smooth, so intersections are generally not on coordinate axes (non-sparse solutions)

Analogy: Imagine you're in a room and need to touch a wall (constraint). If you're holding a diamond-shaped balloon (L1 ball) that's slowly inflating, the balloon will most likely first touch the wall at a corner. But if you're holding a round balloon (L2 ball), it could touch the wall anywhere.

Comparison of L1 and L2 Regularization

Property	L1 Regularization	L2 Regularization
Unit ball shape	Diamond (has corners)	Sphere (smooth)
Solution characteristics	Sparse (many exact zeros)	Small but nonzero
Geometric intuition	Tends to be on coordinate axes	Tends to shrink uniformly
Optimization	Non-smooth, needs special algorithms	Smooth, gradient descent works
Applications	Feature selection, compressed sensing	Prevent overfitting, ridge regression

Elastic Net: Combining L1 and L2

Elastic Net combines the advantages of both:

Advantages: - Inherits L1's sparsity - L2 component provides stability, especially when features are highly correlated - Can select feature groups (grouping effect)

LASSO Regression

Definition of LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) is the most important sparsity method in statistics:

Equivalent Form (constrained form):

The LASSO Solution

Subgradient Condition: The optimal solutionof LASSO satisfies:whereis a subgradient of:

Soft Thresholding Operation

For the simple case (orthogonal design), LASSO has a closed-form solution:This is called the soft thresholding operation.

Comparison with Hard Thresholding: - Soft thresholding: Smoothly shrinks small coefficients to zero - Hard thresholding: Directly sets small coefficients to zero, leaves large ones unchanged

Visualization: The soft thresholding function acts like a "squeeze" operation, pushing all coefficients toward zero, with small coefficients becoming exactly zero.

Properties of LASSO

Sparsity: LASSO automatically sets unimportant feature coefficients to exactly zero.

Feature Selection: Nonzero coefficients correspond to selected features — this is automatic feature selection.

Bias-Variance Trade-off: - Large→ More sparse, high bias, low variance - Small→ More dense, low bias, high variance

Choosing: Cross-validation is typically used.

The LASSO Path

The LASSO path is the trajectory ofasvaries.

Characteristics: - Whenis very large, all coefficients are zero - Asdecreases, coefficients gradually become nonzero - The path is piecewise linear (can be computed efficiently)

The LARS algorithm (Least Angle Regression) can efficiently compute the entire LASSO path.

Compressed Sensing Theory

Core Idea

Traditional Sampling: The Shannon-Nyquist theorem requires sampling rate at least twice the signal's highest frequency.

Compressed Sensing: If the signal is sparse, we can accurately recover it using far fewer measurements than the Nyquist rate!

Revolutionary Implications: - MRI scan time can be dramatically reduced - Sensors can be simpler and cheaper - Data storage and transmission become more efficient

Measurement Modelwhere:

-: Original signal (assumed k-sparse or k-sparse in some basis) -: Measurement matrix () -: Measurements -: Noise

Problem: Recover the-dimensional signal frommeasurements (underdetermined system, infinitely many solutions!)

Key Insight: Sparsity provides additional constraints, making recovery possible.

Restricted Isometry Property (RIP)

Definition: Matrixsatisfies the k-order Restricted Isometry Property (RIP) if there existssuch that for all k-sparse vectors:

Intuition:approximately preserves the length of sparse vectors.

Geometric Meaning: Whenmaps sparse vectors fromto, it doesn't excessively compress or stretch them.

Analogy: Imagine projecting a high-dimensional object onto lower dimensions. RIP ensures the projection doesn't map different sparse vectors to the same point (otherwise they couldn't be distinguished).

Recovery Theorem

Theorem (Cand è s-Tao): Ifsatisfies, then any k-sparse signalcan be exactly recovered fromvia L1 minimization:

Number of Measurements Required: Random matrices satisfying RIP exist, requiring onlymeasurements.

Significance: - Traditional methods needmeasurements - Compressed sensing needs onlymeasurements - When, this is a huge savings!

Measurement Matrix Design

Random Matrices (most common): - Gaussian random matrix: - Bernoulli random matrix: - Partial Fourier matrix: Randomly selected rows of the Fourier matrix

Advantages of Random Matrices: - Satisfy RIP with high probability - Signal-independent (universality) - Easy to generate

Deterministic Matrices: Exist but construction is complex, commonly used for specific applications.

Noisy Case

Basis Pursuit Denoising (BPDN):

Recovery Guarantee: Ifsatisfies RIP and noise is bounded, the BPDN solution's error from the true signal isorder.

Algorithms

Iterative Shrinkage-Thresholding Algorithm (ISTA)

ISTA is a classic algorithm for solving LASSO problems.

Update Rule:where: -is the largest eigenvalue of(Lipschitz constant) -is the soft thresholding function

Algorithm Interpretation: 1. Gradient descent step:$^{(k)} - ^T({(k)} - )$2. Soft thresholding step: Promotes sparsity

Convergence Rate:

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

FISTA uses Nesterov acceleration:

Convergence Rate:— quadratic speedup!

Orthogonal Matching Pursuit (OMP)

OMP is a greedy algorithm:

Algorithm Steps: 1. Initialize: residual, support set$S = j^* = _j |, _j |$ - Update support: - Least squares update: - Update residual: Advantages: Simple, fast Disadvantages: Greedy approach may miss global optimum

Coordinate Descent

Idea: Optimize one coordinate at a time.

For LASSO:

Advantages: Very efficient for large-scale sparse problems.

Applications

Medical Imaging: Compressed Sensing MRI

Background: MRI scanning requires acquiring large amounts of k-space data, which is time-consuming.

Compressed Sensing MRI: - Acquire only partial k-space data (random undersampling) - Exploit image sparsity in wavelet domain - Reconstruct image via L1 optimization

Result: Scan time reduced by 2-8x, especially useful for motion-sensitive imaging.

Practical Systems: Since 2017, FDA has approved multiple compressed sensing MRI systems.

Single-Pixel Camera

Principle: Capture images with a single photodetector.

How It Works: 1. Use digital micromirror array (DMD) to generate random patterns 2. Each pattern produces one measurement (weighted sum of all pixels) 3.measurements suffice to reconstruct an-pixel image ()

Applications: - Infrared imaging (expensive detectors) - Terahertz imaging - High-speed imaging

Image Compression and Recovery

Image Inpainting: - Known partial pixels, recover complete image - Exploit sparse priors of images (wavelet, DCT, learned dictionary)

Super-Resolution: - Recover high-resolution from low-resolution images - Assume high-resolution image is sparsely representable

Genomics

Problem: Identify disease-related genes.

Data:genes,samples ()

Method: LASSO regression - Sparsity assumption: Only a few genes are related to the disease - Automatically select relevant genes

Finance

Sparse Portfolio: - Invest in only a few assets (reduce transaction costs) - L1 regularization promotes sparse holdings

High-Frequency Trading Signals: - Select key indicators from many features

Sparse Matrix Storage and Computation

Sparse Matrix Formats

COO Format (Coordinate): Store triplesrepresenting nonzero elements.

CSR Format (Compressed Sparse Row): - data: Nonzero element values - indices: Column indices - indptr: Row pointers

CSC Format (Compressed Sparse Column): Similar to CSR, but stored by column.

Sparse Matrix Operations

Matrix-Vector Multiplication: - Dense: - Sparse (nnz nonzeros): Sparse Matrix Solving: - Direct methods: Sparse LU factorization - Iterative methods: Conjugate gradient, GMRES

Deep Understanding: Why Does Compressed Sensing Work?

Counter-Intuitive High-Dimensional Geometry

In high-dimensional space, our intuition often fails:

Volume of High-Dimensional Spheres: - Most volume concentrates near the surface (not at the center) - Volume of high-dimensional balls approaches zero

Structure of Sparse Vectors: - k-sparse vectors live in the union ofk-dimensional subspaces - The "dimension" of this union is much smaller than

Johnson-Lindenstrauss Lemma

JL Lemma:points can be randomly projected into-dimensional space while preserving pairwise distances (relative error).

Connection to Compressed Sensing: RIP is a strengthened version of the JL lemma for sparse vectors.

Information-Theoretic Perspective

Information Content of Sparse Signals: k-sparse vectors have aboutdegrees of freedom.

Lower Bound on Measurements:is necessary.

Conclusion: The number of measurements in compressed sensing is information-theoretically optimal (up to constants)!

Python Implementation

LASSO Regression

import numpy as np
from sklearn.linear_model import Lasso

# Generate sparse data
np.random.seed(42)
n, p, k = 100, 500, 10  # samples, features, sparsity

# True sparse coefficients
beta_true = np.zeros(p)
beta_true[:k] = np.random.randn(k) * 5

# Design matrix and response
X = np.random.randn(n, p) / np.sqrt(n)
y = X @ beta_true + np.random.randn(n) * 0.5

# LASSO regression
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Compare
print(f"True nonzero coefficients: {np.sum(beta_true != 0)}")
print(f"Estimated nonzero coefficients: {np.sum(np.abs(lasso.coef_) > 0.01)}")
print(f"Recovery error: {np.linalg.norm(lasso.coef_ - beta_true):.4f}")

OMP Algorithm

def omp(Phi, y, k):
    """Orthogonal Matching Pursuit algorithm"""
    n = Phi.shape[1]
    r = y.copy()  # residual
    S = []  # support set
    x = np.zeros(n)
    
    for _ in range(k):
        # Select most correlated atom
        correlations = np.abs(Phi.T @ r)
        j = np.argmax(correlations)
        S.append(j)
        
        # Least squares update
        Phi_S = Phi[:, S]
        x_S = np.linalg.lstsq(Phi_S, y, rcond=None)[0]
        
        # Update residual
        r = y - Phi_S @ x_S
    
    # Construct sparse solution
    x[S] = x_S
    return x

ISTA Algorithm

def soft_threshold(x, tau):
    """Soft thresholding function"""
    return np.sign(x) * np.maximum(np.abs(x) - tau, 0)

def ista(Phi, y, lambda_param, max_iter=1000, tol=1e-6):
    """Iterative Shrinkage-Thresholding Algorithm"""
    m, n = Phi.shape
    x = np.zeros(n)
    
    # Lipschitz constant
    L = np.linalg.norm(Phi.T @ Phi, 2)
    
    for i in range(max_iter):
        x_old = x.copy()
        
        # Gradient descent step
        grad = Phi.T @ (Phi @ x - y)
        x = x - (1/L) * grad
        
        # Soft thresholding step
        x = soft_threshold(x, lambda_param / L)
        
        # Convergence check
        if np.linalg.norm(x - x_old) < tol:
            break
    
    return x

FISTA Algorithm

def fista(Phi, y, lambda_param, max_iter=1000, tol=1e-6):
    """Fast Iterative Shrinkage-Thresholding Algorithm"""
    m, n = Phi.shape
    x = np.zeros(n)
    z = x.copy()
    t = 1
    
    # Lipschitz constant
    L = np.linalg.norm(Phi.T @ Phi, 2)
    
    for i in range(max_iter):
        x_old = x.copy()
        
        # Gradient step on z
        grad = Phi.T @ (Phi @ z - y)
        x = soft_threshold(z - (1/L) * grad, lambda_param / L)
        
        # Update momentum
        t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2
        z = x + ((t - 1) / t_new) * (x - x_old)
        t = t_new
        
        # Convergence check
        if np.linalg.norm(x - x_old) < tol:
            break
    
    return x

Compressed Sensing Experiment

import matplotlib.pyplot as plt

def compressed_sensing_demo():
    """Demonstrate compressed sensing recovery"""
    # Generate sparse signal
    n = 256  # signal dimension
    k = 10   # sparsity
    m = 64   # number of measurements
    
    # True sparse signal
    x_true = np.zeros(n)
    support = np.random.choice(n, k, replace=False)
    x_true[support] = np.random.randn(k)
    
    # Random measurement matrix
    Phi = np.random.randn(m, n) / np.sqrt(m)
    
    # Measurements
    y = Phi @ x_true
    
    # Recovery via ISTA
    x_recovered = ista(Phi, y, lambda_param=0.1)
    
    # Visualize
    fig, axes = plt.subplots(2, 1, figsize=(12, 6))
    
    axes[0].stem(x_true, linefmt='b-', markerfmt='bo', basefmt='k-')
    axes[0].set_title(f'True Signal (k={k} sparse, n={n})')
    axes[0].set_xlabel('Index')
    
    axes[1].stem(x_recovered, linefmt='r-', markerfmt='ro', basefmt='k-')
    axes[1].set_title(f'Recovered Signal (m={m} measurements)')
    axes[1].set_xlabel('Index')
    
    plt.tight_layout()
    plt.show()
    
    # Report error
    error = np.linalg.norm(x_true - x_recovered) / np.linalg.norm(x_true)
    print(f"Relative recovery error: {error:.4f}")
    
compressed_sensing_demo()

Summary and Outlook

Key Points of This Chapter

Sparsity is a universal feature of natural signals
L1 norm is the convex relaxation of L0, geometrically promoting sparsity
LASSO simultaneously achieves sparse estimation and feature selection
Compressed Sensing: Recover k-sparse signals frommeasurements
RIP condition guarantees recovery correctness
Algorithms: ISTA, FISTA, OMP, coordinate descent

Connections to Other Chapters

Chapter 9 (SVD): Low-rank is another form of "sparsity"
Chapter 10 (Norms): Properties of L1/L2 norms
Chapter 11 (Optimization): LASSO is a convex optimization problem
Chapter 15 (Machine Learning): LASSO used for feature selection

Further Learning

Structured Sparsity: Group sparsity, tree sparsity
Dictionary Learning: Data-driven dictionary design
Matrix Completion: Recovery of low-rank matrices
Deep Compressed Sensing: Neural networks + sparse recovery

Preview of Next Chapter

"Tensors and Multilinear Algebra"

Tensors are generalizations of vectors and matrices
CP decomposition, Tucker decomposition
Applications: Recommender systems, chemometrics

Exercises

Basic Problems

Prove that the L1 ball has corners on the coordinate axes.
Compute the L0, L1, L2 norms of vector.
Explain why L0 optimization is NP-hard.
Derive the formula for the soft thresholding operation.
Write out the KKT conditions for LASSO and derive the subgradient conditions.

Advanced Problems

Prove: Ifsatisfies RIP, then different sparse vectors are mapped to different measurements.
Implement FISTA and compare convergence speed with ISTA on a synthetic problem.
Prove that the condition number ofis the square of the condition number of, and explain why this makes normal equations less stable for sparse recovery.
Derive the update formula for coordinate descent applied to LASSO.
Prove that OMP correctly recovers any k-sparse signal whensatisfies certain coherence conditions.

Programming Problems

Use LASSO for feature selection on the MNIST dataset and visualize selected features.
Implement compressed sensing image reconstruction:
- Take random Fourier measurements of an image
- Reconstruct using total variation minimization
- Compare reconstruction quality vs. number of measurements
Compare recovery performance of OMP and LASSO:
- Generate signals with varying sparsity levels
- Plot success rate vs. sparsity
- Analyze computational time trade-offs
Implement a simple dictionary learning algorithm:
- Alternate between sparse coding (fix dictionary, update coefficients)
- And dictionary update (fix coefficients, update dictionary)

Application Problems

MRI Reconstruction Simulation:
- Generate a phantom image
- Create random k-space undersampling mask
- Reconstruct using L1-minimization in wavelet domain
- Compare with zero-filled reconstruction
Time Series Anomaly Detection:
- Model normal behavior as sparse in some basis
- Detect anomalies as deviations from sparse representation
- Apply to real sensor data

Theoretical Problems

Prove the relationship between mutual coherence and spark of a dictionary.
Show that Gaussian random matrices satisfy RIP with high probability when.
Derive the dual problem of LASSO and interpret the dual variables geometrically.
Prove that the LASSO path is piecewise linear and derive the conditions under which a coefficient becomes nonzero.

References

Cand è s, E. & Wakin, M. (2008). "An Introduction to Compressive Sampling". IEEE Signal Processing Magazine.
- Accessible introduction to compressed sensing theory
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". JRSS-B.
- Original LASSO paper
Elad, M. (2010). Sparse and Redundant Representations. Springer.
- Comprehensive treatment of sparse representation
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity. CRC Press.
- Modern treatment of sparsity in statistics and machine learning
Foucart, S. & Rauhut, H. (2013). A Mathematical Introduction to Compressive Sensing. Springer.
- Rigorous mathematical foundations
Beck, A. & Teboulle, M. (2009). "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems". SIAM J. Imaging Sciences.
- Original FISTA paper

This is Chapter 12 of the 18-part "Essence of Linear Algebra" series.