Machine Learning Mathematical Derivations (17): Dimensionality Reduction and PCA

Dimensionality Reduction is a core technique in machine learning for handling high-dimensional data — when feature dimensions reach thousands or even millions, the curse of dimensionality makes learning difficult. Dimensionality reduction preserves the main structure of data while projecting it to a lower-dimensional space. From PCA's eigenvalue decomposition to LDA's inter-class separation, from kernel tricks for nonlinear mapping to t-SNE for manifold learning, dimensionality reduction algorithms provide powerful tools for data visualization, feature extraction, and preprocessing. This chapter deeply derives the two perspectives of PCA, implicit mapping of kernel PCA, Fisher's criterion for LDA, and probability distribution matching in t-SNE.

Curse of Dimensionality and the Need for Dimensionality Reduction

Counter-intuitive Phenomena in High-Dimensional Space

Curse of Dimensionality: As data dimensionality increases, samples become sparse, and algorithm performance degrades sharply

Phenomenon 1: Volume Concentration

Volume of a -dimensional unit hypersphere:

As, almost all volume concentrates in the shell! The proportion occupied by a ball of radius 0.99:

For, this proportion is only about Phenomenon 2: Distance Concentration

In high-dimensional space, distances between random points tend to become equal:

This causes nearest neighbor and clustering algorithms to fail

Two Major Goals of Dimensionality Reduction

1. Feature Extraction: Remove redundant features, preserve important information

Noise filtering
Improved computational efficiency
Reduced storage space

2. Data Visualization: Project to 2D/3D space

Understand data distribution
Discover cluster structures
Detect anomalies

PCA: Maximum Variance Perspective

Problem Formalization

Data:

Centering:, where

Objective: Find projection direction () that maximizes projected variance

Projected values: Variance:

Note thatafter centering

First Principal Component Derivation

Objective function:

Rewriting in matrix form:

where covariance matrix:

Lagrangian:

Differentiation:

Eigenvalue equation:

Conclusion:is an eigenvector of the covariance matrix, variance

Optimal solution: Choose the eigenvector corresponding to the largest eigenvalue

Multiple Principal Components

General conclusion: The firstprincipal components are the eigenvectors corresponding to thelargest eigenvalues of the covariance matrix

Projection matrix:

Dimensionality reduction:

Variance Retention Ratio

Total variance:

Retained variance:

Practice: Usually choosesuch that retained varianceor 95%

PCA: Minimum Reconstruction Error Perspective

Reconstruction Objective

Projection: Reconstruction: Reconstruction error:

It can be proven that minimizing reconstruction error leads to the same eigenvalue equation —equivalent to the maximum variance perspective!

Kernel PCA: Nonlinear Dimensionality Reduction

Kernel Trick Review

Idea: Data linearly inseparable in original space may be linearly separable in high-dimensional feature space

Explicit mapping:(, possibly infinite)

Kernel function:

Common kernels:

Polynomial kernel:
RBF kernel (Gaussian):

Kernel PCA Derivation

Key observation: Principal componentslie in the span of, expressible as

Solving steps:

Compute kernel matrix$_{ij} = k(_i, _j) = - _N - _N + _N _N_N = ^T_k_k_k _k / $ New sample projection:

LDA: Linear Discriminant Analysis

Fisher's Discriminant Criterion

Supervised dimensionality reduction: Utilizes class labels

Data:classes, classhassamples $Extra close brace or missing open brace\{\mathbf{x}_i^{(c)}}$ Objective: Find projection directionthat:

Maximizes inter-class distance (means of different classes are far apart)
Minimizes intra-class distance (samples within same class are close)

Binary LDA Derivation

Between-class scatter:

Within-class scatter:

Fisher's criterion:

Solution:

Multi-class LDA

Solved by generalized eigendecomposition:

Select the firstlargest eigenvalues ()

LDA vs PCA

Aspect	PCA	LDA
Type	Unsupervised	Supervised
Objective	Maximum variance	Maximum between-class/minimum within-class
Output dimension	Any	At most
Application	Visualization, preprocessing	Classification tasks

t-SNE: Manifold Learning and Visualization

From SNE to t-SNE

Objective: Map high-dimensional data to low-dimensional (usually 2D/3D), preserving local structure

SNE idea: Define probability of similarity between point pairs in high-dimensional space, also define probability in low-dimensional space, minimize KL divergence between two distributions

High-Dimensional Space Probability

Conditional probability (similarity ofrelative to):

Symmetrization:

Perplexity: Controls

Usually set, findvia binary search

Low-Dimensional Space Probability

t-distribution (Student-t): Heavy-tailed distribution, avoids crowding problem

Why t-distribution?

Gaussian distribution decays too fast, distant points get crowded together
t-distribution's heavy tail gives distant points more space

KL Divergence and Gradient

Objective: Minimize KL divergence

Gradient:

Physical meaning:

-: Similar in high-dim but too far in low-dim, produces attraction -: Dissimilar in high-dim but too close in low-dim, produces repulsion

Q&A

Q1: Why are the two perspectives of PCA (maximum variance/minimum reconstruction error) equivalent?

A: Mathematically, both derive the same eigenvalue equation. Intuitively: - Directions with large variance contain more information - Small reconstruction error means projection loses little information These are fundamentally the same: preserving the main directions of data variation.

Q2: Is PCA sensitive to data scaling?

A: Very sensitive! If features have different units (e.g., height in cm vs weight in kg), features with large variance will dominate principal components. Solution: - Standardization: - Normalization:In practice, usually standardize before PCA.

Q3: How to choose perplexity in t-SNE?

A: Perplexity controls effective neighbor count. Rules of thumb: - Small datasets (<1000): Perp = 5-30 - Medium datasets (1000-10000): Perp = 30-50 - Large datasets: Perp = 50-100

Too small: Local structure over-emphasized, global structure lost Too large: Global structure dominates local structure

Q4: Why does t-SNE use t-distribution instead of Gaussian?

A: Solves the "Crowding Problem". Moderate-distance points in high-dimensional space have insufficient space when projected to low dimensions. The t-distribution's heavy-tail property allows: - Nearby points remain close in low dimensions - Distant points have more space to spread out

Q5: Is t-SNE suitable for clustering?

A: Not really. t-SNE problems: 1. Does not preserve distances: Only preserves local probability distribution, not global distances 2. Randomness: Different random initializations give different results 3. Not interpretable: Low-dimensional coordinates have no clear meaning

Recommendation: Visualize with t-SNE, then cluster with K-means etc. in original space.

✏️ Exercises and Solutions

Exercise 1: PCA Variance

Problem: Eigenvalues. Keep top 2 PCs, variance retention? Solution:

Exercise 2: Centering

Problem: Why center data before PCA? Solution: Covariance requires zero mean. Without centering, PC1 points to mean direction.

Exercise 3: t-SNE vs PCA

Problem: Compare t-SNE and PCA. Solution: PCA: linear, global, preserves variance. t-SNE: nonlinear, local, preserves neighborhoods.

Exercise 4: Kernel PCA

Problem: How does kernel PCA handle nonlinear data? Solution: Kernel implicitly maps to high-dimensional space, do PCA there.

Exercise 5: Whitening

Problem: What is PCA whitening? Solution: makes covariance identity (independent, unit variance).

✏️ Exercises and Solutions

Exercise 1: PCA Variance

Problem: Eigenvalues. Keep top 2 PCs, variance retention? Solution:

Exercise 2: Centering

Problem: Why center data before PCA? Solution: Covariance requires zero mean. Without centering, PC1 points to mean direction.

Exercise 3: t-SNE vs PCA

Problem: Compare t-SNE and PCA. Solution: PCA: linear, global, preserves variance. t-SNE: nonlinear, local, preserves neighborhoods.

Exercise 4: Kernel PCA

Problem: How does kernel PCA handle nonlinear data? Solution: Kernel implicitly maps to high-dimensional space, do PCA there.

Exercise 5: Whitening

Problem: What is PCA whitening? Solution: makes covariance identity (independent, unit variance).

Referencess

[1] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559-572.

[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

[3] Scholkopf, B., Smola, A., & Muller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319.

[4] Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.

[5] McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.