Dimensionality Reduction is a core technique in machine learning for handling high-dimensional data — when feature dimensions reach thousands or even millions, the curse of dimensionality makes learning difficult. Dimensionality reduction preserves the main structure of data while projecting it to a lower-dimensional space. From PCA's eigenvalue decomposition to LDA's inter-class separation, from kernel tricks for nonlinear mapping to t-SNE for manifold learning, dimensionality reduction algorithms provide powerful tools for data visualization, feature extraction, and preprocessing. This chapter deeply derives the two perspectives of PCA, implicit mapping of kernel PCA, Fisher's criterion for LDA, and probability distribution matching in t-SNE.
Curse of Dimensionality and the Need for Dimensionality Reduction
Counter-intuitive Phenomena in High-Dimensional Space
Curse of Dimensionality: As data dimensionality increases, samples become sparse, and algorithm performance degrades sharply
Phenomenon 1: Volume Concentration
Volume of a
As
For
In high-dimensional space, distances between random points tend to become equal:
This causes nearest neighbor and clustering algorithms to fail
Two Major Goals of Dimensionality Reduction
1. Feature Extraction: Remove redundant features, preserve important information
- Noise filtering
- Improved computational efficiency
- Reduced storage space
2. Data Visualization: Project to 2D/3D space
- Understand data distribution
- Discover cluster structures
- Detect anomalies
PCA: Maximum Variance Perspective
Problem Formalization
Data:
Centering:
Objective: Find projection direction
Projected values:
Note that
First Principal Component Derivation
Objective function:
Rewriting in matrix form:
where covariance matrix:
Lagrangian:
Differentiation:
Eigenvalue equation:
Conclusion:
Optimal solution: Choose the eigenvector
corresponding to the largest eigenvalue
Multiple Principal Components
General conclusion: The first
Projection matrix:
Dimensionality reduction:
Variance Retention Ratio
Total variance:
Retained variance:
Practice: Usually choose
PCA: Minimum Reconstruction Error Perspective
Reconstruction Objective
Projection:
It can be proven that minimizing reconstruction error leads to the same eigenvalue equation —equivalent to the maximum variance perspective!
Kernel PCA: Nonlinear Dimensionality Reduction
Kernel Trick Review
Idea: Data linearly inseparable in original space may be linearly separable in high-dimensional feature space
Explicit mapping:
Kernel function:
Common kernels:
- Polynomial kernel:
- RBF kernel (Gaussian):
Kernel PCA Derivation
Key observation: Principal components
Solving steps:
- Compute kernel matrix$_{ij} = k(_i, _j)
= - _N - _N + _N _N _N = ^T _k _k _k _k / $ New sample projection:
LDA: Linear Discriminant Analysis
Fisher's Discriminant Criterion
Supervised dimensionality reduction: Utilizes class labels
Data:
- Maximizes inter-class distance (means of different classes are far apart)
- Minimizes intra-class distance (samples within same class are close)
Binary LDA Derivation
Between-class scatter:
Within-class scatter:
Fisher's criterion:
Solution:
Multi-class LDA
Solved by generalized eigendecomposition:
Select the first
LDA vs PCA
| Aspect | PCA | LDA |
|---|---|---|
| Type | Unsupervised | Supervised |
| Objective | Maximum variance | Maximum between-class/minimum within-class |
| Output dimension | Any |
At most |
| Application | Visualization, preprocessing | Classification tasks |
t-SNE: Manifold Learning and Visualization
From SNE to t-SNE
Objective: Map high-dimensional data to low-dimensional (usually 2D/3D), preserving local structure
SNE idea: Define probability of similarity between point pairs in high-dimensional space, also define probability in low-dimensional space, minimize KL divergence between two distributions
High-Dimensional Space Probability
Conditional probability (similarity of
Symmetrization:
Perplexity: Controls
Usually set
Low-Dimensional Space Probability
t-distribution (Student-t): Heavy-tailed
distribution, avoids crowding problem
Why t-distribution?
- Gaussian distribution decays too fast, distant points get crowded together
- t-distribution's heavy tail gives distant points more space
KL Divergence and Gradient
Objective: Minimize KL divergence
Gradient:
Physical meaning:
-
Q&A
Q1: Why are the two perspectives of PCA (maximum variance/minimum reconstruction error) equivalent?
A: Mathematically, both derive the same eigenvalue equation. Intuitively: - Directions with large variance contain more information - Small reconstruction error means projection loses little information These are fundamentally the same: preserving the main directions of data variation.
Q2: Is PCA sensitive to data scaling?
A: Very sensitive! If features have different units (e.g., height in
cm vs weight in kg), features with large variance will dominate
principal components. Solution: - Standardization:
Q3: How to choose perplexity in t-SNE?
A: Perplexity controls effective neighbor count. Rules of thumb: - Small datasets (<1000): Perp = 5-30 - Medium datasets (1000-10000): Perp = 30-50 - Large datasets: Perp = 50-100
Too small: Local structure over-emphasized, global structure lost Too large: Global structure dominates local structure
Q4: Why does t-SNE use t-distribution instead of Gaussian?
A: Solves the "Crowding Problem". Moderate-distance points in high-dimensional space have insufficient space when projected to low dimensions. The t-distribution's heavy-tail property allows: - Nearby points remain close in low dimensions - Distant points have more space to spread out
Q5: Is t-SNE suitable for clustering?
A: Not really. t-SNE problems: 1. Does not preserve distances: Only preserves local probability distribution, not global distances 2. Randomness: Different random initializations give different results 3. Not interpretable: Low-dimensional coordinates have no clear meaning
Recommendation: Visualize with t-SNE, then cluster with K-means etc. in original space.
✏️ Exercises and Solutions
Exercise 1: PCA Variance
Problem: Eigenvalues
Exercise 2: Centering
Problem: Why center data before PCA? Solution: Covariance requires zero mean. Without centering, PC1 points to mean direction.
Exercise 3: t-SNE vs PCA
Problem: Compare t-SNE and PCA. Solution: PCA: linear, global, preserves variance. t-SNE: nonlinear, local, preserves neighborhoods.
Exercise 4: Kernel PCA
Problem: How does kernel PCA handle nonlinear data?
Solution: Kernel
Exercise 5: Whitening
Problem: What is PCA whitening?
Solution:
✏️ Exercises and Solutions
Exercise 1: PCA Variance
Problem: Eigenvalues
Exercise 2: Centering
Problem: Why center data before PCA? Solution: Covariance requires zero mean. Without centering, PC1 points to mean direction.
Exercise 3: t-SNE vs PCA
Problem: Compare t-SNE and PCA. Solution: PCA: linear, global, preserves variance. t-SNE: nonlinear, local, preserves neighborhoods.
Exercise 4: Kernel PCA
Problem: How does kernel PCA handle nonlinear data?
Solution: Kernel
Exercise 5: Whitening
Problem: What is PCA whitening?
Solution:
Referencess
[1] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559-572.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
[3] Scholkopf, B., Smola, A., & Muller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319.
[4] Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
[5] McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
- Post title:Machine Learning Mathematical Derivations (17): Dimensionality Reduction and PCA
- Post author:Chen Kai
- Create time:2021-11-29 09:30:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-17-Dimensionality-Reduction-and-PCA/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.