A kernel
Why Kernel Methods Matter
The linear limitation
Many ML algorithms (linear regression, PCA, linear SVM) only work well when data is linearly separable or has linear structure.
Problem: Most real-world data is non-linear.
Naive solution: Manually engineer features
Issues:
- Tedious: Requires domain knowledge
- Combinatorial explosion: High-degree polynomials → exponential feature count
- Computational cost: Storing and computing
high-dim
is expensive
The kernel trick: implicit feature mapping
Key idea: If an algorithm only needs inner
products
without explicitly computing
Example: Polynomial kernel
Mathematical Foundation
Positive definite kernels
A kernel
Why it matters: Positive definite kernels correspond to valid inner products in some feature space.
Reproducing Kernel Hilbert Space (RKHS)
A Hilbert space
Moore-Aronszajn theorem: Every positive definite kernel uniquely defines an RKHS.
Practical implication: Kernel methods (SVM, Gaussian Processes) optimize functions in an RKHS, with the kernel controlling the "smoothness" and structure.
Mercer's theorem
If
Intuition: The kernel can be decomposed into a (possibly infinite) sum of basis functions.
Example: RBF kernel has
infinite-dimensional feature space (all
The Kernel Trick in Action
Example: Kernel SVM
Linear SVM (primal):
Dual form (only needs dot products):
Kernel trick: Replace
Example: Kernel PCA
Standard PCA: Eigendecomposition of covariance
matrix
Kernel PCA: Eigendecomposition of kernel
matrix
Result: Extract non-linear principal components
without explicitly constructing
Common Kernels: Theory and Practice
1. RBF
(Gaussian) Kernel
Properties:
- Infinite-dimensional feature space (all Mercer eigenvalues > 0)
- Smooth: Infinitely differentiable
- Universal: Can approximate any continuous function (given enough data)
When to use:
- Default choice for SVM, Gaussian Processes
- Data has smooth, local structure
Hyperparameter:
-
Tuning tips:
- Cross-validation over
(log scale) - Heuristic:
2. Polynomial Kernel
Properties:
- Finite-dimensional feature space (all degree-
monomials) - Non-smooth at origin (if
)
When to use:
- Data has polynomial relationships
- Text classification (sparse, high-dim data)
Hyperparameters:
-
3. Linear Kernel
Properties:
- Simplest kernel (no feature mapping)
- Fast:
per kernel evaluation ( = feature dim)
When to use:
- Data is already linearly separable
- High-dimensional sparse data (text, genes)
- Baseline comparison
Hyperparameter: None (only SVM regularization
4. Sigmoid Kernel
Properties:
- Mimics neural network activation
- Not always PSD (only for certain
)
When to use: Rarely in modern practice (neural networks are better for non-PSD cases).
Pitfall: Can produce invalid Gram matrices.
5.
Mat é rn Kernel (Gaussian Processes)
Properties:
Tunable smoothness:
controls differentiability - → Exponential kernel (not differentiable) - → RBF kernel (infinitely smooth) Common in GPs:
When to use: Gaussian Process regression
Need to control function smoothness (e.g., robotics, geostatistics)
Hyperparameters:
-
6.
Periodic Kernel (Time Series)
Properties:
Captures periodic patterns (period
) Stationary: Depends only on
When to use: Time series with seasonality (e.g., temperature, sales)
Audio signals (pitch detection)
Hyperparameters:
-
Hyperparameter Tuning
Cross-validation (most common)
Split data into
folds For each hyperparameter combination (e.g.,
): - Train on
folds, validate on 1
- Train on
Pick hyperparameters with best average validation score
Grid search (exhaustive): 1
2
3
4
5
6
7
8
9
10
11
12from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1], # gamma = 1/(2*sigma^2) in sklearn
'kernel': ['rbf']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
Random search (faster for high-dim hyperparameter
spaces): 1
2
3
4
5
6
7
8
9
10from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
param_dist = {
'C': loguniform(0.01, 100),
'gamma': loguniform(0.0001, 1),
}
random_search = RandomizedSearchCV(SVC(kernel='rbf'), param_dist, n_iter=50, cv=5)
random_search.fit(X_train, y_train)
Maximum likelihood (Gaussian Processes)
For GP regression, kernel hyperparameters are often tuned by
maximizing marginal likelihood:
Advantage: Principled, no need for validation set.
Disadvantage: Can overfit if data is noisy (add priors on hyperparameters).
Troubleshooting Guide
Problem 1: Overfitting (high training accuracy, poor test accuracy)
Symptoms:
- Training accuracy → 100%
- Test accuracy much lower
- Kernel matrix nearly diagonal (RBF with very small
)
Causes:
- Kernel too complex (e.g., RBF with small
, polynomial with high ) - Regularization too weak (SVM: small
)
Solutions: 1. Increase kernel
bandwidth (
Problem 2: Underfitting (low training and test accuracy)
Symptoms:
- Both training and test accuracy low
- Model output almost constant (e.g., predicts same class for all inputs)
Causes:
- Kernel too simple (e.g., linear kernel on non-linear data)
- Kernel bandwidth too large (RBF acts like constant)
- Regularization too strong (large
in SVM)
Solutions: 1. Use more expressive
kernel (linear → polynomial → RBF) 2. Decrease kernel
bandwidth (smaller
Problem 3: Numerical instability (matrix inversion fails)
Symptoms:
- GP training crashes with "singular matrix" error
- Kernel PCA eigenvalues include negative values
- SVM optimization does not converge
Causes:
- Kernel matrix not PSD (numerical errors, invalid kernel)
- Duplicate or near-duplicate data points (kernel matrix rank-deficient)
Solutions: 1. Add jitter
(regularization diagonal): 1
K = K + 1e-6 * np.eye(n)
Problem 4: Slow training (large datasets)
Symptoms:
- SVM takes hours on 10k+ samples
- GP training/prediction infeasible beyond 5k samples
Causes:
- Kernel methods scale
memory, time (matrix inversion)
Solutions: 1. Use linear kernel
(reduces to
- Nystr ö m approximation (random feature maps)
- Random Fourier features (approximate RBF kernel)
- Sparse GPs (inducing points)
- Switch to deep learning (scales better for
large
)
Kernel Selection Flowchart
1 | Start |
Practical Tips
1. Always normalize features
Why: Distance-based kernels (RBF, Mat é rn) are scale-sensitive. Features with large magnitudes dominate.
1 | from sklearn.preprocessing import StandardScaler |
2. Start with RBF, then experiment
Rule of thumb:
- First try: RBF kernel with grid search over
and - If underfits: Try polynomial (
) - If slow: Try linear
3. Visualize kernel matrix
1 | import matplotlib.pyplot as plt |
What to look for:
- Diagonal pattern: Kernel too localized (small
), risk of overfitting - Uniform block: Kernel too global (large
), underfitting - Structured patterns: Good sign (capturing data relationships)
4. Check kernel validity
For custom kernels, ensure they are positive definite:
1 | import numpy as np |
Comparison: Kernel Methods vs Deep Learning
| Aspect | Kernel Methods | Deep Learning |
|---|---|---|
| Training data | Works well with small data (<10k) | Needs large data (>10k) |
| Interpretability | High (RKHS functions, kernel weights) | Low (black box) |
| Hyperparameters | Few (kernel params + regularization) | Many (architecture, LR, batch, etc.) |
| Scalability | Poor ( |
Good (mini-batch, GPU) |
| Theory | Strong (RKHS, Mercer's theorem) | Weaker (empirical success) |
When to use kernel methods:
- Small to mid-size datasets (<10k samples)
- Need probabilistic predictions (Gaussian Processes)
- Want interpretability (SVMs have support vectors)
When to use deep learning:
- Large datasets (>100k samples)
- Complex structures (images, text, graphs)
- Scalability matters
Summary: Kernel Methods in 5 Steps
- Check data: Linearly separable? → Linear kernel. Otherwise → RBF.
- Normalize features (crucial for RBF/Mat é rn/periodic kernels)
- Tune hyperparameters: Cross-validation for
, , , etc. - Diagnose issues: Overfitting → increase
. Underfitting → decrease or use more expressive kernel. - Scale up: If data too large → use Nystr ö m/random features or switch to deep learning.
Key hyperparameters:
- RBF:
(bandwidth), (regularization) - Polynomial:
(degree), (scale), (offset) - Mat é rn:
(smoothness), (length scale)
Common pitfalls:
- Forgetting to normalize features (RBF fails silently)
- Using sigmoid kernel (not always PSD)
- Overfitting with small
or high polynomial degree
Further reading:
- Post title:Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)
- Post author:Chen Kai
- Create time:2024-02-28 00:00:00
- Post link:https://www.chenk.top/en/kernel-methods/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.