Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)
Chen Kai BOSS

A kernel lets you use linear methods on non-linear problems by implicitly mapping data into a (possibly very high-dimensional) feature space. This note builds the intuition (the kernel trick), the math foundation (positive definite kernels, RKHS, Mercer's theorem), and the practical side (how to choose kernels and tune hyperparameters). We cover common kernels (RBF, polynomial, Mat é rn, periodic), troubleshooting (overfitting, underfitting, numerical issues), and a decision flowchart for kernel selection in SVM, Gaussian Processes, and Kernel PCA.

Why Kernel Methods Matter

The linear limitation

Many ML algorithms (linear regression, PCA, linear SVM) only work well when data is linearly separable or has linear structure.

Problem: Most real-world data is non-linear.

Naive solution: Manually engineer features (e.g., polynomials, interactions).

Issues:

  • Tedious: Requires domain knowledge
  • Combinatorial explosion: High-degree polynomials → exponential feature count
  • Computational cost: Storing and computing high-dimis expensive

The kernel trick: implicit feature mapping

Key idea: If an algorithm only needs inner products, and we can compute:

without explicitly computing, then we work in high-dim space implicitly.

Example: Polynomial kernelcorresponds to all degree-polynomial features, but we compute only one dot product instead of storingfeatures.


Mathematical Foundation

Positive definite kernels

A kernelis positive definite if for any finite setand any real vector:Equivalently, the Gram matrixis positive semi-definite (PSD).

Why it matters: Positive definite kernels correspond to valid inner products in some feature space.

Reproducing Kernel Hilbert Space (RKHS)

A Hilbert spaceis an RKHS if: 1. Elements are functions 2. Point evaluationis continuous 3. There exists a reproducing kernelsuch that:

Moore-Aronszajn theorem: Every positive definite kernel uniquely defines an RKHS.

Practical implication: Kernel methods (SVM, Gaussian Processes) optimize functions in an RKHS, with the kernel controlling the "smoothness" and structure.

Mercer's theorem

Ifis continuous, symmetric, and positive definite on, then:whereare non-negative eigenvalues, andare orthonormal eigenfunctions.

Intuition: The kernel can be decomposed into a (possibly infinite) sum of basis functions.

Example: RBF kernel has infinite-dimensional feature space (all).


The Kernel Trick in Action

Example: Kernel SVM

Linear SVM (primal):

Dual form (only needs dot products):

Kernel trick: ReplacewithNon-linear SVM with same computational cost!

Example: Kernel PCA

Standard PCA: Eigendecomposition of covariance matrix.

Kernel PCA: Eigendecomposition of kernel matrix.

Result: Extract non-linear principal components without explicitly constructing.


Common Kernels: Theory and Practice

1. RBF (Gaussian) Kernel

Properties:

  • Infinite-dimensional feature space (all Mercer eigenvalues > 0)
  • Smooth: Infinitely differentiable
  • Universal: Can approximate any continuous function (given enough data)

When to use:

  • Default choice for SVM, Gaussian Processes
  • Data has smooth, local structure

Hyperparameter:

-(bandwidth): Controls "influence radius" - Small: High curvature, risk of overfitting (memorizes training data) - Large: Low curvature, risk of underfitting (acts like linear kernel)

Tuning tips:

  • Cross-validation over(log scale)
  • Heuristic:

2. Polynomial Kernel

Properties:

  • Finite-dimensional feature space (all degree-monomials)
  • Non-smooth at origin (if)

When to use:

  • Data has polynomial relationships
  • Text classification (sparse, high-dim data)

Hyperparameters:

-(degree): Typicallyor(higher risks overfitting) -(scale): Affects sensitivity to feature magnitudes -(offset): Oftenor Pitfall: High-degree polynomials () often overfit.

3. Linear Kernel

Properties:

  • Simplest kernel (no feature mapping)
  • Fast:per kernel evaluation (= feature dim)

When to use:

  • Data is already linearly separable
  • High-dimensional sparse data (text, genes)
  • Baseline comparison

Hyperparameter: None (only SVM regularization).

4. Sigmoid Kernel

Properties:

  • Mimics neural network activation
  • Not always PSD (only for certain)

When to use: Rarely in modern practice (neural networks are better for non-PSD cases).

Pitfall: Can produce invalid Gram matrices.

5. Mat é rn Kernel (Gaussian Processes)

Properties:

  • Tunable smoothness:controls differentiability -→ Exponential kernel (not differentiable) -→ RBF kernel (infinitely smooth)

  • Common in GPs: When to use:

  • Gaussian Process regression

  • Need to control function smoothness (e.g., robotics, geostatistics)

Hyperparameters:

-(smoothness): Typically fixed (1.5, 2.5, or 5/2) -(length scale): Tune via maximum likelihood or cross-validation

6. Periodic Kernel (Time Series)

Properties:

  • Captures periodic patterns (period)

  • Stationary: Depends only on When to use:

  • Time series with seasonality (e.g., temperature, sales)

  • Audio signals (pitch detection)

Hyperparameters:

-(period): Should match data periodicity (tune or set from domain knowledge) -(length scale): Controls smoothness within each period


Hyperparameter Tuning

Cross-validation (most common)

  1. Split data intofolds

  2. For each hyperparameter combination (e.g.,):

    • Train onfolds, validate on 1
  3. Pick hyperparameters with best average validation score

Grid search (exhaustive):

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1], # gamma = 1/(2*sigma^2) in sklearn
'kernel': ['rbf']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

Random search (faster for high-dim hyperparameter spaces):

1
2
3
4
5
6
7
8
9
10
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_dist = {
'C': loguniform(0.01, 100),
'gamma': loguniform(0.0001, 1),
}

random_search = RandomizedSearchCV(SVC(kernel='rbf'), param_dist, n_iter=50, cv=5)
random_search.fit(X_train, y_train)

Maximum likelihood (Gaussian Processes)

For GP regression, kernel hyperparameters are often tuned by maximizing marginal likelihood:

Advantage: Principled, no need for validation set.

Disadvantage: Can overfit if data is noisy (add priors on hyperparameters).


Troubleshooting Guide

Problem 1: Overfitting (high training accuracy, poor test accuracy)

Symptoms:

  • Training accuracy → 100%
  • Test accuracy much lower
  • Kernel matrix nearly diagonal (RBF with very small)

Causes:

  • Kernel too complex (e.g., RBF with small, polynomial with high)
  • Regularization too weak (SVM: small)

Solutions: 1. Increase kernel bandwidth (for RBF, lowerfor polynomial) 2. Increase regularization (lowerin SVM, increase noise variance in GP) 3. Use simpler kernel (linear instead of RBF) 4. Add more training data

Problem 2: Underfitting (low training and test accuracy)

Symptoms:

  • Both training and test accuracy low
  • Model output almost constant (e.g., predicts same class for all inputs)

Causes:

  • Kernel too simple (e.g., linear kernel on non-linear data)
  • Kernel bandwidth too large (RBF acts like constant)
  • Regularization too strong (largein SVM)

Solutions: 1. Use more expressive kernel (linear → polynomial → RBF) 2. Decrease kernel bandwidth (smallerfor RBF) 3. Decrease regularization (increasein SVM)

Problem 3: Numerical instability (matrix inversion fails)

Symptoms:

  • GP training crashes with "singular matrix" error
  • Kernel PCA eigenvalues include negative values
  • SVM optimization does not converge

Causes:

  • Kernel matrix not PSD (numerical errors, invalid kernel)
  • Duplicate or near-duplicate data points (kernel matrix rank-deficient)

Solutions: 1. Add jitter (regularization diagonal):

1
K = K + 1e-6 * np.eye(n)
2. Normalize features (crucial for distance-based kernels like RBF) 3. Remove duplicates from training data 4. Use more numerically stable kernel (avoid Sigmoid, use RBF)

Problem 4: Slow training (large datasets)

Symptoms:

  • SVM takes hours on 10k+ samples
  • GP training/prediction infeasible beyond 5k samples

Causes:

  • Kernel methods scale memory, time (matrix inversion)

Solutions: 1. Use linear kernel (reduces toper sample) 2. Approximate methods:

  • Nystr ö m approximation (random feature maps)
  • Random Fourier features (approximate RBF kernel)
  • Sparse GPs (inducing points)
  1. Switch to deep learning (scales better for large)

Kernel Selection Flowchart

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Start
|
+--> Data linearly separable?
Yes --> Use LINEAR kernel
No --> Continue
|
+--> Need smoothness control (GP)?
Yes --> Use MAT É RN kernel (tune nu, length scale)
No --> Continue
|
+--> Time series with seasonality?
Yes --> Use PERIODIC kernel
No --> Continue
|
+--> Sparse high-dim data (text)?
Yes --> Try LINEAR or POLYNOMIAL (d=2)
No --> Continue
|
+--> Default choice --> RBF kernel (tune sigma, C)

Practical Tips

1. Always normalize features

Why: Distance-based kernels (RBF, Mat é rn) are scale-sensitive. Features with large magnitudes dominate.

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Start with RBF, then experiment

Rule of thumb:

  • First try: RBF kernel with grid search overand
  • If underfits: Try polynomial ()
  • If slow: Try linear

3. Visualize kernel matrix

1
2
3
4
5
6
7
8
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import rbf_kernel

K = rbf_kernel(X_train, gamma=0.1)
plt.imshow(K, cmap='viridis')
plt.colorbar()
plt.title('Kernel Matrix (RBF)')
plt.show()

What to look for:

  • Diagonal pattern: Kernel too localized (small), risk of overfitting
  • Uniform block: Kernel too global (large), underfitting
  • Structured patterns: Good sign (capturing data relationships)

4. Check kernel validity

For custom kernels, ensure they are positive definite:

1
2
3
4
5
6
7
8
import numpy as np

def is_psd(K):
eigvals = np.linalg.eigvalsh(K)
return np.all(eigvals >= -1e-8) # Allow small numerical errors

K = rbf_kernel(X_train)
print(f"Kernel is PSD: {is_psd(K)}")

Comparison: Kernel Methods vs Deep Learning

Aspect Kernel Methods Deep Learning
Training data Works well with small data (<10k) Needs large data (>10k)
Interpretability High (RKHS functions, kernel weights) Low (black box)
Hyperparameters Few (kernel params + regularization) Many (architecture, LR, batch, etc.)
Scalability Poor (memory,time) Good (mini-batch, GPU)
Theory Strong (RKHS, Mercer's theorem) Weaker (empirical success)

When to use kernel methods:

  • Small to mid-size datasets (<10k samples)
  • Need probabilistic predictions (Gaussian Processes)
  • Want interpretability (SVMs have support vectors)

When to use deep learning:

  • Large datasets (>100k samples)
  • Complex structures (images, text, graphs)
  • Scalability matters

Summary: Kernel Methods in 5 Steps

  1. Check data: Linearly separable? → Linear kernel. Otherwise → RBF.
  2. Normalize features (crucial for RBF/Mat é rn/periodic kernels)
  3. Tune hyperparameters: Cross-validation for,,, etc.
  4. Diagnose issues: Overfitting → increase. Underfitting → decreaseor use more expressive kernel.
  5. Scale up: If data too large → use Nystr ö m/random features or switch to deep learning.

Key hyperparameters:

  • RBF:(bandwidth),(regularization)
  • Polynomial:(degree),(scale),(offset)
  • Mat é rn:(smoothness),(length scale)

Common pitfalls:

  • Forgetting to normalize features (RBF fails silently)
  • Using sigmoid kernel (not always PSD)
  • Overfitting with smallor high polynomial degree

Further reading:

  • Post title:Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)
  • Post author:Chen Kai
  • Create time:2024-02-28 00:00:00
  • Post link:https://www.chenk.top/en/kernel-methods/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments