A kernel lets you use linear methods on non-linear problems by implicitly mapping data into a (possibly very high-dimensional) feature space. This note builds the intuition (the kernel trick), the math foundation (positive definite kernels, RKHS, Mercer's theorem), and the practical side (how to choose kernels and tune hyperparameters). We cover common kernels (RBF, polynomial, Mat é rn, periodic), troubleshooting (overfitting, underfitting, numerical issues), and a decision flowchart for kernel selection in SVM, Gaussian Processes, and Kernel PCA.

Why Kernel Methods Matter

The linear limitation

Many ML algorithms (linear regression, PCA, linear SVM) only work well when data is linearly separable or has linear structure.

Problem: Most real-world data is non-linear.

Naive solution: Manually engineer features (e.g., polynomials, interactions).

Issues:

Tedious: Requires domain knowledge
Combinatorial explosion: High-degree polynomials → exponential feature count
Computational cost: Storing and computing high-dimis expensive

The kernel trick: implicit feature mapping

Key idea: If an algorithm only needs inner products, and we can compute:

without explicitly computing, then we work in high-dim space implicitly.

Example: Polynomial kernelcorresponds to all degree-polynomial features, but we compute only one dot product instead of storingfeatures.

Mathematical Foundation

Positive definite kernels

A kernelis positive definite if for any finite setand any real vector:Equivalently, the Gram matrixis positive semi-definite (PSD).

Why it matters: Positive definite kernels correspond to valid inner products in some feature space.

Reproducing Kernel Hilbert Space (RKHS)

A Hilbert spaceis an RKHS if: 1. Elements are functions 2. Point evaluationis continuous 3. There exists a reproducing kernelsuch that:

Moore-Aronszajn theorem: Every positive definite kernel uniquely defines an RKHS.

Practical implication: Kernel methods (SVM, Gaussian Processes) optimize functions in an RKHS, with the kernel controlling the "smoothness" and structure.

Mercer's theorem

Ifis continuous, symmetric, and positive definite on, then:whereare non-negative eigenvalues, andare orthonormal eigenfunctions.

Intuition: The kernel can be decomposed into a (possibly infinite) sum of basis functions.

Example: RBF kernel has infinite-dimensional feature space (all).

The Kernel Trick in Action

Example: Kernel SVM

Linear SVM (primal):

Dual form (only needs dot products):

Kernel trick: Replacewith→ Non-linear SVM with same computational cost!

Example: Kernel PCA

Standard PCA: Eigendecomposition of covariance matrix.

Kernel PCA: Eigendecomposition of kernel matrix.

Result: Extract non-linear principal components without explicitly constructing.

Common Kernels: Theory and Practice

1. RBF (Gaussian) Kernel

Properties:

Infinite-dimensional feature space (all Mercer eigenvalues > 0)
Smooth: Infinitely differentiable
Universal: Can approximate any continuous function (given enough data)

When to use:

Default choice for SVM, Gaussian Processes
Data has smooth, local structure

Hyperparameter:

-(bandwidth): Controls "influence radius" - Small: High curvature, risk of overfitting (memorizes training data) - Large: Low curvature, risk of underfitting (acts like linear kernel)

Tuning tips:

Cross-validation over(log scale)
Heuristic:

2. Polynomial Kernel

Properties:

Finite-dimensional feature space (all degree-monomials)
Non-smooth at origin (if)

When to use:

Data has polynomial relationships
Text classification (sparse, high-dim data)

Hyperparameters:

-(degree): Typicallyor(higher risks overfitting) -(scale): Affects sensitivity to feature magnitudes -(offset): Oftenor Pitfall: High-degree polynomials () often overfit.

3. Linear Kernel

Properties:

Simplest kernel (no feature mapping)
Fast:per kernel evaluation (= feature dim)

When to use:

Data is already linearly separable
High-dimensional sparse data (text, genes)
Baseline comparison

Hyperparameter: None (only SVM regularization).

4. Sigmoid Kernel

Properties:

Mimics neural network activation
Not always PSD (only for certain)

When to use: Rarely in modern practice (neural networks are better for non-PSD cases).

Pitfall: Can produce invalid Gram matrices.

5. Mat é rn Kernel (Gaussian Processes)

Properties:

Tunable smoothness:controls differentiability -→ Exponential kernel (not differentiable) -→ RBF kernel (infinitely smooth)
Common in GPs: When to use:
Gaussian Process regression
Need to control function smoothness (e.g., robotics, geostatistics)

Hyperparameters:

-(smoothness): Typically fixed (1.5, 2.5, or 5/2) -(length scale): Tune via maximum likelihood or cross-validation

6. Periodic Kernel (Time Series)

Properties:

Captures periodic patterns (period)
Stationary: Depends only on When to use:
Time series with seasonality (e.g., temperature, sales)
Audio signals (pitch detection)

Hyperparameters:

-(period): Should match data periodicity (tune or set from domain knowledge) -(length scale): Controls smoothness within each period

Hyperparameter Tuning

Cross-validation (most common)

Split data intofolds
For each hyperparameter combination (e.g.,):
- Train onfolds, validate on 1
Pick hyperparameters with best average validation score

Grid search (exhaustive):

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],  # gamma = 1/(2*sigma^2) in sklearn
    'kernel': ['rbf']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

Random search (faster for high-dim hyperparameter spaces):

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_dist = {
    'C': loguniform(0.01, 100),
    'gamma': loguniform(0.0001, 1),
}

random_search = RandomizedSearchCV(SVC(kernel='rbf'), param_dist, n_iter=50, cv=5)
random_search.fit(X_train, y_train)

Maximum likelihood (Gaussian Processes)

For GP regression, kernel hyperparameters are often tuned by maximizing marginal likelihood:

Advantage: Principled, no need for validation set.

Disadvantage: Can overfit if data is noisy (add priors on hyperparameters).

Troubleshooting Guide

Problem 1: Overfitting (high training accuracy, poor test accuracy)

Symptoms:

Training accuracy → 100%
Test accuracy much lower
Kernel matrix nearly diagonal (RBF with very small)

Causes:

Kernel too complex (e.g., RBF with small, polynomial with high)
Regularization too weak (SVM: small)

Solutions: 1. Increase kernel bandwidth (for RBF, lowerfor polynomial) 2. Increase regularization (lowerin SVM, increase noise variance in GP) 3. Use simpler kernel (linear instead of RBF) 4. Add more training data

Problem 2: Underfitting (low training and test accuracy)

Symptoms:

Both training and test accuracy low
Model output almost constant (e.g., predicts same class for all inputs)

Causes:

Kernel too simple (e.g., linear kernel on non-linear data)
Kernel bandwidth too large (RBF acts like constant)
Regularization too strong (largein SVM)

Solutions: 1. Use more expressive kernel (linear → polynomial → RBF) 2. Decrease kernel bandwidth (smallerfor RBF) 3. Decrease regularization (increasein SVM)

Problem 3: Numerical instability (matrix inversion fails)

Symptoms:

GP training crashes with "singular matrix" error
Kernel PCA eigenvalues include negative values
SVM optimization does not converge

Causes:

Kernel matrix not PSD (numerical errors, invalid kernel)
Duplicate or near-duplicate data points (kernel matrix rank-deficient)

Solutions: 1. Add jitter (regularization diagonal):

1	K = K + 1e-6 * np.eye(n)

2. Normalize features (crucial for distance-based kernels like RBF) 3. Remove duplicates from training data 4. Use more numerically stable kernel (avoid Sigmoid, use RBF)

Problem 4: Slow training (large datasets)

Symptoms:

SVM takes hours on 10k+ samples
GP training/prediction infeasible beyond 5k samples

Causes:

Kernel methods scale memory, time (matrix inversion)

Solutions: 1. Use linear kernel (reduces toper sample) 2. Approximate methods:

Nystr ö m approximation (random feature maps)
Random Fourier features (approximate RBF kernel)
Sparse GPs (inducing points)

Switch to deep learning (scales better for large)

Kernel Selection Flowchart

Start
  |
  +--> Data linearly separable?
        Yes --> Use LINEAR kernel
        No --> Continue
  |
  +--> Need smoothness control (GP)?
        Yes --> Use MAT É RN kernel (tune nu, length scale)
        No --> Continue
  |
  +--> Time series with seasonality?
        Yes --> Use PERIODIC kernel
        No --> Continue
  |
  +--> Sparse high-dim data (text)?
        Yes --> Try LINEAR or POLYNOMIAL (d=2)
        No --> Continue
  |
  +--> Default choice --> RBF kernel (tune sigma, C)

Practical Tips

1. Always normalize features

Why: Distance-based kernels (RBF, Mat é rn) are scale-sensitive. Features with large magnitudes dominate.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Start with RBF, then experiment

Rule of thumb:

First try: RBF kernel with grid search overand
If underfits: Try polynomial ()
If slow: Try linear

3. Visualize kernel matrix

import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import rbf_kernel

K = rbf_kernel(X_train, gamma=0.1)
plt.imshow(K, cmap='viridis')
plt.colorbar()
plt.title('Kernel Matrix (RBF)')
plt.show()

What to look for:

Diagonal pattern: Kernel too localized (small), risk of overfitting
Uniform block: Kernel too global (large), underfitting
Structured patterns: Good sign (capturing data relationships)

4. Check kernel validity

For custom kernels, ensure they are positive definite:

import numpy as np

def is_psd(K):
    eigvals = np.linalg.eigvalsh(K)
    return np.all(eigvals >= -1e-8)  # Allow small numerical errors

K = rbf_kernel(X_train)
print(f"Kernel is PSD: {is_psd(K)}")

Comparison: Kernel Methods vs Deep Learning

Aspect	Kernel Methods	Deep Learning
Training data	Works well with small data (<10k)	Needs large data (>10k)
Interpretability	High (RKHS functions, kernel weights)	Low (black box)
Hyperparameters	Few (kernel params + regularization)	Many (architecture, LR, batch, etc.)
Scalability	Poor (memory,time)	Good (mini-batch, GPU)
Theory	Strong (RKHS, Mercer's theorem)	Weaker (empirical success)

When to use kernel methods:

Small to mid-size datasets (<10k samples)
Need probabilistic predictions (Gaussian Processes)
Want interpretability (SVMs have support vectors)

When to use deep learning:

Large datasets (>100k samples)
Complex structures (images, text, graphs)
Scalability matters

Summary: Kernel Methods in 5 Steps

Check data: Linearly separable? → Linear kernel. Otherwise → RBF.
Normalize features (crucial for RBF/Mat é rn/periodic kernels)
Tune hyperparameters: Cross-validation for,,, etc.
Diagnose issues: Overfitting → increase. Underfitting → decreaseor use more expressive kernel.
Scale up: If data too large → use Nystr ö m/random features or switch to deep learning.

Key hyperparameters:

RBF:(bandwidth),(regularization)
Polynomial:(degree),(scale),(offset)
Mat é rn:(smoothness),(length scale)

Common pitfalls:

Forgetting to normalize features (RBF fails silently)
Using sigmoid kernel (not always PSD)
Overfitting with smallor high polynomial degree

Further reading: