Regularization is a core technique in machine learning for controlling model complexity and preventing overfitting — when training data is limited, models tend to memorize noise rather than true patterns. From the mathematical forms of L1/L2 regularization to their Bayesian prior interpretation, from Dropout's random deactivation to early stopping's implicit regularization, from cross-validation for model selection to VC dimension generalization bounds, regularization theory provides mathematical guarantees for balancing underfitting and overfitting. This chapter deeply derives the optimization forms, Bayesian interpretation, bias-variance decomposition, and learning theory foundations of regularization.
Overfitting and Generalization
Mathematical Characterization of Overfitting
Training error (empirical risk):
Generalization error (expected risk):
Generalization gap:
Overfitting:
Underfitting:
Bias-Variance Tradeoff
Expected generalization error decomposition for regression:
Assuming true model
Physical interpretation:
- Bias: Difference between average prediction and true value, measures underfitting
- Variance: Prediction fluctuation due to different training sets, measures overfitting
- Irreducible error: Noise in data itself
Model Complexity and Generalization Tradeoff
Simple model (low complexity):
- High bias (underfitting)
- Low variance (stable)
Complex model (high complexity):
- Low bias (fits training data)
- High variance (sensitive to training data, overfitting)
Optimal complexity: Balances bias and variance, minimizes total error

L2 Regularization: Ridge Regression
Optimization Form
Objective function:
where
Linear regression ridge regression:
Closed-form solution:
Regularization effects:
- Prevents
from being singular (non-invertible) - Shrinks weights toward 0 (weight decay)
Bayesian Interpretation
Maximum A Posteriori estimation (MAP):
Taking logarithm:
Assumptions:
- Likelihood:
- Prior:
(Gaussian prior)
Conclusion: L2 regularization is equivalent to MAP
estimation under Gaussian prior,
Weight Decay in Gradient Descent
Gradient update:
Rewriting:
Physical meaning: Before each update, weights first
decay to
L1 Regularization: Lasso
Optimization Form
Objective function:
where
Sparsity Induction
Geometric intuition of L1 vs L2:
Consider constrained optimization form:
Tangent point of contour lines and constraints:
- L2: Tangent point typically not on coordinate axes,
non-sparse - L1: Tangent point tends to be on coordinate axes (at corners),
partially 0, sparse!

Soft-thresholding operator:
Conclusion: L1 regularization drives small weights directly to 0, achieving feature selection
Bayesian Interpretation
Laplace prior:
Conclusion: L1 regularization is equivalent to MAP estimation under Laplace prior
Elastic Net: L1 + L2
Combined regularization:
Advantages:
- Preserves L1's sparsity
- Avoids L1's instability with highly correlated features
- L2's grouping effect: Correlated features tend to be selected or dropped together
Dropout: Random Deactivation Regularization
Dropout Mechanism
During training:
Each neuron is randomly "turned off" with probability
Or equivalently, introducing binary mask
Dividing by
Use all neurons, no Dropout (or equivalently use expectation)
Ensemble Learning Perspective
Idea: Dropout trains exponentially many sub-networks
Assuming network has
Training: Each mini-batch samples one sub-network
Testing: Approximates average of all sub-networks
(weight sharing)
Analogy to Bagging: Train multiple models and average
Mathematical Analysis of Regularization Effect
Approximate L2 regularization:
Consider linear model
where
Minimizing variance is equivalent to penalizing
Deep networks: Dropout prevents neuron co-adaptation, forcing learning of robust features
Early Stopping and Implicit Regularization
Early Stopping Strategy
Algorithm:
- Monitor validation set error during training
- Stop when validation error doesn't decrease for
consecutive epochs - Return model with minimum validation error
Effect: Prevents overfitting, similar to regularization
Mathematical Explanation of Early Stopping
Gradient descent trajectory:
Initialize
Early stopping = limiting iteration count
For convex quadratic loss (like linear regression), early stopping is equivalent to L2 regularization
Intuition:
- Early training: Learn main patterns (low-frequency components)
- Late training: Fit noise (high-frequency components)
Early stopping stops after learning main patterns, avoiding fitting noise
Relationship between Early Stopping and L2 Regularization
Linear regression example:
It can be proven (via SVD decomposition):
where
Conclusion: Early stopping is equivalent to adaptive
ridge regression,
Model Selection and Cross-Validation
Cross-Validation Strategies

K-fold cross-validation:
- Split data into
folds (usually or 10) - For each fold
:- Train on remaining
folds - Validate on fold$k
K$validation errors
- Train on remaining
Estimated generalization error:
Leave-One-Out Cross-Validation (LOO-CV):
Information Criteria
AIC (Akaike Information Criterion):
where
BIC (Bayesian Information Criterion):
Principle: Balance fit and complexity
- First term: Negative log-likelihood (smaller is better)
- Second term: Complexity penalty (simpler is better)
AIC vs BIC:
- BIC penalizes complexity more heavily (
when ) - BIC tends to select simpler models
- AIC is asymptotically optimal (prediction perspective)
- BIC has consistency (
recovers true model)
Generalization Theory: VC Dimension and PAC Learning
VC Dimension Definition
Shattering: Hypothesis space
Examples:
- 1D linear classifier: VC dimension = 2
- 2D linear classifier: VC dimension = 3 -
-dimensional linear classifier: VC dimension =
VC Dimension and Generalization Bound
Theorem (Vapnik-Chervonenkis):
With probability at least
Meaning:
- Larger VC dimension, looser generalization bound (need more data)
- Larger sample size
, tighter generalization bound
Tradeoff:
- Complex model (high VC dimension): Small training error, loose generalization bound
- Simple model (low VC dimension): Large training error, tight generalization bound
PAC Learning Framework
PAC (Probably Approximately Correct):
Given:
- Error threshold
- Confidence
If there exists an algorithm that with probability at least
where
Sample complexity:
To achieve
Conclusion: Finite VC dimension
The Generalization Mystery of Deep Learning
Phenomenon: Deep neural networks have
parameters
Classical theory fails:
- VC dimension
(huge) - Rademacher complexity also high
- But test error is still small!
Modern explanations:
- Implicit regularization: SGD tends to converge to flat minima (low Hessian eigenvalues)
- Norm constraints: Actual learned weight norms are much smaller than theoretical upper bounds
- Compression bounds: New bounds based on PAC-Bayes, information theory
- Double descent curve: New phenomenon in modern over-parameterized regime
Q&A
Q1: Why does L1 regularization produce sparse solutions while L2 does not?
A: From geometric and optimization perspectives:
Geometric: Feasible region of constrained optimization - L1: Diamond (sharp corners), contour lines tend to touch at corners - L2: Circle (smooth), contour lines touch smooth boundary
Optimization: Subgradient - L1 subgradient at 0
is
Q2: Why does Dropout need to divide by (1-p) during training?
A: To keep expectation unchanged!
During training:
Alternative approach (Inverted Dropout): Scale during training, no change during testing (PyTorch uses this)
Q3: Why is early stopping similar to L2 regularization?
A: For convex quadratic loss, gradient descent trajectory can be decomposed via SVD:
When
A: No! Deep neural network VC dimension is huge (
Modern theory: - PAC-Bayes bounds: Based on weight norms - Compression bounds: Based on information theory - Implicit regularization: SGD prefers flat minima
VC dimension still provides guidance for classical machine learning (SVM, decision trees).
Q5: Dropout, L2 regularization, data augmentation — can they be used simultaneously?
A: Yes! They act on different aspects:
- L2 regularization: Limits weight magnitude
- Dropout: Prevents neuron co-adaptation
- Data augmentation: Introduces invariance priors
Common combinations in practice: - CNN: Data augmentation + L2 regularization + Dropout - Transformer: Dropout + gradient clipping + label smoothing
Note: Excessive regularization leads to underfitting; need cross-validation tuning.
✏️ Exercises and Solutions
Exercise 1: L2 Regularization
Problem:
Exercise 2: L1 vs L2
Problem: Compare L1 and L2 regularization. Solution: L1 produces sparsity (feature selection), L2 produces small weights (smooth).
Exercise 3: Early Stopping
Problem: How does early stopping prevent overfitting? Solution: Monitor validation error, stop when error stops decreasing, acts as implicit regularization.
Exercise 4: Cross Validation
Problem: How does k-fold CV select hyperparameters? Solution: Split data into k folds, train on k-1, validate on 1, repeat k times, average. Choose best hyperparameters.
Exercise 5: AIC vs BIC
Problem: Compare AIC and BIC.
Solution: AIC=
✏️ Exercises and Solutions
Exercise 1: L2 Regularization
Problem:
Exercise 2: L1 vs L2
Problem: Compare L1 and L2 regularization. Solution: L1 produces sparsity (feature selection), L2 produces small weights (smooth).
Exercise 3: Early Stopping
Problem: How does early stopping prevent overfitting? Solution: Monitor validation error, stop when error stops decreasing, acts as implicit regularization.
Exercise 4: Cross Validation
Problem: How does k-fold CV select hyperparameters? Solution: Split data into k folds, train on k-1, validate on 1, repeat k times, average. Choose best hyperparameters.
Exercise 5: AIC vs BIC
Problem: Compare AIC and BIC.
Solution: AIC=
Referencess
[1] Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady, 5, 1035-1038.
[2] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
[3] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.
[4] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
[5] Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 264-280.
[6] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR.
[7] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2020). Deep double descent: Where bigger models and more data hurt. ICLR.
- Post title:Machine Learning Mathematical Derivations (20): Regularization and Model Selection
- Post author:Chen Kai
- Create time:2021-12-17 16:00:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-20-Regularization-and-Model-Selection/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.