Machine Learning Mathematical Derivations (20): Regularization and Model Selection

Regularization is a core technique in machine learning for controlling model complexity and preventing overfitting — when training data is limited, models tend to memorize noise rather than true patterns. From the mathematical forms of L1/L2 regularization to their Bayesian prior interpretation, from Dropout's random deactivation to early stopping's implicit regularization, from cross-validation for model selection to VC dimension generalization bounds, regularization theory provides mathematical guarantees for balancing underfitting and overfitting. This chapter deeply derives the optimization forms, Bayesian interpretation, bias-variance decomposition, and learning theory foundations of regularization.

Overfitting and Generalization

Mathematical Characterization of Overfitting

Training error (empirical risk):

Generalization error (expected risk):

Generalization gap:

Overfitting:is small butis large, large generalization gap

Underfitting:is large

Bias-Variance Tradeoff

Expected generalization error decomposition for regression:

Assuming true model,,Learning algorithm on training setyields model Final decomposition:

Physical interpretation:

Bias: Difference between average prediction and true value, measures underfitting
Variance: Prediction fluctuation due to different training sets, measures overfitting
Irreducible error: Noise in data itself

Model Complexity and Generalization Tradeoff

Simple model (low complexity):

High bias (underfitting)
Low variance (stable)

Complex model (high complexity):

Low bias (fits training data)
High variance (sensitive to training data, overfitting)

Optimal complexity: Balances bias and variance, minimizes total error

L2 Regularization: Ridge Regression

Optimization Form

Objective function:

whereis regularization strength

Linear regression ridge regression:

Closed-form solution:

Regularization effects:

Preventsfrom being singular (non-invertible)
Shrinks weights toward 0 (weight decay)

Bayesian Interpretation

Maximum A Posteriori estimation (MAP):

Taking logarithm:

Assumptions:

Likelihood:
Prior:(Gaussian prior)

Conclusion: L2 regularization is equivalent to MAP estimation under Gaussian prior,

Weight Decay in Gradient Descent

Gradient update:

Rewriting:

Physical meaning: Before each update, weights first decay totimes

L1 Regularization: Lasso

Optimization Form

Objective function:

where Characteristic: Non-differentiable (at 0), requires special optimization algorithms

Sparsity Induction

Geometric intuition of L1 vs L2:

Consider constrained optimization form: - L2 constraint:(sphere) - L1 constraint:(diamond, sharp corners on coordinate axes)

Tangent point of contour lines and constraints:

L2: Tangent point typically not on coordinate axes,non-sparse
L1: Tangent point tends to be on coordinate axes (at corners),partially 0, sparse!

Soft-thresholding operator:

Conclusion: L1 regularization drives small weights directly to 0, achieving feature selection

Bayesian Interpretation

Laplace prior:

Conclusion: L1 regularization is equivalent to MAP estimation under Laplace prior

Elastic Net: L1 + L2

Combined regularization:

Advantages:

Preserves L1's sparsity
Avoids L1's instability with highly correlated features
L2's grouping effect: Correlated features tend to be selected or dropped together

Dropout: Random Deactivation Regularization

Dropout Mechanism

During training:

Each neuron is randomly "turned off" with probability(usually)

Or equivalently, introducing binary mask:

Dividing byensures expectation unchanged: During testing:

Use all neurons, no Dropout (or equivalently use expectation)

Ensemble Learning Perspective

Idea: Dropout trains exponentially many sub-networks

Assuming network hashidden neurons, each can be dropped, givingpossible sub-networks

Training: Each mini-batch samples one sub-network

Testing: Approximates average of all sub-networks (weight sharing)

Analogy to Bagging: Train multiple models and average

Mathematical Analysis of Regularization Effect

Approximate L2 regularization:

Consider linear model , Dropout is equivalent to adding noise:

where Expectation and variance:

Minimizing variance is equivalent to penalizing, similar to adaptive L2 regularization

Deep networks: Dropout prevents neuron co-adaptation, forcing learning of robust features

Early Stopping and Implicit Regularization

Early Stopping Strategy

Algorithm:

Monitor validation set error during training
Stop when validation error doesn't decrease forconsecutive epochs
Return model with minimum validation error

Effect: Prevents overfitting, similar to regularization

Mathematical Explanation of Early Stopping

Gradient descent trajectory:

Initialize, afteriterations:

Early stopping = limiting iteration count

For convex quadratic loss (like linear regression), early stopping is equivalent to L2 regularization

Intuition:

Early training: Learn main patterns (low-frequency components)
Late training: Fit noise (high-frequency components)

Early stopping stops after learning main patterns, avoiding fitting noise

Relationship between Early Stopping and L2 Regularization

Linear regression example:

It can be proven (via SVD decomposition):

where

Conclusion: Early stopping is equivalent to adaptive ridge regression, decreases with iteration count

Model Selection and Cross-Validation

Cross-Validation Strategies

K-fold cross-validation:

Split data intofolds (usuallyor 10)
For each fold:
- Train on remainingfolds
- Validate on fold$kK$validation errors

Estimated generalization error:

Leave-One-Out Cross-Validation (LOO-CV): - Unbiased estimate - Large computationtraining runs

Information Criteria

AIC (Akaike Information Criterion):

whereis number of parameters

BIC (Bayesian Information Criterion):

Principle: Balance fit and complexity

First term: Negative log-likelihood (smaller is better)
Second term: Complexity penalty (simpler is better)

AIC vs BIC:

BIC penalizes complexity more heavily (when)
BIC tends to select simpler models
AIC is asymptotically optimal (prediction perspective)
BIC has consistency (recovers true model)

Generalization Theory: VC Dimension and PAC Learning

VC Dimension Definition

Shattering: Hypothesis spacecan correctly classify alllabel combinations for dataset $Extra close brace or missing open brace\{\mathbf{x}_1, \dots, \mathbf{x}_m}$ VC Dimension: Maximum sample sizecan shatter $Extra close brace or missing open brace\text{VC}(\mathcal{H}) = \max \{m: \mathcal{H} \text{ can shatter some set of size } m}$

Examples:

1D linear classifier: VC dimension = 2
2D linear classifier: VC dimension = 3 --dimensional linear classifier: VC dimension =

VC Dimension and Generalization Bound

Theorem (Vapnik-Chervonenkis):

With probability at least, for all :

Meaning:

Larger VC dimension, looser generalization bound (need more data)
Larger sample size, tighter generalization bound

Tradeoff:

Complex model (high VC dimension): Small training error, loose generalization bound
Simple model (low VC dimension): Large training error, tight generalization bound

PAC Learning Framework

PAC (Probably Approximately Correct):

Given:

Error threshold
Confidence

If there exists an algorithm that with probability at leastoutputs hypothesissatisfying:

whereis optimal hypothesis, thenis PAC learnable

Sample complexity:

To achieve-PAC learning, required sample size:

Conclusion: Finite VC dimensionPAC learnable

The Generalization Mystery of Deep Learning

Phenomenon: Deep neural networks have parameterstraining samples, should severely overfit according to traditional theory, but actually generalize well

Classical theory fails:

VC dimension(huge)
Rademacher complexity also high
But test error is still small!

Modern explanations:

Implicit regularization: SGD tends to converge to flat minima (low Hessian eigenvalues)
Norm constraints: Actual learned weight norms are much smaller than theoretical upper bounds
Compression bounds: New bounds based on PAC-Bayes, information theory
Double descent curve: New phenomenon in modern over-parameterized regime

Q&A

Q1: Why does L1 regularization produce sparse solutions while L2 does not?

A: From geometric and optimization perspectives:

Geometric: Feasible region of constrained optimization - L1: Diamond (sharp corners), contour lines tend to touch at corners - L2: Circle (smooth), contour lines touch smooth boundary

Optimization: Subgradient - L1 subgradient at 0 is, can reach exactly 0 - L2 gradient at 0 is 0, can only approach 0 infinitely

Q2: Why does Dropout need to divide by (1-p) during training?

A: To keep expectation unchanged!

During training:During testing:Without scaling, test activation values would betimes training values, causing output inconsistency.

Alternative approach (Inverted Dropout): Scale during training, no change during testing (PyTorch uses this)

Q3: Why is early stopping similar to L2 regularization?

A: For convex quadratic loss, gradient descent trajectory can be decomposed via SVD:

Whenis small,This is exactly the shrinkage factor form of ridge regression, Q4: Can VC dimension predict deep learning generalization?

A: No! Deep neural network VC dimension is huge (), requiring astronomical sample counts by VC theory, but actual generalization is good.

Modern theory: - PAC-Bayes bounds: Based on weight norms - Compression bounds: Based on information theory - Implicit regularization: SGD prefers flat minima

VC dimension still provides guidance for classical machine learning (SVM, decision trees).

Q5: Dropout, L2 regularization, data augmentation — can they be used simultaneously?

A: Yes! They act on different aspects:

L2 regularization: Limits weight magnitude
Dropout: Prevents neuron co-adaptation
Data augmentation: Introduces invariance priors

Common combinations in practice: - CNN: Data augmentation + L2 regularization + Dropout - Transformer: Dropout + gradient clipping + label smoothing

Note: Excessive regularization leads to underfitting; need cross-validation tuning.

✏️ Exercises and Solutions

Exercise 1: L2 Regularization

Problem: . Find gradient. Solution:

Exercise 2: L1 vs L2

Problem: Compare L1 and L2 regularization. Solution: L1 produces sparsity (feature selection), L2 produces small weights (smooth).

Exercise 3: Early Stopping

Problem: How does early stopping prevent overfitting? Solution: Monitor validation error, stop when error stops decreasing, acts as implicit regularization.

Exercise 4: Cross Validation

Problem: How does k-fold CV select hyperparameters? Solution: Split data into k folds, train on k-1, validate on 1, repeat k times, average. Choose best hyperparameters.

Exercise 5: AIC vs BIC

Problem: Compare AIC and BIC. Solution: AIC=, BIC=. BIC penalizes complexity more (when ).

✏️ Exercises and Solutions

Exercise 1: L2 Regularization

Problem: . Find gradient. Solution:

Exercise 2: L1 vs L2

Problem: Compare L1 and L2 regularization. Solution: L1 produces sparsity (feature selection), L2 produces small weights (smooth).

Exercise 3: Early Stopping

Problem: How does early stopping prevent overfitting? Solution: Monitor validation error, stop when error stops decreasing, acts as implicit regularization.

Exercise 4: Cross Validation

Problem: How does k-fold CV select hyperparameters? Solution: Split data into k folds, train on k-1, validate on 1, repeat k times, average. Choose best hyperparameters.

Exercise 5: AIC vs BIC

Problem: Compare AIC and BIC. Solution: AIC=, BIC=. BIC penalizes complexity more (when ).

Referencess

[1] Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady, 5, 1035-1038.

[2] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.

[3] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.

[4] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.

[5] Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 264-280.

[6] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR.

[7] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2020). Deep double descent: Where bigger models and more data hurt. ICLR.