In 2005, Google Research published a paper claiming that their simple statistical models outperformed carefully designed expert systems in machine translation tasks. This sparked a profound question: Why can simple models learn effective patterns from data? The answer lies in the mathematical theory of machine learning.
The core question of machine learning is: Given finite training samples, how can we guarantee that the learned model performs well on unseen data? This is not an engineering problem but a mathematical one — it involves deep structures from probability theory, functional analysis, and optimization theory. This series rigorously derives the theoretical foundations of machine learning from mathematical first principles.
Basic Concepts of Machine Learning
Mathematical Formalization of the Learning Problem
Assume there exists an unknown data generating distribution
Definition 1 (Loss Function): A loss function
- 0-1 Loss (Classification):
- Squared Loss (Regression):
- Hinge Loss (Support Vector Machines):
Definition 2 (Generalization Error): The
generalization error of hypothesis
Definition 3 (Empirical Error): Given a training
set
The Core Paradox of Machine Learning: We can only
observe
Three Elements of Supervised Learning
Supervised learning can be completely described by a triple
Hypothesis Space
- Linear hypothesis space:
- Polynomial hypothesis space:
The choice of hypothesis space determines the model's expressive power. Larger spaces have stronger expressive power but are more difficult to learn.
Loss Function
Learning Algorithm
Common learning algorithms include:
- Empirical Risk Minimization (ERM):
- Structural Risk Minimization (SRM):
, where is a complexity penalty term
PAC Learning Framework: Mathematical Characterization of Learnability
Definition of PAC Learning
The Probably Approximately Correct (PAC) learning framework, proposed by Valiant in 1984, provides a mathematical answer to the fundamental question "What kinds of problems are learnable?"
Definition 4 (PAC Learnability): A hypothesis
class
- Probably (with high probability): The event occurs
with probability at least
- Approximately (approximately optimal): The
generalization error differs from the optimal hypothesis by at most
- Correct (correct): The error is measured on the
true distribution
Sample Complexity is the core quantity in PAC learning: it tells us how many training samples are needed to guarantee successful learning. Lower sample complexity means easier learning.
PAC Learnability of Finite Hypothesis Spaces
We first consider the simplest case: the hypothesis space
Theorem 1 (Sample Complexity for Finite Hypothesis
Spaces): Let
Proof:
Step 1: Define the set of bad hypotheses
For any
Our goal is to prove: with high probability, the ERM algorithm will not select a bad hypothesis.
Step 2: Analyze the probability of a single bad hypothesis
For a fixed bad hypothesis
Because
Step 3: Use the Union Bound
Define event
Step 4: Use the inequality
This is a standard exponential inequality. Therefore:
Step 5: Conclusion
When the number of samples satisfies the above condition, with
probability at least
Intuition behind the theorem:
-
Example (Boolean Function Learning):
Consider learning conjunctions over
Agnostic PAC Learning
The previous analysis assumed the existence of a perfect hypothesis (realizable hypothesis), but this often does not exist in reality. Agnostic PAC Learning relaxes this assumption.
Definition 5 (Agnostic PAC Learnability): A
hypothesis class
Theorem 2 (Sample Complexity in the Agnostic Case):
For a finite hypothesis space
- Realizable case:
- Agnostic case:
Doubling the accuracy requirement quadruples the sample requirement — this is a fundamental law of statistical learning.
VC Dimension Theory: Complexity Measure for Infinite Hypothesis Spaces
From Finite to Infinite: Why We Need VC Dimension
Previous results relied on
- Linear classifiers:
, containing infinitely many hypotheses - Polynomial classifiers, neural networks, etc., are all infinite hypothesis spaces
For infinite hypothesis spaces,
The Shattering Concept
Definition 6 (Shattering): For a sample set$C =
{x_1, , x_m}
shatters
Mathematical representation:
Intuition: Shattering means the hypothesis space's expressive power on this particular sample set reaches maximum — it can produce all possible labeling patterns.
Definition of VC Dimension
Definition 7 (VC Dimension): The VC dimension of
hypothesis space
Key properties:
VC dimension is combinatorial: It does not depend on data distribution
, only on the structure of Non-monotonicity: If
cannot shatter some set of size , it does not mean it cannot shatter other larger sets
Example: VC Dimension of Linear Classifiers
Theorem 3: The VC dimension of linear classifiers
in
Proof:
Part 1:
Consider
Given any labeling
- For
: , predicts ; if , take instead - For
: , positive when , negative when By appropriately choosing and , we can realize all labelings. Therefore .
Part 2:
Suppose there exist
Weighted sum with
Intuition: A
VC Dimension and Sample Complexity
Theorem 4 (Vapnik-Chervonenkis Bound): Let
Proof sketch (incomplete):
Introduce the growth function
: the maximum number of distinct labelings that can realize on a sample set of size Sauer-Shelah Lemma: If
, then: This is much smaller than (polynomial vs exponential) Use the Uniform Convergence theorem and symmetrization technique to establish generalization bounds
The complete proof requires more technical tools, which we will expand in subsequent chapters.
Application example:
For
- VC dimension:
- To achieve
and , we need: This is sample complexity linear in dimension, which is very practical for high-dimensional problems.
Bias-Variance Decomposition: Anatomy of Generalization Error
Bias-Variance Tradeoff in Regression
Bias-variance decomposition is another important perspective for understanding model generalization ability. It decomposes generalization error into three components, revealing the relationship between model complexity and generalization ability.
Setup:
- True function:
(unknown) - Observed data:
, where is zero-mean noise, , - Learning algorithm: learns
from training set - Loss function: squared loss
Question: For a fixed test point , if we draw multiple different training sets from the distribution, each time training to get different , what is the expected error?
Theorem 5 (Bias-Variance Decomposition): For
fixed
- Bias:
- Variance:
- Noise:
Then the expected generalization error satisfies:
Proof:
For fixed
1.
Therefore:
Interpretation of the three components:
- Bias: The gap between the model's average
prediction and the true value, reflecting the model's fitting
ability
- High bias: Model is too simple, cannot capture the true pattern in data (underfitting)
- Example: Using a straight line to fit a quadratic curve
- Variance: The degree of fluctuation in model
predictions across different training sets, reflecting the model's
stability
- High variance: Model is very sensitive to small changes in the training set (overfitting)
- Example: Using a high-degree polynomial to fit few data points
- Noise: Randomness inherent in the data, which is the lower bound of error and cannot be eliminated by improving the model
Numerical Example of Bias-Variance Tradeoff
Problem setup:
True function:
Low complexity model: First-degree polynomial (line)
Medium complexity model: Third-degree polynomial
High complexity model: Ninth-degree polynomial
Experiment:
- Draw 100 different training sets from the distribution
- For each training set, fit the model using least squares
- Calculate bias and variance at test point
Results (approximate values):
| Model | Bias ² | Variance | Total Error |
|---|---|---|---|
| 1st degree polynomial | 0.42 | 0.002 | 0.43 |
| 3rd degree polynomial | 0.05 | 0.02 | 0.08 |
| 9th degree polynomial | 0.01 | 0.35 | 0.37 |
Analysis:
- 1st degree polynomial: High bias (0.42), low variance (0.002)— model too simple, cannot capture sin function's curvature
- 3rd degree polynomial: Low bias (0.05), low variance (0.02)— medium complexity, best performance
- 9th degree polynomial: Very low bias (0.01), high variance (0.35)— model too complex, sensitive to training data noise
Tradeoff curve: As model complexity increases:
- Bias ↓ (fitting ability increases)
- Variance ↑ (stability decreases)
- There exists optimal complexity minimizing total error
Mathematical Characterization of Overfitting and Underfitting
Definition 8 (Overfitting): If there exists a
model
Definition 9 (Underfitting): If there exists a
model
Detection method:
| Phenomenon | Training Error | Validation Error | Diagnosis |
|---|---|---|---|
| Low | Low | Low | Good fit |
| High | High | High | Underfitting |
| Low | High | - | Overfitting |
| High | Low | - | Data issues |
Mitigation strategies:
- Overfitting:
- Increase training data
- Reduce model complexity (fewer parameters)
- Regularization (L1/L2, detailed in subsequent chapters)
- Early stopping
- Dropout (for neural networks)
- Underfitting:
- Increase model complexity (more parameters/layers)
- Increase training iterations
- Feature engineering (add interaction features, polynomial features)
- Reduce regularization strength
No Free Lunch Theorem: Fundamental Limits of Learning
Theorem Statement
The No Free Lunch (NFL) theorem (Wolpert & Macready, 1997) states: Averaged over all possible problems (data distributions), all learning algorithms have the same performance.
Theorem 6 (NFL theorem, simplified version):
Consider all possible functions
Proof sketch:
Let
For a fixed training set
Learning algorithm
- Half of the functions make the prediction correct
- Half of the functions make the prediction incorrect
Therefore, the average error rate of any algorithm is
Implications of the theorem:
- No universal algorithm: There is no algorithm that performs best on all problems
- Importance of prior assumptions: Algorithm effectiveness depends on prior assumptions about the problem (inductive bias)
- Specific problems need specific algorithms: Algorithm design should target problem characteristics
Inductive Bias
Definition 10 (Inductive Bias): When a learning algorithm faces multiple hypotheses that all agree with the training data, its tendency to choose one among them is called inductive bias.
Common inductive biases:
- Smoothness assumption: Nearby inputs should produce
nearby outputs
- Applicable to continuous function learning
- Examples: k-NN, kernel methods
- Linearity assumption: Output is a linear
combination of inputs
- Applicable to linearly separable problems
- Examples: Linear regression, logistic regression
- Sparsity assumption: Only a few features are
important
- Applicable to high-dimensional low-rank problems
- Examples: L1 regularization, feature selection
- Hierarchical structure assumption: Concepts can be
composed from simple concepts
- Applicable to complex pattern recognition
- Examples: Deep learning
- Spatiotemporal locality assumption: Related
features are spatiotemporally local
- Applicable to image, sequence data
- Examples: Convolutional neural networks, recurrent neural networks
Example (Occam's Razor):
Among multiple hypotheses consistent with the data, choose the simplest one. This is a common inductive bias, corresponding to regularization methods.
Mathematical representation:
Code Implementation: PAC Analysis of Simple Linear Regression
We verify theoretical predictions through code: observe the relationship between sample complexity and generalization error.
1 | import numpy as np |
Code explanation:
- Experiment design:
- True function is linear
- Using linear regression for fitting, so bias is small (model matches true function) - Vary training set size
, observe changes in generalization error
- True function is linear
- PAC learning verification:
- As
increases, test error (generalization error) monotonically decreases - This verifies Theorem 1: more samples, better learning
- As
- Bias-variance decomposition verification:
- Bias ² ≈ 0.001 (very small, because linear model is correct)
- Variance decreases from 0.05 (m=5) to 0.001 (m=1000)— more samples, more stable model
- Noise ≈ 0.25 (fixed, cannot be eliminated)
- Total error = Bias ² + Variance + Noise ≈ Test error (verified formula)
- Conclusion:
- Theory and experiment perfectly match
- For simple problems (linear), a small number of samples (~100) is sufficient for good learning
- Noise is the irreducible lower bound of error
Summary: Core Points of Machine Learning Foundations
Key Formulas:
Generalization error decomposition:
PAC sample complexity (finite hypothesis space):
VC dimension sample complexity:
Practical Checklist:
References
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134-1142.
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 264-280.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4), 929-965.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67-82.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR.
Bartlett, P. L., Maiorov, V., & Meir, R. (1998). Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Computation, 10(8), 2159-2173.
Next chapter preview: Chapter 2 will delve into linear algebra and matrix theory, laying a solid mathematical foundation for subsequent learning algorithm derivations. We will rigorously derive core tools such as matrix calculus, eigenvalue decomposition, and singular value decomposition.
- Post title:Mathematical Derivations in Machine Learning (1): Introduction and Mathematical Foundations
- Post author:Chen Kai
- Create time:2025-01-15 00:00:00
- Post link:https://www.chenk.top/mathematical-derivations-in-machine-learning-01-introduction-foundations/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.