Ensemble Learning is one of the most powerful weapons in machine learning — combining multiple weak learners to build strong learners with excellent performance. From Kaggle competitions to industrial applications, ensemble methods are ubiquitous. Why does "three cobblers beat a shoemaker"? What's the mathematical mechanism? This chapter systematically derives the theoretical foundations of ensemble learning, including bias-variance decomposition, Boosting's additive models, Random Forest's randomization strategies, and Gradient Boosting's function space optimization, revealing the core wisdom of ensemble learning.
Theoretical Foundations of Ensemble Learning
Why Ensembles Work
Consider regression problem with true function
H() = _{t=1}^T h_t()$$
Squared error decomposition:
Assuming model errors are independent and identically
distributed,
Key insight: Ensemble doesn't change bias, but
reduces variance to
Bias-Variance Decomposition
For squared loss, generalization error decomposes as:
where: - Bias: Gap between model's average prediction and true value, measures underfitting - Variance: Prediction fluctuation across different training sets, measures overfitting - Noise: Irreducible error in data itself
Bias-Variance Dilemma: - Complex models: Low bias, high variance - Simple models: High bias, low variance
Ensemble Strategies: - Bagging: Reduces variance (for high-variance models like decision trees) - Boosting: Reduces bias (progressively improves from weak learners)
Diversity and Accuracy
For classification (binary), ensemble error upper bound:$$
P(H() y) _{t=1}^T _t P(h_t() y)$$
If classifiers are independent with error rate
P_{} = _{k > T/2} ^k (1 - )^{T - k}$$
Example:
P_{} $$
Conclusion: Independent classifiers slightly better than random, when ensembled can approach perfect.
The following figure shows bias-variance tradeoff and ensemble variance reduction:
Bagging and Random Forest
The figure below compares Bagging and Boosting workflows:

Bootstrap Aggregating (Bagging)
Algorithm (Breiman, 1996):
Bootstrap sampling: From training set
, sample with replacement subsets , each sizeTrain base learners: Train
onEnsemble prediction:
- Regression:
- Classification: (majority vote)
- Regression:
Bootstrap sampling properties: - Probability of
sample being selected:
OOB error estimate: For each sample
OOB error is unbiased estimate of generalization error, no need for separate validation set.
Random Forest

Innovation (Breiman, 2001): On top of Bagging, introduce random feature selection:
- Bootstrap sampling: Same as Bagging
- Random feature subset: Each node split, randomly
select
features from features (usually or ) - Fully grown: Decision trees not pruned, grow to pure nodes or minimum sample count
Dual randomization mechanism: - Sample randomization: Bootstrap (reduces sample variance) - Feature randomization: Random feature selection (reduces feature correlation, increases diversity)
Theoretical analysis: Generalization error upper
bound (Breiman, 2001)
where: -
Methods to reduce error: 1. Increase
Feature Importance
Based on impurity decrease: Feature
Based on OOB permutation: Randomly shuffle
feature
Permutation importance is more accurate (not affected by feature correlation), but slower.
The following figure shows Random Forest feature importance and tree diversity:
Boosting Algorithm Family
AdaBoost: Adaptive Boosting
The figures below show AdaBoost training error convergence and sample weight evolution:
The animation demonstrates how Boosting iteratively reweights samples to improve decision boundary:

Core idea: Iteratively train weak learners, each round increase weights of misclassified samples, finally weighted combination.
Algorithm (Freund & Schapire, 1997):
Input: Training set
Train base learner:
Compute weighted error rate:
- Compute base learner weight:
- Update sample weights:$$
w_{t+1}(i) = $$
where
Final model:$$
H() = ( _{t=1}^T _t h_t() )$$
Theoretical Guarantee of AdaBoost
Training error upper bound:
If each round
Z_t
Conclusion: Training error decreases exponentially!
Exponential Loss and Additive Models
AdaBoost is equivalent to minimizing exponential loss in additive model:
Forward Stagewise Additive Modeling:
Each round optimization:
Let
$$
h_t = _h _i w_t(i) (h(_i) y_i)$$
This is exactly AdaBoost's update formula!
Insight: AdaBoost is coordinate descent in function space, each round adding a new basis function.
Gradient Boosting Decision Trees (GBDT)
Gradient Boosting Framework
Idea: View Boosting as gradient descent optimization in function space.
Objective: Minimize loss function
where
Gradient Boosting Algorithm (Friedman, 2001):
Initialize:
- Compute negative gradient (pseudo-residual):
$$
r_{ti} = -{F = F{t-1 }}$$
Fit base learner:
Line search:
Update model:
Final model:
Different Loss Functions
Squared Loss (Regression)
$$
L(y, F) = (y - F)^2$$
$$
r_i = y_i - F_{t-1}(_i)$$
Gradient boosting degenerates to fitting residuals, similar to traditional regression.
Absolute Loss (Robust Regression)
$$
L(y, F) = |y - F|$$
$$
r_i = (y_i - F_{t-1}(_i))$$
Robust to outliers.
Logistic Loss (Binary Classification)
$$
L(y, F) = (1 + (-yF))$$
$$
r_i = $$
Output log-odds:
Regularization and Shrinkage
Learning rate shrinkage:
$$
F_t() = F_{t-1}() + h_t(), (0, 1]$$
Small learning rate
Subsampling (Stochastic Gradient Boosting): Each
round randomly sample
Tree complexity penalty: Limit tree depth
(e.g.,
Complete Implementation
1 | import numpy as np |
Q&A Highlights
Q1: Why is Boosting stronger than Bagging?
A: Bagging trains in parallel, reduces variance; Boosting trains sequentially, reduces both bias and variance. Key difference: - Bagging: Independent training, suits high-variance models (like deep trees) - Boosting: Adaptive training, progressively learns complex patterns from simple models
But Boosting is more prone to overfitting and noise sensitivity.
Q2: Why does AdaBoost use exponential weight updates?
A: Exponential update
Q3: Why is GBDT's learning rate important?
A: Learning rate
Q4: How to choose Random Forest's feature count
A: Rules of thumb: - Classification:
Q5: Can Boosting be parallelized?
A: Standard Boosting is sequential (each round depends on previous). But parallel variants exist: - Parallel Bagging + Boosting: Bagging first, Boosting on each subset - Data parallelism: Within single round, split samples across machines (like XGBoost distributed) - Feature parallelism: When finding best split, search features in parallel
Q6: Why does GBDT use decision trees instead of other models?
A: Decision trees naturally support: - Nonlinearity: Automatically learn feature interactions - Piecewise constant: Each leaf optimized independently - Efficient splitting: Fast threshold-based search
Theoretically any model works (like linear models), but trees are most practical.
Q7: Can ensemble learning prevent overfitting?
A: Depends on method: - Bagging/RF: Rarely overfits (more trees the better) - Boosting: Can overfit (training error →0, test error may rise)
Preventing Boosting overfitting: - Early stopping (monitor validation set) - Learning rate shrinkage - Subsampling - Tree complexity constraints
✏️ Exercises and Solutions
Exercise 1: Bias-Variance Decomposition Calculation
Problem: Three independent regression models, each
with bias squared
Solution:
Single model MSE:
Ensemble MSE (assuming independent errors): -
Ensemble doesn't change bias:
Ensemble reduces error by
Exercise 2: Majority Voting Error Probability
Problem: 21 independent binary classifiers, each
with error rate
Solution:
Majority voting fails when
Using binomial cumulative distribution (or programming):
Ensemble reduces error rate from
Exercise 3: AdaBoost Weight Update
Problem: After round
Solution:
(1) Classifier weight:
(2) Sample weight update (before normalization):
Since
Correctly classified samples get lower weight, misclassified samples
get higher weight (
Exercise 4: Random Forest vs Single Deep Tree
Problem: Random Forest uses 100 trees of depth 10,
each trained on Bootstrap sample, each split randomly selects
Solution:
Single deep tree: - Trains on complete dataset, memorizes all training sample feature combinations - Extremely high variance, poor generalization
Random Forest's variance reduction mechanisms:
- Bootstrap sampling: Each tree sees different training subset (~63.2% original samples), reduces correlation between trees
- Feature random selection: Each split only
considers
features, forces different trees to use different feature combinations, further decorrelates - Averaging effect: 100 trees' predictions average out, canceling individual tree overfitting noise
By variance reduction formula, if trees are independent, variance
drops to
Exercise 5: GBDT vs AdaBoost
Problem: Both GBDT and AdaBoost are Boosting methods, but GBDT uses gradient descent to fit residuals while AdaBoost uses exponential loss to reweight samples. Compare their pros/cons.
Solution:
| Feature | AdaBoost | GBDT |
|---|---|---|
| Loss function | Exponential (fixed) | Any differentiable (flexible) |
| Base learner | Usually decision stumps (weak) | Decision trees (CART) |
| Update method | Sample reweighting | Fit negative gradient (residual) |
| Noise sensitivity | Sensitive (exponential heavily penalizes outliers) | More robust (can use Huber etc.) |
| Regression tasks | Not applicable | Directly applicable |
| Interpretability | Good (explicit weights) | Good (tree structure) |
Summary: - AdaBoost: Simple, efficient for binary classification, but noise-sensitive - GBDT: More general (regression+classification), flexible loss functions, widely used in industry (XGBoost/LightGBM foundation)
In practice, GBDT variants (XGBoost, LightGBM, CatBoost) dominate due to flexibility and performance advantages.
References
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
- Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 10: Boosting and Additive Trees]
Ensemble learning, with the wisdom of "many heads are better than one," combines simple models into powerful systems. From Bagging's parallel aggregation to Boosting's sequential improvement, from Random Forest's randomization strategies to GBDT's gradient optimization, ensemble methods showcase the engineering beauty of machine learning. Understanding ensemble learning is not only mastering practical tools, but grasping the systems thinking of "the whole is greater than the sum of its parts"— this idea permeates modern machine learning, from deep learning's Dropout to reinforcement learning's Actor-Critic, all embodying the essence of ensemble.
- Post title:Machine Learning Mathematical Derivations (11): Ensemble Learning
- Post author:Chen Kai
- Create time:2021-10-24 10:30:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-11-Ensemble-Learning/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.