Machine Learning Mathematical Derivations (11): Ensemble Learning
Chen Kai BOSS

Ensemble Learning is one of the most powerful weapons in machine learning — combining multiple weak learners to build strong learners with excellent performance. From Kaggle competitions to industrial applications, ensemble methods are ubiquitous. Why does "three cobblers beat a shoemaker"? What's the mathematical mechanism? This chapter systematically derives the theoretical foundations of ensemble learning, including bias-variance decomposition, Boosting's additive models, Random Forest's randomization strategies, and Gradient Boosting's function space optimization, revealing the core wisdom of ensemble learning.

Theoretical Foundations of Ensemble Learning

Why Ensembles Work

Consider regression problem with true function , single model prediction, ensemble model:$$

H() = _{t=1}^T h_t()$$

Squared error decomposition:

Assuming model errors are independent and identically distributed,,:

Key insight: Ensemble doesn't change bias, but reduces variance to Conditions: 1. Base learner errors are independent (achieved through randomization) 2. Base learners aren't too weak (bias not too large)

Bias-Variance Decomposition

For squared loss, generalization error decomposes as:

where: - Bias: Gap between model's average prediction and true value, measures underfitting - Variance: Prediction fluctuation across different training sets, measures overfitting - Noise: Irreducible error in data itself

Bias-Variance Dilemma: - Complex models: Low bias, high variance - Simple models: High bias, low variance

Ensemble Strategies: - Bagging: Reduces variance (for high-variance models like decision trees) - Boosting: Reduces bias (progressively improves from weak learners)

Diversity and Accuracy

For classification (binary), ensemble error upper bound:$$

P(H() y) _{t=1}^T _t P(h_t() y)$$

If classifiers are independent with error rate, majority vote error rate:$$

P_{} = _{k > T/2} ^k (1 - )^{T - k}$$

Example:, $$

P_{} $$

Conclusion: Independent classifiers slightly better than random, when ensembled can approach perfect.

The following figure shows bias-variance tradeoff and ensemble variance reduction:

Bias-Variance Tradeoff

Bagging and Random Forest

The figure below compares Bagging and Boosting workflows:

Bagging vs Boosting

Bootstrap Aggregating (Bagging)

Algorithm (Breiman, 1996):

  1. Bootstrap sampling: From training set, sample with replacementsubsets, each size

  2. Train base learners: Trainon

  3. Ensemble prediction:

    • Regression: - Classification:(majority vote)

Bootstrap sampling properties: - Probability of sample being selected: - About 36.8% samples not sampled (Out-of-Bag, OOB)

OOB error estimate: For each sample, predict using models that don't contain it, compute average error:Extra close brace or missing open brace\text{OOB Error} = \frac{1}{N} \sum_{i=1}^N L(y_i, \frac{1}{|\{t: (\mathbf{x}_i, y_i) \notin \mathcal{D}_t}|} \sum_{t: (\mathbf{x}_i, y_i) \notin \mathcal{D}_t} h_t(\mathbf{x}_i))

OOB error is unbiased estimate of generalization error, no need for separate validation set.

Random Forest

Innovation (Breiman, 2001): On top of Bagging, introduce random feature selection:

  1. Bootstrap sampling: Same as Bagging
  2. Random feature subset: Each node split, randomly selectfeatures fromfeatures (usuallyor)
  3. Fully grown: Decision trees not pruned, grow to pure nodes or minimum sample count

Dual randomization mechanism: - Sample randomization: Bootstrap (reduces sample variance) - Feature randomization: Random feature selection (reduces feature correlation, increases diversity)

Theoretical analysis: Generalization error upper bound (Breiman, 2001)

where: -is average margin -is average correlation between trees

Methods to reduce error: 1. Increase: Use stronger base learners 2. Decrease: Increase randomness (decrease, increase)

Feature Importance

Based on impurity decrease: Feature's importance is sum of impurity decrease at all nodes splitting on:

Based on OOB permutation: Randomly shuffle featurevalues in OOB samples, compute accuracy drop:

Permutation importance is more accurate (not affected by feature correlation), but slower.

The following figure shows Random Forest feature importance and tree diversity:

Random Forest Feature Importance

Boosting Algorithm Family

AdaBoost: Adaptive Boosting

The figures below show AdaBoost training error convergence and sample weight evolution:

AdaBoost Convergence

The animation demonstrates how Boosting iteratively reweights samples to improve decision boundary:

Boosting Animation

Core idea: Iteratively train weak learners, each round increase weights of misclassified samples, finally weighted combination.

Algorithm (Freund & Schapire, 1997):

Input: Training setExtra close brace or missing open brace\mathcal{D} = \{(\mathbf{x}_i, y_i)}_{i=1}^N,Extra close brace or missing open bracey_i \in \{-1, +1}; base learner; rounds Initialize: Loop ():

  1. Train base learner:

  2. Compute weighted error rate:

  1. Compute base learner weight:

  1. Update sample weights:$$

w_{t+1}(i) = $$

whereis normalization factor.

Final model:$$

H() = ( _{t=1}^T _t h_t() )$$

Theoretical Guarantee of AdaBoost

Training error upper bound:

If each round(weak learning condition), then:$$

Z_t (H) (1 - 42){T/2} = (-2^2 T)$$

Conclusion: Training error decreases exponentially!

Exponential Loss and Additive Models

AdaBoost is equivalent to minimizing exponential loss in additive model:

Forward Stagewise Additive Modeling:

Each round optimization:

Let , then:

$$

h_t = _h _i w_t(i) (h(_i) y_i)$$

This is exactly AdaBoost's update formula!

Insight: AdaBoost is coordinate descent in function space, each round adding a new basis function.

Gradient Boosting Decision Trees (GBDT)

Gradient Boosting Framework

Idea: View Boosting as gradient descent optimization in function space.

Objective: Minimize loss function

where is additive model.

Gradient Boosting Algorithm (Friedman, 2001):

Initialize: Loop ():

  1. Compute negative gradient (pseudo-residual):

$$

r_{ti} = -{F = F{t-1 }}$$

  1. Fit base learner:

  2. Line search:

  3. Update model: Final model:

Different Loss Functions

Squared Loss (Regression)

$$

L(y, F) = (y - F)^2$$

$$

r_i = y_i - F_{t-1}(_i)$$

Gradient boosting degenerates to fitting residuals, similar to traditional regression.

Absolute Loss (Robust Regression)

$$

L(y, F) = |y - F|$$

$$

r_i = (y_i - F_{t-1}(_i))$$

Robust to outliers.

Logistic Loss (Binary Classification)

$$

L(y, F) = (1 + (-yF))$$

$$

r_i = $$

Output log-odds:

Regularization and Shrinkage

Learning rate shrinkage:

$$

F_t() = F_{t-1}() + h_t(), (0, 1]$$

Small learning rate (e.g., ) with large (e.g., ) usually performs better.

Subsampling (Stochastic Gradient Boosting): Each round randomly samplesamples (, e.g.,) to train, similar to SGD.

Tree complexity penalty: Limit tree depth (e.g.,), leaf count, minimum samples.

Complete Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
from collections import Counter

class DecisionStump:
"""Decision stump (single-layer decision tree)"""
def __init__(self):
self.feature_idx = None
self.threshold = None
self.left_value = None
self.right_value = None

def fit(self, X, y, weights):
N, d = X.shape
best_error = float('inf')

for j in range(d):
thresholds = np.unique(X[:, j])
for threshold in thresholds:
left_mask = X[:, j] <= threshold
right_mask = ~left_mask

if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
continue

# Weighted majority vote
left_value = np.sign(np.sum(weights[left_mask] * y[left_mask]))
right_value = np.sign(np.sum(weights[right_mask] * y[right_mask]))

# Weighted error
pred = np.where(left_mask, left_value, right_value)
error = np.sum(weights[pred != y])

if error < best_error:
best_error = error
self.feature_idx = j
self.threshold = threshold
self.left_value = left_value
self.right_value = right_value

return self

def predict(self, X):
left_mask = X[:, self.feature_idx] <= self.threshold
return np.where(left_mask, self.left_value, self.right_value)

class AdaBoost:
"""AdaBoost classifier"""
def __init__(self, n_estimators=50):
self.n_estimators = n_estimators
self.estimators = []
self.alphas = []

def fit(self, X, y):
N = X.shape[0]
weights = np.ones(N) / N

for t in range(self.n_estimators):
# Train weak learner
stump = DecisionStump()
stump.fit(X, y, weights)
pred = stump.predict(X)

# Compute weighted error
epsilon = np.sum(weights[pred != y])
epsilon = np.clip(epsilon, 1e-10, 1 - 1e-10)

# Compute learner weight
alpha = 0.5 * np.log((1 - epsilon) / epsilon)

# Update sample weights
weights *= np.exp(-alpha * y * pred)
weights /= np.sum(weights)

self.estimators.append(stump)
self.alphas.append(alpha)

return self

def predict(self, X):
predictions = np.array([alpha * estimator.predict(X)
for alpha, estimator in zip(self.alphas, self.estimators)])
return np.sign(np.sum(predictions, axis=0))

def score(self, X, y):
return np.mean(self.predict(X) == y)

class GradientBoostingRegressor:
"""Gradient Boosting Regression Trees"""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.trees = []
self.init_prediction = None

def fit(self, X, y):
# Initialize: predict mean
self.init_prediction = np.mean(y)
F = np.full(len(y), self.init_prediction)

for t in range(self.n_estimators):
# Compute negative gradient (residuals)
residuals = y - F

# Fit residuals
tree = self._build_tree(X, residuals, depth=0)
self.trees.append(tree)

# Update predictions
F += self.learning_rate * self._predict_tree(tree, X)

return self

def _build_tree(self, X, y, depth):
"""Recursively build regression tree"""
N = len(y)

# Stopping conditions
if depth >= self.max_depth or N < 2:
return {'value': np.mean(y)}

# Find best split
best_mse = float('inf')
best_split = None

for j in range(X.shape[1]):
thresholds = np.unique(X[:, j])
for threshold in thresholds:
left_mask = X[:, j] <= threshold
right_mask = ~left_mask

if np.sum(left_mask) < 1 or np.sum(right_mask) < 1:
continue

mse = np.var(y[left_mask]) * np.sum(left_mask) + np.var(y[right_mask]) * np.sum(right_mask)

if mse < best_mse:
best_mse = mse
best_split = (j, threshold, left_mask, right_mask)

if best_split is None:
return {'value': np.mean(y)}

j, threshold, left_mask, right_mask = best_split

return {
'feature': j,
'threshold': threshold,
'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)
}

def _predict_tree(self, tree, X):
"""Single tree prediction"""
if 'value' in tree:
return np.full(len(X), tree['value'])

left_mask = X[:, tree['feature']] <= tree['threshold']
predictions = np.zeros(len(X))
predictions[left_mask] = self._predict_tree(tree['left'], X[left_mask])
predictions[~left_mask] = self._predict_tree(tree['right'], X[~left_mask])
return predictions

def predict(self, X):
F = np.full(len(X), self.init_prediction)
for tree in self.trees:
F += self.learning_rate * self._predict_tree(tree, X)
return F

def score(self, X, y):
pred = self.predict(X)
return 1 - np.sum((y - pred) ** 2) / np.sum((y - np.mean(y)) ** 2)

Q&A Highlights

Q1: Why is Boosting stronger than Bagging?

A: Bagging trains in parallel, reduces variance; Boosting trains sequentially, reduces both bias and variance. Key difference: - Bagging: Independent training, suits high-variance models (like deep trees) - Boosting: Adaptive training, progressively learns complex patterns from simple models

But Boosting is more prone to overfitting and noise sensitivity.


Q2: Why does AdaBoost use exponential weight updates?

A: Exponential updateensures misclassified samples' weights grow rapidly, correctly classified samples' weights decay. This is the natural result of minimizing exponential loss (see forward stagewise derivation).


Q3: Why is GBDT's learning rate important?

A: Learning ratecontrols each tree's contribution. Small(e.g.,) needs more trees, but generalizes better (similar to small step size in SGD). Reason: shrinkage regularization prevents single tree overfitting.


Q4: How to choose Random Forest's feature count?

A: Rules of thumb: - Classification: - Regression:Smallermeans more diversity but weaker individual trees; largermeans stronger trees but higher correlation. Tune via cross-validation.


Q5: Can Boosting be parallelized?

A: Standard Boosting is sequential (each round depends on previous). But parallel variants exist: - Parallel Bagging + Boosting: Bagging first, Boosting on each subset - Data parallelism: Within single round, split samples across machines (like XGBoost distributed) - Feature parallelism: When finding best split, search features in parallel


Q6: Why does GBDT use decision trees instead of other models?

A: Decision trees naturally support: - Nonlinearity: Automatically learn feature interactions - Piecewise constant: Each leaf optimized independently - Efficient splitting: Fast threshold-based search

Theoretically any model works (like linear models), but trees are most practical.


Q7: Can ensemble learning prevent overfitting?

A: Depends on method: - Bagging/RF: Rarely overfits (more trees the better) - Boosting: Can overfit (training error →0, test error may rise)

Preventing Boosting overfitting: - Early stopping (monitor validation set) - Learning rate shrinkage - Subsampling - Tree complexity constraints


✏️ Exercises and Solutions

Exercise 1: Bias-Variance Decomposition Calculation

Problem: Three independent regression models, each with bias squared, variance. Calculate expected MSE for single model and ensemble model (simple average), ignoring noise.

Solution:

Single model MSE:

Ensemble MSE (assuming independent errors): - Ensemble doesn't change bias: - Variance reduces to:

Ensemble reduces error by.


Exercise 2: Majority Voting Error Probability

Problem: 21 independent binary classifiers, each with error rate. Using majority voting, calculate ensemble error rate approximation.

Solution:

Majority voting fails whenclassifiers are wrong:

Using binomial cumulative distribution (or programming):

Ensemble reduces error rate fromto about, significant improvement!


Exercise 3: AdaBoost Weight Update

Problem: After round , weak classifier has error rate. Sample is correctly classified (), current weight . Calculate: (1) classifier weight; (2) sample's next weight (before normalization).

Solution:

(1) Classifier weight:

(2) Sample weight update (before normalization):

Since (correct), :

Correctly classified samples get lower weight, misclassified samples get higher weight (), forcing next round to focus on hard samples.


Exercise 4: Random Forest vs Single Deep Tree

Problem: Random Forest uses 100 trees of depth 10, each trained on Bootstrap sample, each split randomly selectsfeatures ( is total features). Explain why Random Forest is less prone to overfitting than a single deep tree.

Solution:

Single deep tree: - Trains on complete dataset, memorizes all training sample feature combinations - Extremely high variance, poor generalization

Random Forest's variance reduction mechanisms:

  1. Bootstrap sampling: Each tree sees different training subset (~63.2% original samples), reduces correlation between trees
  2. Feature random selection: Each split only considersfeatures, forces different trees to use different feature combinations, further decorrelates
  3. Averaging effect: 100 trees' predictions average out, canceling individual tree overfitting noise

By variance reduction formula, if trees are independent, variance drops to; even if partially correlated, variance reduction is still significant.


Exercise 5: GBDT vs AdaBoost

Problem: Both GBDT and AdaBoost are Boosting methods, but GBDT uses gradient descent to fit residuals while AdaBoost uses exponential loss to reweight samples. Compare their pros/cons.

Solution:

Feature AdaBoost GBDT
Loss function Exponential (fixed) Any differentiable (flexible)
Base learner Usually decision stumps (weak) Decision trees (CART)
Update method Sample reweighting Fit negative gradient (residual)
Noise sensitivity Sensitive (exponential heavily penalizes outliers) More robust (can use Huber etc.)
Regression tasks Not applicable Directly applicable
Interpretability Good (explicit weights) Good (tree structure)

Summary: - AdaBoost: Simple, efficient for binary classification, but noise-sensitive - GBDT: More general (regression+classification), flexible loss functions, widely used in industry (XGBoost/LightGBM foundation)

In practice, GBDT variants (XGBoost, LightGBM, CatBoost) dominate due to flexibility and performance advantages.

References

  1. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
  2. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
  3. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
  4. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
  5. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 10: Boosting and Additive Trees]

Ensemble learning, with the wisdom of "many heads are better than one," combines simple models into powerful systems. From Bagging's parallel aggregation to Boosting's sequential improvement, from Random Forest's randomization strategies to GBDT's gradient optimization, ensemble methods showcase the engineering beauty of machine learning. Understanding ensemble learning is not only mastering practical tools, but grasping the systems thinking of "the whole is greater than the sum of its parts"— this idea permeates modern machine learning, from deep learning's Dropout to reinforcement learning's Actor-Critic, all embodying the essence of ensemble.

  • Post title:Machine Learning Mathematical Derivations (11): Ensemble Learning
  • Post author:Chen Kai
  • Create time:2021-10-24 10:30:00
  • Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-11-Ensemble-Learning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments