Machine Learning Mathematical Derivations (11): Ensemble Learning

Ensemble Learning is one of the most powerful weapons in machine learning — combining multiple weak learners to build strong learners with excellent performance. From Kaggle competitions to industrial applications, ensemble methods are ubiquitous. Why does "three cobblers beat a shoemaker"? What's the mathematical mechanism? This chapter systematically derives the theoretical foundations of ensemble learning, including bias-variance decomposition, Boosting's additive models, Random Forest's randomization strategies, and Gradient Boosting's function space optimization, revealing the core wisdom of ensemble learning.

Theoretical Foundations of Ensemble Learning

Why Ensembles Work

Consider regression problem with true function , single model prediction, ensemble model:$$

H() = _{t=1}^T h_t()$$

Squared error decomposition:

Assuming model errors are independent and identically distributed,,:

Key insight: Ensemble doesn't change bias, but reduces variance to Conditions: 1. Base learner errors are independent (achieved through randomization) 2. Base learners aren't too weak (bias not too large)

Bias-Variance Decomposition

For squared loss, generalization error decomposes as:

where: - Bias: Gap between model's average prediction and true value, measures underfitting - Variance: Prediction fluctuation across different training sets, measures overfitting - Noise: Irreducible error in data itself

Bias-Variance Dilemma: - Complex models: Low bias, high variance - Simple models: High bias, low variance

Ensemble Strategies: - Bagging: Reduces variance (for high-variance models like decision trees) - Boosting: Reduces bias (progressively improves from weak learners)

Diversity and Accuracy

For classification (binary), ensemble error upper bound:$$

P(H() y) _{t=1}^T _t P(h_t() y)$$

If classifiers are independent with error rate, majority vote error rate:$$

P_{} = _{k > T/2} ^k (1 - )^{T - k}$$

Example:, $$

P_{} $$

Conclusion: Independent classifiers slightly better than random, when ensembled can approach perfect.

The following figure shows bias-variance tradeoff and ensemble variance reduction:

Bagging and Random Forest

The figure below compares Bagging and Boosting workflows:

Bootstrap Aggregating (Bagging)

Algorithm (Breiman, 1996):

Bootstrap sampling: From training set, sample with replacementsubsets, each size
Train base learners: Trainon
Ensemble prediction:
- Regression: - Classification:(majority vote)

Bootstrap sampling properties: - Probability of sample being selected: - About 36.8% samples not sampled (Out-of-Bag, OOB)

OOB error estimate: For each sample, predict using models that don't contain it, compute average error: $Extra close brace or missing open brace\text{OOB Error} = \frac{1}{N} \sum_{i=1}^N L(y_i, \frac{1}{|\{t: (\mathbf{x}_i, y_i) \notin \mathcal{D}_t}|} \sum_{t: (\mathbf{x}_i, y_i) \notin \mathcal{D}_t} h_t(\mathbf{x}_i))$

OOB error is unbiased estimate of generalization error, no need for separate validation set.

Random Forest

Innovation (Breiman, 2001): On top of Bagging, introduce random feature selection:

Bootstrap sampling: Same as Bagging
Random feature subset: Each node split, randomly selectfeatures fromfeatures (usuallyor)
Fully grown: Decision trees not pruned, grow to pure nodes or minimum sample count

Dual randomization mechanism: - Sample randomization: Bootstrap (reduces sample variance) - Feature randomization: Random feature selection (reduces feature correlation, increases diversity)

Theoretical analysis: Generalization error upper bound (Breiman, 2001)

where: -is average margin -is average correlation between trees

Methods to reduce error: 1. Increase: Use stronger base learners 2. Decrease: Increase randomness (decrease, increase)

Feature Importance

Based on impurity decrease: Feature's importance is sum of impurity decrease at all nodes splitting on:

Based on OOB permutation: Randomly shuffle featurevalues in OOB samples, compute accuracy drop:

Permutation importance is more accurate (not affected by feature correlation), but slower.

The following figure shows Random Forest feature importance and tree diversity:

Boosting Algorithm Family

AdaBoost: Adaptive Boosting

The figures below show AdaBoost training error convergence and sample weight evolution:

The animation demonstrates how Boosting iteratively reweights samples to improve decision boundary:

Core idea: Iteratively train weak learners, each round increase weights of misclassified samples, finally weighted combination.

Algorithm (Freund & Schapire, 1997):

Input: Training set $Extra close brace or missing open brace\mathcal{D} = \{(\mathbf{x}_i, y_i)}_{i=1}^N$ , $Extra close brace or missing open bracey_i \in \{-1, +1}$ ; base learner; rounds Initialize: Loop ():

Train base learner:
Compute weighted error rate:

Compute base learner weight:

Update sample weights:$$

w_{t+1}(i) = $$

whereis normalization factor.

Final model:$$

H() = ( _{t=1}^T _t h_t() )$$

Theoretical Guarantee of AdaBoost

Training error upper bound:

If each round(weak learning condition), then:$$

Z_t (H) (1 - 4²⁾{T/2} = (-2^2 T)$$

Conclusion: Training error decreases exponentially!

Exponential Loss and Additive Models

AdaBoost is equivalent to minimizing exponential loss in additive model:

Forward Stagewise Additive Modeling:

Each round optimization:

Let , then:

h_t = _h _i w_t(i) (h(_i) y_i)$$

This is exactly AdaBoost's update formula!

Insight: AdaBoost is coordinate descent in function space, each round adding a new basis function.

Gradient Boosting Decision Trees (GBDT)

Gradient Boosting Framework

Idea: View Boosting as gradient descent optimization in function space.

Objective: Minimize loss function

where is additive model.

Gradient Boosting Algorithm (Friedman, 2001):

Initialize: Loop ():

Compute negative gradient (pseudo-residual):

r_{ti} = -{F = F{t-1 }}$$

Fit base learner:
Line search:
Update model: Final model:

Different Loss Functions

Squared Loss (Regression)

L(y, F) = (y - F)^2$$

r_i = y_i - F_{t-1}(_i)$$

Gradient boosting degenerates to fitting residuals, similar to traditional regression.

Absolute Loss (Robust Regression)

L(y, F) = |y - F|$$

r_i = (y_i - F_{t-1}(_i))$$

Robust to outliers.

Logistic Loss (Binary Classification)

L(y, F) = (1 + (-yF))$$

r_i = $$

Output log-odds:

Regularization and Shrinkage

Learning rate shrinkage:

F_t() = F_{t-1}() + h_t(), (0, 1]$$

Small learning rate (e.g., ) with large (e.g., ) usually performs better.

Subsampling (Stochastic Gradient Boosting): Each round randomly samplesamples (, e.g.,) to train, similar to SGD.

Tree complexity penalty: Limit tree depth (e.g.,), leaf count, minimum samples.

Complete Implementation

import numpy as np
from collections import Counter

class DecisionStump:
    """Decision stump (single-layer decision tree)"""
    def __init__(self):
        self.feature_idx = None
        self.threshold = None
        self.left_value = None
        self.right_value = None
        
    def fit(self, X, y, weights):
        N, d = X.shape
        best_error = float('inf')
        
        for j in range(d):
            thresholds = np.unique(X[:, j])
            for threshold in thresholds:
                left_mask = X[:, j] <= threshold
                right_mask = ~left_mask
                
                if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                    continue
                
                # Weighted majority vote
                left_value = np.sign(np.sum(weights[left_mask] * y[left_mask]))
                right_value = np.sign(np.sum(weights[right_mask] * y[right_mask]))
                
                # Weighted error
                pred = np.where(left_mask, left_value, right_value)
                error = np.sum(weights[pred != y])
                
                if error < best_error:
                    best_error = error
                    self.feature_idx = j
                    self.threshold = threshold
                    self.left_value = left_value
                    self.right_value = right_value
        
        return self
    
    def predict(self, X):
        left_mask = X[:, self.feature_idx] <= self.threshold
        return np.where(left_mask, self.left_value, self.right_value)

class AdaBoost:
    """AdaBoost classifier"""
    def __init__(self, n_estimators=50):
        self.n_estimators = n_estimators
        self.estimators = []
        self.alphas = []
        
    def fit(self, X, y):
        N = X.shape[0]
        weights = np.ones(N) / N
        
        for t in range(self.n_estimators):
            # Train weak learner
            stump = DecisionStump()
            stump.fit(X, y, weights)
            pred = stump.predict(X)
            
            # Compute weighted error
            epsilon = np.sum(weights[pred != y])
            epsilon = np.clip(epsilon, 1e-10, 1 - 1e-10)
            
            # Compute learner weight
            alpha = 0.5 * np.log((1 - epsilon) / epsilon)
            
            # Update sample weights
            weights *= np.exp(-alpha * y * pred)
            weights /= np.sum(weights)
            
            self.estimators.append(stump)
            self.alphas.append(alpha)
        
        return self
    
    def predict(self, X):
        predictions = np.array([alpha * estimator.predict(X) 
                                for alpha, estimator in zip(self.alphas, self.estimators)])
        return np.sign(np.sum(predictions, axis=0))
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

class GradientBoostingRegressor:
    """Gradient Boosting Regression Trees"""
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.init_prediction = None
        
    def fit(self, X, y):
        # Initialize: predict mean
        self.init_prediction = np.mean(y)
        F = np.full(len(y), self.init_prediction)
        
        for t in range(self.n_estimators):
            # Compute negative gradient (residuals)
            residuals = y - F
            
            # Fit residuals
            tree = self._build_tree(X, residuals, depth=0)
            self.trees.append(tree)
            
            # Update predictions
            F += self.learning_rate * self._predict_tree(tree, X)
        
        return self
    
    def _build_tree(self, X, y, depth):
        """Recursively build regression tree"""
        N = len(y)
        
        # Stopping conditions
        if depth >= self.max_depth or N < 2:
            return {'value': np.mean(y)}
        
        # Find best split
        best_mse = float('inf')
        best_split = None
        
        for j in range(X.shape[1]):
            thresholds = np.unique(X[:, j])
            for threshold in thresholds:
                left_mask = X[:, j] <= threshold
                right_mask = ~left_mask
                
                if np.sum(left_mask) < 1 or np.sum(right_mask) < 1:
                    continue
                
                mse = np.var(y[left_mask]) * np.sum(left_mask) + np.var(y[right_mask]) * np.sum(right_mask)
                
                if mse < best_mse:
                    best_mse = mse
                    best_split = (j, threshold, left_mask, right_mask)
        
        if best_split is None:
            return {'value': np.mean(y)}
        
        j, threshold, left_mask, right_mask = best_split
        
        return {
            'feature': j,
            'threshold': threshold,
            'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
            'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)
        }
    
    def _predict_tree(self, tree, X):
        """Single tree prediction"""
        if 'value' in tree:
            return np.full(len(X), tree['value'])
        
        left_mask = X[:, tree['feature']] <= tree['threshold']
        predictions = np.zeros(len(X))
        predictions[left_mask] = self._predict_tree(tree['left'], X[left_mask])
        predictions[~left_mask] = self._predict_tree(tree['right'], X[~left_mask])
        return predictions
    
    def predict(self, X):
        F = np.full(len(X), self.init_prediction)
        for tree in self.trees:
            F += self.learning_rate * self._predict_tree(tree, X)
        return F
    
    def score(self, X, y):
        pred = self.predict(X)
        return 1 - np.sum((y - pred) ** 2) / np.sum((y - np.mean(y)) ** 2)

Q&A Highlights

Q1: Why is Boosting stronger than Bagging?

A: Bagging trains in parallel, reduces variance; Boosting trains sequentially, reduces both bias and variance. Key difference: - Bagging: Independent training, suits high-variance models (like deep trees) - Boosting: Adaptive training, progressively learns complex patterns from simple models

But Boosting is more prone to overfitting and noise sensitivity.

Q2: Why does AdaBoost use exponential weight updates?

A: Exponential updateensures misclassified samples' weights grow rapidly, correctly classified samples' weights decay. This is the natural result of minimizing exponential loss (see forward stagewise derivation).

Q3: Why is GBDT's learning rate important?

A: Learning ratecontrols each tree's contribution. Small(e.g.,) needs more trees, but generalizes better (similar to small step size in SGD). Reason: shrinkage regularization prevents single tree overfitting.

Q4: How to choose Random Forest's feature count?

A: Rules of thumb: - Classification: - Regression:Smallermeans more diversity but weaker individual trees; largermeans stronger trees but higher correlation. Tune via cross-validation.

Q5: Can Boosting be parallelized?

A: Standard Boosting is sequential (each round depends on previous). But parallel variants exist: - Parallel Bagging + Boosting: Bagging first, Boosting on each subset - Data parallelism: Within single round, split samples across machines (like XGBoost distributed) - Feature parallelism: When finding best split, search features in parallel

Q6: Why does GBDT use decision trees instead of other models?

A: Decision trees naturally support: - Nonlinearity: Automatically learn feature interactions - Piecewise constant: Each leaf optimized independently - Efficient splitting: Fast threshold-based search

Theoretically any model works (like linear models), but trees are most practical.

Q7: Can ensemble learning prevent overfitting?

A: Depends on method: - Bagging/RF: Rarely overfits (more trees the better) - Boosting: Can overfit (training error →0, test error may rise)

Preventing Boosting overfitting: - Early stopping (monitor validation set) - Learning rate shrinkage - Subsampling - Tree complexity constraints

✏️ Exercises and Solutions

Exercise 1: Bias-Variance Decomposition Calculation

Problem: Three independent regression models, each with bias squared, variance. Calculate expected MSE for single model and ensemble model (simple average), ignoring noise.

Solution:

Single model MSE:

Ensemble MSE (assuming independent errors): - Ensemble doesn't change bias: - Variance reduces to:

Ensemble reduces error by.

Exercise 2: Majority Voting Error Probability

Problem: 21 independent binary classifiers, each with error rate. Using majority voting, calculate ensemble error rate approximation.

Solution:

Majority voting fails whenclassifiers are wrong:

Using binomial cumulative distribution (or programming):

Ensemble reduces error rate fromto about, significant improvement!

Exercise 3: AdaBoost Weight Update

Problem: After round , weak classifier has error rate. Sample is correctly classified (), current weight . Calculate: (1) classifier weight; (2) sample's next weight (before normalization).

Solution:

(1) Classifier weight:

(2) Sample weight update (before normalization):

Since (correct), :

Correctly classified samples get lower weight, misclassified samples get higher weight (), forcing next round to focus on hard samples.

Exercise 4: Random Forest vs Single Deep Tree

Problem: Random Forest uses 100 trees of depth 10, each trained on Bootstrap sample, each split randomly selectsfeatures ( is total features). Explain why Random Forest is less prone to overfitting than a single deep tree.

Solution:

Single deep tree: - Trains on complete dataset, memorizes all training sample feature combinations - Extremely high variance, poor generalization

Random Forest's variance reduction mechanisms:

Bootstrap sampling: Each tree sees different training subset (~63.2% original samples), reduces correlation between trees
Feature random selection: Each split only considersfeatures, forces different trees to use different feature combinations, further decorrelates
Averaging effect: 100 trees' predictions average out, canceling individual tree overfitting noise

By variance reduction formula, if trees are independent, variance drops to; even if partially correlated, variance reduction is still significant.

Exercise 5: GBDT vs AdaBoost

Problem: Both GBDT and AdaBoost are Boosting methods, but GBDT uses gradient descent to fit residuals while AdaBoost uses exponential loss to reweight samples. Compare their pros/cons.

Solution:

Feature	AdaBoost	GBDT
Loss function	Exponential (fixed)	Any differentiable (flexible)
Base learner	Usually decision stumps (weak)	Decision trees (CART)
Update method	Sample reweighting	Fit negative gradient (residual)
Noise sensitivity	Sensitive (exponential heavily penalizes outliers)	More robust (can use Huber etc.)
Regression tasks	Not applicable	Directly applicable
Interpretability	Good (explicit weights)	Good (tree structure)

Summary: - AdaBoost: Simple, efficient for binary classification, but noise-sensitive - GBDT: More general (regression+classification), flexible loss functions, widely used in industry (XGBoost/LightGBM foundation)

In practice, GBDT variants (XGBoost, LightGBM, CatBoost) dominate due to flexibility and performance advantages.

References

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 10: Boosting and Additive Trees]

Ensemble learning, with the wisdom of "many heads are better than one," combines simple models into powerful systems. From Bagging's parallel aggregation to Boosting's sequential improvement, from Random Forest's randomization strategies to GBDT's gradient optimization, ensemble methods showcase the engineering beauty of machine learning. Understanding ensemble learning is not only mastering practical tools, but grasping the systems thinking of "the whole is greater than the sum of its parts"— this idea permeates modern machine learning, from deep learning's Dropout to reinforcement learning's Actor-Critic, all embodying the essence of ensemble.