Machine Learning Mathematical Derivations (12): XGBoost and LightGBM

XGBoost and LightGBM are the most popular gradient boosting frameworks in industry — building on GBDT with sophisticated mathematical optimizations and engineering innovations, achieving perfect balance between accuracy and efficiency. From XGBoost's second-order optimization to LightGBM's histogram acceleration, from regularization penalties to gradient-based one-side sampling, these techniques embody deep mathematical insights. This chapter systematically derives the mathematical principles, algorithm details, and practical techniques of XGBoost and LightGBM.

XGBoost: Extreme Gradient Boosting

Regularized Objective Function

Standard GBDT minimizes loss:

XGBoost objective: Add tree complexity regularization

where: -is prediction from firstrounds -is the-th tree -is regularization term

Tree regularization:

where: -is number of leaf nodes -is weight (prediction value) of leaf -penalizes leaf count (like pruning) -penalizes leaf weights (L2 regularization)

Second-Order Taylor Expansion

Taylor expand loss function at:

where: -is first-order gradient -is second-order gradient (Hessian)

Simplified objective function: (omitting constant term)

Objective Function Under Tree Structure

Tree representation: Define treeas

where $Extra close brace or missing open braceq: \mathbb{R}^d \to \{1, \dots, T}$ maps samples to leaf nodes,is weight of leaf.

Reorganize objective: Group samples by leaf nodes

where $Extra close brace or missing open braceI_j = \{i \mid q(\mathbf{x}_i) = j}$ is sample set in leaf.

Optimal leaf weight: Differentiate with respect toand set to zero

where,.

Substitute optimal weights:

This is the optimal loss given tree structure.

Split Gain Computation

Problem: How to choose optimal split point?

For leaf node, consider splitting at thresholdon feature: - Left child: $Extra close brace or missing open braceI_L = \{i \in I_j \mid x_i^{(k)} \leq \theta}$ - Right child: $Extra close brace or missing open braceI_R = \{i \in I_j \mid x_i^{(k)} > \theta}$ Loss before split:

Loss after split:

Split Gain:

Split criterion: - If, perform split - Select feature and threshold with maximum

Gradients for Common Loss Functions

Squared Loss (Regression)

Logistic Loss (Binary Classification)

Let, then:

Softmax Loss (Multi-class)

For class:

where.

Splitting Algorithms

Exact Greedy Algorithm

Input: Sample setat current node, feature set Algorithm: 1. Initialize:,

Enumerate features: For each feature :
- Sort samples inby feature - Enumerate thresholds: For each possible split point:
  - Compute - Compute - If, update best split
Return: Optimal split Complexity:,is number of split points,is feature count.

Approximate Algorithm (Quantile Sketch)

Exact algorithm infeasible for big data. Approximate algorithm: Use quantiles instead of all possible thresholds.

Weighted quantile: Consider second-order gradientas sample weight

Find candidate split points $Extra close brace or missing open brace\{\theta_{k1}, \dots, \theta_{kl}}$ such that.

Sparsity-Aware Algorithm

Problem: Real data often has missing values or sparse features (like one-hot encoding).

Default direction: Learn default split direction (left or right) for each node.

Algorithm: 1. Try assigning non-missing samples to left, right children respectively 2. Assign missing samples to default direction 3. Select split and default direction with maximum gain

Implementation: Only traverse non-missing values,.

LightGBM: Efficient Gradient Boosting

Histogram Algorithm

Core idea: Discretize continuous features intobins, use histograms to accumulate gradients.

The following figure illustrates histogram binning:

Advantages: - Memory: Store histograms instead of raw data,instead of - Computation: Traversebins instead ofsamples,instead of - Communication: In distributed training, transfer histograms instead of data

Algorithm: 1. Feature discretization: Map featureto $Extra close brace or missing open brace\{0, 1, \dots, k-1}$

Build histogram: For each bin , accumulate gradients and Hessians

Traverse bins: Compute split gain for each bin (instead of )

Histogram subtraction: For sibling nodes, only compute one histogram, get other via parent subtraction:

Reduces computation by half.

Leaf-wise Growth Strategy

Level-wise (XGBoost): Each time split all leaf nodes (same level)

Leaf-wise (LightGBM): Each time split single leaf with maximum gain

Advantages: - Higher accuracy (prioritize most valuable nodes) - Faster convergence (don't waste computation on low-gain nodes)

Risks: - Prone to overfitting (deeper, more unbalanced trees) - Needs max_depth constraint

GOSS: Gradient-based One-Side Sampling

Motivation: Different samples contribute differently to gradients; large-gradient samples are more important.

Algorithm: 1. Sort: Rank samples by absolute gradientdescending 2. Sample: - Keep toplarge-gradient samples (e.g.,) - Randomly samplefrom remaining (e.g.,) 3. Reweight: Multiply small-gradient samples byto compensate

Split gain estimation:

where is large-gradient sample set, is sampled small-gradient set.

Theoretical guarantee: GOSS estimation error is bounded (proof in LightGBM paper).

EFB: Exclusive Feature Bundling

Motivation: Sparse features are mutually exclusive (like one-hot encoding), can be bundled into single feature.

Problem formulation: Partitionfeatures into groups, minimizing group count while features in each group are as exclusive as possible.

This is graph coloring problem (NP-hard).

Approximate algorithm: 1. Construct conflict graph: Nodes are features, edge weights are conflict degree (non-zero values co-occurring count) 2. Greedy coloring: In decreasing conflict order, assign features to groups (group with minimum conflict)

Feature bundling: Merge features in group, avoid conflicts via offsets:

where is sum of bin counts for first features.

XGBoost vs LightGBM Comparison

Dimension	XGBoost	LightGBM
Split strategy	Level-wise	Leaf-wise
Base algorithm	Pre-sorting	Histogram
Sampling	Row/column sampling	GOSS + EFB
Memory	High (store sorted data)	Low (histograms)
Speed	Medium	Fast
Accuracy	High	High
Overfitting risk	Low	High (needs tuning)
Sparse features	Sparsity-aware	EFB optimized
Distributed	Supported	Supported (more efficient)

Complete Implementation (Simplified)

import numpy as np

class XGBoostTree:
    """XGBoost single tree (simplified)"""
    def __init__(self, max_depth=6, min_child_weight=1, gamma=0, lambda_=1):
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        self.gamma = gamma
        self.lambda_ = lambda_
        self.tree = None
        
    def fit(self, X, g, h):
        """
        Parameters:
            X: Feature matrix
            g: First-order gradients
            h: Second-order gradients
        """
        self.tree = self._build_tree(X, g, h, depth=0)
        return self
    
    def _calc_gain(self, G_L, H_L, G_R, H_R, G, H):
        """Calculate split gain"""
        gain = 0.5 * (
            G_L ** 2 / (H_L + self.lambda_) +
            G_R ** 2 / (H_R + self.lambda_) -
            G ** 2 / (H + self.lambda_)
        ) - self.gamma
        return gain
    
    def _find_best_split(self, X, g, h):
        """Find optimal split"""
        N, d = X.shape
        best_gain = 0
        best_split = None
        
        G = np.sum(g)
        H = np.sum(h)
        
        for j in range(d):
            # Sort by feature j
            sorted_idx = np.argsort(X[:, j])
            X_sorted = X[sorted_idx, j]
            g_sorted = g[sorted_idx]
            h_sorted = h[sorted_idx]
            
            G_L, H_L = 0, 0
            G_R, H_R = G, H
            
            for i in range(N - 1):
                G_L += g_sorted[i]
                H_L += h_sorted[i]
                G_R -= g_sorted[i]
                H_R -= h_sorted[i]
                
                # Skip same values
                if X_sorted[i] == X_sorted[i + 1]:
                    continue
                
                # Check minimum Hessian constraint
                if H_L < self.min_child_weight or H_R < self.min_child_weight:
                    continue
                
                gain = self._calc_gain(G_L, H_L, G_R, H_R, G, H)
                
                if gain > best_gain:
                    best_gain = gain
                    threshold = (X_sorted[i] + X_sorted[i + 1]) / 2
                    best_split = (j, threshold, G_L, H_L, G_R, H_R)
        
        return best_split, best_gain
    
    def _build_tree(self, X, g, h, depth):
        """Recursively build tree"""
        G = np.sum(g)
        H = np.sum(h)
        
        # Leaf weight
        weight = -G / (H + self.lambda_)
        
        # Stopping conditions
        if depth >= self.max_depth or len(X) < 2:
            return {'weight': weight}
        
        # Find best split
        split, gain = self._find_best_split(X, g, h)
        
        if split is None or gain <= 0:
            return {'weight': weight}
        
        j, threshold, G_L, H_L, G_R, H_R = split
        
        # Split samples
        left_mask = X[:, j] <= threshold
        right_mask = ~left_mask
        
        return {
            'feature': j,
            'threshold': threshold,
            'left': self._build_tree(X[left_mask], g[left_mask], h[left_mask], depth + 1),
            'right': self._build_tree(X[right_mask], g[right_mask], h[right_mask], depth + 1)
        }
    
    def predict(self, X):
        """Predict"""
        return np.array([self._predict_tree(x, self.tree) for x in X])
    
    def _predict_tree(self, x, node):
        """Single sample prediction"""
        if 'weight' in node:
            return node['weight']
        
        if x[node['feature']] <= node['threshold']:
            return self._predict_tree(x, node['left'])
        else:
            return self._predict_tree(x, node['right'])

class XGBoost:
    """XGBoost classifier/regressor"""
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=6, 
                 min_child_weight=1, gamma=0, lambda_=1, objective='reg:squarederror'):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        self.gamma = gamma
        self.lambda_ = lambda_
        self.objective = objective
        self.trees = []
        self.base_score = None
        
    def _gradient_hessian(self, y, pred):
        """Compute gradient and Hessian"""
        if self.objective == 'reg:squarederror':
            g = pred - y
            h = np.ones_like(y)
        elif self.objective == 'binary:logistic':
            p = 1 / (1 + np.exp(-pred))
            g = p - y
            h = p * (1 - p)
        else:
            raise ValueError(f"Unknown objective: {self.objective}")
        return g, h
    
    def fit(self, X, y):
        """Train"""
        # Initialize
        if self.objective == 'reg:squarederror':
            self.base_score = np.mean(y)
        elif self.objective == 'binary:logistic':
            self.base_score = np.log(np.mean(y) / (1 - np.mean(y)))
        
        pred = np.full(len(y), self.base_score)
        
        # Iteratively train trees
        for t in range(self.n_estimators):
            g, h = self._gradient_hessian(y, pred)
            
            tree = XGBoostTree(
                max_depth=self.max_depth,
                min_child_weight=self.min_child_weight,
                gamma=self.gamma,
                lambda_=self.lambda_
            )
            tree.fit(X, g, h)
            self.trees.append(tree)
            
            # Update predictions
            pred += self.learning_rate * tree.predict(X)
        
        return self
    
    def predict(self, X):
        """Predict"""
        pred = np.full(len(X), self.base_score)
        for tree in self.trees:
            pred += self.learning_rate * tree.predict(X)
        
        if self.objective == 'binary:logistic':
            return (1 / (1 + np.exp(-pred)) > 0.5).astype(int)
        else:
            return pred
    
    def score(self, X, y):
        """Score"""
        pred = self.predict(X)
        if self.objective == 'binary:logistic':
            return np.mean(pred == y)
        else:
            return 1 - np.sum((y - pred) ** 2) / np.sum((y - np.mean(y)) ** 2)

Q&A Highlights

Q1: Why does XGBoost use second-order information?

A: Second-order Taylor expansion provides curvature information, like Newton's method vs gradient descent: - More accurate: Quadratic approximation closer to true loss - Faster convergence: Consider Hessian direction, adaptive step size - Unified framework: Different loss functions (squared, logistic, Huber) handled uniformly

Q2: Why is Leaf-wise faster than Level-wise?

A: Leaf-wise splits only one node per round, less computation. Level-wise splits all same-level nodes, may include low-gain nodes (wasted computation).

But: Leaf-wise prone to overfitting, needs max_depth and min_data_in_leaf constraints.

Q3: Why does GOSS work?

A: Large-gradient samples contribute more to loss, keeping them ensures accuracy; small-gradient samples also contain information, through sampling+reweighting for approximation. Theoretically, GOSS variance increase is bounded, while computation significantly reduced.

Q4: How to choose hyperparameters?

A: Rules of thumb: - learning_rate: 0.01-0.3 (smaller + larger n_estimators is better) - max_depth: 3-10 (LightGBM can be deeper) - min_child_weight: 1-10 (prevent overfitting) - gamma: 0-5 (pruning threshold) - lambda: 0-10 (L2 regularization)

Tuning strategy: First coarse-tune learning_rate and n_estimators, then fine-tune tree structure parameters.

Q5: Can XGBoost/LightGBM handle categorical features?

A: - XGBoost: Needs manual encoding (one-hot, label encoding) - LightGBM: Native categorical feature support (optimal split algorithm, no encoding needed)

LightGBM's categorical handling is more efficient (avoids one-hot dimension explosion).

Q6: How to prevent overfitting?

A: 1. Tree complexity: Decrease max_depth, increase min_child_weight 2. Regularization: Increase gamma, lambda 3. Learning rate: Decrease learning_rate (with increased n_estimators) 4. Sampling: Subsampling (subsample < 1), column sampling (colsample_bytree < 1) 5. Early stopping: Monitor validation set, set early_stopping_rounds

Q7: XGBoost/LightGBM vs Deep Learning?

A: | Dimension | GBDT | Deep Learning | |-----------|------|---------------| | Tabular data | Excellent | Average | | Image/Text | Poor | Excellent | | Feature engineering | Needed | Automatic | | Interpretability | High | Low | | Training speed | Fast | Slow | | Tuning difficulty | Medium | High |

Tabular data (common in Kaggle): GBDT usually better; unstructured data: deep learning dominates.

Q8: How does XGBoost handle missing values?

A: Learn optimal default direction: 1. Try assigning missing samples to left, right subtrees respectively 2. Select direction with maximum gain as default 3. During prediction, missing values automatically go default direction

This is equivalent to automatically learning optimal missing value imputation strategy.

References

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD, 785-794.
Ke, G., Meng, Q., Finley, T., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS, 3146-3154.
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). CatBoost: Unbiased boosting with categorical features. NeurIPS, 6638-6648.
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337-407.

XGBoost and LightGBM represent the engineering pinnacle of gradient boosting algorithms — from mathematical optimization (second-order information, regularization) to algorithmic innovation (histograms, GOSS, EFB), from system optimization (parallelization, caching) to practical techniques (hyperparameter tuning, interpretability). Their industrial success proves: solid mathematical theory combined with sophisticated engineering implementation creates both efficient and powerful machine learning systems.

✏️ Exercises and Solutions

Exercise 1: XGBoost Second-Order Approximation

Problem: For regression with squared loss , compute first-order gradient and second-order gradient . If current prediction, true value , what are and ?

Solution:

First-order gradient:

Second-order gradient:

Negative gradient means prediction is too small, need to increase prediction value.

Exercise 2: Split Gain Calculation

Problem: A leaf node has 10 samples, , . After split by : left has 6 samples (), right has 4 samples (). With, compute split gain.

Solution:

Positive gain means split is worthwhile. Larger reduces gain (pruning effect).

Exercise 3: GOSS Sampling

Problem: LightGBM uses GOSS, keeping all large-gradient samples (top 20%) and randomly sampling small-gradient samples (10% of remaining 80%). For dataset with 1000 samples, compute: (1) retained samples; (2) weight factor for small-gradient samples.

Solution:

Large-gradient: ; Small-gradient: ; Total: 280 (28% data)
Small-gradient proportion drops from 80% to 8%, needs amplification factorto maintain unbiased gradient statistics.

Exercise 4: Leaf-wise vs Level-wise

Problem: A dataset trains decision tree. Level-wise grows to depth 3 (15 nodes), Leaf-wise grows 15 nodes but max depth 5. Training errors are equal. Which generalizes better?

Solution:

Level-wise (XGBoost default): Balanced structure, depth 3, better regularization

Leaf-wise (LightGBM default): Unbalanced, depth 5, may overfit on small data

Conclusion: On small data, Level-wise is more robust. On large data, Leaf-wise is more efficient. LightGBM's Leaf-wise needs max_depth and min_data_in_leaf limits.

Exercise 5: XGBoost vs LightGBM Selection

Problem: Choose XGBoost or LightGBM for: (1) 1M samples, 1000 features; (2) 10K samples, 50 features; (3) recommender system with many categorical features.

Solution:

LightGBM: Histogram algorithm excels on large scale, GOSS further accelerates
XGBoost: Small data, Level-wise more robust, less overfitting
LightGBM/CatBoost: Native categorical feature support, no one-hot explosion

Summary: Large data→LightGBM, Small data→XGBoost, Categorical features→LightGBM/CatBoost.