XGBoost and LightGBM are the most popular gradient boosting frameworks in industry — building on GBDT with sophisticated mathematical optimizations and engineering innovations, achieving perfect balance between accuracy and efficiency. From XGBoost's second-order optimization to LightGBM's histogram acceleration, from regularization penalties to gradient-based one-side sampling, these techniques embody deep mathematical insights. This chapter systematically derives the mathematical principles, algorithm details, and practical techniques of XGBoost and LightGBM.
XGBoost: Extreme Gradient Boosting
Regularized Objective Function
Standard GBDT minimizes loss:
XGBoost objective: Add tree complexity regularization
where: -
Tree regularization:
where: -
Second-Order Taylor Expansion
Taylor expand loss function at
where: -
Simplified objective function:
Objective Function Under Tree Structure
Tree representation: Define tree
where
Reorganize objective: Group samples by leaf nodes
where
Optimal leaf weight: Differentiate with respect
to
where
Substitute optimal weights:
This is the optimal loss given tree structure.
Split Gain Computation
Problem: How to choose optimal split point?
For leaf node
Loss after split:
Split Gain:
Split criterion: - If
Gradients for Common Loss Functions
Squared Loss
(Regression)
Logistic Loss (Binary Classification)
Let
Softmax Loss (Multi-class)
For class
where
Splitting Algorithms
Exact Greedy Algorithm
Input: Sample set
- Enumerate features: For each feature
: - Sort samples in
by feature - Enumerate thresholds: For each possible split point : - Compute
- Compute - If , update best split
- Compute
- Sort samples in
- Return: Optimal split
Complexity: , is number of split points, is feature count.
Approximate Algorithm (Quantile Sketch)
Exact algorithm infeasible for big data. Approximate algorithm: Use quantiles instead of all possible thresholds.
Weighted quantile: Consider second-order
gradient
Find candidate split points
Sparsity-Aware Algorithm
Problem: Real data often has missing values or sparse features (like one-hot encoding).
Default direction: Learn default split direction (left or right) for each node.
Algorithm: 1. Try assigning non-missing samples to left, right children respectively 2. Assign missing samples to default direction 3. Select split and default direction with maximum gain
Implementation: Only traverse non-missing
values,
LightGBM: Efficient Gradient Boosting
Histogram Algorithm
Core idea: Discretize continuous features into
The following figure illustrates histogram binning:
Advantages: - Memory: Store
histograms instead of raw data,
Algorithm: 1. Feature
discretization: Map feature
- Build histogram: For each bin
, accumulate gradients and Hessians
- Traverse bins: Compute split gain for each bin
(
instead of )
Histogram subtraction: For sibling nodes, only compute one histogram, get other via parent subtraction:
Reduces computation by half.
Leaf-wise Growth Strategy
Level-wise (XGBoost): Each time split all leaf nodes (same level)
Leaf-wise (LightGBM): Each time split single leaf with maximum gain
Advantages: - Higher accuracy (prioritize most valuable nodes) - Faster convergence (don't waste computation on low-gain nodes)
Risks: - Prone to overfitting (deeper, more unbalanced trees) - Needs max_depth constraint
GOSS: Gradient-based One-Side Sampling
Motivation: Different samples contribute differently to gradients; large-gradient samples are more important.
Algorithm: 1. Sort: Rank samples by
absolute gradient
Split gain estimation:
where
Theoretical guarantee: GOSS estimation error is bounded (proof in LightGBM paper).
EFB: Exclusive Feature Bundling
Motivation: Sparse features are mutually exclusive (like one-hot encoding), can be bundled into single feature.
Problem formulation: Partition
This is graph coloring problem (NP-hard).
Approximate algorithm: 1. Construct conflict graph: Nodes are features, edge weights are conflict degree (non-zero values co-occurring count) 2. Greedy coloring: In decreasing conflict order, assign features to groups (group with minimum conflict)
Feature bundling: Merge features in group, avoid conflicts via offsets:
where
XGBoost vs LightGBM Comparison
| Dimension | XGBoost | LightGBM |
|---|---|---|
| Split strategy | Level-wise | Leaf-wise |
| Base algorithm | Pre-sorting | Histogram |
| Sampling | Row/column sampling | GOSS + EFB |
| Memory | High (store sorted data) | Low (histograms) |
| Speed | Medium | Fast |
| Accuracy | High | High |
| Overfitting risk | Low | High (needs tuning) |
| Sparse features | Sparsity-aware | EFB optimized |
| Distributed | Supported | Supported (more efficient) |
Complete Implementation (Simplified)
1 | import numpy as np |
Q&A Highlights
Q1: Why does XGBoost use second-order information?
A: Second-order Taylor expansion provides curvature information, like Newton's method vs gradient descent: - More accurate: Quadratic approximation closer to true loss - Faster convergence: Consider Hessian direction, adaptive step size - Unified framework: Different loss functions (squared, logistic, Huber) handled uniformly
Q2: Why is Leaf-wise faster than Level-wise?
A: Leaf-wise splits only one node per round, less computation. Level-wise splits all same-level nodes, may include low-gain nodes (wasted computation).
But: Leaf-wise prone to overfitting, needs max_depth and min_data_in_leaf constraints.
Q3: Why does GOSS work?
A: Large-gradient samples contribute more to loss, keeping them ensures accuracy; small-gradient samples also contain information, through sampling+reweighting for approximation. Theoretically, GOSS variance increase is bounded, while computation significantly reduced.
Q4: How to choose hyperparameters?
A: Rules of thumb: - learning_rate: 0.01-0.3 (smaller + larger n_estimators is better) - max_depth: 3-10 (LightGBM can be deeper) - min_child_weight: 1-10 (prevent overfitting) - gamma: 0-5 (pruning threshold) - lambda: 0-10 (L2 regularization)
Tuning strategy: First coarse-tune learning_rate and n_estimators, then fine-tune tree structure parameters.
Q5: Can XGBoost/LightGBM handle categorical features?
A: - XGBoost: Needs manual encoding (one-hot, label encoding) - LightGBM: Native categorical feature support (optimal split algorithm, no encoding needed)
LightGBM's categorical handling is more efficient (avoids one-hot dimension explosion).
Q6: How to prevent overfitting?
A: 1. Tree complexity: Decrease max_depth, increase min_child_weight 2. Regularization: Increase gamma, lambda 3. Learning rate: Decrease learning_rate (with increased n_estimators) 4. Sampling: Subsampling (subsample < 1), column sampling (colsample_bytree < 1) 5. Early stopping: Monitor validation set, set early_stopping_rounds
Q7: XGBoost/LightGBM vs Deep Learning?
A: | Dimension | GBDT | Deep Learning | |-----------|------|---------------| | Tabular data | Excellent | Average | | Image/Text | Poor | Excellent | | Feature engineering | Needed | Automatic | | Interpretability | High | Low | | Training speed | Fast | Slow | | Tuning difficulty | Medium | High |
Tabular data (common in Kaggle): GBDT usually better; unstructured data: deep learning dominates.
Q8: How does XGBoost handle missing values?
A: Learn optimal default direction: 1. Try assigning missing samples to left, right subtrees respectively 2. Select direction with maximum gain as default 3. During prediction, missing values automatically go default direction
This is equivalent to automatically learning optimal missing value imputation strategy.
References
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD, 785-794.
- Ke, G., Meng, Q., Finley, T., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS, 3146-3154.
- Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). CatBoost: Unbiased boosting with categorical features. NeurIPS, 6638-6648.
- Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337-407.
XGBoost and LightGBM represent the engineering pinnacle of gradient boosting algorithms — from mathematical optimization (second-order information, regularization) to algorithmic innovation (histograms, GOSS, EFB), from system optimization (parallelization, caching) to practical techniques (hyperparameter tuning, interpretability). Their industrial success proves: solid mathematical theory combined with sophisticated engineering implementation creates both efficient and powerful machine learning systems.
✏️ Exercises and Solutions
Exercise 1: XGBoost Second-Order Approximation
Problem: For regression with squared loss
Solution:
First-order gradient:
Second-order gradient:
Negative gradient means prediction is too small, need to increase prediction value.
Exercise 2: Split Gain Calculation
Problem: A leaf node has 10 samples,
Solution:
Positive gain means split is worthwhile. Larger
Exercise 3: GOSS Sampling
Problem: LightGBM uses GOSS, keeping all large-gradient samples (top 20%) and randomly sampling small-gradient samples (10% of remaining 80%). For dataset with 1000 samples, compute: (1) retained samples; (2) weight factor for small-gradient samples.
Solution:
Large-gradient:
; Small-gradient: ; Total: 280 (28% data) Small-gradient proportion drops from 80% to 8%, needs amplification factor
to maintain unbiased gradient statistics.
Exercise 4: Leaf-wise vs Level-wise
Problem: A dataset trains decision tree. Level-wise grows to depth 3 (15 nodes), Leaf-wise grows 15 nodes but max depth 5. Training errors are equal. Which generalizes better?
Solution:
Level-wise (XGBoost default): Balanced structure, depth 3, better regularization
Leaf-wise (LightGBM default): Unbalanced, depth 5, may overfit on small data
Conclusion: On small data, Level-wise is more
robust. On large data, Leaf-wise is more efficient. LightGBM's Leaf-wise
needs max_depth and min_data_in_leaf
limits.
Exercise 5: XGBoost vs LightGBM Selection
Problem: Choose XGBoost or LightGBM for: (1) 1M samples, 1000 features; (2) 10K samples, 50 features; (3) recommender system with many categorical features.
Solution:
LightGBM: Histogram algorithm excels on large scale, GOSS further accelerates
XGBoost: Small data, Level-wise more robust, less overfitting
LightGBM/CatBoost: Native categorical feature support, no one-hot explosion
Summary: Large data→LightGBM, Small data→XGBoost, Categorical features→LightGBM/CatBoost.
- Post title:Machine Learning Mathematical Derivations (12): XGBoost and LightGBM
- Post author:Chen Kai
- Create time:2021-10-30 15:15:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-12-XGBoost-and-LightGBM/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.