Machine Learning Mathematical Derivations (12): XGBoost and LightGBM
Chen Kai BOSS

XGBoost and LightGBM are the most popular gradient boosting frameworks in industry — building on GBDT with sophisticated mathematical optimizations and engineering innovations, achieving perfect balance between accuracy and efficiency. From XGBoost's second-order optimization to LightGBM's histogram acceleration, from regularization penalties to gradient-based one-side sampling, these techniques embody deep mathematical insights. This chapter systematically derives the mathematical principles, algorithm details, and practical techniques of XGBoost and LightGBM.

XGBoost: Extreme Gradient Boosting

Regularized Objective Function

XGBoost Regularization

Standard GBDT minimizes loss:

XGBoost Tree Structure

XGBoost objective: Add tree complexity regularization

where: -is prediction from firstrounds -is the-th tree -is regularization term

Tree regularization:

where: -is number of leaf nodes -is weight (prediction value) of leaf -penalizes leaf count (like pruning) -penalizes leaf weights (L2 regularization)

Second-Order Taylor Expansion

Taylor expand loss function at:

where: -is first-order gradient -is second-order gradient (Hessian)

Simplified objective function: (omitting constant term)

Objective Function Under Tree Structure

Tree Regularization

Tree representation: Define treeas

whereExtra close brace or missing open braceq: \mathbb{R}^d \to \{1, \dots, T}maps samples to leaf nodes,is weight of leaf.

Reorganize objective: Group samples by leaf nodes

whereExtra close brace or missing open braceI_j = \{i \mid q(\mathbf{x}_i) = j}is sample set in leaf.

Optimal leaf weight: Differentiate with respect toand set to zero

where,.

Substitute optimal weights:

This is the optimal loss given tree structure.

Split Gain Computation

Problem: How to choose optimal split point?

For leaf node, consider splitting at thresholdon feature: - Left child:Extra close brace or missing open braceI_L = \{i \in I_j \mid x_i^{(k)} \leq \theta} - Right child:Extra close brace or missing open braceI_R = \{i \in I_j \mid x_i^{(k)} > \theta} Loss before split:

Loss after split:

Split Gain:

Split criterion: - If, perform split - Select feature and threshold with maximum

Gradients for Common Loss Functions

Squared Loss (Regression)

Logistic Loss (Binary Classification)

Let, then:

Softmax Loss (Multi-class)

For class:

where.

Splitting Algorithms

Exact Greedy Algorithm

Input: Sample setat current node, feature set Algorithm: 1. Initialize:,

  1. Enumerate features: For each feature :
    • Sort samples inby feature - Enumerate thresholds: For each possible split point:
      • Compute - Compute - If, update best split
  2. Return: Optimal split Complexity:,is number of split points,is feature count.
Histogram Binning

Approximate Algorithm (Quantile Sketch)

Exact algorithm infeasible for big data. Approximate algorithm: Use quantiles instead of all possible thresholds.

Weighted quantile: Consider second-order gradientas sample weight

Find candidate split pointsExtra close brace or missing open brace\{\theta_{k1}, \dots, \theta_{kl}}such that.

Sparsity-Aware Algorithm

Problem: Real data often has missing values or sparse features (like one-hot encoding).

Default direction: Learn default split direction (left or right) for each node.

Algorithm: 1. Try assigning non-missing samples to left, right children respectively 2. Assign missing samples to default direction 3. Select split and default direction with maximum gain

Implementation: Only traverse non-missing values,.

LightGBM: Efficient Gradient Boosting

Histogram Algorithm

Core idea: Discretize continuous features intobins, use histograms to accumulate gradients.

The following figure illustrates histogram binning:

Histogram Binning

Advantages: - Memory: Store histograms instead of raw data,instead of - Computation: Traversebins instead ofsamples,instead of - Communication: In distributed training, transfer histograms instead of data

Algorithm: 1. Feature discretization: Map featuretoExtra close brace or missing open brace\{0, 1, \dots, k-1}

  1. Build histogram: For each bin , accumulate gradients and Hessians

  1. Traverse bins: Compute split gain for each bin (instead of )

Histogram subtraction: For sibling nodes, only compute one histogram, get other via parent subtraction:

Reduces computation by half.

Leaf-wise Growth Strategy

Level-wise (XGBoost): Each time split all leaf nodes (same level)

Leaf-wise (LightGBM): Each time split single leaf with maximum gain

Advantages: - Higher accuracy (prioritize most valuable nodes) - Faster convergence (don't waste computation on low-gain nodes)

Risks: - Prone to overfitting (deeper, more unbalanced trees) - Needs max_depth constraint

GOSS: Gradient-based One-Side Sampling

Motivation: Different samples contribute differently to gradients; large-gradient samples are more important.

Algorithm: 1. Sort: Rank samples by absolute gradientdescending 2. Sample: - Keep toplarge-gradient samples (e.g.,) - Randomly samplefrom remaining (e.g.,) 3. Reweight: Multiply small-gradient samples byto compensate

Split gain estimation:

where is large-gradient sample set, is sampled small-gradient set.

Theoretical guarantee: GOSS estimation error is bounded (proof in LightGBM paper).

EFB: Exclusive Feature Bundling

Motivation: Sparse features are mutually exclusive (like one-hot encoding), can be bundled into single feature.

Problem formulation: Partitionfeatures into groups, minimizing group count while features in each group are as exclusive as possible.

This is graph coloring problem (NP-hard).

Approximate algorithm: 1. Construct conflict graph: Nodes are features, edge weights are conflict degree (non-zero values co-occurring count) 2. Greedy coloring: In decreasing conflict order, assign features to groups (group with minimum conflict)

Feature bundling: Merge features in group, avoid conflicts via offsets:

where is sum of bin counts for first features.

XGBoost vs LightGBM Comparison

Dimension XGBoost LightGBM
Split strategy Level-wise Leaf-wise
Base algorithm Pre-sorting Histogram
Sampling Row/column sampling GOSS + EFB
Memory High (store sorted data) Low (histograms)
Speed Medium Fast
Accuracy High High
Overfitting risk Low High (needs tuning)
Sparse features Sparsity-aware EFB optimized
Distributed Supported Supported (more efficient)

Complete Implementation (Simplified)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
import numpy as np

class XGBoostTree:
"""XGBoost single tree (simplified)"""
def __init__(self, max_depth=6, min_child_weight=1, gamma=0, lambda_=1):
self.max_depth = max_depth
self.min_child_weight = min_child_weight
self.gamma = gamma
self.lambda_ = lambda_
self.tree = None

def fit(self, X, g, h):
"""
Parameters:
X: Feature matrix
g: First-order gradients
h: Second-order gradients
"""
self.tree = self._build_tree(X, g, h, depth=0)
return self

def _calc_gain(self, G_L, H_L, G_R, H_R, G, H):
"""Calculate split gain"""
gain = 0.5 * (
G_L ** 2 / (H_L + self.lambda_) +
G_R ** 2 / (H_R + self.lambda_) -
G ** 2 / (H + self.lambda_)
) - self.gamma
return gain

def _find_best_split(self, X, g, h):
"""Find optimal split"""
N, d = X.shape
best_gain = 0
best_split = None

G = np.sum(g)
H = np.sum(h)

for j in range(d):
# Sort by feature j
sorted_idx = np.argsort(X[:, j])
X_sorted = X[sorted_idx, j]
g_sorted = g[sorted_idx]
h_sorted = h[sorted_idx]

G_L, H_L = 0, 0
G_R, H_R = G, H

for i in range(N - 1):
G_L += g_sorted[i]
H_L += h_sorted[i]
G_R -= g_sorted[i]
H_R -= h_sorted[i]

# Skip same values
if X_sorted[i] == X_sorted[i + 1]:
continue

# Check minimum Hessian constraint
if H_L < self.min_child_weight or H_R < self.min_child_weight:
continue

gain = self._calc_gain(G_L, H_L, G_R, H_R, G, H)

if gain > best_gain:
best_gain = gain
threshold = (X_sorted[i] + X_sorted[i + 1]) / 2
best_split = (j, threshold, G_L, H_L, G_R, H_R)

return best_split, best_gain

def _build_tree(self, X, g, h, depth):
"""Recursively build tree"""
G = np.sum(g)
H = np.sum(h)

# Leaf weight
weight = -G / (H + self.lambda_)

# Stopping conditions
if depth >= self.max_depth or len(X) < 2:
return {'weight': weight}

# Find best split
split, gain = self._find_best_split(X, g, h)

if split is None or gain <= 0:
return {'weight': weight}

j, threshold, G_L, H_L, G_R, H_R = split

# Split samples
left_mask = X[:, j] <= threshold
right_mask = ~left_mask

return {
'feature': j,
'threshold': threshold,
'left': self._build_tree(X[left_mask], g[left_mask], h[left_mask], depth + 1),
'right': self._build_tree(X[right_mask], g[right_mask], h[right_mask], depth + 1)
}

def predict(self, X):
"""Predict"""
return np.array([self._predict_tree(x, self.tree) for x in X])

def _predict_tree(self, x, node):
"""Single sample prediction"""
if 'weight' in node:
return node['weight']

if x[node['feature']] <= node['threshold']:
return self._predict_tree(x, node['left'])
else:
return self._predict_tree(x, node['right'])

class XGBoost:
"""XGBoost classifier/regressor"""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=6,
min_child_weight=1, gamma=0, lambda_=1, objective='reg:squarederror'):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.min_child_weight = min_child_weight
self.gamma = gamma
self.lambda_ = lambda_
self.objective = objective
self.trees = []
self.base_score = None

def _gradient_hessian(self, y, pred):
"""Compute gradient and Hessian"""
if self.objective == 'reg:squarederror':
g = pred - y
h = np.ones_like(y)
elif self.objective == 'binary:logistic':
p = 1 / (1 + np.exp(-pred))
g = p - y
h = p * (1 - p)
else:
raise ValueError(f"Unknown objective: {self.objective}")
return g, h

def fit(self, X, y):
"""Train"""
# Initialize
if self.objective == 'reg:squarederror':
self.base_score = np.mean(y)
elif self.objective == 'binary:logistic':
self.base_score = np.log(np.mean(y) / (1 - np.mean(y)))

pred = np.full(len(y), self.base_score)

# Iteratively train trees
for t in range(self.n_estimators):
g, h = self._gradient_hessian(y, pred)

tree = XGBoostTree(
max_depth=self.max_depth,
min_child_weight=self.min_child_weight,
gamma=self.gamma,
lambda_=self.lambda_
)
tree.fit(X, g, h)
self.trees.append(tree)

# Update predictions
pred += self.learning_rate * tree.predict(X)

return self

def predict(self, X):
"""Predict"""
pred = np.full(len(X), self.base_score)
for tree in self.trees:
pred += self.learning_rate * tree.predict(X)

if self.objective == 'binary:logistic':
return (1 / (1 + np.exp(-pred)) > 0.5).astype(int)
else:
return pred

def score(self, X, y):
"""Score"""
pred = self.predict(X)
if self.objective == 'binary:logistic':
return np.mean(pred == y)
else:
return 1 - np.sum((y - pred) ** 2) / np.sum((y - np.mean(y)) ** 2)

Q&A Highlights

Q1: Why does XGBoost use second-order information?

A: Second-order Taylor expansion provides curvature information, like Newton's method vs gradient descent: - More accurate: Quadratic approximation closer to true loss - Faster convergence: Consider Hessian direction, adaptive step size - Unified framework: Different loss functions (squared, logistic, Huber) handled uniformly


Q2: Why is Leaf-wise faster than Level-wise?

A: Leaf-wise splits only one node per round, less computation. Level-wise splits all same-level nodes, may include low-gain nodes (wasted computation).

But: Leaf-wise prone to overfitting, needs max_depth and min_data_in_leaf constraints.


Q3: Why does GOSS work?

A: Large-gradient samples contribute more to loss, keeping them ensures accuracy; small-gradient samples also contain information, through sampling+reweighting for approximation. Theoretically, GOSS variance increase is bounded, while computation significantly reduced.


Q4: How to choose hyperparameters?

A: Rules of thumb: - learning_rate: 0.01-0.3 (smaller + larger n_estimators is better) - max_depth: 3-10 (LightGBM can be deeper) - min_child_weight: 1-10 (prevent overfitting) - gamma: 0-5 (pruning threshold) - lambda: 0-10 (L2 regularization)

Tuning strategy: First coarse-tune learning_rate and n_estimators, then fine-tune tree structure parameters.


Q5: Can XGBoost/LightGBM handle categorical features?

A: - XGBoost: Needs manual encoding (one-hot, label encoding) - LightGBM: Native categorical feature support (optimal split algorithm, no encoding needed)

LightGBM's categorical handling is more efficient (avoids one-hot dimension explosion).


Q6: How to prevent overfitting?

A: 1. Tree complexity: Decrease max_depth, increase min_child_weight 2. Regularization: Increase gamma, lambda 3. Learning rate: Decrease learning_rate (with increased n_estimators) 4. Sampling: Subsampling (subsample < 1), column sampling (colsample_bytree < 1) 5. Early stopping: Monitor validation set, set early_stopping_rounds


Q7: XGBoost/LightGBM vs Deep Learning?

A: | Dimension | GBDT | Deep Learning | |-----------|------|---------------| | Tabular data | Excellent | Average | | Image/Text | Poor | Excellent | | Feature engineering | Needed | Automatic | | Interpretability | High | Low | | Training speed | Fast | Slow | | Tuning difficulty | Medium | High |

Tabular data (common in Kaggle): GBDT usually better; unstructured data: deep learning dominates.


Q8: How does XGBoost handle missing values?

A: Learn optimal default direction: 1. Try assigning missing samples to left, right subtrees respectively 2. Select direction with maximum gain as default 3. During prediction, missing values automatically go default direction

This is equivalent to automatically learning optimal missing value imputation strategy.


References

  1. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD, 785-794.
  2. Ke, G., Meng, Q., Finley, T., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS, 3146-3154.
  3. Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). CatBoost: Unbiased boosting with categorical features. NeurIPS, 6638-6648.
  4. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337-407.

XGBoost and LightGBM represent the engineering pinnacle of gradient boosting algorithms — from mathematical optimization (second-order information, regularization) to algorithmic innovation (histograms, GOSS, EFB), from system optimization (parallelization, caching) to practical techniques (hyperparameter tuning, interpretability). Their industrial success proves: solid mathematical theory combined with sophisticated engineering implementation creates both efficient and powerful machine learning systems.

✏️ Exercises and Solutions

Exercise 1: XGBoost Second-Order Approximation

Problem: For regression with squared loss , compute first-order gradient and second-order gradient . If current prediction, true value , what are and ?

Solution:

First-order gradient:

Second-order gradient:

Negative gradient means prediction is too small, need to increase prediction value.


Exercise 2: Split Gain Calculation

Problem: A leaf node has 10 samples, , . After split by : left has 6 samples (), right has 4 samples (). With, compute split gain.

Solution:

Positive gain means split is worthwhile. Larger reduces gain (pruning effect).


Exercise 3: GOSS Sampling

Problem: LightGBM uses GOSS, keeping all large-gradient samples (top 20%) and randomly sampling small-gradient samples (10% of remaining 80%). For dataset with 1000 samples, compute: (1) retained samples; (2) weight factor for small-gradient samples.

Solution:

  1. Large-gradient: ; Small-gradient: ; Total: 280 (28% data)

  2. Small-gradient proportion drops from 80% to 8%, needs amplification factorto maintain unbiased gradient statistics.


Exercise 4: Leaf-wise vs Level-wise

Problem: A dataset trains decision tree. Level-wise grows to depth 3 (15 nodes), Leaf-wise grows 15 nodes but max depth 5. Training errors are equal. Which generalizes better?

Solution:

Level-wise (XGBoost default): Balanced structure, depth 3, better regularization

Leaf-wise (LightGBM default): Unbalanced, depth 5, may overfit on small data

Conclusion: On small data, Level-wise is more robust. On large data, Leaf-wise is more efficient. LightGBM's Leaf-wise needs max_depth and min_data_in_leaf limits.


Exercise 5: XGBoost vs LightGBM Selection

Problem: Choose XGBoost or LightGBM for: (1) 1M samples, 1000 features; (2) 10K samples, 50 features; (3) recommender system with many categorical features.

Solution:

  1. LightGBM: Histogram algorithm excels on large scale, GOSS further accelerates

  2. XGBoost: Small data, Level-wise more robust, less overfitting

  3. LightGBM/CatBoost: Native categorical feature support, no one-hot explosion

Summary: Large data→LightGBM, Small data→XGBoost, Categorical features→LightGBM/CatBoost.

  • Post title:Machine Learning Mathematical Derivations (12): XGBoost and LightGBM
  • Post author:Chen Kai
  • Create time:2021-10-30 15:15:00
  • Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-12-XGBoost-and-LightGBM/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments