Chen Kai Blog

Machine Learning Mathematical Derivations (20): Regularization and Model Selection

Regularization is a core technique in machine learning for controlling model complexity and preventing overfitting — when training data is limited, models tend to memorize noise rather than true patterns. From the mathematical forms of L1/L2 regularization to their Bayesian prior interpretation, from Dropout's random deactivation to early stopping's implicit regularization, from cross-validation for model selection to VC dimension generalization bounds, regularization theory provides mathematical guarantees for balancing underfitting and overfitting. This chapter deeply derives the optimization forms, Bayesian interpretation, bias-variance decomposition, and learning theory foundations of regularization.
2021-12-17
- Machine Learning
- Dropout
- | Machine Learning
- | Regularization
- | Mathematical Derivations
- | PAC Learning
Read more
Machine Learning Mathematical Derivations (19): Neural Networks and Backpropagation

Neural Networks are the cornerstone of deep learning — from biological neuron inspiration to multilayer nonlinear transformations, neural networks achieve end-to-end learning through the backpropagation algorithm. From the perceptron convergence theorem to the universal approximation theorem, from vanishing gradient problems to He initialization, from Sigmoid to ReLU, the mathematical principles of neural networks provide a solid foundation for understanding deep models. This chapter deeply derives the matrix form of forward propagation, chain rule of backpropagation, mathematical analysis of vanishing/exploding gradients, and weight initialization strategies.
2021-12-11
- Machine Learning
- Neural Networks
- | Machine Learning
- | Mathematical Derivations
- | Backpropagation
- | Vanishing Gradients
Read more
Machine Learning Mathematical Derivations (18): Clustering Algorithms

Clustering is a core task in unsupervised learning — automatically discovering group structures based on data similarity without labels. From the elegant simplicity of K-means to the density-adaptive DBSCAN, from hierarchical clustering's tree structure to spectral clustering's graph-theoretic foundation, clustering algorithms provide powerful tools for exploratory data analysis, customer segmentation, image segmentation, and anomaly detection. This chapter deeply derives Lloyd's algorithm and EM interpretation of K-means, linkage criteria in hierarchical clustering, density reachability in DBSCAN, and Laplacian matrix with NCut objective in spectral clustering.
2021-12-05
- Machine Learning
- GMM
- | K-means
- | DBSCAN
- | Machine Learning
- | Mathematical Derivations
Read more
Machine Learning Mathematical Derivations (17): Dimensionality Reduction and PCA

Dimensionality Reduction is a core technique in machine learning for handling high-dimensional data — when feature dimensions reach thousands or even millions, the curse of dimensionality makes learning difficult. Dimensionality reduction preserves the main structure of data while projecting it to a lower-dimensional space. From PCA's eigenvalue decomposition to LDA's inter-class separation, from kernel tricks for nonlinear mapping to t-SNE for manifold learning, dimensionality reduction algorithms provide powerful tools for data visualization, feature extraction, and preprocessing. This chapter deeply derives the two perspectives of PCA, implicit mapping of kernel PCA, Fisher's criterion for LDA, and probability distribution matching in t-SNE.
2021-11-29
- Machine Learning
- PCA
- | LDA
- | t-SNE
- | Machine Learning
- | Mathematical Derivations
Read more
Machine Learning Mathematical Derivations (16): Conditional Random Fields

Conditional Random Fields (CRF) are discriminative models for sequence labeling — unlike HMM, CRF directly models the conditional probabilityrather than the joint probability, thereby avoiding the observation independence assumption and allowing flexible use of overlapping features. From named entity recognition to part-of-speech tagging, from information extraction to image segmentation, CRF achieves optimal performance in sequence modeling through clever undirected graph structures and feature engineering. This chapter delves into the potential functions, normalization factor, forward-backward algorithm, gradient computation, and LBFGS optimization for linear-chain CRF.
2021-11-23
- Machine Learning
- CRF
- | LBFGS
- | Machine Learning
- | Mathematical Derivations
- | Conditional Random Fields
Read more
Machine Learning Mathematical Derivations (15): Hidden Markov Models

The Hidden Markov Model (HMM) is a classical tool for sequence modeling — when we observe a series of visible outputs, how do we infer the underlying hidden state sequence? From speech recognition to part-of-speech tagging, from bioinformatics to financial time series, HMM solves the three fundamental problems of probability computation, learning, and prediction through clever dynamic programming algorithms. This chapter delves into the mathematical principles of forward-backward algorithms, optimal path finding with Viterbi, and the EM framework implementation of Baum-Welch.
2021-11-17
- Machine Learning
- HMM
- | Machine Learning
- | Mathematical Derivations
- | Hidden Markov Model
- | Forward Algorithm
Read more
Machine Learning Mathematical Derivations (14): Variational Inference and Variational EM

Variational Inference transforms Bayesian inference into an optimization problem — when the posterior distribution is difficult to compute exactly, variational inference optimizes over a tractable family of distributions to approximate the true posterior, converting an integration problem into an optimization problem. From variational EM to variational autoencoders, from topic models to deep generative models, variational inference has become a core technique in modern machine learning. This chapter systematically derives the mathematical principles of variational inference, mean-field approximation, coordinate ascent algorithms, and black-box variational inference.
2021-11-11
- Machine Learning
- Variational Inference
- | ELBO
- | Machine Learning
- | Mathematical Derivations
- | Variational Bayes
Read more
Machine Learning Mathematical Derivations (13): EM Algorithm and GMM

The EM (Expectation-Maximization) algorithm is a general framework for handling latent variable models — when data contains unobserved latent variables, direct likelihood maximization becomes difficult. EM iterates between "expectation" and "maximization" steps, guaranteeing monotonic likelihood increase until convergence. From parameter estimation in Gaussian mixture models to image segmentation and speech recognition, the EM algorithm demonstrates both theoretical elegance and practical value. This chapter systematically derives the mathematical principles, convergence theory, Gaussian mixture models, and their variants.
2021-11-05
- Machine Learning
- GMM
- | Machine Learning
- | Mathematical Derivations
- | EM Algorithm
- | Gaussian Mixture Model
Read more
Machine Learning Mathematical Derivations (12): XGBoost and LightGBM

XGBoost and LightGBM are the most popular gradient boosting frameworks in industry — building on GBDT with sophisticated mathematical optimizations and engineering innovations, achieving perfect balance between accuracy and efficiency. From XGBoost's second-order optimization to LightGBM's histogram acceleration, from regularization penalties to gradient-based one-side sampling, these techniques embody deep mathematical insights. This chapter systematically derives the mathematical principles, algorithm details, and practical techniques of XGBoost and LightGBM.
2021-10-30
- Machine Learning
- XGBoost
- | LightGBM
- | Machine Learning
- | Mathematical Derivation
- | Gradient Boosting
Read more
Machine Learning Mathematical Derivations (11): Ensemble Learning

Ensemble Learning is one of the most powerful weapons in machine learning — combining multiple weak learners to build strong learners with excellent performance. From Kaggle competitions to industrial applications, ensemble methods are ubiquitous. Why does "three cobblers beat a shoemaker"? What's the mathematical mechanism? This chapter systematically derives the theoretical foundations of ensemble learning, including bias-variance decomposition, Boosting's additive models, Random Forest's randomization strategies, and Gradient Boosting's function space optimization, revealing the core wisdom of ensemble learning.
2021-10-24
- Machine Learning
- Bagging
- | Boosting
- | AdaBoost
- | GBDT
- | Machine Learning
Read more

Prev Next