The EM (Expectation-Maximization) algorithm is a general framework for handling latent variable models — when data contains unobserved latent variables, direct likelihood maximization becomes difficult. EM iterates between "expectation" and "maximization" steps, guaranteeing monotonic likelihood increase until convergence. From parameter estimation in Gaussian mixture models to image segmentation and speech recognition, the EM algorithm demonstrates both theoretical elegance and practical value. This chapter systematically derives the mathematical principles, convergence theory, Gaussian mixture models, and their variants.
Latent Variables and Incomplete Data
Problem Background
Complete data: Observed data
Incomplete data: Only
Objective: Maximize the incomplete data
log-likelihood
Difficulty: The sum is inside the logarithm, making direct optimization challenging.
Example: Gaussian Mixture Model - Observed: Data
points
Jensen's Inequality and Log-Likelihood Lower Bound
Jensen's Inequality: For a concave function
For
Application to log-likelihood: Introduce an
arbitrary distribution
ELBO (Evidence Lower Bound):
KL Divergence and Tight Bound Condition
Rewriting the ELBO:
where KL divergence:
Tight bound condition: When
EM Algorithm Derivation
Algorithm Framework

Initialization: Parameters
Iteration (iteration
E-step (Expectation): Fix
Compute the Q-function:
M-step (Maximization): Fix
Termination: Convergence (change in log-likelihood or parameters below threshold)
Meaning
of the Q-function
Interpretation: The Q-function is the expected complete data log-likelihood under the posterior distribution of latent variables.
Convergence Proof

Theorem: The EM algorithm guarantees monotonic
increase of the log-likelihood:
Proof:
Corollary: The EM algorithm converges to a (local) maximum or saddle point of the log-likelihood.
Gaussian Mixture Model (GMM)

Model Definition
Generative process: 1. Choose component
Marginal distribution (observed data
likelihood):
Parameters:
EM Algorithm for GMM
E-step: Compute responsibilities:
Physical meaning: Posterior probability that
sample
M-step: Update parameters.
Q-function:
Update
where
Update
Weighted average: Samples
Update
Weighted covariance.
Geometric Interpretation of GMM
K-means vs GMM: - K-means: Hard
assignment (
Advantages of GMM: - Allows overlapping clusters - Different shapes for different clusters (covariance) - Provides uncertainty estimates
Initialization Strategies
The EM algorithm is sensitive to initialization. Common strategies:
K-means initialization: Use K-means results to initialize
, , Random initialization: Randomly select
points from data as Multiple restarts: Run EM multiple times, select result with highest log-likelihood
GMM Applications
Density Estimation
Objective: Learn data distribution
Applications: Anomaly detection, sample generation, probabilistic prediction.
Cluster Analysis
Hard clustering:
BIC model selection: Choose number of
components
where
Image Segmentation
Problem: Cluster image pixels into
Features: RGB color, position coordinates, texture features.
GMM model: Each component corresponds to a region.
Segmentation result: Assign pixel
Complete Implementation
1 | import numpy as np |
Q&A
Q1: Why is the EM algorithm called Expectation-Maximization?
A: The E-step computes the Q-function (the expectation of complete data log-likelihood), and the M-step maximizes the Q-function. The name directly describes the algorithm steps.
Q2: Does the EM algorithm converge to the global optimum?
A: Not guaranteed. EM guarantees monotonic increase of log-likelihood but may converge to local optima or saddle points. Solutions: - Multiple random initializations - Better initialization (K-means) - Combine with other optimization algorithms (e.g., gradient descent)
Q3: How to choose the number of components K in GMM?
A: Model selection criteria: - BIC:
Typically choose
Q4: What is the relationship between GMM and K-means?
A: K-means is a special case of GMM: - All covariances
GMM is a soft and generalized version of K-means.
Q5: Can the EM algorithm be parallelized?
A: The E-step is naturally parallel (each sample computes
Q6: What to do when the covariance matrix is singular?
A: Common causes: High dimensionality, few samples in some
components. Solutions: - Regularization:
Q7: What are variants of the EM algorithm?
A: - Incremental EM: Online learning, batch updates - Stochastic EM: Random subset sampling at each iteration - Variational EM: E-step uses variational inference to approximate posterior - Generalized EM: M-step doesn't fully optimize (e.g., a few gradient ascent steps)
Q8: Can GMM handle high-dimensional data?
A: High-dimensional difficulties (curse of dimensionality): -
Covariance parameters
Solutions: - Diagonal covariance (assume feature independence) - Factor analysis (low-rank covariance) - Feature selection/dimensionality reduction
Q9: How to visualize high-dimensional GMM?
A: - PCA projection: Project to first 2-3 principal
components - t-SNE: Nonlinear dimensionality reduction
- Responsibility heatmap: Sample × component
Q10: What are other applications of the EM algorithm?
A: - Hidden Markov Models: Baum-Welch algorithm - Topic models: Variational EM for LDA - Missing data imputation: EM iteratively estimates missing values - Mixture of experts: Gating network + expert networks
EM is a general framework for handling latent variables.
✏️ Exercises and Solutions
Exercise 1: E-step Calculation
Problem: GMM with
Exercise 2: M-step Update
Problem: Samples
Exercise 3: EM Convergence
Problem: What does EM converge to? Solution: Local optimum (or saddle point). EM guarantees likelihood increase but not global optimum.
Exercise 4: GMM vs K-means
Problem: Compare GMM and K-means. Solution: K-means: hard assignment, spherical clusters. GMM: soft assignment (probabilistic), elliptical clusters.
Exercise 5: Missing Data
Problem: Data
✏️ Exercises and Solutions
Exercise 1: E-step Calculation
Problem: GMM with
Exercise 2: M-step Update
Problem: Samples
Exercise 3: EM Convergence
Problem: What does EM converge to? Solution: Local optimum (or saddle point). EM guarantees likelihood increase but not global optimum.
Exercise 4: GMM vs K-means
Problem: Compare GMM and K-means. Solution: K-means: hard assignment, spherical clusters. GMM: soft assignment (probabilistic), elliptical clusters.
Exercise 5: Missing Data
Problem: Data
Referencess
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1-22.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Chapter 9: Mixture Models and EM]
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 11: Mixture Models and EM]
- McLachlan, G., & Krishnan, T. (2007). The EM Algorithm and Extensions (2nd ed.). Wiley.
- Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 89, 355-368.
The EM algorithm, with its elegant mathematical structure and broad practical value, has become one of the cornerstones of modern machine learning. From parameter estimation in Gaussian mixture models to training hidden Markov models, from missing data handling to semi-supervised learning, the EM algorithm demonstrates the wisdom of "divide and conquer"— decomposing a difficult incomplete data optimization problem into simple "expectation" and "maximization" steps. Understanding the EM algorithm is not only mastering a classical algorithm but also appreciating the deep connection between probabilistic inference and optimization — a theme that runs through variational inference, Monte Carlo methods, and other modern Bayesian machine learning techniques.
- Post title:Machine Learning Mathematical Derivations (13): EM Algorithm and GMM
- Post author:Chen Kai
- Create time:2021-11-05 09:45:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-13-EM-Algorithm-and-GMM/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.