The Hidden Markov Model (HMM) is a classical tool for sequence modeling — when we observe a series of visible outputs, how do we infer the underlying hidden state sequence? From speech recognition to part-of-speech tagging, from bioinformatics to financial time series, HMM solves the three fundamental problems of probability computation, learning, and prediction through clever dynamic programming algorithms. This chapter delves into the mathematical principles of forward-backward algorithms, optimal path finding with Viterbi, and the EM framework implementation of Baum-Welch.
Markov Chains: From Memoryless to Finite Memory
First-Order Markov Assumption
Classical Markov Chain: State sequence
Physical interpretation: The future depends only on the present, not on the history
Homogeneity assumption: Transition probabilities are
time-independent
Transition matrix:
Constraints:
Weather Prediction Example
State space:
Interpretation: Probability of sunny after sunny is 80%, probability of sunny after rain is 40%
Initial distribution:
Stationary distribution: When
For the weather example above, the stationary distribution is
Three Elements and Basic Architecture of HMM
Core Components
State sequence (unobservable):
Observation sequence (observable):
Initial state probability:
, where State transition probability matrix:
Emission probability matrix (observation probability):
Compact notation:
Two Fundamental Assumptions
Homogeneous Markov assumption:
Observation independence assumption:
Joint probability decomposition:
Part-of-Speech Tagging Example
Observation sequence: ["I", "love", "natural", "language", "processing"]
Hidden states: [Pronoun, Verb, Adjective, Noun, Noun]
State space:
Observation space:
Emission probability:
Three Fundamental Problems of HMM
Problem 1: Probability Computation
Input: Model
Direct computation (brute force):
Complexity:
Problem 2: Learning Problem
Input: Observation sequence
Output: Model parameters
Algorithm: Baum-Welch algorithm (EM algorithm for HMM)
Problem 3: Prediction Problem (Decoding)
Input: Model
Algorithm: Viterbi algorithm (dynamic programming)
Forward Algorithm: Left-to-Right Probability Accumulation
Forward Variable Definition
Forward variable:
Terminal objective:
Recursion Formula Derivation
Initialization (
Recursion (
Applying HMM assumptions:
- Observation independence:
- Markov property:
Result:
Physical meaning: Sum over all possible previous
states
Complete Forward Algorithm
Input:
Step 2 (Recursion): For
Step 3 (Termination):
Complexity:
Numerical Stability: Log-Space Computation
Problem: Sequential multiplication causes underflow (probabilities too small)
Solution: Use
Using the
Backward Algorithm: Right-to-Left Probability Backtracking
Backward Variable Definition
Backward variable:
Relationship with forward:
Recursion Formula Derivation
Initialization (
Convention: Probability is 1 when no future observations
Recursion (
Applying HMM assumptions:
Physical meaning: Sum over all possible next
states
Forward-Backward Synergy
Joint probability decomposition:
State occupancy probability (Smoothing):
State transition probability:
These probabilities are core to the Baum-Welch algorithm!
Viterbi Algorithm: Dynamic Programming for Optimal Path
Problem Formalization
Objective: Find the most likely state sequence
Difference from forward algorithm: - Forward
algorithm: Sums over all paths
Viterbi Variable Definition
Recursion Formula
Initialization (
Recursion (
Backtracking pointer:
Physical meaning: Among all paths reaching
state
Complete Viterbi Algorithm
Input:
Step 2 (Recursion): For
Step 3 (Termination):
Step 4 (Backtracking): For
Output: Optimal state sequence
Baum-Welch Algorithm: EM Framework for HMM
Learning Problem Formalization
Observed data:
Difficulty: Log-likelihood contains summation over latent variables
Cannot optimize directly by differentiation!
EM Algorithm Application
E-step: Compute Q-function
M-step: Maximize Q-function
E-step Detailed Derivation
Complete data log-likelihood:
Q-function expansion:
Introducing expected statistics:
These are exactly what the forward-backward algorithm computes!
Q-function simplifies to:
M-step Update Formulas
Optimal
Optimal
Physical meaning: Expected number of transitions
from state
Physical meaning: Expected number of emissions of
symbol
Q&A
Q1: What is the essential difference between the forward algorithm and Viterbi algorithm?
A: Both are dynamic programming, but the forward algorithm sums over
all possible paths
Q2: Why does Baum-Welch guarantee monotonic likelihood increase?
A: This is a theoretical guarantee of the EM algorithm. Each EM iteration guarantees:
The key is that the Q-function in the E-step is a lower bound (ELBO) of the log-likelihood, and maximizing this lower bound in the M-step pushes up the true likelihood.
Q3: What are the limitations of HMM?
A: Three major limitations: 1. Strong Markov assumption: Future depends only on current state, ignoring long-range dependencies 2. Observation independence assumption: Current observation depends only on current state; in reality, observations may be interdependent 3. Fixed number of states: Must be specified in advance, difficult to determine automatically (can be addressed with Bayesian nonparametric methods like iHMM)
Q4: What are the advantages of CRF over HMM?
A: CRF is a discriminative model that directly models
Disadvantage: Higher training complexity (requires computing partition function).
✏️ Exercises and Solutions
Exercise 1: Forward Algorithm
Problem: HMM with 2 states (Sunny, Rainy),
observation
Solution:
Exercise 2: Viterbi Decoding
Problem: Use parameters from Exercise 1, find most likely state sequence.
Solution:
Exercise 3: Complexity
Problem:
Solution: Forward:
Exercise 4: Baum-Welch Update
Problem: Derive update formula for
Solution:
Exercise 5: HMM vs Naive Bayes
Problem: Compare differences.
Solution: HMM models sequence dependencies (state
transitions), NB assumes sample independence. HMM inference uses DP
References
[1] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
[2] Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554-1563.
[3] Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260-269.
[4] Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press.
[5] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. (Chapter 17: Hidden Markov Models)
- Post title:Machine Learning Mathematical Derivations (15): Hidden Markov Models
- Post author:Chen Kai
- Create time:2021-11-17 10:00:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-15-Hidden-Markov-Models/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.