Hidden Markov Models (HMM) are classical tools for sequence modeling
- when we observe a series of visible outputs, how can we infer the
underlying hidden state sequence? From speech recognition to
part-of-speech tagging, from bioinformatics to financial time series,
HMMs solve three fundamental problems (probability computation,
learning, and prediction) through elegant dynamic programming
algorithms. This chapter systematically derives the mathematical
principles of forward-backward algorithms, optimal path solving with the
Viterbi algorithm, and the EM framework implementation of the Baum-Welch
algorithm.
Markov
Chains: From Memorylessness to Finite Memory
First-Order Markov
Assumption
Classical Markov Chain: A state sequence satisfies
Physical Interpretation: The future depends only on
the present, independent of history.
Homogeneity Assumption: Transition probabilities are
time-invariant
Transition Matrix:,
whereis the number of states
Constraint Conditions:
Weather Prediction Example
State Space:Sunny, RainyTransition
Matrix:Interpretation: 80% probability of sunny following
sunny, 40% probability of clearing after rain
Initial Distribution:State Distribution After T Steps:
Stationary Distribution: Whensatisfies ergodicity,exists and
satisfiesFor the weather example
above, the stationary distribution is
Three Components
and Basic Architecture of HMM
Core Components
State Sequence (unobservable):,Observation Sequence (observable):,Three Parameters:
Initial State Probabilities:, where
State Transition Probability Matrix:
Emission Probability Matrix (observation
probabilities):
Forward
Algorithm: Left-to-Right Probability Accumulation
Forward Probability
Definition
Forward Variable:represents the probability of
the partial observation sequence beingup to timeand the state at timebeing
Termination Objective:
Derivation of Recursive
Formula
Initialization ():
Recursion ():
Applying HMM Assumptions:
Observation independence:
Markov property:We obtain:
Physical Meaning: Transition from all possible
previous statesto current
state, multiplied by the
probability of emitting observation
Complete Forward Algorithm
Procedure
Input:, observation sequenceStep 1 (Initialization): For
Step 2 (Recursion): For,
Step 3 (Termination):
Complexity: - reduced fromto
polynomial time!
Numerical Stability:
Log-Space Computation
Problem: Successive multiplication causes underflow
(probabilities too small)
Solution: UseRecursion Becomes:Using thetrick:
Backward
Algorithm: Right-to-Left Probability Backtracking
Backward Probability
Definition
Backward Variable:represents the probability of
the partial observation sequence beingfrom timetogiven that the state at timeis
Relationship with Forward:
Derivation of Recursive
Formula
Initialization ():Convention: probability is 1 when there are no future
observations
Recursion ():
Applying HMM Assumptions:
Physical Meaning: Transition from current stateto all possible next states, multiplied by the probability of
emittingand subsequent
backward probabilities
Complete Backward
Algorithm Procedure
Input:, observation sequenceStep 1 (Initialization): For
Step 2 (Recursion): For,
Step 3 (Verification):Should be consistent with
the forward algorithm result
Synergy of Forward-Backward
Joint Probability Decomposition:
State Occupation Probability (Smoothing):
State Transition Probability:These probabilities are the core of the
Baum-Welch algorithm!
Viterbi
Algorithm: Dynamic Programming for Optimal Path
Problem Formalization
Objective: Find the most likely state sequence
Difference from Forward Algorithm: - Forward
algorithm: sum over all paths -
Viterbi algorithm: maximize over all paths
Viterbi Variable Definition
:
Maximum probability among all single paths ending at stateat time
:
Optimal previous state when the state at timeis
Derivation of Recursive
Formula
Initialization ():
Recursion ():
Backtracking Pointer:
Physical Meaning: Among all paths reaching
state, select the previous state
with maximum probability
Complete Viterbi Algorithm
Procedure
Input:, observation sequenceStep 1 (Initialization): For
Step 2 (Recursion): For,
Step 3 (Termination):
Step 4 (Backtracking): For
Output: Optimal state sequenceand
probability
Log-Space Viterbi Algorithm
Numerically Stable Version:Avoids underflow
caused by successive multiplication
Viterbi vs Forward Algorithm
Forward Algorithm:
Viterbi Algorithm:Simply replacingBoth are dynamic
programming
Physical Meaning: Expected number of transitions
from stateto statedivided by total expected number of
transitions from state
Optimizing
Discrete Observation Case: For each stateand observation symbol
Physical Meaning: Expected number of emissions of
symbolin statedivided by total expected time in
stateContinuous
Observation Case (Gaussian distribution):Update formulas:
Complete Baum-Welch
Algorithm Procedure
Input: Observation sequence, number of
states, number of observation
symbolsInitialization: Randomly initializeIteration (until convergence):
E-Step: Using current parameters$^{(k)}_t(i)_t(i)$
Q1: What is the essential difference between the forward
algorithm and Viterbi algorithm?
A: Both are dynamic programming, but the forward algorithm sums over
all possible paths, computing
the total probability of the observation sequence; the Viterbi algorithm
maximizes over all possible paths, finding the optimal state sequence.
Mathematically:Just the difference between summation and
maximization leads the former to give marginal probabilities and the
latter to give MAP estimation.
Q2: Why does Baum-Welch guarantee monotonic increase in
likelihood?
A: This is guaranteed by EM algorithm theory. Each iteration of the
EM algorithm ensures:The key is that the Q function
constructed in the E-step is a lower bound (ELBO) of the log-likelihood,
and maximizing this lower bound in the M-step pushes up the true
likelihood. The specific proof relies on Jensen's inequality and the
non-negativity of KL divergence.
Q3: What are the limitations of HMM?
A: Three major limitations: 1. Markov assumption too
strong: The future depends only on the current state, ignoring
long-range dependencies 2. Observation independence
assumption: Current observation depends only on current state;
in reality, observations may depend on each other 3. Fixed
number of states: Must be specified in advance, difficult to
determine automatically (can be solved with Bayesian nonparametric
methods like iHMM)
Q4: What advantages does CRF have compared to
HMM?
A: CRF is a discriminative model that directly modelsrather than
HMM's generative. Advantages: 1. No need for observation
independence assumption, can use overlapping features 2. More flexible
parameter learning, can introduce global features 3. Usually better
performance on sequence labeling tasks (like NER) than HMM
Disadvantage is higher training complexity (need to compute partition
function).
Q5: How to choose the number of states in HMM?
A: Three methods: 1. Cross-validation: Test
performance with differenton
validation set 2. Information criteria: Use AIC/BIC to
balance fit and complexitywhereis the number of
parameters 3. Bayesian nonparametrics: Use iHMM
(infinite HMM) to automatically infer number of states
Q6: How to analyze the time complexity of the
forward-backward algorithm?
A: The forward algorithm at each time, for each state, needs to iterate over all previous
statesfor summation, with
complexity; fortime steps, total complexity is. Compared to brute force
enumeration's, dynamic
programming avoids redundant computation. Space complexity is(storingandmatrices).
Q7: Why doesn't Viterbi algorithm directly use?
A: Because computingrequires normalization:Butis the same for all state
sequences, so
maximizingis equivalent to maximizing, so Viterbi
directly optimizes the joint probability.
Q8: How to handle continuous observations (like speech
signals)?
A: Use continuous observation HMM (CHMM), where emission
probabilities become continuous distributions:Or Gaussian Mixture Models (GMM):The M-step of Baum-Welch is modified
accordingly to use EM algorithm for updating Gaussian parameters.
Q9: Can HMM handle variable-length sequences?
A: Yes. For sequences of different lengths${^{(1)}, , ^{(K)}} $The Baum-Welch algorithm performs the
E-step separately for each sequence, and accumulates expected statistics
from all sequences in the M-step.
Q10: Can the forward algorithm be used for real-time
decoding?
A: Yes! The forward algorithm naturally supports online computation.
Each time a new observationarrives, simply compute:No future information needed. But
Viterbi requires backtracking and must wait for the sequence to end. For
real-time applications, use online Viterbi or
beam search approximations.
Q11: What is the relationship between HMM and
RNN?
A: HMM can be seen as a discrete probabilistic version of RNN. Both
model sequences, but: - HMM: Discrete hidden states, fixed transition
probabilities, inference uses dynamic programming - RNN: Continuous
hidden states, parameterized transition function (neural network),
inference uses forward propagation
In modern deep learning, RNN/LSTM have largely replaced HMM, but HMM
inference algorithms (like CTC) are still used in speech
recognition.
Q12: How to evaluate HMM model quality?
A: Three ways: 1. Log-likelihood: Higheris better (but may overfit) 2.
Decoding accuracy: Match between Viterbi-predicted
state sequence and true labels 3. Perplexity:Lower is
better
References
[1] Rabiner, L. R. (1989). A tutorial on hidden Markov models and
selected applications in speech recognition. Proceedings of the
IEEE, 77(2), 257-286.
[2] Baum, L. E., & Petrie, T. (1966). Statistical inference for
probabilistic functions of finite state Markov chains. The Annals of
Mathematical Statistics, 37(6), 1554-1563.
[3] Viterbi, A. (1967). Error bounds for convolutional codes and an
asymptotically optimum decoding algorithm. IEEE Transactions on
Information Theory, 13(2), 260-269.
[4] Jelinek, F. (1997). Statistical methods for speech
recognition. MIT Press.
[5] Eddy, S. R. (1998). Profile hidden Markov models.
Bioinformatics, 14(9), 755-763.
[6] Lafferty, J., McCallum, A., & Pereira, F. C. (2001).
Conditional random fields: Probabilistic models for segmenting and
labeling sequence data. ICML.
[7] Murphy, K. P. (2012). Machine Learning: A Probabilistic
Perspective. MIT Press. (Chapter 17: Hidden Markov Models)
[8] Bishop, C. M. (2006). Pattern Recognition and Machine
Learning. Springer. (Chapter 13: Sequential Data)
[9] Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden
Markov models. Machine Learning, 29(2-3), 245-273.
[10] Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M.
(2006). Hierarchical Dirichlet processes. Journal of the American
Statistical Association, 101(476), 1566-1581.
Post title:Machine Learning Math Derivations (15): Hidden Markov Models
Post author:Chen Kai
Create time:2025-08-15 00:00:00
Post link:https://www.chenk.top/ml-math-15-hidden-markov-model/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.