Machine Learning Mathematical Derivations (15): Hidden Markov Models

The Hidden Markov Model (HMM) is a classical tool for sequence modeling — when we observe a series of visible outputs, how do we infer the underlying hidden state sequence? From speech recognition to part-of-speech tagging, from bioinformatics to financial time series, HMM solves the three fundamental problems of probability computation, learning, and prediction through clever dynamic programming algorithms. This chapter delves into the mathematical principles of forward-backward algorithms, optimal path finding with Viterbi, and the EM framework implementation of Baum-Welch.

Markov Chains: From Memoryless to Finite Memory

First-Order Markov Assumption

Classical Markov Chain: State sequence $Extra close brace or missing open brace\{S_1, S_2, \dots, S_T}$ satisfies

Physical interpretation: The future depends only on the present, not on the history

Homogeneity assumption: Transition probabilities are time-independent

Transition matrix:, whereis the number of states

Constraints:

Weather Prediction Example

State space: Transition matrix:

Interpretation: Probability of sunny after sunny is 80%, probability of sunny after rain is 40%

Initial distribution: State distribution after T steps:

Stationary distribution: Whensatisfies ergodicity,exists and satisfies

For the weather example above, the stationary distribution is

Three Elements and Basic Architecture of HMM

Core Components

State sequence (unobservable): $Extra close brace or missing open brace\mathbf{I} = \{i_1, i_2, \dots, i_T}$ , $Extra close brace or missing open bracei_t \in \{1, 2, \dots, N}$

Observation sequence (observable): $Extra close brace or missing open brace\mathbf{O} = \{o_1, o_2, \dots, o_T}$ , $Extra close brace or missing open braceo_t \in \{v_1, v_2, \dots, v_M}$ Three parameters:

Initial state probability:, where
State transition probability matrix:
Emission probability matrix (observation probability):

Compact notation:

Two Fundamental Assumptions

Homogeneous Markov assumption:

Observation independence assumption:

Joint probability decomposition:

Part-of-Speech Tagging Example

Observation sequence: ["I", "love", "natural", "language", "processing"]

Hidden states: [Pronoun, Verb, Adjective, Noun, Noun]

State space:number of POS tags (e.g., 45)

Observation space:vocabulary size (e.g., 50000)

Emission probability:represents the probability of emitting "processing" in the noun state

Three Fundamental Problems of HMM

Problem 1: Probability Computation

Input: Model, observation sequence $Extra close brace or missing open brace\mathbf{O} = \{o_1, \dots, o_T}$ Output: Observation sequence probability Application: Model evaluation, sequence classification

Direct computation (brute force):

Complexity:— for,, approximatelycomputations, infeasible!

Problem 2: Learning Problem

Input: Observation sequence $Extra close brace or missing open brace\mathbf{O} = \{o_1, \dots, o_T}$ (possibly multiple sequences)

Output: Model parameters Application: Learning model from data

Algorithm: Baum-Welch algorithm (EM algorithm for HMM)

Problem 3: Prediction Problem (Decoding)

Input: Model, observation sequence Output: Most likely state sequence Application: POS tagging, speech recognition

Algorithm: Viterbi algorithm (dynamic programming)

Forward Algorithm: Left-to-Right Probability Accumulation

Forward Variable Definition

Forward variable:represents the probability of partial observation sequenceup to timewith stateat time

Terminal objective:

Recursion Formula Derivation

Initialization ():

Recursion ():

Applying HMM assumptions:

Observation independence:
Markov property:Result:

Physical meaning: Sum over all possible previous statestransitioning to current state, multiplied by emission probability for

Complete Forward Algorithm

Input:, observation sequence Step 1 (Initialization): For

Step 2 (Recursion): For,

Step 3 (Termination):

Complexity:— reduced fromto polynomial time!

Numerical Stability: Log-Space Computation

Problem: Sequential multiplication causes underflow (probabilities too small)

Solution: Use Recursion becomes:

Using thetrick:

Backward Algorithm: Right-to-Left Probability Backtracking

Backward Variable Definition

Backward variable:represents the probability of partial observation sequencefromtogiven stateat time

Relationship with forward:

Recursion Formula Derivation

Initialization ():

Convention: Probability is 1 when no future observations

Recursion ():

Applying HMM assumptions:

Physical meaning: Sum over all possible next states, multiplied by emission probability forand subsequent backward probability

Forward-Backward Synergy

Joint probability decomposition:

State occupancy probability (Smoothing):

State transition probability:

These probabilities are core to the Baum-Welch algorithm!

Viterbi Algorithm: Dynamic Programming for Optimal Path

Problem Formalization

Objective: Find the most likely state sequence

Difference from forward algorithm: - Forward algorithm: Sums over all paths - Viterbi algorithm: Maximizes over all paths

Viterbi Variable Definition

: Maximum probability among all single paths ending at stateat time

: Optimal previous state when at stateat time

Recursion Formula

Initialization ():

Recursion ():

Backtracking pointer:

Physical meaning: Among all paths reaching state, select the previous state with maximum probability

Complete Viterbi Algorithm

Input:, observation sequence Step 1 (Initialization): For

Step 2 (Recursion): For,

Step 3 (Termination):

Step 4 (Backtracking): For

Output: Optimal state sequenceand probability Complexity:

Baum-Welch Algorithm: EM Framework for HMM

Learning Problem Formalization

Observed data: Latent variables: Parameters: Objective: Maximize likelihood

Difficulty: Log-likelihood contains summation over latent variables

Cannot optimize directly by differentiation!

EM Algorithm Application

E-step: Compute Q-function

M-step: Maximize Q-function

E-step Detailed Derivation

Complete data log-likelihood:

Q-function expansion:

Introducing expected statistics:

These are exactly what the forward-backward algorithm computes!

Q-function simplifies to:

M-step Update Formulas

Optimal:

Physical meaning: Expected number of transitions from stateto statedivided by expected total transitions from state Optimal (discrete observations):

Physical meaning: Expected number of emissions of symbolin statedivided by expected total time in state

Q&A

Q1: What is the essential difference between the forward algorithm and Viterbi algorithm?

A: Both are dynamic programming, but the forward algorithm sums over all possible paths, computing the total probability of the observation sequence; the Viterbi algorithm maximizes over all possible paths, finding the optimal state sequence.

Q2: Why does Baum-Welch guarantee monotonic likelihood increase?

A: This is a theoretical guarantee of the EM algorithm. Each EM iteration guarantees:

The key is that the Q-function in the E-step is a lower bound (ELBO) of the log-likelihood, and maximizing this lower bound in the M-step pushes up the true likelihood.

Q3: What are the limitations of HMM?

A: Three major limitations: 1. Strong Markov assumption: Future depends only on current state, ignoring long-range dependencies 2. Observation independence assumption: Current observation depends only on current state; in reality, observations may be interdependent 3. Fixed number of states: Must be specified in advance, difficult to determine automatically (can be addressed with Bayesian nonparametric methods like iHMM)

Q4: What are the advantages of CRF over HMM?

A: CRF is a discriminative model that directly modelsrather than the generativeof HMM. Advantages: 1. No observation independence assumption, can use overlapping features 2. More flexible parameter learning, can introduce global features 3. Generally better performance on sequence labeling tasks (e.g., NER)

Disadvantage: Higher training complexity (requires computing partition function).

✏️ Exercises and Solutions

Exercise 1: Forward Algorithm

Problem: HMM with 2 states (Sunny, Rainy), observation , , , . Compute .

Solution: , ; , ;

Exercise 2: Viterbi Decoding

Problem: Use parameters from Exercise 1, find most likely state sequence.

Solution: is maximum, backtrack to get

Exercise 3: Complexity

Problem: states, sequence. Compare forward algorithm vs brute force.

Solution: Forward: ; Brute force: . Dynamic programming reduces exponential to polynomial!

Exercise 4: Baum-Welch Update

Problem: Derive update formula for .

Solution: where

Exercise 5: HMM vs Naive Bayes

Problem: Compare differences.

Solution: HMM models sequence dependencies (state transitions), NB assumes sample independence. HMM inference uses DP , NB uses Bayes formula .

References

[1] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.

[2] Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554-1563.

[3] Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260-269.

[4] Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press.

[5] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. (Chapter 17: Hidden Markov Models)