Machine Learning Math Derivations (15): Hidden Markov Models

Hidden Markov Models (HMM) are classical tools for sequence modeling - when we observe a series of visible outputs, how can we infer the underlying hidden state sequence? From speech recognition to part-of-speech tagging, from bioinformatics to financial time series, HMMs solve three fundamental problems (probability computation, learning, and prediction) through elegant dynamic programming algorithms. This chapter systematically derives the mathematical principles of forward-backward algorithms, optimal path solving with the Viterbi algorithm, and the EM framework implementation of the Baum-Welch algorithm.

Markov Chains: From Memorylessness to Finite Memory

First-Order Markov Assumption

Classical Markov Chain: A state sequence satisfies

Physical Interpretation: The future depends only on the present, independent of history.

Homogeneity Assumption: Transition probabilities are time-invariant

Transition Matrix:, whereis the number of states

Constraint Conditions:

Weather Prediction Example

State Space:Sunny, Rainy Transition Matrix:Interpretation: 80% probability of sunny following sunny, 40% probability of clearing after rain

Initial Distribution: State Distribution After T Steps:

Stationary Distribution: Whensatisfies ergodicity,exists and satisfiesFor the weather example above, the stationary distribution is

Three Components and Basic Architecture of HMM

Core Components

State Sequence (unobservable):, Observation Sequence (observable):, Three Parameters:

Initial State Probabilities:, where
State Transition Probability Matrix:
Emission Probability Matrix (observation probabilities):

Compact Representation:

Two Fundamental Assumptions

Homogeneous Markov Assumption:

Observation Independence Assumption:

Joint Probability Decomposition:

Part-of-Speech Tagging Example

Observation Sequence: ["I", "love", "natural", "language", "processing"]

Hidden States: [Pronoun, Verb, Adjective, Noun, Noun]

State Space:number of POS tags (e.g., 45 types)

Observation Space:vocabulary size (e.g., 50000)

Emission Probability:represents the probability of emitting "processing" in the noun state

Three Fundamental Problems of HMM

Problem 1: Probability Computation

Input: Model, observation sequence Output: Observation sequence probability Application: Model evaluation, sequence classification

Direct Computation (brute force enumeration):

Complexity: - for,, approximatelycomputations - infeasible!

Problem 2: Learning Problem

Input: Observation sequence(possibly multiple sequences)

Output: Model parameters Application: Learning model from data

Algorithm: Baum-Welch algorithm (EM algorithm for HMM)

Problem 3: Prediction Problem (Decoding)

Input: Model, observation sequence Output: Most likely state sequence Application: POS tagging, speech recognition

Algorithm: Viterbi algorithm (dynamic programming)

Forward Algorithm: Left-to-Right Probability Accumulation

Forward Probability Definition

Forward Variable:represents the probability of the partial observation sequence beingup to timeand the state at timebeing

Termination Objective:

Derivation of Recursive Formula

Initialization ():

Recursion ():

Applying HMM Assumptions:

Observation independence:
Markov property:We obtain:

Physical Meaning: Transition from all possible previous statesto current state, multiplied by the probability of emitting observation

Complete Forward Algorithm Procedure

Input:, observation sequence Step 1 (Initialization): For

Step 2 (Recursion): For,

Step 3 (Termination):

Complexity: - reduced fromto polynomial time!

Numerical Stability: Log-Space Computation

Problem: Successive multiplication causes underflow (probabilities too small)

Solution: Use Recursion Becomes:Using thetrick:

Backward Algorithm: Right-to-Left Probability Backtracking

Backward Probability Definition

Backward Variable:represents the probability of the partial observation sequence beingfrom timetogiven that the state at timeis

Relationship with Forward:

Derivation of Recursive Formula

Initialization ():Convention: probability is 1 when there are no future observations

Recursion ():

Applying HMM Assumptions:

Physical Meaning: Transition from current stateto all possible next states, multiplied by the probability of emittingand subsequent backward probabilities

Complete Backward Algorithm Procedure

Input:, observation sequence Step 1 (Initialization): For

Step 2 (Recursion): For,

Step 3 (Verification):Should be consistent with the forward algorithm result

Synergy of Forward-Backward

Joint Probability Decomposition:

State Occupation Probability (Smoothing):

State Transition Probability:These probabilities are the core of the Baum-Welch algorithm!

Viterbi Algorithm: Dynamic Programming for Optimal Path

Problem Formalization

Objective: Find the most likely state sequence

Difference from Forward Algorithm: - Forward algorithm: sum over all paths - Viterbi algorithm: maximize over all paths

Viterbi Variable Definition

: Maximum probability among all single paths ending at stateat time

: Optimal previous state when the state at timeis

Derivation of Recursive Formula

Initialization ():

Recursion ():

Backtracking Pointer:

Physical Meaning: Among all paths reaching state, select the previous state with maximum probability

Complete Viterbi Algorithm Procedure

Input:, observation sequence Step 1 (Initialization): For

Step 2 (Recursion): For,

Step 3 (Termination):

Step 4 (Backtracking): For

Output: Optimal state sequenceand probability

Log-Space Viterbi Algorithm

Numerically Stable Version:Avoids underflow caused by successive multiplication

Viterbi vs Forward Algorithm

Forward Algorithm:

Viterbi Algorithm:Simply replacingBoth are dynamic programming

Baum-Welch Algorithm: EM Framework for HMM

Formalization of Learning Problem

Observed Data: Hidden Variables: Parameters: Objective: Maximize likelihood

Difficulty: Log-likelihood contains summation over hidden variablesCannot directly differentiate for optimization!

Application of EM Algorithm Framework

E-Step: Compute Q function

M-Step: Maximize Q function

Detailed E-Step Derivation

Complete Data Log-Likelihood:

Q Function Expansion:

Introducing Expected Statistics: These are exactly what the forward-backward algorithm computes!

Q Function Simplifies to:

Detailed M-Step Derivation

Independent Parameter Optimization: Q function can be decomposed into three parts

Optimizing

Objective:

Lagrangian:

Differentiation:

Constraint Condition:

Optimal Solution:Because(probability normalization)

Optimizing

Objective: Optimize the-th row for each

Lagrangian:

Differentiation:

Optimal Solution:

Physical Meaning: Expected number of transitions from stateto statedivided by total expected number of transitions from state

Optimizing

Discrete Observation Case: For each stateand observation symbol

Physical Meaning: Expected number of emissions of symbolin statedivided by total expected time in state Continuous Observation Case (Gaussian distribution):Update formulas:

Complete Baum-Welch Algorithm Procedure

Input: Observation sequence, number of states, number of observation symbols Initialization: Randomly initialize Iteration (until convergence):

E-Step: Using current parameters$^{(k)}_t(i)_t(i)$

M-Step: Update parameters

Convergence Criterion:

Code Implementation: Complete HMM Toolkit

Forward-Backward Algorithm Implementation

import numpy as np

class HMM:
    def __init__(self, n_states, n_observations):
        """
        Initialize HMM model
        
        Args:
            n_states: Number of hidden states N
            n_observations: Number of observation symbols M
        """
        self.n_states = n_states
        self.n_observations = n_observations
        
        # Random initialization of parameters
        self.pi = np.random.rand(n_states)
        self.pi /= self.pi.sum()
        
        self.A = np.random.rand(n_states, n_states)
        self.A /= self.A.sum(axis=1, keepdims=True)
        
        self.B = np.random.rand(n_states, n_observations)
        self.B /= self.B.sum(axis=1, keepdims=True)
    
    def forward(self, observations):
        """
        Forward algorithm to compute alpha
        
        Args:
            observations: Observation sequence [o1, o2, ..., oT], values 0 to M-1
        
        Returns:
            alpha: shape (T, N)
            log_prob: Log-likelihood
        """
        T = len(observations)
        alpha = np.zeros((T, self.n_states))
        
        # Initialization
        alpha[0] = self.pi * self.B[:, observations[0]]
        
        # Recursion
        for t in range(1, T):
            for j in range(self.n_states):
                alpha[t, j] = np.sum(alpha[t-1] * self.A[:, j]) * self.B[j, observations[t]]
        
        log_prob = np.log(alpha[-1].sum())
        return alpha, log_prob
    
    def backward(self, observations):
        """
        Backward algorithm to compute beta
        
        Args:
            observations: Observation sequence [o1, o2, ..., oT]
        
        Returns:
            beta: shape (T, N)
        """
        T = len(observations)
        beta = np.zeros((T, self.n_states))
        
        # Initialization
        beta[-1] = 1.0
        
        # Recursion
        for t in range(T-2, -1, -1):
            for i in range(self.n_states):
                beta[t, i] = np.sum(self.A[i] * self.B[:, observations[t+1]] * beta[t+1])
        
        return beta
    
    def forward_log(self, observations):
        """
        Numerically stable log-space forward algorithm
        """
        T = len(observations)
        log_alpha = np.zeros((T, self.n_states))
        
        # Initialization
        log_alpha[0] = np.log(self.pi) + np.log(self.B[:, observations[0]])
        
        # Recursion
        for t in range(1, T):
            for j in range(self.n_states):
                log_alpha[t, j] = self._logsumexp(
                    log_alpha[t-1] + np.log(self.A[:, j])
                ) + np.log(self.B[j, observations[t]])
        
        log_prob = self._logsumexp(log_alpha[-1])
        return log_alpha, log_prob
    
    def _logsumexp(self, x):
        """Numerically stable log(sum(exp(x)))"""
        max_x = np.max(x)
        return max_x + np.log(np.sum(np.exp(x - max_x)))

Viterbi Algorithm Implementation

def viterbi(self, observations):
    """
    Viterbi algorithm for optimal state sequence
    
    Args:
        observations: Observation sequence [o1, o2, ..., oT]
    
    Returns:
        states: Optimal state sequence
        log_prob: Log probability of optimal path
    """
    T = len(observations)
    
    # Use log probabilities to avoid underflow
    log_delta = np.zeros((T, self.n_states))
    psi = np.zeros((T, self.n_states), dtype=int)
    
    # Initialization
    log_delta[0] = np.log(self.pi) + np.log(self.B[:, observations[0]])
    psi[0] = 0
    
    # Recursion
    for t in range(1, T):
        for j in range(self.n_states):
            temp = log_delta[t-1] + np.log(self.A[:, j])
            psi[t, j] = np.argmax(temp)
            log_delta[t, j] = temp[psi[t, j]] + np.log(self.B[j, observations[t]])
    
    # Termination
    states = np.zeros(T, dtype=int)
    states[-1] = np.argmax(log_delta[-1])
    log_prob = log_delta[-1, states[-1]]
    
    # Backtracking
    for t in range(T-2, -1, -1):
        states[t] = psi[t+1, states[t+1]]
    
    return states, log_prob

Baum-Welch Algorithm Implementation

def baum_welch(self, observations, max_iter=100, tol=1e-6):
    """
    Baum-Welch algorithm for learning HMM parameters
    
    Args:
        observations: Observation sequence [o1, o2, ..., oT]
        max_iter: Maximum number of iterations
        tol: Convergence threshold
    
    Returns:
        log_probs: Log-likelihood at each iteration
    """
    T = len(observations)
    log_probs = []
    
    for iteration in range(max_iter):
        # E-step: Forward-backward algorithm
        alpha, log_prob = self.forward(observations)
        beta = self.backward(observations)
        log_probs.append(log_prob)
        
        # Compute gamma and xi
        gamma = np.zeros((T, self.n_states))
        xi = np.zeros((T-1, self.n_states, self.n_states))
        
        for t in range(T):
            denominator = np.sum(alpha[t] * beta[t])
            gamma[t] = (alpha[t] * beta[t]) / denominator
        
        for t in range(T-1):
            denominator = 0.0
            for i in range(self.n_states):
                for j in range(self.n_states):
                    xi[t, i, j] = alpha[t, i] * self.A[i, j] * \
                                  self.B[j, observations[t+1]] * beta[t+1, j]
                    denominator += xi[t, i, j]
            xi[t] /= denominator
        
        # M-step: Update parameters
        # Update pi
        self.pi = gamma[0]
        
        # Update A
        for i in range(self.n_states):
            denominator = np.sum(gamma[:-1, i])
            for j in range(self.n_states):
                self.A[i, j] = np.sum(xi[:, i, j]) / denominator
        
        # Update B
        for j in range(self.n_states):
            denominator = np.sum(gamma[:, j])
            for k in range(self.n_observations):
                mask = (observations == k)
                self.B[j, k] = np.sum(gamma[mask, j]) / denominator
        
        # Check convergence
        if iteration > 0 and abs(log_probs[-1] - log_probs[-2]) < tol:
            print(f"Converged after {iteration+1} iterations")
            break
    
    return log_probs

Part-of-Speech Tagging Application Example

def pos_tagging_example():
    """
    Part-of-speech tagging example
    """
    # Define states and observations
    states = ['N', 'V', 'ADJ']  # Noun, Verb, Adjective
    observations_vocab = ['I', 'love', 'natural', 'language', 'processing']
    
    # Create mappings
    state_to_idx = {s: i for i, s in enumerate(states)}
    obs_to_idx = {o: i for i, o in enumerate(observations_vocab)}
    
    # Initialize HMM
    hmm = HMM(n_states=len(states), n_observations=len(observations_vocab))
    
    # Manually set parameters (in real applications, learn from corpus)
    hmm.pi = np.array([0.6, 0.3, 0.1])  # High probability of starting with N
    
    hmm.A = np.array([
        [0.3, 0.5, 0.2],  # N -> N/V/ADJ
        [0.6, 0.1, 0.3],  # V -> N/V/ADJ
        [0.7, 0.2, 0.1]   # ADJ -> N/V/ADJ
    ])
    
    hmm.B = np.array([
        [0.1, 0.05, 0.05, 0.6, 0.2],    # N emission probabilities
        [0.05, 0.7, 0.05, 0.1, 0.1],    # V emission probabilities
        [0.05, 0.05, 0.8, 0.05, 0.05]   # ADJ emission probabilities
    ])
    
    # Observation sequence
    sentence = ['I', 'love', 'natural', 'language', 'processing']
    obs_seq = [obs_to_idx[w] for w in sentence]
    
    # Viterbi decoding
    state_seq, log_prob = hmm.viterbi(obs_seq)
    predicted_tags = [states[s] for s in state_seq]
    
    print("Sentence:", sentence)
    print("Predicted POS tags:", predicted_tags)
    print("Log probability:", log_prob)
    
    # Forward algorithm to compute probability
    alpha, log_prob_forward = hmm.forward(obs_seq)
    print("Forward log probability:", log_prob_forward)

Model Training Example

def train_hmm_example():
    """
    Training HMM model example
    """
    # Generate synthetic data
    np.random.seed(42)
    true_hmm = HMM(n_states=3, n_observations=5)
    
    # Generate observation sequence
    observations = generate_sequence(true_hmm, length=1000)
    
    # Train new model
    learned_hmm = HMM(n_states=3, n_observations=5)
    log_probs = learned_hmm.baum_welch(observations, max_iter=50)
    
    # Visualize convergence
    import matplotlib.pyplot as plt
    plt.figure(figsize=(10, 5))
    plt.plot(log_probs, marker='o')
    plt.xlabel('Iteration')
    plt.ylabel('Log Likelihood')
    plt.title('Baum-Welch Convergence')
    plt.grid(True)
    plt.show()
    
    return learned_hmm

def generate_sequence(hmm, length):
    """
    Generate observation sequence from HMM
    """
    observations = []
    state = np.random.choice(hmm.n_states, p=hmm.pi)
    
    for _ in range(length):
        obs = np.random.choice(hmm.n_observations, p=hmm.B[state])
        observations.append(obs)
        state = np.random.choice(hmm.n_states, p=hmm.A[state])
    
    return np.array(observations)

Practical Issues and Optimization Techniques

Initialization Strategies

Problem: Baum-Welch is local optimization, initialization is crucial

Solutions:

Uniform Initialization: All parameters equal
Random Initialization + Multiple Restarts: Select the one with highest likelihood
K-means Pre-training: Initialize states with clustering results
Expert Knowledge: Use linguistic rules to initialize POS tagging

Smoothing Techniques

Zero Probability Problem: Unseen transitions/emissions have probability 0

Laplace Smoothing:Typically take

Scaling Techniques

Problem:decays exponentially

Scaled Forward Algorithm:

Define scaling factor:Scaled forward variable:Log-likelihood:

Higher-Order HMMs

Second-Order HMM:

State Explosion: Second-order requiresparameters

Simplified Solution: Interpolation smoothing

In-Depth Q&A

Q1: What is the essential difference between the forward algorithm and Viterbi algorithm?

A: Both are dynamic programming, but the forward algorithm sums over all possible paths, computing the total probability of the observation sequence; the Viterbi algorithm maximizes over all possible paths, finding the optimal state sequence. Mathematically: Just the difference between summation and maximization leads the former to give marginal probabilities and the latter to give MAP estimation.

Q2: Why does Baum-Welch guarantee monotonic increase in likelihood?

A: This is guaranteed by EM algorithm theory. Each iteration of the EM algorithm ensures:The key is that the Q function constructed in the E-step is a lower bound (ELBO) of the log-likelihood, and maximizing this lower bound in the M-step pushes up the true likelihood. The specific proof relies on Jensen's inequality and the non-negativity of KL divergence.

Q3: What are the limitations of HMM?

A: Three major limitations: 1. Markov assumption too strong: The future depends only on the current state, ignoring long-range dependencies 2. Observation independence assumption: Current observation depends only on current state; in reality, observations may depend on each other 3. Fixed number of states: Must be specified in advance, difficult to determine automatically (can be solved with Bayesian nonparametric methods like iHMM)

Q4: What advantages does CRF have compared to HMM?

A: CRF is a discriminative model that directly modelsrather than HMM's generative. Advantages: 1. No need for observation independence assumption, can use overlapping features 2. More flexible parameter learning, can introduce global features 3. Usually better performance on sequence labeling tasks (like NER) than HMM

Disadvantage is higher training complexity (need to compute partition function).

Q5: How to choose the number of states in HMM?

A: Three methods: 1. Cross-validation: Test performance with differenton validation set 2. Information criteria: Use AIC/BIC to balance fit and complexitywhereis the number of parameters 3. Bayesian nonparametrics: Use iHMM (infinite HMM) to automatically infer number of states

Q6: How to analyze the time complexity of the forward-backward algorithm?

A: The forward algorithm at each time, for each state, needs to iterate over all previous statesfor summation, with complexity; fortime steps, total complexity is. Compared to brute force enumeration's, dynamic programming avoids redundant computation. Space complexity is(storingandmatrices).

Q7: Why doesn't Viterbi algorithm directly use?

A: Because computingrequires normalization:Butis the same for all state sequences, so maximizingis equivalent to maximizing, so Viterbi directly optimizes the joint probability.

Q8: How to handle continuous observations (like speech signals)?

A: Use continuous observation HMM (CHMM), where emission probabilities become continuous distributions:Or Gaussian Mixture Models (GMM):The M-step of Baum-Welch is modified accordingly to use EM algorithm for updating Gaussian parameters.

Q9: Can HMM handle variable-length sequences?

A: Yes. For sequences of different lengths${^{(1)}, , ^{(K)}} $Extra close brace or missing open braceP(\{\mathbf{O}^{(k)}} \mid \boldsymbol{\lambda}) = \prod_{k=1}^K P(\mathbf{O}^{(k)} \mid \boldsymbol{\lambda})$ $The Baum-Welch algorithm performs the E-step separately for each sequence, and accumulates expected statistics from all sequences in the M-step.

Q10: Can the forward algorithm be used for real-time decoding?

A: Yes! The forward algorithm naturally supports online computation. Each time a new observationarrives, simply compute:No future information needed. But Viterbi requires backtracking and must wait for the sequence to end. For real-time applications, use online Viterbi or beam search approximations.

Q11: What is the relationship between HMM and RNN?

A: HMM can be seen as a discrete probabilistic version of RNN. Both model sequences, but: - HMM: Discrete hidden states, fixed transition probabilities, inference uses dynamic programming - RNN: Continuous hidden states, parameterized transition function (neural network), inference uses forward propagation

In modern deep learning, RNN/LSTM have largely replaced HMM, but HMM inference algorithms (like CTC) are still used in speech recognition.

Q12: How to evaluate HMM model quality?

A: Three ways: 1. Log-likelihood: Higheris better (but may overfit) 2. Decoding accuracy: Match between Viterbi-predicted state sequence and true labels 3. Perplexity:Lower is better

References

[1] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.

[2] Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554-1563.

[3] Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260-269.

[4] Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press.

[5] Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763.

[6] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML.

[7] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. (Chapter 17: Hidden Markov Models)

[8] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapter 13: Sequential Data)

[9] Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29(2-3), 245-273.

[10] Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566-1581.