Machine Learning Mathematical Derivations (16): Conditional Random Fields

Conditional Random Fields (CRF) are discriminative models for sequence labeling — unlike HMM, CRF directly models the conditional probabilityrather than the joint probability, thereby avoiding the observation independence assumption and allowing flexible use of overlapping features. From named entity recognition to part-of-speech tagging, from information extraction to image segmentation, CRF achieves optimal performance in sequence modeling through clever undirected graph structures and feature engineering. This chapter delves into the potential functions, normalization factor, forward-backward algorithm, gradient computation, and LBFGS optimization for linear-chain CRF.

From HMM to CRF: The Leap from Generative to Discriminative

Review of HMM Limitations

HMM modeling: Joint probability

Using Bayes' rule for prediction:

Core assumptions (restrictive):

Observation independence:
Markov property: Practical problem: In POS tagging, the current word's tag may depend on the previous word, next word, word suffixes, and other features — the observation independence assumption is unreasonable!

Advantages of Discriminative Models

CRF idea: Directly model Advantages:

No observation independence assumption: Features can overlap arbitrarily
Flexible feature engineering: Can introduce global features, dictionary features, morphological features, etc.
Theoretical performance guarantee: Discriminative models typically outperform generative models on labeling tasks

Cost: Higher training complexity (requires computing normalization factor)

From HMM to MEMM to CRF

Maximum Entropy Markov Model (MEMM):

Problem: Label bias — local normalization causes the model to favor states with fewer transition options

CRF solution: Global normalization

Mathematical Framework of Linear-Chain CRF

Basic Definitions

Input: Observation sequence

Output: Label sequence, $Extra close brace or missing open bracey_t \in \mathcal{Y} = \{1, 2, \dots, L}$ Conditional probability:

where:

Potential function:represents score at position
Partition function (normalization factor):

Feature Function Decomposition

Transition features: Depend on previous label and current label

Example: $𝟙$

State features (Emission features): Depend on current label and observations

Example: $𝟙$ Potential function expansion:

Parameterized Form

Unified representation: Combine all feature functions

Simplified potential function:

Global score:

whereis the global feature vector

Complete CRF form:

Matrix Form Representation

Define: Score matrix from labeltoat position

Sequence score as matrix product:

Partition function:

whereis an all-ones vector,is anmatrix

Forward-Backward Algorithm: CRF Version

Forward Algorithm Derivation

Forward variable:represents unnormalized probability of all paths ending at labelat position

Initialization ():

whereis the start symbol (usually a special START tag)

Recursion ():

Termination:

Physical meaning:accumulates scores of all partial sequences of lengthending with label

Marginal Probability Computation

Single label marginal probability:

Adjacent label marginal probability:

These marginal probabilities are core to parameter learning!

Parameter Learning: Maximum Likelihood Estimation

Objective Function

Training data: $Extra close brace or missing open brace\mathcal{D} = \{(\mathbf{X}^{(n)}, \mathbf{Y}^{(n)})}_{n=1}^N$ Log-likelihood:

Expanding:

Adding L2 regularization:

# CRF Training

Gradient Computation

Final gradient form:

Physical interpretation: Empirical feature expectation - Model feature expectation

Efficient Computation of Expectation

Problem: Directly computingrequires enumeratingsequences!

Solution: Exploit linear chain structure and marginal probabilities

Using forward-backward algorithm:

Complexity:— same as forward-backward algorithm!

Viterbi Decoding: CRF Version

Optimal Sequence Prediction

Objective:

Equivalent to:

Viterbi Algorithm Derivation

Dynamic programming variable:represents optimal path score ending at labelat position

Recursion formula:

Backtracking pointer:

Complexity:

Q&A

Q1: What is the core advantage of CRF over HMM?

A: CRF is a discriminative model that directly models, while HMM is a generative model that models. Specific advantages: 1. No observation independence assumption: HMM assumes; CRF allows arbitrary feature overlap 2. Flexible feature engineering: CRF can use global features, overlapping features (previous/next words, dictionary matches) 3. Better sequence labeling performance: CRF typically outperforms HMM by 5-10 percentage points on NER and POS tagging tasks

Q2: Why does CRF need the forward-backward algorithm for gradient computation?

A: The gradient contains the expectation term, and direct computation requires enumeratingsequences. The forward-backward algorithm exploits linear chain structure to decompose the expectation into a sum of marginal probabilities, reducing complexity to.

Q3: Why is the partition function Z(X) difficult to compute?

A: The partition function is defined as

requiring summation over allpossible label sequences. Although the forward algorithm can efficiently compute(), it must be recomputed for each parameter update, making training computationally expensive. This is the main cost of CRF compared to perceptron.

Q4: Why is L2 regularization important for CRF?

A: Two reasons: 1. Prevents overfitting: Feature count is typically large (tens of thousands to millions); L2 penaltyprevents weight explosion 2. Ensures convexity: The original objective function is convex; adding L2 makes it strongly convex, guaranteeing LBFGS convergence to global optimum

Typically, chosen by cross-validation.

Q5: How does CRF combine with deep learning?

A: Modern sequence labeling typically uses BiLSTM-CRF: - BiLSTM: Learns contextual representations - CRF: Models label dependencies, constrains validity

Structure:BiLSTM output serves as CRF emission scores; the CRF layer learns the transition matrix. Training jointly optimizes both; inference still uses Viterbi.

✏️ Exercises and Solutions

Exercise 1: CRF vs HMM

Problem: Compare CRF and HMM in POS tagging.

Solution: CRF is discriminative , no observation independence assumption, flexible overlapping features, global normalization. HMM is generative , requires observation independence, local features only.

Exercise 2: Feature Function Design

Problem: Design 3 feature functions for NER.

Solution: 1. Transition: $𝟙$ 2. State: $𝟙$ 3. Global: $𝟙$

Exercise 3: Forward Algorithm

Problem: CRF sequence length , labels. Write forward algorithm steps to compute .

Solution: Init: ; Recur: ; Terminate: . Complexity: .

Exercise 4: Gradient

Problem: Derive gradient ofw.r.t. .

Solution: . Gradient = model expected feature - observed feature (maximum entropy principle).

Exercise 5: Viterbi

Problem: Difference between CRF and HMM Viterbi.

Solution: Both use DP to find optimal label sequence. Difference: HMM uses local observation , CRF uses global observation with arbitrary features.

References

[1] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML.

[2] Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 267-373.

[3] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

[4] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. NAACL-HLT.