Conditional Random Fields (CRF) are discriminative models for
sequence labeling — unlike HMM, CRF directly models the conditional
probability
From HMM to CRF: The Leap from Generative to Discriminative
Review of HMM Limitations
HMM modeling: Joint probability
Using Bayes' rule for prediction:
Core assumptions (restrictive):
- Observation independence:
- Markov property:
Practical problem: In POS tagging, the current word's tag may depend on the previous word, next word, word suffixes, and other features — the observation independence assumption is unreasonable!
Advantages of Discriminative Models
CRF idea: Directly model
- No observation independence assumption: Features can overlap arbitrarily
- Flexible feature engineering: Can introduce global features, dictionary features, morphological features, etc.
- Theoretical performance guarantee: Discriminative models typically outperform generative models on labeling tasks
Cost: Higher training complexity (requires computing normalization factor)
From HMM to MEMM to CRF
Maximum Entropy Markov Model (MEMM):
Problem: Label bias — local normalization causes the model to favor states with fewer transition options
CRF solution: Global normalization
Mathematical Framework of Linear-Chain CRF
Basic Definitions
Input: Observation sequence
Output: Label sequence
where:
- Potential function:
represents score at position - Partition function (normalization factor):
Feature Function Decomposition

Transition features: Depend on previous label and current label
Example:
State features (Emission features): Depend on current label and observations
Example:
Parameterized Form
Unified representation: Combine all feature functions
Simplified potential function:
Global score:
where
Complete CRF form:
Matrix Form Representation
Define
Sequence score as matrix product:
Partition function:
where
Forward-Backward Algorithm: CRF Version
Forward Algorithm Derivation
Forward variable:
Initialization (
where
Recursion (
Termination:
Physical meaning:
Marginal Probability Computation
Single label marginal probability:
Adjacent label marginal probability:
These marginal probabilities are core to parameter learning!
Parameter Learning: Maximum Likelihood Estimation
Objective Function
Training data:
Expanding:
Adding L2 regularization:
#条件随机场/figure_2.png)
Gradient Computation
Final gradient form:
Physical interpretation: Empirical feature expectation - Model feature expectation
Efficient Computation of Expectation
Problem: Directly computing
Solution: Exploit linear chain structure and marginal probabilities
Using forward-backward algorithm:
Complexity:
Viterbi Decoding: CRF Version
Optimal Sequence Prediction
Objective:
Equivalent to:
Viterbi Algorithm Derivation
Dynamic programming variable:
Recursion formula:
Backtracking pointer:
Complexity:
Q&A
Q1: What is the core advantage of CRF over HMM?
A: CRF is a discriminative model that directly models
Q2: Why does CRF need the forward-backward algorithm for gradient computation?
A: The gradient contains the expectation term
Q3: Why is the partition function Z(X) difficult to compute?
A: The partition function is defined as
requiring summation over all
Q4: Why is L2 regularization important for CRF?
A: Two reasons: 1. Prevents overfitting: Feature
count is typically large (tens of thousands to millions); L2
penalty
Typically
Q5: How does CRF combine with deep learning?
A: Modern sequence labeling typically uses BiLSTM-CRF: - BiLSTM: Learns contextual representations - CRF: Models label dependencies, constrains validity
Structure:
✏️ Exercises and Solutions
Exercise 1: CRF vs HMM
Problem: Compare CRF and HMM in POS tagging.
Solution: CRF is discriminative
Exercise 2: Feature Function Design
Problem: Design 3 feature functions for NER.
Solution: 1. Transition:
Exercise 3: Forward Algorithm
Problem: CRF sequence length
Solution: Init:
Exercise 4: Gradient
Problem: Derive gradient of
Solution:
Exercise 5: Viterbi
Problem: Difference between CRF and HMM Viterbi.
Solution: Both use DP to find optimal label
sequence. Difference: HMM uses local observation
References
[1] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML.
[2] Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 267-373.
[3] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
[4] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. NAACL-HLT.
- Post title:Machine Learning Mathematical Derivations (16): Conditional Random Fields
- Post author:Chen Kai
- Create time:2021-11-23 15:45:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-16-Conditional-Random-Fields/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.