Conditional Random Fields (CRF) are discriminative models for
sequence labeling - unlike HMMs, CRFs directly model the conditional
probabilityrather than
the joint probability,
thereby avoiding the observation independence assumption and allowing
flexible use of overlapping features. From named entity recognition to
part-of-speech tagging, from information extraction to image
segmentation, CRFs achieve optimal performance in sequence modeling
through clever undirected graph structures and feature engineering. This
chapter systematically derives potential functions of linear-chain CRFs,
partition functions, forward-backward algorithms, gradient computation
for parameter learning, and LBFGS optimization.
From
HMM to CRF: The Leap from Generative to Discriminative
Review of HMM Limitations
HMM Modeling: Joint probability
Using Bayes' Formula for Prediction:
Core Assumptions (restrictive):
Observation Independence:
Markov Property:Practical Issue: In POS tagging, the
label of the current word may depend on the previous word, the next
word, word suffix, and many other features - the observation
independence assumption is unreasonable!
Advantages of
Discriminative Models
CRF Idea: Directly modelAdvantages:
No observation independence assumption: Features
can overlap arbitrarily
Flexible feature engineering: Can introduce global
features, dictionary features, morphological features, etc.
Theoretical performance guarantee: Discriminative
models usually outperform generative models in labeling tasks
Cost: Higher training complexity (need to compute
partition function)
From HMM to MEMM to CRF
Maximum Entropy Markov Model (MEMM):
Problem: Label Bias - local normalization causes the
model to prefer states with fewer transition options
Potential Function:represents
the score at position
Partition Function (normalization factor):
Feature Function
Decomposition
Transition Features: Depend on previous label and
current labelExample:State Features (Emission
Features): Depend on current label and observationExample:Potential Function Expansion:
Parameterized Form
Unified Representation: Merge all feature
functions
Simplified Potential Function:
Global Score:Whereis the
global feature vector
Complete CRF Form:
Matrix Form Representation
Define: Score matrix from labeltoat position
Matrix Product for Sequence Score:
Partition Function:Whereis the all-ones vector,is anmatrix
Forward-Backward
Algorithm: CRF Version
Forward Algorithm Derivation
Forward Variable:represents the unnormalized
probability of all paths ending at labelat position
Initialization ():Whereis the
start symbol (usually set to a special START tag)
Recursion ():
Termination:
Physical Meaning:accumulates the sum of scores
of all partial sequences of lengthending with label
Backward Algorithm
Derivation
Backward Variable:represents the unnormalized
probability of all paths from positionwith labelto the end of sequence
Initialization ():Whereis
the end symbol (END tag)
Recursion ():
Verification:
Marginal Probability
Computation
Marginal Probability of Single Label:
Marginal Probability of Adjacent Labels:These marginal probabilities are the
core of parameter learning!
Log-Space Computation
Numerical Stability: UseUsing thetrick:
Parameter
Learning: Maximum Likelihood Estimation
Objective Function
Training Data:Log-Likelihood:Expanded:
Adding L2 Regularization:
Optimization Objective:
Detailed Gradient
Computation
Gradient of Log-Likelihood:
First Term (simple):
Second Term (requires derivation):Therefore:
Final Gradient Form:
Physical Interpretation: Empirical feature
expectation - model feature expectation
Efficient Computation of
Expectation
Problem: Direct computation ofrequires enumeratingsequences!
Solution: Exploit linear-chain structure and
marginal probabilities
Using Forward-Backward Algorithm:
Complexity: - same as forward-backward
algorithm!
Gradient Descent and
LBFGS Optimization
Gradient Descent:
Problem: Slow convergence, need to tune learning
rate
defner_example(): """ Named Entity Recognition Example """ # Training data (BIO tagging) X_train = [ ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii'], ['Apple', 'Inc.', 'is', 'located', 'in', 'Cupertino'], ['The', 'company', 'was', 'founded', 'in', '1976'] ] Y_train = [ ['B-PER', 'I-PER', 'O', 'O', 'O', 'B-LOC'], ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC'], ['O', 'O', 'O', 'O', 'O', 'B-DATE'] ] # Test data X_test = [ ['Steve', 'Jobs', 'founded', 'Apple'], ['Google', 'is', 'in', 'California'] ] # Train CRF labels = set(label for seq in Y_train for label in seq) crf = LinearChainCRF(labels, l2_lambda=0.1) crf.fit(X_train, Y_train, max_iter=50) # Predict Y_pred = crf.predict(X_test) for X, Y inzip(X_test, Y_pred): print(f"Sentence: {' '.join(X)}") print(f"Predicted labels: {' '.join(Y)}") print()
Advanced Topics and
Optimization
Feature Template Design
Window Features:
1 2 3 4
# Window of 2 words before and after for offset in [-2, -1, 0, 1, 2]: if0 <= t + offset < T: features.append(f"word[{offset}]={X[t+offset]},label={y_t}")
Dictionary Features:
1 2
if word in person_dict: features.append(f"in_person_dict=True,label={y_t}")
Regex Features:
1 2 3
import re if re.match(r'\d{4}-\d{2}-\d{2}', word): # Date format features.append(f"date_pattern=True,label={y_t}")
Applications: Chinese word segmentation, named
entity recognition
Structured Perceptron
Online learning version: Doesn't need to compute
partition function
Update Rule:
Advantage: Fast training, but not as probabilistic
as CRF
In-Depth Q&A
Q1: What are the core advantages of CRF compared to
HMM?
A: CRF is a discriminative model that directly models, while HMM is a generative
model that models. Specific
advantages: 1. No observation independence assumption:
HMM assumes, CRF allows arbitrary overlapping features 2.
Flexible feature engineering: CRF can use global
features, overlapping features (like surrounding words, dictionary
matching) 3. Better sequence labeling performance: On
NER and POS tagging tasks, CRF typically outperforms HMM by 5-10
percentage points
Q2: Why does CRF need forward-backward algorithm to compute
gradients?
A: The gradient contains expectation term, direct
computation requires enumeratingsequences. The forward-backward
algorithm exploits the linear-chain structure to decompose the
expectation into a sum of marginal probabilities:The marginal
probabilitycan be efficiently computed throughand, reducing complexity to.
Q3: Why is the partition function Z(X) hard to
compute?
A: The partition function is defined asIt
requires summing over allpossible label sequences. Although the
forward algorithm can efficiently compute(), it needs to be recomputed for
every parameter update, leading to high training complexity. This is the
main cost of CRF compared to perceptron.
Q4: Why is L2 regularization important for CRF?
A: Two reasons: 1. Prevent overfitting: The number
of features is usually very large (tens of thousands to millions), L2
penaltyprevents
weight explosion 2. Ensure convexity: The original
objective function is convex, adding L2 keeps it strongly convex,
ensuring LBFGS converges to global optimum
Usually take, selected through cross-validation.
Q5: What's the difference between Viterbi decoding and
marginal decoding?
A: Two decoding strategies: - Viterbi decoding: Find
the sequence with maximum joint probability- Marginal
decoding: Independently select the label with maximum marginal
probability at each positionViterbi guarantees global
consistency (like BIO tagging constraints), marginal decoding may
produce illegal sequences. Viterbi is more commonly used in
practice.
Q6: How to handle large label sets?
A: When the number of labelsis
large (like character tagging in Chinese word segmentation,, but word-level tagging may have
thousands of labels), complexitybecomes unacceptable. Solutions:
1. Beam Search approximation: Keep only top-K candidate
states 2. Hierarchical labeling: First coarse-grained
labeling, then fine-grained 3. Neural CRF: Use neural
networks to learn transition matrices, reducing parameters
Q7: Can CRF handle non-sequential structures?
A: Yes! General CRFs can be defined on arbitrary graph structures: -
Tree CRF: Dependency parsing - Grid
CRF: Image segmentation (2D CRF) - Fully-connected
CRF: Global normalization classification
But non-chain structures have higher inference complexity, usually
requiring approximation algorithms (like loopy belief propagation).
Q8: Why does MEMM have the label bias problem?
A: MEMM uses local normalization:Problem: States with fewer transition options
(like deterministic states) have smaller normalization denominators,
probabilities tend to 1, model prefers these states. CRF solves this
through global normalization:All paths compete for the same denominator,
avoiding local bias.
Q9: How does the number of features affect CRF
performance?
A: Relationship between number of features and performance: - Too
few: Underfitting, can't capture complex patterns - Too many:
Overfitting, slow training, poor generalization
Rule of thumb: - Basic features (words, POS, affixes): thousands to
tens of thousands - Adding dictionary features: tens of thousands to
hundreds of thousands - Use regularization + feature selection (like L1)
to control complexity
Q10: How to combine CRF with deep learning?
A: Modern sequence labeling typically uses
BiLSTM-CRF: - BiLSTM: Learn contextual representations
- CRF: Model label dependencies, constrain legality
Structure:BiLSTM output
serves as emission scores for CRF, CRF layer learns transition matrix.
Joint optimization during training, Viterbi used for inference.
Q11: How to evaluate CRF models?
A: Evaluation metrics for sequence labeling tasks: 1.
Token-level accuracy: Correctness of individual labels
2. Span-level F1: Entity-level precision, recall, F1
(more important) - Precision = Correct entities / Predicted entities -
Recall = Correct entities / True entities 3. Confusion
matrix: Analyze which labels are easily confused
Q12: How to optimize CRF training time?
A: Optimization strategies: 1. Feature hashing: Use
hash functions to reduce feature dimensionality 2. Mini-batch
gradients: Don't compute gradients on full data 3.
Parallelization: Forward-backward for multiple
sequences can be parallelized 4. Early stopping:
Monitor validation set performance, avoid overfitting 5.
Pre-training initialization: Quickly pre-train weights
with perceptron
References
[1] Lafferty, J., McCallum, A., & Pereira, F. C. (2001).
Conditional random fields: Probabilistic models for segmenting and
labeling sequence data. ICML.
[2] Sutton, C., & McCallum, A. (2012). An introduction to
conditional random fields. Foundations and Trends in Machine
Learning, 4(4), 267-373.
[3] Sha, F., & Pereira, F. (2003). Shallow parsing with
conditional random fields. NAACL-HLT.
[4] Wallach, H. M. (2004). Conditional random fields: An
introduction. Technical Report MS-CIS-04-21, University of
Pennsylvania.
[5] Sarawagi, S., & Cohen, W. W. (2005). Semi-Markov conditional
random fields for information extraction. NIPS.
[6] Collins, M. (2002). Discriminative training methods for hidden
Markov models: Theory and experiments with perceptron algorithms.
EMNLP.
[7] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF
models for sequence tagging. arXiv preprint
arXiv:1508.01991.
[8] Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via
bi-directional LSTM-CNNs-CRF. ACL.
[9] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., &
Dyer, C. (2016). Neural architectures for named entity recognition.
NAACL-HLT.
[10] Kr ä henb ü hl, P., & Koltun, V. (2011). Efficient inference
in fully connected CRFs with Gaussian edge potentials.
NIPS.
Post title:Machine Learning Math Derivations (16): Conditional Random Fields
Post author:Chen Kai
Create time:2025-05-05 00:00:00
Post link:https://www.chenk.top/ml-math-16-conditional-random-field/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.