Machine Learning Math Derivations (16): Conditional Random Fields
Chen Kai BOSS

Conditional Random Fields (CRF) are discriminative models for sequence labeling - unlike HMMs, CRFs directly model the conditional probabilityrather than the joint probability, thereby avoiding the observation independence assumption and allowing flexible use of overlapping features. From named entity recognition to part-of-speech tagging, from information extraction to image segmentation, CRFs achieve optimal performance in sequence modeling through clever undirected graph structures and feature engineering. This chapter systematically derives potential functions of linear-chain CRFs, partition functions, forward-backward algorithms, gradient computation for parameter learning, and LBFGS optimization.

From HMM to CRF: The Leap from Generative to Discriminative

Review of HMM Limitations

HMM Modeling: Joint probability

Using Bayes' Formula for Prediction:

Core Assumptions (restrictive):

  1. Observation Independence:

  2. Markov Property: Practical Issue: In POS tagging, the label of the current word may depend on the previous word, the next word, word suffix, and many other features - the observation independence assumption is unreasonable!

Advantages of Discriminative Models

CRF Idea: Directly model Advantages:

  1. No observation independence assumption: Features can overlap arbitrarily
  2. Flexible feature engineering: Can introduce global features, dictionary features, morphological features, etc.
  3. Theoretical performance guarantee: Discriminative models usually outperform generative models in labeling tasks

Cost: Higher training complexity (need to compute partition function)

From HMM to MEMM to CRF

Maximum Entropy Markov Model (MEMM):

Problem: Label Bias - local normalization causes the model to prefer states with fewer transition options

CRF Solution: Global normalization

Mathematical Framework of Linear-Chain CRF

Basic Definitions

Input: Observation sequence Output: Label sequence, Conditional Probability:Where:

  • Potential Function:represents the score at position
  • Partition Function (normalization factor):

Feature Function Decomposition

Transition Features: Depend on previous label and current labelExample:𝟙 State Features (Emission Features): Depend on current label and observationExample:𝟙 Potential Function Expansion:

Parameterized Form

Unified Representation: Merge all feature functions

Simplified Potential Function:

Global Score:Whereis the global feature vector

Complete CRF Form:

Matrix Form Representation

Define: Score matrix from labeltoat position

Matrix Product for Sequence Score:

Partition Function:Whereis the all-ones vector,is anmatrix

Forward-Backward Algorithm: CRF Version

Forward Algorithm Derivation

Forward Variable:represents the unnormalized probability of all paths ending at labelat position

Initialization ():Whereis the start symbol (usually set to a special START tag)

Recursion ():

Termination:

Physical Meaning:accumulates the sum of scores of all partial sequences of lengthending with label

Backward Algorithm Derivation

Backward Variable:represents the unnormalized probability of all paths from positionwith labelto the end of sequence

Initialization ():Whereis the end symbol (END tag)

Recursion ():

Verification:

Marginal Probability Computation

Marginal Probability of Single Label:

Marginal Probability of Adjacent Labels:These marginal probabilities are the core of parameter learning!

Log-Space Computation

Numerical Stability: Use Using thetrick:

Parameter Learning: Maximum Likelihood Estimation

Objective Function

Training Data:Extra close brace or missing open brace\mathcal{D} = \{(\mathbf{X}^{(n)}, \mathbf{Y}^{(n)})} _{n=1}^N Log-Likelihood:Expanded:

Adding L2 Regularization:

Optimization Objective:

Detailed Gradient Computation

Gradient of Log-Likelihood:

First Term (simple):

Second Term (requires derivation): Therefore:

Final Gradient Form:

Physical Interpretation: Empirical feature expectation - model feature expectation

Efficient Computation of Expectation

Problem: Direct computation ofrequires enumeratingsequences!

Solution: Exploit linear-chain structure and marginal probabilities

Using Forward-Backward Algorithm:

Complexity: - same as forward-backward algorithm!

Gradient Descent and LBFGS Optimization

Gradient Descent:

Problem: Slow convergence, need to tune learning rate

LBFGS Algorithm (Limited-memory BFGS):

  • Second-order optimization method, uses Hessian matrix approximation
  • Limited memory version, suitable for high-dimensional parameters
  • Standard choice for CRF training

Pseudocode:

1
2
3
4
5
6
Initialize w = 0
while not converged:
Compute g = ∇ L(w) (via forward-backward algorithm)
Update w using LBFGS
if |L(w_new) - L(w_old)| < ε:
break

Engineering Implementation: Use scipy.optimize.fmin_l_bfgs_b

Viterbi Decoding: CRF Version

Optimal Sequence Prediction

Objective:Equivalent to:

Viterbi Algorithm Derivation

Dynamic Programming Variable:represents the optimal path score to positionwith label

Recursive Formula:

Backtracking Pointer:

Algorithm Procedure

Initialization ():

Recursion ():

Termination:

Backtracking ():

Complexity:

Code Implementation: Complete CRF Framework

Feature Engineering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from collections import defaultdict

class CRFFeatures:
"""CRF Feature Extractor"""

def __init__(self, labels):
"""
Args:
labels: Set of labels (excluding START/END)
"""
self.labels = ['<START>'] + list(labels) + ['<END>']
self.label_to_idx = {l: i for i, l in enumerate(self.labels)}
self.L = len(self.labels)

# Feature templates
self.feature_templates = [
self._word_identity,
self._word_suffix,
self._word_prefix,
self._word_is_capitalized,
self._word_is_numeric,
self._prev_word,
self._next_word,
]

self.feature_to_idx = {}
self.next_feature_idx = 0

def _get_feature_id(self, feature_name):
"""Get feature ID, create if doesn't exist"""
if feature_name not in self.feature_to_idx:
self.feature_to_idx[feature_name] = self.next_feature_idx
self.next_feature_idx += 1
return self.feature_to_idx[feature_name]

def _word_identity(self, X, t, prev_label, curr_label):
"""Current word feature"""
return [(f"word={X[t]},label={curr_label}", 1.0)]

def _word_suffix(self, X, t, prev_label, curr_label):
"""Word suffix feature"""
word = X[t]
if len(word) >= 3:
return [(f"suffix={word[-3:]},label={curr_label}", 1.0)]
return []

def _word_prefix(self, X, t, prev_label, curr_label):
"""Word prefix feature"""
word = X[t]
if len(word) >= 3:
return [(f"prefix={word[:3]},label={curr_label}", 1.0)]
return []

def _word_is_capitalized(self, X, t, prev_label, curr_label):
"""Capitalization feature"""
if X[t][0].isupper():
return [(f"capitalized=True,label={curr_label}", 1.0)]
return []

def _word_is_numeric(self, X, t, prev_label, curr_label):
"""Numeric feature"""
if X[t].isdigit():
return [(f"numeric=True,label={curr_label}", 1.0)]
return []

def _prev_word(self, X, t, prev_label, curr_label):
"""Previous word feature"""
if t > 0:
return [(f"prev_word={X[t-1]},label={curr_label}", 1.0)]
return []

def _next_word(self, X, t, prev_label, curr_label):
"""Next word feature"""
if t < len(X) - 1:
return [(f"next_word={X[t+1]},label={curr_label}", 1.0)]
return []

def extract_features(self, X, t, prev_label, curr_label):
"""
Extract all features at position t

Returns:
feature_vector: Sparse feature vector {feature_id: value}
"""
features = {}

# Transition features
trans_feat = f"transition={prev_label}->{curr_label}"
features[self._get_feature_id(trans_feat)] = 1.0

# State features
for template in self.feature_templates:
for feat_name, feat_val in template(X, t, prev_label, curr_label):
feat_id = self._get_feature_id(feat_name)
features[feat_id] = feat_val

return features

def extract_global_features(self, X, Y):
"""
Extract global feature vector F(X, Y)

Returns:
Sparse vector {feature_id: count}
"""
features = defaultdict(float)
T = len(X)

# Add START tag
Y_with_boundary = ['<START>'] + list(Y) + ['<END>']

for t in range(T):
local_feats = self.extract_features(
X, t, Y_with_boundary[t], Y_with_boundary[t+1]
)
for feat_id, val in local_feats.items():
features[feat_id] += val

return dict(features)

CRF Model Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
from scipy.special import logsumexp
from scipy.optimize import fmin_l_bfgs_b

class LinearChainCRF:
"""Linear-Chain Conditional Random Field"""

def __init__(self, labels, l2_lambda=0.1):
"""
Args:
labels: Set of labels
l2_lambda: L2 regularization coefficient
"""
self.feature_extractor = CRFFeatures(labels)
self.labels = self.feature_extractor.labels
self.L = self.feature_extractor.L
self.l2_lambda = l2_lambda
self.weights = None

def _compute_scores(self, X, Y=None):
"""
Compute score matrix M_t(i, j)

Returns:
If Y is None: list of (L, L) matrices
If Y given: scalar score
"""
T = len(X)

if Y is None:
# Compute all possible transition scores
M = []
for t in range(T):
M_t = np.zeros((self.L, self.L))
for i in range(self.L):
for j in range(self.L):
if t == 0 and self.labels[i] != '<START>':
M_t[i, j] = -np.inf
continue

features = self.feature_extractor.extract_features(
X, t, self.labels[i], self.labels[j]
)
score = sum(self.weights[f_id] * val
for f_id, val in features.items()
if f_id < len(self.weights))
M_t[i, j] = score
M.append(M_t)
return M
else:
# Compute score for given sequence
global_features = self.feature_extractor.extract_global_features(X, Y)
score = sum(self.weights[f_id] * count
for f_id, count in global_features.items()
if f_id < len(self.weights))
return score

def _forward(self, X):
"""
Forward algorithm to compute alpha and log Z

Returns:
alpha: (T, L) array
log_Z: Log partition function
"""
T = len(X)
M = self._compute_scores(X)

# Log-space computation
log_alpha = np.full((T, self.L), -np.inf)

# Initialization
start_idx = self.labels.index('<START>')
log_alpha[0, start_idx] = 0.0
for j in range(self.L):
if M[0][start_idx, j] > -np.inf:
log_alpha[0, j] = M[0][start_idx, j]

# Recursion
for t in range(1, T):
for j in range(self.L):
log_alpha[t, j] = logsumexp(
log_alpha[t-1] + M[t][:, j]
)

# Termination
end_idx = self.labels.index('<END>')
log_Z = logsumexp(log_alpha[-1])

return log_alpha, log_Z

def _backward(self, X):
"""
Backward algorithm to compute beta

Returns:
beta: (T, L) array
"""
T = len(X)
M = self._compute_scores(X)

log_beta = np.full((T, self.L), -np.inf)

# Initialization
log_beta[-1, :] = 0.0

# Recursion
for t in range(T-2, -1, -1):
for i in range(self.L):
log_beta[t, i] = logsumexp(
M[t+1][i, :] + log_beta[t+1]
)

return log_beta

def _compute_gradient(self, X_list, Y_list):
"""
Compute gradient

Returns:
gradient: Gradient vector
neg_log_likelihood: Negative log-likelihood
"""
num_features = self.feature_extractor.next_feature_idx
empirical_counts = np.zeros(num_features)
expected_counts = np.zeros(num_features)
log_likelihood = 0.0

for X, Y in zip(X_list, Y_list):
# Empirical feature counts
global_features = self.feature_extractor.extract_global_features(X, Y)
for f_id, count in global_features.items():
empirical_counts[f_id] += count

# True sequence score
score = self._compute_scores(X, Y)

# Forward-backward algorithm
log_alpha, log_Z = self._forward(X)
log_beta = self._backward(X)

log_likelihood += score - log_Z

# Expected feature counts
T = len(X)
M = self._compute_scores(X)

for t in range(T):
for i in range(self.L):
for j in range(self.L):
# Marginal probability
if t == 0:
log_marginal = M[t][i, j] + log_beta[t, j] - log_Z
else:
log_marginal = (log_alpha[t-1, i] + M[t][i, j] +
log_beta[t, j] - log_Z)

marginal = np.exp(log_marginal)

# Feature expectation
features = self.feature_extractor.extract_features(
X, t, self.labels[i], self.labels[j]
)
for f_id, val in features.items():
expected_counts[f_id] += marginal * val

# Gradient = empirical features - expected features - L2 regularization term
gradient = empirical_counts - expected_counts - self.l2_lambda * self.weights

# Objective = negative log-likelihood + L2 penalty
neg_log_likelihood = -log_likelihood + 0.5 * self.l2_lambda * np.sum(self.weights**2)

return -gradient, neg_log_likelihood # Return negative gradient because LBFGS minimizes

def fit(self, X_list, Y_list, max_iter=100):
"""
Train CRF model

Args:
X_list: List of observation sequences
Y_list: List of label sequences
max_iter: Maximum number of iterations
"""
# Initialize weights
num_features = self.feature_extractor.next_feature_idx
self.weights = np.zeros(num_features)

# Define objective function
def objective(w):
self.weights = w
neg_grad, neg_ll = self._compute_gradient(X_list, Y_list)
return neg_ll, neg_grad

# LBFGS optimization
print("Training CRF with LBFGS...")
self.weights, f_min, info = fmin_l_bfgs_b(
objective,
self.weights,
maxiter=max_iter,
disp=1
)

print(f"Training finished. Final negative log-likelihood: {f_min:.4f}")
return self

def viterbi(self, X):
"""
Viterbi decoding

Returns:
best_labels: Optimal label sequence
"""
T = len(X)
M = self._compute_scores(X)

# Dynamic programming
delta = np.full((T, self.L), -np.inf)
psi = np.zeros((T, self.L), dtype=int)

# Initialization
start_idx = self.labels.index('<START>')
delta[0, :] = M[0][start_idx, :]

# Recursion
for t in range(1, T):
for j in range(self.L):
scores = delta[t-1] + M[t][:, j]
psi[t, j] = np.argmax(scores)
delta[t, j] = scores[psi[t, j]]

# Backtracking
best_path = np.zeros(T, dtype=int)
best_path[-1] = np.argmax(delta[-1])

for t in range(T-2, -1, -1):
best_path[t] = psi[t+1, best_path[t+1]]

# Convert to labels
best_labels = [self.labels[idx] for idx in best_path
if self.labels[idx] not in ['<START>', '<END>']]

return best_labels

def predict(self, X_list):
"""Batch prediction"""
return [self.viterbi(X) for X in X_list]

Named Entity Recognition Application

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def ner_example():
"""
Named Entity Recognition Example
"""
# Training data (BIO tagging)
X_train = [
['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii'],
['Apple', 'Inc.', 'is', 'located', 'in', 'Cupertino'],
['The', 'company', 'was', 'founded', 'in', '1976']
]

Y_train = [
['B-PER', 'I-PER', 'O', 'O', 'O', 'B-LOC'],
['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC'],
['O', 'O', 'O', 'O', 'O', 'B-DATE']
]

# Test data
X_test = [
['Steve', 'Jobs', 'founded', 'Apple'],
['Google', 'is', 'in', 'California']
]

# Train CRF
labels = set(label for seq in Y_train for label in seq)
crf = LinearChainCRF(labels, l2_lambda=0.1)
crf.fit(X_train, Y_train, max_iter=50)

# Predict
Y_pred = crf.predict(X_test)

for X, Y in zip(X_test, Y_pred):
print(f"Sentence: {' '.join(X)}")
print(f"Predicted labels: {' '.join(Y)}")
print()

Advanced Topics and Optimization

Feature Template Design

Window Features:

1
2
3
4
# Window of 2 words before and after
for offset in [-2, -1, 0, 1, 2]:
if 0 <= t + offset < T:
features.append(f"word[{offset}]={X[t+offset]},label={y_t}")

Dictionary Features:

1
2
if word in person_dict:
features.append(f"in_person_dict=True,label={y_t}")

Regex Features:

1
2
3
import re
if re.match(r'\d{4}-\d{2}-\d{2}', word): # Date format
features.append(f"date_pattern=True,label={y_t}")

Semi-Markov CRF

Allows segment-level labeling: Labels span multiple positions

Applications: Chinese word segmentation, named entity recognition

Structured Perceptron

Online learning version: Doesn't need to compute partition function

Update Rule:

Advantage: Fast training, but not as probabilistic as CRF

In-Depth Q&A

Q1: What are the core advantages of CRF compared to HMM?

A: CRF is a discriminative model that directly models, while HMM is a generative model that models. Specific advantages: 1. No observation independence assumption: HMM assumes, CRF allows arbitrary overlapping features 2. Flexible feature engineering: CRF can use global features, overlapping features (like surrounding words, dictionary matching) 3. Better sequence labeling performance: On NER and POS tagging tasks, CRF typically outperforms HMM by 5-10 percentage points

Q2: Why does CRF need forward-backward algorithm to compute gradients?

A: The gradient contains expectation term, direct computation requires enumeratingsequences. The forward-backward algorithm exploits the linear-chain structure to decompose the expectation into a sum of marginal probabilities:The marginal probabilitycan be efficiently computed throughand, reducing complexity to.

Q3: Why is the partition function Z(X) hard to compute?

A: The partition function is defined asIt requires summing over allpossible label sequences. Although the forward algorithm can efficiently compute(), it needs to be recomputed for every parameter update, leading to high training complexity. This is the main cost of CRF compared to perceptron.

Q4: Why is L2 regularization important for CRF?

A: Two reasons: 1. Prevent overfitting: The number of features is usually very large (tens of thousands to millions), L2 penaltyprevents weight explosion 2. Ensure convexity: The original objective function is convex, adding L2 keeps it strongly convex, ensuring LBFGS converges to global optimum

Usually take, selected through cross-validation.

Q5: What's the difference between Viterbi decoding and marginal decoding?

A: Two decoding strategies: - Viterbi decoding: Find the sequence with maximum joint probability- Marginal decoding: Independently select the label with maximum marginal probability at each positionViterbi guarantees global consistency (like BIO tagging constraints), marginal decoding may produce illegal sequences. Viterbi is more commonly used in practice.

Q6: How to handle large label sets?

A: When the number of labelsis large (like character tagging in Chinese word segmentation,, but word-level tagging may have thousands of labels), complexitybecomes unacceptable. Solutions: 1. Beam Search approximation: Keep only top-K candidate states 2. Hierarchical labeling: First coarse-grained labeling, then fine-grained 3. Neural CRF: Use neural networks to learn transition matrices, reducing parameters

Q7: Can CRF handle non-sequential structures?

A: Yes! General CRFs can be defined on arbitrary graph structures: - Tree CRF: Dependency parsing - Grid CRF: Image segmentation (2D CRF) - Fully-connected CRF: Global normalization classification

But non-chain structures have higher inference complexity, usually requiring approximation algorithms (like loopy belief propagation).

Q8: Why does MEMM have the label bias problem?

A: MEMM uses local normalization:Problem: States with fewer transition options (like deterministic states) have smaller normalization denominators, probabilities tend to 1, model prefers these states. CRF solves this through global normalization:All paths compete for the same denominator, avoiding local bias.

Q9: How does the number of features affect CRF performance?

A: Relationship between number of features and performance: - Too few: Underfitting, can't capture complex patterns - Too many: Overfitting, slow training, poor generalization

Rule of thumb: - Basic features (words, POS, affixes): thousands to tens of thousands - Adding dictionary features: tens of thousands to hundreds of thousands - Use regularization + feature selection (like L1) to control complexity

Q10: How to combine CRF with deep learning?

A: Modern sequence labeling typically uses BiLSTM-CRF: - BiLSTM: Learn contextual representations - CRF: Model label dependencies, constrain legality

Structure:BiLSTM output serves as emission scores for CRF, CRF layer learns transition matrix. Joint optimization during training, Viterbi used for inference.

Q11: How to evaluate CRF models?

A: Evaluation metrics for sequence labeling tasks: 1. Token-level accuracy: Correctness of individual labels 2. Span-level F1: Entity-level precision, recall, F1 (more important) - Precision = Correct entities / Predicted entities - Recall = Correct entities / True entities 3. Confusion matrix: Analyze which labels are easily confused

Q12: How to optimize CRF training time?

A: Optimization strategies: 1. Feature hashing: Use hash functions to reduce feature dimensionality 2. Mini-batch gradients: Don't compute gradients on full data 3. Parallelization: Forward-backward for multiple sequences can be parallelized 4. Early stopping: Monitor validation set performance, avoid overfitting 5. Pre-training initialization: Quickly pre-train weights with perceptron

References

[1] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML.

[2] Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 267-373.

[3] Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. NAACL-HLT.

[4] Wallach, H. M. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, University of Pennsylvania.

[5] Sarawagi, S., & Cohen, W. W. (2005). Semi-Markov conditional random fields for information extraction. NIPS.

[6] Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.

[7] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

[8] Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. ACL.

[9] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. NAACL-HLT.

[10] Kr ä henb ü hl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. NIPS.

  • Post title:Machine Learning Math Derivations (16): Conditional Random Fields
  • Post author:Chen Kai
  • Create time:2025-05-05 00:00:00
  • Post link:https://www.chenk.top/ml-math-16-conditional-random-field/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments