Machine Learning Math Derivations (16): Conditional Random Fields

Conditional Random Fields (CRF) are discriminative models for sequence labeling - unlike HMMs, CRFs directly model the conditional probabilityrather than the joint probability, thereby avoiding the observation independence assumption and allowing flexible use of overlapping features. From named entity recognition to part-of-speech tagging, from information extraction to image segmentation, CRFs achieve optimal performance in sequence modeling through clever undirected graph structures and feature engineering. This chapter systematically derives potential functions of linear-chain CRFs, partition functions, forward-backward algorithms, gradient computation for parameter learning, and LBFGS optimization.

From HMM to CRF: The Leap from Generative to Discriminative

Review of HMM Limitations

HMM Modeling: Joint probability

Using Bayes' Formula for Prediction:

Core Assumptions (restrictive):

Observation Independence:
Markov Property: Practical Issue: In POS tagging, the label of the current word may depend on the previous word, the next word, word suffix, and many other features - the observation independence assumption is unreasonable!

Advantages of Discriminative Models

CRF Idea: Directly model Advantages:

No observation independence assumption: Features can overlap arbitrarily
Flexible feature engineering: Can introduce global features, dictionary features, morphological features, etc.
Theoretical performance guarantee: Discriminative models usually outperform generative models in labeling tasks

Cost: Higher training complexity (need to compute partition function)

From HMM to MEMM to CRF

Maximum Entropy Markov Model (MEMM):

Problem: Label Bias - local normalization causes the model to prefer states with fewer transition options

CRF Solution: Global normalization

Mathematical Framework of Linear-Chain CRF

Basic Definitions

Input: Observation sequence Output: Label sequence, Conditional Probability:Where:

Potential Function:represents the score at position
Partition Function (normalization factor):

Feature Function Decomposition

Transition Features: Depend on previous label and current labelExample: $𝟙$ State Features (Emission Features): Depend on current label and observationExample: $𝟙$ Potential Function Expansion:

Parameterized Form

Unified Representation: Merge all feature functions

Simplified Potential Function:

Global Score:Whereis the global feature vector

Complete CRF Form:

Matrix Form Representation

Define: Score matrix from labeltoat position

Matrix Product for Sequence Score:

Partition Function:Whereis the all-ones vector,is anmatrix

Forward-Backward Algorithm: CRF Version

Forward Algorithm Derivation

Forward Variable:represents the unnormalized probability of all paths ending at labelat position

Initialization ():Whereis the start symbol (usually set to a special START tag)

Recursion ():

Termination:

Physical Meaning:accumulates the sum of scores of all partial sequences of lengthending with label

Backward Algorithm Derivation

Backward Variable:represents the unnormalized probability of all paths from positionwith labelto the end of sequence

Initialization ():Whereis the end symbol (END tag)

Recursion ():

Verification:

Marginal Probability Computation

Marginal Probability of Single Label:

Marginal Probability of Adjacent Labels:These marginal probabilities are the core of parameter learning!

Log-Space Computation

Numerical Stability: Use Using thetrick:

Parameter Learning: Maximum Likelihood Estimation

Objective Function

Training Data: $Extra close brace or missing open brace\mathcal{D} = \{(\mathbf{X}^{(n)}, \mathbf{Y}^{(n)})} _{n=1}^N$ Log-Likelihood:Expanded:

Adding L2 Regularization:

Optimization Objective:

Detailed Gradient Computation

Gradient of Log-Likelihood:

First Term (simple):

Second Term (requires derivation): Therefore:

Final Gradient Form:

Physical Interpretation: Empirical feature expectation - model feature expectation

Efficient Computation of Expectation

Problem: Direct computation ofrequires enumeratingsequences!

Solution: Exploit linear-chain structure and marginal probabilities

Using Forward-Backward Algorithm:

Complexity: - same as forward-backward algorithm!

Gradient Descent and LBFGS Optimization

Gradient Descent:

Problem: Slow convergence, need to tune learning rate

LBFGS Algorithm (Limited-memory BFGS):

Second-order optimization method, uses Hessian matrix approximation
Limited memory version, suitable for high-dimensional parameters
Standard choice for CRF training

Pseudocode:

Initialize w = 0
while not converged:
    Compute g = ∇ L(w) (via forward-backward algorithm)
    Update w using LBFGS
    if |L(w_new) - L(w_old)| < ε:
        break

Engineering Implementation: Use scipy.optimize.fmin_l_bfgs_b

Viterbi Decoding: CRF Version

Optimal Sequence Prediction

Objective:Equivalent to:

Viterbi Algorithm Derivation

Dynamic Programming Variable:represents the optimal path score to positionwith label

Recursive Formula:

Backtracking Pointer:

Algorithm Procedure

Initialization ():

Recursion ():

Termination:

Backtracking ():

Complexity:

Code Implementation: Complete CRF Framework

Feature Engineering

import numpy as np
from collections import defaultdict

class CRFFeatures:
    """CRF Feature Extractor"""
    
    def __init__(self, labels):
        """
        Args:
            labels: Set of labels (excluding START/END)
        """
        self.labels = ['<START>'] + list(labels) + ['<END>']
        self.label_to_idx = {l: i for i, l in enumerate(self.labels)}
        self.L = len(self.labels)
        
        # Feature templates
        self.feature_templates = [
            self._word_identity,
            self._word_suffix,
            self._word_prefix,
            self._word_is_capitalized,
            self._word_is_numeric,
            self._prev_word,
            self._next_word,
        ]
        
        self.feature_to_idx = {}
        self.next_feature_idx = 0
    
    def _get_feature_id(self, feature_name):
        """Get feature ID, create if doesn't exist"""
        if feature_name not in self.feature_to_idx:
            self.feature_to_idx[feature_name] = self.next_feature_idx
            self.next_feature_idx += 1
        return self.feature_to_idx[feature_name]
    
    def _word_identity(self, X, t, prev_label, curr_label):
        """Current word feature"""
        return [(f"word={X[t]},label={curr_label}", 1.0)]
    
    def _word_suffix(self, X, t, prev_label, curr_label):
        """Word suffix feature"""
        word = X[t]
        if len(word) >= 3:
            return [(f"suffix={word[-3:]},label={curr_label}", 1.0)]
        return []
    
    def _word_prefix(self, X, t, prev_label, curr_label):
        """Word prefix feature"""
        word = X[t]
        if len(word) >= 3:
            return [(f"prefix={word[:3]},label={curr_label}", 1.0)]
        return []
    
    def _word_is_capitalized(self, X, t, prev_label, curr_label):
        """Capitalization feature"""
        if X[t][0].isupper():
            return [(f"capitalized=True,label={curr_label}", 1.0)]
        return []
    
    def _word_is_numeric(self, X, t, prev_label, curr_label):
        """Numeric feature"""
        if X[t].isdigit():
            return [(f"numeric=True,label={curr_label}", 1.0)]
        return []
    
    def _prev_word(self, X, t, prev_label, curr_label):
        """Previous word feature"""
        if t > 0:
            return [(f"prev_word={X[t-1]},label={curr_label}", 1.0)]
        return []
    
    def _next_word(self, X, t, prev_label, curr_label):
        """Next word feature"""
        if t < len(X) - 1:
            return [(f"next_word={X[t+1]},label={curr_label}", 1.0)]
        return []
    
    def extract_features(self, X, t, prev_label, curr_label):
        """
        Extract all features at position t
        
        Returns:
            feature_vector: Sparse feature vector {feature_id: value}
        """
        features = {}
        
        # Transition features
        trans_feat = f"transition={prev_label}->{curr_label}"
        features[self._get_feature_id(trans_feat)] = 1.0
        
        # State features
        for template in self.feature_templates:
            for feat_name, feat_val in template(X, t, prev_label, curr_label):
                feat_id = self._get_feature_id(feat_name)
                features[feat_id] = feat_val
        
        return features
    
    def extract_global_features(self, X, Y):
        """
        Extract global feature vector F(X, Y)
        
        Returns:
            Sparse vector {feature_id: count}
        """
        features = defaultdict(float)
        T = len(X)
        
        # Add START tag
        Y_with_boundary = ['<START>'] + list(Y) + ['<END>']
        
        for t in range(T):
            local_feats = self.extract_features(
                X, t, Y_with_boundary[t], Y_with_boundary[t+1]
            )
            for feat_id, val in local_feats.items():
                features[feat_id] += val
        
        return dict(features)

CRF Model Implementation

from scipy.special import logsumexp
from scipy.optimize import fmin_l_bfgs_b

class LinearChainCRF:
    """Linear-Chain Conditional Random Field"""
    
    def __init__(self, labels, l2_lambda=0.1):
        """
        Args:
            labels: Set of labels
            l2_lambda: L2 regularization coefficient
        """
        self.feature_extractor = CRFFeatures(labels)
        self.labels = self.feature_extractor.labels
        self.L = self.feature_extractor.L
        self.l2_lambda = l2_lambda
        self.weights = None
    
    def _compute_scores(self, X, Y=None):
        """
        Compute score matrix M_t(i, j)
        
        Returns:
            If Y is None: list of (L, L) matrices
            If Y given: scalar score
        """
        T = len(X)
        
        if Y is None:
            # Compute all possible transition scores
            M = []
            for t in range(T):
                M_t = np.zeros((self.L, self.L))
                for i in range(self.L):
                    for j in range(self.L):
                        if t == 0 and self.labels[i] != '<START>':
                            M_t[i, j] = -np.inf
                            continue
                        
                        features = self.feature_extractor.extract_features(
                            X, t, self.labels[i], self.labels[j]
                        )
                        score = sum(self.weights[f_id] * val 
                                  for f_id, val in features.items()
                                  if f_id < len(self.weights))
                        M_t[i, j] = score
                M.append(M_t)
            return M
        else:
            # Compute score for given sequence
            global_features = self.feature_extractor.extract_global_features(X, Y)
            score = sum(self.weights[f_id] * count 
                       for f_id, count in global_features.items()
                       if f_id < len(self.weights))
            return score
    
    def _forward(self, X):
        """
        Forward algorithm to compute alpha and log Z
        
        Returns:
            alpha: (T, L) array
            log_Z: Log partition function
        """
        T = len(X)
        M = self._compute_scores(X)
        
        # Log-space computation
        log_alpha = np.full((T, self.L), -np.inf)
        
        # Initialization
        start_idx = self.labels.index('<START>')
        log_alpha[0, start_idx] = 0.0
        for j in range(self.L):
            if M[0][start_idx, j] > -np.inf:
                log_alpha[0, j] = M[0][start_idx, j]
        
        # Recursion
        for t in range(1, T):
            for j in range(self.L):
                log_alpha[t, j] = logsumexp(
                    log_alpha[t-1] + M[t][:, j]
                )
        
        # Termination
        end_idx = self.labels.index('<END>')
        log_Z = logsumexp(log_alpha[-1])
        
        return log_alpha, log_Z
    
    def _backward(self, X):
        """
        Backward algorithm to compute beta
        
        Returns:
            beta: (T, L) array
        """
        T = len(X)
        M = self._compute_scores(X)
        
        log_beta = np.full((T, self.L), -np.inf)
        
        # Initialization
        log_beta[-1, :] = 0.0
        
        # Recursion
        for t in range(T-2, -1, -1):
            for i in range(self.L):
                log_beta[t, i] = logsumexp(
                    M[t+1][i, :] + log_beta[t+1]
                )
        
        return log_beta
    
    def _compute_gradient(self, X_list, Y_list):
        """
        Compute gradient
        
        Returns:
            gradient: Gradient vector
            neg_log_likelihood: Negative log-likelihood
        """
        num_features = self.feature_extractor.next_feature_idx
        empirical_counts = np.zeros(num_features)
        expected_counts = np.zeros(num_features)
        log_likelihood = 0.0
        
        for X, Y in zip(X_list, Y_list):
            # Empirical feature counts
            global_features = self.feature_extractor.extract_global_features(X, Y)
            for f_id, count in global_features.items():
                empirical_counts[f_id] += count
            
            # True sequence score
            score = self._compute_scores(X, Y)
            
            # Forward-backward algorithm
            log_alpha, log_Z = self._forward(X)
            log_beta = self._backward(X)
            
            log_likelihood += score - log_Z
            
            # Expected feature counts
            T = len(X)
            M = self._compute_scores(X)
            
            for t in range(T):
                for i in range(self.L):
                    for j in range(self.L):
                        # Marginal probability
                        if t == 0:
                            log_marginal = M[t][i, j] + log_beta[t, j] - log_Z
                        else:
                            log_marginal = (log_alpha[t-1, i] + M[t][i, j] + 
                                          log_beta[t, j] - log_Z)
                        
                        marginal = np.exp(log_marginal)
                        
                        # Feature expectation
                        features = self.feature_extractor.extract_features(
                            X, t, self.labels[i], self.labels[j]
                        )
                        for f_id, val in features.items():
                            expected_counts[f_id] += marginal * val
        
        # Gradient = empirical features - expected features - L2 regularization term
        gradient = empirical_counts - expected_counts - self.l2_lambda * self.weights
        
        # Objective = negative log-likelihood + L2 penalty
        neg_log_likelihood = -log_likelihood + 0.5 * self.l2_lambda * np.sum(self.weights**2)
        
        return -gradient, neg_log_likelihood  # Return negative gradient because LBFGS minimizes
    
    def fit(self, X_list, Y_list, max_iter=100):
        """
        Train CRF model
        
        Args:
            X_list: List of observation sequences
            Y_list: List of label sequences
            max_iter: Maximum number of iterations
        """
        # Initialize weights
        num_features = self.feature_extractor.next_feature_idx
        self.weights = np.zeros(num_features)
        
        # Define objective function
        def objective(w):
            self.weights = w
            neg_grad, neg_ll = self._compute_gradient(X_list, Y_list)
            return neg_ll, neg_grad
        
        # LBFGS optimization
        print("Training CRF with LBFGS...")
        self.weights, f_min, info = fmin_l_bfgs_b(
            objective,
            self.weights,
            maxiter=max_iter,
            disp=1
        )
        
        print(f"Training finished. Final negative log-likelihood: {f_min:.4f}")
        return self
    
    def viterbi(self, X):
        """
        Viterbi decoding
        
        Returns:
            best_labels: Optimal label sequence
        """
        T = len(X)
        M = self._compute_scores(X)
        
        # Dynamic programming
        delta = np.full((T, self.L), -np.inf)
        psi = np.zeros((T, self.L), dtype=int)
        
        # Initialization
        start_idx = self.labels.index('<START>')
        delta[0, :] = M[0][start_idx, :]
        
        # Recursion
        for t in range(1, T):
            for j in range(self.L):
                scores = delta[t-1] + M[t][:, j]
                psi[t, j] = np.argmax(scores)
                delta[t, j] = scores[psi[t, j]]
        
        # Backtracking
        best_path = np.zeros(T, dtype=int)
        best_path[-1] = np.argmax(delta[-1])
        
        for t in range(T-2, -1, -1):
            best_path[t] = psi[t+1, best_path[t+1]]
        
        # Convert to labels
        best_labels = [self.labels[idx] for idx in best_path 
                      if self.labels[idx] not in ['<START>', '<END>']]
        
        return best_labels
    
    def predict(self, X_list):
        """Batch prediction"""
        return [self.viterbi(X) for X in X_list]

Named Entity Recognition Application

def ner_example():
    """
    Named Entity Recognition Example
    """
    # Training data (BIO tagging)
    X_train = [
        ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii'],
        ['Apple', 'Inc.', 'is', 'located', 'in', 'Cupertino'],
        ['The', 'company', 'was', 'founded', 'in', '1976']
    ]
    
    Y_train = [
        ['B-PER', 'I-PER', 'O', 'O', 'O', 'B-LOC'],
        ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC'],
        ['O', 'O', 'O', 'O', 'O', 'B-DATE']
    ]
    
    # Test data
    X_test = [
        ['Steve', 'Jobs', 'founded', 'Apple'],
        ['Google', 'is', 'in', 'California']
    ]
    
    # Train CRF
    labels = set(label for seq in Y_train for label in seq)
    crf = LinearChainCRF(labels, l2_lambda=0.1)
    crf.fit(X_train, Y_train, max_iter=50)
    
    # Predict
    Y_pred = crf.predict(X_test)
    
    for X, Y in zip(X_test, Y_pred):
        print(f"Sentence: {' '.join(X)}")
        print(f"Predicted labels: {' '.join(Y)}")
        print()

Advanced Topics and Optimization

Feature Template Design

Window Features:

# Window of 2 words before and after
for offset in [-2, -1, 0, 1, 2]:
    if 0 <= t + offset < T:
        features.append(f"word[{offset}]={X[t+offset]},label={y_t}")

Dictionary Features:

1 2	if word in person_dict: features.append(f"in_person_dict=True,label={y_t}")

Regex Features:

1
2
3

import re
if re.match(r'\d{4}-\d{2}-\d{2}', word):  # Date format
    features.append(f"date_pattern=True,label={y_t}")

Semi-Markov CRF

Allows segment-level labeling: Labels span multiple positions

Applications: Chinese word segmentation, named entity recognition

Structured Perceptron

Online learning version: Doesn't need to compute partition function

Update Rule:

Advantage: Fast training, but not as probabilistic as CRF

In-Depth Q&A

Q1: What are the core advantages of CRF compared to HMM?

A: CRF is a discriminative model that directly models, while HMM is a generative model that models. Specific advantages: 1. No observation independence assumption: HMM assumes, CRF allows arbitrary overlapping features 2. Flexible feature engineering: CRF can use global features, overlapping features (like surrounding words, dictionary matching) 3. Better sequence labeling performance: On NER and POS tagging tasks, CRF typically outperforms HMM by 5-10 percentage points

Q2: Why does CRF need forward-backward algorithm to compute gradients?

A: The gradient contains expectation term, direct computation requires enumeratingsequences. The forward-backward algorithm exploits the linear-chain structure to decompose the expectation into a sum of marginal probabilities:The marginal probabilitycan be efficiently computed throughand, reducing complexity to.

Q3: Why is the partition function Z(X) hard to compute?

A: The partition function is defined asIt requires summing over allpossible label sequences. Although the forward algorithm can efficiently compute(), it needs to be recomputed for every parameter update, leading to high training complexity. This is the main cost of CRF compared to perceptron.

Q4: Why is L2 regularization important for CRF?

A: Two reasons: 1. Prevent overfitting: The number of features is usually very large (tens of thousands to millions), L2 penaltyprevents weight explosion 2. Ensure convexity: The original objective function is convex, adding L2 keeps it strongly convex, ensuring LBFGS converges to global optimum

Usually take, selected through cross-validation.

Q5: What's the difference between Viterbi decoding and marginal decoding?

A: Two decoding strategies: - Viterbi decoding: Find the sequence with maximum joint probability- Marginal decoding: Independently select the label with maximum marginal probability at each positionViterbi guarantees global consistency (like BIO tagging constraints), marginal decoding may produce illegal sequences. Viterbi is more commonly used in practice.

Q6: How to handle large label sets?

A: When the number of labelsis large (like character tagging in Chinese word segmentation,, but word-level tagging may have thousands of labels), complexitybecomes unacceptable. Solutions: 1. Beam Search approximation: Keep only top-K candidate states 2. Hierarchical labeling: First coarse-grained labeling, then fine-grained 3. Neural CRF: Use neural networks to learn transition matrices, reducing parameters

Q7: Can CRF handle non-sequential structures?

A: Yes! General CRFs can be defined on arbitrary graph structures: - Tree CRF: Dependency parsing - Grid CRF: Image segmentation (2D CRF) - Fully-connected CRF: Global normalization classification

But non-chain structures have higher inference complexity, usually requiring approximation algorithms (like loopy belief propagation).

Q8: Why does MEMM have the label bias problem?

A: MEMM uses local normalization:Problem: States with fewer transition options (like deterministic states) have smaller normalization denominators, probabilities tend to 1, model prefers these states. CRF solves this through global normalization:All paths compete for the same denominator, avoiding local bias.

Q9: How does the number of features affect CRF performance?

A: Relationship between number of features and performance: - Too few: Underfitting, can't capture complex patterns - Too many: Overfitting, slow training, poor generalization

Rule of thumb: - Basic features (words, POS, affixes): thousands to tens of thousands - Adding dictionary features: tens of thousands to hundreds of thousands - Use regularization + feature selection (like L1) to control complexity

Q10: How to combine CRF with deep learning?

A: Modern sequence labeling typically uses BiLSTM-CRF: - BiLSTM: Learn contextual representations - CRF: Model label dependencies, constrain legality

Structure:BiLSTM output serves as emission scores for CRF, CRF layer learns transition matrix. Joint optimization during training, Viterbi used for inference.

Q11: How to evaluate CRF models?

A: Evaluation metrics for sequence labeling tasks: 1. Token-level accuracy: Correctness of individual labels 2. Span-level F1: Entity-level precision, recall, F1 (more important) - Precision = Correct entities / Predicted entities - Recall = Correct entities / True entities 3. Confusion matrix: Analyze which labels are easily confused

Q12: How to optimize CRF training time?

A: Optimization strategies: 1. Feature hashing: Use hash functions to reduce feature dimensionality 2. Mini-batch gradients: Don't compute gradients on full data 3. Parallelization: Forward-backward for multiple sequences can be parallelized 4. Early stopping: Monitor validation set performance, avoid overfitting 5. Pre-training initialization: Quickly pre-train weights with perceptron

References

[1] Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML.

[2] Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 267-373.

[3] Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. NAACL-HLT.

[4] Wallach, H. M. (2004). Conditional random fields: An introduction. Technical Report MS-CIS-04-21, University of Pennsylvania.

[5] Sarawagi, S., & Cohen, W. W. (2005). Semi-Markov conditional random fields for information extraction. NIPS.

[6] Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.

[7] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

[8] Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. ACL.

[9] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. NAACL-HLT.

[10] Kr ä henb ü hl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. NIPS.