Unit 4 Conditional random Field
Let's dive into a comprehensive explanation of Conditional Random Fields (CRFs) ,
focusing on the following key components:
1. Linear Chain CRF
2. Partition Function
3. Markov Network
4. Belief Propagation
5. Training CRFs
6. Hidden Markov Model (HMM)
7. Entropy
We’ll cover each with:
✅ Definitions
🧠 Procedures / Algorithms
💡 Applications
📌 Examples
1. Conditional Random Fields (CRFs)
✅ Definition:
Conditional Random Fields (CRFs) are a class of statistical models used for structured
prediction. Unlike classification models that predict single labels, CRFs model sequences
or structures and are particularly useful when outputs are interdependent.
They are discriminative probabilistic models , meaning they directly model P(Y∣X) ,
where:
Y : Output sequence (e.g., part-of-speech tags)
X : Input sequence (e.g., words in a sentence)
2. Linear Chain CRF
✅ Definition:
A linear chain CRF is a special case of CRFs designed for sequential data, such as
sentences or time series. It assumes that each output label depends only on the current
input and the previous label — similar to a first-order Markov model.
This structure is widely used in Natural Language Processing (NLP) tasks like Named
Entity Recognition (NER) and Part-of-Speech (POS) tagging .
🧠 Structure:
Each node represents a label at position t , connected to the previous label and the
corresponding input feature vector.
3. Partition Function
✅ Definition:
1
🧮 Why Important?
It ensures that the total probability over all possible output sequences equals 1.
4. Markov Network (Markov Random Field)
✅ Definition:
A Markov network (also called a Markov Random Field ) is an undirected graphical
model representing a set of random variables having a Markov property described by an
undirected graph.
CRFs can be seen as a type of conditional Markov network , where we condition on
observed inputs.
🧠 Key Properties:
Undirected edges between nodes
No directionality; dependencies are mutual
Uses cliques and potential functions to define joint distributions
5. Belief Propagation
✅ Definition:
Belief propagation (also known as sum-product message passing ) is an algorithm used
to perform inference in graphical models like Markov networks and Bayesian networks.
In linear-chain CRFs, belief propagation is used to compute:
Marginal probabilities of individual labels
Joint probabilities of sequences
Most probable output sequence (via Viterbi-style decoding)
🔁 Algorithm Steps:
1. Pass messages forward and backward along the chain.
2. Combine messages to compute marginals or best path.
💡 Application:
Used during both training and inference in CRFs.
6. Training CRFs
✅ Definition:
CRF training involves learning the optimal weights λk that maximize the conditional
likelihood of the correct output sequence given the input.
2
🧠 Training Procedure:
Step-by-step:
1. Feature Extraction : Define feature functions fk(yt,yt−1,xt)
2. Forward-Backward Algorithm : Used to compute gradients efficiently
3. Optimization : Use L-BFGS, SGD, or Adam to optimize log-likelihood
💡 Applications:
Named Entity Recognition
POS Tagging
Handwriting recognition
Bioinformatics (sequence labeling)
📌 Example:
Train a CRF to tag each word in a sentence with its part-of-speech (e.g., noun, verb,
adjective).
7. Hidden Markov Model (HMM)
✅ Definition:
An HMM is a generative probabilistic model for sequences. It assumes that there’s a
hidden (unobserved) sequence of states that generates the observed sequence.
🆚 HMM vs CRF:
FEATURE HMM CRF
Type Generative Discriminative
Modeling P(X,Y) P(Y|X)
Dependencies AssumesXdepends only on current Can use arbitrary features of entire input
state
Label Bias ❌ Suffers from it ✅ Avoids it
Problem
8. Entropy
✅ Definition:
3
Entropy measures uncertainty in a probability distribution. In CRFs, entropy can be used
to:
Evaluate confidence in predictions
Regularize models during training
For a discrete distribution p=[p1,...,pn] , entropy is:
H(p)=−i∑pilogpi
💡 Application:
High entropy → uncertain prediction
Low entropy → confident prediction
Used in semi-supervised learning and active learning strategies.
📊 Summary Table
CONCEPT DESCRIPTION USE CASE EXAMPLE
Linear Chain Sequential CRF with first-order NER, POS tagging Labeling "Apple" as
CRF Markov assumption "Organization"
Partition Normalizes probabilities Ensures valid distribution Z(X)in $ P(Y
Function
Markov Undirected graphical model General structured CRFs are conditional
Network modeling MRFs
Belief Inference algorithm Compute marginals or Decoding in CRF
Propagation most likely sequence
Training CRFs Maximize conditional likelihood Sequence labeling POS tagging using L-
BFGS
HMM Generative sequence model Speech recognition Part-of-speech tagging
Entropy Measure of uncertainty Confidence estimation Active learning in CRFs
📝 Final Notes:
CRFs outperform HMMs in many NLP tasks because they avoid the label bias
problem and allow richer feature representations.
Linear-chain CRFs are the most commonly used variant due to their efficiency
and applicability to real-world problems.
Belief propagation and dynamic programming (like Viterbi) play crucial roles in
CRF inference and decoding.