NLP and the Web – WS 2024/2025
Lecture 2
Foundations of Text Classification
Dr. Thomas Arnold
Hovhannes Tamoyan
Kexin Wang
Ubiquitous Knowledge Processing Lab
Technische Universität Darmstadt
Syllabus (tentative)
Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 2
Last Lecture – Linguistic Analysis Levels
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 3
Today's lecture
Text Classification: Introduction
Algorithms
Naive Bayes
Hidden Markov Models
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 4
What is Text Classification?
Output tags /
Input Text Classification Model
classes
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 5
Examples – Spam Detection
From: Danke (archdigest@news.condenast.com)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 6
Examples – Sentiment Analysis
Positive or negative movie review?
▪ This is it. This is the one. This is the worst movie ever made. Ever. It
beats everything. I have never seen worse.
▪ Expertly scripted and perfectly delivered, this searing parody leaves you
literally rolling with laughter.
▪ While watching this film I started to come up with things I would rather be
doing, including drinking bleach, rubbing sand in my eyes, and tax
returns.
▪ Just finished watching this movie for maybe the 7th or 8th time
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 7
More Examples
Topic Labeling Age/Gender Identification
I've seen George Foreman shadow
boxing and the shadow won.
Authorship Identification Language Identification
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 8
Approaches
Rule-based
Handcrafted linguistic rules
Human comprehensible
Pro: Precision can be high
Con: Very expensive to build and maintain
IF "basketball" THEN
return top_sports
ELSEIF…
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 9
Approaches
Supervised Machine Learning
Step 1: Training
Tags
Classification Model
Input Text Feature
Extractor
Features
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 10
Approaches
Supervised Machine Learning
Step 2: Prediction
Classification Model
Input Text Feature
Extractor
Features
Tags
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 11
Approaches
Supervised Machine Learning
Classification model based on Training Data
Human comprehensible? Dependant on model!
Pro: Easier to maintain, usually more accurate
Con: Needs training data
Classification Model
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 12
Comparison to Neural Models
Kuzman T, Mozetič I, Ljubešić N. Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of
Classification Methods in the Era of Large Language Models. Machine Learning and Knowledge Extraction. 2023; 5(3):1149-1175
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 13
Today's lecture
Text Classification: Introduction
Algorithms
Naive Bayes
Hidden Markov Models
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 14
Naïve Bayes Classifier: Background
▪ Before starting let’s review related concepts:
▪ Conditional Probability
▪ Bayes' Rule.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 15
Conditional Probability
▪ Conditional Probability in plain English:
▪ What is the probability that something will happen, given that something
else has already happened.
▪ Assume we have some Outcome O and some Evidence E.
▪ 𝑃(𝑂, 𝐸): the Probability of having both the Outcome O and Evidence E is the
multiplication of two probabilities:
▪ 𝑃(𝑂): Probability of O occurring
▪ 𝑃(𝐸|𝑂): Probability of E given that O happened
𝑃(𝑂, 𝐸) = 𝑃(𝑂) × 𝑃(𝐸|𝑂)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 16
Conditional Probability: Example
▪ Let say we have a group of students
▪ Students could be either awake or asleep
▪ Students are also either bachelor or master students
▪ If we select one student randomly: what is the probability that this person
is a sleeping bachelor student?
▪ Conditional Probability can help us answer that:
▪ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 , 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) = 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝) ∗ 𝑃 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟 𝐴𝑠𝑙𝑒𝑒𝑝)
▪ We could compute the exact same thing, the reverse way
▪ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 , 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) = 𝑃(𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟) ∗ 𝑃(𝐴𝑠𝑙𝑒𝑒𝑝 |𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 17
Bayes Rule
▪ Conceptually, this is a way to go from
𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒 | 𝑂𝑢𝑡𝑐𝑜𝑚𝑒) to 𝑃 𝑂𝑢𝑡𝑐𝑜𝑚𝑒 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒)
▪ Let‘s see how to do that
▪ We have:
▪ 𝑃(𝑂, 𝐸) = 𝑃(𝑂) × 𝑃(𝐸|𝑂) and 𝑃(𝑂, 𝐸) = 𝑃(𝐸) × 𝑃(𝑂|𝐸)
▪ → 𝑃(𝑂) × 𝑃(𝐸|𝑂) = 𝑃(𝐸) × 𝑃(𝑂|𝐸)
𝑃(𝐸|𝑂) × 𝑃(𝑂)
𝑃 𝑂𝐸 =
𝑃(𝐸)
Bayes Rule
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 18
Getting to Naïve Bayes'
▪ So far, we have talked only about one piece of evidence.
▪ In reality, we have to predict an outcome given multiple evidence
▪ math gets very complicated :(
▪ Naive Bayes‘ solution:
▪ treat each of piece of evidence as independent -> simpler math :)
▪ This approach is why this is called naive Bayes
▪ Suppose we have multiple evidences 𝐸1 , … , 𝐸𝑛 and an outcome 𝑂
𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 19
Getting to Naïve Bayes'
𝑃 𝐸1 𝑂 × 𝑃 𝐸2 𝑂 × … × 𝑃 𝐸𝑛 𝑂 × 𝑃(𝑂)
𝑃 𝑂 𝐸1 , … , 𝐸𝑛 =
𝑃(𝐸1 , 𝐸2 , … , 𝐸𝑛 )
Many people choose to remember this as:
𝑃(𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑜𝑓 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒) ∗ 𝑃𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒
𝑃 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒)
▪ Notes:
• If the 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 1, then we are just multiplying by 1.
• If the 𝑃(𝑠𝑜𝑚𝑒 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒|𝑜𝑢𝑡𝑐𝑜𝑚𝑒) is 0, then the whole probability becomes 0
• Since we divide everything by 𝑃(𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒),
• we can even get away without calculating it.
• The intuition behind multiplying by the prior is
• to give high probability to more common outcomes, and low probabilities to unlikely outcomes.
• These are also called base rates and they are a way to scale our predicted probabilities
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 20
Naïve Bayes: Example
▪ Let's say that we have data on 1000 pieces of fruit: Banana, Orange or
some Other Fruit
▪ We know 3 characteristics about each fruit
▪ Whether it is Long
▪ Whether it is Sweet and
▪ If its color is Yellow
▪ Our training set
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Other Fruit 100 100 150 50 50 150 200
Total 500 500 650 350 800 200 1000
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 21
Naïve Bayes: Example cont.
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Other Fruit 100 100 150 50 50 150 200
Total 500 500 650 350 800 200 1000
▪ “Prior” probabilities
▪ P(Banana) = 500/1000 = 0.5, P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/1000 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 22
Naïve Bayes: Example cont.
▪ “Prior” probabilities
▪ P(Banana) = 0.5 (500/1000), P(Orange) = 0.3, P(Other Fruit) = 0.2
▪ Probability of "Evidence“
▪ p(Long) = 500/100 = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
▪ Probability of "Likelihood“
▪ P(Long|Banana) = 0.8, P(Long|Orange) = 0
▪ P(Yellow|Other Fruit) = 50/200 = 0.25, P(Not Yellow|Other Fruit) = 0.75
Given an unknown fruit which is long, sweet and yellow, is it Banana, Orange or Other Fruit?
P(Long|Banana) ∗ P(Sweet|Banana) ∗ P(Yellow|Banana) ∗ P(Banana)
▪ P Banana Long, Sweet and Yellow =
P(Long) ∗ P(Sweet) ∗ P(Yellow)
= 0.8 ∗ 0.7 ∗ 0.9 ∗ 0.5 / P(evidence)= 0.252 / P(evidence)
▪ P Orange Long, Sweet and Yellow = 0 , why?
P(Long|Other Fruit) ∗ P(Sweet|Other Fruit) ∗ P(Yellow|Other Fruit) ∗ P(Other Fruit)
▪ P Other Fruit Long, Sweet and Yellow =
P(Long) ∗ P(Sweet) ∗ P(Yellow)
= (100/200 ∗ 150/200 ∗ 50/150 ∗ 200/1000) / 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒) = 0.01875 / 𝑃(𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒)
0.252 >> 0.01875 => the unknown fruit is most likely a banana
Is it clear now why we don‘t need to calculated the P(evidence)?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 23
Naïve Bayes for Text Classification based on
Words as Features
▪ Multinomial Naïve Bayes Model: The probability of document d belonging
to a class c is proportional to the product of the probabilities of terms t
belonging to a class c, and to the class prior P(c):
▪ The best class c for a document d is found by selecting the class, for
which the maximum a posteriori (map) probability is maximal:
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 24
Summary on Naïve Bayes
▪ Bayesian methods provide the basis for probabilistic learning methods
▪ Bayesian methods can be used to determine the most probable
hypothesis given the data
▪ No training, just probability calculation
▪ Binary, numeric and nominal features can be mixed
▪ Naïve Bayes fails if the independence assumption is violated too much
▪ Especially identical or highly overlapping features pose a problem that has to be
addressed with proper feature selection
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 25
I didn't expect a kind of menti quiz.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 26
Today's lecture
Text Classification: Introduction
Algorithms
Naive Bayes
Hidden Markov Models
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 27
Limitations of Standard classification
▪ Standard classification problem assumes
▪ individual cases are disconnected and independent
▪ Many NLP problems do not satisfy this assumption
▪ involve making many connected decisions, each resolving a different ambiguity,
but which are mutually dependent
▪ More sophisticated learning and inference techniques are needed
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 28
Sequence Labeling
▪ Many NLP problems can viewed as sequence labeling
▪ Each token in a sequence is assigned a label
▪ Labels of tokens are dependent on the labels of other tokens in the
sequence, particularly their neighbors
▪ Examples:
▪ Part Of Speech Tagging
In: John saw the saw and decided to take it to the table.
Out: PN V Det N Con V Part V Pro Prep Det N
▪ Named entity recognition: people organizations places
Michael Dell is the CEO of Dell Computer Corporation and lives in Austin Texas.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 29
Use case: Part of Speech (POS) Tagging
▪ Given a sentence X, predict its part of speech sequence Y
▪ A type of “structured” prediction
▪ How can we do this?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 30
Use case: Part of Speech (POS) Tagging
Possible Answers
▪ Sequence labeling as classification:
▪ Pointwise prediction: predict each word individually with a classifier
▪ Generative sequence models: e.g. Hidden Markov Model (HMM)
▪ Later in the lecture: Neural Sequence Models (RNN / LSTM)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 31
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
PN
Slides from: Raymond J. Mooney
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 32
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 33
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 34
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 35
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Conj
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 36
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 37
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Part
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 38
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 39
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Pro
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 40
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Prep
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 41
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 42
Sequence Labeling as Classification
▪ Classify each token independently but use as input features, information
about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 43
Sequence Labeling as Classification
Using Outputs as Inputs
▪ Better input features are usually the categories of the surrounding tokens,
but these are not available yet.
▪ Can use category of either the preceding or succeeding tokens by going
forward or back and using previous output.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 44
Forward Classification
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 45
Forward Classification
PN
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 46
Forward Classification
PN V
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 47
Forward Classification
PN V Det
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 48
Forward Classification
PN V Det N
John saw the saw and decided to take it to the table.
classifier
Conj
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 49
Forward Classification
PN V Det N Conj
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 50
Forward Classification
PN V Det N Conj V
John saw the saw and decided to take it to the table.
classifier
Part
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 51
Forward Classification
PN V Det N Conj V Part
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 52
Forward Classification
PN V Det N Conj V Part V
John saw the saw and decided to take it to the table.
classifier
Pro
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 53
Forward Classification
PN V Det N Conj V Part V Pro
John saw the saw and decided to take it to the table.
classifier
Prep
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 54
Forward Classification
PN V Det N Conj V Part V Pro Prep
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 55
Forward Classification
PN V Det N Conj V Part V Pro Prep Det
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 56
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 57
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
N
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 58
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Det N
John saw the saw and decided to take it to the table.
classifier
Prep
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 59
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Prep Det N
John saw the saw and decided to take it to the table.
classifier
Pro
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 60
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 61
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
▪ Preposition? (“I am heading to the store”)
▪ Particle? (in front if infinitive verbs, like “I want to eat. I am going to leave.”)
V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
Part
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 62
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 63
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
V Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
Conj
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 64
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Conj V Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 65
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
V Conj V Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
Det
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 66
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
Det V Conj V Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 67
Backward Classification
▪ Disambiguating “to” in this case would be even easier backward.
V Det V Conj V Part V Pro Prep Det N
John saw the saw and decided to take it to the table.
classifier
PN
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 68
Sequence Labeling as Classification
Problems
▪ Not easy to integrate information from category of tokens on both sides
▪ Difficult to propagate uncertainty between decisions and “collectively”
determine the most likely joint assignment of categories to all of the
tokens in a sequence.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 69
Probabilistic Sequence Models
▪ Probabilistic sequence models allow
▪ integrating uncertainty over multiple, interdependent classifications
▪ and collectively determine the most likely global assignment.
▪ Generative sequence models: e.g. Hidden Markov Model (HMM)
▪ Later in the lecture: Neural Sequence Models (RNN / LSTM)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 70
Intuition of HMM decoding: PoS Tagging
▪ Choose the tag sequence that is most probable given the observation sequence
of n words:
Tag sequence
Best tag sequence Word sequence
▪ Bayes‘ Rule:,
▪ Drop the denominator
▪ Assumption 1: word appearing depends only on its own tag
▪ Assumption 2: the probability of a tag is dependent only on the previous tag
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 71
Hidden Markov Model
A Hidden Markov Model is a statistical model of hidden, stochastic state transitions
with observable, stochastic output. Key features:
▪ A fixed set of states
▪ At each "time", the hidden markov model is in exactly one of these states
▪ State transition probabilities
▪ The starting state can be fixed or probabilistic
▪ A fixed set of possible outputs
▪ For each state: a distribution of probabilities for every possible output
▪ Also called emission probabilities
Task:
For an observed output sequence, what is the (hidden) state sequence that has the
highest probabiliy to produce this output?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 72
Hidden Markov Model - Example
Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 73
Hidden Markov Model - Example
Every day, Darth Vader is in one of three moods: Good, Neutral or Bad
But, because he wears his mask, we cannot observe it!
(image from pixabay.com, CC0 Creative Commons licence)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 74
Hidden Markov Model - Example
Somehow, you know the odds how his mood changes from day to day:
0.4
0.8 0.4
0.2 0.5
0.2 0.5
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 75
Hidden Markov Model - Example
You also know the chances for his mood on day 1:
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
0.2 0.5
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 76
Hidden Markov Model - Example
What we CAN observe is if Darth Vader destroys a planet or not, which depends on
his mood!
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
0.2 0.5
0.5 0.5 0.3 0.7 0.1 0.9
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 77
Hidden Markov Model - Example
We observe that he does not destroy a planet on the first day, but he
destroys a planet each on the second and third day:
Question:
What is the most probable sequence of his mood on these three days?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 78
Hidden Markov Model - Example
What is the probability for this mood
sequence and this observation:
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 79
Hidden Markov Model - Example
What is the probability for this mood
sequence and this observation:
First day:
Transition probability: S 0.3
Emission probability: 0.1
Joint probability: 0.3 * 0.1 = 0.03
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 80
Hidden Markov Model - Example
What is the probability for this mood
sequence and this observation:
Second day:
Transition probability: 0.5
Emission probability: 0.7
Joint probability: 0.5 * 0.7 = 0.35
Probability for sequence: 0.03 (day one) * 0.35 (day two) = 0.0105
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 81
Hidden Markov Model - Example
What is the probability for this mood
sequence and this observation:
Third day:
Transition probability: 0.4
Emission probability: 0.9
Joint probability: 0.4 * 0.9 = 0.36
Probability for sequence: 0.03 * 0.35 * 0.36 = 0.00378
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 82
Hidden Markov Model – Application to NLP
In our POS tagging example, we know the sequence of words, and we want to
know the sequence of POS tags!
transitions
▪ (hidden) States: POS tags
states t1 t2 t3 t4
▪ (observable) Outputs: Tokens
▪ We also need outputs w1 w2 w3 w4
▪ Transition probabilities between states
▪ Also: Initial probabilities = Probabilities for the first token of a sentence
▪ Emission probabilities for every state
How would our graph from the example look like?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 83
Hidden Markov Model – Application to NLP
S
0.3 0.3
0.4
0.4
0.8 0.4
0.2 0.5
N V DT
0.2 0.5
0.6 0.0 0.1 0.1 0.1 0.8
0.4 0.8 0.1
Tom saw the Tom saw the Tom saw the
How do we get to the transition and emission probabilities?
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 84
Hidden Markov Model – Application to NLP:
Estimating the Probabilities
▪ The probabilities are estimated just by counting on a tagged training
corpus
▪ Transition probability: how often the first tag is followed by the second divided
by the number of the times the first tag was seen in a labeled corpus
▪ The emission probabilities: the number of times the word was associated with
the tag in the labeled corpus divided by number of the times the first tag was
seen in a labeled corpus
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 85
Example
Input sentence: Janet will back the bill
Correct PoS tags: Janet/NNP will/MD back/VB the/DT bill/NN
Transition probabilities based on WSJ corpus
Rows are labeled with the conditioning event; thus P(VB|MD) is 0.7968.
Given the observation (output) likelihoods
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 86
Hidden Markov Model – Complexity
Question: What is the most likely state sequence given an output sequence?
▪ Naïve solution:
▪ brute force search by enumerating all possible sequences of states
▪ Complexity 𝑂 𝑠 𝑚
▪ where m is the length of the input and s is the number of states in the model.
▪ Better solution: Dynamic Programming!
▪ Standard procedure is called the Viterbi algorithm
▪ Running time is 𝑂(𝑚𝑠 2 ),
▪ where m is the length of the input and s is the number of states in the model.
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 87
Viterbi algorithm: Motivation
▪ Let 𝑇: 𝑜1 , 𝑜2 , … , 𝑜𝑇 some input sentence
▪ Let 𝑆: 𝑠1 , 𝑠2 , … , 𝑠𝑇 sequence of tags
▪ Goal: best sequence of tags given the input sequence
▪ argmax𝑠1 ,𝑠2,…,𝑠𝑇 𝑃(𝑜1 , … , 𝑜𝑇 , 𝑠𝑇 , 𝑠2 , … , 𝑠𝑇 )
▪ Example
▪ T = The, man, saw, the saw
▪ S = {Det, N, V}
▪ Possible tag sequences:
-> Det Det Det Det Det -> 0.000003
Det Det Det Det N -> 0.000008
Det Det Det Det V -> 0.000009
…
Det N V Det N -> 0.012
we have 35 (243) sequences: sentence length = 5, #tags = 3
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 88
Viterbi algorithm: Basic idea
Let us say we have only two possible states, A and B, and some observation o
What is the best possible state sequence of length 5 for this observation o?
It is either:
▪ the best possible sequence of length 4 that ends with A, followed by A
▪ the best possible sequence of length 4 that ends with A, followed by B
▪ the best possible sequence of length 4 that ends with B, followed by A
▪ the best possible sequence of length 4 that ends with B, followed by B
This is only true because the next state only depends on the state directly before!
So, what is the best possible sequence of length 4 that ends with A? It is either:
▪ the best possible sequence of length 3 that ends with A, followed by A
▪ the best possible sequence of length 3 that ends with B, followed by A
▪ …
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 89
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw the saw $$
DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 90
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw the saw $$
0.6 * 0.5
DT = 0.3
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 91
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw the saw $$
DT 0.3
V 0
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 92
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 93
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 94
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 95
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
P (man | V) * P (V | DT) * 0.3,
P (man | V) * P (V | V) * 0,
V 0 ? P (man | V) * P (V | N) * 0
}
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 96
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
V 0 ? 0.1 * 0.6 * 0
} = 0.009
N 0
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 97
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 = max {
0.1 * 0.3 * 0.3,
0.1 * 0.2 * 0,
0.009
V 0 DT 0.1 * 0.6 * 0
} = 0.009
N 0 Predecessor of the most probable path
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 98
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man
DT 0.3 0
0.009
V 0 DT
0.054
N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 99
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw
DT 0.3 0 = max {
P (saw | V) * P (V | DT) * 0,
P (saw | V) * P (V | V) * 0.009,
0.009
V 0 DT ? P (saw | V) * P (V | N) * 0.054
}
0.054
N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 100
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw
DT 0.3 0 = max {
0.2 * 0.3 * 0, ( = 0)
0.2 * 0.2 * 0.009, ( = 0,00036)
0.009 6,48 ∗ 10−3
V 0 DT N 0.2 * 0.6 * 0.054 ( = 0,00648)
} = 0,00648 = 6,48 ∗ 10−3
0.054
N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 101
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw the saw $$
1,62 ∗ 10−3
DT 0.3 0 0 V 0
0.009 6,48 ∗ 10−3 9,72 ∗ 10−5 1,94 ∗ 10−4
V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3 1,94 ∗ 10−4
N 0 DT N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 102
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
saw
The man saw the $$
N
1,62 ∗ 10−3
DT 0.3 0 0 V 0
0.009 6,48 ∗ 10−3 9,72 ∗ 10−5
V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3 1,94 ∗ 10−4
N 0 DT N 0 DT
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 103
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
the saw
The man saw $$
DT N
1,62 ∗ 10−3
DT 0.3 0 0 V 0
0.009 6,48 ∗ 10−3 9,72 ∗ 10−5
V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 104
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
saw the saw
The man $$
V DT N
DT 0.3 0 0 V 0
0.009 6,48 ∗ 10−3 9,72 ∗ 10−5
V 0 DT N 0 DT N
0.054 2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 105
Viterbi algorithm: Example
Emission probabilities Transition probabilities
P (the | DT) = 0.5 P (DT | DT)= 0.1 P (DT | N) = 0.2
P (man | V) = 0.1 P (V | DT) = 0.3 P (V | N) = 0.6
P (man | N) = 0.3 P (N | DT) = 0.6 P (N | N) = 0.2
P (saw | V) = 0.2 P (DT | V) = 0.5 P (DT | q0) = 0.6
P (saw | N) = 0.2 P (V | V) = 0.2 P (V | q0) = 0.3
P (N | V) = 0.3 P (N | q0) = 0.1
The man saw the saw
$$
DT N V DT N
DT 0.3 0 0 V 0
0.009 9,72 ∗ 10−5
V 0 DT N 0 DT N
2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 106
Viterbi algorithm: Example
Remember the complexity of the Viterbi algorithm?
Running time is 𝑂(𝑚𝑠 2 ),
▪ where m is the length of the input and s is the number of states in the model.
Now we see why: For every token (m) we we have to evaluate every POS (s) in combination with every
possible predeccessor POS (s), so he have m * s * s operations = 𝑚𝑠 2
The man saw the saw
$$
DT N V DT N
DT 0.3 0 0 V 0
0.009 9,72 ∗ 10−5
V 0 DT N 0 DT N
2,16 ∗ 10−3
N 0 DT N 0 𝐃𝐓
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 107
Summary
▪ Sequence Labeling:
▪ Input and output are signal sequences
▪ No individual classification per signal, but joint classification that minimizes
some cost
▪ Hidden Markov Models
▪ Emissions can be observed
▪ States are hidden
▪ Goal: Find most probable state sequence for a given emission sequence
▪ Solve via Viterbi (dynamic programming)
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 108
Next Lecture
Information Retrieval
Introduction
WS24/25 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 109