Module 3

20CB913 ML Module 3
Hidden Markov Models (HMM): HMM with forward-backward and Vierbi algorithms, Sequence classification
using HMM, Conditional random fields, Applications of sequence classification such as part-of-speech tagging.
Regression: Multi-variable regression, Model evaluation, Least squares regression, Regularization, LASSO,
Applications of regression. Association rule mining algorithms including apriori, Expectation-Maximization
(EM) algorithm for unsupervised learning,Clustering: average linkage, Ward’s algorithm, Minimum spanning
tree clustering, K-nearest neighbors clustering, BIRCH, CURE, DBSCAN, Anomaly and outlier detection
methods.
Sequence classification using HMM
Hidden Markov Models (HMMs) can be used for sequence classification tasks, where the goal is to assign a
label or class to a sequence of observations based on the underlying hidden structure represented by the HMM.
This is commonly referred to as "sequence labeling" or "sequence classification." Here's how you can use
HMMs for sequence classification:
1. Data Preparation:
- Collect and preprocess your labeled training data. Each sequence in your training data should be associated
with a known label or class. For example, this could be part-of-speech tags for words in a sentence, phonemes
in a speech signal, or any other relevant sequence data.
2. Model Definition:
- Define your HMM model with the following components:
- Hidden States (e.g., states corresponding to different classes or labels).
- Emission Probabilities: Define the emission probabilities \(B\) where \(B[i][j]\) represents the probability
of observing the j-th symbol when the model is in state \(S_i\).
- Transition Probabilities: Define the transition probabilities \(A\) where \(A[i][j]\) represents the probability
of transitioning from state \(S_i\) to state \(S_j\).
3. Training:
- Train your HMM model using your labeled training data. This typically involves estimating the parameters
(transition and emission probabilities) of the HMM. You can use the Baum-Welch algorithm or other training
techniques.
4. Classification:
- Given an unlabeled sequence that you want to classify, you can use the trained HMM model to classify it
into one of the predefined classes or labels.
- To classify a sequence:
- Calculate the likelihood of the sequence for each class or label using the Forward algorithm.
- Assign the sequence to the class or label with the highest likelihood.
5. Testing and Evaluation:
- Evaluate the performance of your sequence classification model on a separate test dataset to assess its
accuracy, precision, recall, F1-score, or any other relevant evaluation metric.
Here's a simplified example of using an HMM for sequence classification, specifically for part-of-speech
tagging in natural language processing:
-Data: You have a corpus of sentences with each word labeled with its part-of-speech tag (e.g., noun, verb,
adjective).
Model: Define an HMM where each hidden state corresponds to a part-of-speech tag, and the emission
probabilities represent the likelihood of observing a word given its part-of-speech tag. Transition probabilities
capture the likelihood of transitioning from one part-of-speech tag to another.
Training: Train the HMM on the labeled data to estimate the emission and transition probabilities.
Classification: Given a new sentence with words but without part-of-speech tags, use the trained HMM to
assign part-of-speech tags to each word in the sentence based on the sequence of observations (words).
HMMs can be adapted to various sequence classification tasks by appropriately defining the states, emission
probabilities, and transition probabilities to capture the underlying structure of the data. They are particularly
useful when dealing with sequential data where the order of observations matters.
Conditional random fields
Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are both probabilistic graphical
models used for sequence labeling tasks, but they have different modeling assumptions and applications. Here,
I'll provide an example of how each model can be applied to sequence labeling, and then discuss their
differences.
Hidden Markov Model (HMM) Example
Suppose you have a simple weather forecasting problem. You want to predict whether the weather is "Sunny"
or "Rainy" on a daily basis, given only the observations of whether your friend carries an umbrella
("Umbrella") or not. In this case, you have:
- Hidden States: {Sunny, Rainy}
- Observations: {Umbrella, No Umbrella}
- Transition Probabilities (HMM):
- P(Sunny to Sunny) = 0.7

- P(Sunny to Rainy) = 0.3
- P(Rainy to Rainy) = 0.6
- P(Rainy to Sunny) = 0.4
- Emission Probabilities (HMM):
- P(Umbrella | Sunny) = 0.2
- P(No Umbrella | Sunny) = 0.8
- P(Umbrella | Rainy) = 0.9
- P(No Umbrella | Rainy) = 0.1
You can use the Forward-Backward algorithm to calculate the most likely sequence of hidden states (weather
conditions) given a sequence of observations (whether your friend carries an umbrella).
Conditional Random Fields (CRFs) Example:
Let's consider a more complex named entity recognition (NER) problem in natural language processing. Given
a sentence, you want to label each word as "Person," "Location," or "Other." In this case, you have:
Features: For each word in the sentence, you extract various features such as the word itself, its prefix, suffix,
part-of-speech tag, etc.
Your CRF model might look like this:
Label Set: {Person, Location, Other}
Feature Set: {Word, Prefix, Suffix, POS Tag, ...}
Feature Functions (CRF): These functions capture the relationship between features and labels. For example:
Φ(Word=John}, Person) = 1
Φ(POS Tag=NNP, Person) = 1
Φ(Word=New York, Location) = 1
CRFs model the conditional probability P(Labels | Features), taking into account the entire sequence of
observations (sentence) and the dependencies between labels.
Differences:
1. Model Assumptions:
HMMs assume that the observations are generated by a hidden Markov process with certain transition and
emission probabilities.
CRFs model the conditional probability of labels given observations directly, considering dependencies
between neighboring labels.
2. Dependency Structure:
HMMs have a first-order Markov property, meaning that the probability of a state depends only on the
previous state.
CRFs can model more complex dependencies between labels, considering the entire observation sequence.
3. Applications:
HMMs are often used in sequential data modeling, such as speech recognition, where the focus is on hidden
states and their transitions.
CRFs are commonly used in NER, part-of-speech tagging, and other sequence labeling tasks that require
capturing complex label dependencies.
In summary, HMMs and CRFs are both valuable for sequence modeling, but CRFs are more flexible in
modeling label dependencies, making them particularly suitable for structured prediction tasks like named
entity recognition. HMMs, on the other hand, are better suited for tasks where the focus is primarily on
modeling hidden states and transitions.
Applications of sequence classification such as part-of-speech tagging
Sequence classification, such as part-of-speech tagging, is a fundamental task in natural language processing
and other fields involving sequential data. Hidden Markov Models (HMMs) have been widely used for such
tasks due to their ability to model sequences and their inherent probabilistic nature. Here are some applications
of sequence classification using HMMs:
1. Part-of-Speech Tagging (POS Tagging):
In POS tagging, each word in a sentence is assigned a part-of-speech category (e.g., noun, verb, adjective)
based on its context and position in the sentence. HMMs are used to capture the sequential dependencies
between words and their corresponding part-of-speech tags.
2. Speech Recognition:
HMMs have been a cornerstone in automatic speech recognition (ASR) systems. They model the phonetic
and acoustic properties of speech and are used to transcribe spoken language into text. Each phoneme or
subword unit is treated as a hidden state, and the observed acoustic features are the emissions.
3. Gene Prediction in Bioinformatics:
In genomics, HMMs are applied to predict gene structures in DNA sequences. Hidden states represent various
parts of a gene, such as exons, introns, and coding regions, while emissions correspond to nucleotide sequences.
4. Part-of-Speech Tagging in Speech Synthesis:

In text-to-speech (TTS) systems, HMMs are used to predict the prosody and phonetic characteristics of each
word in a sentence, helping generate natural-sounding speech.
5. Gesture Recognition:
In gesture recognition applications, HMMs can be used to classify sequences of hand or body movements into
specific gestures or actions.
6. Named Entity Recognition (NER):
NER involves identifying and classifying entities (e.g., names of persons, organizations, locations) in text.
HMMs can be used to model the sequence of words in a sentence and assign labels to identified entities.
7. Handwriting Recognition
HMMs have been applied to recognize handwritten text or characters in the context of optical character
recognition (OCR) systems.
8. Anomaly Detection in Time Series Data:
HMMs can be used for anomaly detection in time series data. Hidden states represent different system states,
and emissions represent observed measurements. Anomalies are detected when the observed sequence deviates
from the expected state transitions and emissions.
9. Sentiment Analysis:
In sentiment analysis of text, HMMs can be used to model the sequence of words and classify whether the
sentiment of a text is positive, negative, or neutral.
10. Gesture-Based Human-Computer Interaction:
HMMs are used to recognize and interpret sequences of hand or body movements as input commands for
human-computer interaction in virtual reality or gaming applications.
11. Segmentation and Labeling in Natural Language Processing:
- HMMs are used for tasks like word segmentation in languages without explicit word boundaries and for
labeling spans of text with semantic tags.
These are just a few examples of the many applications of sequence classification using Hidden Markov
Models. HMMs are versatile models that can be adapted to various tasks involving sequential data, making
them a valuable tool in fields ranging from linguistics to bioinformatics to human-computer interaction.
Part of Speech (POS) tagging with Hidden

Markov Model
What is Part of Speech (POS) tagging?
Back in elementary school, we have learned the differences between the various parts of speech tags such as nouns, verbs,
adjectives, and adverbs. Associating each word in a sentence with a proper POS (part of speech) is known as POS tagging
or POS annotation. POS tags are also known as word classes, morphological classes, or lexical tags.
Back in the days, the POS annotation was manually done by human annotators but being such a laborious task, today we
have automatic tools that are capable of tagging each word with an appropriate POS tag within a context.
POS tags give a large amount of information about a word and its neighbors. Their applications can be found in various
tasks such as information retrieval, parsing, Text to Speech (TTS) applications, information extraction, linguistic research
for corpora. They are also used as an intermediate step for higher-level NLP tasks such as parsing, semantics analysis,
translation, and many more, which makes POS tagging a necessary function for advanced NLP applications.
Techniques for POS tagging

1. Rule-based POS tagging: The rule-based POS tagging models apply a set of handwritten rules and use
contextual information to assign POS tags to words. These rules are often known as context frame rules. One
such rule might be: “If an ambiguous/unknown word ends with the suffix ‘ing’ and is preceded by a Verb, label it
as a Verb”.
2. Transformation Based Tagging: The transformation-based approaches use a pre-defined set of handcrafted
rules as well as automatically induced rules that are generated during training.
3. Deep learning models: Various Deep learning models have been used for POS tagging such as Meta-BiLSTM
which have shown an impressive accuracy of around 97 percent.
4. Stochastic (Probabilistic) tagging: A stochastic approach includes frequency, probability or statistics. The
simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training
data and uses this information to tag that word in the unannotated text. But sometimes this approach comes up
with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. One
such approach is to calculate the probabilities of various tag sequences that are possible for a sentence and assign
the POS tags from the sequence with the highest probability. Hidden Markov Models (HMMs) are
probabilistic approaches to assign a POS Tag.
POS tagging with Hidden Markov Model

HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden Markov models are known for their
applications to reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition,
musical score following, partial discharges, and bioinformatics.
Let us consider an example proposed by Dr.Luis Serrano and find out how HMM selects an appropriate tag sequence for
a sentence.
In this example, we consider only 3 POS tags that are noun, modal and verb. Let the sentence “ Ted will spot Will ” be
tagged as noun, model, verb and a noun and to calculate the probability associated with this particular sequence of tags
we require their Transition probability and Emission probability.The transition probability is the likelihood of a particular
sequence for example, how likely is that a noun is followed by a modal and a modal by a verb and a verb by a noun. This
probability is known as Transition probability. It should be high for a particular sequence to be correct.Now, what is the
probability that the word Ted is a noun, will is a modal, spot is a verb and Will is a noun. These sets of probabilities are
Emission probabilities and should be high for our tagging to be likely.Let us calculate the above two probabilities for the
set of sentences below
Mary Jane can see Will
Spot will see Mary
Will Jane spot Mary?
Mary will pat Spot

Note that Mary Jane, Spot, and Will are all names.
In the above sentences, the word Mary appears four times as a noun. To calculate the emission probabilities, let us create
a counting table in a similar manner.
Words Noun Model Verb
Mary 4 0 0
Jane 2 0 0
Will 1 3 0
Spot 2 0 1
Can 0 1 0
See 0 0 2
pat 0 0 1
Now let us divide each column by the total number of their appearances for example, ‘noun’ appears nine times in the
above sentences so divide each term by 9 in the noun column. We get the following table after this operation.
Words Noun Model Verb
Mary 4/9 0 0
Jane 2/9 0 0
Will 1/9 3/4 0
Spot 2/9 0 1/4
Can 0 1/4 0
See 0 0 2/4
pat 0 0 1
From the above table, we infer that
The probability that Mary is Noun = 4/9
The probability that Mary is Model = 0
The probability that Will is Noun = 1/9
The probability that Will is Model = 3/4
In a similar manner, you can figure out the rest of the probabilities. These are the emission probabilities.
Next, we have to calculate the transition probabilities, so define two more tags <S> and <E>. <S> is placed at
the beginning of each sentence and <E> at the end as shown in the figure below.
Let us again create a table and fill it with the co-occurrence counts of the tags.
N M V <E>
<S> 3 1 0 0
N 1 3 1 4
M 1 0 3 0
V 4 0 0 0
In the above figure, we can see that the <S> tag is followed by the N tag three times, thus the first entry is 3.The
model tag follows the <S> just once, thus the second entry is 1. In a similar manner, the rest of the table is
filled.
Next, we divide each term in a row of the table by the total number of co-occurrences of the tag in
consideration, for example, The Model tag is followed by any other tag four times as shown below, thus we
divide each element in the third row by four.
N M V <E>
<S> 3/4 1/4 0 0
N 1/9 3/9 1/9 4/9
M 1/4 0 3/4 0
V 4/4 0 0 0
These are the respective transition probabilities for the above four sentences. Now how does the HMM
determine the appropriate sequence of tags for a particular sentence from the above tables? Let us find it out.
Take a new sentence and tag them with wrong tags. Let the sentence, ‘ Will can spot Mary’ be tagged as-
 Will as a model
 Can as a verb
 Spot as a noun
 Mary as a noun
Now calculate the probability of this sequence being correct in the following manner.
The probability of the tag Model (M) comes after the tag <S> is ¼ as seen in the table. Also, the probability that
the word Will is a Model is 3/4. In the same manner, we calculate each and every probability in the graph. Now
the product of these probabilities is the likelihood that this sequence is right. Since the tags are not correct, the
product is zero.
1/4*3/4*3/4*0*1*2/9*1/9*4/9*4/9=0
When these words are correctly tagged, we get a probability greater than zero as shown below
Calculating the product of these terms we get,
3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
For our example, keeping into consideration just three POS tags we have mentioned, 81 different combinations
of tags can be formed. In this case, calculating the probabilities of all 81 combinations seems achievable. But
when the task is to tag a larger sentence and all the POS tags in the Penn Treebank project are taken into
consideration, the number of possible combinations grows exponentially and this task seems impossible to
achieve. Now let us visualize these 81 combinations as paths and using the transition and emission probability
mark each vertex and edge as shown below.
The next step is to delete all the vertices and edges with probability zero, also the vertices which do not lead to
the endpoint are removed. Also, we will mention-
Now there are only two paths that lead to the end, let us calculate the probability associated with each path.
<S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
<S>→N→M→N→V→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
Clearly, the probability of the second sequence is much higher and hence the HMM is going to tag each word in
the sentence according to this sequence.
In the previous section, we optimized the HMM and bought our calculations down from 81 to just two. Now we
are going to further optimize the HMM by using the Viterbi algorithm. Let us use the same example we used
before and apply the Viterbi algorithm to it.
Consider the vertex encircled in the above example. There are two paths leading to this vertex as shown below
along with the probabilities of the two mini-paths.
Now we are really concerned with the mini path having the lowest probability. The same procedure is done for
all the states in the graph as shown in the figure below
As we can see in the figure above, the probabilities of all paths leading to a node are calculated and we remove
the edges or path which has lower probability cost. Also, you may notice some nodes having the probability of
zero and such nodes have no edges attached to them as all the paths are having zero probability. The graph
obtained after computing probabilities of all paths leading to a node is shown below:
To get an optimal path, we start from the end and trace backward, since each state has only one incoming edge,
This gives us a path as shown below
As you may have noticed, this algorithm returns only one path as compared to the previous method which
suggested two paths. Thus by using this algorithm, we saved us a lot of computations.
After applying the Viterbi algorithm the model tags the sentence as following-
 Will as a noun
 Can as a model
 Spot as a verb
 Mary as a noun
These are the right tags so we conclude that the model can successfully tag the words with their appropriate
POS tags.

Module 3

Uploaded by

Copyright:

Available Formats

Module 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3

Uploaded by

Copyright:

Available Formats

20CB913 ML Module 3

Sequence classification using HMM

- Define your HMM model with the following components:

- Hidden States (e.g., states corresponding to different classes or labels).

5. Testing and Evaluation:

Conditional random fields

Hidden Markov Model (HMM) Example

- Hidden States: {Sunny, Rainy}

- Observations: {Umbrella, No Umbrella}

- Transition Probabilities (HMM):

- P(Sunny to Sunny) = 0.7

- P(Rainy to Rainy) = 0.6

- P(Rainy to Sunny) = 0.4

- Emission Probabilities (HMM):

- P(Umbrella | Sunny) = 0.2

- P(No Umbrella | Sunny) = 0.8

- P(Umbrella | Rainy) = 0.9

- P(No Umbrella | Rainy) = 0.1

Conditional Random Fields (CRFs) Example:

Your CRF model might look like this:

Label Set: {Person, Location, Other}

Feature Set: {Word, Prefix, Suffix, POS Tag, ...}

Φ(POS Tag=NNP, Person) = 1

Φ(Word=New York, Location) = 1

Applications of sequence classification such as part-of-speech tagging

1. Part-of-Speech Tagging (POS Tagging):

3. Gene Prediction in Bioinformatics:

4. Part-of-Speech Tagging in Speech Synthesis:

6. Named Entity Recognition (NER):

8. Anomaly Detection in Time Series Data:

10. Gesture-Based Human-Computer Interaction:

11. Segmentation and Labeling in Natural Language Processing:

Part of Speech (POS) tagging with Hidden

Techniques for POS tagging

POS tagging with Hidden Markov Model

Mary Jane can see Will

Spot will see Mary

Will Jane spot Mary?

Mary will pat Spot

Words Noun Model Verb

Words Noun Model Verb

Will 1/9 3/4 0

Spot 2/9 0 1/4

From the above table, we infer that

The probability that Mary is Noun = 4/9

The probability that Mary is Model = 0

The probability that Will is Noun = 1/9

The probability that Will is Model = 3/4

<S> 3/4 1/4 0 0

N 1/9 3/9 1/9 4/9

Calculating the product of these terms we get,

You might also like