0% found this document useful (0 votes)
263 views15 pages

NLP Unit 5

Universal Networking Language (UNL) is a language-independent formalism that aims to represent the core meaning of natural language texts through concepts represented as Universal Words linked by semantic relations. Dependency parsing analyzes grammatical relationships between words in a sentence, identifying one word as the head and others as dependents modified in some way like adjectives or objects. It represents these relationships as a tree structure with labeled edges showing the dependency type like nominal modifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
263 views15 pages

NLP Unit 5

Universal Networking Language (UNL) is a language-independent formalism that aims to represent the core meaning of natural language texts through concepts represented as Universal Words linked by semantic relations. Dependency parsing analyzes grammatical relationships between words in a sentence, identifying one word as the head and others as dependents modified in some way like adjectives or objects. It represents these relationships as a tree structure with labeled edges showing the dependency type like nominal modifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Natural Language Processing

Unit 5 Semantic Relations


UNL, Towards Dependency Parsing, Universal Networking Language, Semantic Role Extraction,
Baum Welch Algorithm, HMM and Speech Recognition. HMM training, Baum Welch
Algorithm;HMM training

5.1 Universal Networking Language(UNL)

Universal Networking Language (UNL) is a declarati


ve formal language specifically designed to represent semantic data extracted from natural
language texts. It can be used as a pivot language in interlingual machine translation systems or as
a knowledge representation language in information retrieval applications.
UNL is designed to establish a simple foundation for representing the most central aspects of
information and meaning in a machine- and human-language-independent form. As a language-
independent formalism, UNL aims to code, store, disseminate and retrieve information
independently of the original language in which it was expressed. In this sense, UNL seeks to
provide tools for overcoming the language barrier in a systematic way.
At first glance, UNL seems to be a kind of interlingua, into which source texts are converted before
being translated into target languages. It can, in fact, be used for this purpose, and very efficiently,
too. However, its real strength is knowledge representation and its primary objective is to provide
an infrastructure for handling knowledge that already exists or can exist in any given language.
Nevertheless, it is important to note that at present it would be foolish to claim to represent the
“full” meaning of any word, sentence, or text for any language. Subtleties of intention and
interpretation make the “full meaning,” however we might conceive it, too variable and subjective
for any systematic treatment. Thus UNL avoids the pitfalls of trying to represent the “full meaning”
of sentences or texts, targeting instead the “core” or “consensual” meaning most often attributed
to them. In this sense, much of the subtlety of poetry, metaphor, figurative language, innuendo,
and other complex, indirect communicative behaviors is beyond the current scope and goals of
UNL. Instead, UNL targets direct communicative behavior and literal meaning as a tangible,
concrete basis for most human communication in practical, day-to-day settings.
In the UNL approach, information conveyed by natural language is represented sentence by
sentence as a hypergraph composed of a set of directed binary labeled links (referred to
as relations) between nodes or hypernodes (the Universal Words, or simply UWs), which stand
for concepts. UWs can also be annotated with attributes representing context information.
As an example, the English sentence ‘The sky was blue?!’ can be represented in UNL as
follows:

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

In the example above, sky(icl>natural world) and blue(icl>color) , which represent individual
concepts, are UWs; "aoj" (= attribute of an object) is a directed binary semantic relation linking
the two UWs; and "@def", "@interrogative", "@past", "@exclamation" and "@entry" are
attributes modifying UWs.
UWs are intended to represent universal concepts, but are expressed in English words or in any
other natural language in order to be humanly readable. They consist of a "headword" (the UW
root) and a "constraint list" (the UW suffix between parentheses), where the constraints are used
to disambiguate the general concept conveyed by the headword. The set of UWs is organized in
the UNL Ontology, in which high-level concepts are related to lower-level ones through the
relations "icl" (= is a kind of), "iof" (= is an instance of) and "equ" (= is equal to).
Relations are intended to represent semantic links between words in every existing language. They
can be ontological (such as "icl" and "iof," referred to above), logical (such as "and" and "or"), and
thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc.). There are
currently 46 relations in the UNL Specs. They jointly define the UNL syntax.
Attributes represent information that cannot be conveyed by UWs and relations. Normally, they
represent information concerning time ("@past", "@future", etc.), reference ("@def", "@indef",
etc.), modality ("@can", "@must", etc.), focus ("@topic", "@focus", etc.), and so on.
Within the UNL Program, the process of representing natural language sentences in UNL graphs
is called UNLization, and the process of generating natural language sentences out of UNL graphs
is called NLization. UNLization, which involves natural language analysis and understanding, is
intended to be carried out semi-automatically (i.e., by humans with computer aids); and NLization
is intended to be carried out fully automatically.
For more detailed information refer:
http://www.undl.org/publications/UNL-beyond%20MT.html

Dependency Parsing is the process to analyze the grammatical structure in a sentence and find out
related words as well as the type of the relationship between them.
Each relationship:
1. Has one head and a dependent that modifies the head.
2. Is labeled according to the nature of the dependency between the head and the dependent.
These labels can be found at Universal Dependency Relations.

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

Simple dependency relation between two words


In the above diagram, there exists a relationship between car and black because black modifies the
meaning of car. Here, car acts as the head and black is a dependent of the head. The nature of the
relationship here is amod which stands for “Adjectival Modifier”. It is an adjective or an adjective
phrase that modifies a noun.

The term Dependency Parsing (DP) refers to the process of examining the dependencies between
the phrases of a sentence in order to determine its grammatical structure. A sentence is divided
into many sections based mostly on this. The process is based on the assumption that there is a
direct relationship between each linguistic unit in a sentence. These hyperlinks are called
dependencies.
Consider the following statement: “I prefer the morning flight through Denver.”

In a written dependency structure, the relationships between each linguistic unit, or phrase, in the
sentence are expressed by directed arcs. The root of the tree “prefer” varies the pinnacle of the
preceding sentence, as labelled within the illustration.
A dependence tag indicates the relationship between two phrases. For example, the word “flight”
changes the meaning of the noun “Denver.” As a result, you may identify a dependence from
flight -> Denver, where flight is the pinnacle and Denver is the kid or dependent. It’s represented
by nmod, which stands for the nominal modifier.

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

Exercise:

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

Dependency Description Dependency Description


Tag Tag
acl clausal modifier of a noun expl:pass reflexive pronoun used in
(adnominal clause) reflexive passive
acl:relcl relative clause modifier expl:pv reflexive clitic with an
inherently reflexive verb
advcl adverbial clause modifier fixed fixed multiword expression
amod adjectival modifier goeswith goes with
appos appositional modifier iobj indirect object
aux auxiliary list list
aux:pass passive auxiliary mark marker
case case-marking nmod nominal modifier
cc coordinating conjunction nmod:poss possessive nominal modifier
cc:preconj preconjunct nmod:tmod temporal modifier
ccomp clausal complement nsubj nominal subject
clf classifier nsubj:pass passive nominal subject
compound compound nummod numeric modifier
compound:prt phrasal verb particle obj object
conj conjunct obl:arg oblique argument
cop copula obl:lmod locative modifier
csubj clausal subject obl:tmod temporal modifier
dep unspecified dependency parataxis parataxis
det determiner punct punctuation
det:nummod pronominal quantifier agreeing in root root
the case with the noun
det:poss possessive determiner vocative vocative
discourse discourse xcomp open clausal complement
expl expletive

Implementation : There are different ways to implement dependency parsing in Python. In this
article, we will look at three ways. 1. Using spaCy 2. Using NLTK with Stanford CoreNLP 3.
Using Stanza
Applications of Dependency Parsing
Dependency parsing is a process of analyzing the grammatical structure of a sentence by
identifying the dependencies between the words in a sentence and representing them as a directed
graph.
The following are some of the applications of dependency parsing:
1. Named Entity Recognition (NER) – It helps in identifying and classifying named entities
in a text such as people, places, and organizations.
2. Part-of-Speech (POS) Tagging – It helps in identifying the parts of speech of each word in
a sentence and classifying them as nouns, verbs, adjectives, etc.
3. Sentiment Analysis – It helps in determining the sentiment of a sentence by analyzing the
dependencies between the words and the sentiment associated with each word.
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing

4. Machine Translation – It helps in translating sentences from one language to another by


analyzing the dependencies between the words and generating the corresponding
dependencies in the target language.
5. Text Generation – It helps in generating text by analyzing the dependencies between the
words and generating new words that fit into the existing structure.
6. Question Answering – It helps in answering questions by analyzing the dependencies
between the words in a question and finding relevant information in a corpus.

Key points:

Dependency parsing focuses on identifying the grammatical relationships between words in a


sentence, such as subject-verb relationships.

Dependency parsing uses dependency grammar, which represents the relationships between
words as labeled directed arcs.

Dependency parsing is based on a bottom-up approach, where the parse tree is built from the
leaves up to the root.

Dependency parsing represents a sentence as a directed graph, where words are represented as
nodes and grammatical relationships are represented as edges.

Dependency parsing is more suitable for natural language generation tasks and dependency-
based machine learning models.

Dependency parsing is simpler and more efficient, but may not capture as much syntactic
information as constituency parsing.

Dependency parsing is more appropriate for languages with less morphological inflection like
English and Chinese.

Dependency parsing is used for more advanced NLP tasks like Machine Translation, Language
Modeling, and Text summarization.

Dependency parsing is more suitable for languages with less complex syntactic structures.

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

5.3 Semantic Role Extraction

In natural language processing, semantic role labeling (also called as semantic role labelling or
theme role labelling) is the process that assigns labels to words or phrases in a sentence that
indicates their semantic role in the sentence, such as that of an agent, goal, or result.
It serves to find the meaning of the sentence. To do this, it detects the arguments associated with
the predicate or verb of a sentence and how they are classified into their specific roles.
Example 1 –
"Mary sold the book to John."
The agent is "Mary"
The predicate is "sold" (or rather, "to sell,")
The theme is "the book,"
The recipient is "John."
Example 2-
"the book belongs to me"
would need two labels such as "possessed" and "possessor" and "the book was sold to John" would
need two other labels such as theme and recipient, despite these two clauses being similar to
"subject" and "object" functions.
Semantic role labeling is mostly used for machines to understand the roles of words within
sentences. This benefits applications similar to Natural Language Processing programs that need
to understand not just the words of languages, but how they can be used in varying sentences. A
better understanding of semantic role labeling could lead to advancements in question
answering, information extraction, automatic text summarization, text data mining, and speech
recognition.
semantic role labeling is concerned with identification, classification, and establishing distinct
identities. In some instances, semantic role labeling may not be effective through parsing trees.
Sometimes, SRL is then applied via pruning and chunking. Re-ranking is also applied through
which multiple labels are aligned to every instance or argument and the context is then globally
extracted from final labels.
Various lexical resources are used for semantic role labelling like Framenet, Propbank, Verbnet,
Wordnet. Semantic role extraction can be done by using various machine learning algorithms for
automatic semantic role labelling.

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

Fig. Semantic role Extraction and labelling

5.4 Baum Welch Algorithm

Baum-Welch algorithm Also known as the forward-backward algorithm, the Baum-Welch


algorithm is a dynamic programming approach and a special case of the expectation-
maximization algorithm (EM algorithm). Its purpose is to tune the parameters of the HMM, namely
the state transition matrix A, the emission matrix B, and the initial state distribution π₀, such that the
model is maximally like the observed data.
There are a few phases for this algorithm, including the initial phase, the forward phase, the
backward phase, and the update phase. The forward and the backward phase form the E-step of the
EM algorithm, while the update phase itself is the M-step.

Initial phase
In the initial phase, the content of the parameter matrices A, B, π₀ are initialized, and it could be
done randomly if there is no prior knowledge about them.

Forward phase
In the forward phase, the following recursive alpha function is calculated. For the deviation of the
function, I would strongly recommend this YouTube video as the speaker presented it clearly and
explained it very well.

There are a few points to make here:


1. The alpha function is defined as the joint probability of the observed data up to time k and the
state at time k
2. It is a recursive function because the alpha function appears in the first term of the right hand
side (R.H.S.) of the equation, meaning that the previous alpha is reused in the calculation of
the next. This is also why it is called the forward phase.
3. The second term of the R.H.S. is the state transition probability from A, while the last term is
the emission probability from B.
4. The R.H.S. is summed over all possible states at time k -1.

It should be pointed out that, each alpha contains the information from the observed data up to
time k, and to get the next alpha, we only need to reuse the current alpha, and add information about
the transition to the next state and the next observed variable. This recursive behavior saves
computations of getting the next alpha by freeing us from looking through the past observed data
every time.
By the way, we need the following starting alpha to begin the recursion.

The starting alpha is the product of probabilities of the emission and the initial state

Backward phase
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing

Similar points could be made here:


1. The beta function is defined as the conditional probability of the observed data from time k+1
given the state at time k
2. It is a recursive function because the beta function appears in first term of the right hand side
of the equation, meaning that the next beta is reused in the calculation of the current one. This
is also why it is called a backward phase.
3. The second term of the R.H.S. is the state transition probability from A, while the last term is
the emission probability from B.
4. The R.H.S. is summed over all possible states at time k +1.
Again, we need the ending beta to start the recursion.

Firstly, as mentioned, alpha and the beta both are recursive functions, which means that we could
reuse previous answer as the input for the next answer. This is what dynamic programming is about
— you could save time by reusing old result!
Secondly, the formula in the forward phase is very useful. Suppose you have a set of well-trained
transition and emission parameters, and given that your problem is to, in real-time, find out the
mysterious hidden truth from observed data. Then you actually could do it like this! When you get
one data point (data point p), then you could put it into the formula which will give you the
probability distribution of the associated hidden state, and from which you could pick the most
probable one as your answer. And the story does not stop here, as you get the next data point (data
point q), and you put it again into the formula, it will give you another probability distribution for
you to pick the best choice, but this is not only based on data point q and the transition and emission
parameters, but also the data point p. Such use of the formula is called filtering.
Thirdly, suppose that you collected many data points already, and because you know that the earlier
the data point, the less observed data the choice of your answer based on. Therefore you would like
to improve that by somehow ‘injecting’ information from the later data into the earlier ones. This
is where the backward formula comes into play. Such use of the formula is called smoothing.
Fourthly, this is about the combination of the last two paragraphs. With the help of the alpha and
the beta formula, one could determine the probability distribution of the state variable at any
time k given the whole sequence of observed data. This could also be understood mathematically.

The denominator term is a normalization constant and is usually dropped like this because it does
not depend on the state, and therefore it is not important when comparing the probability of
different states at any time k.
Lastly, the result from the alpha and the beta functions are useful in the update phase.

Update phase
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing

For the deviation of the above formulas, if you have watched the YouTube videos that I suggested
for the forward and backward formula, and you can understand them, then probably you will have
no problem to derive these two yourself.
The first formula it shows the probability distribution of a state at time k given all observed data we
have. The second formula tells us a bit different thing which is the joint probability of two
consecutive states given the data. They make use of the alpha function, the beta function, the
transition and the emission that are already available. These two formulas are further used to finally
do the update.

The deviation steps are not shown here, because mathematics is not the intention of this article, but
showing the formulas themselves would be useful for us to see how they are being re-used through
the steps.
It was mentioned that the Baum-Welch algorithm is a case of EM algorithm. The alpha and the beta
function form the E-step because they predict for the expected hidden states given the observed data
and the parameter matrices A, B, π₀. The update phase is the M-step because the last three update
formulas are so derived that the L.H.S. parameters will best fit the expected hidden states given the
observed data.

5.5 HMM and Speech Recognition

Speech recognition is a process of converting speech signal to a sequence of word. Various


approach has been used for speech recognition which include Dynamic programming and Neural
Network. HMM is very rich in mathematical structure and hence can form the theoretical basis for

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

use in a wide range of application. HMM model, when applied properly work well in practice for
several important application.
Speech recognition consists of two main modules, feature extraction and feature matching. The
purpose of feature extraction module is to convert speech waveform to some type of representation
for further analysis and processing, this extracted information is known as feature vector. The
process of converting voice signal to feature vector is done by signal-processing front end module.
Input to front-end is noise free voice sample and output of it is feature vector. In feature matching,
the extracted feature vector from unknown voice sample is scored against acoustic model, the model
with max score wins and its output is considered as recognized word. Following are the few method
for implementing front-end (for extracting feature factor)
 MFCC (Mel-Frequency Cepstrum Coefficient)
 LPC (Linear Predictive Coding)
Once the feature vector are obtained we build the acoustic model. The acoustic model is used to
score the unknown voice sample. As shown in block diagram, Output of front-end is given as input
to acoustic model. Different types of acoustic model are
• VQ-Code Book
• GMM-Gaussian Mixture Model
Acoustic Model Representation: In speech recognition, basic unit of sound is phoneme. Phoneme
is a minimal unit that serves to distinguish between meanings of words. For a sequence of phoneme
for “CAT” is K A and T. In English language there are nearly around 46 phoneme. We can construct
any word from English dictionary using proper concatenation of this phoneme. In order to recognize
a given word, we should extract phoneme from voice sample. Due of slowly timed varying nature
of speech signal short-term spectral analysis is the most common ways to characterize speech
signal. When examined over a sufficiently short period of time (between 10 and 25 msec), its
characteristics are fairly stationary. However, over long period of time the signal characteristic
change to reflect the different speech sound being spoken. Using this observation, we find that
feature vector extracted over 10 to 25 msec correspond to single phoneme. For speech recognition
in HMM, we assign each basic unit(phoneme) a unique HMM. Study shows that each phoneme
HMM can be represented by three state i.e. begin, middle and end state.
To understand this, take example of isolated word recognizer, each word in vocabulary has distinct
HMM. When unknown word comes, it is scored against all HMM model and HMM with maximum
score is considered as recognized word. As shown in block diagram output of acoustic model is
sequence of phoneme. By looking dictionary in reverse way (phoneme — word) we can find
corresponding word. But in general there are many word with same phoneme sequence, for example
car key and Khakee has same phoneme sequence. In such case language structure comes in to
picture. Language structure uses context information to narrow down the recognized word to
resemble the given grammar construct. This type of model is known as mono-phone or context-
independent model, following are different types of HMM

1. Context-Independent Phoneme HMM


• Number of State: d-state HMM for each phoneme (d is normally equal to 3)
• Accuracy: not accurate in continuous speech recognition
• Compact: d-state HMM lead to less parameter to be calculated
• General: Yes, we can build HMM for new word using existing phoneme HMM

2. Context-Dependent Triphone HMM


MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing

• Number of State: d-state HMM for each phoneme


• Accuracy: Accurate, as it has left-right phoneme relation 11
• Compact: Each phoneme has immediate left-right relation, more parameter needs to be calculated
• General: Yes

3. Whole-Word HMM
• Number of State: No phoneme generation
• Assign number of State to model a word as whole
• Accurate: Yes, require large training data and work for small vocabulary
• Compact: No, need too many State as vocabulary increases
• General: No, can’t build new word using this representation

In practice triphone model is widely used in speech recognition using Hidden Markov Model. Now
we are not going to discuss how feature vector are extracted from voice signal. we assume that
using either MFCC or LPC we get required feature vector. In case of MFCC it is 39-dimension
vector. The question is how to map feature vector to HMM state for doing so we use technique
called Vector Quantization.
• create training set of feature vector
• cluster them in to small number of classes
• represent each class by symbol
• for each class Vk, compute the probability that it is generated by given HMM state.

In Vector Quantization, we define a codebook which contain entry for each symbol, which is also
known as prototype vector or codeword. for example if we have 256 classes (i.e. 8 bit VQ), we
make 256 entry in codebook. As shown in Fig.8 the incoming vector is compared with each
prototype vector and one with minimum distance is chosen and its index value is given to the input
vector.
Generate code book(K-means clustering)
• choose M vector from L training Vector — where M = 2B as initial code words (Choose M with
MAX distance between them)
• for each training vector, find the closest code word (minimum distance). Assign this training
vector to that cell
• for each cell, compute centroid of that cell. The new code word is the centroid
• repeat last two step until average distance falls below threshold

In this way, the speech signal decoding and enhancement can be done with the help of HMM Viterbi
algorithm.

5.6 HMM training

HMM can be trained using a variety of algorithms, including the Baum-Welch algorithm and
the Viterbi algorithm.
The Baum-Welch algorithm is an unsupervised learning algorithm that iteratively adjusts the
probabilities of events occurring in each state to fit the data better.

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

The Viterbi algorithm is a dynamic programming algorithm that finds the most likely sequence of
hidden states given a sequence of observable events.

Viterbi Algorithm
The Viterbi algorithm is a dynamic programming algorithm used to determine the most probable
sequence of hidden states in a Hidden Markov Model (HMM) based on a sequence of observations.
It is a widely used algorithm in speech recognition, natural language processing, and other areas
that involve sequential data.
The algorithm works by recursively computing the probability of the most likely sequence of
hidden states that ends in each state for each observation.
At each time step, the algorithm computes the probability of being in each state and emits the
current observation based on the probabilities of being in the previous states and making a
transition to the current state.
Assuming we have an HMM with N hidden states and T observations, the Viterbi algorithm can
be summarized as follows:
1. Initialization: At time t=1, we set the probability of the most likely path ending in state i
for each state i to the product of the initial state probability pi and the emission probability
of the first observation given state i. This is denoted by: delta[1,i] = pi * b[i,1].
2. Recursion: For each time step t from 2 to T, and for each state i, we compute the
probability of the most likely path ending in state i at time t by considering all possible
paths that could have led to state i. This probability is given by:
delta[t,i] = max_j(delta[t-1,j] * a[j,i] * b[i,t])
Here, a[j,i] is the probability of transitioning from state j to state i, and b[i,t] is the probability of
observing the t-th observation given state I.
We also keep track of the most likely previous state that led to the current state i, which is given
by:
psi[t,i] = argmax_j(delta[t-1,j] * a[j,i])
3. Termination: The probability of the most likely path overall is given by the maximum of
the probabilities of the most likely paths ending in each state at time T. That is, P* =
max_i(delta[T,i]).
4. Backtracking: Starting from the state i* that gave the maximum probability at time T, we
recursively follow the psi values back to time t=1 to obtain the most likely path of hidden
states.
The Viterbi algorithm is an efficient and powerful tool that can handle long sequences of
observations using dynamic programming.

Baum–Welch algorithm

The Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to


find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-
backward algorithm to compute the statistics for the expectation step.
The standard algorithm for HMM training is the forward-backward, or Baum- Forwardbackward
Baum-Welch Welch algorithm (Baum, 1972), a special case of the Expectation-Maximization EM
or EM algorithm. The algorithm will let us train both the transition probabilities A and the emission
probabilities B of the HMM. EM is an iterative algorithm, computing an initial estimate for the
probabilities, then using those estimates to computing a better estimate, and so on, iteratively
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing

improving the probabilities that it learns. Let us begin by considering the much simpler case of
training a fully visible Markov model, where we know both the temperature and the ice cream
count for every day. That is, imagine we see the following set of input observations and magically
knew the aligned hidden state sequences:

For a real HMM, we cannot compute these counts directly from an observation sequence since we
don’t know which path of states was taken through the machine for a given input. For example,
suppose I didn’t tell you the temperature on day 2, and you had to guess it, but you (magically)
had the above probabilities, and the temperatures on the other days. You could do some Bayesian
arithmetic with all the other probabilities to get estimates of the likely temperature on that missing
day, and use those to get expected counts for the temperatures for day 2. But the real problem is
even harder: we don’t know the counts of being in any of the hidden states!! The Baum-Welch
algorithm solves this by iteratively estimating the counts. We will start with an estimate for the
transition and observation probabilities and then use these estimated probabilities to derive better
and better probabilities. And we’re going to do this by computing the forward probability for an
observation and then dividing that probability mass among all the different paths that contributed
to this forward probability. To understand the algorithm, we need to define a useful probability
related to the forward probability and called the backward probability. The backward probability
β is the probability of seeing the observations from time t +1 to the end, given that we are in state
i at time t (and given the automaton λ):

MIT CORER, Barshi


Department of Computer Science & Engineering
Natural Language Processing

The parameters of an HMM are the A transition probability matrix and the B observation likelihood
matrix. Both can be trained with the Baum-Welch or forward-backward algorithm.

`***********************************

MIT CORER, Barshi


Department of Computer Science & Engineering

You might also like