NLP Unit 5
NLP Unit 5
In the example above, sky(icl>natural world) and blue(icl>color) , which represent individual
concepts, are UWs; "aoj" (= attribute of an object) is a directed binary semantic relation linking
the two UWs; and "@def", "@interrogative", "@past", "@exclamation" and "@entry" are
attributes modifying UWs.
UWs are intended to represent universal concepts, but are expressed in English words or in any
other natural language in order to be humanly readable. They consist of a "headword" (the UW
root) and a "constraint list" (the UW suffix between parentheses), where the constraints are used
to disambiguate the general concept conveyed by the headword. The set of UWs is organized in
the UNL Ontology, in which high-level concepts are related to lower-level ones through the
relations "icl" (= is a kind of), "iof" (= is an instance of) and "equ" (= is equal to).
Relations are intended to represent semantic links between words in every existing language. They
can be ontological (such as "icl" and "iof," referred to above), logical (such as "and" and "or"), and
thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc.). There are
currently 46 relations in the UNL Specs. They jointly define the UNL syntax.
Attributes represent information that cannot be conveyed by UWs and relations. Normally, they
represent information concerning time ("@past", "@future", etc.), reference ("@def", "@indef",
etc.), modality ("@can", "@must", etc.), focus ("@topic", "@focus", etc.), and so on.
Within the UNL Program, the process of representing natural language sentences in UNL graphs
is called UNLization, and the process of generating natural language sentences out of UNL graphs
is called NLization. UNLization, which involves natural language analysis and understanding, is
intended to be carried out semi-automatically (i.e., by humans with computer aids); and NLization
is intended to be carried out fully automatically.
For more detailed information refer:
http://www.undl.org/publications/UNL-beyond%20MT.html
Dependency Parsing is the process to analyze the grammatical structure in a sentence and find out
related words as well as the type of the relationship between them.
Each relationship:
1. Has one head and a dependent that modifies the head.
2. Is labeled according to the nature of the dependency between the head and the dependent.
These labels can be found at Universal Dependency Relations.
The term Dependency Parsing (DP) refers to the process of examining the dependencies between
the phrases of a sentence in order to determine its grammatical structure. A sentence is divided
into many sections based mostly on this. The process is based on the assumption that there is a
direct relationship between each linguistic unit in a sentence. These hyperlinks are called
dependencies.
Consider the following statement: “I prefer the morning flight through Denver.”
In a written dependency structure, the relationships between each linguistic unit, or phrase, in the
sentence are expressed by directed arcs. The root of the tree “prefer” varies the pinnacle of the
preceding sentence, as labelled within the illustration.
A dependence tag indicates the relationship between two phrases. For example, the word “flight”
changes the meaning of the noun “Denver.” As a result, you may identify a dependence from
flight -> Denver, where flight is the pinnacle and Denver is the kid or dependent. It’s represented
by nmod, which stands for the nominal modifier.
Exercise:
Implementation : There are different ways to implement dependency parsing in Python. In this
article, we will look at three ways. 1. Using spaCy 2. Using NLTK with Stanford CoreNLP 3.
Using Stanza
Applications of Dependency Parsing
Dependency parsing is a process of analyzing the grammatical structure of a sentence by
identifying the dependencies between the words in a sentence and representing them as a directed
graph.
The following are some of the applications of dependency parsing:
1. Named Entity Recognition (NER) – It helps in identifying and classifying named entities
in a text such as people, places, and organizations.
2. Part-of-Speech (POS) Tagging – It helps in identifying the parts of speech of each word in
a sentence and classifying them as nouns, verbs, adjectives, etc.
3. Sentiment Analysis – It helps in determining the sentiment of a sentence by analyzing the
dependencies between the words and the sentiment associated with each word.
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing
Key points:
Dependency parsing uses dependency grammar, which represents the relationships between
words as labeled directed arcs.
Dependency parsing is based on a bottom-up approach, where the parse tree is built from the
leaves up to the root.
Dependency parsing represents a sentence as a directed graph, where words are represented as
nodes and grammatical relationships are represented as edges.
Dependency parsing is more suitable for natural language generation tasks and dependency-
based machine learning models.
Dependency parsing is simpler and more efficient, but may not capture as much syntactic
information as constituency parsing.
Dependency parsing is more appropriate for languages with less morphological inflection like
English and Chinese.
Dependency parsing is used for more advanced NLP tasks like Machine Translation, Language
Modeling, and Text summarization.
Dependency parsing is more suitable for languages with less complex syntactic structures.
In natural language processing, semantic role labeling (also called as semantic role labelling or
theme role labelling) is the process that assigns labels to words or phrases in a sentence that
indicates their semantic role in the sentence, such as that of an agent, goal, or result.
It serves to find the meaning of the sentence. To do this, it detects the arguments associated with
the predicate or verb of a sentence and how they are classified into their specific roles.
Example 1 –
"Mary sold the book to John."
The agent is "Mary"
The predicate is "sold" (or rather, "to sell,")
The theme is "the book,"
The recipient is "John."
Example 2-
"the book belongs to me"
would need two labels such as "possessed" and "possessor" and "the book was sold to John" would
need two other labels such as theme and recipient, despite these two clauses being similar to
"subject" and "object" functions.
Semantic role labeling is mostly used for machines to understand the roles of words within
sentences. This benefits applications similar to Natural Language Processing programs that need
to understand not just the words of languages, but how they can be used in varying sentences. A
better understanding of semantic role labeling could lead to advancements in question
answering, information extraction, automatic text summarization, text data mining, and speech
recognition.
semantic role labeling is concerned with identification, classification, and establishing distinct
identities. In some instances, semantic role labeling may not be effective through parsing trees.
Sometimes, SRL is then applied via pruning and chunking. Re-ranking is also applied through
which multiple labels are aligned to every instance or argument and the context is then globally
extracted from final labels.
Various lexical resources are used for semantic role labelling like Framenet, Propbank, Verbnet,
Wordnet. Semantic role extraction can be done by using various machine learning algorithms for
automatic semantic role labelling.
Initial phase
In the initial phase, the content of the parameter matrices A, B, π₀ are initialized, and it could be
done randomly if there is no prior knowledge about them.
Forward phase
In the forward phase, the following recursive alpha function is calculated. For the deviation of the
function, I would strongly recommend this YouTube video as the speaker presented it clearly and
explained it very well.
It should be pointed out that, each alpha contains the information from the observed data up to
time k, and to get the next alpha, we only need to reuse the current alpha, and add information about
the transition to the next state and the next observed variable. This recursive behavior saves
computations of getting the next alpha by freeing us from looking through the past observed data
every time.
By the way, we need the following starting alpha to begin the recursion.
The starting alpha is the product of probabilities of the emission and the initial state
Backward phase
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing
Firstly, as mentioned, alpha and the beta both are recursive functions, which means that we could
reuse previous answer as the input for the next answer. This is what dynamic programming is about
— you could save time by reusing old result!
Secondly, the formula in the forward phase is very useful. Suppose you have a set of well-trained
transition and emission parameters, and given that your problem is to, in real-time, find out the
mysterious hidden truth from observed data. Then you actually could do it like this! When you get
one data point (data point p), then you could put it into the formula which will give you the
probability distribution of the associated hidden state, and from which you could pick the most
probable one as your answer. And the story does not stop here, as you get the next data point (data
point q), and you put it again into the formula, it will give you another probability distribution for
you to pick the best choice, but this is not only based on data point q and the transition and emission
parameters, but also the data point p. Such use of the formula is called filtering.
Thirdly, suppose that you collected many data points already, and because you know that the earlier
the data point, the less observed data the choice of your answer based on. Therefore you would like
to improve that by somehow ‘injecting’ information from the later data into the earlier ones. This
is where the backward formula comes into play. Such use of the formula is called smoothing.
Fourthly, this is about the combination of the last two paragraphs. With the help of the alpha and
the beta formula, one could determine the probability distribution of the state variable at any
time k given the whole sequence of observed data. This could also be understood mathematically.
The denominator term is a normalization constant and is usually dropped like this because it does
not depend on the state, and therefore it is not important when comparing the probability of
different states at any time k.
Lastly, the result from the alpha and the beta functions are useful in the update phase.
Update phase
MIT CORER, Barshi
Department of Computer Science & Engineering
Natural Language Processing
For the deviation of the above formulas, if you have watched the YouTube videos that I suggested
for the forward and backward formula, and you can understand them, then probably you will have
no problem to derive these two yourself.
The first formula it shows the probability distribution of a state at time k given all observed data we
have. The second formula tells us a bit different thing which is the joint probability of two
consecutive states given the data. They make use of the alpha function, the beta function, the
transition and the emission that are already available. These two formulas are further used to finally
do the update.
The deviation steps are not shown here, because mathematics is not the intention of this article, but
showing the formulas themselves would be useful for us to see how they are being re-used through
the steps.
It was mentioned that the Baum-Welch algorithm is a case of EM algorithm. The alpha and the beta
function form the E-step because they predict for the expected hidden states given the observed data
and the parameter matrices A, B, π₀. The update phase is the M-step because the last three update
formulas are so derived that the L.H.S. parameters will best fit the expected hidden states given the
observed data.
use in a wide range of application. HMM model, when applied properly work well in practice for
several important application.
Speech recognition consists of two main modules, feature extraction and feature matching. The
purpose of feature extraction module is to convert speech waveform to some type of representation
for further analysis and processing, this extracted information is known as feature vector. The
process of converting voice signal to feature vector is done by signal-processing front end module.
Input to front-end is noise free voice sample and output of it is feature vector. In feature matching,
the extracted feature vector from unknown voice sample is scored against acoustic model, the model
with max score wins and its output is considered as recognized word. Following are the few method
for implementing front-end (for extracting feature factor)
MFCC (Mel-Frequency Cepstrum Coefficient)
LPC (Linear Predictive Coding)
Once the feature vector are obtained we build the acoustic model. The acoustic model is used to
score the unknown voice sample. As shown in block diagram, Output of front-end is given as input
to acoustic model. Different types of acoustic model are
• VQ-Code Book
• GMM-Gaussian Mixture Model
Acoustic Model Representation: In speech recognition, basic unit of sound is phoneme. Phoneme
is a minimal unit that serves to distinguish between meanings of words. For a sequence of phoneme
for “CAT” is K A and T. In English language there are nearly around 46 phoneme. We can construct
any word from English dictionary using proper concatenation of this phoneme. In order to recognize
a given word, we should extract phoneme from voice sample. Due of slowly timed varying nature
of speech signal short-term spectral analysis is the most common ways to characterize speech
signal. When examined over a sufficiently short period of time (between 10 and 25 msec), its
characteristics are fairly stationary. However, over long period of time the signal characteristic
change to reflect the different speech sound being spoken. Using this observation, we find that
feature vector extracted over 10 to 25 msec correspond to single phoneme. For speech recognition
in HMM, we assign each basic unit(phoneme) a unique HMM. Study shows that each phoneme
HMM can be represented by three state i.e. begin, middle and end state.
To understand this, take example of isolated word recognizer, each word in vocabulary has distinct
HMM. When unknown word comes, it is scored against all HMM model and HMM with maximum
score is considered as recognized word. As shown in block diagram output of acoustic model is
sequence of phoneme. By looking dictionary in reverse way (phoneme — word) we can find
corresponding word. But in general there are many word with same phoneme sequence, for example
car key and Khakee has same phoneme sequence. In such case language structure comes in to
picture. Language structure uses context information to narrow down the recognized word to
resemble the given grammar construct. This type of model is known as mono-phone or context-
independent model, following are different types of HMM
3. Whole-Word HMM
• Number of State: No phoneme generation
• Assign number of State to model a word as whole
• Accurate: Yes, require large training data and work for small vocabulary
• Compact: No, need too many State as vocabulary increases
• General: No, can’t build new word using this representation
In practice triphone model is widely used in speech recognition using Hidden Markov Model. Now
we are not going to discuss how feature vector are extracted from voice signal. we assume that
using either MFCC or LPC we get required feature vector. In case of MFCC it is 39-dimension
vector. The question is how to map feature vector to HMM state for doing so we use technique
called Vector Quantization.
• create training set of feature vector
• cluster them in to small number of classes
• represent each class by symbol
• for each class Vk, compute the probability that it is generated by given HMM state.
In Vector Quantization, we define a codebook which contain entry for each symbol, which is also
known as prototype vector or codeword. for example if we have 256 classes (i.e. 8 bit VQ), we
make 256 entry in codebook. As shown in Fig.8 the incoming vector is compared with each
prototype vector and one with minimum distance is chosen and its index value is given to the input
vector.
Generate code book(K-means clustering)
• choose M vector from L training Vector — where M = 2B as initial code words (Choose M with
MAX distance between them)
• for each training vector, find the closest code word (minimum distance). Assign this training
vector to that cell
• for each cell, compute centroid of that cell. The new code word is the centroid
• repeat last two step until average distance falls below threshold
In this way, the speech signal decoding and enhancement can be done with the help of HMM Viterbi
algorithm.
HMM can be trained using a variety of algorithms, including the Baum-Welch algorithm and
the Viterbi algorithm.
The Baum-Welch algorithm is an unsupervised learning algorithm that iteratively adjusts the
probabilities of events occurring in each state to fit the data better.
The Viterbi algorithm is a dynamic programming algorithm that finds the most likely sequence of
hidden states given a sequence of observable events.
Viterbi Algorithm
The Viterbi algorithm is a dynamic programming algorithm used to determine the most probable
sequence of hidden states in a Hidden Markov Model (HMM) based on a sequence of observations.
It is a widely used algorithm in speech recognition, natural language processing, and other areas
that involve sequential data.
The algorithm works by recursively computing the probability of the most likely sequence of
hidden states that ends in each state for each observation.
At each time step, the algorithm computes the probability of being in each state and emits the
current observation based on the probabilities of being in the previous states and making a
transition to the current state.
Assuming we have an HMM with N hidden states and T observations, the Viterbi algorithm can
be summarized as follows:
1. Initialization: At time t=1, we set the probability of the most likely path ending in state i
for each state i to the product of the initial state probability pi and the emission probability
of the first observation given state i. This is denoted by: delta[1,i] = pi * b[i,1].
2. Recursion: For each time step t from 2 to T, and for each state i, we compute the
probability of the most likely path ending in state i at time t by considering all possible
paths that could have led to state i. This probability is given by:
delta[t,i] = max_j(delta[t-1,j] * a[j,i] * b[i,t])
Here, a[j,i] is the probability of transitioning from state j to state i, and b[i,t] is the probability of
observing the t-th observation given state I.
We also keep track of the most likely previous state that led to the current state i, which is given
by:
psi[t,i] = argmax_j(delta[t-1,j] * a[j,i])
3. Termination: The probability of the most likely path overall is given by the maximum of
the probabilities of the most likely paths ending in each state at time T. That is, P* =
max_i(delta[T,i]).
4. Backtracking: Starting from the state i* that gave the maximum probability at time T, we
recursively follow the psi values back to time t=1 to obtain the most likely path of hidden
states.
The Viterbi algorithm is an efficient and powerful tool that can handle long sequences of
observations using dynamic programming.
Baum–Welch algorithm
improving the probabilities that it learns. Let us begin by considering the much simpler case of
training a fully visible Markov model, where we know both the temperature and the ice cream
count for every day. That is, imagine we see the following set of input observations and magically
knew the aligned hidden state sequences:
For a real HMM, we cannot compute these counts directly from an observation sequence since we
don’t know which path of states was taken through the machine for a given input. For example,
suppose I didn’t tell you the temperature on day 2, and you had to guess it, but you (magically)
had the above probabilities, and the temperatures on the other days. You could do some Bayesian
arithmetic with all the other probabilities to get estimates of the likely temperature on that missing
day, and use those to get expected counts for the temperatures for day 2. But the real problem is
even harder: we don’t know the counts of being in any of the hidden states!! The Baum-Welch
algorithm solves this by iteratively estimating the counts. We will start with an estimate for the
transition and observation probabilities and then use these estimated probabilities to derive better
and better probabilities. And we’re going to do this by computing the forward probability for an
observation and then dividing that probability mass among all the different paths that contributed
to this forward probability. To understand the algorithm, we need to define a useful probability
related to the forward probability and called the backward probability. The backward probability
β is the probability of seeing the observations from time t +1 to the end, given that we are in state
i at time t (and given the automaton λ):
The parameters of an HMM are the A transition probability matrix and the B observation likelihood
matrix. Both can be trained with the Baum-Welch or forward-backward algorithm.
`***********************************