NLP Course File Notes
NLP Course File Notes
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
To make interactions between computers and humans, computers need to understand natural
languages used by
humans.
Natural language processing is all about making computers learn, understand, analyse, manipulate and
interpret
natural(human) languages.
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
Processing of Natural Language is required when you want an intelligent system like robot to perform
as per your
instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The ability of machines to interpret human language is now at the core of many applications that we
use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants, and
social
language translators.
The input and output of an NLP system can be Speech or Written Text
Components of NLP
It helps the machine to understand and analyse human language by extracting the
text from large data such as keywords, emotions, relations, and semantics.
NLG.
• NLP is essentially the way humans communicate with a machine to allow them to perform the
task required
• Language processing is not only for computers but also used for other machines as well
• Around 1950 the Turning machine became famous developed by Alan turning
• Turning machine uses logic and various other key aspects to understand the logic between each
message and crack them
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• In 1960 work was led by Chomsky and other researchers to develop a universal syntax
• That universal syntax that could be used by people around the world to communicate with the
machines
• NLP along with ML and AI to develop probabilistic models that relied heavily on data.
• A lot of data relied on speech recognition which allowed people to transfer audio files and store
them in the form of data on the internet.
CHALLENGES OF NLP
• 1. Text Summarization
• This is simply a challenge that lies ahead for readers who are working on extracting data and
going through a lot of data in a short time.
• AI uses NLP to interpret and structure a summary by using all of the important points.
• One of the main challenges of NLP is finding and collecting enough high-quality data to train and
test your models.
• Therefore, you need to ensure that you have a clear data strategy
• it has many variations, such as dialects, accents, slang, idioms, jargon, and sarcasm.
• Therefore , nee to ensure that the models can handle and can adapt to different domains and
scenarios, and that they can capture the meaning and sentiment behind the words.
• 1 rule-based
• 2 statistical,
• 3 neural
• So we need to consider the trade-offs and criteria of each model, such as accuracy, speed,
scalability, interpretability, and robustness.
• integrating and deploying your models into your existing systems and workflows.
• you need to ensure that your models are compatible and interoperable with your systems
• that they can handle the input and output formats and channels, that they can scale and
perform well under different loads and conditions, and that they can be updated and maintained
easily.
Language modelling
• A language model in NLP is a probabilistic statistical model that determines the probability of a
given sequence of words occurring in a sentence based on the previous words.
• Which means
• Grammatical inferences that have been given out to the machine to make an appropriate
prediction based on the probabilistic inferences that the sentence has occurred in a similar event
as before
• To construct the above sentence we need to see all the possible combinations based on the rule
that given below
• 1. P(This) = P(w1)
• 2. P(glass|This) = P(w2)
This section deals with words, its structure and its models
1.1.1 Tokens
1.1.2 Lexemes
1.1.3 Morphemes
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
1.1.4 Typology
1.2.1 Irregularity
1.2.2 Ambiguity
1.2.3 Productivity
This chapter mainly deals with Sentence and topic detection or segmentation.
2.1 Introduction
2.2 Methods
This section deals with statistical classical approaches (Generative and Discriminative approaches)
Segmentation
Morphological Models
Introduction
Methods
To make interactions between computers and humans, computers need to understand natural
languages used by
humans.
Natural language processing is all about making computers learn, understand, analyse, manipulate and
interpret
natural(human) languages.
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Processing of Natural Language is required when you want an intelligent system like robot to perform
as per your
instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The ability of machines to interpret human language is now at the core of many applications that we
use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants, and
social
language translators.
The input and output of an NLP system can be Speech or Written Text
Components of NLP
It helps the machine to understand and analyse human language by extracting the
text from large data such as keywords, emotions, relations, and semantics.
NLP Terminology
Discourse − It deals with how the immediately preceding sentence can affect the
Steps in NLP
1. Lexical Analysis
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis
Lexical Analysis –
This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
Semantic Analysis –
Discourse Integration –
Discourse Integration depends upon the sentences that proceeds it and also invokes the
Pragmatic Analysis –
It involves deriving those aspects of language which require real world knowledge.
We use it to express our thoughts, and through language, we receive information and infer its
meaning.
Linguists have developed whole disciplines that look at language from different perspectives
The point of morphology, for instance, is to study the variable forms and functions of words,
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.
The meaning of a linguistic expression is its semantics, and etymology and lexicology cover
especially the evolution of words and explain the semantic, morphological, and other links
among them.
Words are perhaps the most intuitive units of language, yet they are in general tricky to define.
Knowing how to work with them allows, in particular, the development of syntactic and
Here, first we explore how to identify words of distinct types in human languages, and how the
internal structure of words can be modelled in connection with the grammatical properties and
punctuation.
But in many other languages, the writing system leaves it up to the reader to tell words
Words are defined in most languages as the smallest linguistic units that can form a
The minimal parts of words that deliver aspects of meaning to them are called
morphemes.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Tokens
Suppose, for a moment, that words in English are delimited only by whitespace and
Example: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from syntax, we notice two here: words
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens,
each of which has its independent role and can be reverted to its normalized form.
In English, this kind of tokenization and normalization may apply to just a limited set of
cases, but in other languages, these phenomena have to be treated in a less trivial manner.
In Arabic or Hebrew, certain tokens are concatenated in writing with the preceding or the
The underlying lexical or syntactic units are thereby blurred into one compact string of letters
Tokens behaving in this way can be found in various languages and are often called clitics.
In the writing systems of Chinese, Japanese, and Thai, whitespace is not used to separate
words.
Lexemes
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that
Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
When we convert a word into its other forms, such as turning the singular mouse into
regardless of its lexical category, we say we derive the lexeme: for instance, the
nouns receiver and reception are derived from the verb to receive.
Example: Did you see him? I didn’t see him. I didn’t see anyone.
• Example presents the problem of tokenization of didn’t and the investigation of the
In the paraphrase I saw no one, the lexeme to see would be inflected into the
form saw to reflect its grammatical function of expressing positive past tense.
Likewise, him is the oblique case form of he or even of a more abstract lexeme
with nobody.
The difficulty with the definition of what counts as a word need not pose a problem for
Morphemes
Morphological theories differ on whether and how to associate the properties of word
The morphs that by themselves represent some aspect of the meaning of a word are
• Human languages employ a variety of devices by which morphs and morphemes are
Morphology
played = play-ed
cats = cat-s
unfriendly = un-friend-ly
play = play
replayed = re-play-ed
computerized = comput-er-ize-d
Inflectional morphology: inflected forms are constructed from base forms and inflectional
affixes.
Derivational morphology: words are constructed from roots (or stems) and derivational
affixes:
inter+national = international
international+ize = internationalize
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
internationalize+ation = internationalization
agree-ment-s, where agree is a free lexical morpheme and the other elements are
bound grammatical morphemes contributing some partial meaning to the whole word.
in a more complex scheme, morphs can interact with each other, and their forms may
morphophonemic.
Typology
It can consider various criteria, and during the history of linguistics, different classifications
Let us outline the typology that is based on quantitative relations between words, their
Isolating, or analytic, languages include no or relatively few words that would comprise more
than one morpheme (typical members are Chinese, Vietnamese, and Thai; analytic tendencies
Synthetic languages can combine more morphemes in one word and are further
Fusional languages are defined by their feature-per-morpheme ratio higher than one
In accordance with the notions about word formation processes mentioned earlier, we
Morphological Typology
Morphological typology is a way of classifying the languages of the world that groups
The field organizes languages on the basis of how those languages form words by
combining morphemes.
The morphological typology classifies languages into two broad classes of synthetic languages
The synthetic class is then further sub classified as either agglutinative languages or fusional
languages.
Analytic languages contain very little inflection, instead relying on features like word order and
Synthetic languages, ones that are not analytic, are divided into two
• Agglutinative languages rely primarily on discrete particles(prefixes, suffixes, and infixes) for
• While fusional languages "fuse" inflectional categories together, often allowing one word
ending to contain several categories, such that the original root can be difficult to extract
(anybody, newspaper).
Ambiguity: word forms be understood in multiple ways out of the context of their
discourse.
Morphological parsing tries to eliminate the variability of word forms to provide higher-
level linguistic units whose lexical and morphological properties are explicit and well
defined.
By irregularity, we mean existence of such forms and structures that are not described
Some irregularities can be understood by redesigning the model and improving its
language.
Morphological modelling also faces the problem of productivity and creativity in language, by
which unconventional but perfectly meaningful new words or new senses are coined.
Irregularity
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Morphological parsing is motivated by the quest for generalization and abstraction in the
world of words.
Immediate descriptions of given linguistic data may not be the ultimate ones, due to either
needed.
The design principles of the morphological model are therefore very important.
In Arabic, the deeper study of the morphological processes that are in effect during inflection
and derivation, even for the so-called irregular words, is essential for mastering the whole
With the proper abstractions made, irregular morphology can be seen as merely enforcing
some extended rules, the nature of which is phonological, over the underlying or prototypical
analyzed into immediate I and morphophonemic M templates, in which dashes mark the
The outer columns of the table correspond to P perfective and I imperfective stems
declared in the lexicon; the inner columns treat active verb forms of the following
• Table illustrates differences between a naive model of word structure in Arabic and
the model proposed in Smrˇz and Smrˇz and Bielick´y where morphophonemic merge
patterns and generic affixes without any context-dependent variation of the affixes or ad hoc
The merge rules, indeed very neatly or effectively concise, then ensure that such structured
representations can be converted into exactly the surface forms, both orthographic and
Applying the merge rules is independent of and irrespective of any grammatical parameters
Ambiguity
Words forms that look the same but have distinct functions or meaning are called
homonyms.
processing at large.
Table arranges homonyms on the basis of their behaviour with different endings.
Because Arabic script usually does not encode short vowels and omits yet some other
diacritical marks that would record the phonological form exactly, the degree of its
When inflected syntactic words are combined in an utterance, additional phonological and
•cases are expressed by the same word form with ‘my study’ and ‘ my teachers’,
Productivity
summarized in the distinction between langue and parole, or in the competence versus
In one view, language can be seen as simply a collection of utterances (parole) actually
This ideal data set can in practice be approximated by linguistic corpora, which are finite
collections of linguistic data that are studied with empirical(based on) methods and can be
This general potential holds for morphological processes as well and is called
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
morphological productivity.
We denote the set of word forms found in a corpus of a language as its vocabulary.
The members of this set are word types, whereas every original instance of a word form is a
word token.
The distribution of words or other elements of language follows the “80/20 rule,” also known
It says that most of the word tokens in a given corpus can be identified with just a couple of
word types in its vocabulary, and words from the rest of the vocabulary occur much less
Furthermore, new, unexpected words will always appear as the collection of linguistic data is
enlarged.
adverbs can be prefixed with ne- to define the complementary lexical concept.
Morphological Models
There are many possible approaches to designing and implementing morphological models.
formalisms and frameworks, in particular grammars of different kinds and expressive power,
with which to address whole classes of problems in processing natural as well as formal
languages.
Let us now look at the most prominent types of computational approaches to morphology.
Dictionary Lookup
Morphological parsing is a process by which word forms of a language are associated with
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Morphological systems that specify these associations by merely enumerating(is the act or
process of making or stating a list of things one after another) them case by case
Likewise for systems in which analyzing a word form is reduced to looking it up verbatim in
word lists, dictionaries, or databases, unless they are constructed by and kept in sync with
The data structure can be optimized for efficient lookup, and the results can be
Dictionaries can be implemented, for instance, as lists, binary search trees, tries, hash
Because the set of associations between word forms and their desired descriptions is
declared by plain enumeration, the coverage of the model is finite and the generative
Despite all that, an enumerative model is often sufficient for the given purpose, deals easily
Finite-State Morphology
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool) and
LexTools.
automata.
They consist of a finite set of nodes connected by directed edges labeled with pairs of input
In such a network or graph, nodes are also called states, while edges are called arcs.
Traversing the network from the set of initial states to the set of final states along the arcs is
equivalent to reading the sequences of encountered input symbols and writing the sequences
The set of possible sequences accepted by the transducer defines the input
language; the set of possible sequences emitted by the transducer defines the output
language.
mergin
merge +V +PRES-PART
For example, a finite-state transducer could translate the infinite regular language consisting of the
words vnuk, pravnuk, prapravnuk, ... to the matching words in the infinite regular language defined
In finite-state computational morphology, it is common to refer to the input word forms as surface
strings and to
the output descriptions as lexical strings, if the transducer is used for morphological analysis, or vice
versa, if it is
Relations on languages can also be viewed as functions. Let us have a relation R, and let us denote by
[Σ] the set
of all sequences over some set of symbols Σ, so that the domain and the range of R are subsets of [Σ].
We can then consider R as a function mapping an input string into a set of output strings, formally
denoted by this
Finite-state technology can be applied to the morphological modeling of isolating and agglutinative
languages in a
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
quite straightforward manner. Korean finite-state models are discussed by Kim, Lee and Rim, and Han, to
mention
a few.
•In English, a finite-state transducer could analyze the surface string children into the lexical
string child [+plural], for instance, or generate women from woman [+plural].
Unification-Based Morphology
The concepts and methods of these formalisms are often closely connected to those
of logic programming.
In finite-state morphological models, both surface and lexical forms are by themselves
data structures that can include complex values or can be recursively nested if
needed.
content ψ, cf.
Erjavec argues that for morphological modelling, word forms are best captured by
regular expressions, while the linguistic content is best described through typed
feature structures.
Nodes are associated with types, and atomic values are attributeless nodes
symbolic expressions.
Unification is the key operation by which feature structures can be merged into a more
Unification of feature structures can also fail, which means that the information in them
is mutually incompatible.
Morphological models of this kind are typically formulated as logic programs, and
Functional Morphology
organizes the linguistic as well as abstract elements of a model into distinct types of
Linguistic notions like paradigms, rules and exceptions, grammatical categories and
Morphological parsing is just one usage of the system, the others being
type
morphologies of Latin, Swedish, Spanish, Urdu, and other languages have been
implemented.
In Haskell, in particular, developers can take advantage of its syntactic flexibility and
design their own notation for the functional constructs that model the given problem.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The notation then constitutes a so-called domain-specific embedded language, which makes
programming even
more fun.
Morphological grammars in Grammatical Framework can be extended with descriptions of the syntax
and
semantics of a language.
Grammatical Framework itself supports multilinguality, and models of more than a dozen languages
are available in
it as open-source software.
2.1 Introduction
In human language, words and sentences do not appear randomly but have structure.
For example, combinations of words from sentences- meaningful grammatical units, such as
Automatic extraction of structure of documents helps subsequent NLP tasks: for example,
parsing, machine translation, and semantic role labelling use sentences as the basic
processing unit.
Task of deciding where sentences start and end given a sequence of characters(made of words
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Topic segmentation as the task of determining when a topic starts and ends in a sequence of
sentences.
The statistical classification approaches that try to find the presence of sentence and topic
These methods base their predictions on features of the input: local characteristics that give
Features are the core of classification approaches and require careful design and selection in
Most statistical approaches described here are language independent, every language is a
challenging in itself.
For example, for processing of Chinese documents, the processor may need to first segment
the character sequences into words, as the words usually are not separated by a space.
Similarly, for morphological rich languages, the word structure may need to be analyzed to
determined.
Tokens can be word or sub-word units, depending on the task and language.
In written text in English and some other languages, the beginning of a sentence is usually
marked with an uppercase letter, and the end of a sentence is explicitly marked with a
In addition to their role as sentence boundary markers, capitalized initial letters are used
distinguish proper nouns, periods are used in abbreviations, and numbers and punctuation
The period at the end of an abbreviation can mark a sentence boundary at the same time.
In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it does.
Especially quoted sentences are always problematic, as the speakers may have uttered
multiple sentences, and sentence boundaries inside the quotes are also marked with
punctuation marks.
An automatic method that outputs word boundaries as ending sentences according to the
presence of such punctuation marks would result in cutting some sentences incorrectly.
Ambiguous abbreviations and capitalizations are not only problem of sentence segmentation
in written text.
Spontaneously written texts, such as short message service (SMS) texts or instant
Similarly, if the text input to be segmented into sentences comes from an automatic system,
such as optical character recognition (OCR) or ASR, that aims to translate images of
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
handwritten, type written, or printed text or spoken utterances into machine editable text, the
finding of sentences boundaries must deal with the errors of those systems as well.
On the other hand, for conversational speech or text or multiparty meetings with
ungrammatical sentences and disfluencies, in most cases it is not clear where the boundaries
are.
Code switching -that is, the use of words, phrases, or sentences from multiple languages by
multilingual speakers- is another problem that can affect the characteristics of sentences.
For example, when switching to a different language, the writer can either keep the
punctuation rules from the first language or resort to the code of the second language.
to identify potential ends of sentences and lists of abbreviations for disambiguating them.
For example, if the word before the boundary is a known abbreviation, such as “Mr.” or “Gov.,”
the text is not segmented at that position even though some periods are exceptions.
problem.
Given the training data where all sentence boundaries are marked, we can train a classifier to
recognize them.
This is, given a sequence of(written or spoken) words, the aim of topic segmentation is to
Topic segmentation is an important task for various language understanding applications, such
For example, in information retrieval, if a long documents can be segmented into shorter,
topically coherent segments, then only the segment that is about the user’s query could be
retrieved.
During the late1990s, the U.S defence advanced research project agency(DARPA) initiated the
topic detection and tracking program to further the state of the art in finding and following
One of the tasks in the TDT effort was segmenting a news stream into individual stories.
2.2 Methods
classification problem.
Given a boundary candidate( between two word tokens for sentence segmentation and
between two sentences for topic segmentation), the goal is to predict whether or not the
Formally, let xƐX be the vector of features (the observation) associated with a candidate and y
Classification problem: given a set of training examples(x,y)train, find a function that will assign
Alternatively to the binary classification problem, it is possible to model boundary types using
finer-grained categories.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
For segmentation in text be framed as a three-class problem: sentence boundary ba, without
Similarly spoken language, a three way classification can be made between non-boundaries
• For sentence or topic segmentation, the problem is defined as finding the most probable
• The natural unit of sentence segmentation is words and of topic segmentation is sentence, as
we can assume that topics typically do not change in the middle of a sentences.
The words or sentences are then grouped into categories stretches belonging to one
sentences or topic- that is word or sentence boundaries are classified into sentences or topic
The classification can be done at each potential boundary i (local modelling); then, the aim is
Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to show
possible categories.
made locally.
However, the consecutive types can be related to each other. For example, in broadcast news
speech, two consecutive sentences boundaries that form a single word sentence are very
infrequent.
In local modelling, features can be extracted from surrounding example context of the
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ො𝑦 = 𝑦𝑖 𝑖𝑛 𝑌
𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑦𝑖 𝑥𝑖
• It is also possible to see the candidate boundaries as a sequence and search for the sequence of
boundary types
Another categorization of methods is done according to the type of the machine learning algorithm:
generative versus
discriminative.
Generative sequence models estimate the joint distribution of the observations P(X,Y) (words,
punctuation) and the
Discriminative sequence models, however, focus on features that categorize the differences between
the labelling of that
examples.
𝑌 = 𝑦
𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑌 𝑋
Most commonly used generative sequence classification method for topic and
The probability in equation 2.2 is rewritten as the following, using the Bayes rule:
𝑌 = 𝑦
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑌|𝑋) 2.1
𝑌 = 𝑦
𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑌 𝑋 = 𝑦
P(Y|X) = the probability of given the X (feature vectors), what is the probability of X
P(X) in the denominator is dropped because it is fixed for different Y and hence does not
The most important distinction is that whereas class densities P(x|y) are model assumptions
boosting, maximum entropy, and regression. Are based on very different machine learning
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
algorithms.
For sentence segmentation, supervised learning methods have primarily been applied to
newspaper articles.
Stamatatos, Fakotakis and Kokkinakis used transformation based learning (TBL) to infer rules
Many classifiers have been tried for the task: regression trees, neural networks, classification
trees, maximum entropy classifiers, support vector machines, and naïve Bayes classifiers.
The most Text tiling method Hearst for topic segmentation uses a lexical cohesion metric in a
Figure depicts a typical graph of similarity with respect to consecutive segmentation units.
Originally, two methods for computing the similarity scores were proposed: block
The first, block comparison, compares adjacent blocks of text to see how similar they are
Given two blocks, b1 and b2, each having k tokens (sentences or paragraphs), the similarity
The weights can be binary or may be computed using other information retrieval- metrics
The second, the vocabulary introduction method, assigns a score to a token-sequence gap
on the basis of how many new words are seen in the interval in which it is the midpoint.
Similar to the block comparison formulation, given two consecutive blocks b1 and b2, of
equal number of words w, the topical cohesion score is computed with the following formula:
Where NumNewTerms(b) returns the number of terms in block b seen the first time in text.
In segmentation tasks, the sentence or topic decision for a given example(word, sentence,
paragraph) highly depends on the decision for the examples in its vicinity.
discriminative models with additional decoding stages that find the best assignment of labels
extension of SVM, and maximum margin Markov networks(M3N) are extensions of HMM.
Contrary to local classifiers that predict sentences or topic boundaries independently, CRFs
can oversee the whole sequence of boundary hypotheses to make their decisions.
In a given context and under a set of observation features, one approach may be better than
other.
These approaches can be rated in terms of complexity (time and memory) of their training
of generative ones because they require multiple passes over the training data to adjust for
feature weights.
However, generative models such as HELMs can handle multiple orders of magnitude larger
training sets and benefits, for instance, from decades of news wire transcripts.
On the other hand, they work with only a few features (only words for HELM) and do not
1.List and explain the challenges of morphological models. Mar 2021 [7]
2. Discuss the importance and goals of Natural Language Processing. Mar 2021 [8]
6.Differentiate between surface and deep structure in NLP with suitable examples. Sep 2021 [8]
7.Give some examples for early NLP systems. Sep 2021 [7]
9.With the help of a neat diagram, explain the representation of syntactic structure. Mar 2021 [8]
10.Elobarate the models for ambiguity resolution in Parsing. Mar 2021 [7]
13.Given the grammar S->AB|BB, A->CC|AB|a, B->BB|CA|b, C->BA|AA|b, word w=‘aabb’. Applay top
down parsing test, word
14.Explain Tree Banks and its role in parsing. Sep 2021 [7]
Applications of NLP:
•Sentiment Analysis.
•Text Classification.
•Text Extraction.
•Machine Translation.
•Text Summarization.
•Market Intelligence.
•Auto-Correct.
Unit-II
Syntax Analysis:
2.4Parsing Algorithms,
The parsing in NLP is the process of determining the syntactic structure of a text by analysing
Example Grammar:
• Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are
called terminals.
Parse Tree:
A treebank can be defined as a linguistically annotated corpus that includes some kind of
A sentence is parsed by relating each word to other words in the sentence which depend on it.
The syntactic parsing of a sentence consists of finding the correct syntactic structure of that
Dependency grammar (DG) and phrase structure grammar(PSG) are two such formalisms.
PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
DG: syntactic structure consist of lexical items, linked by binary asymmetric relations called
dependencies.
Learning: scoring possible dependency graphs for a given sentence, usually by factoring the
Parsing: searching for the highest scoring graph for a given sentence
Syntax
In NLP, the syntactic analysis of natural language input can vary from being very low-level,
such as simply tagging each word in the sentence with a part of speech (POS), or very high
In syntactic parsing, ambiguity is a particularly difficult problem because the most possible
From tagging to full parsing, algorithms that can handle such ambiguity have to be carefully
chosen.
Here we explores the syntactic analysis methods from tagging to full parsing and the use of
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
There is a natural pause between the words derive and In in sentence that reflects an
Parsing can provide a structural description that identifies such a break in the intonation.
A simpler case: The cat who lives dangerously had nine lives.
In this case, a text-to-speech system needs to know that the first instance of the word lives is
a verb and the second instance is a noun before it can begin to produce the natural
This is an instance of the part-of-speech (POS) tagging problem where each word in the
Another motivation for parsing comes from the natural language task of summarization, in
which several documents about the same topic should be condensed down to a small digest
of information.
Such a summary may be in response to a question that is answered in the set of documents.
In this case, a useful subtask is to compress an individual sentence so that only the relevant
For example:
Beyond the basic level, the operations of the three products vary widely.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
The output of the parser for the input sentence is shown in Fig.
In the sentence fragment, the capitalized phrase EUROPEAN COUNTRIES can be replaced with
• In contemporary NLP, syntactic parsers are routinely used in many applications, including
but not limited to statistical machine translation, information extraction from text
This implies that a parser requires some knowledge(syntactic rules) in addition to the input
sentence about the kind of syntactic analysis that should be produced as output.
One method to provide such knowledge to the parser is to write down a grammar of the
In natural language, it is far too complex to simply list all the syntactic rules in terms of a CFG.
The second knowledge acquisition problem- not only do we need to know the syntactic rules
for a particular language, but we also need to know which analysis is the most
The construction of treebank is a data driven approach to syntax analysis that allows us to
A treebank is simply a collection of sentences (also called a corpus of text), where each
The syntactic analysis for each sentence has been judged by a human expert as the most
A lot of care is taken during the human annotation process to ensure that a consistent
There is no set of syntactic rules or linguistic grammar explicitly provided by a treebank, and
A detailed set of assumptions about the syntax is typically used as an annotation guideline
to help the human experts produce the single-most plausible syntactic analysis for each
Treebanks solve the first knowledge acquisition problem of finding the grammar underlying
the syntax analysis because the syntactic analysis is directly given instead of a grammar.
In fact, the parser does not necessarily need any explicit grammar rules as long as it can
Because each sentence in a treebank has been given its most plausible(probable) syntactic
analysis, supervised machine learning methods can be used to learn a scoring function over
Two main approaches to syntax analysis are used to construct treebanks: dependency graph
These two representations are very closely related to each other and under some
Dependence analysis is typically favoured for languages such as Czech and Turkish, that have
Phrase structure analysis is often used to provide additional information about long-distance
• NLP: is the capability of the computer software to understand the natural language.
• Each language has its own structure(SVO or SOV)->called grammar ->has certain
I eat mango
• belongs to VN
• It has the form α → β, where α and β are strings on VN ∪ ∑ at least one symbol of α
S -> NP VP
VP -> V NP
V ->hit
NP-> DN
D->the
N->John|ball
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The main philosophy behind dependency graphs is to connect a word- the head of a phrase-
The notation connects a head with its dependent using a directed (asymmetric) connections.
Dependency graphs, just like phrase structures trees, is a representation that is consistent
The words in the input sentence are treated as the only vertices in the graph, which are linked
In dependency-based syntactic parsing, the task is to derive a syntactic structure for an input
sentence by identifying the syntactic head of the each word in the sentence.
This defines a dependency graph, where the nodes are the words of the input sentence and
The dependency tree analyses, where each word depends on exactly one parent, either
By convention, in dependency tree 0 index is used to indicate the root symbol and the
directed arcs are drawn from the head word to the dependent word.
In the Fig shows a dependency tree for Czech sentence taken from the Prague dependency
treebank.
Each node in the graph is a word, its part of speech and the position of the word in the
sentence.
For example [fakulte, N3,7] is the seventh word in the sentence with POS tag N3.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The node [#, ZSB,0] is the root node of the dependency tree.
There are many variations of dependency syntactic analysis, but the basic textual format for a
Where each dependent word specifies the head Word in the sentence, and exactly one word
constraint imposed by the linear order of words on the dependencies between words.
A projective dependency tree is one where if we put the words in a linear order based on the
sentence with the root symbol in the first position, the dependency arcs can be drawn above
A Phrase Structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed by
Phrase structure analysis also typically incorporate ideas from generative grammar(from
Sentence includes a subject and a predicate. The subject is a noun phrase (NP) and the
For example, the phrase structure analysis : Mr. Baker seems especially sensitive, taken from
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
The subject of the sentence is marked with the SBJ marker and predicate of the sentence is
NNP: proper noun, singular VBZ: verb, third person singular present
The same sentence gets the following dependency tree analysis: some of the information
from the bracketing labels from the phrase structure analysis gets mapped onto the labelled
To explain some details of phrase structure analysis in treebank, which was a project that
annotated 40,000 sentences from the wall street journal with phrase structure tree,
The SBARQ label marks what questions ie those that contain a gap and therefore require a
trace.
• Wh- moved noun phrases are labeled WHNP and put inside SBARQ. They bear an identity
index that matches the reference index on the *T* in the position of the gap.
• However questions that are missing both subject and auxiliary are label SQ
• *T* traces for wh- movement and this empty trace has an index ( here it is 1) and associated
Parsing Algorithms
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• The simple CFG G that can be used to derive string such as a and b or c from the start symbol
N.
• For the input string a and b or c, the following sequence of actions separated by symbol
• This method is called the rightmost derivation of the input using a CFG.
• This derivation sequence exactly corresponds to the construction of the following parse tree
• For example, one more rightmost derivation that results following parse tree.
• To build a parser, we need an algorithm that can perform the steps in the above rightmost
• Every CFG turns out to have an automata that is equivalent to it, called pushdown automata
• An algorithm for parsing that is general for any given CFG and input string.
• The algorithm is called shift-reduce parsing which uses two data structures: a buffer for input
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
1 a and b or c Init
2 a a and b or c Shift a
13
• CFG s in the worst case such a parser might have to resort to backtracking, which means re-parsing the
input
which leads to a time that is exponential in the grammar size in the worst case.
• Variants of this algorithm(CYK) are often used in statistical parsers that attempt to search the space of
possible parse trees without the limitation of purely left to right parsing.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• One of the earliest recognition parsing algorithm is CYK (Cocke, Kasami and younger) parsing algorithm
and It
CYK example:
• Here we discuss on modelling aspects of parsing: how to design features and ways to resolve
ambiguity in parsing.
• Here we want to provide a model that matches the intuition that the second tree above is
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the following
CFG:
• We assign scores or probabilities to the rules in CGF in order to provide a score or probability
• From these rule probabilities, the only deciding factor for choosing between the two
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP PP.
• The rule probabilities can be derived from a treebank, consider a treebank with three
• if we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3
occurred 50 times, then the PCFG we obtain from this treebank is:
• For input a a a there are two parses using the above PCFG: the probability P1 =0.125 0.334
• The parse tree p2 is the most likely tree for that input.
Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to build the
parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of steps in the
derivation.
• The conditioning context in the probability P(di|d1,……..,di-1) is called the history and
corresponds to a partially built parse tree (as defined by the derived sequence).
• We make a simplifying assumption that keeps the conditioning context to a finite set by
• Colins created a simple notation and framework that describes various discriminative
• Let x be a set of inputs and y be a set of possible outputs that can be a sequence of POS tags
• Each xƐx and yƐy is mapped to a d-dimensional feature vector ø(x,y), with each dimension
• A weight parameter vector wƐRd assigns a weight to each feature in ø(x,y), representing the
• The value of ø(x,y).w is the score of (x,y) . The height the score, the more possible it is that y is
the output of x.
• The function GEN(x) generates the set of possible outputs y for a given x.
• Having ø(x,y).w and GEN(x) specified, we would like to choose the height scoring candidate
• A conditional random field (CRF) defines the conditional probability as a linear score for each
• There are two general approaches to parsing 1.Top down parsing ( start with start symbol)
Unit-III
enough to allow reasoning system to make deduction(the process of reaching a decision or answer by
• But at the same time, is general enough that it can be used across many domains with little to
• It is not clear whether a final, low-level, detailed semantics representation covering various
• An ontology(a branch of metaphysics concerned with the nature and relations of being) can be created
that can be
created that can capture the various granularities and aspects of meaning that are embodied
• None of these approaches are not created, So two compromise approaches have emerged in
• In the first approach, a specific, rich meaning representation is created for a limited domain
for use by application that are restricted to that domain, such as travel reservations, football
created, going from low-level analysis to a middle analysis, and the bigger understanding task
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
is divided into multiple, smaller pieces that are more manageable, such as word sense
• The task of producing the output of the first type is often called deep semantic parsing, and
the task of producing the output of the second type is often called shallow semantic parsing.
• The first approach is so specific that porting to every new domain can require anywhere from
• In other words, the reusability of the representation across domains is very limited.
• The problem with second approach is that it is extremely difficult to construct a general
purpose ontology and create symbols that are shallow enough to be learnable but detailed
• Therefore, an application specific translation layer between the more general representation
2.Semantic Interpretation
• Semantic parsing can be considered as part of Semantic interpretation, which involves various
components that together define a representation of text that can be fed into a computer to
allow further computations manipulations and search, which are prerequisite for any language
understanding system or application. Here we start with discus with structure of semantic
theory.
1.Explain sentence having ambiguous meaning: The bill is large is ambiguous in the sense that
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
2. Resolve the ambiguities of words in context. The bill is large but need not be paid, the
3. Identify meaningless but syntactically well-formed sentence: Colorless green ideas sleep
furiously.
2.1.Structural ambiguity
• The syntactic structure means transforming the a sentence into its underlying syntactic
representation.
2.2.Word Sense
• In any given language, the same word type is used in different contexts and with different
• For example, we use the word nail to represent a part of the human anatomy and also to
entities that are sparkled across the discourse using the same or different phrases.
• The predominant tasks have become popular over the years: named entity recognition and
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
coreference resolution.
2.4.Predicate-Argument Structure
• Once we have the word-sense, entities and events identified, another level of semantics structure
comes into play: identifying the participants of the entities in these events.
• Resolving the argument structure of predicate in the sentence is where we identify which entities
• A word which functions as the verb does here, we call a predicate and words which function as
the nouns do are called arguments. Here are some other predicates and arguments:
• Selena slept
argument predicate
Tom is tall
argument predicate
• “Sanchon serves vegetarian food” can be described in FOPC as: Server (Sanchon,
VegetarianFood)
• Generally, this process can be defined as the identification of who did what to whom,
2.5.Meaning Representation
• The final process of the semantic interpretation is to build a semantic representation or meaning
3.System Paradigms
• It is important to get a perspective on the various primary dimensions on which the problem of
• The approaches generally fall into the following three categories: 1.System architecture 2.Scope 3.
Coverage.
• System Architectures
a.Knowledge based: These systems use a predefined set of rules or a knowledge base to obtain
using existing resources that can be bootstrapped for a particular application or problem domain.
c.Supervised: these systems involve the manual annotation of some phenomena that appear in a
d.Semi-Supervised: manual annotation is usually very expensive and does not yield enough data to
completely capture a phenomenon. In such instances, researches can automatically expand the data
set on which their models are trained either by employing machine-generated output directly or by
2.Scope:
a.Domain Dependent: These systems are specific to certain domains, such as air travel
b.Domain Independent: These systems are general enough that the techniques can be applicable
3.Coverage:
a.Shallow: These systems tend to produce an intermediate representation that can then be
b. Deep: These systems usually create a terminal representation that is directly consumed by a
machine or application.
4.Word Sense
• Word Sense Disambiguation is an important method of NLP by which the meaning of a word
• In a compositional approach to semantics, where the meaning of the whole is composed on the
meaning of parts, the smallest parts under consideration in textual discourse are typically the
words themselves: either tokens as they appear in the text or their lemmatized forms.
• Words sense has been examined and studied for a very long time.
• Attempts to solve this problem range from rule based and knowledge based to completely
• Very early systems were predominantly rule based or knowledge based and used dictionary
• Unsupervised word sense induction or disambiguation techniques try to induce the senses or
• These systems perform either a hard or soft clustering of words and tend to allow the tuning of
• Word sense ambiguities can be of three principal types: i.homonymy ii.polysemy iii.categorial
ambiguity.
• Homonymy defined as the words having same spelling or same form but having different and
unrelated meaning. For example, the word “Bat” is a homonymy word because bat can be an
• Polysemy is a Greek word, which means “many signs”. polysemy has the same spelling but
• Both polysemy and homonymy words have the same syntax or spelling. The main difference
between them is that in polysemy, the meanings of the words are related but in homonymy, the
Polysemy: financial bank, bank of clouds and book bank: indicate collection of
things.
• Categorial ambiguity: the word book can mean a book which contain the chapters or
police register which is used to enter the charges against some one.
• In the above note book belongs to the grammatical category of noun, and text book is verb.
• Distinguishing between these two categories effectively helps disambiguate these two senses.
• Therefore categorical ambiguity can be resolved with syntactic information (part of speech)
• Traditionally, in English, word senses have been annotated for each part of speech separately,
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
whereas in Chinese, the sense annotation has been done per lemma.
Resources
• As with any language understanding task, the availability of resources is key factor in the
• Early work on wors sense disambiguation used machine readable dictionaries or thesaususes
as knowledge sources.
• Two prominent sources were the Longman dictionary of contemporary English (LDOCE) and
Roget’s Thesaurus.
• The biggest sense annotation corpus OntoNotes released through Lissuistic Data Consortium
(LDC).
Systems
• Researchers have explored various system architectures to address the sense disambiguation
problem.
• We can classify these systems into four main categories: i. rules based or knowledge ii.
Rule Based:
• The first generation pf word sense disambiguation systems was primarily based on dictionary
sense definitions.
• Much of this information is historical and cannot readily be translated and made available for
building systems today. But some of techniques and algorithms are still available.
• The simplest and oldest dictionary based sense disambiguation algorithm was introduced by
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
leak.
• The core of the algorithm is that the dictionary sense whose terms most closely overlap with
• This study used Roget’s Thesaurus categories and classified unseen words into one of these 1042
categories based on a statistical analysis of 100 word concordances for each member of each category.
• The second step computes weights for each of the salient words.
• P(w|Rcat) is the probability of a word w occurring in the context of a Roget’s Thesaurus category Rcat.
• P(w|Rcat) |Pr(w) , the probability of a word (w) appearing in the context of a Roget category divided
• Finally, in third step, the unseen words in the test set are classified into the classified into the category
Supervised
• The simpler form of word sense disambiguating systems the supervised approach, which
tends to transfer all the complexity to the machine learning machinery while still requiring
hand annotation tends to be superior to unsupervised and performs best when tested on
annotated data.
• These systems typically consist of a machine learning classifier trained on various features
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
extracted for words that have been manually disambiguated in a given corpus and the
application of the resulting models to disambiguating words in the unseen test sets.
• A good feature of these systems is that the user can incorporate rules and knowledge in the
form of features.
Classifier:
• Probably the most common and high performing classifiers are support vector machine
Features: Here we discuss a more commonly found subset of features that have been useful in
• Lexical context: The feature comprises the words and lemma of words occurring in the entire
• Parts of speech : the feature comprises the POS information for words in the window
• Bag of words context: this feature comprises using an unordered set of words in the context
window.
• Local Collections : Local collections are an ordered sequence of phrases near the target word
that provide semantic context for disambiguation. Usually, a very small window of about
three tokens on each side of the target word, most often in contiguous pairs or triplets, are
• Syntactic relations: if the parse of the sentence containing the target word is available, then
• Topic features: The board topic, or domain, of the article that word belongs to is also a good
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• Semantic analysis starts with lexical semantics, which studies individual words’
• Semantic analysis then examines relationships between individual words and analyzes
https://www.youtube.com/watch?v=eEjU8oY_7DE
https://www.youtube.com/watch?v=W7QdqCrX_mY
https://www.youtube.com/watch?v=XLvv_5meRNM
https://www.geeksforgeeks.org/understanding-semantic-analysis-
nlp/#:~:text=Semantic%20Analysis%20is%20a%20subfield,process%20to%20us%20as%20humans.
In deep parsing, the search strategy will give a complete syntactic structure to a sentence. It is the task of
parsing
a limited part of the syntactic information from the given task. It is suitable for complex NLP applications.
It can be used
Once an expression has been fully parsed and its syntactic ambiguities resolved, its meaning should be
uniquely
Semantic translation is the process of using semantic information to aid in the translation of data in one
• Semantic meaning can be studied at several different levels within linguistics. The three major types of
semantics
Categories of Semantics
Nick Rimer, author of Introducing Semantics, goes into detail about the two categories of semantics.
"Based on the
distinction between the meanings of words and the meanings of sentences, we can recognize two main
divisions in the
study of semantics: lexical semantics and phrasal semantics. Lexical semantics is the study of word
meaning, whereas
phrasal semantics is the study of the principles which govern the construction of the meaning of phrases
and of
A bird's bill, also called a beak, A bird's horny projecting jaws; a bill
Unit-IV
Software
• Shallow semantics parsing or semantic role labelling, is the process of identifying the various
• In linguistics, predicate refers to the main verb in the sentence. Predicate takes arguments.
• The role of Semantic Role Labelling (SRL) is to determine how these arguments are
• Consider the sentence "Mary loaded the truck with hay at the depot on Friday".
• Loaded' is the predicate. Mary, truck and hay have respective semantic roles of loader, bearer
• The job of SRL is to identify these roles so that NLP tasks can "understand" the sentence.
• Often an idea can be expressed in multiple ways. Consider these sentences that all mean the
same thing: "Yesterday, Kristina hit Scott with a baseball"; "Scott was hit by Kristina yesterday
with a baseball"; "With a baseball, Kristina hit Scott yesterday"; "Kristina hit Scott with a
baseball yesterday".
• Either constituent or dependency parsing will analyze these sentence syntactically. But
• SRL is useful in any NLP application that requires semantic understanding: machine translation,
information
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Resources:
• The late 1990s saw the emergence of two important corpora that are semantically tagged, one is
• These resources have begun a transition from a long tradition of predominantly rule-based approaches
• These approaches focus on transforming linguistic insights into features, rather than into rules and
letting a machine learning framework use those features to learn a model that helps automatically tag
• FrameNet is based on the theory of frame semantics, where a given predicate invokes a semantic
frame, this instantiating some or all of the possible semantic roles belonging to that frame.
• PropBank on the, on the other hand, is based on Dowty’s prototype theory and uses a more
linguistically neutral view in which each predicate has a set of core arguments that are
predicate dependent and all predicates share a set of non-core or adjunctive, arguments.
FrameNet
• The process of FrameNet annotation consists of identifying specific semantic frames and
• Then, a set of predicates that instantiate the semantic frame, irrespective of their
grammatical category, are identified, and a variety of sentences are labelled for those
predicates.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• The labelling process entails identifying the frame that an instance of the predicate lemma
invokes, then identifying semantic arguments for that instance, and tagging them with one of
• The combination of the predicate lemma and the frame that its instance invokes is called a
• The arguments are tagged as either core arguments, with labels of the type ARGN, where N
takes values from 0 to 5, or adjunctive arguments(listed in table) with labels of the type
ARGM-X, where X can take values such as TMP for temporal, LOC for locative and so on.
• Adjunctive arguments share the same meaning across all predicates, where as the meaning
• ARG0 in the PROTO-AGENT (usually the subject of the a transitive verb, ARG1 is the PROTO-PATIENT
(usually
• Table 4-1 shows a list of core arguments for the predicates operate and author.
• Note that some core arguments, such as ARG2 and ARG3, do not occur with author.
• This is explained by the fact that not all core arguments can be instantiated by all senses of all
predicates.
NEHRU INSTITUTE ENGINEERING AND TECHNOLOGY
(Autonomous)
T. M. Palayam, Coimbatore-641 105
(Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
Re-Accredited by NAAC, Recognized by UGC with Section 2(f) and 12(B)
NBA Accredited UG Programmes – Aero & CSE
Coimbatore– 641105 Tamil Nadu, India
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
• A list of core arguments that can occur with a particular sense of the predicate, along with their real-
world
meaning, is present in a file called the frames file. One frames is associated with each predicate