NLP UNIT 2 Part 2
NLP UNIT 2 Part 2
LEXICAL RESOURCES
Introduction:-
A lexical resource is a collection of words and/ or phrases along with associated information,
such as part-of-speech and sense definitions. These are secondary to texts, and are usually
created and enriched with the help of texts. There are 4 types of lexical nouns, verbs and
adjectives and adverbs.
Wordnet:-
Wordnet is a large lexical database of English words. Nouns, verbs, adjectives and adverbs are
grouped into a set of cognitive synonyms called ” synsets”, each expressing a distinct approach.
Synsets are interlinked using conceptual, semantic and lexical relations such as hyponymy and
antonymy.
The meaning of a word is called sense.
Noun:
1. read (something that is read) “ the article was a very good read”
Verb:
1. read (Interpret something that is written or printed)
“Read the advertisement”;“ Have you read Salman Rushdie?”
2. Read, (have or contain a certain wording or form)
“The passage reads as follows”;” what does the law say?”
3. Read, scan
“This dictionary can be read by the computer”
4. learn, study, read, take
“She is reading for the bar exam”
5. read “I read you loud and clear!”
The word ‘ read’,‘Read’ has one sense as a noun and five senses as a verb. The below
figure shows some of the relationships that hold between nouns, verbs and adjectives and
adverbs.
Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy
relation, whereas adjectives are organized into clusters based on antonym pairs.
Applications of Wordnet:-
Word net has found numerous applications in problems related with IR and NLP.
Document Summarisation:-
Wordnet has found useful applications in text summarization. It utilizes information from word
net to compute lexical chains.
● FrameNET:
FrameNET is a large database of semantically annotated English sentences. It is based
on principles of semantics. It defines a tagset of semantic roles called the frame
element.
FrameNET captures situations through case-frame representation of words
( verbs, adjectives and nouns). The word that involves a frame is called target word or
predicate and the participant entities are defined using semantic roles, which are called
frame elements.
Each frame contains a main lexical item as predicate and associated frame-
specific semantic roles, such as AUTHORITIES, TIME and SUSPECT in the ARREST
Frame called the Frame Elements.
Example,
The above sentence annotated with the semantic roles AUTHORITIES and SUSPECT.
The target word in sentence is ’nab’ which is a verb in the ARREST frame.
FrameNET Applications:-
The shallow semantic role obtained from FrameNET can play an important role in information
extraction.
For example, a semantic role makes it impossible to identify that the theme role played
by ’match’ is the same in sentences (1) and (2) though the semantic role is different.
(1) The Empire stopped the match.
(2) The match stopped due to bad weather.
In sentence (1) the word ’match’ is the object, while it is the subject in sentence (2).
STEMMERS:-
Stemming is NLP technique that is used to reduce words to their base form, also known as the
root form. The process of stemming is used to normalize text and make it easier to process. It is
an important step in text pre-processing, and it is commonly used in information retrieval and
text mining applications.
A stunning algorithm reduces the words “ chocolates”,” chocolatey”, ” choco” to the root
word, ” chocolate” and ” retrieval”, ”retrieved” reduced to the stem ” retrieve”.
There are several different algorithms for stamming including the porter stemmer,
snowball stemmer and lancaster stemmer.
The Porter stemmer is used to remove common suffixes from the words.
The snowball stemmer is advanced and is based on Potter, but it also supports several
other languages in addition to English.
The Lancaster stemmer is less accurate than the porter stemmer and snowball stemmer.
Stemming can be useful for several NLP tasks such as, text classification, information
retrieval and text summarisation.
Disadvantages:-
● Reducing the readability of the text.
● it may not always produce the correct root form of a word.
Note: Stemming is different from Lemmatization.
Ex: Stemming for root word “like” include,
“Likes”, “Liked”, “Likely”, “Liking”
Errors in Stemming:-
There are mainly two errors,
1. Over-stemming
2. Under- stemming
Over-stemming occurs when two words are stemmed from the same root that are of
different stems.
Under- stemming occurs when two words are stemmed from the same root that are not
of different stems.
Applications of Stemming:-
1. Stemming is used in Information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.
3. A method of group analysis is used on textual materials called document ( text)
clustering. It includes subject extraction, automatic documents structuring and quick
information retrieval.
PART-OF-SPEECH TAGGER:-
Part-of-speech tagging is used at an early stage of text processing in many NLP
applications such as speech synthesis, machine translation, IR and information extraction.
In IR, part of speech tagging can be used in indexing, extracting phrases and for
disambiguating word senses. The rest of this section presents a number of part-of-speech
taggers are already in place.
TNT Taggers:-
Trigrams ’n’ Tags (or) TNT is an efficient statistical part-of-speech tagger.This tagger is
based on Hidden Markov Models (HMM) and uses some Optimisation techniques for
smoothing and handling unknown words.
Brill Tagger:-
Brill described a trainable rule-based tagger that obtained performance comparable to
that of a stochastic tagger.It uses transformation-based learning to automatically induce
rules.
Tree Tagger:-
It is a probability tagging method. It avoids problems faced by the Markov model
methods when estimating transition probabilities from sparse data.