0% found this document useful (0 votes)
14 views6 pages

NLP UNIT 2 Part 2

The document discusses lexical resources, focusing on Wordnet and FrameNET, which are databases that organize words and their meanings for applications in natural language processing (NLP). It also covers stemming as a technique to reduce words to their root forms, along with its advantages, disadvantages, and applications in information retrieval. Additionally, the document outlines various part-of-speech taggers used in NLP, emphasizing their methodologies and functionalities.

Uploaded by

pavani20891
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

NLP UNIT 2 Part 2

The document discusses lexical resources, focusing on Wordnet and FrameNET, which are databases that organize words and their meanings for applications in natural language processing (NLP). It also covers stemming as a technique to reduce words to their root forms, along with its advantages, disadvantages, and applications in information retrieval. Additionally, the document outlines various part-of-speech taggers used in NLP, emphasizing their methodologies and functionalities.

Uploaded by

pavani20891
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIT 2 (PART 2)

LEXICAL RESOURCES

Introduction:-
A lexical resource is a collection of words and/ or phrases along with associated information,
such as part-of-speech and sense definitions. These are secondary to texts, and are usually
created and enriched with the help of texts. There are 4 types of lexical nouns, verbs and
adjectives and adverbs.

Wordnet:-
Wordnet is a large lexical database of English words. Nouns, verbs, adjectives and adverbs are
grouped into a set of cognitive synonyms called ” synsets”, each expressing a distinct approach.
Synsets are interlinked using conceptual, semantic and lexical relations such as hyponymy and
antonymy.
The meaning of a word is called sense.
Noun:
1. read (something that is read) “ the article was a very good read”
Verb:
1. read (Interpret something that is written or printed)
“Read the advertisement”;“ Have you read Salman Rushdie?”
2. Read, (have or contain a certain wording or form)
“The passage reads as follows”;” what does the law say?”
3. Read, scan
“This dictionary can be read by the computer”
4. learn, study, read, take
“She is reading for the bar exam”
5. read “I read you loud and clear!”
The word ‘ read’,‘Read’ has one sense as a noun and five senses as a verb. The below
figure shows some of the relationships that hold between nouns, verbs and adjectives and
adverbs.
Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy
relation, whereas adjectives are organized into clusters based on antonym pairs.
Applications of Wordnet:-
Word net has found numerous applications in problems related with IR and NLP.

Concept Identification in Natural Language:-


Wordnet can be used to identify concepts pertaining to a term, to so them to full semantic
richness and complexity of a given information need.

Word Sense Disambiguation:-


Wordnet combines features of a number of the other resources commonly used in
disambiguation work. It offers sense definitions of words, identifies synsets of synonyms,
defines a number of semantic relations and is freely available.
Automatic Query Expansion:-
Wordnet semantic relations can be used to expand queries so that the search for a document is
not confined to the pattern-matching of query terms, but also covers synonyms.

Document Structuring and Categorization:-


The semantic information extracted from wordnets, and wordnet conceptual representation of
knowledge, have been used for text categorization.

Document Summarisation:-
Wordnet has found useful applications in text summarization. It utilizes information from word
net to compute lexical chains.
● FrameNET:
FrameNET is a large database of semantically annotated English sentences. It is based
on principles of semantics. It defines a tagset of semantic roles called the frame
element.
FrameNET captures situations through case-frame representation of words
( verbs, adjectives and nouns). The word that involves a frame is called target word or
predicate and the participant entities are defined using semantic roles, which are called
frame elements.
Each frame contains a main lexical item as predicate and associated frame-
specific semantic roles, such as AUTHORITIES, TIME and SUSPECT in the ARREST
Frame called the Frame Elements.
Example,

The above sentence annotated with the semantic roles AUTHORITIES and SUSPECT.
The target word in sentence is ’nab’ which is a verb in the ARREST frame.

FrameNET Applications:-
The shallow semantic role obtained from FrameNET can play an important role in information
extraction.
For example, a semantic role makes it impossible to identify that the theme role played
by ’match’ is the same in sentences (1) and (2) though the semantic role is different.
(1) The Empire stopped the match.
(2) The match stopped due to bad weather.
In sentence (1) the word ’match’ is the object, while it is the subject in sentence (2).

STEMMERS:-
Stemming is NLP technique that is used to reduce words to their base form, also known as the
root form. The process of stemming is used to normalize text and make it easier to process. It is
an important step in text pre-processing, and it is commonly used in information retrieval and
text mining applications.

A stunning algorithm reduces the words “ chocolates”,” chocolatey”, ” choco” to the root
word, ” chocolate” and ” retrieval”, ”retrieved” reduced to the stem ” retrieve”.
There are several different algorithms for stamming including the porter stemmer,
snowball stemmer and lancaster stemmer.
The Porter stemmer is used to remove common suffixes from the words.
The snowball stemmer is advanced and is based on Potter, but it also supports several
other languages in addition to English.
The Lancaster stemmer is less accurate than the porter stemmer and snowball stemmer.
Stemming can be useful for several NLP tasks such as, text classification, information
retrieval and text summarisation.

Disadvantages:-
● Reducing the readability of the text.
● it may not always produce the correct root form of a word.
Note: Stemming is different from Lemmatization.
Ex: Stemming for root word “like” include,
“Likes”, “Liked”, “Likely”, “Liking”

Errors in Stemming:-
There are mainly two errors,
1. Over-stemming
2. Under- stemming

Over-stemming occurs when two words are stemmed from the same root that are of
different stems.
Under- stemming occurs when two words are stemmed from the same root that are not
of different stems.

Applications of Stemming:-
1. Stemming is used in Information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.
3. A method of group analysis is used on textual materials called document ( text)
clustering. It includes subject extraction, automatic documents structuring and quick
information retrieval.
PART-OF-SPEECH TAGGER:-
Part-of-speech tagging is used at an early stage of text processing in many NLP
applications such as speech synthesis, machine translation, IR and information extraction.
In IR, part of speech tagging can be used in indexing, extracting phrases and for
disambiguating word senses. The rest of this section presents a number of part-of-speech
taggers are already in place.

Stanford Log-linear part-of-speech (POS) Tagger:-


1. It makes explicit use of both the preceding and following tag context via a
dependency network representation.
2. It uses a broad range of lexical features.
3. Utilizes priors in conditional log linear models.

A part-of-speech Tagger for English:-


This tagger uses a bi-directional inference algorithm for part-of-speech tagging.It is
based on Maximum Entropy Markov Models (MEMM).
The algorithm can enumerate all possible decomposition structures and find the
highest probability sequence together with the corresponding decomposition structure in
polynomial time.

TNT Taggers:-
Trigrams ’n’ Tags (or) TNT is an efficient statistical part-of-speech tagger.This tagger is
based on Hidden Markov Models (HMM) and uses some Optimisation techniques for
smoothing and handling unknown words.

Brill Tagger:-
Brill described a trainable rule-based tagger that obtained performance comparable to
that of a stochastic tagger.It uses transformation-based learning to automatically induce
rules.

CLAWS part-of-speech Tagger for English:-


Constituent Likelihood Automatic Word-tagging System (CLAWS) one of the earliest
probabilistic taggers for English.It can be easily adapted to different types of text in
different input formats.

Tree Tagger:-
It is a probability tagging method. It avoids problems faced by the Markov model
methods when estimating transition probabilities from sparse data.

ACoPOST: A Collection of POS Taggers:-


ACoPOST is a set of freely available POS taggers. The taggersin the set are based on
different frameworks. It consists of four taggers.
Maximum Entropy Tagger (MET):-
It uses an iterative procedure to successively improve parameters for a set of features
that help to distinguish between relevant contexts.

Trigram Tagger:- (T3)


It is based on HMM. The states in the model are tag pairs that emit words.

Error-driven Transformation-Based Tagger (TBT):-


It uses annotated corpuses to learn transformation rules, which are then used to change
the assigned tag using contextual information.

Example-based Tagger (ET):-


The underlying assumption of example-based models also called memory-based
instance-based models.

POS Tagger for Indian Languages:-


POS tagging adds many more dimensions as most of them are agglutinative,
morphologically very rich, highly inflected and are sometimes diglossic.
● RESEARCH CORPORA:-
A corpus is a searchable database of language samples for linguistic research. A
corpus may be based on written or spoken language. Some corpora are tagged
or annotated by part-of-speech. Other corpora are plain text.

● Monolingual Corpus:- It consists of texts within a single language.


● Multilingual Corpus:- It contains text written in multiple languages.
● Specialized Corpus:- It is focused on a specific domain or subject area.

A Corpus can be made up of everything from newspapers, novels, recipes, radio


broadcasts to television shows, movies and tweets.
In NLP, a Corpus contains text and speech data that can be used to train AI and
machine learning systems.

You might also like