Module 5 - Information Retrieval and Lexical Resources

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 80

Application of WORDNET

• Concept identifications in Natural language


• WordNet can be used to identify concepts pertaining to a term, to suit them to the full semantic
richness.
• Word sense disambiguation
• It offers
• sense definitions of words
• identifies synsets of synonyms
• Defines a number of semantic relations
• Automatic Query Expansion
• WordNet semantic relations can be used to expand queries so that the search for a document is not
confined to the pattern-matching of query terms, but also covers synonyms.
• Document structuring and categorization
• The semantic information extracted from WordNet have been used for text categorization.
• Document summarization
• The approach presented by Barzilay and Elhadad uses information from WordNet to compute
lexical chains.
MODULE 5

INFORMATION RETRIEVAL AND LEXICAL RESOURCES


Introduction

● Information retrieval (IR) deals with the organization, storage, retrieval and evaluation
of information relevant to a user’s query.
● The retrieval system responds by retrieving the document that seems relevant to the query.
● An information retrieval system does not inform the user on the subject of the query. Rather,
it merely informs on the existence and whereabouts of documents relating to the query.
● The IR system differs from question answering system and data retrieval system.
● A question answering system provides users with answers to specific questions.
● Data retrieval systems retrieve precise data, usually organized in a well defined structure.
Design Features of Information Retrieval Systems
Give Design feature of IR system. (6 M, Jan 2019)
Explain design feature of IR systems with a neat diagram (10 M, Feb 2021)
● The process of IR begins with the user’s information need.
● Based on the need the user formulates a query.
● The IR system returns documents that seem relevant to the query.
● The retrieval is performed by matching the query representation
with document representation.
● The actual text of the document is not used in the retrieval
process. Instead documents in a collection are frequently
represented through a set of index terms or keywords.
Design Features of Information Retrieval Systems
● Representation of keywords provides a logical view of the document. The process of transforming
document text to some representation of it is known as indexing.

● There are different types of index structures. The one commonly used is inverted index.

● An inverted index is a list of keywords, with each keyword carrying pointers to the documents
containing that keywords.

● Note:

○ A forward index (or just index) is the list of documents, and which words appear in them.

○ The inverted index is the list of words, and the documents in which they appear.

○ They are both indexes - it's just a question of which direction you're going.

○ Forward is from documents->to->words, inverted is from words->to->documents.


Design Features of Information Retrieval Systems
● The computational cost involved in adopting a full text logical view is high. Hence, some text
operations are performed to reduce the set of representative keywords. The two most
commonly used text operations are stop word elimination and stemming.
● Stop word elimination removes grammatical or functional words.
● Stemming reduces words to their common grammatical roots.
● Zipf’s law is applied to further reduce the size of index set.
● All terms in a document may not be equally relevant. Attempts are made to quantify the
significance of index terms to a document by assigning them numerical values called weights.
● The choice of index terms and weights is theoretically and practically difficult and hence, a
number of term-weighting schemes are used to cope with it.
Indexing
Define indexing. List and explain methods used to reduce the set of representative
keywords during indexing. (10 Marks, June/July 2019)
● The process of transforming a collection of raw documents into an easily accessible
representation is known as indexing.
● Indexing technique involves identifying good document descriptors such as keywords or
terms, which help in describing the content of the document and as well as discriminating it
from other documents in the collection.
● Luhn, proposed that the discrimination power of index terms is a function of the rank order of
the frequency of their occurrence and that middle frequency terms have the highest
discrimination power. This model was proposed for the extraction of salient terms from a
document. These extracted terms are then used to represent text.
Eliminating Stop Words
Stop word elimination may be harmful - justify (Jan 2020, 4 Marks)

Explain the benefits of eliminating stop words. Give examples in which stop word elimination
may be harmful. (4 M, Feb 2021)
● The lexical processing of index terms involves elimination of stop words.
● Stop words are high frequency words which have little semantic weight and are thus, unlikely to
help in retrieval.
● Stop words play important grammatical role in language but do not contribute to the semantic
content of a document in a keyword based representation.
● Keywords are commonly used in documents regardless of topics and have no topical specificity.
● Typical example of stop words are articles and prepositions. Eliminating them considerably
reduces the number of index terms.
Advantages and Disadvantages of Eliminating Stop words

Advantages:
● Eliminating stop words can result in considerable reduction in number of index terms
without losing any significant information.
Disadvantages:
● Sometimes it can result in elimination of useful index terms.
● For example,
○ ‘A’ in ‘Vitamin A’.
○ Some phrases like ‘to be or not to be’ consist entirely of stop words.
● Eliminating stop words in above cases make it impossible to correctly search a
document.
Sample Stop Words in English
Stemming
How stemming affects the performance of IR system? (4M Jan 2020)
● Stemming normalizes morphological variants by removing affixes from the words to reduce them to their
stem.
● Eg.
○ The words compute, computing, computers and computer will be reduced to it stem comput.
● Keywords or terms which are used to represent text are stems and not the actual words.
● Eg., the stemmed representation of the text -

‘Design features of information retrieval systems’ is

{design, featur, inform, retriev, system}

The stop words have been removed and the remaining terms are in lower case.
Advantages and Disadvantages of Stemming

Advantages:
● Stemming is useful to help combine similar terms. This results in increased recall.
● Eg., The words compute, computing, computers and computer will be reduced to its
stem comput.

Disadvantages:
● Stemming may throw away useful distinctions.
● In some cases it may be resulting in reduced precision. Eg., when documents
containing the term computation are returned in response to the query phrase
‘personal computer’.
Precision and Recall
IR System
Zipf’s Law

State and Explain Zip’s law (4 M, Feb 2021)


● Zipf’s law is associated with the distribution of words in natural languages.

● Zipf’s law states that the frequency of words multiplied by their ranks in a large corpus is more or less constant.

○ Frequency * rank ≈ constant

● If we compute the frequencies of words in a corpus and arrange them in decreasing order of frequency, then the product of
the frequency of a word and its rank is approximately equal to the product of, the frequency and rank of another word.

● Frequency of the word is inversely proportional to its rank.


Application of Zipf’s Law in Stop Word Elimination
● Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain a small number of words that occur with
high frequency and a large number of words that occur with low frequency.

● In between is a middling number of medium frequency terms.

● The high frequency words, being common, have less discriminating power and are not useful for indexing.

● Low frequency words are less likely to be included in the query and are also not useful for indexing. As there are large number of low
frequency words, dropping them considerably reduces the size of list of index terms.

● The remaining medium frequency words are content-bearing terms and can be used for indexing.

● This is implemented by defining thresholds for high and low frequency and dropping words that have frequencies above or below these
thresholds.

● Stop word elimination is an implementation using Zipf’s law.


Information Retrieval Models
● An IR model is a pattern that defines several aspects of retrieval procedure, for eg.,
○ how documents and user’s queries are represented
○ how a system retrieves relevant documents according to users’ queries
○ how retrieved documents are ranked.
● An IR system consists of
○ a model of documents
○ a model of queries
○ a matching function which compares queries to documents.
● The central objective of the model is to retrieve all documents relevant to the query.
● IR models can be classified as:
1. Classical models of IR
2. Non-Classical models of IR
3. Alternative models of IR
Classical models of IR
• Classical models of IR are based on mathematical knowledge that is
easily recognized and well understood.
• Almost all existing commercial systems are based on the mathematical
models of IR.
• They are simple, efficient and easy to implement.
• The three classical IR models are:
1. Boolean model
2. Vector model
3. Probabilistic model
Non-classical Models of IR
• Non-classical IR models perform retrieval based on principles other
than those used by classical models, i.e., similarity, probability and
Boolean operation.
• These are best illustrated by models based on special logic
technique, situation theory or the concept of interaction.
• Non-classical IR models are -
1. Information logic model
2. situation theory model
3. Interaction model.
Alternative Models of IR
• Alternative IR models are enhancements of classical models
making use of specific techniques from other fields.
• Alternative IR models are -
1. Cluster model
2. Fuzzy model
3. Latent semantic indexing (LSI) model
Boolean Model
Explain Boolean and Vector space information retrieval model (10 M, Jan 2019)
With Example explain Boolean model for classical IR (6 M, Feb 2021)
• Boolean model is the oldest of the three classical models. It is based on Boolean
logic and classical set theory.
• In this model documents are represented as a set of keywords, usually stored in
an inverted file.
• An inverted file is a list of keywords, and identifiers of the documents in which
they occur.
• Users are required to express their queries as a boolean expression consisting of
keywords connected with boolean logical operators (AND, OR, NOT).
• Retrieval is performed based on whether or not document contains the query
terms.
Boolean Model

Given a finite set


T = { t1, t2, ..., ti , ...,tm } of index terms,
a finite set
D = { d1, d2 , ...,dj ,..., dn} of documents and
a boolean expression in a normal form - represent a query Q as follows:

The retrieval is performed in two steps:


1. The set Ri of documents are obtained that contain or do not contain the term ti:
Ri = { },

Where
2. Set operations are used to retrieve documents in response to Q:
Boolean Model - Example 1:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Document collection: A collection of Shakespeare's work.

1 if play contains
word, otherwise 0
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)
• 110100 AND 110111 AND 101111 = 100100
• The plays Antony and cleopatra and Hamlet contain the words Brutus and Caesar but not Calpurnia.

• Reference: https://www.youtube.com/watch?v=zRcwihu67H4
Boolean Model - Example 2
Let the set of original documents be D={D1, D2, D3}
Where
D1 =Information retrieval is concerned with the organization, storage, retrieval and evaluation of information relevant to user’s query.
D2 = A user having an information needs to formulate a request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving the document that seems relevant to the query.
Let the set of terms used to represent these documents be
T = { information, retrieval, query }
Then, the set D of document will be represented as follows:
D = { d1, d2, d3 }
Where
d1 = { information, retrieval, query }
d2 = { information, query }
d3 = { retrieval, query}
Let the Query Q be, Q = information ^ retrieval
Step 1: The sets R1 and R2 of documents are retrieved in response to Q, where
R1 = { dj | information ∈ dj } = { d1, d2 }
R2 = { dj | retrieval ∈ dj } = { d1, d3 }
Step 2: Then, the following documents are retrieved in response to the query Q
{ dj | dj ∈ R1 ⋂ R2 } = { d1 }
This results in the retrieval of the original document D1 that has the representation d1.
Boolean Model - Advantages and Disadvantages
Advantages:
• They are simple, efficient, easy to implement, perform well in terms of
recall and precision if the query is well formulated.
Disadvantages:
1. The model fails to retrieve documents that are only partly relevant to
user query.
2. Boolean system cannot rank the retrieved documents.
3. It distinguishes between presence and absence of keywords but fails to
assign relevance and importance to keywords in a document.
4. User has to formulate the query in pure boolean expression.
Probabilistic model
• Ranks the document based on the probability of their relevance to a given query.
• Retrieval depends on whether probability of relevance (relative to a query) of a document is
higher than that of non relevance, that is, whether it exceeds a threshold value.
• Given a set of documents D, a query ‘q’, and a cut off value ‘α’ , this model first calculates
the probability of relevance and irrelevance of a document to the query.
• It then ranks documents having probability of relevance at least that of irrelevance in
decreasing order of their relevance.
• Documents are retrieved if the probability of relevance in the ranked list exceeds the cut
off value.
• If P( R / d ) is the probability of relevance of a document d, for query q, P( I / d ) is the
probability of irrelevance. Then the set of the documents retrieved is:
• S ={ dj | P ( R / dj ) >= P( I / dj ) } , P ( R / dj ) > α
Probabilistic Model - Advantages and Disadvantages
Advantages:
• The probabilistic model produce results that partly match the user
query.
Disadvantages:
• It assumes that the terms are independent when estimating
probabilities. This assumption allows for accurate estimation of
parameter values and helps reduce computational complexity of
the model. However, this assumption seems to be inaccurate as
terms in a given domain usually tend to co-occur. For eg., it is
more likely that ‘match point’ will co-occur with ‘tennis’ rather
than ‘cricket’.
Vector space model
• The vector space model is one of the most well studied retrieval models.
• It represents documents and queries as vectors of features representing terms that occur
within them. Each document is characterized by a numerical vector.
• Vectors represented in multidimensional space, in which each dimension corresponds to a
distinct term in the corpus of documents.
• Each feature takes a value of either zero or one, indicating the absence or presence of that
term in a document or query.
• Features are assigned numerical values that are usually a function of the frequency of
terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield
a retrieval score to each document.
• This score is used to produce a ranked list of retrieval documents.
Vector space model
Given a finite set of n documents
D = { d1, d2, d3, …, dn }
and a finite set of m terms
T= { t1, t2, t3, …, tm }
each document is represented by a column vector of weights as follows
(w1j, w2j, w3j ,…, Wij ,..., wmj )t
Where Wij is the weight of the term ti in document dj.
The document collection as a whole is represented by a m x n term - document matrix -
Document space

Term space
w11 w12 … w1j … w1n
w21 w22 … w2j … w2n
w31 w32 … w3j … w3n
wm1 wm2 … wmj … wmn
Example:
• D1 = Information retrieval is concerned with the organization, storage, retrieval and evaluation of information relevant
to user’s query.
• D2 = A user having an information needs to formulate a request in the form of query written in natural languages.
• D3 = The retrieval system responds by retrieving the document that seems relevant to the query.
• T = { information, retrieval, query }
• Let the weights be assigned based on the frequency of the term in the document. Then the associated vectors will be as
shown in matrix 1.
• Convert document vectors to unit length by dividing elements of each column by the length of the column vector given
by √Σiwij2

Term Document Matrix

t1 t2 t3 d1 d2 d3 d1 d2 d3
d1 2 2 1 t1 2 1 0 After Normalization -> t1 0.67 0.71 0
d2 1 0 1 t2 2 0 1 t2 0.67 0 0.71
 
d3 0 1 1 t3 1 1 1 t3 0.33 0.71 0.71
Term Weighting
Define Term Weighting. (Feb 2021)
• Each term that is selected as an indexing feature for a document, acts as a
discriminator between that document and all other documents in the corpus.
• Term weighting is the process to quantify the discriminating power of the terms
by associating the frequency of their occurrence (term frequency) within a
document.
• The following facts must be taken into account:
1. The more a document contains a given word, the more that document is about a
concept represented by that word.
2. The less a term occurs in particular document in a collection, the more
discriminating that term is.
Term Weighting
• The first factor indicates that terms that occur more frequently represent the document’s meaning more strongly
than those occurring less frequently and hence must be given high weights. This weight is the raw frequency of the
term in the document.
• The second factor considers term distribution across the document collection. Terms occurring in a few documents
are useful for distinguishing those documents from the rest of the collection. Similarly, terms that occur more
frequently across the entire collection are less helpful while discriminating among documents.

• The factor n / ni - gives the measure that favours terms appearing in fewer documents, where

n - total number of documents in the collection

ni - the number of documents in which the term i occurs

• This measure assigns the lowest weight 1 to a term that appears in all documents and the highest weight of n to a
term that occurs in only one document.
• As the number of documents in any collection is usually large, the log of this measure is taken, which is called
inverse document frequency (idf) term weight.

idfi = log ( n / ni )

• If a term occurs in all documents in a collection, its idf is 0


Term Weighting
• The combination of term frequency (tf) and inverse document frequency (idf) weights results in a
family of tf x idf weight schemes having the general formula -
Wij = tfij * log (n/ni)

• According to the above scheme, the weight of the term t i in document dj is equal to the product of
document frequency of the term (term frequency) and log of its inverse document frequency
within the collection.
• The tf x idf weighting scheme combines both the ‘local’ and ‘global’ statistics to assign term
weight.
• The tf component is a document specific statistic that measures the importance of the term within
the document.
• The idf is a global statistic that attempts to include distribution of the term across the document
collection.
Term Weighting - Example
Example to explain how term weights can be calculated using the tf-idf weighting scheme.
Consider a document represented by three terms {tornado, swirl, wind} with the raw tf 4,1,1 respectively. In
a collection of 100 documents, 15 documents contain the term tornado, 20 contain swirl and 40 contain
wind. Find the idf and the term weight of the three terms. (6 M, Feb 2021)
idf - tornado -> log(n / ni) = log (100 / 15 ) = 0.824 Weight - tornado -> tf x idf = 4 * 0.824 = 3.296
idf - swirl -> log(n / ni) = log (100 / 20 ) = 0.699 Weight - tornado -> tf x idf = 1 * 0.699 = 0.699
idf - wind -> log(n / ni) = log (100 / 40 ) = 0.398 Weight - tornado -> tf x idf = 1 * 0.398 = 0.398
The following table shows the weights assigned to the three terms using tf x idf weighting scheme

3.296
Weighting Schemes
• The three factors of weighting schemes are -
1. Within - document frequency or term frequence (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length - If a term appears the same number of times in a short
document and in a long document, then the term appearing in short
document is more valuable compared to the term appearing in long
document.
• Any term weighting schemes can be represented by a triple ABC.
• A represents the way the tf component is handled.
• B indicates the way the idf component is incorporated.
• C represents the length normalization component.
• Possible option for each of the three dimensions of the triple are shown in table.
Weighting Schemes

• Different combinations of options can be used to represent document and query vectors.
• The retrieval model can be represented by a pair of triples like nnn.nnn (doc = ‘nnn’, query =
‘nnn’).
• The first triple corresponds to the weighting strategy used for the documents and the second
triple to the weighing strategy used for the query, for example, ntc-ntc, lnc-ltc, etc.
Simple automatic method for indexing
A simple auto method for obtaining indexed representation of the document is as follows -

Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower case, and removes
punctuation marks.

Step 2: Stop word elimination:


This removes the words that appear more frequently in the document.

Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index terms.

Step 4: Term weighting:


This assigns weight to terms according to their importance in the document or collection of
documents.
Simple automatic method for indexing
Table shows the document vectors obtained after the application of
these steps on sample documents shown in fig. Below.
Non-classical models of IR
• Non-classical IR models are based on principles other than similarity, probability, Boolean operations, etc., on which
classical IR models are based.
• Non-classical IR models are
1. Information logic model
2. situation theory model
3. interaction model.
4. Information Logic Model:
• Information logic model is based on logical imaging.
• Retrieval is performed by making inferences from document to query.
• A measure of uncertainty is associated with this inferences.
• The principle put forward by Van Rijsbergen is used to measure this uncertainty, which says that -
• “Given any two sentences x and y , a measure of uncertainty of y->x relating to a given
data set is determined by the minimal extent to which one has to add information to the
data set in order to establish truth of y->x”
Non-classical models of IR
2. Situation theory model
• Situation theory model is also based on Van Rijsbergen’s principle.
• Retrieval is considered as a flow of information from document to query.
• A structure called infon, denoted by τ (tau) is used to describe the situation and to model
information flow.
• An infon represents a n-ary relation and its polarity.
• The polarity of an infon is either 0 or 1, indicating infon carries either positive or negative
information.
• Ex: Information in the sentence - ‘Adil is serving a dish’, is conveyed by the infon
τ = << Serving Adil, dish; 1 >>
• Polarity of an infon depends on the support. The support of an infon is represented as s ⊨ τ, which
means that the situation s makes the infon τ true.
• The infon, τ = << Serving Adil, dish; 1 >> is made true by a situation
s1 = “ I see Adil serving a dish”
• A document d is relevant to a query q, if d ⊨ q (document supports the query)
Non-classical models of IR
3. Interaction model:
• In interaction model documents are interconnected.
• The query interacts with the inter connected documents.
• Retrieval is considered as a result of this interaction.
• Artificial neural networks can be used to implement this model where each
document is modelled as a neuron, the document set as a whole forms a neural
network.
• The query is also modelled as a neuron and integrated into the network. Because of
this integration new connections are built between the query and the document and
existing connections are changed. This restructuring corresponds to the concept of
interaction . A measure of this interaction is obtained and used for retrieval.
Alternative Models of IR
• Alternative IR models are enhancements of classical models
making use of specific techniques from other fields.
• Alternative IR models are -
1. Cluster model
2. Fuzzy model
3. Latent semantic indexing (LSI) model
Cluster Model

Explain the cluster and fuzzy models of IR system. (8M, Jan 2019)
With a suitable example explain cluster based IR modelling (8 M, July 2019)
• The cluster model attempts to reduce the number of matches during retrieval based on cluster
hypothesis:
“Closely associated documents tend to be relevant to the same clusters.”
• The hypothesis suggests that closely associated documents are likely to be retrieved together. By
forming groups or classes or clusters of related documents, the search time reduces considerably.
• Instead of matching a query with every document in the collection, it is matched with representatives
of the cluster (class), and only documents from a class whose representative is close to query, are
considered for individual match.
• Clustering is applied on terms instead of documents based on co-occurrence. Terms can be grouped to
form classes of co-occurrence terms.
• A number of methods are used to group documents. One of the method is based on similarity matrix.
Cluster Model
Cluster Generation method based on Similarity Matrix

• Let D = { d1,d2,d3,…..dm } be set of documents.

• Let E = (eij)n,n be the similarity matrix.

• The element Ei,j in this matrix, denotes a similarity between document d i and dj.
• Let T be the threshold value.
• Any pair of documents di and dj (i != j) whose similarity measure exceeds threshold (eij >= T) is
grouped to form a cluster.
• The remaining documents form a single cluster.
• The set of clusters thus obtained is
C = { C1, C2, …, Ck, …, Cp }
Cluster Model
• A representative vector of each cluster is constructed by computing the centroid of the document
vectors belonging to that class.
• Representation vector for a cluster C k is rk = { a1k, a2k, …, aik, …, amk }
• An element aik in this vector is computed as

• Where aij is weight of the term ti, of the document dj, in cluster Ck.
• During retrieval, the query is compared with the cluster vectors
(r1, r2, …, rk, …, rp )
• This comparison is carried out by computing the similarity between the query vector q and the
representative vector r, as

• A cluster C, whose similarity s, exceeds a threshold is returned and the search proceeds in that cluster.
Cluster Model - Example
Fuzzy Model
• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs [ ti ,μ (ti ) ]
Where μ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.
• The membership degree expresses the significance of term to the information contained in the
document.
• Significance values (weights) are assigned based on the number of occurrences of the term in the
document and in the entire document collection.
• Each document in the collection -
D = { d1, d2, …, dj, …, dn }

Can thus be represented as a vector of term weights,

(wij, w2j, w3j, …, wij, …, wmj ) t

Where wij is the degree to which term ti belongs to document dj.


Fuzzy Model
• Each term in the document is considered a representative of a subject area
and wij is the membership function of document dj to the subject area
represented by term ti.
• Each term ti is represented by a fuzzy set fi in the domain of documents given
by
fi = { (dj, wij) } | i = 1, …, m; j = 1, …, n
• This weighted representation makes it possible to rank the retrieved
documents in decreasing order of their relevance to the user’s query.
• Queries are Boolean queries.
• For each term that appears in the query, a set of documents is retrieved.
• Fuzzy set operations are then applied to obtain the desired result.
Fuzzy Model
Single-term Query:
• For a single-term query q=tq, those documents from the fuzzy set
fq ={(dj, wiq)},
are retrieved for which wiq exceeds a given threshold. The threshold may also be zero.
AND Query:
• For an AND query q = tq1 ^ tq2, the fuzzy sets fq1 and fq2 are obtained and then, their intersection is
obtained, using the fuzzy intersection operator
fq1 ^ fq2 = min { (dj, wiq1), (dj, wiq) }
The documents in this set are returned.
OR Query:
• For an OR query q = tq1 V tq2, the union of fuzzy sets fq1 and fa2 is computed to retrieve documents as
follows -
fq1 V fq2 = max { (dj, wiq1), (dj, wiq) }
Fuzzy Model - Example
Consider the following 3 documents:
d1 = { information, retrieval, query }
d2 = { retrieval, query, model }
d3 = { information, retrieval }
Where the set of terms used to represent documents is
T = { information, model, query, retrieval }
Fuzzy set for terms
f1 = { (d1, 1/3), (d2, 0), (d3, 1/2) } → t1 = information
f2 = { (d1, 0),(d2, 1/3), ( d 3, 0 ) } → t2 = model
f3 = { (d1, 1/3), (d2, 1/3), (d3, 0) } → t3 = query
f4 = { (d1, 1/3), (d2, 1/3), (d3, 1/2) } → t4 = retrieval
If the query is q = t2 ∧ t4, then document d2 is returned.
Latent Semantic Indexing Model
https://www.youtube.com/watch?v=M1duqgg8-IM
https://www.youtube.com/watch?v=P5mlg91as1c
https://www.youtube.com/watch?v=CwBn0voJDaw
Evaluation of IR System
• The evaluation of IR systems is the process of assessing how well a system meets the information
needs of its users.
• IR Evaluation models can be broadly classified as -
1. system driven - System driven models measure how well a system ranks documents
2. user-centered models - User-centered models measure user satisfaction.
• Following are the six criteria that can be used for evaluation -
1. Coverage of the collection
2. Time lag - The time that elapses between submission of a query and getting back the
response.
3. Presentation format
4. User effort - The effort made by the user to obtain relevant information.
5. Precision - The proportion of retrieved documents that are relevant.
6. Recall - The proportion of relevant documents that are retrieved.
• Precision and recall are most frequently applied in evaluating IR system and both are related to
effectiveness, that is, the ability of a system to retrieve relevant documents in response to user query.
Relevance
• Relevance is subjective in nature, it depends on the individual judgements of users.
• Given a query, the same document may be judged as relevant by one user and non-
relevant by another.
• It is not possible to measure true relevance because no human can read all documents
in a collection and provide a relevance assessment.
• Another issue with relevance is the degree of relevance. Traditionally, relevance is
visualized as a binary concept, that is, a document is either relevant or not relevant;
whereas relevance is actually a continuous function.
• A number of relevance frameworks includes system, communication, psychological
and situational frameworks.
• The most inclusive of these is the situational framework, which is based on a cognitive
view of the information seeking process and considers the importance of situation,
context, multi-dimensionality and time.
Effectiveness Measure
• Effectiveness is a measure of the ability of a system to satisfy the
user in terms of the relevance of documents retrieved.
• Aspect of the effectiveness include -
• Whether the documents returned are relevant to the user
• Whether they are presented in order of relevance
• Whether a significant number of relevant documents in the
collection are returned to the user.
• A number of measures have been proposed to quantify effectiveness.
The most commonly used measures of effectiveness are precision
and recall.
Precision and Recall
Define precision and recall and explain the tradeoff between them in evaluation
of IR systems (8 M, July 2019)
Precision:
• Precision is defined as the proportion of relevant documents in a retrieved
set.
• It is the probability that a relevant document is retrieved.
• It measures the accuracy of a system.
Number of relevant documents retrieved ( NRret )
Precision = ___________________________________________
Recall: Total number of documents retrieved ( Nret )
• Recall is the proportion of relevant documents that are actually been
retrieved.
• Recall measures the exhaustiveness of the system.
Number of relevant documents retrieved ( NRr )
Recall = ________________________________________________________
Total number of relevant documents in the collection ( NR rel )
Precision and Recall
• Precision and recall are based on binary relevance judgement.
• Every retrievable item is recognizably ‘relevant’, or recognizably ‘not relevant’.
• For every search result, all retrievable documents will be either -
• Relevant or non-relevant
• Retrieved or not retrieved
• Hence, each document will fall into one, and only one, of four cells of the matrix as shown in relevant
matrix
Precision and Recall
Refer: https://www.youtube.com/watch?v=mctizdBujk4
Precision and Recall
Trade off between precision and recall
• There exists a tradeoff between precision and recall, though a high value of both at the same time is desirable.
• Precision is high at low recall values.
• As the recall increases, precision decreases.
• The ideal case of perfect retrieval requires that all relevant documents be retrieved before the first non-relevant
document is retrieved.
• This is shown in the figure by the line parallel to x-axis having a precision of 1.0 at all recall points.
• Recall is an additive process. Once the highest recall (1.0) is achieved, it remains 10 for any subsequent document
retrieved.
• We can achieve 100% recall by retrieving all documents in the collection, but this defeats the intent of the IR
system.
• A multi-stage retrieval is used to improve both recall and precision simultaneously. Precision is calculated at a
particular cut-off, typically 5 documents, 10 documents, 15 documents, etc.
• Another measure called, non-interpolated average precision
Non-interpolated Average Precision
• Non-interpolated average precision is the average precision at
observed recall points.
• Observed recall points correspond to points where a relevant
document is retrieved.
• First, compute precision at each point where a relevant document is
found and then compute average of these precision numbers to get
a single number.
Example: To compute Non-interpolated Average Precision
The table shows the ranking of 10 documents for a particular retrieval. The
crossed documents are those that are relevant. Find the non-interpolated average
precision.
Example to compute Non-interpolated Average Precision
Example to compute Non-interpolated Average Precision
• Observed recall points correspond to points where a relevant document is retrieved. The observed
recall points are 0.2, 0.4, 0.6, 0.8 and 1.0
Precision at recall point 0.2: 1/1 = 1.0
Precision at recall point 0.4: 2/2 = 1.0
Precision at recall point 0.6: 3/5 = 0.6
Precision at recall point 0.8: 4/9 = 0.4
Precision at recall point 1.0: 5/10 = 0.5
Non-interpolated average precision = 1.0 + 1.0 + 0.6 + 0.4 + 0.5 = 0.7
5
Example to compute Non-interpolated Average Precision
A user submitted a query to an IR system, out of the first 15 documents returned by the system, those
ranked 1, 2, 5, 8, 12 were relevant. Compute the non-interpolated average precision of this retrieval.
Assume there are 6 relevant documents. (6 M, Jan 2020)
Observed Recall Points are
1/6 = 0.17
2/6 = 0.11
3/6 = 0.5
4/6 = 0.67
5/6 = 0.83
Precision at recall point 0.17 = 1.0
Precision at recall point 0.11 = 1.0
Precision at recall point 0.5 = 0.6
Precision at recall point 0.67 = 0.5
Precision at recall point 0.83 = 0.42

Non-interpolated Average Precision = 1.0 + 1.0 + 0.6 + 0.5 + 0.42 = 3.52

5 5
Non-interpolated Average Precision = 0.704
Lexical Resources
Write short notes on WordNet, FrameNet, Stemmer, POS Tagging (16 M, Feb 2021)
• Various tools and lexical resources are developed to ease the task of researchers working with NLP.
• These resources are open sources and are readily available.
The various resources and tools are
1. Lexical Resources
a. WordNet
b. FrameNet
2. Tools
a. Stemmers
b. Taggers
c. Parsers
d. Freely Available test corpuses for text-processing applications
WORDNET
Explain WordNet. List the applications of WordNet (6 M, Jan 2020)
Explain WordNet and list the applications of WordNet (10M, Feb 2021)
Write a note of WordNet Lexical Resource and its application in NLP (6M, July 2019)
Explain the WordNet for English language and its applications. (8M, Jan 2019)
• WordNet is a large lexical database for the English language.
• Inspired by psycholinguistic theories, it was developed and is being maintained at the Cognitive Science
Laboratory, Princeton University, under the direction of George A. Miller.
• WordNet consists of three databases
• One for nouns
• One for verbs
• One for both adjectives and adverbs
• Information is organized into sets of synonymous words called synsets, each representing one base concept.
• The synsets are linked to each other by means of lexical and semantic relations.
WORDNET

• Lexical relations occur between word-forms (senses) and semantic relations


between word meanings.
• These relations include synonymy, hypernymy / hyponymy, antonymy,
meronymy / holonymy, troponymy, etc.
• A word may appear in more than one synset and in more than one part-of-
speech. The meaning of the word is called sense.
• WordNet lists all senses of a word, each sense belonging to a different synset.
• WordNet’s sense-entries consist of a set synonyms and a gloss.
WORDNET
• A gloss consists of a dictionary-style definition and examples demonstrating the use of a synset in a
sentence, as shown in the figure below.
WORDNET
Write the hypernym chain for ‘RIVER’ extracted from WordNet 2.0 (4M, Jan 2020)
• The figure shows the entries for the word ‘read’. Read has one sense as a noun and 11 senses as a verb.
• Glosses help differentiate meanings.
• Figure show Noun relations in WordNet
• Nouns and verbs are organized into hierarchies based on the hypernymy / hyponymy relation.
WORDNET

Figure shows the verb, adjective and adverb relations in Wordnet


WORDNET
Write the hypernym chain for ‘RIVER’ extracted from WordNet 2.0 (4M, Jan 2020)
• The figure shows the entries for the word ‘read’. Read has one sense as a noun and 11 senses as a verb.
• Glosses help differentiate meanings.
• The below figure shows the hypernym chain for ‘RIVER’ extracted from WordNet.
WORDNET

• WordNet is freely and publicly available for download from


http://wordnet.princeton.edu/obtain
• WordNet for other languages have been developed, Example, EuroWordNet and Hindi
WordNet
• EuroWordNet covers European languages, including English, Dutch, Spanish, Italian,
German, French, and Czech.
• Hindi WordNet has been developed by CFILT (Resource Center for Indian Language
Technology Solutions), IIT Bombay.
• Hindi WordNet can be obtained from the URL http://www.cfilt.iitb.ac.in/wordnet/webhwn/
• CFLIT has also developed a Marathi WordNet
http://www.cfilt.iitb.ac.in/wordnet/webmwn/wn.php
FRAMENET

• FrameNet is a large database of semantically annotated English sentences.


• It is based on principles of frame semantics.
• It defines a tagset of semantic roles called the frame element.
• Sentences from the British National Corpus are tagged with these frame
elements.
• The basic philosophy involved is that each word evokes a particular situation
with particular participants.
• FrameNet aims at capturing these situations through case-frame representation of
words. The word that invokes a frame is called target word or predicate, and the
participant entities are defined using semantic roles, which are called frame
elements.
FRAMENET

• Each frame contains a main lexical item as predicate and associated frame-specific semantic roles, such
as AUTHORITIES, TIME, AND SUSPECT in the ARREST frame, called frame elements.
• Example: The sentence below is annotated with semantic roles AUTHORITIES AND SUSPECT
• [Authorities The police] nabbed [suspect the snatcher].
• The COMMUNICATION frame has the semantic roles ADDRESSEE,
COMMUNICATOR, TOPIC, and MEDIUM.
• A JUDGEMENT frame contains roles such as a JUDGE, EVALUEE, and REASON.
• Example:
• [judge She] [Evaluee blames the police] [ Reason for failing to provide enough protection].
• A frame may inherit roles from another frame. Eg., a STATEMENT frame may inherit
from a COMMUNICATION frame, it contains roles such as SPEAKER, ADDRESSEE,
and MESSAGE.
• Example:
• [Speaker She] told [Addressee me] [Message ‘I’ll return by 7:00 pm today’].
FrameNet APPLICATIONS

• FrameNet data can be used for


• Automatic semantic parsing
• Information extraction
• Question answering system
• Information retrieval
• Machine translation
• Text summarization
• Word sense disambiguation
STEMMERS

• Stemming or conflation is the process of reducing inflected words to their base


or root form.
• The stem need not be identical to the morphological base of the word.
• It is sufficient if related words map to the same stem, even if the stem is not in
itself a valid root.
• Stemming is useful in search engines for query expansion or indexing.
• Stemming programs are commonly referred to as stemmers.
• The most common algorithm for stemming English is Porter’s algorithm, other
stemmers are Lovins stemmer and Paice/Husk stemmer.
STEMMERS

• Stemmers for European Languages:


• Snowball presents stemmers for English, Russian and a number of other European
languages, including French, Spanish, Portuguese, Hungarian, Italian, German,
Dutch, Swedish, Norwegian, Danish and Finnish.
• Stemmers for Indian Languages:
• The Resource Centre of Indian Language Technology (CFILT), IIT Bombay has
developed stemmers for Indian languages, which are available at
http://www.cfilt.iitb.ac.in
• Stemming Application:
• Stemmers are used in
• Web search engines
• Text summarization
• Text categorization
PART-OF-SPEECH TAGGER

• Part-of-Speech tagging is used at an early stage of text processing in many NLP application
such as
• Speech synthesis
• Machine translation
• Information Retrieval
• Information extraction
PART-OF-SPEECH TAGGER

• Various part-of-speech taggers are -


• Stanford Log-linear Part-of Speech (POS) Tagger
• Part-of-Speech Tagger for English
• TnT Tagger
• Brill Tagger
• CLAWS Part-of-Speech Tagger for English
• Tree-Tagger
• ACOPOST - A Collection of POS Taggers
• Maximum Entropy Tagger (MET)
• Trigram Tagger (T3)
• Error-driven Transformation-based Tagger (TBT)
• Example-based Tagger (ET)
RESEARCH CORPORA

• Research carpora is developed for a number of NLP related tasks.


• Following are the available standard document collections -
• IR Test Collection
• Summarization Data
• Word Sense Disambiguation
• Asian Language Corpora

You might also like