Module 5 - Information Retrieval and Lexical Resources
Module 5 - Information Retrieval and Lexical Resources
Module 5 - Information Retrieval and Lexical Resources
● Information retrieval (IR) deals with the organization, storage, retrieval and evaluation
of information relevant to a user’s query.
● The retrieval system responds by retrieving the document that seems relevant to the query.
● An information retrieval system does not inform the user on the subject of the query. Rather,
it merely informs on the existence and whereabouts of documents relating to the query.
● The IR system differs from question answering system and data retrieval system.
● A question answering system provides users with answers to specific questions.
● Data retrieval systems retrieve precise data, usually organized in a well defined structure.
Design Features of Information Retrieval Systems
Give Design feature of IR system. (6 M, Jan 2019)
Explain design feature of IR systems with a neat diagram (10 M, Feb 2021)
● The process of IR begins with the user’s information need.
● Based on the need the user formulates a query.
● The IR system returns documents that seem relevant to the query.
● The retrieval is performed by matching the query representation
with document representation.
● The actual text of the document is not used in the retrieval
process. Instead documents in a collection are frequently
represented through a set of index terms or keywords.
Design Features of Information Retrieval Systems
● Representation of keywords provides a logical view of the document. The process of transforming
document text to some representation of it is known as indexing.
● There are different types of index structures. The one commonly used is inverted index.
● An inverted index is a list of keywords, with each keyword carrying pointers to the documents
containing that keywords.
● Note:
○ A forward index (or just index) is the list of documents, and which words appear in them.
○ The inverted index is the list of words, and the documents in which they appear.
○ They are both indexes - it's just a question of which direction you're going.
Explain the benefits of eliminating stop words. Give examples in which stop word elimination
may be harmful. (4 M, Feb 2021)
● The lexical processing of index terms involves elimination of stop words.
● Stop words are high frequency words which have little semantic weight and are thus, unlikely to
help in retrieval.
● Stop words play important grammatical role in language but do not contribute to the semantic
content of a document in a keyword based representation.
● Keywords are commonly used in documents regardless of topics and have no topical specificity.
● Typical example of stop words are articles and prepositions. Eliminating them considerably
reduces the number of index terms.
Advantages and Disadvantages of Eliminating Stop words
Advantages:
● Eliminating stop words can result in considerable reduction in number of index terms
without losing any significant information.
Disadvantages:
● Sometimes it can result in elimination of useful index terms.
● For example,
○ ‘A’ in ‘Vitamin A’.
○ Some phrases like ‘to be or not to be’ consist entirely of stop words.
● Eliminating stop words in above cases make it impossible to correctly search a
document.
Sample Stop Words in English
Stemming
How stemming affects the performance of IR system? (4M Jan 2020)
● Stemming normalizes morphological variants by removing affixes from the words to reduce them to their
stem.
● Eg.
○ The words compute, computing, computers and computer will be reduced to it stem comput.
● Keywords or terms which are used to represent text are stems and not the actual words.
● Eg., the stemmed representation of the text -
The stop words have been removed and the remaining terms are in lower case.
Advantages and Disadvantages of Stemming
Advantages:
● Stemming is useful to help combine similar terms. This results in increased recall.
● Eg., The words compute, computing, computers and computer will be reduced to its
stem comput.
Disadvantages:
● Stemming may throw away useful distinctions.
● In some cases it may be resulting in reduced precision. Eg., when documents
containing the term computation are returned in response to the query phrase
‘personal computer’.
Precision and Recall
IR System
Zipf’s Law
● Zipf’s law states that the frequency of words multiplied by their ranks in a large corpus is more or less constant.
● If we compute the frequencies of words in a corpus and arrange them in decreasing order of frequency, then the product of
the frequency of a word and its rank is approximately equal to the product of, the frequency and rank of another word.
● The high frequency words, being common, have less discriminating power and are not useful for indexing.
● Low frequency words are less likely to be included in the query and are also not useful for indexing. As there are large number of low
frequency words, dropping them considerably reduces the size of list of index terms.
● The remaining medium frequency words are content-bearing terms and can be used for indexing.
● This is implemented by defining thresholds for high and low frequency and dropping words that have frequencies above or below these
thresholds.
Where
2. Set operations are used to retrieve documents in response to Q:
Boolean Model - Example 1:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Document collection: A collection of Shakespeare's work.
1 if play contains
word, otherwise 0
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)
• 110100 AND 110111 AND 101111 = 100100
• The plays Antony and cleopatra and Hamlet contain the words Brutus and Caesar but not Calpurnia.
• Reference: https://www.youtube.com/watch?v=zRcwihu67H4
Boolean Model - Example 2
Let the set of original documents be D={D1, D2, D3}
Where
D1 =Information retrieval is concerned with the organization, storage, retrieval and evaluation of information relevant to user’s query.
D2 = A user having an information needs to formulate a request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving the document that seems relevant to the query.
Let the set of terms used to represent these documents be
T = { information, retrieval, query }
Then, the set D of document will be represented as follows:
D = { d1, d2, d3 }
Where
d1 = { information, retrieval, query }
d2 = { information, query }
d3 = { retrieval, query}
Let the Query Q be, Q = information ^ retrieval
Step 1: The sets R1 and R2 of documents are retrieved in response to Q, where
R1 = { dj | information ∈ dj } = { d1, d2 }
R2 = { dj | retrieval ∈ dj } = { d1, d3 }
Step 2: Then, the following documents are retrieved in response to the query Q
{ dj | dj ∈ R1 ⋂ R2 } = { d1 }
This results in the retrieval of the original document D1 that has the representation d1.
Boolean Model - Advantages and Disadvantages
Advantages:
• They are simple, efficient, easy to implement, perform well in terms of
recall and precision if the query is well formulated.
Disadvantages:
1. The model fails to retrieve documents that are only partly relevant to
user query.
2. Boolean system cannot rank the retrieved documents.
3. It distinguishes between presence and absence of keywords but fails to
assign relevance and importance to keywords in a document.
4. User has to formulate the query in pure boolean expression.
Probabilistic model
• Ranks the document based on the probability of their relevance to a given query.
• Retrieval depends on whether probability of relevance (relative to a query) of a document is
higher than that of non relevance, that is, whether it exceeds a threshold value.
• Given a set of documents D, a query ‘q’, and a cut off value ‘α’ , this model first calculates
the probability of relevance and irrelevance of a document to the query.
• It then ranks documents having probability of relevance at least that of irrelevance in
decreasing order of their relevance.
• Documents are retrieved if the probability of relevance in the ranked list exceeds the cut
off value.
• If P( R / d ) is the probability of relevance of a document d, for query q, P( I / d ) is the
probability of irrelevance. Then the set of the documents retrieved is:
• S ={ dj | P ( R / dj ) >= P( I / dj ) } , P ( R / dj ) > α
Probabilistic Model - Advantages and Disadvantages
Advantages:
• The probabilistic model produce results that partly match the user
query.
Disadvantages:
• It assumes that the terms are independent when estimating
probabilities. This assumption allows for accurate estimation of
parameter values and helps reduce computational complexity of
the model. However, this assumption seems to be inaccurate as
terms in a given domain usually tend to co-occur. For eg., it is
more likely that ‘match point’ will co-occur with ‘tennis’ rather
than ‘cricket’.
Vector space model
• The vector space model is one of the most well studied retrieval models.
• It represents documents and queries as vectors of features representing terms that occur
within them. Each document is characterized by a numerical vector.
• Vectors represented in multidimensional space, in which each dimension corresponds to a
distinct term in the corpus of documents.
• Each feature takes a value of either zero or one, indicating the absence or presence of that
term in a document or query.
• Features are assigned numerical values that are usually a function of the frequency of
terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield
a retrieval score to each document.
• This score is used to produce a ranked list of retrieval documents.
Vector space model
Given a finite set of n documents
D = { d1, d2, d3, …, dn }
and a finite set of m terms
T= { t1, t2, t3, …, tm }
each document is represented by a column vector of weights as follows
(w1j, w2j, w3j ,…, Wij ,..., wmj )t
Where Wij is the weight of the term ti in document dj.
The document collection as a whole is represented by a m x n term - document matrix -
Document space
Term space
w11 w12 … w1j … w1n
w21 w22 … w2j … w2n
w31 w32 … w3j … w3n
wm1 wm2 … wmj … wmn
Example:
• D1 = Information retrieval is concerned with the organization, storage, retrieval and evaluation of information relevant
to user’s query.
• D2 = A user having an information needs to formulate a request in the form of query written in natural languages.
• D3 = The retrieval system responds by retrieving the document that seems relevant to the query.
• T = { information, retrieval, query }
• Let the weights be assigned based on the frequency of the term in the document. Then the associated vectors will be as
shown in matrix 1.
• Convert document vectors to unit length by dividing elements of each column by the length of the column vector given
by √Σiwij2
t1 t2 t3 d1 d2 d3 d1 d2 d3
d1 2 2 1 t1 2 1 0 After Normalization -> t1 0.67 0.71 0
d2 1 0 1 t2 2 0 1 t2 0.67 0 0.71
d3 0 1 1 t3 1 1 1 t3 0.33 0.71 0.71
Term Weighting
Define Term Weighting. (Feb 2021)
• Each term that is selected as an indexing feature for a document, acts as a
discriminator between that document and all other documents in the corpus.
• Term weighting is the process to quantify the discriminating power of the terms
by associating the frequency of their occurrence (term frequency) within a
document.
• The following facts must be taken into account:
1. The more a document contains a given word, the more that document is about a
concept represented by that word.
2. The less a term occurs in particular document in a collection, the more
discriminating that term is.
Term Weighting
• The first factor indicates that terms that occur more frequently represent the document’s meaning more strongly
than those occurring less frequently and hence must be given high weights. This weight is the raw frequency of the
term in the document.
• The second factor considers term distribution across the document collection. Terms occurring in a few documents
are useful for distinguishing those documents from the rest of the collection. Similarly, terms that occur more
frequently across the entire collection are less helpful while discriminating among documents.
• The factor n / ni - gives the measure that favours terms appearing in fewer documents, where
• This measure assigns the lowest weight 1 to a term that appears in all documents and the highest weight of n to a
term that occurs in only one document.
• As the number of documents in any collection is usually large, the log of this measure is taken, which is called
inverse document frequency (idf) term weight.
idfi = log ( n / ni )
• According to the above scheme, the weight of the term t i in document dj is equal to the product of
document frequency of the term (term frequency) and log of its inverse document frequency
within the collection.
• The tf x idf weighting scheme combines both the ‘local’ and ‘global’ statistics to assign term
weight.
• The tf component is a document specific statistic that measures the importance of the term within
the document.
• The idf is a global statistic that attempts to include distribution of the term across the document
collection.
Term Weighting - Example
Example to explain how term weights can be calculated using the tf-idf weighting scheme.
Consider a document represented by three terms {tornado, swirl, wind} with the raw tf 4,1,1 respectively. In
a collection of 100 documents, 15 documents contain the term tornado, 20 contain swirl and 40 contain
wind. Find the idf and the term weight of the three terms. (6 M, Feb 2021)
idf - tornado -> log(n / ni) = log (100 / 15 ) = 0.824 Weight - tornado -> tf x idf = 4 * 0.824 = 3.296
idf - swirl -> log(n / ni) = log (100 / 20 ) = 0.699 Weight - tornado -> tf x idf = 1 * 0.699 = 0.699
idf - wind -> log(n / ni) = log (100 / 40 ) = 0.398 Weight - tornado -> tf x idf = 1 * 0.398 = 0.398
The following table shows the weights assigned to the three terms using tf x idf weighting scheme
3.296
Weighting Schemes
• The three factors of weighting schemes are -
1. Within - document frequency or term frequence (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length - If a term appears the same number of times in a short
document and in a long document, then the term appearing in short
document is more valuable compared to the term appearing in long
document.
• Any term weighting schemes can be represented by a triple ABC.
• A represents the way the tf component is handled.
• B indicates the way the idf component is incorporated.
• C represents the length normalization component.
• Possible option for each of the three dimensions of the triple are shown in table.
Weighting Schemes
• Different combinations of options can be used to represent document and query vectors.
• The retrieval model can be represented by a pair of triples like nnn.nnn (doc = ‘nnn’, query =
‘nnn’).
• The first triple corresponds to the weighting strategy used for the documents and the second
triple to the weighing strategy used for the query, for example, ntc-ntc, lnc-ltc, etc.
Simple automatic method for indexing
A simple auto method for obtaining indexed representation of the document is as follows -
Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower case, and removes
punctuation marks.
Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index terms.
Explain the cluster and fuzzy models of IR system. (8M, Jan 2019)
With a suitable example explain cluster based IR modelling (8 M, July 2019)
• The cluster model attempts to reduce the number of matches during retrieval based on cluster
hypothesis:
“Closely associated documents tend to be relevant to the same clusters.”
• The hypothesis suggests that closely associated documents are likely to be retrieved together. By
forming groups or classes or clusters of related documents, the search time reduces considerably.
• Instead of matching a query with every document in the collection, it is matched with representatives
of the cluster (class), and only documents from a class whose representative is close to query, are
considered for individual match.
• Clustering is applied on terms instead of documents based on co-occurrence. Terms can be grouped to
form classes of co-occurrence terms.
• A number of methods are used to group documents. One of the method is based on similarity matrix.
Cluster Model
Cluster Generation method based on Similarity Matrix
• The element Ei,j in this matrix, denotes a similarity between document d i and dj.
• Let T be the threshold value.
• Any pair of documents di and dj (i != j) whose similarity measure exceeds threshold (eij >= T) is
grouped to form a cluster.
• The remaining documents form a single cluster.
• The set of clusters thus obtained is
C = { C1, C2, …, Ck, …, Cp }
Cluster Model
• A representative vector of each cluster is constructed by computing the centroid of the document
vectors belonging to that class.
• Representation vector for a cluster C k is rk = { a1k, a2k, …, aik, …, amk }
• An element aik in this vector is computed as
• Where aij is weight of the term ti, of the document dj, in cluster Ck.
• During retrieval, the query is compared with the cluster vectors
(r1, r2, …, rk, …, rp )
• This comparison is carried out by computing the similarity between the query vector q and the
representative vector r, as
• A cluster C, whose similarity s, exceeds a threshold is returned and the search proceeds in that cluster.
Cluster Model - Example
Fuzzy Model
• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs [ ti ,μ (ti ) ]
Where μ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.
• The membership degree expresses the significance of term to the information contained in the
document.
• Significance values (weights) are assigned based on the number of occurrences of the term in the
document and in the entire document collection.
• Each document in the collection -
D = { d1, d2, …, dj, …, dn }
5 5
Non-interpolated Average Precision = 0.704
Lexical Resources
Write short notes on WordNet, FrameNet, Stemmer, POS Tagging (16 M, Feb 2021)
• Various tools and lexical resources are developed to ease the task of researchers working with NLP.
• These resources are open sources and are readily available.
The various resources and tools are
1. Lexical Resources
a. WordNet
b. FrameNet
2. Tools
a. Stemmers
b. Taggers
c. Parsers
d. Freely Available test corpuses for text-processing applications
WORDNET
Explain WordNet. List the applications of WordNet (6 M, Jan 2020)
Explain WordNet and list the applications of WordNet (10M, Feb 2021)
Write a note of WordNet Lexical Resource and its application in NLP (6M, July 2019)
Explain the WordNet for English language and its applications. (8M, Jan 2019)
• WordNet is a large lexical database for the English language.
• Inspired by psycholinguistic theories, it was developed and is being maintained at the Cognitive Science
Laboratory, Princeton University, under the direction of George A. Miller.
• WordNet consists of three databases
• One for nouns
• One for verbs
• One for both adjectives and adverbs
• Information is organized into sets of synonymous words called synsets, each representing one base concept.
• The synsets are linked to each other by means of lexical and semantic relations.
WORDNET
• Each frame contains a main lexical item as predicate and associated frame-specific semantic roles, such
as AUTHORITIES, TIME, AND SUSPECT in the ARREST frame, called frame elements.
• Example: The sentence below is annotated with semantic roles AUTHORITIES AND SUSPECT
• [Authorities The police] nabbed [suspect the snatcher].
• The COMMUNICATION frame has the semantic roles ADDRESSEE,
COMMUNICATOR, TOPIC, and MEDIUM.
• A JUDGEMENT frame contains roles such as a JUDGE, EVALUEE, and REASON.
• Example:
• [judge She] [Evaluee blames the police] [ Reason for failing to provide enough protection].
• A frame may inherit roles from another frame. Eg., a STATEMENT frame may inherit
from a COMMUNICATION frame, it contains roles such as SPEAKER, ADDRESSEE,
and MESSAGE.
• Example:
• [Speaker She] told [Addressee me] [Message ‘I’ll return by 7:00 pm today’].
FrameNet APPLICATIONS
• Part-of-Speech tagging is used at an early stage of text processing in many NLP application
such as
• Speech synthesis
• Machine translation
• Information Retrieval
• Information extraction
PART-OF-SPEECH TAGGER