RetrivalChapter One

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Information Retrieval

3/29/2019 1
Information Retrieval Systems?
• Document (Web page)
retrieval in response to a
query
• Quite effective (at some things)
• Commercially successful (some of them)
• But what goes on behind the
scenes?
• How do they work?
• What happens beyond the Web?

Web search systems


• Lycos, Excite, Yahoo, Google,
Live, Northern Light, Teoma,
HotBot, Baidu, …
3/29/2019 2
Examples of IR systems
 Conventional (library catalog): Search by keyword, title, author,
etc.
 Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural language.
 Multimedia (QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
 Question answering systems (AskJeeves, Answerbus):
Search in (restricted) natural language
 Other:
 Cross language information retrieval,
 Music retrieval

3/29/2019 3
Information Retrieval
• Information retrieval (IR) is the process of finding material
(usually documents) of an unstructured nature (usually
text) that satisfies an information need from within large
collections (usually stored on computers).
• Information is organized into (a large number of)
documents
• Large collections of documents from various sources: news articles, research papers,
books, digital libraries, Web pages, etc.
• Example: Web Search Engines like Google claim to index over 1 Trillion pages

3/29/2019 4
General Goal of Information Retrieval
• To help users find useful information based on their
information needs (with a minimum effort) despite
• Increasing complexity of Information
• Changing needs of user

• Provide immediate random access to the document


collection.
• Retrieval systems, such as Google, Yahoo, are
developed with this aim.

3/29/2019 5
Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects implies
failure
Information retrieval
Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be inaccurate
small errors are tolerated

3/29/2019 6
Cont’d….
• Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error 3/29/2019
response Sensitive Insensitive 7
Why is IR so hard?
•Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
•Information retrieval problem: locating relevant
documents based on user input, such as keywords
or example documents
• The real problem boils down to matching the language of the query to the
language of the document.
• Simply matching on words is a very brittle (no elasticity) approach. One
word can have different semantic meanings. Consider: Take
• “take a place at the table”
• “take money to the bank”
• “take a picture”

3/29/2019 8
More Problems with IR
• You can’t even tell what part of speech a word has:
• “I saw her duck”
• A query that searches for “pictures of a duck” will find documents that
contains:
• “I saw her duck away from the ball falling from the sky”

• Proper Nouns often use regular old nouns


• Consider a document with “a man named Abraham owned a Lincoln”
• A word matching query for “Abraham Lincoln” may well find the above
document.

3/29/2019 9
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents

The User Task:


two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
3/29/2019 10
The User Task
Retrieval
• It is the process of retrieving information whereby the main
objective is clearly defined from the onset of searching process.
• The user of a retrieval system has to translate his information need
into a query in the language provided by the system.
• In this context (i.e. by specifying a set of words), the user searches
for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets

3/29/2019 11
Browsing
• It is the process of retrieving information, whereby the main
objective is not clearly defined from the beginning and whose
purpose might change during the interaction with the system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in Addis,
he might turn his attention to a document providing ‘direction to
Addis’, and from this to documents which cover ‘Tourism in
Ethiopia’.
• In this context, user is said to be browsing in the collection and not
searching, since a user may has an interest glancing around

3/29/2019 12
Logical View of Documents
Documents in a collection are frequently represented by a set of
index terms or keywords
Such keywords are mostly extracted directly from the text of the
document
These representative keywords provide a logical view of the
document

Docs Tokenization stop words stemming Indexing

Full Index terms


text

Document representation viewed as a continuum, in which logical


view of documents might shift from full text to index terms
3/29/2019 13
Logical view of documents
•If full text :
• Each word in the text is a keyword
• Most complex form
• Expensive
• If full text is too large, the set of representative
keywords can be reduced through transformation
process called text operation
⚫It reduce the complexity of the document representation
and allow moving the logical view from that of a full text
to a set of index terms

3/29/2019 14
Structure of an IR System
• An Information Retrieval System serves as a bridge between the world
of authors and the world of readers/users,
• That is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents that
satisfy their information need.

User Documents
Black box
• The black box is the information retrieval system.
• To be effective in its attempt to satisfy information need of users, the IR
system must ‘interpret’ the contents of documents in a collection and rank
them according to their degree of relevance to the user query.
• Thus the notion of relevance is at the center of IR
• The primary goal of an IR system is to retrieve all the documents which are
relevant to a user query while retrieving as few non-relevant documents as
possible
3/29/2019 15
Typical IR Task

❖ Given:
• A corpus of textual natural-language
documents.
• A user query in the form of a textual string.
❖ Find:
• A ranked set of documents that are relevant to
the query.

3/29/2019 16
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.

3/29/2019 17
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

3/29/2019 18
What is Information Retrieval ?
• A good formal definition of information retrieval is given
in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The
organization and access of information items should
provide the user with easy access to the information in
which he is interested”
• The definition incorporates all important features of a
good information retrieval system
• Representation
• Storage
• Organization
• Access
• The focus is on the user information need
3/29/2019 19
Overview of the Retrieval process

3/29/2019 20
The Retrieval Process
• It is necessary to define the text database before any of
the retrieval processes are initiated
• This is usually done by the manager of the database and
includes specifying the following
• The documents to be used
• The operations to be performed on the text
• The text model to be used (the text structure and what elements
can be retrieved)

• The text operations transform the original documents and


the information needs and generate a logical view of them

3/29/2019 21
Retrieval Process ….
• Once the logical view of the documents is defined, the
database module builds an index of the text
• An index is a critical data structure
• It allows fast searching over large volumes of data
• Different index structures might be used , but the most
popular one is the inverted file
• Given that the document database is indexed, the retrieval
process can be initiated

3/29/2019 22
The Retrieval Process …
The user first specifies a user need which is then parsed and transformed
by the same text operation applied to the text
Next the query operations is applied before the actual query, which provides
a system representation for the user need, is generated
The query is then processed to obtain the retrieved documents
 Before the retrieved documents are sent to the user, the retrieved documents are
ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the search for useful
information. Two choices for the user:
 (i) reformulate query, run on entire collection or (ii) reformulate query, run on
result set
At this point, s/he might pinpoint a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by the user to
change the query formulation.
Hopefully, this modified query is a better representation of the real user
need
3/29/2019 23
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs
3/29/2019 24
Issues that arise in IR
•Text representation
• what makes a “good” representation?
• how is a representation generated from text?
• what are retrievable objects and how are they organized?
•Information needs representation
• what is an appropriate query language?
• how can interactive query formulation and refinement be supported?
•Comparing representations (to identify relevant
documents)
• What weighting scheme and similarity measure to be used?
• what is a “good” model of retrieval?
•Evaluating effectiveness of retrieval
• what are good metrics?
• what constitutes a good experimental test bed?

3/29/2019 25
Focus in IR System Design
Our focus during IR system design is:
• In improving performance effectiveness of the
system
• Effectiveness of the system is measured in terms of precision, recall, …
• Stemming, stop words, weighting schemes, matching algorithms
• In improving performance efficiency
• The concern here is storage space usage, access time, searching time, data
transfer time …
• Concern regarding space – time tradeoffs !!
• Use Compression techniques, data/file structures, etc.

3/29/2019 26
Subsystems of an IR system
• The two subsystems of an IR system:
• Searching: is an online process of finding relevant documents in the
index list as per users query
• Indexing: is an offline process of organizing documents using
keywords extracted from the collection
• Indexing and searching: are unavoidably connected
• you cannot search what was not first indexed in some
manner or other
• indexing of documents or objects is done in order to be
searchable
• there are many ways to do indexing
• to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language

• Knowing searching is knowing indexing


3/29/2019 27
Indexing Subsystem

documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
3/29/2019 28
Searching Subsystem

query parse query


query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index

3/29/2019 29
The End

3/29/2019 30

You might also like