IR Chapter I
IR Chapter I
IR Chapter I
Introduction to Information
Storage and Retrieval
IR and IR systems
Data vs information retrieval
IR and the retrieval process
Basic structure of an IR system
How search engines work
Introduction to ISR
Information retrieval is a process of
– looking for relevant information in large and
heterogeneous collection of information and
– retrieved information ordered in relevant rank.
Based on the kind of collection the retrieval works
on there are different types of IR, like
– textual IR,
– graphic IR,
– multimedia IR and others.
November 22 ISR 2
Introduction to ISR
November 22 ISR 3
Introduction to ISR
November 22 ISR 5
Information Retrieval
Information Retrieval (IR)
Is an activity of obtaining
relevant documents
based on user needs
from
collection of retrieved documents.
November 22 ISR 6
Information Retrieval
Goal = find documents relevant to an information need
from a large document set
November 22 ISR 7
IR in Practice
November 22 ISR 8
IR in Practice
• Computer scientist – fast and accurate search engine
November 22 ISR 9
IR Systems
Document (Web page) retrieval in response
to a query
Quite effective (at some things)
Commercially successful (some of them)
But what goes on behind the scenes?
How do they work?
What happens beyond the Web?
November 22 ISR 10
IR Systems
November 22 ISR 11
IR Systems
Information is organized into (a large
number of) documents
– Large collections of documents from various
sources:
• news articles,
• research papers,
• books,
• digital libraries,
• Web pages, etc.
November 22 ISR 12
IR Systems
November 22 ISR 13
IR Systems
November 22 ISR 16
IR Systems
November 22 ISR 17
IR Systems
A query is issued by user
And a set of documents that are deemed
relevant to the query are ranked based on
their computed similarity to the query and
presented to the user query.
Information Retrieval (IR) is devoted to
finding relevant documents, not finding
simple matches to patterns.
November 22 ISR 18
IR Systems
Automated information retrieval (IR)
systems were originally developed to help
manage the huge scientific literature that
has developed since the 1940s.
Many university, corporate, and public
libraries now use IR systems to provide
access to books, journals, and other
documents.
November 22 ISR 19
IR Systems
An Information Retrieval System consists
of a software program that facilitates a user
in finding the information the user needs.
The system may use standard computer
hardware or specialized hardware to support
the search sub-function and to convert non-
textual sources to a searchable media (e.g.,
transcription of audio to text).
November 22 ISR 20
General Goal of IR
To help users find useful/relevant
information based on their information
needs (with a minimum effort) despite;
The challenges:
Increasing complexity of Information (overload)
Changing needs of user
Provide immediate random access to the
document collection.
Retrieval systems, such as Google, Yahoo,…
November 22 ISR 21
are developed with this aim.
Objectives of IR Systems
To minimize the overhead of a user locating
needed information.
Overhead can be expressed as the time a
user spends in all of the steps leading to
reading an item containing the needed
information
(e.g., query generation, query execution,
scanning results of query to select items to
read,
November 22 reading non-relevant
ISR items). 22
Objectives of IR Systems
The success of an IR system is very subjective,
Based upon what information is needed and the
willingness of a user to accept overhead.
Needed information can be all information that is
in the system that relates to a user’s need.
In other cases it may be sufficient information in
the system to complete a task, allowing for missed
data.
November 22 ISR 23
Functions of an IRS
The Major Functions of an IRS are:
To identify the sources of information
relevant to the areas of interest of the target
users’ and community.
To analyze the contents of the sources
(documents).
To represent the contents of the analyzed
sources for matching with the users’
queries.
November 22 ISR 24
Functions of an IRS
The Major Functions of an IRS are:
To match the search statement with the
stored database.
November 22 ISR 32
IR and the Retrieval Process
The Process of IR starts when a user creates
any query into the system through some
graphical interface provided.
These user-defined queries are the
statements of needed information.
For example, queries fork by users in search
engines.
November 22 ISR 33
IR and the Retrieval Process
In IR single query does not match to the
right data object instead;
It matches with the several collections of
data objects from which the most relevant
document is taken into consideration for
further evaluation.
The ranking of relevant documents is done
to find out the most related document to the
given
November 22 query. ISR 34
IR and the Retrieval Process
This is the key difference between the
Database searching and IR.
After the query is sent to the core of the
system.
This part has the access to the content
management module which is directly
linked with the back-end
i.e. the large collections of data objects.
November 22 ISR 35
IR and the Retrieval Process
Once results IR are generated by the core
system then it is returned to the user by
some graphical user interfaces.
The process repeats and results are modified
until the user satisfied for what he is
actually looking for.
November 22 ISR 36
IR and the Retrieval Process
November 22 ISR 37
IR and the Retrieval Process
Document Parsing
Document parsing deals with the overall
document structure.
In this phase, it breaks down the document
into discrete components.
In Preprocessing phase it creates unit
documents for example one document
representing emails and another as
November 22 ISR 38
additional specific part.
IR and the Retrieval Process
Lexical Analysis
In Lexical analysis, tokenization is the process of
breaking a stream into words, phrases, symbols, or
other meaningful terms called tokens.
These meaningful elements are further sent to
Parts of Speech Tagging.
Typically, Tokenization occurs at a word level.
November 22 ISR 39
IR and the Retrieval Process
Lemmatization
Usually refers to doing
these things properly Reduces words to their
with Vocabulary and base form
Morphological analysis – Flies → fly
of words. – Mules → mule
– Agreed → agree
Aiming to remove
– Owned → own
inflectional endings
– Traditional → tradition
only.
November 22 Requires a dictionary 42
ISR
Basic structure of an IR system
The two subsystems of an IR system:
Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
Searching: is an online process of finding
relevant documents in the index list as per users
query
Indexing and searching: are unavoidably
connected
November 22 ISR 43
Basic structure of an IR system
You cannot search that was not first indexed
in some manner.
Indexing of documents is done in order to
be searchable
There are many ways to do indexing
To index one needs an indexing language
There are many indexing languages
Every word in a document could be an indexing
Novemberlanguage.
22 ISR 44
Basic structure of an IR system
November 22 ISR 45
IR Systems
November 22 ISR 46
Issues in IR
Text representation
– what makes a “good” representation?
– how is a representation generated from text?
Information needs representation
– what is an appropriate query language?
Comparing representations
– to identify relevant documents
– what is a “good” model of retrieval?
Evaluating effectiveness of retrieval
November 22 ISR 47
– what are good metrics?
Why is IR so hard?
Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents.
The real problem boils down to matching
the language of the query to the language of
the document.
November 22 ISR 48
Why is IR so hard?
Simply matching on words is a very weak
approach.
One word can have different semantic
meanings. Consider: Take
– “take a place at the table”
– “take money to the bank”
– “take a picture”
In Amharic ገና ……...ትፈራለህ …….
November 22 ISR 49
The World Wide Web
The Web is an infrastructure of distributed
information combined with software that uses
networks as a vehicle to exchange that
information.
Web page is a document that contains or
references various kinds of data, such as text,
images, graphics, and programs.
Links are connections between one web page and
another that can be used “move around” as
desired.
Nov-22 50
The World Wide Web
Website is a collection of related web
pages.
The Internet makes the communication
possible, but the Web makes that
communication easy, more productive, and
more enjoyable.
Nov-22 51
Characteristics of the Web
Decentralized content publishing with essentially
no central control of authorship.
This turned out to be the biggest challenge for web
search engines in their quest to index and retrieve
this content.
Web page authors created content in dozens of
(natural) languages and thousands of dialects,
Thus demanding many different forms of
stemming and other linguistic operations.
Nov-22 52
Characteristics of the Web
Huge (1.75 terabytes of text)
Allow people to share information globally and
freely
Hides the detail of communication protocols,
machine
locations, and operating systems
Data are unstructured
Exponential growth
Increasingly commercial over time (1.5 % .com in
1993 to 60% .com in 1997)
Nov-22 53
Search Engine ?
Search Engines is a website that helps you find
other websites like Yahoo and Google.
You enter keywords and the search engine
produces a list if links to potentially useful sites.
There are two types of searches:
Keyword searches
Concept-based searches
Nov-22 54
Search Engine ?
Browser is a software tool that issues the
request for the web page we want and
displays it when it arrives.
We often talk about “visiting” a website, as
if we were going there.
In truth, we actually specify the information
we want, and it is brought to us.
Nov-22 55
The Browser
A browser is a Web client program that uses
Hypertext Transfer Protocol (HTTP) to make
requests of Web servers throughout the Internet on
behalf of the browser user.
Text-only mode such as Lynx
Graphic mode involves a graphical software
program that retrieves
text,
audio, and
video
Nov-22 56
Challenges of Building a Search Engine
Nov-22 62
User Problems
Novice users do not know how to start
using a search engine.
Do not care about advertisements? No
funding!
Around 85% of users only look at the first
page of the result, so relevant answers
might be skipped.
Nov-22 63
Searching Guidelines
Specify the words clearly (+, -)
Use advanced search when necessary
Provide as many particular terms as
possible
If looking for a company, institution, or
organization, try:
Nov-22 64
Searching Guidelines
Some searching engine specialize in some
areas
If the user use broad queries, try to use Web
directories as starting points.
The user should notice that anyone can
publish data on the Web, so information
that they get from search engines might not
be accurate.
Nov-22 65
Types of Search Engines
Search by Keywords:
e.g. AltaVista, Excite, Google
Search by categories:
e.g. Yahoo!
Specialize in other languages:
e.g. Chinese Yahoo! and Yahoo! Japan
Interview simulation
e.g. Ask Jeeves!
Nov-22 66
Search Engine Architecture
Nov-22 67
Web Crawlers
Software agents that traverse the Web sending
new or updated pages to a main server where they
are indexed.
Also called robots, spiders, worms, wanders,
walkers, and knowbots
The 1st crawler, Wanderer was developed in 1993
Runs on local machine and send requests to
remote Web servers.
Nov-22 68
Web Crawlers
Breath-first and depth-first manner of
searching is applied
Avoid crawling same pages
Web pages change dynamically
Fastest crawlers are able to traverse up to 10
million pages per day
Nov-22 69
Thank you!