chapter 1 ir (1)
chapter 1 ir (1)
chapter 1 ir (1)
(ISR)
Chapter One
Overview of ISR
1
Course Overview
Topic(s) Details
Overview of IR IR and the retrieval process; Basic structure of an IR system
Basic Laws in IR; Document pre-processing (Tokenization, Stop
Text/Document
word detection, Stemming); Term extraction (Term weighting
Operations
and similarity measures)
Indexing The need for indexing; compression; Inverted files; Suffix trees
Structures and Suffix arrays; Signature files
A Formal Characterization of IR Models; Boolean model, Vector
IR Models
space model & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries (Boolean
Query Languages
queries); Pattern matching; Natural language queries
Issues in query reformulation; Relevance feedback mechanisms;
Query Operations Query expansion; Statistical co-occurrence analysis;
Re-weighting query terms
Current Research Research in IR (Multimedia Retrieval, Web Retrieval, Question
Issues in IR 2
answering. etc.)
Introduction to IR
Baseline Process:-
– Given a collection of documents
– And a user’s query
– Find the most relevant documents
3
Key Terms Used in IR
• Query
– a representation of what the user is looking for - can be a list of words
or a phrase.
• Document
– an information entity that the user wants to retrieve
• Collection
– a set of documents
• Index
– a representation of information that makes querying easier
• Term
– word or concept that appears in a document or a query
4
Documents
• Not just printed paper
• Can be records, pages, sites, images, people, movies
• Document encoding (Unicode)
• Document representation
• Document preprocessing (e.g., removing metadata)
5
Sample Queries
What is corona viruses?
How to get happiness?
Dog
List of Ethiopian prime ministries
How many students are registered?
How to vote
Types of computer
How to install software?
What is Ebola?
Google Search
7
Yahoo Search
Amazon Search
Examples of Search Engines
• A search engine is a software program that helps people find the information they
are looking for online using keywords or phrases.
• Conventional (library catalog)
– Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, Yahoo!)
• Lexis-Nexis
a private legal information retrieval system provided on a computer database.
– Search by keywords. Limited search using queries in natural language.
• Image-based
– shapes, colors, keywords
• Question answering systems (ask.com)
• Ask.com is a search engine that was started in 1996. This website helps people
find web pages that they are looking for, by typing in the subject they want.
– Search in (restricted) natural language
• Clustering systems (Vivísimo, Clusty): Vivisimo was a privately held technology
company in Pittsburgh, Pennsylvania, specialising in Vivisimo's public web
search engine. Clusty was a metasearch engine with document clustering
• Research systems (Lemur, Nutch)
What is Information Retrieval ?
• Information retrieval is the process of searching for
relevant documents from unstructured large corpus
that satisfy users information need.
• IR is simply about finding relevant information.
• It focuses on providing the user with easy access to
information of their interest.
formulation uses uses process
Information
Request Query Matching Index
item
The result Is
is represented by a based on contains
Information
Relevance Collection
need
11
Fig 1 Information retrieval process
IR Processes
• Information retrieval is the process of matching the query against the indexed
information objects
• An index is an optimized data structure that is built on the top of the information
objects
– allowing faster access for the search process
– The indexer:
• Tokenizes the text (tokenization):
Tokenization is the process of turning sensitive data into non sensitive data called
"tokens" that can be used in a database or internal system without bringing it into
scope.
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smaller units, such as individual words or terms. Each of these
smaller units are called tokens.
Tokenization is a non-mathematical approach that replaces sensitive data with
non-sensitive substitutes without altering the type or length of data.
• Removes words with little semantic value (stop-words)
• unifies word families (stemming)
Stemming is the process of reducing a word to its word stem that affixes to 12
suffixes and prefixes or to the roots of words known as a lemma
Cont’d...
Words stemming is one of the important issues in the field of natural
language processing and information retrieval.
– The same is done for the query as well
• The IR system responds by matching information objects, which are
relevant to a query
• Information retrieval focuses on finding relevant information rather
than simple pattern matching
❖ Relevance
• relevance is an important concept in information retrieval (IR), but it
is hard to define
• The goal of IR is to find relevant information for the person who needs
it. But:
• What is relevance?
• What kind of information or document is relevant?
• Who evaluates the relevance of a text or a document? 13
• On what criteria?
Relevance
✔ is a subjective notion(Concept)
✔ depends on the task being solved and its context
✔ can change with time (e.g. new info became available)
✔ can change with location (e.g. the most important answer is the
closest one)
✔ can change with the device (e.g. The best answer is a short doc that
is easier to download and visualize)
• Relevance
• Retrieval results, indexing, etc., are evaluated with methods that are based on the
concept of relevance
There is no single agreement on the definition of relevance
✔ relatedness
✔ topicality
✔ beneficiality
✔ utility
14
Topicality vs. user relevance
There are two main directions in relevance definitions:
1. Topical relevance: relevance to a subject(topic), topicality, system
relevance
In its most simple form, matching words in documents and queries
2. User relevance: user oriented view of relevance
Based on the user’s evaluation of the usefulness of the documents
Topicality vs. user relevance
• Basic assumption about topicality: index words (or phrases) can describe the
semantics of a document and a retrieval task sufficiently
• It is commonly believed that a better matching of keywords leads to a better result
• For example, the system may try to infer the meaning of a text with
advanced linguistic methods
• But no system has been shown to be perfect
15
Cont’d...
• Topical relevance is useful because it is easy to define and to
measure, but it does not contain everything related to relevance
• The main focus in research is now towards user relevance
A more specific classification
A. Algorithmic relevance
Similarity between query and document depending on the
matching method
B.Topicality
Correspondence between topic and text as an interpretation by a
human being
C. Cognitive relevance
The relevance of a document according to the knowledge state of
the user
16
Cont…
D. Situational relevance
The relevance of the document according to the situation, task or
problem of the user
E. Motivational/emotional relevance
The relevance of the document according to the objectives or
motives of the user, e.g., the entertainment value
17
Cont’d...
A retrieval strategy (model) is an algorithm and related structures
that takes a query and a set of documents and assigns a similarity
measure between the query and each document
▪ similarity represents relevance to the user query
▪ Documents are ranked on the basis of their similarity to the query
18
In general, the IR Process
doc
Representation Representation
Retrieved documents
Evaluation
19
Text Collections and IR
• Large collections of documents from various sources: news articles, research papers,
books, digital libraries, Web pages, etc.
Storage of text:
Textual documents
•Searchable as text
•words are represented as ASCII/Unicode
Is Abbreviated as American Standard Code for Information Interchange
Image Documents:
• Scanned image of text document, which is not searchable as text: Texts
(characters, words, etc.) are represented as patterns of pixels
20
Cont’d...
Retrieval from Document Images:
Two options:-
1. Recognition-based retrieval:
• Optical Character Recognition (OCR) is required to convert
document images to ASCII (may be error prone) and then
• apply text IR systems on the recognized documents
2. Recognition-free retrieval:
• retrieval from document images without explicit recognition
• Search relevant documents directly from image collections
Directly searching by its name, title or place etc.
21
IR as a Discipline
• IR deals with the representation, storage, organization of, and
access to information items such as documents, webpages, online
catalogs, structured and semi-structured records, multimedia
objects.
• It can involve range of contents and media
22
Cont’d...
• The area has grown beyond its early goals
Nowadays research in IR includes:
✔ Modeling, ✔ language,
✔ web search, ✔ cross-language retrieval,
✔ text classification, ✔ audio (speech and music)
✔ system architecture, retrieval,
✔ user interface, ✔ image retrieval,
✔ data filtering, ✔ video retrieval,
✔ question answering, etc.
23
Cont’d...
• IR can be studied from two distinct but complementary point of
view
A computer-centered: consists of
• Building up efficient indexes
• Processing user queries with high performance
• Developing ranking algorithms to improve results
A human-centered:
• Studying the behavior of the user
• Understanding user’s need
• Determining how understanding user’s need affects the
organization and operation of retrieval system
24
IR as a Tool
• IR is a tool that finds and selects from a
collection of items a subset that serves
the user’s purpose
25
Examples of IR systems
•Text-based (Lexis-Nexis, Google, FAST): Search by keywords.
Limited search using queries in natural language.
•Multimedia (QBIC, WebSeek, SaFe): Search by visual appearance
(shapes, colors,… ).
•Question answering systems (AskJeeves, Answerbus): Search in
(restricted) natural language
•Digital and virtual libraries
•other:
– Cross language vs. multilingual information retrieval,
– Music retrieval
26
– Medical search engines
IR serve as a Bridge
• An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
– That is, writers present a set of ideas in a document using a set
of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
Black box
User Documents
27
IR System Architecture
28
Indexing, retrieval and ranking
29
IR System Architecture
• Document collection
• Document representation
– Text analysis/Operations
– Indexing – executed offline
• Query parsing and expansion
– spelling correction, normalization, stop word removal, etc.
• Retrieval and ranking – IR models
• Evaluation of the quality of the answer
• Relevance feedback – to improve ranking
– The clicks on the documents
• Formatting – consists of retrieving the title of the documents and
generating snippets(brief extract) for them
30
Issues in IR
• Document/Text representation
– what makes a “good” representation?
– how is a representation generated from text?
– what are the retrievable objects and how are they organized?
• Information need representation
– what is an appropriate query language?
– how can interactive query formulation and refinement be
supported?
• Comparing representations
– what is a “good” similarity measure & retrieval model?
– how is uncertainty represented?
• Evaluating effectiveness of retrieval
– what are good metrics?
– what constitutes a good experimental test bed?
31
Information Vs Data Retrieval
• Data retrieval : the task of determining which documents of a
collection contain the keywords in the user query
• Data retrieval system
– Relational database
– Deals with data that has a well defined structure and semantics
32
Data Vs. Information Retrieval
Features Data Retrieval Information
Retrieval
Matching Exact match Partial or best match
33
IR Research areas
• Much of IR research focuses more specifically on text retrieval But
there are many other interesting areas:
–Audio retrieval, which deals with searching for speech or music file
–Cross-language retrieval, which uses a query in one language (say
English) and finds documents in other languages (say Amharic and
Russian).
–Question-answering IR systems, which retrieve answers from a
body of text. For example, the question Who won the 1997 World
Series? finds a 1997 headline World Series: Marlins are
champions.
–Image retrieval, which finds images on a given topic or images that
contain a given shape or color.
–Video retrieval, which searches for video file that the user looking
for.
34
Is IR just document retrieval?
• Cross-language information retrieval:
Cross-lingual Information Retrieval is the task of retrieving relevant information
when the document collection is written in a different language from the user
query.
There are many situations where CLIR becomes essential because the information
is not in the user’s native language.
• Information extraction
• Question answering
• Document Summarization
• Text classification
• Multimedia information retrieval
• Multi-database searching
• Document provenance
• Recommender systems
• Text mining
• etc…
35
Assignment 1
Write an overview on one of the following topics and submit it in
hard copy. Your overview should provide introduction, a
typical architecture, techniques and methods,
performance achieved so far, future research directions
and reference materials you have used.
36
The end of Chapter one!
37