CompletedUNIT 1 PPT 10.7.17
CompletedUNIT 1 PPT 10.7.17
CompletedUNIT 1 PPT 10.7.17
TECHNIQUES
UNIT I - INTRODUCTION
UNIT II - MODELING
UNIT III INDEXING
UNIT IV - CLASSIFICATION AND CLUSTERING
UNIT V - SEARCHING AND RANKING
UNIT I - INTRODUCTION
Motivation
Basic Concepts
Practical Issues
Retrieval Process
Architecture
Boolean Retrieval
Retrieval Evaluation
Open Source IR Systems
History of Web Search
Web Characteristics
The impact of the web on IR
IR Versus Web Search
Components of a Search engine
Motivation- Information Retrieval
Source selection
Problem articulation
Engine
OUTPUT
Examination of results
Extraction of information
18
Practical Issues(Conti..)
prompt dissemination of information
filtering of information
providing the right amount of information at the right time
active switching of information
receiving information in the desired form
browsing
getting information in an economical way
current literature
providing access to other information systems
interpersonal communication
offering personalized help.
The formalized IR process
Real world Anomalous state of knowledge
Matching
Results
Architecture of IR
Supporting the search process
Elements of an information retrieval
system
Functions
to identify the information (sources) relevant to the areas of interest of the target users
community; this is a challenging job especially in the web environment where virtually
everybody in the world can be the potential user of a web-based information retrieval
system
to analyze the contents of the sources (documents); this is becoming increasingly
challenging as the size, volume and variety of information sources (documents) is
increasing rapidly; web information retrieval is carried out automatically using specially
designed programs called spiders
to represent the contents of analyzed sources in a way that matches users queries; this is
done by automatically creating one or more index files, and is becoming an increasingly
complex task due to the volume and variety of content and increasing user demands
to analyze users queries and represent them in a form that will be suitable for matching
the database; this is done in a number of ways, through the design of sophisticated search
interfaces including those that can provide some help to users for selection of appropriate
search terms by using dictionary and thesauri, automatic spell checkers, a predefined set of
search statements and so forth
to match the search statement with the stored database; a number of complex information
retrieval models have been developed over the years that are used to determine the
similarity of the query and stored documents
to retrieve relevant information; a variety of tools and techniques are used to determine
the relevance of retrieved items and their ranking
to make continuous changes in all aspects of the system, keeping in mind the rapid
developments in information and communication technologies (ICTs) relating to changing
patterns of society, users and their information needs and expectations.
Making the connections
Stemming
Making sure that simple variations in word form are recognized
as equivalent for the purpose of the search: exercise, exercises,
exercised, for example.
Indexing
A keyword or group of selected words
Any word (more general)
How to choose the most relevant terms to use as index elements
for a set of documents.
Build an inverted file for the chosen index terms.
Anatomy of a web page
31
Boolean Retrieval Model
The query is a Boolean algebra expression using connectives like ,, etc.
The documents retrieved are the documents that completely match the given
query.
Partial matches are not retrieved. Also, the retrieved set of documents is not
ordered.For example, Say, there are four documents in the system.
For each term in the query, a list of documents that contain the term is created.
Then the lists are merged according to the Boolean operators.
32
Boolean Model
Advantages
It is simple, efficient and easy to implement.
It was one of the earliest retrieval methods to be implemented. It
remained the primary retrieval model for at least three decades.
It is very precise in nature. The user exactly gets what is specified.
Boolean model is still widely used in small scale searches like
searching emails, files from local hard drives or in a mid-sized library.
Disadvantages
In Boolean model, the retrieval strategy is based on binary criteria.
So, partial matches are not retrieved. Only those documents that exactly
match the query are retrieved.
Hence, to effectively retrieve from a large set of documents users must
have a good domain knowledge to form good queries.
The retrieved documents are not ranked.
35
Retrieval Evaluation
Contingency table of classification of documents
Actual Condition
Present Absent
fp fp type 1 error
Positive tp
type1
Test result
Negative fn fn type 2 error
tn
type2
present = tp + fn
positives = tp + fp
Total # of cases N = tp + fp + fn + tn negatives = fn + tn
Query to search
engine retrieves: D2, not retrieved D1,D10 D3,D7
All docs
Retrieved
Relevant
Precision vs. Recall
| RelRetrieved | | RelRetrieved |
Precision = Recall =
| Retrieved | | Rel in Collection |
All docs
Retrieved
Relevant
Evaluation of Matching:
Recall and Precision
If information retrieval were perfect ...
Every hit would be relevant to the original query, and every
relevant item in the body of information would be found.
Local Search
Provides access to local search results from
Google Maps.
Video Search
Incorporate a simple search box
incorporate dynamic, search powered strips of
video and book thumbnails.
53
Lucene
Cross-Platform API
Implemented in Java
Ported in C++, C#, Perl, Python
Offers scalable, high-performance indexing
Incremental indexing as fast as bath indexing
Index size roughly 20-30% the size of indexed text
Supports many powerful query types
54
Lucene: Modules
Analysis
Tokenization, Stop words, Stemming, etc.
Document
Unique ID for each document
Title of document, date modified, content, etc.
Index
Provides access and maintains indexes.
Query Parser
Search / Search Spans
55
Terrier: Overview (1/2)
Stands for TERabyte RetrIEveR.
Open Source API (Mozilla Public Licence).
Modular platform for the rapid development of large-scale IR
applications.
It is written in Java (and Perl)
Highly compressed disk data structures.
Handling large-scale document collections.
Standard evaluation of TREC ad-hoc and known-item search
retrieval results.
Based on a new parameter-free probabilistic framework for IR
(DFR), allowing adaptable term weighting functionalities.
56
Terrier: Indexing
Create your own Collection decoder and Document implementation.
Centralized or distributed Setting.
Indexer iterates through the collection and creates the following data
structures
Direct Index
Document Index
Lexicon
57
Lemur: Overview
Support for XML and structured document
retrieval
Interactive interfaces for Windows, Linux, and
Web
Cross-Platform, fast and modular code written
in C++
Free and open-source software
58
Lemur: Query Flow
Query
User Query Parser
Scoring
Nodes
runQuery()
59
The Dragon Toolkit
University Of Drexel
60
Dragon: Overview (1/2)
Highly scalable to large data set
Well designed Programming API and XML-
based Interface
Various document representations including
words, multiword phrases, ontology-based
concepts, and concept pairs
Various text retrieval models
Text classification, clustering, summarization
and topic modeling
61
Dragon: Overview (2/2)
Provides built-in supports for semantic-based IR and TM
(different from Lucene and Lemur ).
Integrates a set of NLP tools, which enable the toolkit to
index text collections with various representation
schemes including words, phrases, ontology-based
concepts and relationships.
It is specially designed for large-scale application.
The toolkit uses sparse matrix to implement text
representations and does not have to load all data into
memory in the running time.
Can handle hundred thousands of documents with very
limited memory.
62
History of Web Search
64
Web characteristics
The web graph
view the static Web consisting of static HTML pages
together with the hyperlinks between them as a
directed graph in which each web page is a node and
each hyperlink a directed edge.
Spam
This led to the first generation of spam, which SPAM
(in the context of web search) is the manipulation of
web page content for the purpose of appearing high up
in search results for selected keywords.
To avoid irritating users with these repetitions,
sophisticated spammers re-sorted to such tricks as
rendering these repeated terms in the same color as the
background.
Web crawler
70
Which Search Engine?
Yahoo
Altavista
Excite
Google
NorthernLights
Hotbot
Infoseek
See Handout - The Little Search Engine that Could
71
Components of a Search engine
Interface
Indexer
Users
Crawler
Characteristics
Web
A Typical Web Search Engine
Whats SEO?
SEO = Search Engine Optimization
Refers to the process of optimizing both the on-page and off-page ranking
factors in order to achieve high search engine rankings for targeted search
terms.
Refers to the industry that revolves around obtaining high rankings in the
search engines for desirable keyword search terms as a means of increasing
the relevant traffic to a given website.
Content based
Link spam
Cloaking
Mirror sites
URL redirection
Duplicate detection