0% found this document useful (0 votes)
2 views

Module1PartBInformationRetrievalWebdocuments

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module1PartBInformationRetrievalWebdocuments

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Module 1

: Information Retrieval and


Web Search
An introduction
Introduction
 Text mining refers to data mining using text
documents as data.
 Most text mining tasks use Information
Retrieval (IR) methods to pre-process text
documents.
 These methods are quite different from
traditional data pre-processing methods
used for relational tables.
 Web search also has its root in IR.

CS583, Bing Liu, UIC 2


Information Retrieval (IR)
 Conceptually, IR is the study of finding needed
information. I.e., IR helps users find information
that matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval,
emphasizing document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution of
information.

CS583, Bing Liu, UIC 3


IR architecture

CS583, Bing Liu, UIC 4


IR queries

 Keyword queries
 Boolean queries (using AND, OR, NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions

CS583, Bing Liu, UIC 5


Information retrieval models
 An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc

CS583, Bing Liu, UIC 6


Boolean model
 Each document or query is treated as a
“bag” of words or terms. Word sequence is
not considered.
 Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive words/terms
in the collection. V is called the vocabulary.
 A weight wij > 0 is associated with each term
ti of a document dj ∈ D. For a term that does
not appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),

CS583, Bing Liu, UIC 7


Boolean model (contd)
 Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves
every document that makes the query logically
true.
 Called exact match.
 The retrieval results are usually quite poor
because term frequency is not considered.
CS583, Bing Liu, UIC 8
Vector space model
 Documents are also treated as a “bag” of words or
terms.
 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1.
Each term weight is computed based on some
variations of TF or TF-IDF scheme.

 Term Frequency (TF) Scheme: The weight of a term


ti in document dj is the number of times that ti
appears in dj, denoted by fij. Normalization may also
be applied.
CS583, Bing Liu, UIC 9
TF-IDF term weighting scheme
 The most well known
weighting scheme
 TF: still term frequency
 IDF: inverse document
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
 The final TF-IDF term
weight is:

CS583, Bing Liu, UIC 10


Retrieval in vector space model
 Query q is represented in the same way or slightly
differently.
 Relevance of di to q: Compare the similarity of
query q and document di.
 Cosine similarity (the cosine of the angle between
the two vectors)

 Cosine is also commonly used in text clustering

CS583, Bing Liu, UIC 11


An Example
 A document space is defined by three terms:
 hardware, software, users
 the vocabulary
 A set of documents are defined as:
 A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
 A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
 A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

 If the Query is “hardware and software”


 what documents should be retrieved?

CS583, Bing Liu, UIC 12


An Example (cont.)
 In Boolean query matching:
 document A4, A7 will be retrieved (“AND”)
 retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
 In similarity matching (cosine):
 q=(1, 1, 0)
 S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
 Document retrieved set (with ranking)=
 {A4, A7, A1, A2, A5, A6, A8, A9}

CS583, Bing Liu, UIC 13


Okapi relevance method
 Another way to assess the degree of relevance is to
directly compute a relevance score for each
document to the query.
 The Okapi method and its variations are popular
techniques in this setting.

CS583, Bing Liu, UIC 14


Relevance feedback
 Relevance feedback is one of the techniques for
improving retrieval effectiveness. The steps:
 the user first identifies some relevant (Dr) and irrelevant
documents (Dir) in the initial list of retrieved documents
 the system expands the query q by extracting some
additional terms from the sample relevant and irrelevant
documents to produce qe
 Perform a second round of retrieval.
 Rocchio method (α, β and γ are parameters)

CS583, Bing Liu, UIC 15


Rocchio text classifier
 In fact, a variation of the Rocchio method above,
called the Rocchio classification method, can be
used to improve retrieval effectiveness too
 Rocchio classifier is constructed by producing a
prototype vector ci for each class i (relevant or
irrelevant in this case):

 In classification, cosine is used.

CS583, Bing Liu, UIC 16


Text pre-processing

 Word (term) extraction: easy


 Stopwords removal
 Stemming
 Frequency counts and computing TF-IDF
term weights.

CS583, Bing Liu, UIC 17


Stopwords removal
 Many of the most frequently used words in English are useless
in IR and text mining – these words are called stop words.
 the, of, and, to, ….

 Typically about 400 to 500 such words

 For an application, an additional domain specific stopwords list

may be constructed
 Why do we need to remove stopwords?
 Reduce indexing (or data) file size

stopwords accounts 20-30% of total word counts.


 Improve efficiency and effectiveness


 stopwords are not useful for searching or text mining

 they may also confuse the retrieval system.

CS583, Bing Liu, UIC 18


Stemming
 Techniques used to find out the root/stem of a
word. E.g.,
 user engineering
 users engineered
 used engineer
 using
 stem: use engineer
Usefulness:
 improving effectiveness of IR and text mining
 matching similar words
 Mainly improve recall
 reducing indexing size
 combing words with same roots may reduce indexing
size as much as 40-50%.

CS583, Bing Liu, UIC 19


Basic stemming methods
Using a set of rules. E.g.,
 remove ending
 if a word ends with a consonant other than s,
followed by an s, then delete s.
 if a word ends in es, drop the s.
 if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
 If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
 …...
 transform words
 if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

CS583, Bing Liu, UIC 20


Frequency counts + TF-IDF
 TF-Counts the number of times a word
occurred in a document.
 Using occurrence frequencies to indicate relative
importance of a word in a document.
 if a word appears often in a document, the document
likely “deals with” subjects related to the word.
 IDF-Counts the number of documents in the
collection that contains each word
 TF-IDF can be computed.

CS583, Bing Liu, UIC 21


Evaluation: Precision and Recall

 Given a query:
 Are all retrieved documents relevant?
 Have all the relevant documents been retrieved?
 Measures for system performance:
 The first question is about the precision of the
search
 The second is about the completeness (recall) of
the search.

CS583, Bing Liu, UIC 22


Precision-recall curve

CS583, Bing Liu, UIC 23


Example
Consider a document
collection D with 20
documents. Given a
query q, we know that
eight documents are
relevant to q. A retrieval
algorithm produces the
ranking (of all documents
in D)

CS583, Bing Liu, UIC 24


Rank precision

 Compute the precision values at some


selected rank positions.
 Mainly used in Web search evaluation.
 For a Web search engine, we can compute
precisions for the top 5, 10, 15, 20, 25 and 30
returned pages
 as the user seldom looks at more than 30 pages.
 Recall is not very meaningful in Web search.
 Why?
CS583, Bing Liu, UIC 25
Web Search as a huge IR system

 A Web crawler (robot) crawls the Web to


collect all the pages.
 Servers establish a huge inverted indexing
database and other indexing databases
 At query (search) time, search engines
conduct different types of vector query
matching.

CS583, Bing Liu, UIC 26


Different search engines
 The real differences among different search
engines are
 their index weighting schemes
 Including location of terms, e.g., title, body,
emphasized words, etc.
 their query processing methods (e.g., query
classification, expansion, etc)
 their ranking algorithms
 Few of these are published by any of the search
engine companies. They are tightly guarded
secrets.

CS583, Bing Liu, UIC 27


 Search Engine

Architecture of SE

How do search engines like Google work?

28
 Search Engine

Paid

Search Ads

Algorithmic results.
29
 Search Engine

Architecture
Sponsored Links

CG Appliance Express
Discount Appliances (650) 756-3931

User
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA

Miele Vacuum Cleaners


Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com

Miele Vacuum Cleaners


Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise


At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...

Web spider
www.miele.com/ - 20k - Cached - Similar pages

Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this


page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ]


Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages

Search

Indexer

The Web

Indexes Ad indexes


30
 Search Engine

Indexing Process

31
 Search Engine

Indexing Process
 Text acquisition
 identifies and stores documents for indexing

 Text transformation
 transforms documents into index terms or features

 Index creation
 takes index terms and creates data structures
(indexes) to support fast searching

32
Inverted index
 The inverted index of a document collection
is basically a data structure that
 attaches each distinctive term with a list of all
documents that contains the term.
 Thus, in retrieval, it takes constant time to
 find the documents that contains a query term.
 multiple query terms are also easy handle as we
will see soon.

CS583, Bing Liu, UIC 33


An example

CS583, Bing Liu, UIC 34


Index construction- Example

CS583, Bing Liu, UIC 35


Search using inverted index

Given a query q, search has the following steps:


 Step 1 (vocabulary search): find each
term/word in q in the inverted index.
 Step 2 (results merging): Merge results to
find documents that contain all or some of the
words/terms in q.
 Step 3 (Rank score computation): To rank
the resulting documents/pages, using,
 content-based ranking
 link-based ranking

CS583, Bing Liu, UIC 36


 Search Engine

Query Process

37
 Search Engine

Query Process
 User interaction
 supports creation and refinement of query, display of
results

 Ranking
 uses query and indexes to generate ranked list of
documents
 Evaluation
 monitors and measures effectiveness and efficiency
(primarily offline)

38
Indexing Process

Details: Text Acquisition


 Crawler
 Identifies and acquires documents for search engine
 Many types – web, enterprise, desktop
 Web crawlers follow links to find documents
 Must efficiently find huge numbers of web pages (coverage)
and keep them up-to-date (freshness)
 Single site crawlers for site search
 Topical or focused crawlers for vertical search
 Document crawlers for enterprise and desktop search
 Follow links and scan directories

39
Indexing Process

Web Crawler
 Starts with a set of seeds, which are a set of URLs given to it
as parameters
 Seeds are added to a URL request queue
 Crawler starts fetching pages from the request queue
 Downloaded pages are parsed to find link tags that might
contain other useful URLs to fetch
 New URLs added to the crawler’s request queue, or frontier
 Continue until no more new URLs or disk full

40
Indexing Process

Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web

41
Indexing Process

Crawling the Web

42
Indexing Process

Text Acquisition
 Feeds
 Real-time streams of documents
 e.g., web feeds for news, blogs, video, radio, tv

 RSS is common standard


 RSS “reader” can provide new XML documents to search
engine
 Conversion
 Convert variety of documents into a consistent text plus
metadata format
 e.g. HTML, XML, Word, PDF, etc. → XML

 Convert text encoding for different languages


 Using a Unicode standard like UTF-8

43
Indexing Process

Text Acquisition
 Document data store
 Stores text, metadata, and other related content for
documents
 Metadata is information about document such as type and
creation date
 Other content includes links, anchor text
 Provides fast access to document contents for search
engine components
 e.g. result list generation
 Could use relational database system
 More typically, a simpler, more efficient storage system is
used due to huge numbers of documents
44
Indexing Process

Text Transformation
 Parser
 Processing the sequence of text tokens in the document to
recognize structural elements
 e.g., titles, links, headings, etc.

 Tokenizer recognizes “words” in the text


 must consider issues like capitalization, hyphens,
apostrophes, non-alpha characters, separators
 Markup languages such as HTML, XML often used to specify
structure
 Tags used to specify document elements
 E.g., <h2> Overview </h2>
 Document parser uses syntax of markup language (or other
formatting) to identify structure

45
Indexing Process

Text Transformation
 Stopping
 Remove common words
 e.g., “and”, “or”, “the”, “in”

 Some impact on efficiency and effectiveness


 Can be a problem for some queries

 Stemming
 Group words derived from a common stem
 e.g., “computer”, “computers”, “computing”, “compute”

 Usually effective, but not for all queries


 Benefits vary for different languages

46
Indexing Process

Text Transformation
 Link Analysis
 Makes use of links and anchor text in web pages

 Link analysis identifies popularity and community


information
 e.g., PageRank
 Anchor text can significantly enhance the
representation of pages pointed to by links
 Significant impact on web search

47
Indexing Process

Text Transformation
 Information Extraction
 Identify classes of index terms that are important for
some applications
 e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
 Classifier
 Identifies class-related metadata for documents
 i.e., assigns labels to documents
 e.g., topics, reading levels, sentiment
 Use depends on application

48
Summary

 Web Document Representation


 Boolean and Vector Space models
 Preprocessing
 Web search engine
 Architecture
 Overview of crawling

CS583, Bing Liu, UIC 49

You might also like