Unit1 Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

PANIMALAR ENGINEEING COLLEGE

DEPARTMENT OF CSE
IV YEAR - SEMESTER VIII
CS8080 INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION

Information Retrieval – Early Developments – The IR Problem – The User‗s


Task – Information versus Data Retrieval - The IR System – The Software
Architecture of the IR System – The Retrieval and Ranking Processes - The
Web – The e-Publishing Era – How the web changed Search – Practical Issues
on the Web – How People Search – Search Interfaces Today – Visualization in
Search Interfaces

1.1. Information Retrieval


IR deals with the representation, storage, organization of, and
access to information items
o Types of information items: documents, Web pages, online
catalogs, structured records, multimedia objects
Early goals of the IR area: indexing text and searching for useful
documents in a collection
Nowadays, research in IR includes:
o Modeling, Web search, text classification, systems
architecture, user interfaces, data visualization, filtering
and languages

1.2. Early Developments


For more than 5,000 years, man has organized information for later
retrieval and searching
o This has been done by compiling, storing, organizing, and
indexing papyrus, hieroglyphics, and books
For holding the various items, special purpose buildings called
libraries, or bibliothekes, are used
o The oldest known library was created in Elba, in the Fertile
Crescent, between 3,000 and 2,500 BC
o By 300 BC, Ptolemy Soter, a Macedonian general, created the
Great Library at Alexandria
o Nowadays, libraries are everywhere
In 2008, more than 2 billion items were checked out from
libraries in the US—an increase of 10% over the previous
year
Since the volume of information in libraries is always growing, it is
necessary to build specialized data structures for fast search — the
indexes
For centuries indexes have been created manually as sets of categories,
with labels associated with each category
The advent of modern computers has allowed the construction of large
indexes automatically
Early Developments in IR

During the 50’s, research efforts in IR were initiated by pioneers such


as Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin
Moores, who allegedly coined the term Information Retrieval
In 1962, Cyril Cleverdon published the Cranfield studies on retrieval
evaluation
In 1963, Joseph Becker and Robert Hayes published the first book on
IR
In the late 60’s, key research conducted by Karen Sparck Jones and
Gerard Salton, among others, led to the definition of the TF-IDF term
weighting scheme
(term frequency-inverse document frequency-statistical measure
used to evaluate how important a word is to a document in a
collection or corpus)
In 1971, Jardine and van Rijsbergen articulated the cluster
hypothesis
In 1978, the first ACM SIGIR Internation Conference on Information
Retrieval was held in Rochester
In 1979, van Rijsbergen published a classic book entitled Information
Retrieval, which focused on the Probabilistic Model
In 1983, Salton and McGill published a classic book entitled
Introduction to Modern Information Retrieval, which focused on the
Vector Model

Libraries and Digital Libraries


Libraries were among the first institutions to adopt IR systems for
retrieving information
Initially, such systems consisted of an automation of existing
processes such as card catalogs searching
Increased search functionality was then added
o Ex: subject headings, keywords, query operators
Nowadays, the focus has been on improved graphical interfaces,
electronic forms, hypertext features
IR at the Center of the Stage
Until recently, IR was an area of interest restricted mainly to
librarians and information experts
A single fact changed these perceptions—the introduction of the Web,
which has become the largest repository of knowledge in human
history
Due to its enormous size, finding useful information on the Web
usually requires running a search
And searching on the Web is all about IR and its technologies
Thus, almost overnight, IR has gained a place with other technologies at
the center of the stage

1.3. The IR Problem


Users of modern IR systems, such as search engine users, have
information needs of varying complexity
An example of complex information need is as follows:
Find all documents that address the role of the Federal
Government in financing the operation of the National Railroad
Transportation Corporation (AMTRAK)
This full description of the user information need is not necessarily
a good query to be submitted to the IR system
Instead, the user might want to first translate this information need
into a query
This translation process yields a set of keywords, or index terms,
which summarize the user information need
Given the user query, the key goal of the IR system is to retrieve
information that is useful or relevant to the user
That is, the IR system must rank the information items according to
a degree of relevance to the user query
The IR Problem
The key goal of an IR system is to retrieve all the items that are
relevant to a user query, while retrieving as few nonrelevant items as
possible
The notion of relevance is of central importance in IR

1.4. The User’s Task


Consider a user who seeks information on a topic of their
interest
o This user first translates their information need into a query,
which requires specifying the words that compose the query
o In this case, we say that the user is searching or querying for
information of their interest
Consider now a user who has an interest that is either poorly
defined or inherently broad
o For instance, the user has an interest in car racing and
wants to browse documents on Formula 1 and Formula
Indy
o In this case, we say that the user is browsing or
navigating the documents of the collection

1.5. Information versus Data Retrieval


Data retrieval: the task of determining which documents of a
collection contain the keywords in the user query
Data retrieval system
o Ex: relational databases
o Deals with data that has a well defined structure and semantics
o A single erroneous object among a thousand retrieved
objects means total failure
Data retrieval does not solve the problem of retrieving
information about a subject or topic
Data IR
Data Structured Unstructured
Clear semantics
Fields
(SSN, age) No fields (other than text)
Defined (relational algebra, Free text (“natural language”),
Queries
SQL) Boolean
Critical (concurrency Downplayed(to make something
Recoverability control, recovery, atomic seem less important or less bad than
operations) it really is), though still an issue
Matching Exact (results are always Imprecise (need to measure
“correct”) effectiveness)

1.6. The IR System


1.6.1.The Software Architecture of the IR System
Steps:
1. Assemble Document Collection and store it in Central Repository
Can be private or be crawled from the web.
2. Documents in Central Repository need to be indexed for fast retrieval
( index structure : inverted index)
3. Retrieval /Searching Process can be initiated
a) User gives query reflects their information need
b) User query is converted into system query by parsing
and expanding
c) Processed against the index to retrieve subset of all
documents.
d) Retrieved documents are ranked and top ranked documents are
returned to user

To improve the IR system – Evaluation done


Evaluation Procedure - comparing the set of results produced by the IR
system with the results suggested by human specialist
To improve ranking – Collect feedback from the user and use this to
change the results
1.6.2.The Retrieval and Ranking Processes

The process of indexing , retrieval and ranking


User Interface manages interaction with the user:
Query input and document output.
Relevance feedback.
Visualization of results.
Indexing Process - constructs an inverted index of word to document
pointers
Retrieval Process - retrieves documents that contain a given query token
from the inverted index
Ranking Process - assign score to all retrieved documents according to
a relevance metric

Indexing Process & Retrieval Process


Text Transformation / operation forms index words (tokens)
o Stop-word removal (words which are filtered out before or after
processing of natural language data )
o Stemming (basically removing the suffix from a word and reduce it to
its root word, Ex. In “Flying” – “ing” is removed and “Fly” root word)

Stop-words
To reduce the set of representative keywords from
large collection a", "and", "but", "how", "or“
For example, "What is a motherboard?“ "motherboard" . The removal
of stop words usually improves IR effectiveness.
Stop-list: contain stop-words, not to be used as index
Prepositions, Articles, Pronouns
Some adverbs and adjectives, Some frequent words (e.g. document)
The removal of stop-words usually improves IR effectiveness
Reason for stemming
Different word forms may bear similar meaning (e.g. search, searching):
create a “standard” representation for them
Stemming Ex:
Which reduces distinct words to their common
grammatical root Removing some endings of word
computer
compute
computes
computing comput
computed
computation

1.7. The Web


1.7.1. A Brief History
At the end of World War II, Vannevar Bush looked for applications of
new technologies to peace times
Bush first produced a report entitled Science, The Endless Frontier
o This report directly influenced the creation of the National
Science Foundation
Following, he wrote As We May Think, a remarkable paper which
discussed new hardware and software gadgets
In Bush’s words
Whole new forms of encyclopedias will appear, ready-made with a
mesh of associative trails running through them, ready to be dropped
into the memex and there amplified
As We May Think influenced people like Douglas Engelbart, who
invented the computer mouse and introduced the concept of
hyperlinked texts
Ted Nelson, working in his Project Xanadu, pushed the concept
further and coined the term hypertext
A hypertext allows the reader to jump from one electronic document
to another, which was one important property regarding the problem
that Tim Berners-Lee faced in 1989
At the time, Berners-Lee worked in Geneva at the
CERN—Conseil Européen pour la Recherche Nucléaire
There, researchers who wanted to share documentation with others
had to reformat their documents to make them compatible with an
internal publishing system
Berners-Lee reasoned that it would be nice if the solution of sharing
documents were decentralized
He saw that a networked hypertext would be a good solution and
started working on its implementation
In 1990, Berners-Lee
o Wrote the HTTP protocol
o Defined the HTML language
o Wrote the first browser, which he called World Wide Web
o Wrote the first Web server
In 1991, he made his browser and server software available in the
Internet
The Web was born!

1.8. The e-Publishing Era


Since its inception, the Web became a huge success
Well over 20 billion pages are now available and accessible in the Web
More than one fourth of humanity now access the Web on a regular
basis
Why is the Web such a success? What is the single most important
characteristic of the Web that makes it so revolutionary?
In search for an answer, let us dwell into the life of a writer who lived at
the end of the 18th Century
She finished the first draft of her novel in 1796
The first attempt of publication was refused without a reading
The novel was only published 15 years later!
She got a flat fee of $110, which meant that she was not paid anything
for the many subsequent editions
Further, her authorship was anonymized under the reference “By a
Lady”
We are talking of …
Pride and Prejudice is the second or third best loved novel in the UK
ever, after The Lord of the Rings and Harry Potter
It has been the subject of six TV series and five film versions
The last of these, starring Keira Knightley and Matthew Macfadyen,
grossed over 100 million dollars
Jane Austen published anonymously her entire life
Throughout the 20th century, her novels have never been out of print
Jane Austen was discriminated because there was no freedom to
publish in the beginning of the 19th century
The Web, unleashed by the inventiveness of Tim Berners-Lee, changed
this once and for all
It did so by universalizing freedom to publish
The Web moved mankind into a new era, into a new time, into The e-
Publishing Era

1.9. How the web changed Search

Web search is today the most prominent application of IR and its


techniques—the ranking and indexing components of any search
engine are fundamentally IR pieces of technology
The first major impact of the Web on search is related to the
characteristics of the document collection itself
o The Web is composed of pages distributed over millions of sites
and connected through hyperlinks
o This requires collecting all documents and storing copies of
them in a central repository, prior to indexing
o This new phase in the IR process, introduced by the Web, is
called crawling
The second major impact of the Web on search is related to:
o The size of the collection
o The volume of user queries submitted on a daily basis
o As a consequence, performance and scalability have become
critical characteristics of the IR system
The third major impact in a very large collection, predicting
relevance is much harder than before
o Fortunately, the Web also includes new sources of evidence
o Ex: hyperlinks and user clicks in documents in the answer set
The fourth major impact derives from the fact that the Web is also a
medium to do business
o Search problem has been extended beyond the seeking of text
information to also encompass other user needs
o Ex: the price of a book, the phone number of a hotel, the link
for downloading a software
The fifth major impact of the Web on search is Web spam
o Web spam: abusive availability of commercial information
disguised in the form of informational content
o This difficulty is so large that today we talk of Adversarial Web
Retrieval

1.10. Practical Issues on the Web


Security
o Commercial transactions over the Internet are not yet a
completely safe procedure
Privacy
o Frequently, people are willing to exchange information as long as
it does not become public
Copyright and patent rights
o It is far from clear how the wide spread of data on the Web affects
copyright and patent laws in the various countries
Scanning, optical character recognition (OCR), and cross-language
retrieval

1.11. How People Search


User Interfaces for search focuses on
o the human users of search systems
o the search user interface, i.e., the window through which
search systems are seen
o The user interface role is to aid in the searchers’
understanding and expression of their information need
o Further, the interface should help users
formulate their queries
select among available information
sources understand search results
keep track of the progress of their search
User interaction with search interfaces differs depending on
o the type of task
o the domain expertise of the information seeker
o the amount of time and effort available to invest in the process
Marchionini makes a distinction between information lookup and
exploratory search
Information lookup tasks
o are akin to fact retrieval or question answering
o can be satisfied by discrete pieces of information: numbers,
dates, names, or Web sites
o can work well for standard Web search interactions
Exploratory search is divided into learning and investigating tasks
Learning search
o requires more than single query-response pairs
o requires the searcher to spend time
scanning and reading multiple information items
synthesizing content to form new understanding
Investigating refers to a longer-term process which
o involves multiple iterations that take place over perhaps very
long periods of time
o may return results that are critically assessed before being
integrated into personal and professional knowledge bases
o may be concerned with finding a large proportion of the relevant
information available
Information seeking can be seen as being part of a larger process
referred to as sensemaking
Sensemaking is an iterative process of formulating a conceptual
representation from a large collection
Russell et al. observe that most of the effort in sensemaking goes
towards the synthesis of a good representation
Some sensemaking activities interweave search throughout, while
others consist of doing a batch of search followed by a batch of
analysis and synthesis
Examples of deep analysis tasks that require sensemaking (in
addition to search)
o the legal discovery process
o epidemiology (disease tracking)
o studying customer complaints to improve service
o obtaining business intelligence

Classic x Dynamic Model


Classic notion of the information seeking process
1. problem identification
2. articulation of information need(s)
3. query formulation
4. results evaluation

More recent models emphasize the dynamic nature of the search


process
o The users learn as they search
o Their information needs adjust as they see retrieval results and
other document surrogates
This dynamic process is sometimes referred to as the berry picking
model of search
The rapid response times of today’s Web search engines allow
searchers:
o to look at the results that come back
o to reformulate their query based on these results
This kind of behavior is a commonly-observed strategy within the
berry-picking approach
Sometimes it is referred to as orienteering
Jansen et al made a analysis of search logs and found that the
proportion of users who modified queries is 52%
Some seeking models cast the process in terms of strategies and how
choices for next steps are made
In some cases, these models are meant to reflect conscious planning
behavior by expert searchers
In others, the models are meant to capture the less planned,
potentially more reactive behavior of a typical information seeker
Navigation x Search
Navigation: the searcher looks at an information structure and
browses among the available information
This browsing strategy is preferrable when the information structure
is well-matched to the user’s information need
it is mentally less taxing to recognize a piece of information than it is
to recall it
it works well only so long as appropriate links are available
If the links are not available, then the browsing experience might be
frustrating
Spool discusses an example of a user looking for a software driver for
a particular laser printer
Say the user first clicks on printers, then laser printers, then the
following sequence of links:
HP laser printers
HP laser printers model 9750
software for HP laser printers model 9750
software drivers for HP laser printers model 9750
software drivers for HP laser printers model 9750 for the Win98
operating system
This kind of interaction is acceptable when each refinement makes
sense for the task at hand
Search Process
Numerous studies have been made of people engaged in the search
process
The results of these studies can help guide the design of search
interfaces
One common observation is that users often reformulate their queries
with slight modifications
Another is that searchers often search for information that they have
previously accessed
o The users’ search strategies differ when searching over
previously seen materials
Researchers have developed search interfaces support both query
history and revisitation
Studies also show that it is difficult for people to determine whether or
not a document is relevant to a topic
o The less users know about a topic, the poorer judges they are of
whether a search result is relevant to that topic
Other studies found that searchers tend to look at only the top-
ranked retrieved results
Further, they are biased towards thinking the top one or two results
are better than those beneath them
Studies also show that people are poor at estimating how much of the
relevant material they have found
Other studies have assessed the effects of knowledge of the search
process itself
These studies have observed that experts use different strategies than
novices searchers
For instance, Tabatabai et al found that
o expert searchers were more patient than novices
o this positive attitude led to better search outcomes

1.12. Search Interfaces Today


1.12.1. Introduction:

How does an information seeking session begin in online


information systems?
The most common way is to use a Web search engine
Another method is to select a Web site from a personal collection of
already-visited sites
Which are typically stored in a browser’s bookmark
Online bookmark systems are popular among a smaller segment of
users
Ex: Delicious.com
Web directories are also used as a common starting point, but
have been largely replaced by search engines
1.12.2. Query Specification
The primary methods for a searcher to express their information
need are either
entering words into a search entry form
selecting links from a directory or other information
organization display
For Web search engines, the query is specified in textual form
Typically, Web queries today are very short consisting of one to
three words
Short queries reflect the standard usage scenario in which the user
tests the waters
If the results do not look relevant, then the user
reformulates their query
If the results are promising, then the user navigates to the
most relevant-looking Web site
This search behavior is a demonstration of the orienteering
strategy of Web search
Before the Web, search systems regularly supported Boolean
operators and command-based syntax
However, these are often difficult for most users to
understand
Jansen et al conducted a study over a Web log with 1.5M queries,
and found that
2.1% of the queries contained Boolean operator
7.6% contained other query syntax, primarily double-
quotation marks for phrases
White et al examined interaction logs of nearly 600,000 users, and
found that
1.1% of the queries contained one or more operators
8.7% of the users used an operator at any time
Web ranking has gone through three major phases
In the first phase, from approximately 1994–2000
Since the Web was much smaller then, complex queries were
less likely to yield relevant information
Further, pages retrieved not necessarily contained all query
words
Around 1997, Google moved to conjunctive queries only
The other Web search engines followed, and conjunctive
ranking became the norm
Google also added term proximity information and page
importance scoring (PageRank)
As the Web grew, longer queries posed as phrases started to
produce highly relevant results

1.12.3. Query Specification Interfaces

The standard interface for a textual query is a search box entry form
Studies suggest a relationship between query length and the width of
the entry form
o Results found that either small forms discourage long queries or
wide forms encourage longer queries
Some entry forms are followed by a form that filters the query in some
way
For instance, at yelp.com, the user can refine the search by location
using a second form

Notice that the yelp.com form also shows the user’s home location,
if it has been specified previously
Some search forms show hints on what kind of information should
be entered into each form
For instance, in zvents.com search, the first box is labeled “what
are you looking for”?
The previous example also illustrates specialized input types that
some search engines are supporting today
o The zvents.com site recognizes that words like “tomorrow”
are time-sensitive
o It also allows flexibility in the syntax of dates
To illustrate, searching for “comedy on wed ” automatically
computes the date for the nearest future Wednesday
o This is an example of how the interface can be designed to
reflect how people think
Some interfaces show a list of query suggestions as the user types
the query
o This is referred to as auto-complete, auto-suggest, or
dynamic query suggestions
o Anick et al found that users clicked on dynamic Yahoo
suggestions one third of the time
Often the suggestions shown are those whose prefix matches the
characters typed so far
o However, in some cases, suggestions are shown that only
have interior letters matching
Further, suggestions may be shown that are synonyms of the
words typed so far
Dynamic query suggestions, from Netflix.com

The dynamic query suggestions can be derived from several


sources, including:
o The user’s own query history
o A set of metadata that a Web site’s designer considers important
o All of the text contained within a Web site
Dynamic query suggestions, grouped by type, from NextBio.com:
1.12.4. Retrieval Results Display

When displaying search results, either


o the documents must be shown in full, or else
o the searcher must be presented with some kind of
representation of the content of those documents
The document surrogate refers to the information that
summarizes the document
o This information is a key part of the success of the search
interface
o The design of document surrogates is an active area of
research and experimentation
o The quality of the surrogate can greatly effect the perceived
relevance of the search results listing
In Web search, the page title is usually shown prominently, along
with the URL and other metadata
In search over information collections, metadata such as date
published and author are often displayed
Text summary (or snippet) containing text extracted from the
document is also critical
Currently, the standard results display is a vertical list of textual
summaries
This list is sometimes referred to as the SERP (Search Engine
Results Page)
In some cases the summaries are excerpts drawn from the full text
that contain the query terms
In other cases, specialized kinds of metadata are shown in addition
to standard textual results
o This technique is known as blended results or universal search
For example, a query on a term like “rainbow” may return sample
images as one entry in the results listing

A query on the name of a sports team might retrieve the latest


game scores and a link to buy tickets

Nielsen notes that in some cases the information need is satisfied


directly in the search results listing
o This makes the search engine an “answer engine”
Displaying the query terms in the context in which they appear in
the document
o Improves the user’s ability to gauge the relevance of the
results
o It is sometimes referred to as KWIC - keywords in context
o It is also known as query-biased summaries, query-oriented
summaries, or user-directed summaries
The visual effect of query term highlighting can also improve
usability of search results listings
o Highlighting can be shown both in document surrogates in
the retrieval results and in the retrieved documents
Determining which text to place in the summary, and how much
text to show, is a challenging problem
Often the summaries contain all the query terms in close proximity
to one another
However, there is a trade-off between
o Showing contiguous sentences, to aid in coherence in the
result
o Showing sentences that contain the query terms
Some results suggest that it is better to show full sentences rather
than cut them off
o On the other hand, very long sentences are usually not
desirable in the results listing
Further, the kind of information to display should vary according
to the intent of the query
o Longer results are deemed better than shorter ones for
certain types of information need
o On the other hand, abbreviated listing is preferable for
navigational queries
o Similarly, requests for factual information can be satisfied
with a concise results display
Other kinds of document information can be usefully shown in the
search results page
The page results below show figures extracted from journal articles
alongside the search results

1.12.5. Query Reformulation


There are tools to help users reformulate their query
One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
A special case of this is spelling corrections or suggestions
Usually only one suggested alternative is shown: clicking on that
alternative re-executes the query
In years back, the search results were shown using the
purportedly incorrect spelling
Microsoft Live’s search results page for the query “IMF”
Term expansion: search interfaces are increasingly employing
related term suggestions
Log studies suggest that term suggestions are a somewhat heavily-
used feature in Web search
Jansen et al made a log study and found that 8% of queries were
generated from term suggestions
Anick et al found that 6% of users who were exposed to term
suggestions chose to click on them
Some query term suggestions are based on the entire search
session of the particular user
Others are based on behavior of other users who have issued the
same or similar queries in the past
o One strategy is to show similar queries by other users
o Another is to extract terms from documents that have been
clicked on in the past by searchers who issued the same
query
Relevance feedback is another method whose goal is to aid in
query reformulation
The main idea is to have the user indicate which documents are
relevant to their query
o In some variations, users also indicate which terms
extracted from those documents are relevant
The system then computes a new query from this information and
shows a new retrieval set
Nonetheless, this method has not been found to be successful from
a usability perspective
o Because that, it does not appear in standard interfaces today
This stems from several factors
o People are not particularly good at judging document
relevance, especially for topics with which they are
unfamiliar
o The beneficial behavior of relevance feedback is inconsistent

1.12.6. Organization Search Results

Organizing results into meaningful groups can help users


understand the results and decide what to do next
Popular methods for grouping search results: category systems
and clustering
Category system: meaningful labels organized in such a way as to
reflect the concepts relevant to a domain
o Good category systems have the characteristics of being
coherent and relatively complete
o Their structure is predictable and consistent across
search results for an information collection
The most commonly used category structures are flat,
hierarchical, and faceted categories
Flat categories are simply lists of topics or subjects
o They can be used for grouping, filtering (narrowing), and sorting
sets of documents in search interfaces
Most Web sites organize their information into general categories
o Selecting that category narrows the set of information shown
accordingly
Some experimental Web search engines automatically organize
results into flat categories
o Studies using this kind of design have received positive user
responses (Dumais et al, Kules et al)
However, it can difficult to find the right subset of categories to use
for the vast content of the Web
Rather, category systems seem to work better for more focused
information collections
In the early days of the Web, hierarchical directory systems such
as Yahoo’s were popular

Hierarchy can also be effective in the presentation of search


results over a book or other small collection
The Superbook system was an early search interface based on
this idea
In the Superbook system, the search results were shown in the
context of the table-of-contents hierarchy
The SuperBook interface for showing retrieval results in context
An alternative representation is the faceted metadata
Unlike flat categories, faceted metadata allow the assignment of
multiple categories to a single item
Each category corresponds to a different facet (dimension or
feature type) of the collection of items
Figure below shows a example of faceted navigation

Clustering refers to the grouping of items according to some


measure of similarity
It groups together documents that are similar to one another but
different from the rest of the collection
o Such as all the document written in Japanese that appear in a
collection of primarily English articles
The greatest advantage of clustering is that it is fully automatable
The disadvantages of clustering include
o an unpredictability in the form and quality of results
o the difficulty of labeling the groups
o the counter-intuitiveness of cluster sub-hierarchies
Output produced using Findex clustering
Cluster output on the query “senate”, from Clusty.com

1.13. Visualization in Search Interfaces


Experimentation with visualization for search has been primarily
applied in the following ways
o Visualizing Boolean syntax
o Visualizing query terms within retrieval results
o Visualizing relationships among words and documents
o Visualization for text mining

1.13.1. Visualizing Boolean syntax


Boolean query syntax is difficult for most users and is rarely used
in Web search
For many years, researchers have experimented with how to
visualize Boolean query specification
A common approach is to show Venn diagrams
A more flexible version of this idea was seen in the VQuery
system, proposed by Steve Jones
The VQuery interface for Boolean query specification
1.13.2. Visualizing query terms within retrieval results

Understanding the role of the query terms within the retrieved


docs can help relevance assessment
Experimental visualizations have been designed that make this
role more explicit
In the TileBars interface, for instance, documents are shown as
horizontal glyphs
The locations of the query term hits marked along the glyph
The user is encouraged to break the query into its different facets,
with one concept per line
Then, the lines show the frequency of occurrence of query terms
within each topic
The TileBars interface

Other approaches include placing the query terms in bar charts,


scatter plots, and tables
A usability study by Reiterer et al compared five views:
o a standard Web search engine-style results listing
o a list view showing titles, document metadata, and a graphic
showing locations of query terms
o a color TileBars-like view
o a color bar chart view like that of Veerasamy & Belkin
o a scatter plot view plotting relevance scores against date of
publication
Field-sortable search results view

Colored TileBars view

When asked for subjective responses, the 40 participants of the


study preferred, on average, in this order:
o Field-sortable view first
o TileBars
o Web-style listing
The bar chart and scatter plot received negative responses
Another variation on the idea of showing query term hits within
documents is to show thumbnails
o Thumbnails are miniaturized rendered versions of the visual
appearance of the document
However, Czerwinski et al found that thumbnails are no better
than blank squares for improving search results
The negative study results may stem from a problem with the size
of the thumbnails
o Woodruff et al shows that making the query terms more visible via
highlighting within the thumbnail improves its usability
Textually enhanced thumbnails

1.13.3. Visualizing relationships among words and documents

Numerous works proposed variations on the idea of placing words


and docs on a two-dimensional canvas
In these works, proximity of glyphs represents semantic
relationships among the terms or documents
An early version of this idea is the VIBE interface
o Documents containing combinations of the query terms are placed
midway between the icons representing those terms
The Aduna Autofocus and the Lyberworld projects presented a 3D
version of the ideas behind VIBE
The VIBE display

Another idea is to map docs or words from a very high-dimensional


term space down into a 2D plane
o The docs or words fall within that plane, using 2D or 3D
This variation on clustering can be done to
o documents retrieved as a result of a query
o documents that match a query can be highlighted within a pre-
processed set of documents
InfoSky and xFIND’s VisIslands are two variations on these
starfield displays
InfoSky, from Jonker et al
xFIND’s VisIslands, from Andrews et al

These views are relatively easy to compute and can be visually


striking
However, evaluations that have been conducted so far provide
negative evidence as to their usefulness
o The main problems are that the contents of the documents are not
visible in such views
A more promising application of this kind of idea is in the layout of
thesaurus terms, in a small network graph
o Ex: Visual Wordnet
The Visual Wordnet view of the WordNet lexical thesaurus
1.13.4. Visualization for text mining

Visualization is also used for purposes of analysis and exploration


of textual data
Visualizations such as the Word Tree show a piece of a text
concordance
o It allows the user to view which words and phrases commonly
precede or follow a given word
Another example is the NameVoyager, which shows frequencies of
names for U.S. children across time
The Word Tree visualization, on Martin Luther King’s I have a
dream speech, from Wattenberg et al

The popularity of baby names over time (names beginning with


JA), from babynamewizard.com
Visualization is also used in search interfaces intended for
analysts
An example is the TRIST information triage system, from Proulx et
al
In this system, search results is represented as document icons
o Thousands of documents can be viewed in one display
It supports multiple linked dimensions that allow for finding
characteristics and correlations among the docs
Its designers won the IEEE Visual Analytics Science and
Technology (VAST) contest for two years running
The TRIST interface with results for queries related to Avian
Flu

You might also like