ISR Lab Manual
ISR Lab Manual
ISR Lab Manual
LABORATORY MANUAL
Semester – II
VISION
MISSION
• We believe in and work for the holistic development of students and teachers.
• We strive to achieve this by imbibing a unique value system, transparent work culture,
excellent academic and physical environment conducive to learning, creativity and
technology transfer.
II. IT Graduates will function effectively as individuals and team members, growing
into technical and leadership roles.
III. Graduates of the programme will pursue continuous learning required to adapt
and flourish in ever changing scenarios to pursue career in IT/non IT professions.
Program Outcomes (PO’s):
POs are statements that describe what students are expected to know and be able to do upon
graduating from the program. These relate to the skills, knowledge, analytical ability attitude and
behavior that students acquire through the program.
b) Problem analysis: Graduates will be able to carry out identification and formulation of
the problem statement by requirement engineering and literature survey.
e) Modern tool usage: Graduates will be able to use the techniques, skills, modern IT
engineering tools necessary for engineering practice.
f) The engineer and society: Graduates will be able to apply reasoning and knowledge to
assess global and societal issues
h) Ethics: Graduates will be able to understand the professional and ethical responsibility.
i) Individual and team work: Graduates will be able to function effect ively as an
individual member, team member or leader in mult i -disciplinary teams.
j) Communication: Graduates will be able to communicate effectively and make effective
documentations and presentations.
k) Project Management and Finance: Graduates will be able to apply and demonstrate
engineering and management principles in project management as a member or leader.
l) Life-long Learning: Graduates will be able to recognize the need for continuous learning
and to engage in life-long learning.
3. An ability to use systems for securely processing, storing, retrieving and transmitting
information
Course Objectives:
Course Outcomes:
By the end of the course, students should be able to,
1. Understand the concept, data structure and preprocessing algorithms of Information retrieval.
2. Deal with storage and retrieval process of text and multimedia data.
3. Evaluate performance of any information retrieval system.
4. Design user interfaces.
5. Understand importance of recommender system (Take decision on design parameters of
recommender system).
6. Understand concept of multimedia and distributed information retrieval.
7. Map the concepts of the subject on recent developments in the Information retrieval field.
1. Get solid foundation in design and development of software application useful to society
2. Able to developed Programming skills
CERTIFICATE
has completed all the practical work in the Information Storage and Retrieval Lab [414464B]
satisfactorily, as prescribed by Savitribai Phule Pune University, Pune in the academic year
Place:
Date:
INDEX
Marks
Sr. Date of Date of Signature
Title of Experiment Obtained
No performance Submission of Faculty
(10)
Implementation of Conflation Algorithm
1
Implementation of Single Pass Algorithm
2
for Clustering
3 Implementation of Inverted File
Implementation of feature extraction from
4
2D image
LAB INNOVATION
4. Calculation of precision and recall for some set of documents and queries
We will be using the Linux command line tool, the Terminal, in order to compile a simple C
program. To open the Terminal, you can use the Ubuntu Dash or the Ctrl+Alt+T shortcut.
In order to compile and execute a C program, you need to have the essential packages installed
on your system. Enter the following command as root in your Linux Terminal:
You will be asked to enter the password for root; the installation process will begin after that.
Please make sure that you are connected to the internet.
#include<stdio.h>
int main()
{
printf("\nA sample C program\n\n");
return 0;
}
Then save the file with .c extension. In this example, I am naming my C program as
sampleProgram.c
Alternatively, you can write the C program through the Terminal in gedit as follows:
$ gedit sampleProgram.c
This will create a .c file where you can write and save a program.
In your Terminal, enter the following command in order to make an executable version of the
program you have written:
Syntax:
$ gcc [programName].c -o programName
Example:
$ gcc sampleProgram.c -o sampleProgram
Make sure your program is located in your Home folder. Otherwise, you will need to specify
appropriate paths in this command.
The final step is to run the compiled C program. Use the following syntax to do so:
$ ./programName
Example:
$ ./sampleProgram
You can see how the program is executed in the above example, displaying the text we wrote to
print through it.
Marks: / 10
Objectives:
To study Conflation Algorithm & Document Representative
Outcomes:
Theory:
Document Representative:
Documents in a collection are frequently represented through a set of index terms or keywords.
Such keywords might be extracted directly from the text of the document or might be specified
by a human subject. Modern computers are making it possible to represent a document by its full
set of words. With very large collections, however, even modern computers might have to reduce
the set of representative keywords. This can be accomplished through the elimination of stop
words (such as articles and connectives), the use of stemming (which reduces distinct words to
their common grammatical root), and the identification of noun groups (which eliminates
adjectives, adverbs, and verbs). Further, compression might be employed. These operations are
called text operations (or transformations). The full text is clearly the most complete logical view
of a document but its usage usually implies higher computational costs. A small set of categories
(generated by a human specialist) provides the most concise logical view of a document but its
usage might lead to retrieval of poor quality. Several intermediate logical views (of a document)
might be adopted by an information retrieval system as illustrated in Figure
Besides adopting any of the intermediate representations, the retrieval system might also
recognize the internal structure normally present in a document. This information on the
structure of the document might be quite useful and is required by structured text retrieval
models. As illustrated in Figure we view the issue of logically representing a document as a
continuum in which the logical view of a document might shift (smoothly) from a full text
representation to a higher-level representation specified by a human subject.
The document representative is one consisting simply of a list of class names, each name
representing a class of words occurring in the total input text. A document will be indexed by a
name if one of its significant words occurs as a member of that class .
Distinctions we put up with a certain proportion of errors and assume (correctly) that they
will not degrade retrieval effectiveness too much.
The final output from a conflation algorithm is a set of classes, one for each stem detected. A
class name is assigned to a document if and only if one of its members occurs as a significant
word in the text of the document. A document representative then becomes a list of class names.
These are often referred to as the documents index terms or keywords.
B. Viva Questions:
4. Why Normalized versions of the simple matching coefficient are used for measures of
association
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO2 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO3 - - - - - - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 - - - - - - - - - - - -
Marks: / 10
Objectives:
Theory: -
Clustering: A basic assumption in retrieval systems is that documents relevant to a request are
separated from those which are not relevant, i.e. that the relevant documents are more like one
another than they are like non-relevant documents. Whether this is true for a collection can be
tested as follows. Compute the association between all pairs of documents:
(a) Both of which are relevant to a request, and
Cluster Hypothesis: closely associated documents tend to be relevant to the same requests.
Single Pass Algorithm:
1. The object descriptions are processed serially;
2. The first object becomes the cluster representative of the first cluster;
3. Each subsequent object is matched against all cluster representatives existing at
its processing time;
4. A given object is assigned to one cluster (or more if overlap is allowed) according
to some condition on the matching function;
5. When an object is assigned to a cluster the representative for that cluster is
recomputed;
6. If an object fails a certain test it becomes the cluster representative of a new
cluster.
Algorithm:
1. Input minimum five conflated files
2. Define cluster one and initialize the first object as object descriptive for that
3. Calculate dice coefficient between first object which is object descriptive and next object
4. If matching coefficient is greater than threshold, then add to defined cluster else create
new cluster with object descriptive as that object.
B. Viva Questions:
3. Boolean Search
4. What is multi-pass clustering technique.
S Name of PO PO2 PO PO PO PO PO PO PO P P P
N Experiment 1 3 4 5 6 7 8 9 O O O
o 10 11 12
01 Implementation
n of Single Pass
Algorithm for ✔ ✔ ✔ ✔ - - - - - - - -
Clustering
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ - - - - - - - -
CO2 ✔ ✔ ✔ ✔ - - - - - - - -
CO3 - - - - - - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 - - - - - - - - - - - -
Marks: / 10
Objective: -
To study Indexing, Inverted Files and searching information with the help of inverted file
Outcomes:
At the end of the assignment the students should have
1. Understood use of indexing in fast retrieval
2. Understood working of inverted index
Theory:
Indexing
In searching for a basics query is to scan the text sequentially. Sequential or online text
searching involves finding the occurrences of a pattern in a text. Online searching is
appropriate then the text is small and it is the only choice if the text collection is very
volatile or the index space overhead cannot be afforded. A second option is to build data
structures over the text to speed up the search. It is worthwhile building and maintaining
an. index when the text collection is large and semi-static. Semi-static collections can be
updated at reasonably regular intervals but they are not deemed to support thousands of
insertions of single words per second. This is the case for most real text databases not
only dictionaries or other slow growing literary works. There are many indexing
Techniques.
Three of them are inverted files, suffix arrays and signature files.
Inverted Files:
An inverted file is a word-oriented mechanism for indexing a test collection in order to
speed up the matching task. The inverted file structure is composed of two elements:
vocabulary and occurrence. The vocabulary is the set of all different words in the text.
For each such word a list of all the text portions where the word appears is stored. The set
of all those lists is called the occurrences. These positions can refer to words or
characters. Word positions simplify phrase and proximity queries, while character
positions facilitate direct access to the matching text position.
Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
2. Retrieval of occurrence
3. Manipulation of occurrences
Single-word queries can be searched using any suitable data structure to speed up the search,
such as hashing, tries, or B-trees. The first two give O(m) search cost. However, simply storing
the words in lexicographically order is cheaper in space and very competitive in performance.
Since the word can be binary searched at O (log n) cost. Prefix and range queries can also be
solved with binary search, tries, or B-trees but not with hashing. If the query is formed by single
words then the process ends by delivering the list of occurrences. Context queries arc more
difficult to solve with inverted indices. Each element must be searched separately and a list
generated for each one. Then, the lists of all elements are traversed in synchronization to find
places where all the words appear in sequence (for a phrase) or appear close enough (for
proximity). If one list is much shorter than the others, it may be better to binary search its
elements into the longer lists instead of performing a linear merge. If block addressing is used it
is necessary to traverse the blocks for these queries, since the position information is needed. It is
then better to intersect the lists to obtain the blocks which contain all the searched words and
then sequentially search the context query in those blocks. Some care has to be exercised at
block boundaries, since they can split a match.
Example:
Text:
1 6 9 11 17 19 ….
This is a text. A text has many words. Words are made from letters.
Inverted Index:
Vocabulary Occurrences
Letters 60…
Made 50…
Many 28…
Text 11,19…
Words 33,40….
Algorithm
1. Input the conflated file
2. Build the index file for input file
3. Input the query
4. Print the index file and result of query
B. Viva Questions:
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ - - - - - - - -
CO2 ✔ ✔ ✔ ✔ - - - - - - - -
CO3 - - - - - - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 - - - - - - - - - - - -
CO and PSO Mapping:
Marks: / 10
Aim : To implement a program for feature extraction in 2D color images (any features like shape,
texture ,size, owner, type of file etc.
Objective:
To study features extraction process of 2D color images for information retrieval
Outcomes:
At the end of the assignment the students should have
1. Understood the feature extraction process and its applications
Input:
Image file
Output:
Features of Image file
Theory:
Introduction:
Information Storage and Retrieval BE-IT
Technology determines the types and amounts of information we can access. Currently, a large
fraction of information originates in silicon. Cheap, fast chips and smart algorithms are helping digital
data processing take over all sorts of information processing. Consequently, the volume of digital data
surrounding us increases continuously. However, an information-centric society has additional
requirements besides the availability and capability to process digital data. We should also be able to
find the pieces of information relevant to a particular problem. Having the answer to a question but not
being able to find it is equivalent to not having it at all. The increased volume of information and the
wide variety of data types make finding information a challenging task. Current searching methods and
algorithms are based on assumptions about technology and goals that seemed reasonable before the
widespread use of computers. However, these assumptions no longer hold in the context of information
retrieval systems. The pattern originated in the information retrieval domain.
However, information retrieval has expanded into other fields like office automation, genome
databases, fingerprint identification, medical imaging, data mining, multimedia, etc. Since the pattern
works with any kind of data, it is applicable in many other domains. You will see examples from text
searching, telecommunications, stock prices, medical imaging and trademark symbols. The key idea of
the pattern is to map from a large, complex problem space into a small, simple feature space. The
mapping represents the creative part of the solution. Every type of application uses a different kind
mapping. Mapping into the feature space is also the hard part of this pattern. Traditional searching
algorithms are not viable for problems typical to the information retrieval domain. Since they were
designed for exact matching, their use for similarity search is cumbersome. In contrast, feature
extraction provides an elegant and efficient alternative. With information retrieval expanding into other
fields, this pattern is applicable in a wide range of applications. Work with an alternative, simpler
representation of data. The representation contains some information that is unique to each data item.
This computation is actually a function. It maps from the problem space into a feature space. For this
reason, it is also called feature extraction process.
Feature Extraction:
Texture is an important feature that identifies the object present in any image. The texture is
defined by the spatial distribution of pixels in the neighborhood of an image. The gray level spatial
dependency is represented by a two-dimensional matrix known as GLCM and it is used for texture
analysis. The GLCM matrix specifies that how often the pairs of pixels with certain values occur in an
image. The statistical measures are then derived using the GLCM matrix. The textural features represent
Information Storage and Retrieval BE-IT
the spatial distribution of gray tonal variations within a specified area. In images, the neighboring pixel
is correlated and spatial values are obtained by the redundancy between the neighboring pixel values.
The color features are represented by color histograms in six color spaces namely RGB, HSV, LAB,
CIE, HUE and OPP.
The textural features are considered for classifying the image. These textural features are calculated in
the spatial domain and a set of gray tone spatial dependency matrix was computed. The textural features
are computed using GLCM matrix in four different orientation angles. The textural features are based on
the fact that describes how the gray tone appears in a spatial relationship to another.
GRAY LEVEL CO-OCCURENCE MATRIX (GLCM)
In statistical texture analysis, from the distribution of intensities the texture features are obtained at
specified position relative to one another in an image. The statistics of texture are classified into first
order, second order and higher order statistics. The method of extracting second order statistical texture
features is done using Gray Level Co-occurrence Matrix (GLCM). First order texture measure is not
related to pixel neighbor relationships and it is calculated from the original image. GLCM considers the
relation between two pixels at a time, called reference pixel and a neighbor pixel. A GLCM is defined
by a matrix in which the number of rows and columns are equal to the number of gray levels G in an
image. The matrix element P (I, j | Ax, By) is the relative frequency where I and j represent the intensity
and both are separated by a pixel distance Ax, By. The different textural features such as energy,
entropy, contrast, homogeneity, correlation, dissimilarity, inverse difference moment and maximum
probability can be computed using GLCM matrix.
Significance of Extracted Feature:
1. Color: It signifies the object identification and extraction of from scene.
2. Brightness: Brightness is one of the most significant pixel characteristics. Brightness should be
used only for no quantitative references to physiological sensations and perceptions of light.
3. Entropy: It characterized the texture in image
4. Contrast: Contrast is the dissimilarity or difference between things.
5. Shape of image
6. Size of image
7. Owner, file name, file type etc.
Information Storage and Retrieval BE-IT
Algorithm
1. Open colored 2D bitmap file in binary mode.
2. Read the header structure
3. Extract the various feature
4. Print the values of features
Conclusion:
Implementation is concluded by stating the fundamentals of feature extraction from image file.
Viva Questions:
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO2 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO3 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 ✔ ✔ ✔ ✔ ✔ - - - - - - -
Information Storage and Retrieval BE-IT
Marks: / 10
Objective: -
To understand the working of web crawler and implement it
Outcomes:
At the end of the assignment the students should have
1. Understood how web crawler works
Theory:
Search Engines
A program that searches documents for specified keywords and returns a list of the
documents where the keywords were found is a search engine. Although search engine is really
a general class of programs, the term is often used to specifically describe systems like Google,
Alta Vista and Excite that enable users to search for documents on the World Wide Web and
USENET newsgroups.
Information Storage and Retrieval BE-IT
Typically, a search engine works by sending out a spider to fetch as many documents as
possible. Another program, called an indexer, then reads these documents and creates an
index based on the words contained in each document. Each search engine uses a
proprietary algorithm to create its indices such that, ideally, only meaningful results are
returned for each query. Search engines are special sites on the Web that are designed to help
people find information stored on other sites.
There are differences in the ways. Various search engines work, but they all perform three basic tasks:
They search the Internet - based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
Fig.1 shows general search engine architecture. Every engine relies on a crawler module
to provide the grist for its operation. Crawlers are small programs that browse the Web on the
search engine's behalf, similar to how a human user would follow links to reach different
pages. The programs are given a starting set of URLs, whose pages they retrieve from the Web.
The crawlers extract URLs appearing in the retrieved pages, and give this information to
the crawler control module. This module determines what links to visit next, and feeds the links
to visit back to the crawlers. (Some of the functionality of the crawler control module may
be implemented by the crawlers themselves.) The crawlers also pass the retrieved pages into a
page repository. Crawlers continue visiting the Web, until local resources, such as storages,
are exhausted.
Information Storage and Retrieval BE-IT
practice. In a more traditional IR system, the documents to be indexed are available locally in a database
or file system. WebCrawler's first information retrieval system was based on Salton's vector- space
retrieval model.
The first system used a simple vector-space retrieval model. In the vector- space model, the
queries and documents represent vectors in a highly dimensional word space. The
component of the vector in a particular dimension is the significance of the word to the
document. For example, if a particular word is very significant to a document, the component of the
vector along that word's axis would be strong. In this vector space, then, the task of querying
becomes that of determining what document vectors are most similar to the query vector.
Practically speaking, this task amounts to comparing the query vector, component by component, to all
the document vectors that have a word in common with the query vector. WebCrawler
d e t e r m i n e d a similarity number for each of these comparisons that formed the basis of the
relevance score returned to the user. WebCrawler's first IR system had three pieces: a query
processing module, an inverted full-text index, and a metadata store. The query processing module
parses the searcher's query, looks up the words in the inverted index, forms the result list, looks up
the metadata for each result, and builds the HTML for the result page. The query processing module
used a series of data structures and algorithms to generate results for a given query. First, this module
put the query in a canonical form, and parsed each space- separated word in the query. If necessary,
each word was converted to its singular form using a modified Porter stemming algorithm and all
words were filtered through the stop list to obtain the final list of words. Finally, the query
processor looked up each word in the dictionary, and ordered the list of words for optimal query
execution. WebCrawler's key contribution to distributed systems is to show that a reliable, scalable,
and responsive system can be built using simple techniques for handling distribution, load balancing,
and fault tolerance.
Robot Exclusion
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt
protocol, is a convention to prevent co-operating web crawlers and other web robots from
accessing all or a part of a website which is otherwise publicly viewable. Robots are often used by
search engines to categorize and archive web sites, or by webmasters to proofread source code. The
standard is different but can be used in conjunction with sitemaps, a robot inclusion standard for
websites.
Information Storage and Retrieval BE-IT
A robots.txt file on a website will function as a request that specified robots ignore specified
files or directories in their search. This might be, for example, out of preference for privacy from
search engine results, or the belief that the content of the selected directories might be misleading or
irrelevant to the categorization of the site as a whole, or out of desire that an application only
operates on certain data. A person may not want certain pages indexed. Crawlers should obey the
Robot Exclusion Protocol.
The Robots Exclusion Protocol (REP) is a very simple but powerful mechanism available to
webmasters and SEOs alike. Perhaps it is the simplicity of the file that means it is often
overlooked and often the cause of one or more critical SE0 issues. To this end, we have
attempted to pull together tricks, tips and examples to assist with the implementation and management
of your robots.txt file. As many of the non-standard REP declarations supported by Google, Yahoo
and Bing may change, we will be providing updates to this in the future.
The robots.txt file defines the Robots Exclusion Protocol (REP) for a website. The file
defines directives that exclude Web robots from directories or files per website host. The robots.txt
file defines crawling directives, not indexing directives. Good Web robots adhere to directives in
your robots.txt file. Bad Web robots may not. Do not rely on the robots.txt file to protect private
or sensitive data from search engines. The robots.txt file is publicly accessible and so do not include
any files or folders that may include business critical information.
For example: Website analytics folders (/web stats/, /stats/ etc.)
Test or development areas (/test/, /dev/)
XML Sitemap element if your URL structure contains vital taxonomy.
If a URL redirects to a URL that is blocked by a robots.txt file, the first URL will be reported
as being blocked by robots.txt in Google Webmaster Tools. Search engines may cache your
robots.txt file (For example: Google may cache your robots.txt file for 24 hours). When
deploying a new website from a development environment always check the robots.txt file to
ensure no key directories are excluded. Excluding files using robots.txt may not save the crawl
budget from the same crawl session. For example: if Google cannot access a number of files it may
not crawl other files in their place. URLs excluded by REP (Robots Exclusion Protocol) may
still appear in a search engine index.
Program Implementation: Code written in Java to implement of the same with appropriate output.
Information Storage and Retrieval BE-IT
Algorithm
1. Make User Interface
2. Input the URL of any website
3. Establish HTTP connection
4. Read HTML page source code
5. Extract Hyperlinks of HTML page
6. Display the list of hyperlinks on the same page
Viva Questions:
1. What is robot.txt?
2. What is the significance of robot.txt?
3. What are the strategies used by crawler.
5. What is page rank?
6. What is significance of dampening factor?
Information Storage and Retrieval BE-IT
Mapping of CO, PO and PSO
Note : enter Correlation levels in the box, 1. slight(Low); 2: Moderate(medium); 3. Substantial
( High), If No correlation, Put “-”
01 Implementation
of Web ✔ ✔ ✔ ✔ ✔ - - - - - - -
Crawler
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO2 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO3 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO4 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO5 - - - - - - - - - - - -
CO6 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO7 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO and PSO Mapping:
/ /20
Date of Performance: Sign with Date
Aim : To implement a program for feature extraction of input image and to plot histogram for the
features
Objective: -
1. To study features extraction process of input images for information retrieval and plot histogram
Outcomes:
At the end of the assignment the students should have
1. Plot the histogram of color image
Theory:
Introduction:
Technology determines the types and amounts of information we can access. Currently, a large
fraction of information originates in silicon. Cheap, fast chips and smart algorithms are helping digital
data processing take over all sorts of information processing. Consequently, the volume of digital data
surrounding us increases continuously. However, an information-centric society has additional
requirements besides the availability and capability to process digital data. We should also be able to
Information Storage and Retrieval BE-IT
find the pieces of information relevant to a particular problem. Having the answer to a question but not
being able to find it is equivalent to not having it at all. The increased volume of information and the
wide variety of data types make finding information a challenging task. Current searching methods and
algorithms are based on assumptions about technology and goals that seemed reasonable before the
widespread use of computers. However, these assumptions no longer hold in the context of information
retrieval systems. The pattern originated in the information retrieval domain. However, information
retrieval has expanded into other fields like office automation, genome databases, fingerprint
identification, medical imaging, data mining, multimedia, etc. Since the pattern works with any kind of
data, it is applicable in many other domains. You will see examples from text searching,
telecommunications, stock prices, medical imaging and trademark symbols. The key idea of the pattern
is to map from a large, complex problem space into a small, simple feature space. The mapping
represents the creative part of the solution. Every type of application uses a different kind mapping.
Mapping into the feature space is also the hard part of this pattern. Traditional searching algorithms are
not viable for problems typical to the information retrieval domain. Since they were designed for exact
matching, their use for similarity search is cumbersome. In contrast, feature extraction provides an
elegant and efficient alternative. With information retrieval expanding into other fields, this pattern is
applicable in a wide range of applications. Work with an alternative, simpler representation of data. The
representation contains some information that is unique to each data item. This computation is actually a
function. It maps from the problem space into a feature space. For this reason, it is also called feature
extraction process.
Feature Extraction:
When the input data to an algorithm is too large to be processed and it is suspected to be
notoriously redundant (much data, but not much information) then the input data will be transformed
into a reduced representation set of features (also named features vector). Transforming the input data
into the set of features is called feature extraction. If the features extracted are carefully chosen it is
expected that the features set will extract the relevant information from the input data in order to
perform the desired task using this reduced representation instead of the full-size input.
Significance of Extracted Feature:
1. Colour: It signifies the object identification and extraction of from scene.
2. Brightness: Brightness is one of the most significant pixel characteristics. Brightness should be
used only for no quantitative references to physiological sensations and perceptions of light.
Information Storage and Retrieval BE-IT
Algorithm
1.Open colored 2D bitmap file in binary mode.
2.Read the header structure
3.Extract the various feature
4.Print the values of features
Conclusion: Implementation is concluded by stating the fundamentals of feature extraction from image
file
Information Storage and Retrieval BE-IT
Viva Questions:
2.Applications of histogram
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO2 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO3 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 ✔ ✔ ✔ ✔ ✔ - - - - - - -
CO and PSO Mapping:
Marks: / 10
Theory:
Do study of collaborative or content based recommender system
Conclusion: Thus we have studied collaborative recommender system
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO2 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO3 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO4 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO5 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO6 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO7 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO and PSO Mapping:
Marks: / 10
Objective: -
To study Indexing, Inverted Files and retrieve documents with the help of inverted file for multiple
documents on multiple query.
Implement retrieval algorithm for 25 to 30 documents.
Input a query and verify output
Outcomes:
At the end of the assignment the students should have
1. Understood the concept of collaborative recommender system
Theory:
Indexing
In searching for a basics query is to scan the text sequentially. Sequential or online text searching
involves finding the occurrences of a pattern in a text. Online searching is appropriate then the
text is small and it is the only choice if the text collection is very volatile or the index space
overhead cannot be afforded. A second option is to build data structures over the text to speed up
the search. It is worthwhile building and maintaining an. index when the text collection is large
and semi-static. Semi-static collections can be updated at reasonably regular intervals but they
are not deemed to support thousands of insertions of single words per second. This is the case for
most real text databases not only dictionaries or other slow growing literary works. There are
many indexing Techniques.
Three of them are inverted files, suffix arrays and signature files.
Inverted Files
An inverted file is a word-oriented mechanism for indexing a test collection in order to speed up
the matching task. The inverted file structure is composed of two elements: vocabulary and
occurrence. The vocabulary is the set of all different words in the text. For each such word a list
of all the text portions where the word appears is stored. The set of all those lists is called the
occurrences. These positions can refer to words or characters. Word positions simplify phrase
and proximity queries, while character positions facilitate direct access to the matching text
position.
Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
Searching with the help of inverted file:
The search algorithm on an inverted index has three steps.
1. Vocabulary Search
2. Retrieval of occurrence
3. Manipulation of occurrences
Multiple keyword queries searching will takes place using these five ways:
1.Single keyword
2.ANDing of keywords (k1&&k2)
3.ORing of keywords (k1||k2)
4.Using NOT
5.Mix keywords
Information Storage and Retrieval BE-IT
Example:
Document d1:
Information retrieval (IR) is the activity of obtaining information system resources relevant to an
information need from a collection. Searches can be based on full-text or other content-based indexing.
Document d2:
Information retrieval is the science of searching for information in a document, searching for documents
themselves, and also searching for metadata that describe data, and for databases of texts, images or
sounds
Document d3:
Automated information retrieval systems are used to reduce what has been called information overload.
An IR system is a software that provide access to books, journals and other documents, stores them and
manages the document. Web search engines are the most visible IR applications.
Document d4:
An information retrieval process begins when a user enters a query into the system. Queries are formal
statements of information needs, for example search strings in web search engines.
Document d5:
Information retrieval a query does not uniquely identify a single object in the collection. Instead, several
objects may match the query, perhaps with different degrees of relevancy.
Inverted Index:
Vocabulary Occurrences
Query d4, d5
Information d1, d2, d3, d4, d5
User d4
Document d2, d3
Web d3, d4
Information Storage and Retrieval BE-IT
Output
List of relevant documents
1.Single keyword
Example:
Query - Web
Output- d3, d4
2.AND operator
Example:
Query: Information AND User
Output- d4
3. OR operator
Example:
Query: User OR Document
Output: d2, d3, d4
4. NOT operator
Example:
Query: NOT Web
Output: d1,d2,d5
4. Mix operator
Example:
Query: (Document OR Web) NOT User
Output: d2, d3
Conclusion: Implementation is concluded by retrieval of documents using Inverted Files for multiple
documents and multiple input queries.
Information Storage and Retrieval BE-IT
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ - - - - - - - -
CO2 ✔ ✔ ✔ ✔ - - - - - - - -
CO3 - - - - - - - - - - - -
CO4 - - - - - - - - - - - -
CO5 - - - - - - - - - - - -
CO6 - - - - - - - - - - - -
CO7 - - - - - - - - - - - -
Information Storage and Retrieval BE-IT
Marks: / 10
Aim : To study image retrieval for ADAS (Advanced Driver Assistance System) using different cases
Objective:
To study Lane Change Assist (LCA), Driver Drowsiness and inattentiveness, Automatic Parking, ACC
etc. that are included in image retrieval for ADAS (Advanced Driver Assistance System)
Outcomes:
At the end of the assignment the students should have
2. Understood the concept of ADAS
Theory:
Conclusion: Thus we have studied ADAS
Information Storage and Retrieval BE-IT
CO and PO Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
/POs
CO1 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO2 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO3 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO4 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO5 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO6 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
Information Storage and Retrieval BE-IT
CO7 ✔ ✔ ✔ ✔ ✔ - ✔ ✔ ✔ - ✔ ✔
CO and PSO Mapping: