Irt 2 Marks With Answer
Irt 2 Marks With Answer
Irt 2 Marks With Answer
UNIT I
Part A (2 Marks)
1. Define information retrieval.
Information Retrieval (IR) is finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from within large collections
(usually stored on computers).
2. Explain difference between data retrieval and information retrieval.
Stemming is the process for reducing inflected words to their stem, base or root form,
generally a written word form. The process of stemming is often called conflation.
8. What is an invisible web ?
Many dynamically generated sites are not indexable by search engines; this phenomenon
is known as the invisible web.
9. Define Zipf’s law.
An empirical rule that describes the frequency of the text words. It states that the i th most
frequent word appears as many times as the most frequent one divided by i Ɵ, for some Ɵ
> 1.
10. What is supervised learning ?
In supervised learning, both the inputs and the outputs are provided. The network then
processes the inputs and compares its resulting outputs against the desired outputs. Errors
are then propagated back through the system, causing the system to adjust the weights
which control the network
11. What is unsupervised learning ?
In an unsupervised learning, the network adapts purely in response to its inputs. Such
networks can learn to pick out structure in their input.
12. What is text mining ?
Text mining is understood as a process of automatically extracting meaningful, useful,
previously unknown and ultimately comprehensible information from textual document
repositories. Text mining can be visualized as consisting of two phases : Text refining
that transforms free-form text documents into a chosen intermediate form, and knowledge
distillation that deduces patterns or knowledge from the intermediate form.
13. Specify the role of an IR system.
The role of an IR system is to retrieve all the documents, which are relevant to a query
while retrieving as few non - relevant documents as possible. IR allows access to whole
documents, whereas, search engines do not.
14. Outline the impact of the web on information retrieval.
Web is a huge, widely-distributed, highly heterogeneous and semi-structured information.
The rapid growth of the Internet, huge information is available on the Web and Web
information retrieval presents additional technical challenges when compared to classic
information retrieval due to the heterogeneity and size of the web.
Web information retrieval is unique due to the dynamism, variety of languages
used, duplication, high linkage, ill formed query and wide variance in the nature
of users. IR helps users find information that matches their information needs
expressed as queries. Historically, IR is about document retrieval, emphasizing
document as the basic unit.
15. Compare information retrieval and web search.
5. Demonstrate the framework of Open Source Search engine with necessary diagrams.
(13)
6. i) Compare in detail Information Retrieval and Web Search with examples. (8)
ii) Analyze the fundamental concepts involved in IR system. (5)
PART-C ( 15 Marks )
1. Create an open source search engine like Google with suitable functionalities. (15)
2. Evaluate the best search engines other than Google and explain any five of them in
detail. (15)
3. Justify how the AI impact Search and Search Engine optimization. (15)
4. Generalize the Deep Learning and Human Learning capabilities in Future of Search
Engine Optimization. (15)
UNIT II
A retrieval model can be a description of either the computational process or the human
process of retrieval : The process of choosing documents for retrieval; the process by which
information needs are first articulated and then refined.
2. What is cosine similarity ?
This metric is frequently used when trying to determine similarity between two
documents. Since there are more words that are in common between two documents, it is
useless to use the other methods of calculating similarities.
3. What is language model based IR ?
A language model is a probabilistic mechanism for generating text. Language models
estimate the probability distribution of various natural language phenomena
4. Define unigram language.
A unigram (1-gram) language model makes the strong independence assumption that
words are generated independently from a multinomial distribution .
5. What are the characteristics of relevance feedback ?
Characteristics of relevance feedback :
1. It shields the user from the details of the query reformulation process.
2. It breaks down the whole searching task into a sequence of small steps which
are easier to grasp.
3. Provide a controlled process designed to emphasize some terms (relevant
ones) and deemphasize others (non-relevant ones)
6. What are the assumptions of vector space model ?
Disadvantages :
a. It is not simple to translate an information need into a Boolean expression
b. Exact matching may lead to retrieval of too few or too many documents.
c. The retrieved documents are not ranked.
d. The model does not use term weights.
Luhn's basic idea to use various properties of texts, including statistical ones, was
critical in opening handling of input by computers for IR. Automatic input joined the
already automated output.
10. What is a stemming ? Give example.
Conflation algorithms are used in information retrieval systems for matching the
morphological variants of terms for efficient indexing and faster retrieval operations.
The conflation process can be done either manually or automatically. The automatic
conflation operation is also called stemming.
11. What is Recall ?
Recall is the ratio of the number of relevant documents retrieved to the total number
of relevant documents in the collection.
12. What is precision ?
Precision is the ratio of the number of relevant documents retrieved to the total
number of documents retrieved.
13. Explain Latent Semantic Indexing.
Latent Semantic Indexing is a technique that projects queries and documents into a
space with "latent" semantic dimensions. It is statistical method for automatic
indexing and retrieval that attempts to solve the major problems of the current
technology. It is intended to uncover latent semantic structure in the data that is
hidden. It creates a semantic space wherein terms and documents that are associated
are placed near one another.
14. List the retrieval model.
Retrieval models are Boolean model and vector model. Boolean model based on set
theory and Boolean algebra. Vector model is used in information filtering,
information retrieval, indexing and relevancy rankings.
15. Define document preprocessing.
Document pre-processing is the process of incorporating a new document into an
information retrieval system. It is a complex process that leads to the representation
of each document by a select set of index terms.
16. Define an inverted index.
An inverted index is an index into a set of documents of the words in the documents.
The index is accessed by some search method. Each index entry gives the word and a
list of documents, possibly with locations within the documents, where the word
occurs. The inverted index data structure is a central component of a typical search
engine indexing Algorithm
17. What is Zone index ?
A zone is a region of the document that can contain an arbitrary amount of text, e.g.,
Title, Abstract and References. Build inverted indexes on zones as well to permit
querying. Zones are similar to fields, except the contents of a zone can be arbitrary
free text.
18. State Bayes rule.
PART B ( 13 Marks)
1. i) Express what is Boolean retrieval model. (4)
ii) Describe the document preprocessing steps in detail (9)
2. Illustrate the Vector space retrieval model with example (13)
3. Describe about basic concepts of Cosine similarity. (13)
4. Develop on example to implement term weighting .(min docs = 5) (13)
5. i) Tabulate the common preprocessing steps. (4)
ii) Discuss the Boolean retrieval in detail with diagram..(9)
6. i) Discuss in detail about term frequency and Inverse Document Frequency. (7)
ii)Compute TF-IDF .given a document containing terms with the given frequencies
A(3) ,B(2), C(1).Assume document collections 10,000 and document frequencies of these
terms are A(50), B(1300), C(250) (6)
7. i)Explain Latent Semantic Indexing and latent semantic space with an illustration. (9)
ii) Analyze the use of LSI in Information Retrieval. What is its need in synonyms and
semantic relatedness.(4)
8. i)Examine, how to form a binary term - document incidence matrix (7)
ii) Give an example for the above. (6)
13. i) Explain in detail about binary independence model for Probability Ranking
Principle(PRP). (7)
ii) Analyze how the query generation probability for query likelihood model can be
estimated.(6)
14. i) Apply how Probabilistic approaches to Information Retrieval are done. (7)
ii) Illustrate the following
a) Probabilistic relevance feedback. (2)
b) Pseudo relevance feedback. (2)
c) Indirect relevance feedback (2)
PART C ( 15 Marks)
1. Compose the information Retrieval services of the internet with suitable design. (15)
2. Assess the best Language model to computational linguistics for investigating the use of
software to translate text or speech from one language to another. (15)
3. Contrast the uses of probabilistic IR in indexing the search in the internet. (15)
4. Create a Relevance feedback mechanism for your college website search in the internet.
(15)
UNIT III
PART A ( 2 Marks)
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.
3. Define decision tree.
A decision tree is a tree where each node represents a feature(attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome(categorical or continues
value). A decision tree or a classification tree is a tree in which each internal node is
labeled with an input feature. The arcs coming from a node labeled with a feature are
labeled with each of the possible values of the feature.
4. Define information gain.
Entropy measures the impurity of a collection. Information Gain is defined in
terms of Entropy.
Information gain tells us how important a given attribute of the feature vectors is.
Information gain of attribute A is the reduction in entropy caused by partitioning
the set of examples S.
where Values (A) is the set of all possible values for attribute A and Sv is the subset of S
for which attribute A has value v.
5. Define pre pruning and post pruning.
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting the
data. Such methods typically use statistical measures to remove the least reliable
branches.
7. What is tree pruning ?
Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data.
8. What are Bayesian classifiers ?
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.
9. What is meant by naive Bayes classifier ?
A naive Bayes classifier is an algorithm that uses Bayes' theorem to classify objects.
Naive Bayes classifiers assume strong, or naive, independence between attributes of data
points.
10. What are the characteristics of k-nearest neighbors algorithm ?
Characteristics :
The unknown tuple is assigned the most common class among its k nearest
neighbours.
Nearest-neighbor classifiers use distance-based comparisons that intrinsically
assign equal weight to each attribute.
Nearest-neighbor classifiers can be extremely slow when classifying test tuples.
Distance metric is calculated by using Euclidean distance and Manhattan distance.
It does not use model building.
It relies on local information.
11. What is dimensionality reduction ?
In dimensionality reduction, data encoding or transformations are applied so as to obtain
a reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data
reduction is called lossless.
12. Define similarity.
The similarity between two objects is a numerical measure of the degree to which the two
objects are alike. Similarities are usually non-negative and are often between 0 and 1. A
small distance indicating a high degree of similarity and a large distance indicating a low
degree of similarity.
13. Define an inverted index.
An inverted index is an index into a set of documents of the words in the documents. The
index is accessed by some search method. Each index entry gives the word and a list of
documents, possibly with locations within the documents, where the word occurs. The
inverted index data structure is a central component of a typical search engine indexing
algorithm.
14. What is zone index ?
A zone is a region of the document that can contain an arbitrary amount of text, e.g.,
Title, Abstract and References. Build inverted indexes on zones as well to permit
querying. Zones are similar to fields, except the contents of a zone can be arbitrary
freetext.
5. i) Analyze the working of Nearest Neighbor algorithm along with one representation.
(7)
ii) Analyze the K-Means Clustering method and the problems in it. (6)
PART C ( 15 Marks)
1. (i). Rank the impacts of Categorization and clustering of text in the mining with the
suitable examples. (8)
(ii). Detailed about KNN Classifier (7)
2. Design a Plan to overcome the gap in decision theoretic approach for evaluation in
text mining. (15)
3. Compare two types of Dimensional Index in detail with example (15)
4. Estimate R Tree index and R+ Tree index (15)
UNIT IV
PART A ( 2 Marks)
A web server is a computer connected to the Internet that runs a program that takes
responsibility for storing, retrieving and distributing some of the web files
2. What is web browsers?
A web browser is a program. Web browser is used to communicate with web servers
on the Internet, which enables it to download and display the web pages. Netscape
Navigator and Microsoft Internet Explorer are the most popular browser software's
available in market.
3. Explain paid submission of search services.
In paid submission, user submit website for review by a search service for a preset fee
with the expectation that the site will be accepted and included in that company's
search engine, provided it meets the stated guidelines for submission. Yahoo! is the
major search engine that accepts this type of submission. While paid submissions
guarantee a timely review of the submitted site and notice of acceptance or rejection,
you're not guaranteed inclusion or a particular placement order in the listings.
4. Explain paid inclusion programs of search services.
Paid inclusion programs allow you to submit your website for guaranteed inclusion in
a search engine's database of listings for a set period of time. While paid inclusion
guarantees indexing of submitted pages or sites in a search database, you're not
guaranteed that the pages will rank well for particular queries.
5. Define search engine optimization.
Search Engine Optimization (SEO) is the act of modifying a website to increase its
ranking in organic (vs paid), crawler-based listings of search engines. There are
several ways to increase the visibility of your website through the major search
engines on the Internet today. The two most common forms of Internet marketing
Paid(Sponsored) Placement and Natural Placement
6. What is the purpose of web crawler ?
A web crawler is a program which browses the World Wide Web in a methodical,
automated manner. Web crawlers are mainly used to create a copy of all the visited
pages for later processing by a search engine that will index the downloaded pages to
provide fast searches.
7. Define focused crawler.
A focused crawler or topical crawler is a web crawler that attempts to download only
web pages that are relevant to a pre-defined topic or set of topics.
8. What are the near-duplicate detection?
A web crawler is a program which browses the World Wide Web in a methodical,
automated manner. Web crawlers are mainly used to create a copy of all the visited
pages for later processing by a search engine that will index the downloaded pages to
provide fast searches.
10. What are politeness policies used in web crawling ?
A method for rating the importance of web pages objectively and mechanically using
the link structure of the web
13. Define dangling link.
This occurs when a page contains a link such that the hypertext points to a page with
no outgoing links. Such a link is known as dangling link.
14. Define snippets.
Snippets are short fragments of text extracted from the document content or its
metadata. They may be static or query based. In static snippet, it always shows the
first 50 words of the document, or the content of its description metadata, or a
description taken from a directory site such as dmoz.org. A query-biased snippet is
one selectively extracted on the basis of its relation to the searcher's query.
15. Define hubs.
Hubs are index pages that provide lots of useful links to relevant content pages(topic
authorities). Hub pages for IR are included in the home page.
16. Define authorities.
Authorities are pages that are recognized as providing significant, trustworthy, and
useful information on a topic. In-degree (number of pointers to a page) is one simple
measure of authority. However in-degree treats all links as equal.
PART-B ( 13 Marks)
8. Recommend the need for Near-Duplication Detection by the way of finger print
algorithm. (13)
9. i) Examine the behavior of web crawler and the outcome of crawling policies. (5)
ii) Illustrate the following (8)
a) Focused Crawling
b) Deep web
c) Distributed crawling
d) Site map
10. i) Explain the overview of Web search. (8)
ii) What is the purpose of Web indexing? (5)
13. (i) Based on the Application of Search Engines, How will you categorize them and
what are the issues faced by them? (9)
(ii) Demonstrate about Search Engine Optimization. (4)
14. Describe the following with example.
i) Bag of Words and Shingling (7)
ii) Hashing, Min Hash and Sim Hash (6)
1. Develop a web search structure for searching a newly hosted web domain by the naïve
user with step by step procedure. (15)
2. i) Grade the optimization techniques available for search engine and rank them by your
justification. (9)
ii) Explain Web Crawler Taxonomy in detail (6)
3. Estimate the web crawling methods and illustrate how do the various nodes of a
distributed crawler communicate and share URLs? (15)
4. Formulate the application of Near Duplicate Document Detection techniques and also
Generalize the advantages in Plagiarism checking. (15)
5.
UNIT V
PART A ( 2 Marks)
The two main problems of user-based CF are that the whole user database has to be kept
in memory and that expensive similarity computation between the active user and all
other users in the database has to be performed.
4. Define user based collaborative filtering.
User-based collaborative filtering algorithms work off the premise that if a user (A) has a
similar profile to another user (B), then A is more likely to prefer things that B prefers
when compared with a user chosen at random.
5. What are the characteristics of relevance feedback ?
Recommender Systems are software tools and techniques providing suggestions for items
to be of use to a user. The suggestions relate to various decision-making processes, such
as what items to buy, what music to listen to, or what online news to read.
8. What is demographic based recommender system ?
This type of recommendation system categorizes users based on a set of demographic
classes. This algorithm requires market research data to fully implement. The main
benefit is that it doesn't need history of user ratings.
9. What is Singular Value Decomposition (SVD) ?
SVD is a matrix factorization technique that is usually used to reduce the number of
features of a data set by reducing space dimensions from N to K where K < N.
10. What is Content-based recommender ?
Content-based recommenders refer to such approaches, that provide recommendations by
comparing representations of content describing an item to representations of content that
interests the user. These approaches are sometimes also referred to as content-based
filtering.
11. What is matrix factorization model ?
PART B ( 13 Marks)
7. Illustrate the advantages and disadvantages of Content based and collaborative filtering
recommendation system (13)
8. Describe Knowledge based recommendation system in detail (13)
9. i) Detailed the rules of HLA (7)
ii) Difference between Hybrid and Collaborative Recommendation (6)
PART C ( 15 Marks)