Search Engine: S.Akhil

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

SEARCH ENGINE

S.Akhil
17311A19D8

1.Introduction 1.1 History


Searching is one of the most used actions The History Of Search Engines Internet
on the Internet. Search engines as an technology has been a revolutionary one;
instrument of searching, are very popular there is simply no second question about
and frequently used sites. This is the reason this fact. Different internet tools and
why webmasters and every ordinary user on sources have mesmerized the world to a
the Internet, must have good knowledge great extent. Today people can hardly
about search engines and searching. imagine of a life without internet
Webmasters use major search engines for technology. Find a sector where internet
submitting their sites on it, and for tools and resources do not find an
searching. Ordinary users use major search application and you get a million dollars for
engines primarily for searching, and this task! Literally, there is no such field or
sometimes for submitting their homepages sector devoid of these applications. And
or small sites. Why search engines are so when it comes to internet, it is not just about
popular? If you are ordinary user and you the websites it has. There are a variety of
want to find some texts or pictures about tools that help you to browse, find and
certain theme, first thing you will do is to incorporate information of your choice on
visit search engine to get some good URLs different platforms. And in this regard, the
about certain theme. This will happen every use of search engines cannot be masked
time you need some information or data with any argument. Search engines first
about any theme. If you are webmaster, you emerged in the midst of 1990’s when
will also need some information while Google got fame as the first standardized
preparing site for the Web, and you will and facilitating search engine of the entire
also use search engines. Then, when you world. Though there were some search
finish with it, you must submit your URL to engines working prior to Google but none
many search engines. After that you will of them was as user friendly and easy to use
check your URL ranking on every search as was this one. And taking a look into
engine... There are also hot news on every Google, it was actually an academic project
major search engine, many other interesting of a couple of university students. Later on,
contents... All of this shortly describes why this idea was expanded and new
search engines are so popular. As a user at technological features were incorporated to
novice level you must learn how to use make it a better one. And today this search
search engines for searching the Internet. engine stands out as one of the most widely
You must know that there are two ways of used website all around the world. The
searching: by using user's query or by using owner of this search engines are literally
categories. If you have keywords engines making millions and billions of dollars each
have also been brought to light by different year. Following this search engine, many
software designers but none of them have other search engines have also been brought
yet been able to replace Google to light by different software designers but
none of them have yet been able to replace
Google by any mean. But all these search
engines are serving people a great deal and how these pages get listed in the search
it is hoped that more improvements will be results. Human-powered directories are
brought to these search engines in near good when you are interested in a general
future.
topic of search. In this situation, a directory
can guide and help you narrow your search
2.Types Of Search Engines and get refined results. Therefore, search
results found in a human-powered directory
Different Types of Search Engines When are usually more relevant to the search topic
people mention the term "search engine", it and more accurate. However, this is not an
is often used generically to describe both efficient way to find information when a
crawler-based search engines and human- specific search topic is in mind. Table 1
powered directories. In fact, these two types summarizes the different types of the major
of search engines gather their listings in search engines. Search Engines Types
radically different ways and therefore are Google Crawler-based search engine
inherently different. Crawler-based search AllTheWeb Crawler-based search engine
engines, such as Google, AllTheWeb and Teoma Crawler-based search engine
AltaVista, create their listings Inktomi Crawler-based search engine
automatically by using a piece of software AltaVista Crawler-based search engine
to “crawl” or “spider” the web and then LookSmart Human-Powered Directory
index what it finds to build the search base. Open Directory Human-Powered Directory
Web page changes can be dynamically Yahoo Human-Powered Directory, also
caught by crawlerbased search engines and provide www.studymafia.org crawler-
will affect how these web pages get listed based search results powered by Google
in the search results. Crawler-based search MSN Search Human-Powered Directory
engines are good when you have a specific powered by LookSmart, also provide
search topic in mind and can be very crawler-based search results powered by
efficient in finding relevant information in Inktomi AOL Search Provide crawler-
this situation. However, when the search based search results powered by Google
topic is general, crawler-base search AskJeeves Provide crawler-based search
engines may return hundreds of thousands results powered by Teoma HotBot Provide
of irrelevant responses to simple search crawler-based search results powered by
requests, including lengthy documents in AllTheWeb, Google, Inktomi and Teoma,
which your keyword appears only once. “4-in-1” search engine Lycos Provide
Human-powered directories, such as the crawler-based search results powered by
Yahoo directory, Open Directory and AllTheWeb Netscape Search Provide
LookSmart, depend on human editors to crawler-based search results powered by
create their listings. Typically, webmasters Google Table 1: Different types of the
submit a short description to the directory major search engines From the table above
for their websites, or editors write one for we can see that some search engines like
the sites they review, and these manually Yahoo and MSN Search provide both
edited descriptions will form the search crawler-based results and human-powered
base. Therefore, changes made to listings, therefore become hybrid search
individual web pages will have no effect on engines. A hybrid search engine will still
favor one type of listings over another as its
type of main results. There is another type 3.1.1. Index design factors
of search engines that is called meta-search
engines. Meta-search engines, such as Major factors in designing a search engine's
Dogpile, Mamma, and Metacrawler, architecture include:
transmit user-supplied keywords Merge factors: How data enters the index,
simultaneously to several individual search or how words or subject features are added
engines to actually carry out the search. to the index during text corpus traversal,
Search results returned from all the search and whether multiple indexers can work
engines can be integrated, duplicates can be asynchronously. The indexer must first
eliminated and additional features such as check whether it is updating old
clustering by subjects within the search
results can be implemented by meta-search
content or adding new content. Traversal
3.1 Web Indexing typically correlates to the data collection
policy. Search engine index merging is
Search engine indexing collects, parses, similar in concept to the SQL Merge
and stores data to facilitate fast and accurate command and other merge algorithms.
information retrieval. Index design Storage techniques: How to store the index
incorporates interdisciplinary concepts data, that is, whether information should be
from linguistics, cognitive psychology, data compressed or filtered.
mathematics, informatics, physics and Index size: How much computer storage is
computer science. An alternate name for the required to support the index.
process in the context of search engines Lookup speed: How quickly a word can be
designed to find web pages on the Internet found in the inverted index. The speed of
is Web indexing. finding an entry in a data structure,
compared with how quickly it can be
The purpose of storing an index is
updated or removed, is a central focus of
to optimize speed and performance in
computer science.
finding relevant documents for a search
Maintenance: How the index is maintained
query. Without an index, the search engine
over time.
would scan every document in the corpus,
Fault tolerance: How important it is for the
which would require considerable time and
service to be reliable. Issues include dealing
computing power. For example, while an
with index corruption, determining whether
index of 10,000 documents can be queried
bad data can be treated in isolation, dealing
within milliseconds, a sequential scan of
with bad hardware, partitioning, and
every word in 10,000 large documents
schemes such as hash-based or composite
could take hours. The additional computer
partitioning, as well as replication.
storage required to store the index, as well
as the considerable increase in the time
required for an update to take place, are
traded off for the time saved during
information retrieval.
3.2 Web crawling

Web crawlers are an essential component to


search engines; running a web crawler is a
challenging task. There are tricky
performance and reliability issues and even
more importantly, there are social issues.
Crawling is the most fragile application
since it involves interacting with hundreds
of thousands of web servers and various
name servers, which are all beyond the
control of the system. Web crawling speed
is governed not only by the speed of one’s
own Internet connection, but also by the
speed of the sites that are to be crawled.
Especially if one is a crawling site from
multiple servers, the total crawling time can
be significantly reduced, if many
downloads are done in parallel. Despite the
numerous applications for Web crawlers, at
the core they are all fundamentally the
same. Following is the process by which
Web crawlers work:
Download the Web page.
Parse through the downloaded page and
retrieve all the links. The most crucial evaluation of focused
For each link retrieved, repeat the process.
crawling is to measure the harvest ratio,
The Web crawler can be used for crawling which is the rate at which relevant pages are
through a whole site on the Inter-/Intranet. acquired and irrelevant pages are
effectively filtered off from the crawl. This
You specify a start-URL and the Crawler
follows all links found in that HTML page. harvest ratio must be high, otherwise the
This usually leads to more links, which will focused crawler would spend a lot of time
be followed again, and so on. A site can be merely eliminating irrelevant pages, and it
may be better to use an ordinary crawler
seen as a tree-structure, the root is the start-
URL; all links in that root-HTML-page are instead. to decide on link expansion, a
direct sons of the root. Subsequent links are distiller which determines a measure of
then sons of the previous all documents in centrality of crawled pages to determine
visit-priorities, and a crawler with
the collection at the beginning of the
computational process. The web crawler dynamically reconfigurable priority
computations require several passes, called controls which is governed by the classifier
"iterations", through the collection to end. and distiller.
3.3 Page Rank

Page-Rank is a link analysis algorithm used happening. Hence, a Page-Rank of 0.5


by the Google Internet search engine that means there is a 50% chance that a person
assigns a numerical weighting to each clicking on a random link will be directed
element of a hyperlinked set of documents, to the document with the 0.5 Page-Rank.
such as the World Wide Web, with the
purpose of "measuring" its relative 3.3.1 The Ranking algorithm
importance within the set. The algorithm simplified
may be applied to any collection of entities
with reciprocal quotations and references. Assume a small universe of four web pages:
The numerical weight that it assigns to any A, B, C and D. The initial approximation of
given element E is also called the Page- Page-Rank would be evenly divided
Rank of E and denoted by PR (E).The name between these four documents. Hence, each
"Page-Rank" is a trademark of Google, and document would begin with an estimated
the Page-Rank process has been patented Page-Rank of 0.25.In the original form of
(U.S. Patent 6,285,999). However, the Page-Rank initial values were simply 1.
patent is assigned to Stanford University This meant that the sum of all pages was the
and not to Google. Google has exclusive total number of pages on the web. Later
license rights on the patent from Stanford versions of Page-Rank (see the below
University. The university received 1.8 formulas) would assume a probability
million shares of Google in exchange for distribution between 0 and 1. Here we're
use of the patent; the shares were sold in going to simply use a probability
2005 for $336 million.As an algorithm, distribution hence the initial value of 0.25.If
Page-Rank is a probability distribution used pages B, C, and D each only link to A, they
to represent the likelihood that a person would each confer 0.25 Page-Rank to A.
randomly clicking on links will arrive at All Page-Rank PR ( ) in this simplistic
any particular page. Page-Rank can be system would thus gather to A because all
calculated for collections of documents of links would be pointing to A.
any size. It is assumed in several research
papers that the distribution is evenly
divided between all documents in the
𝑃𝑅(𝐴) = 𝑃𝑅(𝐵) + 𝑃𝑅(𝐶)
collection at the beginning of the
computational process. The Page-Rank + 𝑃𝑅(𝐷)
computations require several passes, called
"iterations", through the collection to adjust This is 0.75. Again, suppose page B also
approximate Page-Rank values to more has a link to page C, and page D has links
closely reflect the theoretical true value.A to all three pages. The value of the link-
probability is expressed as a numeric value votes is divided among all the outbound
between 0 and 1. A 0.5 probability is links on a page. Thus, page B gives a vote
commonly expressed as a worth 0.125 to page A and a vote worth
0.125 to page C. Only one third of D's Page-
Rank is counted for A's Page-Rank
(approximately 0.083)
damping factor is subtracted from 1 (and in
𝑃𝑅 (𝐵) 𝑃𝑅 (𝐶 ) some variations of the algorithm, the result
𝑃𝑅 (𝐴) = + is divided by the number of documents in
2 1
the collection) and this term is then added
𝑃𝑅 (𝐷)
+ to the product of the damping factor and the
3 sum of the incoming Page-Rank scores.

In other words, the Page-Rank conferred by That is,


an outbound link L ( ) is equal to the
document's own Page-Rank score divided 𝑃𝑅 (𝐴) = 1 − 𝑑 +
by the normalized number of outbound 𝑃𝑅(𝐵) 𝑃𝑅(𝐶) 𝑃𝑅(𝐷)
links (it is assumed that links to specific 𝑑( + + + …)
𝐿(𝐵) 𝐿(𝐶) 𝐿(𝐷)
URLs only count once per document).
Or (N = the number of documents in
𝑃𝑅(𝐵) 𝑃𝑅(𝐶)
𝑃𝑅 (𝐴) = + + collection)
𝐿(𝐵) 𝐿(𝐶)
𝑃𝑅(𝐷) 1−𝑑 𝑃𝑅(𝐵)
𝐿(𝐷)
𝑃𝑅 (𝐴) = + 𝑑( +
𝑁 𝐿(𝐵)
𝑃𝑅(𝐶) 𝑃𝑅(𝐷)
In the general case, the Page-Rank value for + + …)
𝐿(𝐶) 𝐿(𝐷)
any page u can be expressed as
So any page's Page-Rank is derived in large
𝑃𝑅 (𝑣) part from the Page-Ranks of other pages.
𝑃𝑅 (𝑈) = ∑
𝐿(𝑣) The damping factor adjusts the derived
𝑣∈Bu value downward. Google recalculates
Page-Rank scores each time it crawls the
I.e. the Page-Rank value for a page u is
Web and rebuilds its index. As Google
dependent on the Page-Rank values for
increase the number of documents in its
each page v out of the set Bu (this set
collection, the initial approximation of
contains all pages linking to page u),
Page-Rank decreases for all documents.
divided by the number L (v) of links from
page v The formula uses a model of a
random surfer who gets bored after several
clicks and switches to a random page. The
Page-Rank value of a page reflects the
3.3.2 Damping factor
chance that the random surfer will land on
The Page-Rank theory holds that even an that page by clicking on a link. It can be
imaginary surfer who is randomly clicking understood as a Markov chain in which the
on links will eventually stop clicking. The states are pages, and the transitions are all
probability, at any step, that the person will equally probable and are the links between
continue is a damping factor d. The various pages. If a page has no links to other pages,
studies have tested different damping it becomes a sink and therefore terminates
factors, but it is generally assumed that the the random surfing process. However, the
damping factor will be set around 0.85. The solution is quite simple. If the random
surfer arrives at a sink page, it picks another Where R is the solution of the equation as
URL at random and continues surfing follows:
again.
(1 − 𝑑)/𝑁
When calculating Page-Rank, pages (1 − 𝑑)/𝑁
with no outbound links are assumed to link 𝑅[ ] + 𝑑[ ]𝑅

out to all other pages in the collection. Their (1 − 𝑑)/𝑁
Page-Rank scores are therefore divided
evenly among all other pages. In other where the adjacency function L(Pi,Pj) is 0,
words, to be fair with pages that are not if page pj does not link to pi, and normalized
sinks, these random transitions are added to such that, for each j
all nodes in the Web, with a residual
probability of usually d = 0.85, estimated 𝑁
from the frequency that an average surfer ∑ 𝑙(𝑃𝑖, 𝑃𝑗) = 1
𝑖=1
uses his or her browser's bookmark feature.
i.e. the elements of each column sum up to
So, the equation is as follows:
1.

𝑃𝑅 (𝑃𝑖 ) This is a variant of the eigenvector


1−𝑑 𝑃𝑅 (𝑃𝑗) centrality measure used commonly in
= + 𝑑( ∑ ) network analysis. Because of the large
𝑁 𝐿(𝑃𝑗) Eigen-gap of the modified adjacency
𝑃𝑗𝜖𝑀(𝑃𝑖)
matrix above, the values of the Page-Rank
where p1,p2,p3,..,pN are the pages under eigenvector are fast to approximate (only a
consideration, M(pi) is the set of pages that few iterations are needed).
link to pi, L(pj) is the number of outbound
links on page pj, and N is the total number As a result of Markov theory, it can be
of pages. shown that the Page-Rank of a page is the
probability of being at that page after lots of
The Page-Rank values are the clicks. This happens to equal t − 1 where t
entries of the dominant eigenvector of the is the expectation of the number of clicks
modified adjacency matrix. This makes (or random jumps) required to get from the
Page-Rank a particularly elegant metric: page back to itself.
the eigenvector is site being a densely
connected set of pages, such as Wikipedia). The main disadvantage is that it
The Google Directory (itself a derivative of favors older pages, because a new page,
the Open Directory Project) allows users to Several strategies have been proposed to
see results sorted by Page-Rank within accelerate the computation of Page-Rank.
categories. The Google Directory is the The Various strategies to manipulate Page-
Rank have been employed in concerted
𝑃𝑅(𝑃1) efforts to improve search results rankings
and monetize advertising links. These
𝑃𝑅 (𝑃2)
𝑅=[ ] strategies have severely impacted the
⋮ reliability of the Page-Rank concept, which
𝑃𝑅(𝑃𝑛)
seeks to determine which documents are Chakrabarti, Soumen. Mining the Web:
actually highly valued by the Web Analysis of Hypertext and Semi Structured
community. Data, 2003
Jansen, B. J. (May 2007). "The
4.Conclusion Comparative Effectiveness of Sponsored
With the precipitous expansion of the Web, and Non-sponsored Links for Web E-
extracting knowledge from the Web is commerce Queries" (PDF). ACM
becoming gradually important and popular. Transactions on the Web.
This is due to the Web’s convenience and "Fast Page-Rank Computation via a Sparse
richness of information. Today search Linear System (Extended Abstract)".
engines can cover more than 60% of Gianna M. Del Corso, Antonio Gullí,
information of the information on the Francesco Romani.
World Wide Web. The future prospects of http://citeseerx.ist.psu.edu/viewdoc/summa
every aspect of search engine are very ry?doi=10.1.1.118.5422.
bright. Like Google is coming up with Deeho Search Engine Optimization (SEO)
embedded intelligence in its search engine. solutions
For all their problems, online search
engines have come a long way. Sites like
Google are pioneering the use of
sophisticated techniques to help distinguish
content from drivel, and the arms race
between search engines and the marketers
who want to manipulate them has spurred
innovation. But the challenge of finding
relevant content online remains. Because of
the sheer number of documents available,
we can find interesting and relevant results
for any search query at all. The problem is
that those results are likely to be hidden in
a mass of semi-relevant and irrelevant
information, with no easy way to
distinguish the good from the bad.

5. References
Brin, Sergey and Page Lawrence. The
anatomy of a large-scale hyper textual
Web search engine. Computer Networks
and ISDN Systems, April 1998Baldi,
Pierre. Modeling the Internet and the Web:
Probabilistic Methods and Algorithms,
2003

You might also like