Project Report

PROJECT REPORT (Final Year Project 2007-2008) Hybrid Search Engine
Project Supervisor Mrs. Shikha Mehta
INTRODUCTION
Search Engines
Definition: A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. Examples: Various search engines are available on the internet e.g. Google, Alta Vista, Ask.com, Yahoo, Lycos, Alltheweb, Myspace, etc. The popularity of search engines can be estimated by the fact that approximately 112 * 106 searches are made in a single day from one search engine alone.
How do Search Engines work ?

There are differences in the ways various search engines work, but they all perform three basic tasks: They search the Internet -- or select pieces of the web -- based on important words. [CRAWLER] They keep an index of the words they find, and where they find them. [INDEXER] They allow users to look for words or combinations of words found in that index. [SEARCHER]
www
CRAWLER
INDEXER
SEARCHER
Local Store (W3 copy)
End Users
PROBLEM STATEMENT
On the basis of recent studies made on the structure and dynamics of the web itself, it has been analyzed that the web is still growing at a high pace, and the dynamics of the web is shifting. More and more dynamic and real-time information is made available on the web.
Our aim is to design a search engine that meets the challenges of web growth and update dynamics.
How is our Search Engine Hybrid ?

FAST Crawler
Features included in our search engine:
COBWeb
Freshness algorithm Heterogeneous Crawlers Heterogeneous Updation mechanism
Distributed Architecture Inclusion of Importance Number Content based signatures for Page seen problem
HITS
Link based indexing
HYBRID SEARCH ENGINE
PAGE RANK
Source based indexing focusing on quality of source.
DOMINOS
Mercator
Inclusion of a new module namely, Local Cache which stores the URLs recently visited.
Content seen test using fingerprinting Checkpointing URL saving Checksum technique
Our Proposed Design
Data Flow
Initial links Links fetched
Crawler Module
ContentSeen Tester
Non redundant links
Compressor
Compressed files
Decompressed stream
Data stream
Keyword Matcher
DeCompressor
Database
Matching Links
Indexer
Ordered Links
USER
Snapshots
Crawler
INPUT Initial set of URLs taken for the sample :
OUTPUT As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
HTML Parser Programming Language C# Input After the crawler crawls the web and store the pages in the repository , we need to extract the useful information from the web page like title , no. of forward links etc. of all the web pages. Output The extracted information is then stored in the database .
Output
Compressor-Decompressor Programming Language C# Input After the crawler crawls the web , this module compress the pages and store them in the repository . We need to decompress all the web pages to search a keyword. Output The compressed pages are stored in the database .
Output
Content Seen Tester Programming Language C# Input The content seen tester generates a bit sequence of all the web pages using MD5 algorithm. Output The bit sequence of every web page is stored in the database.
Output
Indexer
Sorts the results found on the basis of a rank distribution algorithm.
Programming Language C# Input The links between all the web pages are fetched from the database. Output The rank of each web page is stored in the database.
Output
Refresher
Updates the local database with fresh copies of web pages.
Programming Language C# Input The cached pages from the database. Output The refreshed pages are stored in the repository.
Output
User Interface Programming Language ASP .NET Input The user enters a keyword or multiple keywords. Output The results are fetched to the user.
Output
Thank You!!
Presented By:
ANKUSH GULATI 040303 Project Id: B103 ANKIT KALRA 040321 Project Id: B119

Project Report

Uploaded by

Copyright:

Available Formats

Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Report

Uploaded by

Copyright:

Available Formats

PROJECT REPORT (Final Year Project 2007-2008) Hybrid Search Engine

Project Supervisor Mrs. Shikha Mehta

How do Search Engines work ?

Local Store (W3 copy)

How is our Search Engine Hybrid ?

Freshness algorithm Heterogeneous Crawlers Heterogeneous Updation mechanism

HYBRID SEARCH ENGINE

Our Proposed Design

Non redundant links

You might also like