COMP S834: Unit 4
COMP S834: Unit 4
Unit 4
Web indexing and
search engines
120
Course team
Developer:
Designer:
Coordinator:
Member:
Contents
Overview
Introduction
3
5
15
16
17
19
21
22
22
25
Local or remote
Parts of a local search engine
Search gateway programs
Databases
25
26
29
31
Summary
34
Feedback to activities
35
37
References
41
Unit 4
Overview
Search engines and directories are the tools most frequently used to
locate and retrieve information on the Web. With their user-friendly,
graphical, point-and-click interfaces, its no wonder that they are among
the most frequently visited websites.
As Web information providers, it is important for us to understand how
search engines and directories work: how they collect documents, index
these documents, process search terms and generate the results. We also
need to understand the criteria used for ranking search results. We can
use this knowledge to increase the chance that our pages will appear near
the top of the rankings.
As Web users, we are often frustrated by the inaccuracy and irrelevancy
of the results that we get from search engines, despite the high number of
results that are returned to us. We will look at the difficulties that are
inherent in indexing the Web due to its massive size, lack of
cohesiveness and the nature of the content itself.
We will also talk about the so-called deep Web the portion of the
Web that is not covered by search engines. According to some estimates,
as much as half of all Web documents are inaccessible to search engines
(BrightPlanet 2001).
Finally, you will gain hands-on experience in providing search services
on your own website. Most e-commerce sites nowadays offer online
searching of their product catalogue or website content in order to
facilitate navigation by visitors, and we will implement a similar service
for ABC Books online catalogue.
More specifically, this unit:
This unit should take you about four weeks or 3035 hours to complete.
Please plan your time carefully.
Introduction
When the Web first appeared in early 1990s, it introduced a new and
convenient way of distributing information to a global audience.
Suddenly, anyone who knew how to create and upload pages to a Web
server could become a Web publisher.
However, one consequence of this massive increase in Web publication
activity is that we are overwhelmed with all sorts of information.
Retrieving high-quality, timely and accurate information from the Web is
a challenging task.
Unless users know the exact location or URL theyre visiting, they often
rely on a directory or search engine to find the information they want. In
this case, users will go to a search website, submit a query that is
typically a list of keywords, and then receive a list of relevant webpages
that contain the keywords entered. Directories and search engines are
analogous to the cataloguing and indexing services that are available in
physical libraries.
Figure 4.1
Search engines do not search the actual documents on the Web every
time a query is submitted. For the sake of speed and efficiency, they go
through index files that contain stored information about the documents
being searched. Therefore, the performance of a search engine is
determined by the quality and freshness of its index files.
A Web index is like a card catalogue in a physical library. Unlike a
physical library, however, the Web is a decentralized information
resource where content can be added, updated and taken offline by
individual owners at will. One source even states that The current state
of search engines can be compared to a phone book which is updated
irregularly, and has most of the pages ripped out (Lawrence and Giles
1998). On the other hand, although Web index files do not always
Unit 4
Reading 4.1
webopedia.com, How Web search engines work,
http://www.webopedia.com/DidYouKnow/Internet/2003/HowWeb
SearchEnginesWork.asp.
Note: You dont need to click on the For more information links at
the bottom of the page.
From Reading 4.1, you can see that the main distinction between
different types of search engines is whether they use an automated
program (e.g. crawler, spider or robot) to collect pages for indexing or
whether they rely on humans to catalogue and classify the pages that are
submitted to them.
Unit 4
Activity 4.1
Look for relevant materials on the following topic using Yahoos
directory (http://dir.yahoo.com) and Google (http://www.google.com):
hiking and trekking vacations
Describe the actions you took on both Yahoo and Google in order to
arrive at your results. You can also evaluate the quality and relevance of
the results returned by both search tools.
Note: There is feedback on this activity at the back of this unit.
From the previous activity, you may have noticed that search engines and
search directories also organize and present information differently.
Search directories aim to organize Web information resources in a
hierarchy of categories that can be browsed, while search engines present
a list of relevant URLs after processing the keywords entered. Now do
the following self-test to assess your understanding of how these tools
should be used.
Self-test 4.1
Describe when you think it would be suitable to use a search engine for a
particular search, and when it would be suitable to use a search directory.
Search engines are one of the most popular means of finding websites,
with approximately 85% of users using a search engine to find
information (GVU Center 1998). Its important for information providers
to know how search engines work since a lot of their website traffic will
probably come from a search engine referral. The next section will focus
on the components that make up a typical crawler-based search engine.
Figure 4.2 shows how these different components may be put together in
a typical search engine.
Figure 4.2
Gatherer module
The main task of this module is to fetch webpages for inclusion into a
search index. As mentioned in Reading 4.1, the program responsible for
visiting and collecting different webpages is called a robot, spider or
crawler. Crawlers are small programs that browse the Web on the search
engines behalf, in the same way that a human user surfs the Web by
following a series of links.
Crawlers are given a starting set of URLs whose contents they should
retrieve (figure 4.2). These URLs may have been submitted to the search
engine, or they can be heavily used servers and popular pages. In this
manner, the crawler quickly begins to travel across the most widely used
portions of the Web.
Unit 4
A crawler usually begins its visit from a websites homepage, and then
selects the next document to traverse by following a hypertext link. The
crawler keeps following successive links until an exit condition occurs.
The exit conditions are usually the time elapsed during the traversal or
the number of levels of hyperlinks that have been visited.
In some cases, crawlers may extract URLs appearing in the retrieved
pages and submit this information to a separate, crawler control module
(Figure 4.2). The control module determines what links to visit next and
informs the crawler accordingly.
Its possible for your website to be included in a search engine even if
you have never submitted your pages before. A crawler may arrive at
your site by following links from another page. This is good news for
websites that do want their site to be listed, but there may be cases when
website owners may not want their pages to be indexed just yet. For
example, there may be pages that:
are under construction and are not yet ready for public viewing;
Activity 4.2
Read the following two articles before you answer the questions in this
activity. These documents describe how you can give specific
instructions to robots that may visit your site:
1
Based on these readings, write the indexing rules that you should specify
to the robots that come to your ABC Books website:
1
Use the robots META tag to restrict spiders from crawling any files
that authors.html links to.
When a robot visits a webpage, it may visit all the other pages on the
same server by following links from the first page visited. This action is
called a deep crawl. If a search engine deep crawls your site, then its
enough for you to submit your sites homepage. Otherwise, you have to
explicitly submit all the pages that you want indexed. As of this writing,
All the Web, Google and Inktomi deep crawl the sites that they visit,
while AltaVista does not (Sullivan 2002).
Robots must also observe a policy for limiting the traversal of a website.
Most search engine companies treat their crawling strategy as a trade
secret and do not normally reveal detailed information about it. However,
there are two traversal strategies taken from graph theory that may be
used: breadth first and depth first. These algorithms are well-suited to the
Web due to its graph-like structure which consists of nodes (i.e.
webpages) and links (i.e. hypertext).
A breadth-first strategy will first traverse all the hypertext links found in
the initial URL, gather up these documents, then examine this group of
gathered documents for further links to follow. In other words, pages are
crawled in the order that they are discovered. Breadth-first gathering
results in a wide and shallow traverse.
Figure 4.3
E
F
G
H
I
J
Breadth-first traversal
The depth-first strategy starts from the initial URL, then keeps following
successive links at ever-increasing depths. Usually there is a limit to the
number of links followed using depth first. The result is that the depthfirst gatherer does a narrow but deep traverse.
Unit 4
C
E
B
A
D
F
G
J
H
K
Figure 4.4
Depth-first traversal
Activity 4.3
Compare the results of breadth-first and depth-first traversal on ABC
Books website, given that the traversal must stop after five links have
been explored. The following site diagram shows the links that appear on
the homepage.
index.html
books.html
authors.html
tarzan.html
kingdoms.html
news.html
burroughs.html
guanzhong.html
mansions.html
Figure 4.5
Which method is more likely to generate a better quality index for the
site?
Note: There is feedback on this activity at the back of the unit.
Given the enormous size of the Web and the frequency of updates of
information on the Web, here are some of the questions that arise when
setting a crawling policy for a search engine (Arasu et al. 2001):
10
Reading 4.2
Guidelines for robot writers, http://www.robotstxt.org/wc/
guidelines.html.
Unit 4
Self-test 4.2
1
Page repository
The page repository is a storage system for managing large collections of
webpages. The repository performs two functions: (1) it allows the
crawler to store the collected pages; and (2) it allows the indexer to
retrieve pages for indexing.
Pages may be stored in the repository only temporarily during the
crawling and indexing process. They may also be used to cache collected
pages so that the search engine can serve out result pages very quickly.
Due to the vast quantities of documents that must be kept in the page
repository, special consideration must be given to scalability and storage
distribution issues during its design and implementation.
Indexer module
The gatherer fetches documents and submits them to an indexer. The
indexer then assigns each document a unique identifier (called the
primary key) and creates a record for it. This record contains the unique
identifier, the URL of the document, and a set of values or related terms
describing the document.
The indexer also extracts words from each page and records the URL
where each word occurrs, along with its location within the page. The
result is generally a very large lookup table, also called an inverted
index or text index (see figure 4.2). The text index can provide all the
URLs where a given word occurs.
Some indexers may index every single word on the page (i.e. full-text
indexing), while some may select words that occur in important areas,
such as the title, headings, subheadings, links, and the first few lines of
text on a page. The selection criteria vary from one search engine to
another, which explains why they return different results.
For example, Lycos keeps track of the words in the title, subheadings,
links, and words that appear in the first few lines of the text. It also looks
11
12
Figure 4.7
Aside from the text index, the indexer module can also build other types
of indexes. These other indexes are used to enhance the quality and
relevance of search results beyond what can be achieved through textbased indexing alone. For example, Google keeps information about the
links between pages in its structure index, because this information may
be used to rank search results later on. Utility indexes may also be used
to provide access to pages of a given length, pages of a certain
importance, or pages with some number of images in them (figure 4.2).
Indexing is the key component of a search engine. An effective indexing
process will yield a high-quality index that accurately represents the
collection of information resources. Searching a high-quality index is
more likely to result in the precise identification and retrieval of the
correct resources.
Due to the transient nature of most Web content, indexes must be
constantly updated in order to maintain the freshness and relevance of
Unit 4
Self-test 4.3
List some of the ways in which the indexes of different search engines
may vary.
13
14
In the next activity, you will try refining your searches using Googles
Advanced Search form and observe whether it succeeds in making your
search results more focused and in filtering out irrelevant results.
Activity 4.4
Try using Googles Advanced Search interface at
http://www.google.com/advanced_search?hl=en.
The features available in Googles Advanced Search are also available in
basic search, but you must know how to type in the operators into the
search box along with your keywords. A very common example of an
operator is OR.
For example, if youre looking for information on Bangkok or Shanghai
vacation packages, you can type vacation packages Bangkok or
Shanghai directly into the search box without going through the
Advanced Search form.
The next two optional readings can provide you with more information
on what other operators are available in Google:
1
Now that we understand how a search engine works, its time for us to
put this knowledge to practical use in the next section.
Unit 4
Reading 4.3
Sullivan, D (2006) Nielsen NetRatings search engine ratings,
SearchEngineWatch.com, January 24,
http://www.searchenginewatch.com/reports/article.php/2156451.
Reading 4.4
Sullivan, D (2006) comScore media matrix search engine ratings,
SearchEngineWatch.com, April 20,
http://www.searchenginewatch.com/reports/article.php/2156431.
These two readings illustrate that there is more than one way to measure
the reach and popularity of a search engine.
15
16
Self-test 4.4
Describe at least two situations which may result in over-counting when
popularity is measured by audience reach (i.e. percentage of unique
visitors who use a search engine).
Reading 4.5
Sullivan, D (2004) Submitting to crawlers: Google, Yahoo,
Ask/Teoma & Microsoft MSN, SearchEngineWatch.com, July 5,
http://searchenginewatch.com/webmasters/article.php/2167871.
Reading 4.5 stresses the importance of getting other sites to link to your
site. Its no longer enough to submit your website through the Add URL
form in a search engine. Many search engines now consider the quality
and quantity of the links pointing to your site when they determine how
your pages should be ranked within their results.
Some search engines (such as Inktomi and Teoma) do not even provide
the option to submit websites directly to them anymore. They rely
completely on link analysis to determine if your site will get listed or not.
Another option is to pay a fee in order to get listed on major directories
and search engines. This increases the chances that your site will be
picked up by other crawlers.
When building links to your site, you should concentrate on getting links
from webpages whose content is closely related or similar to yours. For
example, its a good idea for ABC Books to exchange links with websites
that also deal with classic literary works.
Self-test 4.5
1
Visit the Add URL form for Google and Altavista. What information
do you need to provide in order to get your website listed?
List some ways to increase the number of links pointing to your site.
Unit 4
It takes several weeks after youve submitted a site before it gets listed, if
at all. Unless youve paid a fee in order to get listed, you cannot expect
search engine companies to give you a timetable for when youll get
listed, or why they have ignored your submission. Information providers
must themselves check on their own whether their website has been
indexed by the search engines theyve submitted to.
Activity 4.5
The following reading shows you the best ways to confirm whether your
webpages have been indexed by the major crawler-based search engines:
Sullivan, D (2001) Checking your listing in search engines,
SearchEngineWatch.com, October 26,
http://www.searchenginewatch.com/webmasters/article.php/2167861.
Use the URLs in the reading above to check whether ABC Books
competitors are listed in the following search engines. You can also note
how many pages from these competitors sites got listed.
Table 4.1
Search engines
Google, http://www.google.com
Altavista, http://www.altavista.com
Paddyfields,
http://www.paddyfields.com
Note: There is feedback on this activity at the back of the unit.
In the end, its important to remember that search engines are not the
only way that visitors will find your site. Its estimated that search
engines are only able to index 15% of all websites. There are other
effective ways for promoting your website, such as email, advertising
banners and link exchanges. You should never rely entirely on search
engines to direct traffic to your site.
17
18
Reading 4.6
Sullivan, D (2004) Submitting to directories: Yahoo & The Open
Directory, SearchEngineWatch.com, July 5,
http://searchenginewatch.com/webmasters/article.php/2167881.
When you submit your site to a directory, you can suggest an appropriate
category for your site to be listed under. However, human editors will
still evaluate your request and ultimately decide whether you will be
listed under your desired category or somewhere else.
Figure 4.8
In the next activity, you will prepare the information that will be
submitted to a search directory for the ABC Books website.
Activity 4.6
Here is a screen shot of Yahoos website submission form.
Figure 4.9
Unit 4
Aside from the subject category in figure 4.8, what other appropriate
categories could ABC Books be listed under? (Hint: Try locating
independent booksellers in Yahoos directory, since this category
may also be suitable for ABC Books.)
have the keyword in the title (which may result in a higher ranker
than just having it in the body of the text);
have the keyword in their URL, e.g. when using the keyword mp3,
greater weight would be given to documents with the domain name
http://www.mp3.com;
have more of the keywords occurring close to each other within the
same document (i.e. when searching on multiple keywords);
19
20
are often clicked by users when they are returned as search results,
also known as click through popularity (click through rates are
generally accepted as a measure of success in getting visitors to a
site, but nowadays, higher traffic does not always translate into
profitability); and
Google makes use of link structure information when ranking its search
results. Using this scheme, pages that have more links pointing to them
are considered more relevant and will therefore appear higher in the
search results. The importance of the links themselves is also ranked, so
results are ranked higher depending not just on quantity but on the
importance of the pages that link to them. For example, a page might be
given more importance if Yahoo points to it rather than if some unknown
page points to it.
The location and frequency of keywords on a webpage may also affect its
ranking. A search engine may analyse how often a keyword appears in
relation to other words in a webpage. It may also check if the search
keywords appear in certain areas, such as near the top of a webpage,
within a heading or in the first few paragraphs of text
(SearchEngineWatch.com).
Self-test 4.6
Discuss the benefits and problems of using the following strategies for
ranking search results:
link popularity;
Unit 4
The next table summarizes the effects of the factors listed above on the
quality of library and Web indexes.
Table 4.2
Index of a library
Index of a Web
21
22
traffic, even under false pretenses. This is another problem that does not
exist in traditional, closed information retrieval systems.
Unit 4
Reading 4.7
University Libraries, University at Albany, The deep Web,
http://www.internettutorials.net/deepweb.html.
Activity 4.7
Visit the following deep websites and compare the quality of the
information on these sites with what you can get on the surface Web:
1
23
24
Self-test 4.7
List four examples of information that belongs to the invisible Web, and
explain why this information is inaccessible to search engines.
Unit 4
information to help you choose from the many search tools available;
Local or remote
The good news is that you rarely have to create your own search engine.
There are many search tools available for almost any platform and Web
server you can imagine. They range from free to very expensive, from
user-friendly, graphical interfaces to compile-it-yourself. No matter
which option you choose, though, you should know that there are two
ways of providing a search service for a website:
1
Local the search engine runs locally on your Web server and
conducts searches against a Web index stored on your local machine.
25
26
Note that the remote option (e.g. option 2) is not the same as submitting
local pages to search engines such as Google or AltaVista. When you
submit your pages to these Web search engines, the index entries are
added to their global index made up of search terms from pages taken all
over the Web. The remote option we are talking about will confine the
search only to index entries built from the pages on your website.
Activity 4.8
The following URL discusses whether remote or local search should be
used.
http://www.thesitewizard.com/archive/searchengine.shtml
After you have finished reading, answer the following questions:
1
Its better to use this if the data is not open to the public.
Unit 4
Index
file
Indexer
store
words
get
words
HTML
pages
look in
index
Search
engine
Search
form
send
search
query
get list of
matches
return
formatted
results
user opens a
found page
Source: http://www.searchtools.com.
Search index file created by the search indexer program, this file
stores the data from your site in a special index or database, designed
for very quick access. Depending on the indexing algorithm and size
of your site, this file can become very large. It must be updated often,
or it will become unsynchronized with the pages and provide
obsolete results.
Search forms the HTML interface to the site search tool, provided
for visitors to enter their search terms and specify their preferences
for the search. Some tools provide pre-built forms.
Results page
27
28
Now that we know how a local search service works and what its
different components are, we will implement a remote indexing and
search service for ABC Books in the following activity.
Activity 4.9
1
Upload the website to your Web server at OUHK. The website needs
to be online for the remote search service to access and index its
pages.
FreeFind is the remote search service that we will use. You must first
register for an account at
http://www.freefind.com/
Our goal is to use FreeFind to create a remote search index for ABC
Books, and then to try some searches on it.
4
Another email will be sent to let you know that your site has
already been indexed. Log in to your account at Freefind and
copy the HTML that is needed to include a search box on your
website. Freefind offers a number of styles for you to choose
from. Figure 4.11 shows one of the available options.
Unit 4
You can further customize the search results page by adding your
own logo, changing the background and text colour, and
specifying which fields (e.g. title, description and URL) will be
shown. This can be done when you log in to your account as well.
For detailed instructions on how these steps should be done, you can
refer to the following page on the FreeFind site:
http://www.freefind.com/library/tut/pagesearch/
The important sections to read are: Setup overview, Indexing your
site, Adding your panel to your site and Customizing your search
results.
Note: You can view the model answer in action on the course website.
Activity 4.10
1
SWISH-E is the local search software that we will use. You can
download the UNIX version from:
http://www.swish-e.org/Download
Next, you should unzip the downloaded file and install it from the
directory which is created for you. Here are the instructions to
follow:
http://www.swish-e.org/current/docs/INSTALL.html
#Building_Swish_e
29
30
#
#
#
#
#
IndexDir ./authors
IndexDir ./books
# Only index the .html files
IndexOnly .html
# Show basic info while indexing
IndexReport 1
# Specify words to ignore, called stopwords.
IgnoreWords www http a an the of and or
Now you can build the search index based on the instructions in this
configuration file. Enter this in the command line:
$swish-e -c swish-e.conf
Unit 4
Now that youve verified the search engine works, we need to build a
server-side program which accepts keywords from the user and sends
them to SWISH-E for processing. You can download a pre-written
PHP search script from the course website (search_swish.php).
Edit the section within the code titled User-defined configuration
variables and place the values specific to your own installation.
Edit search_swish.php to change the line
$index = "/var/www/..."
Note: You can view the model answer in action on the course website.
Databases
The search tools from the two previous activities (4.9 and 4.10) are
applicable to websites that contain only static, pre-built pages. However,
there are now many websites which draw their content from databases in
real-time. In this section, we will discuss how full-text indexing can be
implemented in database-generated pages.
Databases will be covered more extensively in Unit 5, but heres a brief
overview. You can think of databases as a collection of information
organized in such a way that a computer program can quickly select
desired pieces of data (Webopedia.com). Databases can be accessed via
Structured Query Language (SQL), which is basically a standard way of
issuing queries against a database without any need to know what its
underlying structure is.
Right now, ABC Books website consists of static pages only. This
means that whenever changes are made to their book catalogue (e.g. title,
author or price updates), the same change must also be made to all
webpages where that information is found. This method may be feasible
31
32
now that their product catalogue only consists of 25 titles, but what if
their collection expands beyond a hundred titles? Creating and updating
all these pages by hand will definitely be a time-consuming and
potentially error-prone exercise.
One alternative is to store ABC Books catalogue in a database. Heres a
basic list of fields which could make up a record in their book catalogue
table:
1
page title the title of the webpage (e.g. The Art of War by Sun
Tzu, or The Life of Jane Austen); and
page content the book summary (if this page is for a book) or the
authors biography (if this page is for an author).
Reading 4.8
Using mySQL full-text searching,
http://www.zend.com/zend/tut/tutorial-ferrara1.php.
Note: The section on Advanced boolean searching is optional.
Now that youve seen how full-text searching can be done against a
relational database using a PHP script, lets try this out for ourselves in
the next activity. You will need to have mySQL installed on your local
Web server before you proceed.
Activity 4.11
1
Unit 4
Log in to mySQL:
$mysql u root
Verify that the database and table were created successfully. We will
use the select SQL command to view the records loaded into the
table:
$mysql ABCBooks
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 54 to server version: 3.23.58
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> select * from Book_Catalog
Test your full-text search and fix any problems you find. You can
view the model answer in action on the course website.
33
34
Summary
The World Wide Web has brought about dramatic changes in the way
people access information. Information resources that were formerly
locked away in libraries, CD-ROMs, print journals and proprietary
databases can now be searched conveniently and efficiently by anyone
with a desktop computer, a Web browser and an Internet connection.
In this unit, you learned that creating an index for the Web is much more
difficult than creating an index for a library. The main problem in
indexing the Web is that its contents are open-ended. No one knows
exactly how many webpages there are on the Web or exactly when a
page will cease to exist.
You were introduced to the components of search engines, namely the
robot, the indexer, the query system and the retrieval system. We then
looked at the different characteristics of robots as well as the
mechanisms used by website owners to communicate with robots.
You also studied the so-called deep Web, the portion of the Web that is
not covered by search engines. Examples are audio, movies, framed
pages, password-protected pages, dynamic websites and online
databases. We also discussed ways to make our searches more effective
by including content from both the surface Web and the deep Web.
Finally, we discussed how to implement local search facilities on our
own websites. You learned that there are two ways of doing this: (1)
installing and running your own local search engine; or (2) having your
pages indexed by a remote server. You were presented with the pros and
cons of the two methods. Finally, you installed a search engine on your
ABC Books website using both options.
Unit 4
Feedback to activities
Activity 4.1
Topic
Yahoo
hiking and
trekking
vacations
Activity 4.2
1
Activity 4.3
A breadth-first search will gather these five pages: index.html
A breadth-first search will result in all the author names and book titles
getting indexed, because the spider can extract them from the Authors
and Books pages, respectively. A depth-first search will index the book
titles only, and so users who search on author names via search engines
may not find ABC Books website.
35
36
Activity 4.5
Here are the pages listed in each search engine:
For hkupress.org, Google has 46 results while Altavista has 17.
For paddyfields.com, Google has one result while Altavista has none.
Activity 4.6
1
Books Bookstores
Activity 4.8
1
Its better to use this if the data is not open to the public local.
You can have more control over the indexing process local.
Unit 4
Self-test 4.2
1
Here are just a few of the guidelines that responsible robot designers
should abide by:
make a list of places that should not be visited before starting the
traversal;
remember the places that have already been visited so the same
page is not retrieved multiple times;
scan the URLs and verify that the spider can handle the content
types on these pages before visiting them; and
37
38
Self-test 4.3
Size of the index How many documents are included in the index,
and what percentage of the total Web is it able to search?
Self-test 4.4
Some situations may result in over- or under-counting when popularity is
measured by audience reach.
A unique visitor is actually a single computer used to access a search
engine, which leads to the following situations:
If the same person uses a search engine from two different computers
(e.g. home and work), they are counted twice.
Self-test 4.5
1
add META tags to your pages containing the title, keywords and
description;
A good way to build links to your site is to find other websites which
are complementary to yours. You can look for these websites by
entering your target keywords in major search engines. You can
approach the owners of websites which appear in the top results and
establish a linking partnership with them. You can also establish
Unit 4
Self-test 4.6
Here are some benefits and problems associated with various strategies
for ranking search results:
Self-test 4.7
Some examples of information on the invisible Web:
39
40
Framed pages Not all search engines can handle framed pages,
while some search engines index the contents of each frame within a
page separately. This may result in a framed page being returned out
of the context of its surrounding frameset.
Unit 4
References
Arasu, A, Cho, J, Hector, G M, Paepcke, A and Raghavan, S (2001)
Searching the Web, ACM Transactions on Internet Technology, 1(1):
243.
Barker, J (2003) What makes a search engine good?
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SrchEngCriteri
a.pdf.
Bergman, M K (2001) The deep Web: surfacing hidden value,
BrightPlanet, http://www.brightplanet.com/technology/deepweb.asp.
BrightPlanet (2004) Tutorial guide to effective searching of the
Internet,
http://www.brightplanet.com/deepcontent/tutorials/search/index.asp.
ComputerWire (2002) Fast search claims Googles size crown, June 18,
http://www.theregister.co.uk/content/23/25762.html.
GVU Center, College of Computing Georgia Institute of Technology
(1998) GVUs 10th World Wide Web survey,
http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.
Lawrence, S and Giles, C (1998) How big is the Web? How much of the
Web do search engines index? How up to date are these search engines?,
NEC Research Institute, http://www.neci.nj.nec.com/homepages/
Lawrence/websize.html.
Lawrence, S and Giles, C (1999a) Accessibility and distribution of
information on the Web, Nature, 400:1079, http://www.metrics.com/.
Lawrence, S and Giles, C (1999b) Searching the Web: general and
scientific information access, IEEE Communications, 37(1): 11622,
http://www.neci.nec.com/~lawrence/papers/search-ieee99/.
Netcraft (2003) Netcraft November 2003 Web server survey,
http://news.netcraft.com/archives/2003/11/03/november_2003_web_serv
er_survey.html.
Nielsen, J (1997) Search and you may find, July 15,
http://www.useit.com/alertbox/9707b.html.
Search Tools, http://www.searchtools.com.
Sullivan, D (2002) Search engine features for Webmasters,
http://www.searchenginewatch.com/webmasters/article.php/2167891.
41