E Cient Crawling Through URL Ordering
E Cient Crawling Through URL Ordering
E Cient Crawling Through URL Ordering
(p) to refer to
the estimated importance of page p, which is dierent from the actual importance IS(p) that
is computable only after the entire Web has been crawled. If idf factors are not used, then
IS
(p) = IS(p).
2. Backlink Count: The value of I(p) is the number of links to p that appear over the entire
Web. We use IB(p) to refer to this importance metric. Intuitively, a page p that is linked to
by many pages is more important than one that is seldom referenced. This type of citation
count has been used extensively to evaluate the impact of published papers. On the Web,
IB(p) is useful for ranking query results, giving end-users pages that are more likely to be of
general interest.
Note that evaluating IB(p) requires counting backlinks over the entire Web. A crawler may
estimate this value with IB
(p) for the estimated value of IR(p) when we have only a subset of pages available.
More formally, if a page has no outgoing link, we assume that it has outgoing links to every
single Web page. Next, consider a page p that is pointed at by pages t
1
, . . . , t
n
. Let c
i
be
the number of links going out of page t
i
. Also, let d be a damping factor (whose intuition is
given below). Then, the weighted backlink count of page p is given by
IR(p) = (1 d) +d [IR(t
1
)/c
1
+ +IR(t
n
)/c
n
]
This leads to one equation per Web page, with an equal number of unknowns. The equations
can be solved for the IR values. They can be solved iteratively, starting with all IR values
equal to 1. At each step, the new IR(p) value is computed from the old IR(t
i
) values (using
the equation above), until the values converge. This calculation corresponds to computing
the principal eigenvector of the link matrices.
One intuitive model for PageRank is that we can think of a user surng the Web, starting
from any page, and randomly selecting from that page a link to follow. When the user reaches
a page with no outlinks, he jumps to a random page. Also, when the user is on a page, there
is some probability, d, that the next visited page will be completely random. This damping
factor d makes sense because users will only continue clicking on one task for a nite amount
of time before they go on to something unrelated. The IR(p) values we computed above give
us the probability that our random surfer is at p at any given time.
3
4. Forward Link Count: For completeness we may want to consider a metric IF(p) that counts
the number of links that emanate from p. Under this metric, a page with many outgoing links
is very valuable, since it may be a Web directory. This metric can be computed directly from
p, so IF
(p) = IF(p). This kind of metric has been used in conjunction with other factors
to reasonably identify index pages [PPR96]. We could also dene a weighted forward link
metric, analogous to IR(p), but we do not consider it here.
5. Location Metric: The IL(p) importance of page p is a function of its location, not of its
contents. If URL u leads to p, then IL(p) is a function of u. For example, URLs ending with
.com may be deemed more useful than URLs with other endings, or URLs containing the
string home may be more of interest than other URLs. Another location metric that is
sometimes used considers URLs with fewer slashes more useful than those with more slashes.
All these examples are local metrics since they can be evaluated simply by looking at the
URL u.
As stated earlier, our importance metrics can be combined in various ways. For example, we
may dene a metric IC(p) = k
1
IS(p, Q) +k
2
IB(p), for some constants k
1
, k
2
. This combines the
similarity metric (under some given query Q) and the backlink metric. Pages that have relevant
content and many backlinks would be the highest ranked. (Note that a similar approach was used
to improve the eectiveness of a search engine [Mar97].)
3 Problem denition
Our goal is to design a crawler that if possible visits high I(p) pages before lower ranked ones, for
some denition of I(p). Of course, the crawler will only have available I
(p)
foreach u in url queue
backlink count[u] = number of terms (-,u) in links
sort url queue by backlink count[u]
(3) PageRank IR
(p)
solve the following set of equations:
IR[u] = (1 0.9) + 0.9
i
IR[v
i
]
c
i
, where
(v
i
, u) links and c
i
is the number of links in the page v
i
sort url queue by IR(u)
Figure 2: Description of reorder queue() of each ordering metric
0.2 0.4 0.6 0.8 1
Fraction of Stanford Web crawled
0.2
0.4
0.6
0.8
1
P
experiment ideal
G=10
G=3
G=100
S
T
Figure 3: Fraction of Stanford Web crawled vs. P
ST
. I(p) = IB(p); O(u) = IB
(p).
8
0.2 0.4 0.6 0.8 1
Fraction of Stanford Web crawled
0.2
0.4
0.6
0.8
1
P
S
T
Ordering metric O(u)
Breadth
PageRank
Random
Ideal
Backlink
Figure 4: Fraction of Stanford Web crawled vs. P
ST
. I(p) = IB(p); G = 100.
solid lines in the gure show the results from our experiments. For example, when the crawler in
our experiment visited 0.2 (20%) of the Stanford pages, it crawled 0.5 (50%) of the total hot pages
for G = 100. The dashed lines in the graph show the expected performance of ideal crawlers. An
ideal crawler reaches performance 1 when H pages have been crawled. The dotted line represents
the performance of a random crawler, and it increases linearly over time.
The graph shows that as our denition of a hot page becomes more stringent (larger G), the
faster the crawler can locate the hot pages. This result is to be expected, since pages with many
backlinks will be seen quickly after the crawl starts. Figure 3 also shows that even if G is large,
nding the last group of hot pages is always dicult. That is, to the right of the 0.8 point on
the horizontal axis, the crawler nds hot pages at roughly the same rate as a random crawler.
In our next experiment we compare three dierent ordering metrics: 1) breadth-rst 2) backlink-
count and 3) PageRank (corresponding to the three functions of Figure 2). We continue to use the
Crawl & Stop with Threshold model, with G = 100, and a IB(p) importance metric. Figure 4 shows
the results of this experiment. The results are rather counterintuitive. Intuitively one would expect
that a crawler using the backlink ordering metric IB
(p) outperforms
the IB
(p) one. To understand why, we manually traced the crawlers operation. We noticed that
often the IB
(p) crawler behaved like a depth-rst one, frequently visiting pages in one cluster
before moving on to the next. On the other hand, the IR
(p)
ordering, page p
2
may have higher rank (because its link comes from a high ranking page) than the
pages in cluster A (that only have pointers from low ranking pages within the cluster). Therefore,
page p
2
is reached faster.
In summary, during the early stages of a crawl, the backlink information is biased by the starting
point. If the crawler bases its decisions on this skewed information, it tries getting locally hot pages
instead of globally hot pages, and this bias gets worse as the crawl proceeds. On the other hand,
9
p
1
p
2
Cluster B Cluster A
Figure 5: Crawling order
0.2 0.4 0.6 0.8 1
Fraction of Stanford Web crawled
0.2
0.4
0.6
0.8
1
P
Ordering metric O(u)
Breadth
PageRank
Random
Ideal
Backlink C
S
Figure 6: Fraction of Stanford Web crawled vs. P
CS
. I(p) = IB(p).
the IR
(p) PageRank crawler is not as biased towards locally hot pages, so it gives better results
regardless of the starting point.
Figure 6 shows that this conclusion is not limited to the Crawl & Stop with Threshold model.
In the gure we show the performance of the crawlers under the Crawl & Stop model (Section 3).
Remember that under the Crawl & Stop model, the denition of hot pages changes over time. That
is, the crawler does not have a predened notion of hot pages, and instead, when the crawler has
visited, say, 30% of the entire Web, it considers the top 30% pages as hot pages. Therefore, an
ideal crawler would have performance 1 at all times because it would download pages in the order
of their importance. Figure 6 compares 1) breadth-rst 2) backlink and 3) PageRank ordering
metrics for the IB(p) importance metric under this model. The vertical axis represents P
CS
, the
crawled fraction of hot pages at each point under the varying denition of hot pages. The gure
shows that the results of the Crawl & Stop model are analogous to those of the Crawl & Stop with
Threshold model: The PageRank ordering metric shows the best performance.
Returning to the Crawl & Stop with Threshold model, Figure 7 shows the results when we use
the IR(p) PageRank importance metric with G = 13.
2
Again, the PageRank ordering metric shows
2
When G = 13 for the IR(p) metric, the number of hot pages was about 1, 700 (1%), which is close to the 1, 400
10
0.2 0.4 0.6 0.8 1
Fraction of Stanford Web crawled
0.2
0.4
0.6
0.8
1
P
Ordering metric O(u)
Breadth
PageRank
Random
Ideal
Backlink S
T
Figure 7: Fraction of Stanford Web crawled vs. P
ST
. I(p) = IR(p); G = 13.
0.2 0.4 0.6 0.8 1
Fraction of Database Group Web crawled
0.2
0.4
0.6
0.8
1
P
S
T
Ordering metric O(u)
Breadth
PageRank
Random
Ideal
Backlink
Figure 8: Fraction of DB group Web crawled vs. P
ST
. I(p) = IB(p); G = 5.
the best performance. The backlink and the breadth-rst metrics show similar performance. Based
on these results, we recommend using the PageRank ordering metric for both the IB(p) and the
IR(p) importance metrics.
5.3 Small-scale crawl
In some cases the crawlers client may only be interested in small portions of the Web. For instance,
the client may be interested in a single site (to create a mirror, say). In this subsection we evaluate
the ordering metrics in such a scenario.
To study the impact of scale on the performance of a crawler we ran experiments similar to
those in Section 5.2 only on the pages of the Stanford Database Group (on the server http:
//www-db.stanford.edu). This subset of the Stanford pages consists of about 1,100 HTML pages,
which is much smaller than the entire Stanford domain. In most of our experiments for the Database
Group domain, crawling performance was not as good as for the Stanford domain. Figure 8 shows
of Figure 4.
11
1 2 3 4 5 6 7 8 9 10111213141516171819
100
200
300
400
500
N
u
m
b
e
r
o
f
p
a
g
e
s
Number of backlinks
Figure 9: Histogram of backlink counts within DB group Web
one of the results. In this case, we use the Crawl & Stop with Threshold model with G = 5 with
the importance metric IB(p). The graph shows that performance can be even worse than that of
a random crawler at times, for all ordering metrics.
This poor performance is mainly because an importance metric based on backlinks is not a
good measure of importance for a small domain. In a small domain, most pages have only a small
number of backlinks and the number of backlinks therefore is very sensitive to a page creators
style. For example, Figure 9 shows the histogram for the number of backlinks in the Database
Group domain. The vertical axis shows the number of pages with a given backlink count. We can
see that most pages have fewer than 5 backlinks. In this range, the rank of each page varies greatly
according to the style used by the creator of the page. If the creator generates many cross-links
between his pages, then his pages have a high IB(p) rank, otherwise they do not. Therefore, the
rank is very sensitive and is not a good measure of the importance of the pages.
In Figure 8 we can see the impact of locally dense clusters that have many locally popular
but globally unpopular pages. The performance of the backlink IB