Web Usage Mining On Proxy Servers: A Case Study: January 2001
Web Usage Mining On Proxy Servers: A Case Study: January 2001
Web Usage Mining On Proxy Servers: A Case Study: January 2001
net/publication/248411798
CITATIONS READS
11 203
3 authors, including:
Koen Vanhoof
Hasselt University
309 PUBLICATIONS 3,261 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Koen Vanhoof on 15 August 2014.
Abstract
Web Usage Mining is an aspect of data mining that has received a lot of attention in
recent years. Commercial companies as well as academic researchers have developed an ex-
tensive array of tools that perform several data mining algorithms on log files coming from
web servers in order to identify user behaviour on a particular web site. Performing this kind
of investigation on your web site can provide information that can be used to better accom-
modate the user’s needs. An area that has received much less attention is the investigation of
user behaviour on proxy servers. Servers of Internet Service Providers (ISP’s) log traffic from
thousands of users to thousands of web sites. No doubt that web server administrators are
interested in comparing the performance of their own site with those of competitors. More-
over, this kind of research can give a general overview of user behaviour on the Internet or
an overview of behaviour within a specific sector.
It was a Belgian ISP that showed interest in the subject and consequently provided data
from one of their proxy servers for a thesis. This paper is a summary of that thesis and
lays an emphasis on the attained results. The ISP chose to remain anonymous because of
privacy-issues.
1 Introduction
The Internet is generally said to have become available to a large public around 1994–1995. Since
that time a great number of companies have thrown themselves on this new medium. In the
beginning many entrepreneurs saw great new opportunities to make money by setting up an
internet company. Later on, some of the so-called brick-and-mortar companies began to see a
need to go online. Some of those even changed their business so drastically that not much of the
original company was left. Every large company has spent a lot of effort and money to develop a
well-established web site. The ones that have not spent enough effort may find themselves faced
with strategic disadvantages in years to come.
In order to have a successful web site (and especially a successful e-commerce site) it is crucial
to know the users of that site. This need has given rise to a whole new field in research, called
Web Usage Mining. It is commonly seen as a subdivision of Web Mining, which implies that
data mining techniques are applied to data from the World Wide Web. When the data under
consideration emerges from web servers log files, we enter the field of web usage mining. It is
therefore the “automatic discovery of user access patterns from Web servers” [8].
Because it is so important to know one’s customers in order to better suit their needs, companies
are willing to spend money on the analysis of their log files. As a consequence, apart from tools
that were developed by academic researchers, there is simultaneously a significant number of
commercial tools that have been developed to meet these needs. Examples of academic tools
include WebSIFT [1] and Web Utilization Miner [10]. For a more extensive overview, see [3]. An
example of a commercial web usage mining tools is EasyMiner, developed by MINEit Software
Ltd. All of these tools are designed to understand the most common log file formats so that the
process requires very little preprocessing. Unfortunately, when analyzing a log file from a Web
server, one can only analyze browsing behaviour on a single site.
To perform research on a sector or even on general browsing behaviour, the log file data of a
proxy server are a lot more appropriate because of the many-to-many relationship between sites
and users. This topic will be further elaborated in the next sections. Section 2 will give a short
introduction to data collection, preprocessing and data mining techniques in web usage mining.
Section 3 will introduce the concept of e-metrics. These are metrics applied to web usage data that
attempt to quantify the performance of web sites. Section 4 will introduce the data that research
was executed on. After that, section 5 will describe the results of that research. Finally, in section
6 we will present a conclusion and some hints for future research.
Data about behaviour of a single user on a single site can be collected by means of Javascripts
or Java applets. Both methods require user participation in the sense that the user has to enable
their functionality. An applet has the additional problem that it may take some time to load the
first time. However, it has the advantage that it can capture all clicks, including pressing the back
or reload buttons. A script can be loaded faster, but cannot capture all clicks.
A modified browser is situated in the second segment. It can capture behaviour of a single user
over all visited web sites. It’s advantages over Java applets and Javascripts are that it is much more
versatile and will allow data collection about a single user over multiple Web sites [3]. That is why
this kind of data collection is used reqularly by market research groups, e.g. Nielsen//Netratings,
in order to collect information on how certain user groups behave online.
The third way of data collection is on the Web server level. These servers explicitly log all
user behaviour in a more or less standardized fashion. It generates a chronological stream of
requests that come from multiple users visiting a specific site. Since Web servers keep record
of these requests anyhow, this information is readily available. Sometimes an analyst will use
some additional information to better identify users, such as information from cookies or socio-
demographic information about the users that may have been collected. This kind of data collection
also has a number of drawbacks. Like Javascripts, it cannot capture page views that were generated
by pressing back or reload buttons. Apart from that, it also cannot log page views generated by
a cache, either a local cache on the computer of the user, or a cache from an ISP’s proxy server.
The fourth level of data collection logs behaviour of multiple users visiting multiple Web sites.
This kind of information can be found in log files origination from proxy servers. These servers
are used by ISP’s to give customers access to the World Wide Web. They also function as a cache
server. This means that they will keep pages that were recently requested on this server and, if
the same request is made by another user shortly after that, they will send the cached page to
that user, instead of requesting it once more on the Web server were that page is located.
2.2 Preprocessing
Preprocessing is an aspect of data mining of which the importance should not be underestimated.
If this phase is not performed adequately, it is not possible for the mining algorithms to provide
reliable results.
2.3.4 Clustering
In general, clustering is a process of creating a partition so that all the members of each set
of the partition are similar according to some metric [2]. In web usage mining, we can narrow
the definition to a technique to group users in clusters based on their common characteristics.
Clustering algorithms learn in an unsupervised way. They discover their own classes and subsets
of related objects in the training set. Then it has to find descriptions that describe each of these
subjects.
2.3.5 Classification
Contrary to clustering, classification is a supervised way of learning. The database contains one
or more attributes that denote the class of a tuple and these are known as predicted attributes
whereas the remaining attributes are called predicting attributes. A combination of the predicted
attributes defines a class [2]. In the Web domain one is interested in developing a profile of users
belonging to a particular class or category. For example, 45% of users who visit two or more
sites of television stations in a single session, are younger than 21. The algorithms that perform
classification include decision tree classifiers, Bayesian classifiers, k-nearest neighbour classifiers,
etc.
3 E-metrics
E-metrics are based on statistics, which are one data mining technique. Therefore, they can be
considered to be a web usage mining method like any other. Moreover, they also try to gain insight
into browsing behaviour of users and performance of Web sites.
E-metrics are measures with which Web sites can be evaluated. They can be compared with
regular metrics and ratios as these are used in traditional industry, such as return on investment,
net profit, market share, rentability, etc. As Web sites gain a more important position in companies,
there emerges a need to evaluate these Web sites—that consume more and more money—and
quantify their performance. The intention is to give indications of how well the Web site performs
in order to investigate to what extent these measures change over time and how well they perform
compared to those of competitors.
Two kinds of e-metrics can be identified, those that can be applied to every Web site and those
that were designed for a specific kind of Web site, very often e-commerce sites.
3.1.1 Stickiness
This is probably one of the most widely used e-metrics. It is a composite metric that indicates
the effectiveness with which the content of the page or the Web site can keep the attention of the
user. In general it is assumed that sticky sites are better than sites that are less sticky. A possible
formula is as follows:
Stickiness = Frequency ∗ Duration ∗ Total site reach
Where
so that one doesn’t need to have all the data for the complete formula to calculate stickiness.
Usually stickiness is expressed in minutes per user.
It is impossible to suggest an ideal value for this metric. For entire sites, the value should
usually be as high as possible. For individual pages, it depends on the nature of that page. A
navigation page should have a low value, which means that users easily find their way, while
content pages should have a higher value.
4 The Data
The data used for research purposes in this work, were offered to us by a Belgian ISP, that chose
to remain anonymous because of privacy issues. The data used here come from a proxy server that
handles request of users that have a broadband internet connection.
4.2 Preprocessing
Before any actual data mining algorithms can be applied on the data, the data needs to be
preprocessed so that it can serve as input of several algorithms. As mentioned in the introduction,
most tools that were designed for conventional Web Usage Mining on Web servers, perform this
preprocessing automatically. In this case, however, this part of the process will have to be executed
manually. The advantage of this is that we have more control over the way it is done.
5 Research
To give an illustration of what is possible by examining ISP log files, we have decided to examine
browsing behaviour within a certain sector. The sector that was chosen to do this, is a collection of
several sub-sectors. In this sector, five different sub-sectors are distinguished: Newspapers, Banks,
Finance, Television and Radio. By Finance, we mean Web sites on which financial content (about
certain stocks and shares) can be found. Altogether this is a list of fifty Web sites: 7 newspaper
sites, 9 bank sites, 17 financial sites, 13 television sites and 4 radio sites. The Web sites are listed
per sector in figure 2. Only sites that produced a minimum of 120 hits in the original file, were
admitted in this list. If Web sites with a lower amount of total hits than 120 would be admitted,
this would probably have a negative influence on the reliability of the results. The reasons why
this sector was chosen are firstly, because it is diverse and secondly, because it is frequently visited.
Several data mining techniques will be used to discover interesting browsing patterns within this
sector. The actual data mining techniques are association rules, sequence analysis and clustering.
Apart from those we will start with a visualization technique and end with a few e-metrics that
will prove to be highly interesting.
Some of those techniques required some additional preprocessing. Therefore, we have created
a database in which each line represents a session. For each of the 50 Web sites, a column was
created. Every cell in this database accordingly expresses whether or not a specific Web site was
visited in a specific session. The other file (out of which the new file has been extracted and which
will be used for sequence analysis) lists a sequence of hits in which a certain Web site may appear
for a number of times, with only the time stamp changing.
5.1 Visualization
Several data mining tools offer the possibility to represent data in a visual manner. To get a visual
indication of strong relationships between several Web sites, this is a very good method. In this
work, we have made use of the visualization node in SPSS Clementine, the so-called web node.
It simply counts the number of times two Web sites appear together in the sessions. Then, it
generates a graphic that draws links between all the sites that appear together. The more two
sites appear together, the thicker the line between both. The overall picture looks like figure 3.
Not all the links are shown however. This is merely to lay the emphasis on the strongest links.
A number of these links can be identified as being not relevant. Visitors of www.aslk.be and
www.cera.be are automatically redirected to respectively www.fortisbank.com and www.kbc.be
because the two first banks have merged with the latter. Also the links between several Yahoo!
sites and several Nasdaq sites are predictable. A way of deleting these links is to replace one with
the other at the data cleansing stage. However, in the section about sequence analysis we will
see that some users also try to browse in the direction opposite of the redirect, which can be
interesting. For example, some users want to visit CERA after having visited KBC.
KbcBe
Tv1Be QuoteYahooCom
StandaardBe
GvaBe BizYahooCom
Vt4Be
NetbankingDexiaBe
TijdBe
VtmBe
FinanceYahooCom
CeraBe
QuotesNasdaqCom
AslkBe
NasdaqCom
DexiaBe
HetBeleggersNet
HbvlBe
QuotesNasdaqAmexCom
VrtBe
FortisbankCom
Apart from those, some other links are indeed quite interesting. There are quite strong links be-
tween the Web sites of VTM, TV1, VRT and VT4. All of those are television stations. Apparently,
visitors often visit several TV-sites together. The same goes for the newspaper sector. www.gva.be
is frequently visited together with www.standaard.be and www.hbvl.be. Www.standaard.be also
shows a link with www.tijd.be. This last newspaper is a special case. It has more links with fi-
nancial Web sites than with newspapers. The reason for this is that this newspaper (De Financieel
Economische Tijd is comparable to the Financial Times) mainly offers financial news. Neverthe-
less, this is quite interesting information for this Web site. It should consider financial sites such
as Nasdaq and Easdaq as its competitors, rather than newspaper sites. Figure 4 focuses on the
links of this individual Web site and makes the observation even clearer.
As a conclusion, it can be stated that the visualization technique is a handy way of quickly
discerning strong links between Web sites.
FinanceYahooCom
AexNl TijdBe
ForumBeleggersNet
QuotesNasdaqCom
NasdaqCom
HetBeleggersNet
BeursBe
5.4 Clustering
The fourth technique is the one that tries to group sessions into clusters in such a way that the
clusters have as much intra-cluster similarities, and as little inter-cluster similarities as possible.
To do this, we made use of the Kohonen algorithm in SPSS Clementine. The best results were
generated when both the X and the Y axis were set to 3, so that the algorithm identified 9 clusters.
The file that was used as an input for the algorithm, is the same file that was used for the
visualization technique and the association rule algorithms. Based on their X and Y coordinates,
the results were divided in nine tables to get an overview of the clusters. After that, the tables
were transformed into tables that only indicate for each session how many Web sites of a certain
subsector were visited. For each cluster, column totals were calculated to easily see how many
times a certain sector was visited in a cluster. This made it possible to compare each cluster with
the others. Because a full list of each cluster is too extensive, we have included in figure 7 the
totals of each cluster.
We will now briefly describe each cluster (not in the same order as in figure 7 though). Cluster
0–2 is quite large and the emphasis is evidently on the TV sector. Hits in other sectors are merely
2 It is a newspaper, but the Web site has more similarities with financial Web sites.
Cluster Financial Bank TV Radio News # of sessions
0–0 30 143 25 5 93 157
0–1 1 5 25 18 9 32
0–2 1 3 223 23 8 108
1–0 13 0 1 0 0 7
1–1 23 0 0 0 0 11
1–2 9 2 6 1 3 9
2–0 164 1 9 2 17 80
2–1 38 3 1 0 19 19
2–2 170 14 9 8 100 114
present because some users visit other sectors apart from this one. We will therefore call this
cluster the TV cluster.
Cluster 0–1 is a lot smaller than the previous cluster but is also very much focused on the TV
sector. However, in contrast to 0–2, this cluster has a relatively high number of radio hits since
the radio sector is by far the smallest in the original file. Here, however, it is almost as important
as the TV sector. We will therefore call this cluster the Radio cluster. The reason for the large
amount of TV-hits in this cluster is that many radio sites are visited in sessions in which also one
or more TV stations are visited.
Cluster 1–0 is very limited in number of sessions and is completely focused on financial sites. If
we take a look at the full table of this cluster, we can see that this cluster groups sessions in which
the emphasis lays on biz.yahoo.com and quotes.yahoo.com. Each session contains at least one
of both and 4 out of the 7 sessions even contain both. Cluster 1–1 is similar to this cluster and is
concentrated on the financial Web sites forum.beleggers.net and het.beleggers.net. 10 out
of 11 sessions contain both Web sites.
Cluster 2–0 is again quite large (80 sessions). It is very clear that this cluster is a cluster of
sessions that were focused on financial Web sites. Apart from visiting several financial sites, some
users also visited a Web site from another sector, which explains the other hits. We can safely call
this cluster the financial cluster.
Cluster 2–1 is very special in the sense that it groups 19 sessions that all contain www.tijd.be
and finance.yahoo.com (with only one exception). Apparently the relationship between those
two sites is so strong that the algorithm found a separate cluster for them.
Cluster 2–2 is related to the previous one because here the Web site of De Tijd plays once more
a special role. Each of the 114 session in this cluster contains this Web site, together with one or
more financial sites. Clearly, the Kohonen algorithm has divided the sessions in which De Tijd is
visited with another financial site into two separate clusters: one in which it occurs together with
Yahoo! and one in which it occurs with other financial Web sites.
The reason why cluster 1–2 was created is not very clear. It contains 9 sessions in which all
subsector are more or less proportionally represented. It is probably not even a real cluster, but a
small group of sessions that the Kohonen algorithm didn’t assign to any cluster.
Cluster 0–0 is the largest cluster and contains both a lot of bank-hits and newspaper-hits.
While analyzing the full table of this cluster, we were able to discover that hardly any of these
sessions contain Web sites of both sectors. It is therefore slightly strange that the algorithm hasn’t
created two clusters instead of one since there is hardly a link between the two sectors that are so
strongly represented in this cluster.
Even though the last cluster groups sessions that were focused on newspapers, it doesn’t contain
even once the Web site of De Tijd, which confirms the previously made conclusion that this site
is a financial, rather than a newspaper site.
It is remarkable that each cluster is focused on a specific subsector. With the exception of the
bank sector. Cluster 0–0 indeed groups sessions that contained two or more bank sites, but if we
take a look at the details of these sessions, it becomes apparent that they were either generated
by redirects from one bank site to another, or because a user wanted to perform some on-line
transactions and entered a specific Web site for that purpose after having visited the general Web
site of his bank. We can conclude that several competing banks are virtually never visited together
in a site. Users stick to their own bank.
5.5 E-metrics
The emphasis in this section will be on the measurement of stickiness and Average duration. By
computing both measures for each Web site, we will be able to give an indication of user behaviour
on that particular Web site. Then, we will also find out whether or not there are differences in
behaviour among the subsectors.
Stickiness was calculated by dividing the total number of seconds that a Web site has been
visited by the total number of users. Average duration was calculated by dividing the total number
of seconds that a Web site has been visited by the total number of page views of that site. After
having computed these measures for each individual Web site, we calculated the average value for
each sector. The results of this can be found in table 1.
Sector Stickiness Average duration
Newspaper 3’ 9” 10.67 sec.
Financial 5’ 16” 6.18 sec.
Bank 6’ 01” 13.01 sec.
Television 2’ 10” 7.91 sec.
Radio 2’ 21” 5.78 sec.
Overall avg. 4’ 10” 8.66 sec.
Table 1: Average Stickiness (in minutes and seconds) and Average duration per sector (in seconds)
As explained before, stickiness is a measure that indicates to what extent a Web site (or a
whole sector) is able to keep the attention of its users. The higher the value, the better. Average
duration is the average amount of time with which a page on that Web site is viewed. Not each
page should have a high average duration, but we can safely assume that an overall average should
be as high as possible as well.
To obtain a better overview of the meaning of these figures, a graphic (figure 8) was constructed
that has Stickiness on its X-axis and Average Duration on its Y-axis.
The bank sector scores best on both e-metrics. One reason for this may be that among these
Web sites there are a few with which clients can perform on-line transactions. Since users always
do this with only one bank and because they take their time in doing so, both stickiness and
average duration are high.
Newspapers have a high average duration but a relatively low stickiness. The results of financial
sites are the opposite: a high stickiness but a low average duration. The differences in average
duration are easy to explain. While a user will take his time to read an article on a newspaper
site, on a financial site he will very often simply request a page to see the changes concerning
certain shares, which takes very little time. On the other hand, a financial site can hold the
attention of its users quite well. Even though they watch each page only for a limited amount of
time, they stick to the site for a relatively long time. This is not so in the newspaper sector. Here,
users watch a page for a long time, but in the end they only watch a few pages and leave early.
Both the TV and radio sector show low scores on both measures. Uses don’t stay long on
one site and quickly jump from one to the other. This was also shown in the great number of
association rules in this sector.
It is useless for an individual Web site to compare itself with overall averages. It should take
the averages of its own sector into account when evaluating its performance. Unfortunately, these
Figure 8: Stickiness — Average Duration
averages are not readily available and therefore, Web sites can usually only evaluate themselves
by examining their e-metrics throughout time.
Knowing that previous data mining methods showed that the Web site of De Tijd is usually
visited together with financial sites instead of newspapers, it would be interesting to know which
sector its browsing behaviour resembles most. Its stickiness is 3.01 minutes and its average duration
is 8.86 seconds. Remarkably, this resembles much more the averages of newspapers than those of
financial Web sites. The conclusion we can make for De Tijd is: It is almost invariably visited
together with financial Web sites but shows the browsing behaviour of a newspaper Web site.
6 Further research
Even though section 5 showed some of the possibilities of Web Usage Mining on proxy servers and
offered some very interesting conclusions, much more is possible if we dispose over the necessary
data.
First of all, the ISP that owns this data, could do this kind of research on a regular basis to
investigate possible changes over time. The data that was researched here, was spread only over
a time period of three hours. It was therefore impossible to search for any evolutions in time.
Proper user identification was very difficult in this work. However, this could be made much
easier if these data are linked to customer data. Each ISP knows which customer used which
IP-address at each given time. Using these data, users can be perfectly identified. Unfortunately,
because of privacy matters, it was impossible for us to obtain these data.
Most ISP also have a more or less extensive database with socio-demographic data about their
customers, for example the address of the customer, profession, age, number of family members,
etc. This can add a whole new dimension to this research. We could then define classes of users
based on one or more of these socio-demographic data and analyze differences in behaviour between
those classes. We would also be able to find out which kind of users visit certain Web sites or
sectors.
Another way of using an additional demographical attribute in the data without requiring
access to the user information is using a specific set of IP numbers for certain dial-in numbers.
This way, we can compare different areas (for example provinces) with one another. T-online in
Germany is an ISP that uses this technique.
7 Conclusion
Very little research has been done about the possibilities of Web Usage Mining on proxy servers
(or cache servers). The intention of this work was to give an indication of what kind of information
can be extracted from the log files of these servers. Using several data mining and other techniques,
we have been able to draw a number of conclusions that could not have been found on another
level of data collection.
Every technique showed that it is useful to make a distinction between several sectors. Users
who visit several Web sites in a single session, very often visit Web sites that can be considered
direct competitors. This statement is especially applicable in the TV, radio, newspaper and finan-
cial sector. Not in the bank sector though. Users are only interested in the Web site of their own
bank.
Despite the clear distinction between sectors, the sectors that the data mining techniques
found, were slightly different from the ones that we defined before the actual research. First of
all, it could be said that the TV and radio sectors are in fact only one sector. There is hardly
a difference between the browsing behaviour of users who visit these sites. Also, the association
rule algorithms made no difference between them. Only the clustering algorithm made a slight
distinction. It created one cluster that was completely focused on the TV sector, and another one
that contained a relatively high number of radio sites. Yet, even here the distinction was not very
obvious.
Secondly, the Web site of De Financieel Economische Tijd was allocated at first to the news-
paper sector. All techniques except e-metrics showed that this site should be considered to be a
financial rather than a newspaper site. Nonetheless, the e-metrics (stickiness and average duration)
demonstrated that user behaviour on this Web site strongly resembles that of other newspaper
sites.
The combination of association rules and sequence analysis led to a number of interesting
conclusions for specific Web sites. Knowing which sites are usually visited together with your
own (and therefore are competitors) and whether the visit to your site comes first or last, can be
interesting information for the designers of a specific Web site and is information that can only
be deducted from this kind of log files.
Also, the comparison of sectors by means of e-metrics proved to be very useful. There are clear
differences in browsing behaviour between several sectors.
We can conclude that proxy servers contain much information that can be of considerable
interest to specific Web sites. Unfortunately, the administrators of these Web sites have no access to
this information. Moreover, ISP’s have little personal interest in this information and are therefore
not very inclined to perform such (expensive) research. We feel that there must be a way so that
individual companies can obtain this information from an ISP in such a way that the privacy of
the customer is not violated. We hope that this paper may be an inducement for further discussion
about this issue.
References
[1] Robert Cooley, Pang-Ning Tan, and Jaideep Srivastava. Discovery of interesting usage pat-
terns from web data. Technisch rapport TR 99-022, University of Minnesota, 1999.
[2] R. Dilly. Data mining - an introduction. Student notes, December 1995.
View publication stats
[3] M. Deshpande en P-N. Tan J. Srivastava, R. Cooley. Web usage mining: Discovery and
applications of usage patterns from web data. SIGKDD Explorations, 1(2):12, Januari 1999.
[4] MineIt Software Ltd. Capri user guide 1.0. Handleiding, 2000.
[5] NetGenesis. E-metrics - business metrics for the new economy. White paper, 2000.
www.netgenesis.com.
[6] R. Rao P. Pirolli, J. Pitkow. Silk from a sow’s ear: Extracting usable structures from the
web. In Proc. ACM Conf. Human Factors in Computing Systems, CHI, 1996.
[7] J. Pitkow and L. Catledge. Characterizing browsing strategies in the world wide web. Com-
puter Networks and ISDN Systems, 27(6):1065–1073, April 1995.
[8] B. Mobasher R. Cooley, J. Srivastava. Web mining: Information and pattern discovery on
the world wide web. Proceedings of the 9th IEEE International Conference on Tools with
Artificial Intelligence (ICTAI’97), 1997.
[9] Bamshad Mobasher en Jaideep Srivastava Robert Cooley. Data preparation for mining world
wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999.
[10] Myra Spiliopoulou and Lukas C. Faulstich. WUM: a Web Utilization Miner. In Workshop
on the Web and Data Bases (WebDB98), pages 109–115, 1998.