0% found this document useful (0 votes)
13 views

9-Advanced Preprocessing Using Distinct User

Advanced preprocessing using Distinct User

Uploaded by

aktham.8020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

9-Advanced Preprocessing Using Distinct User

Advanced preprocessing using Distinct User

Uploaded by

aktham.8020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ISSN : 2278 – 1021

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 1, Issue 6, August 2012

Advanced Preprocessing using Distinct User


Identification in web log usage data
Sheetal A. Raiyani1, Shailendra Jain2, Ashwin G. Raiyani3
Department of CSE (Software System), Technocrats Institute of Technology, Bhopal, India1
Department of CSE, Technocrats Institute of Technology, Bhopal, India2
Department of CE, RK University, Gujarat, India3

Abstract—Millions of visitors interact daily with web sites around the world. Huge amount of data are being generated and
these information could be very prized to the company in the field of accepting Customer’s behaviors. In this paper a
complete preprocessing methodology having data cleaning, Enhanced preprocessing technique one of the User Identification
which is key issue in preprocessing technique phase is to identify the web users. Traditional User Identification is based on
the site structure by using some heuristic rules. In most cases relationship between pages are based on the site topology which
reduced the efficiency of identification solve this problem we introduced proposed Technique DUI (Distinct User
Identification) based on IP address ,Agent ,Referred pages on desired session time. Which can be used in counter terrorism,
fraud detection and detection of unusual access of secure data, as well as through detection of frequent access behavior get
better the overall designing and performance of future access. Experiments have proved that advanced data preprocessing
technique can enhanced the quality of data preprocessing results.

Keywords— Web usage mining, Preprocessing, User identification, Session time, Server log

mining, also known as Web Log Mining, is the process of


I. INTRODUCTION extracting interesting Patterns in web access logs.
Web mining refers to the use of data mining techniques to
automatically retrieves, extract and analyze information for
knowledge discovery from web documents and services. Web
Usage Mining is a heavily researched area in the field of data
mining. Web usage mining provides the support for the web
site design, providing personalization server and other
business making decision, etc. In order to better serve for the
users, web mining applies the data mining, the artificial
intelligence and the chart technology and so on to the web
data and traces users' visiting characteristics, and then extracts
the users' using pattern. It has quickly become one of the most
important areas in Computer and Information Sciences
because of its direct applications in e-commerce, CRM, Web
analytics, information retrieval and filtering, and Web Fig-1 Taxonomy of Web mining
information systems. According to the differences of the
mining objects, there are roughly three knowledge discovery II. WEB USAGE MINING
domains that pertain to web mining: Web Content Mining, Web usage mining is the application of data mining
Web Structure Mining, and Web Usage Mining. Web content Techniques to discover interesting usage patterns from web
mining is the process of extracting knowledge from the data, in order to understand and better serve the needs of web-
content of documents or their descriptions. Web document based applications. It tries to make sense of the data generated
text mining, resource discovery based on concepts indexing or by the web surfer’s sessions/behaviors. While the web
agent; based technology may also fall in this category. Web content and structure mining utilize the primary data on the
structure mining is the process of inferring knowledge from web, web usage mining mines the secondary data derived
the World Wide Web organization and links between from the interactions of the users while interacting with the
references and referents in the Web. Finally, web usage web. Registration data, user sessions, cookies, user queries,

Copyright to IJARCCE www.ijarcce.com 418


ISSN : 2278 – 1021
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 1, Issue 6, August 2012

mouse clicks, and any other data as the results of interactions. user into groups according to their navigational behavior,
Web usage mining method based on data cube. The approach Discover potential correlations between web pages and user
based on data cube stresses on turning web logs into groups, Identification of potential customers for
structuralized data cube which can introduce various data ecommerce ,Enhance the quality and delivery of Internet
mining technologies[3]. Web usage mining analyzes results of information services to the end user ,Improve web server
user interactions with a web server, including web logs, click system performance and site design, Facilitate personalization.
streams, and database transactions at a web site of a group of
related sites. Web usage mining also known as web log III. WEB LOG FORMAT
mining. Web usage mining process can be regarded as a three-
phase process consisting: The web usage data includes the data from web server
logs, proxy server logs, browser logs, and user profiles. (The
 Preprocessing/ data preparation - web log data are usage data can also be split into 3 different kinds on the basis
preprocessed in order to clean the data – removes log of the source of its collection: on the server side (there is an
entries that are not needed for the mining process, data aggregate picture of the usage of a service by all users), the
integration, identify users, sessions, and so on client side (while on the client side there is complete picture
 Pattern discovery - statistical methods as well as data of usage of all services by a particular client), and the proxy
mining methods (path analysis, Association rule, side (with the proxy side being somewhere in the middle).
Sequential patterns, and cluster and classification rules) Web Server logs are plain text (ASCII) files, that is
are applied in order to detect interesting patterns. Independent from the server platform. There are some
 Pattern analysis phase - discovered patterns are Distinctions between server software, but traditionally there
analyzed here using OLAP tools, knowledge query are four types of server logs.
management mechanism and Intelligent agent to filter
out the uninteresting rules/patterns.

Fig-3 Different types of log

Currently, there are three formats available to record log


files:-W3C Extended Log file Format-Microsoft IIS Log File-
Fig-2: Web Usage Mining
NCSA Common Log file Format.
After discovering patterns from usage data, a further
analysis has to be conducted. The most common ways of The W3C Extended log file format, Microsoft IIS log file
analyzing such patterns are either by using query or by format, and NCSA log file format are all ASCII text formats.
loading the results into a data cube and then performing The W3C Extended and NCSA formats record logging data in
OLAP operations[3]. Then, visualization techniques are used four-digit year format. The Microsoft IIS format uses a two
for a results interpretation. The discovered rules and patterns digit year format for years 1999 and earlier and a four-digit
can then be used for improving the system performance / for format thereafter. The Microsoft IIS log format is provided for
making modifications to the web site. The purpose of web backward compatibility with earlier IIS versions[2]. A web
usage mining is to apply statistical and data mining techniques server log file contains requests made to the web server,
to the preprocessed web log data, in order to discover useful recorded in chronological order. The most popular log file
patterns. Usage mining tools discover and predict user formats are the Common Log Format (CLF) and the extended
behavior in order to help the designer to improve the web site, CLF. A common log format file is created by the web server
to attract visitors, or to give regular users a personalized and to keep track of the requests that occur on a web site. A
adaptive service. The applications are Extract statistical standard log file has the following format
information and discover interesting user patterns, Cluster the

Copyright to IJARCCE www.ijarcce.com 419


ISSN : 2278 – 1021
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 1, Issue 6, August 2012

B. User Identification
The task of user and session identification is find out the
different user sessions from the original web access log.
<ip_addr><base_url><date><method><file><Protoc User’s identification is, to identify who access web site and
which pages are accessed. The goal of session identification is
-ol><code><bytes><referrer><user_agent>
to divide the page accesses of each user at a time into
individual sessions. A session is a series of web pages user
browse in a single access. The difficulties to accomplish this
Fig-4: CLF Log Format
step are introduced by using proxy servers, e.g. different users
may have same IP address in the log. A referrer-based method
IV. PREPROCESSING TECHNIQUE is proposed to solve these problems in this study. The rules
adopted to distinguish user sessions can be described as
The data preparation process is often the most time follows:
consuming and computationally intensive step in the Web The different IP addresses distinguish different users;
usage mining process. The process may involve preprocessing If the IP addresses are same, the different browsers and
the original data, integrating data from multiple sources, and operation systems indicate different users; User identification.
transforming the integrated data into a form suitable for input In this step the unique users are distinguished, and as a result,
into specific data mining operations. This process is known as the different users are identified. This can be done in various
data preparation[5]. Ideally, the input for the Web Usage ways like using IP addresses, cookies, direct authentication
Mining process is a user session file that gives an exact and so on. Because the focus of this paper is put on the
account of who accessed the Web site, what pages were analysis of the different user identification methods, this step
requested and in what order, and how long each page was will be discussed later in detail.
viewed. a user session is the set of the page accesses that
occur during a single visit to a Web site. However, because of C. Session Identification
the reasons we will discuss in the following, the information A session is understood as a sequence of activities
contained in a raw Web server log does not reliably represent performed by a user when he is navigating through a given
a user session file before data preprocessing. Generally, data site. To identify the sessions from the raw data is a complex
preprocessing consists of data cleaning, user identification, step, because the server logs do not always contain all the
session identification and path completion. information needed. There are Web server logs that do not
Raw WebLog Customization Data Cleaning contain enough information to reconstruct the user sessions, in
this case (for example time-oriented or structure-oriented)
Session Identification User Identification heuristics can be used as describe. If all of the IP address,
browsers and operating systems are same, the referrer
Database of Cleaned Log information should be taken into account. The Refer URI field
is checked, and a new user session is identified if the URL in
the Refer URI field hasn’t been accessed previously, or there
Fig-5 : Preprocessing Technique
is a large interval (usually more than 10 seconds) between the
A. Data Cleaning accessing time of this record and the previous one if the Refer
The purpose of data cleaning is to eliminate irrelevant items, URI field is empty; The simplest methods are time oriented in
and these kinds of techniques are of importance for any type which one method based on total session time and the other
of web log analysis not only data mining. According to the based on single page stay time. The set of pages visited by a
purposes of different mining applications [1], irrelevant specific user at a specific time is called page viewing time. It
records in web access log will be eliminated during data varies from 25.5 minutes to 24 hours[4]. while 30 minutes is
cleaning. Since the target of Web Usage Mining is to get the the default timeout. The second method depends on page stay
user’s travel patterns, following two kinds of records are time which is calculated with the difference between two
unnecessary and should be removed. timestamps. If it exceeds 10 minutes then the second entry is
The records of graphics, videos and the format information assumed as a new session. Time based methods are not
The records have filename suffixes of GIF, JPEG, CSS, and reliable because users may involve in some other activities
so on, which can found in the URI field of the every record. after opening the web page and factors such as busy
The records with the failed HTTP status code. By communication line, loading time of components in web page,
examining the Status field of every record in the web access content size of web pages are not considered. Third method
log, the records with status codes over 299 or under 200 are based on navigation uses web topology in graph format.[4]
removed.

Copyright to IJARCCE www.ijarcce.com 420


ISSN : 2278 – 1021
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 1, Issue 6, August 2012

The session identified by may contains more than one visit identify every unique user accurately. We may use cookies to
by the same user at different time, the time oriented heuristics track users’ behaviors. But considering personage privacy,
is then used to divide the different visits into different user many users do not use cookies, so it is necessary to find other
sessions. After grouping the records in web logs into user methods to solve this problem. For users who use the same
computer or use the same agent, how to identify them?
sessions, the path completion algorithm should be used for As presented in [9], it uses heuristic method to solve the
acquiring the complete user access path. problem, which is to test if a page is requested that is not
The WUM system presented in this paper is not a full web directly reachable by a hyperlink from any of the, pages
log mining system. Its aim is to better identify web users and visited by the user, the heuristic assumes that there is another
individuals behind the users. In this manner it realizes the first user with the same computer or with the same IP address. Ref.
three steps of a web log mining process. The results provided [4] presents a method called navigation patterns to identify
by our system can be used for further processing by any data users automatically. But all of them are not accurate because’
mining algorithm. they only consider a few aspects that influence the process of
users identification.
The success of the web site cannot be measured only by
V. RELATED WORK hits and page views. Unfortunately, web site designers and
User identification an important issue is how exactly the web log analyzers do not usually cooperate. This causes
users have to be distinguished. It depends mainly on the task problems such as identification unique user’s, construction
for the mining process is executed. In certain cases the users discrete user’s sessions and collection essential web pages for
are identified only with their IP addresses [6]. This can analysis. The result of this is that many web log mining tools
provide an acceptable result for short time periods (minutes or have been developed and widely exploited to solve these
hours) or when the expected results from the data mining task problems.
do not need more precisely information about the unique web
users. For example in case of selecting frequently visited B. Proposed Method Distinct User Identification (DUI)
pages for server side caching, or preloading the next page of
common navigational paths. Considering this actuality, we presented a new algorithm
In other cases some heuristics are used for better called “DUI (DISTINCT USER IDENTIFICATION)”. It
identification of the users. In [7][6] the different methods are analyses more factors, such as user’s IP address, Web site’s
grouped into two classes, the one is the class of the proactive topology, browser’s edition, operating system and referrer
methods and the other is that of the reactive methods. page. This algorithm possesses preferable precision and
Proactive strategies aim at differentiating the users before or expansibility. It can not only identify users but also identify
during the page request while reactive strategies attempt to session. Session identification will be discussed in next
associate individuals with the log entries after the log is section. Proposed method shows comparison not only based
written. Proactive strategies can be simple user authentication on User_IP somewhere same User_IP may generate the
with forms, using cookies or using dynamic web pages that different web users, based on path which chosen by any user
are associated with the browser invoking them. Reactive and access time with referrer page we find out the distinct web
strategies work with the recorded log files only, and the user.
different users will be distinguished by their navigational Definition: given a clean and filtered web log file and
patterns, download timing sequence or some other heuristics record set web log file
based on some assumption regarding their behavior. For Records R= {r1,r2,r3……r.n}
example in [8][6] web users are distinguished based on their where n>0
navigational patterns using clustering methods. Step1: input Log database RUser of N records
Step2: Distinct User identification base
A. Problem at time of User Identification Step3:RUser=P<url, ip_addr, agent, method, operating
system, status,session id,time_stamp>
User’s identification is, to identify who access Web site Step4: RUSer=<r1,r2,r3…rn> where n!=0,i=0
and which pages are accessed. If users have login of their Step5: while(i<n)
information, it is easy to identify them. In fact, there are lots Step6: read Logdatabase RUser
of user do not register their information. What’s more, there Step7 check if r(i).userip not part of Distinct user
are great numbers of users access Web sites through, agent, identification base then it treated as new user and copy
several users use the same computer, firewall’s existence, one userip in distinct user identification base.
user use different browsers, and so forth. All of problems Step8: end if
make this task greatly complicated and very difficult, to Step9: i=i+1;

Copyright to IJARCCE www.ijarcce.com 421


ISSN : 2278 – 1021
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 1, Issue 6, August 2012

Step10:end loop
Setp11:end
VII. CONCLUSION
In this Research we present Distinct user identification
technique which enhancement of pre-processing steps of web
Entries in raw web log 47890 log usage data in data mining. We use two pre-processing
Entries after data cleaning 12783 technique combine within one pre-processing step time of user
identification we find out distinct user based on their attended
Number of users 6542 session time. Here introduced one proposed algorithm for
Number of Unique users 4366 advanced pre-processing DUI algorithm is very efficient as
Number of sessions 6744 compare to other identification techniques. We get more
precious accurate result. Based on this we can easily
personalized websites, improve the design of WebPages. As
usages of users on websites. Future work needs to be done to
VI. RESULT AND ANALYSIS OF EXPERIMENT combine whole process of WUM. A complete methodology
covering such as pattern discovery and pattern analysis will be
To validate the effectiveness and efficiency of our more useful in identification method.
methodology mentioned above, we have made an experiment
with the web server log of the library of RK University
rku.ac.in. The initial data source of our experiment is from
REFERENCES
JAN 1, 2012 to Aug 3, 2012, which size is 129MB. Our
experiments were performed on a 2.8GHz Pentium ⅣCPU,
[1] Theint Theint Aye , Web log Cleaning for mining of web usage
512MB of main memory, Windows 2000 professional, SQL patterns, 978-1-61284-840-2/11/2011 IEEE.
Server 2000 and JDK 1.5. Figure is the results of our [2] Mohd Helmy Abd,Mohd Norzali, Data Preprocessing on Web Server
experiment. After data cleaning, the number of requests log for Generalized Association Rule Mining. World Academy of
declined from 747890 to 112783.Figure shows the detail Science, Engineering and technology,48 2008
[3] DeMin Dong, Exploring on Web Usage Mining and its Application ,
changes in data cleaning. 5th world Congress on Intelligent Control and Automation, June 15-
19,2004,China
Fig-6: Result of Experiment [4] V.Chitraa , Dr.Antony Selvadoss Thanamani A Novel Technique for
Sessions Identification in Web Usage Mining Preprocessing ,
International Journal of Computer Applications (0975 – 8887) Volume
34– No.9, November 2011
[5] Mr. Sanjay Bapu Thakare, Prof. Sangram. Z. Gawali A Effective and
Complete Preprocessing for Web Usage Mining , (IJCSE) International
Journal on Computer Science and Engineering Vol. 02, No. 03, 2010,
848-851
[6] Renáta Iváncsy, and Sándor Juhász, Analysis of Web User
Identification Methods, World Academy of Science, Engineering and
Technology 34 2007
[7] M. Spiliopoulou and B. Mobasher and B. Berendt and M. Nakagawa,
Framework for the Evaluation of Session Reconstruction Heuristics in
Web Usage Analysis, INFORMS Journal on Computing, 15, 2003
[8] T. Morzy, M. Wojciechowski, and M. Zakrzewicz. Web users
clustering. International Symposium on Computer and Information
Sciences 2000
[9] Spilipoulou M.and Mobasher B, Berendt B.,”A framework for the
Evaluation of Session Reconstruction Heuristics in Web Usage
Analysis”, INFORMS Journal on Computing Spring ,2003

Fig-7 : Result of Proposed method DUI Experiment

Copyright to IJARCCE www.ijarcce.com 422

You might also like