9-Advanced Preprocessing Using Distinct User
9-Advanced Preprocessing Using Distinct User
Abstract—Millions of visitors interact daily with web sites around the world. Huge amount of data are being generated and
these information could be very prized to the company in the field of accepting Customer’s behaviors. In this paper a
complete preprocessing methodology having data cleaning, Enhanced preprocessing technique one of the User Identification
which is key issue in preprocessing technique phase is to identify the web users. Traditional User Identification is based on
the site structure by using some heuristic rules. In most cases relationship between pages are based on the site topology which
reduced the efficiency of identification solve this problem we introduced proposed Technique DUI (Distinct User
Identification) based on IP address ,Agent ,Referred pages on desired session time. Which can be used in counter terrorism,
fraud detection and detection of unusual access of secure data, as well as through detection of frequent access behavior get
better the overall designing and performance of future access. Experiments have proved that advanced data preprocessing
technique can enhanced the quality of data preprocessing results.
Keywords— Web usage mining, Preprocessing, User identification, Session time, Server log
mouse clicks, and any other data as the results of interactions. user into groups according to their navigational behavior,
Web usage mining method based on data cube. The approach Discover potential correlations between web pages and user
based on data cube stresses on turning web logs into groups, Identification of potential customers for
structuralized data cube which can introduce various data ecommerce ,Enhance the quality and delivery of Internet
mining technologies[3]. Web usage mining analyzes results of information services to the end user ,Improve web server
user interactions with a web server, including web logs, click system performance and site design, Facilitate personalization.
streams, and database transactions at a web site of a group of
related sites. Web usage mining also known as web log III. WEB LOG FORMAT
mining. Web usage mining process can be regarded as a three-
phase process consisting: The web usage data includes the data from web server
logs, proxy server logs, browser logs, and user profiles. (The
Preprocessing/ data preparation - web log data are usage data can also be split into 3 different kinds on the basis
preprocessed in order to clean the data – removes log of the source of its collection: on the server side (there is an
entries that are not needed for the mining process, data aggregate picture of the usage of a service by all users), the
integration, identify users, sessions, and so on client side (while on the client side there is complete picture
Pattern discovery - statistical methods as well as data of usage of all services by a particular client), and the proxy
mining methods (path analysis, Association rule, side (with the proxy side being somewhere in the middle).
Sequential patterns, and cluster and classification rules) Web Server logs are plain text (ASCII) files, that is
are applied in order to detect interesting patterns. Independent from the server platform. There are some
Pattern analysis phase - discovered patterns are Distinctions between server software, but traditionally there
analyzed here using OLAP tools, knowledge query are four types of server logs.
management mechanism and Intelligent agent to filter
out the uninteresting rules/patterns.
B. User Identification
The task of user and session identification is find out the
different user sessions from the original web access log.
<ip_addr><base_url><date><method><file><Protoc User’s identification is, to identify who access web site and
which pages are accessed. The goal of session identification is
-ol><code><bytes><referrer><user_agent>
to divide the page accesses of each user at a time into
individual sessions. A session is a series of web pages user
browse in a single access. The difficulties to accomplish this
Fig-4: CLF Log Format
step are introduced by using proxy servers, e.g. different users
may have same IP address in the log. A referrer-based method
IV. PREPROCESSING TECHNIQUE is proposed to solve these problems in this study. The rules
adopted to distinguish user sessions can be described as
The data preparation process is often the most time follows:
consuming and computationally intensive step in the Web The different IP addresses distinguish different users;
usage mining process. The process may involve preprocessing If the IP addresses are same, the different browsers and
the original data, integrating data from multiple sources, and operation systems indicate different users; User identification.
transforming the integrated data into a form suitable for input In this step the unique users are distinguished, and as a result,
into specific data mining operations. This process is known as the different users are identified. This can be done in various
data preparation[5]. Ideally, the input for the Web Usage ways like using IP addresses, cookies, direct authentication
Mining process is a user session file that gives an exact and so on. Because the focus of this paper is put on the
account of who accessed the Web site, what pages were analysis of the different user identification methods, this step
requested and in what order, and how long each page was will be discussed later in detail.
viewed. a user session is the set of the page accesses that
occur during a single visit to a Web site. However, because of C. Session Identification
the reasons we will discuss in the following, the information A session is understood as a sequence of activities
contained in a raw Web server log does not reliably represent performed by a user when he is navigating through a given
a user session file before data preprocessing. Generally, data site. To identify the sessions from the raw data is a complex
preprocessing consists of data cleaning, user identification, step, because the server logs do not always contain all the
session identification and path completion. information needed. There are Web server logs that do not
Raw WebLog Customization Data Cleaning contain enough information to reconstruct the user sessions, in
this case (for example time-oriented or structure-oriented)
Session Identification User Identification heuristics can be used as describe. If all of the IP address,
browsers and operating systems are same, the referrer
Database of Cleaned Log information should be taken into account. The Refer URI field
is checked, and a new user session is identified if the URL in
the Refer URI field hasn’t been accessed previously, or there
Fig-5 : Preprocessing Technique
is a large interval (usually more than 10 seconds) between the
A. Data Cleaning accessing time of this record and the previous one if the Refer
The purpose of data cleaning is to eliminate irrelevant items, URI field is empty; The simplest methods are time oriented in
and these kinds of techniques are of importance for any type which one method based on total session time and the other
of web log analysis not only data mining. According to the based on single page stay time. The set of pages visited by a
purposes of different mining applications [1], irrelevant specific user at a specific time is called page viewing time. It
records in web access log will be eliminated during data varies from 25.5 minutes to 24 hours[4]. while 30 minutes is
cleaning. Since the target of Web Usage Mining is to get the the default timeout. The second method depends on page stay
user’s travel patterns, following two kinds of records are time which is calculated with the difference between two
unnecessary and should be removed. timestamps. If it exceeds 10 minutes then the second entry is
The records of graphics, videos and the format information assumed as a new session. Time based methods are not
The records have filename suffixes of GIF, JPEG, CSS, and reliable because users may involve in some other activities
so on, which can found in the URI field of the every record. after opening the web page and factors such as busy
The records with the failed HTTP status code. By communication line, loading time of components in web page,
examining the Status field of every record in the web access content size of web pages are not considered. Third method
log, the records with status codes over 299 or under 200 are based on navigation uses web topology in graph format.[4]
removed.
The session identified by may contains more than one visit identify every unique user accurately. We may use cookies to
by the same user at different time, the time oriented heuristics track users’ behaviors. But considering personage privacy,
is then used to divide the different visits into different user many users do not use cookies, so it is necessary to find other
sessions. After grouping the records in web logs into user methods to solve this problem. For users who use the same
computer or use the same agent, how to identify them?
sessions, the path completion algorithm should be used for As presented in [9], it uses heuristic method to solve the
acquiring the complete user access path. problem, which is to test if a page is requested that is not
The WUM system presented in this paper is not a full web directly reachable by a hyperlink from any of the, pages
log mining system. Its aim is to better identify web users and visited by the user, the heuristic assumes that there is another
individuals behind the users. In this manner it realizes the first user with the same computer or with the same IP address. Ref.
three steps of a web log mining process. The results provided [4] presents a method called navigation patterns to identify
by our system can be used for further processing by any data users automatically. But all of them are not accurate because’
mining algorithm. they only consider a few aspects that influence the process of
users identification.
The success of the web site cannot be measured only by
V. RELATED WORK hits and page views. Unfortunately, web site designers and
User identification an important issue is how exactly the web log analyzers do not usually cooperate. This causes
users have to be distinguished. It depends mainly on the task problems such as identification unique user’s, construction
for the mining process is executed. In certain cases the users discrete user’s sessions and collection essential web pages for
are identified only with their IP addresses [6]. This can analysis. The result of this is that many web log mining tools
provide an acceptable result for short time periods (minutes or have been developed and widely exploited to solve these
hours) or when the expected results from the data mining task problems.
do not need more precisely information about the unique web
users. For example in case of selecting frequently visited B. Proposed Method Distinct User Identification (DUI)
pages for server side caching, or preloading the next page of
common navigational paths. Considering this actuality, we presented a new algorithm
In other cases some heuristics are used for better called “DUI (DISTINCT USER IDENTIFICATION)”. It
identification of the users. In [7][6] the different methods are analyses more factors, such as user’s IP address, Web site’s
grouped into two classes, the one is the class of the proactive topology, browser’s edition, operating system and referrer
methods and the other is that of the reactive methods. page. This algorithm possesses preferable precision and
Proactive strategies aim at differentiating the users before or expansibility. It can not only identify users but also identify
during the page request while reactive strategies attempt to session. Session identification will be discussed in next
associate individuals with the log entries after the log is section. Proposed method shows comparison not only based
written. Proactive strategies can be simple user authentication on User_IP somewhere same User_IP may generate the
with forms, using cookies or using dynamic web pages that different web users, based on path which chosen by any user
are associated with the browser invoking them. Reactive and access time with referrer page we find out the distinct web
strategies work with the recorded log files only, and the user.
different users will be distinguished by their navigational Definition: given a clean and filtered web log file and
patterns, download timing sequence or some other heuristics record set web log file
based on some assumption regarding their behavior. For Records R= {r1,r2,r3……r.n}
example in [8][6] web users are distinguished based on their where n>0
navigational patterns using clustering methods. Step1: input Log database RUser of N records
Step2: Distinct User identification base
A. Problem at time of User Identification Step3:RUser=P<url, ip_addr, agent, method, operating
system, status,session id,time_stamp>
User’s identification is, to identify who access Web site Step4: RUSer=<r1,r2,r3…rn> where n!=0,i=0
and which pages are accessed. If users have login of their Step5: while(i<n)
information, it is easy to identify them. In fact, there are lots Step6: read Logdatabase RUser
of user do not register their information. What’s more, there Step7 check if r(i).userip not part of Distinct user
are great numbers of users access Web sites through, agent, identification base then it treated as new user and copy
several users use the same computer, firewall’s existence, one userip in distinct user identification base.
user use different browsers, and so forth. All of problems Step8: end if
make this task greatly complicated and very difficult, to Step9: i=i+1;
Step10:end loop
Setp11:end
VII. CONCLUSION
In this Research we present Distinct user identification
technique which enhancement of pre-processing steps of web
Entries in raw web log 47890 log usage data in data mining. We use two pre-processing
Entries after data cleaning 12783 technique combine within one pre-processing step time of user
identification we find out distinct user based on their attended
Number of users 6542 session time. Here introduced one proposed algorithm for
Number of Unique users 4366 advanced pre-processing DUI algorithm is very efficient as
Number of sessions 6744 compare to other identification techniques. We get more
precious accurate result. Based on this we can easily
personalized websites, improve the design of WebPages. As
usages of users on websites. Future work needs to be done to
VI. RESULT AND ANALYSIS OF EXPERIMENT combine whole process of WUM. A complete methodology
covering such as pattern discovery and pattern analysis will be
To validate the effectiveness and efficiency of our more useful in identification method.
methodology mentioned above, we have made an experiment
with the web server log of the library of RK University
rku.ac.in. The initial data source of our experiment is from
REFERENCES
JAN 1, 2012 to Aug 3, 2012, which size is 129MB. Our
experiments were performed on a 2.8GHz Pentium ⅣCPU,
[1] Theint Theint Aye , Web log Cleaning for mining of web usage
512MB of main memory, Windows 2000 professional, SQL patterns, 978-1-61284-840-2/11/2011 IEEE.
Server 2000 and JDK 1.5. Figure is the results of our [2] Mohd Helmy Abd,Mohd Norzali, Data Preprocessing on Web Server
experiment. After data cleaning, the number of requests log for Generalized Association Rule Mining. World Academy of
declined from 747890 to 112783.Figure shows the detail Science, Engineering and technology,48 2008
[3] DeMin Dong, Exploring on Web Usage Mining and its Application ,
changes in data cleaning. 5th world Congress on Intelligent Control and Automation, June 15-
19,2004,China
Fig-6: Result of Experiment [4] V.Chitraa , Dr.Antony Selvadoss Thanamani A Novel Technique for
Sessions Identification in Web Usage Mining Preprocessing ,
International Journal of Computer Applications (0975 – 8887) Volume
34– No.9, November 2011
[5] Mr. Sanjay Bapu Thakare, Prof. Sangram. Z. Gawali A Effective and
Complete Preprocessing for Web Usage Mining , (IJCSE) International
Journal on Computer Science and Engineering Vol. 02, No. 03, 2010,
848-851
[6] Renáta Iváncsy, and Sándor Juhász, Analysis of Web User
Identification Methods, World Academy of Science, Engineering and
Technology 34 2007
[7] M. Spiliopoulou and B. Mobasher and B. Berendt and M. Nakagawa,
Framework for the Evaluation of Session Reconstruction Heuristics in
Web Usage Analysis, INFORMS Journal on Computing, 15, 2003
[8] T. Morzy, M. Wojciechowski, and M. Zakrzewicz. Web users
clustering. International Symposium on Computer and Information
Sciences 2000
[9] Spilipoulou M.and Mobasher B, Berendt B.,”A framework for the
Evaluation of Session Reconstruction Heuristics in Web Usage
Analysis”, INFORMS Journal on Computing Spring ,2003