Web Mining
Web Mining
Web Mining
Presentation 1
Presented By:
Alka Simha 106677801
Avanthi Gupta 106616697
Megha Krishnamurthy 106616749
REFERENCES
• Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber
• Presentation Slides of Prof. Anita Wasilewska
• http://en.wikipedia.org/wiki/Web_mining
• http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf
• http://searchcrm.techtarget.com/sDefinition/0,,sid11_gci789009,00.html
• http://www.cs.rpi.edu/~youssefi/research/VWM/
• http://www.galeas.de/webimining.html
• R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations,
2(1):1-15, 2000.
• R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide
web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999
• S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD
Explorations, 1(2):1-11, 2000System, 1(1), 1999
• Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti
• Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents.
Proc. Fifth International World Wide Web Conference, May 6-10 1996.
OVERVIEW
http://infolab.stanford.edu/~ullman/mining/2008/slides/web_mining_overview.pdf
HOW BIG IS THE WEB
224,749,695 (Mar 2009)
Netcraft survey – Total no of sites across all domains
http://news.netcraft.com/archives/web_server_survey.html
CHALLENGES IN WEB
MINING
• Finding useful and relevant information.
• Creating knowledge from available information.
• As the coverage of information is very wide and diverse, personalization
of the information is a tedious process.
• Learning customer and individual user patterns.
• Much of the web information is redundant, as the same piece of
information or its variant appears in many pages.
• The web is noisy i.e. a page typically contains a mixture of many kinds
of information like, main content, advertisements, copyright notice,
navigation panels.
• The web is dynamic, information keeps changing constantly. Keeping
up with the changes and monitoring them are very important.
• The Web is also about services. Many Web sites and pages enable
people to perform operations with input parameters, i.e., they provide
services.
• The most important challenge faced is Invasion of Privacy. Privacy is
considered lost when information concerning an individual is obtained,
used, or disseminated, when it occurs without their knowledge or
consent.
http://en.wikipedia.org/wiki/Web_mining
USES OF WEB MINING
• This technology has enabled ecommerce to do personalized marketing,
which eventually results in higher trade volumes.
• The predicting capability of the mining application can benefit the society by
identifying criminal activities.
• The companies can establish better customer relationship by giving them
exactly what they need.
• Companies can understand the needs of the customer better and they can
react to customer needs faster.
• The companies can find, attract and retain customers, they can save on
production costs by utilizing the acquired insight of customer requirements.
• They can increase profitability by target pricing based on the profiles
created.
• They can even find the customer who might default to a competitor the
company will try to retain the customer by providing promotional offers to
the specific customer, thus reducing the risk of losing a customer.
http://en.wikipedia.org/wiki/Web_mining
WEB MINING vs DATA MINING
STRUCTURE
¾ Data Mining
Data is structured and has well defined tables,
columns, rows, keys and constraints.
¾ Web Mining
Dynamic and rich in features and patterns.
• SPEED
¾ Often need to react to evolving usage patterns in real time eg.
Merchandizing.
http://www.information-management.com/news/5458-1.html
WEB CRAWLERS
• A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner. Other terms for Web crawlers are ants, automatic
indexers, bots, and worms or Web spider, Web robot
• Crawlers can also be used for automating maintenance tasks on a Web site, such as
checking links or validating HTML code.
• Crawlers can be used to gather specific types of information from Web pages, such
as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu
; mueller{remove this}@cs.sunysb.edu
• A Web crawler is one type of bot, or software agent. In general, it starts with a list
of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl
frontier
April 21, 2009 Web Mining 13
WEB MINING TAXONOMY
Web Mining
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB STRUCTURE MINING
• The structure of a typical Web graph consists of Web pages as nodes, and
hyperlinks as edges connecting two related pages
• Web Structure Mining is the process of discovering information from the Web
• Retrieving information about the relevance and the quality of the web page
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
• Web usage mining also known as Web log mining
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
WEB USAGE MINING
• Applications:
– Target potential customers for electronic commerce
– Enhance the quality and delivery of Internet information
services to the end user
– Improve Web server system performance
– Identify potential prime advertisement locations
– Facilitates personalization of sites
– Improve site design
– Fraud/intrusion detection
– Predict user’s actions (allows pre-fetching)
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Problems with Web Logs
• Typically a 30 minute timeout is used
• Like most data mining tasks, web log mining requires preprocessing
– To identify users
– To match sessions to other data
– To fill in missing data
– Essentially, to reconstruct the click stream
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Problems with Web Logs
• Identifying users
– Clients may have multiple streams
– Clients may access web from multiple hosts
– Proxy servers: many clients/one address
– Proxy servers: one client/many addresses
• Other issues
– When does a session end
– Pages may be cached
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Web Log – Data Mining
Applications
• Association rules
– Find pages that are often viewed together
• Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification
– Relate user attributes to patterns
mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt
Web Logs
• Web servers have the ability to log all requests
• Web server log formats:
– Most use the Common Log Format (CLF)
– New, Extended Log Format allows configuration of log file
• Design of a Web Log Miner:
– Web log is filtered to generate a relational database
– A data cube is generated from the database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge Knowledge
Web log Database Data Cube Sliced and diced
cube
R
(q)
(p)=ε/n+(1−ε)⋅∑
R
Gou
(q,p)∈ de
re
(q)
1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
Web Logs
http://mate.dm.uba.ar/~pfmislej/web%20mining/web%20mining.pdf
WEB MINING APPLICATIONS
• Personalization, Recommendation engines
• Web-commerce applications
• Intelligent web search
• Hypertext classification and Categorization
• Information/trend monitoring
• Analysis of online communities
• Improving the relationship between the website and the user
– Recommendations to modify the web site structure and content
– Web personalization
– Intelligent web site – They are systems that “based on the user
behavior, allow implementation of changes to the current web site
structure and content”
paginas.fe.up.pt/~ec/files_0506/slides/06_WebMining.pdf
Personalization of Webpages
http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf
CONCLUSION
Presented By:
Alka Simha 106677801
Avanthi Gupta 106616697
Megha Krishnamurthy 106616749
Visual Web Mining
• http://www.cs.rpi.edu/~zaki/PS/WWW04.p
df
• http://www.cs.rpi.edu/~youssefi/research/V
WM/
• http://www.vtk.org/
• http://www.w3.org/Robot/
• http://www.cs.rpi.edu
Overview
http://www.cs.rpi.edu/~youssefi/research/VWM/
Abstract
GOAL:
- To correlate the outcomes of mining Web Usage Logs and the
extracted Web Structure, by visually superimposing the results.
Introduction
• Information Visualization
Visual representations of abstract data, using computer-supported,
interactive visual interfaces to reinforce human cognition; thus
enabling the viewer to gain knowledge about the internal structure of
the data and relationships in it.
• User Session
Compact sequence of web accesses by a user.
• Visualization in order to:
- Understand the structure of a particular website.
- Web surfers’ behavior when visiting that website.
• Due to the large dataset and the structural complexity of the sites,
3D visual representations are used.
• Implemented using an open source toolkit called the Visualization
Tool Kit (VTK).
http://www.vtk.org/
Visual Web Mining Architecture
Visual Web Mining Architecture
• Input:
- Web pages and Web server log files.
- web robot (webbot) is used to retrieve the pages of the website.
- The webbot is a very fast Web walker with support for regular
expressions, SQL logging facilities, and many other features. It can be
used to check links, find bad HTML, map out a web site, download
images, etc.
• Outputs:
- Frequent contiguous sequences with a given minimum support.
- These are imported into a database, and non-maximal frequent sequences are
removed.
- Different queries are executed against this data according to some criterion, e.g.
support of each pattern, length of patterns, etc.
- Different URLs which correspond to the same webpage are unified in the final
results.
• The Visualization Stage: Maps the extracted data and attributes into visual images,
realized through VTK extended with support for graphs.
• Result: Interactive 3D/2D visualizations which could be used by analysts to compare
actual web surfing patterns with expected patterns.
Visual Representation
Structures :
- Graphs
Extract spanning tree from the site structure, and use this as the
framework for presenting access-related results through glyphs(an
element of writing) and color mapping.
- Stream Tubes
Variable-width tubes showing access paths with different traffic are
introduced on top of the web graph structure.
Design and Implementation of
Diagrams
This is a visualization of the web graph
of the Computer Science department of
Rensselaer Polytechnic Institute.
Strahler numbers are used for assigning
colors to edges.
Strahler numbers is a numerical measure of the branching complexity for assigning colors
to the edges.
http://www.cs.rpi.edu
Adding third dimension enables
visualization of more information and
clarifies user behavior in and between
clusters. Center node of circular
basement is first page of web site
from which users scatter to different
clusters of web pages. Color spectrum
from Red (entry point into clusters) to
Blue (exit points) clarifies behavior of
users.
Right: Zoom view of colored region with layout of Web Usage taken from Web Graph
basement. The basement itself is removed for clarity
Conclusion
- Using the visualizations, a web analyzer can easily identify which parts of the
website are cold parts with few hits and which parts are hot ones with many
hits and classify them accordingly.
- For e.g., adding links from hot parts of web site to cold parts and then
extracting, visualizing and interpreting changes in access patterns.
SPADE OVERVIEW