Data Mining
Data Mining
Data Mining
— Chapter 1 —
1
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
2
Why Data Mining?
◼ The Explosive Growth of Data: from terabytes to petabytes
◼ 1 terabyte (TB) equals 1,000 gigabytes (GB) or 1,000,000 megabytes (MB)
◼ There are 1,024 terabytes (TB) in 1 petabyte and approximately 1,024 PB make up
one exabyte.
3
Why Data Mining?
4
Evolution of Sciences
5
Evolution of Sciences
◼ Before 1600, empirical science, which is based on observations or
experience.
7
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
8
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
9
What Is Data Mining?
10
What Is Data Mining?
11
What Is Data Mining?
12
What Is Data Mining?
13
What Is Data Mining?
14
What Is
Data
Mining?
15
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
17
Example: Medical Data Mining
18
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
19
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented,
20
Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions)
◼ Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
21
Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions)
◼ Descriptive data mining vs. predictive data mining
22
Multi-Dimensional View of Data Mining
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine
◼ Applications adapted
◼ Retail, telecommunication, banking, fraud analysis,
23
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
24
Data Mining: On What Kinds of Data?
25
Data Mining: On What Kinds of Data?
26
Data Mining: On What Kinds of Data?
27
Data Mining:
On What
Kinds of
Data?
28
Data Mining: On What Kinds of Data?
29
Data Mining: On What Kinds of Data?
30
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
31
Data Mining Function: (1)Characterization
and Discrimination
◼ Data characterization is a summarization of the
general characteristics or features of a target class of
data.
◼ The data corresponding to the user-specified class are
typically collected by a query.
◼ E.g. Summarize the characteristics of customers who spend
more than $5000 a year at Walmart.
◼ The result is a general profile of these customers, such
as that they are 40 to 50 years old, employed, and have
excellent credit ratings.
32
Data Mining Function: (1)Characterization
and Discrimination
◼ Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.
◼ The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.
◼ For example, a user may want to compare the general features of software
products with sales that increased by 10% last year against those with
sales that decreased by at least 30% during the same period.
◼ For example, a customer relationship manager at Samsung may want to
compare two groups of customers—those who shop for computer
products regularly (e.g., more than twice a month) and those who rarely
shop for such products (e.g., less than three times a year).
33
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ Frequent patterns, are patterns that occur
frequently in data.
◼ What items are frequently purchased together in
your Walmart?
◼ Association analysis
◼ Suppose that, as a marketing manager a Walmart,
you want to know which items are frequently
purchased together (i.e., within the same
transaction).
34
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ This association rule involves a single attribute or predicate (i.e., buys) that
repeats. Association rules that contain a single predicate are referred to as
single-dimensional association rules.
◼ Dropping the predicate notation, the rule can be written simply as
“computer→software [1%, 50%].”
35
Data Mining Function: (2) Association and
Correlation Analysis
◼ Suppose, instead, that we are given the Samsung relational database related to
purchases. A data mining system may find association rules like:
36
Data Mining Function: (2) Association and
Correlation Analysis
◼ Suppose, instead, that we are given the Samsung relational database related to
purchases. A data mining system may find association rules like:
◼ The rule indicates that of the Samsung customers under study, 2% are 20
to 29 years old with an income of $40,000 to $49,000 and have purchased
a laptop (computer) at Samsung. There is a 60% probability that a
customer in this age and income group will purchase a laptop.
◼ Note that this is an association involving more than one attribute
or predicate (i.e., age, income, and buys).
◼ The above rule can be referred to as a multidimensional
association rule.
37
Data Mining Function: (3) Classification
◼ Typical methods
◼ Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression.
38
Data Mining Function: (3) Classification
39
Data Mining Function: (3) Classification
◼ Typical methods
◼ A decision tree is a flowchart-like tree structure, where each
node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent
classes or class distributions.
◼ A neural network, when used for classification, is typically a
collection of neuron-like processing units with weighted
connections between the units.
40
Data Mining Function: (3) Classification
◼ Typical methods
◼ Regression analysis is a statistical methodology that is most
often used for numeric prediction, although other methods
exist as well. Regression also encompasses the identification
of distribution trends based on the available data.
41
Data Mining Function: (4) Cluster Analysis
42
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data.
◼ Noise or exception? ― One person’s garbage could be
another person’s treasure
◼ Methods: by product of clustering or regression analysis, …
◼ Useful in fraud detection, rare events analysis
43
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
44
Data Mining: Confluence of Multiple Disciplines
45
Data Mining: Confluence of Multiple
Disciplines
46
Data Mining: Confluence of Multiple
Disciplines
◼ Summary
48
Applications of Data Mining
◼ Business intelligence (BI) technologies provide
historical, current, and predictive views of business
operations.
◼ Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and
predictive analytics.
49
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to
invisible data mining 50
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
51
Major Issues in Data Mining (1)
◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results
52
Major Issues in Data Mining (2)
◼ Summary
54
A Brief History of Data Mining Society
55
Conferences and Journals on Data Mining
56
Where to Find References? DBLP, CiteSeer, Google
◼ Summary
58
Summary
◼ Data mining is the process of discovering interesting patterns frommassive
amounts of data.
◼ A pattern is interesting if it is valid on test data with some degree of certainty,
novel, potentially useful and easily understood by humans.
◼ Data mining can be conducted on any kind of data as long as the data are
meaningful for a target application, such as database data, data warehouse
data, transactional data, and advanced data types.
◼ A data warehouse is a repository for long-term storage of data from multiple
sources, organized so as to facilitate management decision-making.
◼ Data mining functionalities are used to specify the kinds of patterns or
knowledge to be found in data mining tasks.
◼ There are many challenging issues in data mining research. Areas include
mining methodology, user interaction, efficiency and scalability, and dealing
with diverse data types. 59
Recommended Reference Books
◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann,
2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
◼ https://www.nature.com/subjects/computationalscience#:~:text=Computational%20science%20is%20a%20di
scipline,a%20scientific%20system%20or%20process.
60
Activity 1
◼ What is data mining? In your answer, address the following: (5 pts. each)
◼ a) Is it another hype?