Introduction To Data Mining
Introduction To Data Mining
Introduction To Data Mining
Grading
60% during the semester:
10% Course activity (course attendance)
20% Midterm exam (questions with multiple choice
answers)
30% Project (project attendance, algorithm
presentation, project delivery)
Road Map
What is data mining
Steps in data mining process
Data mining methods and subdomains
Summary
DMDW-1
3
DMDW-1
4
Definition ([Wikipedia])
Data mining (the analysis step of the knowledge
discovery in databases process, or KDD) is the process
of discovering new patterns from large data sets
involving methods at the intersection of artificial
intelligence, machine learning, statistics and database
systems.
The goal of data mining is to extract knowledge from a
data set in a human-understandable structure and
involves database and data management, data
preprocessing, model and inference considerations,
interestingness metrics, complexity considerations, postprocessing of found structure, visualization and online
updating.
DMDW-1
6
Conclusions
The data mining process converts data into valuable
knowledge that can be used for decision support
Data mining is a collection of data analysis
methodologies, techniques and algorithms for
discovering new patterns
Data mining is used for large data sets
Data mining process is automated (no need for human
intervention)
Data mining and Knowledge Discovery in Databases
(KDD) are considered by some authors to be the same
thing. Other authors list data mining as the analysis step
in the KDD process - after data cleaning and
transformation and before results visualization /
evaluation.
DMDW-1
8
DMDW-1
10
DMDW-1
11
DM software (1)
In ([Mikut, Reischl 11]) DM software programs are classified in 9
categories:
Data mining suites (DMS) focus on data mining and include
numerous methods and support feature tables and time series.
Examples:
Commercial: IBM SPSS Modeler, SAS Enterprise Miner, DataEngine,
GhostMiner, Knowledge Studio, NAG Data Mining Components,
STATISTICA
Open source: RapidMiner
DM software (2)
Mathematical packages (MATs) provide a large and extendable set
of algorithms and visualization routines. Examples:
Commercial: MATLAB, R-PLUS
DM software (3)
Data mining libraries (LIBs) implement data mining methods as a
bundle of functions and can be embedded in other software tools
using an Application Programming Interface. Examples: Neurofusion
for C++, WEKA, MLC++, JAVA Data Mining Package, LibSVM
Communities involved
The most important communities involved in
data mining
STATISTICS
DATABASE
SYSTEMS
AI
DATA
MINING
CLUSTERING
VISUALIZATION
DMDW-1
15
Road Map
What is data mining
Steps in data mining process
Data mining methods and subdomains
Summary
DMDW-1
16
2.
17
DMDW-1
18
4.
5.
19
DMDW-1
20
Road Map
What is data mining
Steps in data mining process
Data mining methods and subdomains
Summary
DMDW-1
22
Method types
23
Algorithms
Prediction algorithm types:
Classification
Regression
Deviation Detection
Classification
Input:
A set of k classes C = {c1, c2, , ck}
A set of n labeled items D = {(d1, ci1), (d2, ci2), , (dn
cin)}. The items are d1, , dn, each item dj being labeled
with class cj C. D is called the training set.
For calibration of some algorithms a validation set is
required. This validation set contains also labeled items
not included in the training set.
Output:
A model or method for classifying new items. The set of
new items that will be classified using the model/method
is called the test set
DMDW-1
25
Example
Let us consider a medical set of items where each item
is a patient of a hospital emergency unit.
o There are 5 classes, representing maximum waiting time
categories: C0, C10, C30, C60 and C120, Ck meaning the
patient waits maximum k minutes.
o We may represent these data in tabular format
o The output of a classification algorithm using this training set
may be for example a decision tree or a set of ordered rules.
o The model may be used to classify future patients and assign a
waiting time label to them
DMDW-1
26
Vital
Danger
0 resource
1 resource
>1
>1 resource
Waiting
ID)
risk?
if
needed
needed
resource
needed and
time
needed
vital function s
(class
affected
label)
waits?
John
Yes
Yes
No
Yes
No
No
C0
Maria
No
Yes
No
No
Yes
No
C10
Nadia
Yes
Yes
Yes
No
No
No
C0
Omar
No
No
No
No
Yes
Yes
C30
Kiril
No
No
No
Yes
No
Yes
C60
Denis
No
No
No
No
Yes
No
C10
Jean
No
No
Yes
Yes
No
No
C120
Patricia
Yes
Yes
No
No
Yes
Yes
C60
DMDW-1
27
will be C0
28
Yes
Yes
No
No
No
Yes
?????
DMDW-1
Regression (1)
Regression is related with statistics.
Meaning: predicting a value of a given
continuous valued variable based on the values
of other variables, assuming a linear or
nonlinear model of dependency ([Tan,
Steinbach, Kumar 06]).
Used in prediction and forecasting - its use
overlaps machine learning.
Regression analysis is also used to understand
relationship between independent variables and
dependent variable and can be used to
infer causal relationships between them.
DMDW-1
29
Regression (2)
- There are many types of regression.
For example, Wikipedia lists:
Linear regression model
Simple linear regression
Logistic regression
Nonlinear regression
Nonparametric regression
Robust regression
Stepwise regression
DMDW-1
30
Example
Linear regression example
(from http://en.wikipedia.org/wiki/File:Linear_regression.svg)
DMDW-1
31
Deviation detection
Deviation detection or anomaly detection means discovering
significant deviation from the normal behavior. Outliers are a
significant category of abnormal data.
Deviation detection can be used in many circumstances:
Data mining algorithm running stage: often such information may
be important for business decisions and scientific discovery.
Auditing: such information can reveal problems or mal-practices.
Fraud detection in a credit card system: fraudulent claims often
carry inconsistent information that can reveal fraud cases.
Intrusion detection in a computer network may rely on abnormal
data.
Data cleaning (part of data preprocessing): such information can
be detected and possible mistakes may be corrected in this
stage.
DMDW-1
32
DMDW-1
33
Algorithms
Prediction algorithm types:
Classification
Regression
Deviation Detection
Clustering
Input:
A set of n objects D = {d1, d2, , dn} (called usually points).
The objects are not labeled and there is no set of class labels
defined.
A distance function (dissimilarity measure) that can be used to
compute the distance between any two points. Low valued
distance means near, high valued distance means far.
Some algorithms also need a predefined value for the number
of clusters in the produced result.
Output:
A set of object (point) groups called clusters where points in
the same cluster are near one to another and points from
different clusters are far one from another, considering the
distance function.
DMDW-1
35
Example
- Having a set of points in a 2 dimensional space, find the
natural clusters formed by these points.
INITIAL
AFTER CLUSTERING
DMDW-1
36
Then:
A rule is a construction X Y where X and Y are
itemsets.
DMDW-1
37
support(X Y) = support(XY)
- The confidence of a rule is the proportion of transactions
containing Y in the set of transactions containing X:
Output:
The set of frequent itemsets in T, having support >= s
The set of rules derived from T
having support >= s and confidence >= c
DMDW-1
39
Example
Consider the following set of transactions:
Transaction
Items
ID
1
Sequences
The model:
- Itemset: a set of n distinct items
I = {i1, i2, , in }
- Event: a non-empty collection of items;
- we can assume that items are in a given (e.g.
lexicographic) order: (i1,i2 ik)
DMDW-1
41
Output:
- The set of frequent sequences, i.e. the set of sequences that are
included in at least s sequences from S.
- Sometimes a set of rules can be derived from the set of frequent
sequences, each rule being of the form S1 S2 where S1 and S2 are
sequences.
DMDW-1
42
Examples
- In a bookstore we can find frequent sequences like:
DMDW-1
43
Summary
This first course presented:
A list of alternative definitions of Data Mining and some examples of
what is Data Mining and what is not Data Mining
A discussion about the researchers communities involved in Data
Mining and about the fact that Data Mining is a cluster of
subdomains
The steps of the Data Mining process from collecting data located in
existing repositories (data warehouses, archives or operational
systems) to the final evaluation step.
A brief description of the main subdomains of Data Mining with some
examples for each of them.
Next week: Data preprocessing
DMDW-1
44
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents, and
Usage Data, Second Edition, Springer, 1-13.
[Tan, Steinbach, Kumar 06] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, 2006.
Introduction to Data Mining, Adisson-Wesley, 1-16.
[Kimbal, Ross 02] Ralph Kimball, Margy Ross, 2002. The Data Warehouse Toolkit,
Second Edition, John Wiley and Sons, 1-16, 396
[Mikut, Reischl 11] Ralf Mikut and Markus Reischl, Data mining tools, 2011, Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 1, Issue
5, http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf
[Ullman] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009, web page:
http://infolab.stanford.edu/~ullman/mining/mining.html
DMDW-1
45