Data Mining
Data Mining
Data Mining
Data Mining
Data mining:
Find all the items which are frequently purchased
with bread.
Construct some rules about the life style of residents
of a locality that may reduce the risk of occurrence of
diabetes at an early age.
+
Decision support progress to data
mining
+
Knowledge discovery phases
+
Knowledge discovery phases
Step 1: Define Business Objectives.
State your objectives. Are you looking to improve
your direct marketing campaigns? Do you want to
detect fraud in credit card usage? Are you looking
for associations between products that sell together?
Step 2: Prepare Data. This step consists of data
selection, pre-processing of data, and data
transformation. Select the data to be extracted from
the data warehouse. Pre-processing is meant to
improve the quality of selected data. When you
select from the data warehouse, it is assumed
that the data is already cleansed.
+
Knowledge discovery phases
Step 3: Perform Data Mining. Obviously, this is the crucial
step. Apply the selected algorithm to the prepared data.
The output from this step is a set of relationships or
patterns.
Predictive
This
means that the values of such attributes are
more with respect to the other attributes.
+
Basic Methodology
Though the working behind these decision tree
building algorithms is different for different
algorithms, but all of them work on the principle of
Greediness.
In
this tutorial, we will explain in detail the working
and rationale behind algorithms used for the
decision tree construction and how the selection
regarding the attribute that best splits the given
dataset can be made.
Terminal node Decision node Terminal node (B) Terminal node (C)
CarType
Family Luxury
Sports
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
+
Splitting Based on Ordinal
Attributes
They can be grouped as long as the grouping does
not violate the order property of the attribute values.
Size
{Small,
Large} {Medium}
Informally,
Entropy can be considered to find out
how disordered the dataset is.
+
Entropy
9 9 5 5
Entropy Info( D) log 2 log 2
14 14 14 14
0.643 log 2 0.643 0.357 log 2 0.357
0.643(0.643) 0.357(1.486)
0.410 0.521 0.93
+
Example
Which attribute
should be
tested here?
+ Example
Outlook
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be
tested here?
+
Example
Outlook
No Yes No Yes
+
Example
Construct a Decision Tree using Infogain as the
splitting criteria.
Weekend Weather Parents Money Decision
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay In
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
+
Solution
Now,
Gain (S,Weather) = 0.70
Gain (S, Parents) = 0.61
Gain (S, Money) = 0.2816
𝑣 |𝐷𝑗|
S𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 𝐷 = − 𝑗=1 |𝐷| ∗ log( 𝐷𝑗 /|𝐷|
+
Extension to Information Gain
𝑮𝒂𝒊𝒏(𝑨)
𝑮𝒂𝒊𝒏𝑹𝒂𝒕𝒊𝒐 𝑨 =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨(𝑫)
ForOutlook
Gain: 0.247
Split Info:
Sunny: 5 examples, Overcast: 4 examples, Rain:
5 examples
Gain Ratio:
+
Example
ForHumidity
Split Info: 1.000
Gain Ratio: 0.152
Again, Outlook is selected as
For Temperature it has the highest gain ratio.
Split
Info: 1.557
Gain Ratio: 0.019
For Wind
Split
Info: 0.985
Gain Ratio: 0.049
+
Gini Index
It is used as a splitting criteria in Classification and
Regression Tree (CART).
The
attribute that maximizes the reduction in
impurity is chosen as the splitting attribute
+
Gini Index
Examine the partitions resulting from all
possible subsets of {a1…,av}
2v
possible subsets. We exclude the power
set and the empty set, then we have 2v-2
subsets.
The subset that gives the minimum Gini
index for attribute A is selected as its
splitting subset
+
Example
+
Example
Compute the Gini Index for the overall collection of
training examples.
Gini
Index is used in
CART
+
Comparing Attribute Selection
Measures
Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one
partition is much smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized
partitions and purity in both partitions
+
Exercise
Consider the training examples shown in Table for
a binary classification problem.