Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

+

Data Mining

Kusum Lata, Assistant Professor, DTU


+
What is data mining?
 Data mining is about mining data. Just as gold
miners sieve through a lot of material to find the
precious metal, in data mining we sieve through a
lot of data to find useful information.

 Data mining is the art and science of extracting


hidden information from large datasets.

 Due to huge amount of data being created from


normal business operations and competitive
business environment, there is a demand to utilize
this data to make effective decisions.
+
Data query vs. Data mining?
 Data query: is about searching in data when we
know exactly what we are looking for.
 E.g. a list of customers who used master card for a
purchase.

 Data mining:
 Find all the items which are frequently purchased
with bread.
 Construct some rules about the life style of residents
of a locality that may reduce the risk of occurrence of
diabetes at an early age.
+
Decision support progress to data
mining
+
Knowledge discovery phases
+
Knowledge discovery phases
Step 1: Define Business Objectives.
State your objectives. Are you looking to improve
your direct marketing campaigns? Do you want to
detect fraud in credit card usage? Are you looking
for associations between products that sell together?
Step 2: Prepare Data. This step consists of data
selection, pre-processing of data, and data
transformation. Select the data to be extracted from
the data warehouse. Pre-processing is meant to
improve the quality of selected data. When you
select from the data warehouse, it is assumed
that the data is already cleansed.
+
Knowledge discovery phases
Step 3: Perform Data Mining. Obviously, this is the crucial
step. Apply the selected algorithm to the prepared data.
The output from this step is a set of relationships or
patterns.

Step 4: Evaluate results: There are potentially many


patterns or relationships. In this step, you examine all the
resulting patterns. You will apply a filtering mechanism
and select only the promising patterns to be presented
and applied.
+
Knowledge discovery phases
Step 5: Present Discoveries. Presentation of
knowledge discoveries may be in the form of visual
navigation, charts, graphs, or free-form texts.
Presentation also includes storing of interesting
discoveries in the knowledge base for repeated use.

Step 6: Incorporate Usage of Discoveries: This step is


for using the results to create actionable items in the
business. You assemble the results of the discovery
in the best way so that they can be exploited to
improve the business. .
+
Data mining tasks
Descriptive

Descriptive data mining characterize the general


properties of data in the database.

Predictive

Predictive data mining tasks performs inferences on


the current data in order to make predcitions.
+
Classification
 Classification is a the data analysis where a model
or classifier is constructed to predict class
(categorical) labels, such as “safe” or “risky” for
the loan application data; “yes” or “no” for the
marketing data; or “treatment A,” “treatment B,” or
“treatment C for the medical data.
 These categories can be represented by discrete
values, where the ordering among values has no
meaning.
 Predictor: The model constructed to predict a
continuous-valued function, or a numeric value.
+
How does classification work?
 Data classification is a two-step process, consisting of a
learning step (where a classification model is
constructed) and a classification step (where the model is
used to predict class labels for given data).
 In the learning step (or training phase), where a
classification algorithm builds the classifier by analyzing
or “learning from” a training set made up of database
tuples and their associated class labels.
 Because the class label of each training tuple is
provided, this step is also known as supervised
learning (i.e., the learning of the classifier is
supervised” in that it is told to which class each training
tuple belongs).
+
Step 1- Learning
+
Step 2- Classification
+
Classification: Decision Tree
 Decision tree induction is the learning of decision
trees from class-labeled training tuples. A decision
tree is a flowchart-like tree structure, where each
internal node (non leaf node) denotes a test on an
attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a
class label.

 Learned trees can also be re-represented as sets


of if-then rules to improve human readability.

 Decisiontrees classify instances by sorting them


down the tree from the root to some leaf node.
+
Decision Tree Induction

 Decision tree starts with a ‘root of the tree’ which is


the first node and ends with the terminal nodes
which are known as ‘leaves of the tree’.

 The entire tree is traversed during classification


beginning from the root of the tree until a leaf node
is reached.
+
Basic Methodology

 The most critical aspect of decision tree algorithm


is the attribute selection method employed at each
node of the decision tree.

 Thereare some attributes that split the data more


purely than the other attributes.

 This
means that the values of such attributes are
more with respect to the other attributes.
+
Basic Methodology
 Though the working behind these decision tree
building algorithms is different for different
algorithms, but all of them work on the principle of
Greediness.

 In
this tutorial, we will explain in detail the working
and rationale behind algorithms used for the
decision tree construction and how the selection
regarding the attribute that best splits the given
dataset can be made.

 ID3(Iterative Dichotomiser 3) algorithm, invented


by Ross Quinlan is the most widely used algorithm
used for decision tree construction.
+
Basic Methodology
Root node Branch/Sub-Tree

Decision node Decision node (A)


)A

Terminal node Decision node Terminal node (B) Terminal node (C)

Terminal node Terminal node

Figure 1: A Pictorial Representation of a Decision Tree


+
Basic Terminology

 RootNode: It represents entire population or


sample and this further gets divided into two or
more sets.
 Splitting: It
is a process of dividing a node into two
or more sub-nodes.
 Decision
Node: When a sub-node splits into further
sub-nodes, then it is called decision node.
 Leaf/Terminal Node: Nodes that do not split is
called Leaf or Terminal node.
+
Basic Terminology

 Branch / Sub-Tree: A sub section of entire tree is


called branch or sub-tree.

 Parent and Child Node: A node, which is divided


into sub-nodes is called parent node of sub-nodes
whereas sub-nodes are the child of parent node.
For instance, Node A is a parent of nodes B and C.
+
ID3 Algorithm
Examples: Training Examples
Target-attribute: Attribute to be predicted
Attributes: List of predictors
ID3 (Examples , Target-attribute, Attributes)
 Create a root node for the tree.
 If all examples are positive, return the single-node
tree Root, with label = +
 If all examples are negative, return the single-node
tree Root, with label = -
 If Attributes is empty, return the single-node tree
Root with label = most common value of Target-
attribute in Examples.
+
Splitting Based on Nominal
Attributes
 Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets. Need to


find optimal partitioning.

{Sports, CarType OR {Family, CarType


{Family} Luxury} {Sports}
Luxury}
+
Splitting Based on Ordinal
Attributes
 Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium

 Binary split: Divides values into two subset. Need to


find optimal partitioning.

Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
+
Splitting Based on Ordinal
Attributes
 They can be grouped as long as the grouping does
not violate the order property of the attribute values.

 What about this split?

Size
{Small,
Large} {Medium}

Is not allowed as it disturbs the order.


+
Splitting based on Continuous
Attributes
 Different ways of handling
 Discretization to form an ordinal categorical attribute
 Binary Decision: (A < v) or (A  v)
 consider all possible splits and finds the best cut
 can be more compute intensive

Binary Split Multiway Split


+
Which Attribute Is the Best Classifier?

 We would like to select the attribute that is most


useful for classifying examples.

 We will define a statistical property, called


information gain, that measures how well a
given attribute separates the training
examples according to their target classification
+
Entropy

A measure used from Information Theory in the ID3


algorithm and popularly used in decision tree
construction is that of Entropy.

 Entropy of a dataset measures the impurity of the


dataset.

 Informally,
Entropy can be considered to find out
how disordered the dataset is.
+
Entropy

 Ithas been shown that there is a relationship


between entropy and information.
 That is, higher the uncertainty or entropy of some
data, implies more information is required to
completely describe that data.
 In building a decision tree, the aim is to decrease
the entropy of the dataset until we reach leaf nodes
at which point the subset that we are left with is
pure, or has zero entropy and represents instances
all of one class.
+
Entropy
 Suppose D be the data partition, be a training dataset
of class labelled examples.

 Suppose class label has m distinct values.


+
Infogain
+
Example
+
Which attribute to select?
+
Example
+
Example

 There are two classes, play tennis, “yes” and “no”


in the data. Therefore the entropy can be
calculated as below:

9 9 5 5
Entropy  Info( D)   log 2  log 2
14 14 14 14
 0.643 log 2 0.643  0.357 log 2 0.357
 0.643(0.643)  0.357(1.486)
 0.410  0.521  0.93
+
Example

 Now, the next step is to select highly significant


input variable among all the four input variables
(Outlook, Temperature, Humidity, Wind) that will
split the data more purely than the other attributes.

 Forthis, we will calculate the information gain that


would result in over the entire dataset after
splitting the attributes (Outlook, Temperature,
Humidity,Wind).
+
Example
5
Outlook) = Entropy(C) - Outlook = Sunny) -
Infogain(C 5
Infogain(C // Outlook) = Entropy(C) Entropy(C // Outlook
- 14 Entropy(C = Sunny) -
14
44 Entropy(C / Outlook = Overcast) - 55 Entropy(C / Outlook = rain)
14 Entropy(C / Outlook = Overcast) - 14 Entropy(C / Outlook = rain)
14 14
55 22 2 33
2 3
3 4 44
4 4
4
=
= 0.93-
0.93- 14 (- log 22 5 -
(- 5 log log 22 5 )) -
- 5 log - 14 (- log 22 4 )) -
(- 4 log -
14 5 5 5 5 14 4 4
55 (- 33 log 33 - 22 log 22 )
14 (- 5 log 22 5 - 5 log 22 5 )
14 5 5 5 5
= 0.93- 55 (0.529 + 0.442) - 44 (0) - 55 (0.442 + 0.529)
= 0.93- 14 (0.529 + 0.442) - 14 (0) - 14 (0.442 + 0.529)
14 14 14
=
= 0.93- 0.347 -
0.93- 0.347 - 00 -
- 0.347
0.347
= 0.245bits
= 0.245bits
+
Example

 Wewill perform the above calculation for all the


remaining attributes (Temperature, Humidity,
Wind)

 Then select the attribute for the root node which


will result in the highest reduction in entropy.

 Inother words, the attribute which results in


highest information gain will be selected as the
root node.
+
Example
 Performing the above calculation for all the
remaining attributes gives the following values of
information gain:
Infogain (C| Temperature) = 0.028 bits
Infogain (C| Humidity) = 0.152 bits
Infogain (C|Wind) = 0.048 bits
 We can clearly see that the attribute Outlook
results in the highest reduction in entropy or the
highest information gain.
 We would therefore choose this at the root node
splitting the data up into subsets corresponding to
all the different values for the Outlook attribute.
+ Example

Which attribute
should be
tested here?
+ Example

We select Humidity as it has the highest Gain.


+ Example
{D1, D2,…D14}
9 Yes, 5 No

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?

Which attribute
should be
tested here?
+
Example

We select Wind as it has the highest Gain.


+
Decision Trees

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
+
Example
Construct a Decision Tree using Infogain as the
splitting criteria.
Weekend Weather Parents Money Decision
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay In
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
+
Solution

Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) -


pshopping log2(pshopping) -pstay_in log2(pstay_in)

= -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) *


log2(1/10) -(1/10) * log2(1/10)
= -(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -
(1/10) * -3.322
= 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571
+
Solution

 Now,
 Gain (S,Weather) = 0.70
 Gain (S, Parents) = 0.61
 Gain (S, Money) = 0.2816

Therefore, the first attribute selected for splitting is


weather.
 Now,
 Gain (Ssunny, Parents) = 0.918
 Gain (Ssunny, Money) = 0

The chosen attribute is Parents.


+
Solution
 The complete tree is as follows:
+
Decision Trees
+
Problem with Information Gain
Approach
 Biased towards tests with many outcomes (attributes
having a large number of values)
 E.g: An attribute acting as unique identifier will
produce a large number of partitions (1 tuple per
partition).
 Each resulting partition D is pure Info(D) =0
 The information gain is maximized
 Eg: In a data with customer attributes (Customer ID,
Gender and Car Type), a split on “Customer ID” is
preferred as compared to other attributes such as
“Gender” and “Car Type”. But “Customer ID” is not a
predictive attribute.
+
Extension to Information Gain
 C4.5 a successor of ID3 uses an extension to
information gain known as gain ratio.

 It overcomes the bias of Information gain as it


applies a kind of normalization to information gain
using a split information value.

 The split information value represents the potential


information generated by splitting the training data
set D into v partitions, corresponding to v outcomes
on attribute A.

𝑣 |𝐷𝑗|
 S𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 𝐷 = − 𝑗=1 |𝐷| ∗ log( 𝐷𝑗 /|𝐷|
+
Extension to Information Gain

 Gain ratio is defined as

𝑮𝒂𝒊𝒏(𝑨)
𝑮𝒂𝒊𝒏𝑹𝒂𝒕𝒊𝒐 𝑨 =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨(𝑫)

The attribute with the maximum gain ratio is


selected as the splitting attribute.
+
Example

 ForOutlook
 Gain: 0.247
 Split Info:
 Sunny: 5 examples, Overcast: 4 examples, Rain:
5 examples

 Gain Ratio:
+
Example

 ForHumidity
 Split Info: 1.000
 Gain Ratio: 0.152
Again, Outlook is selected as
 For Temperature it has the highest gain ratio.
 Split
Info: 1.557
 Gain Ratio: 0.019

 For Wind
 Split
Info: 0.985
 Gain Ratio: 0.049
+
Gini Index
 It is used as a splitting criteria in Classification and
Regression Tree (CART).

 Measures the impurity of a data partition D

 m is the number of classes and


 pi : the probability that a tuple in D belongs to class
Ci
+
Gini Index
 The Gini Index considers a binary split for each
attribute A, say D1 and D2. The Gini index of D
given that partitioning is:

 The reduction in impurity is given by:

 The
attribute that maximizes the reduction in
impurity is chosen as the splitting attribute
+
Gini Index
 Examine the partitions resulting from all
possible subsets of {a1…,av}
 2v
possible subsets. We exclude the power
set and the empty set, then we have 2v-2
subsets.
 The subset that gives the minimum Gini
index for attribute A is selected as its
splitting subset
+
Example
+
Example
 Compute the Gini Index for the overall collection of
training examples.

 Using attribute income: there are three values: low,


medium and high Choosing the subset {low, medium}
results in two partitions:
D1 (income ∈ {low, medium} ): 10 tuples
D2 (income ∈ {high} ): 4 tuples
+
Example

The Gini Index measures of the remaining partitions are:

The best binary split for attribute income is on {medium,


high} and {low}
+
Summarizing Splitting Criteria

 Information Gain used in


 ID3

 Gain ratio is used in


 C4.5
 J48

 Gini
Index is used in
 CART
+
Comparing Attribute Selection
Measures

 Information Gain
 Biased towards multivalued attributes

 Gain Ratio
 Tends to prefer unbalanced splits in which one
partition is much smaller than the other
 Gini Index
 Biased towards multivalued attributes
 Has difficulties when the number of classes is large
 Tends to favor tests that result in equal-sized
partitions and purity in both partitions
+
Exercise
 Consider the training examples shown in Table for
a binary classification problem.

(a) What is the entropy of this collection of training


examples ?
+
Exercise

(b) What are the information gains of a1 and a2


relative to these training examples?

(c) What is the best split (among a1 and a2)


according to the information gain?

(d) What is the best split (among a1 and a2)


according to the Gini index?
+
Solution

 Whatis the entropy of this collection of training


examples with respect to the positive class?

 There are four positive examples and five negative


examples. Thus, P(+) = 4/9 and P(−) = 5/9. The
entropy of the training examples is −4/9 log2(4/9)
− 5/9 log2(5/9) = 0.9911.
+
Exercise
 What are the information gains of a1 and a2 relative
to these training examples?

 The entropy for a1 is

The information gain is 0.9911-0.7616 = 0.2294.


+
Exercise
 What are the information gains of a1 and a2 relative
to these training examples?

 The entropy for a2 is

The information gain is 0.9911-0.9839 = 0.0072.


+
Solution

 What is the best split (among a1 and a2) according


to the information gain?

 According to information gain a1 produces the best


split.
+
Solution
 What is the best split (among a1 and a2) according
to Gini Index?

 𝐹𝑜𝑟 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑎1, 𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥 𝑖𝑠

 𝐹𝑜𝑟 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑎2 𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥 𝑖𝑠

 Since Gini index of a1 is smaller, it is selected for


splitting.

You might also like