0% found this document useful (0 votes)

33 views

Data Science Concepts Lesson04 Decision Tree Concepts

R language

Uploaded by

Monish Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Data Science Concepts Lesson04 Decision Tree Concepts

R language

Uploaded by

Monish Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Science Concepts

Lesson04–Decision Tree Concepts

© Copyright 2015 All rights reserved.

Objective

• Explain Decision Trees and its applications

After completing this
lesson you will be able to: • Explain the various parameters which are used to evaluate the
outcome of the decision trees.

© Copyright 2015 All rights reserved.

Decision Trees

• Classification is a task of assigning objects to one of the several pre-defined categories.

o Descriptive modelling: Can be used as an explanatory tool to distinguish between
objects of different classes.
o Predictive modelling: Can be used to predict the class label of unknown records.

Input Output

Attribute set Classification Class label (y)

(x) model

• Objective is to build a learning algorithm with good generalization capability.

© Copyright 2015 All rights reserved.

Decision Tree–Concept Development

Classifying species as mammal or non mammal

CART C5.0 CHAID

Hunt’s algorithm Hunt’s algorithm CHAID algorithm
Split: Gini Index Split: Entropy Split: 𝑥 2 test

Criteria for comparing different methods: Predictive accuracy, speed, robustness,

scalability, Interpretability
© Copyright 2015 All rights reserved.
Decision Tree - CHAID

The hypothesis being tested is:

• H0: There is no relationship between the two variables (Y and one of the X’s which is
selected)
• Ha: There is a relationship between the two variables (dependent)

𝑶−𝑬 𝟐 𝒓𝒐𝒘𝒔𝒖𝒎 ∗ 𝒄𝒐𝒍𝒖𝒎𝒏𝑺𝒖𝒎

𝟐
𝝌 = ; E𝒙𝒑𝒆𝒄𝒕𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 =
𝑬 𝒕𝒐𝒕𝒂𝒍𝑺𝒖𝒎

• Steps in the test

o Examine each predictor variable for its statistical significance with the dependent
variable.
o Determine the most significant predictors using p-value (smallest P-Value).
o Divide the data by levels of the most significant predictors. Each of these groups will
be examined individually further.
o For each sub-group, determine the most significant variable from the remaining
predictor and divide the data again.

© Copyright 2015 All rights reserved.

Decision Tree - CHAID

The calculation is repeated for every X Contingency Table

w.r.t Y.
Gender (X) NPA Status (Y) Total
X with the smallest P value is picked
for first split. 1 0

Male (0) 135 139 274

The steps are repeated for second level
Female(1) 165 561 726
tree
Total 300 700 1000

Calculation Table
Observed Expected (𝑂 − 𝐸)2 /𝐸)
M - (Status1) 135 (274*300)/1000 = 82.2 33.91

M - (Status0) 139 (274*700)/1000=191.8 14.53

F - (Status1) 165 (726*300)/1000=217.8 12.8

F - (Status0) 561 (726*700)/1000=508.2 5.48

Total 1000 1000 66.73

P – value 3.106E-16

© Copyright 2015 All rights reserved.

Decision Tree - CART

• CART (Classification and Regression Tree) always performs binary splits.

o Gini Index is a measure of impurity at the node. If sample is completely homogenous then
less impurity. If sample is equally divided then more impurity.

𝒊 𝒕 = 𝑮𝒊𝒏𝒊 𝒕 = 𝑷 𝒋 𝒕 ∗ 𝟏−𝑷 𝒋 𝒕
𝒋=𝟏
𝑤ℎ𝑒𝑟𝑒𝑃 𝑗 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑎𝑡𝑒𝑜𝑔𝑦 𝑗 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡.

𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = [𝒊 𝒕 − 𝑷𝑳 ∗ 𝒊 𝒕 𝑳 − 𝑷𝑹 ∗ 𝒊 𝒕 𝑹 ]
𝑃𝐿 = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑏𝑠 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑏𝑟𝑎𝑛𝑐ℎ
𝑃𝑅 = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑏𝑠 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑏𝑟𝑎𝑛𝑐ℎ

o The variable which maximizes the change in impurity is picked up for building decision tree
o In case of a two category, minimum value of Gini is 0 and Maximum value of Gini can be
0.5 (50% zeros and 50% ones as the two categories).
o If a variable has more than two classes, the classes are combined and then Gini index is
computed:
𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 = 2𝑘−1 − 1

© Copyright 2015 All rights reserved.

Decision Tree - CART

• Entropy is another measure to select the best split

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝒕 = − 𝑷 𝒋 𝒕 ∗ 𝒍𝒐𝒈𝟐 𝑷 𝒋 𝒕
𝒋=𝟏
𝑤ℎ𝑒𝑟𝑒𝑃 𝑗 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑎𝑡𝑒𝑜𝑔𝑦 𝑗 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡.

o The variable which maximizes the change in impurity is picked up for building
decision tree
o In case of a two category, minimum value of entropy is 0 and Maximum value is 1
(50% zeros and 50% ones as the two categories).
o If a variable has more than two classes, the classes are combined and then Gini index is
computed:
𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 = 2𝑘−1 − 1

© Copyright 2015 All rights reserved.

Decision Tree - CART

Node t
Contingency Table D =300
ND = 700
Gender (X) NPA Status Total
1 0 t(L) t(R)
D = 135 D =165
Node (t(L): Male 135 139 274 ND = 139 ND = 561

Node (t(R): Female 165 561 726

Total 300 700 1000 .... ….

Calculation Table
Node Proportion of the Class i(t) = P(j | K)*(1-P(j | K)) Proportion Δ𝑖(𝑡)
t(): P(D | t) = 300/1000 0.30*(1-0.30) = 0.21
0.42
P(ND | t) = 700/1000 0.70*(1.0.70) = 0.21
[0.42 −
t(L): Male P(D | t(L)): 135/274 = 0.49 (0.49)*(1-0.49) = 0.25 274/1000= 0.274
0.50 (0.27∗0.50) −
P(ND | t(L)): 139/274 = 0.51 (0.51)*(1-0.51) = 0.25
(0.726∗0.34)]=
t(R): Female P(D | t(R)): 165/726 = 0.23 (0.23)*(1-0.23) = 0.17 0.34 726/1000= 0.726 0.038
P(ND | t(R)): 561/726 = 0.77 (0.77)*(1-0.77) = 0.17

© Copyright 2015 All rights reserved.

Decision Tree–Classification Matrix

Classification matrix
Predicted

Class=1 (Positive) Class=0 (Negative)

𝑇𝑃 4
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = = = 57.1%
𝑇𝑃 + 𝐹𝑁 7 Observed
Class =1 (Positive) 𝑓11 = 4 [TP] 𝑓10 = 3 [FN]

𝑇𝑁 17
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = = 100% 𝑓01 = 0 [FP] 𝑓00 = 17 [TN]
𝑇𝑁 + 𝐹𝑃 17 Class =0 (Negative)

𝑇𝑃 + 𝑇𝑁 21
𝑀𝑜𝑑𝑒𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 87.5%
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 24

Sensitivity is the probability that predicted class is 1 when observed class is 1.

Specificity is the probability that the predicted class is 0 when the observed class is 0.

© Copyright 2015 All rights reserved.

Decision Tree–ROC Curve

• Receiver operating characteristics (ROC) Curve is a useful way to determine cut-off point
which maximizes sensitivity and specificity.
• Sensitivity and specificity measures are computed based on a sequence of cut-off points to
be applied to the model for predicting observations into Positive or Negative.

An overall indication of the diagnostic accuracy of

a ROC curve is the area under the curve (AUC).
AUC values between:
• 0.9-1 indicate perfect sensitivity and specificity,
• 0.8-0.9 indicate good sensitivity and specificity,
• 0.7-0.8 indicate fair sensitivity and specificity,
• 0.6-0.7 is poor
• 0.6 and below indicate by chance outcome

© Copyright 2015 All rights reserved.

Decision Tree–Gain Chart and Lift Chart

• Lift and Gain chart measure how much better one can expect to do with the model
comparing without a model.
• In contrast to the confusion/classification matrix that evaluates models on the whole
population, gain or lift chart evaluates model performance in a portion of the population.

Steps to build Gain / Lift:

1. Randomly split data into two samples (say): 80% = training sample, 20%
= validation sample.
2. Score (predicted probability) the validation sample using the response
model (training sample).
3. Rank the scored file, in descending order by probability.
4. Split the ranked file into 10 sections (deciles). Count the number of
events in each section.

Cumulative gains and lift charts are a graphical representation to depict the advantage of
using a predictive model to choose which customers to contact.

© Copyright 2015 All rights reserved.

Decision Tree–Gain Chart

• Gain at a given decile level is the ratio of cumulative number of targets (events) up to that
decile to the total number of targets (events) in the entire data set.

Source: http://www.listendata.com/2014/08/excel-template-gain-and-lift-charts.html
© Copyright 2015 All rights reserved.
Decision Tree–Lift Chart

• Lift measures how much better one can expect to do with the model comparing without a
model.
• It is the ratio of gain % to the random expectation at a given decile level. The random
expectation at the xth decile is x%.

Interpretation:
By contacting only 10% of customers, 4.5 times customers may respond.

To build Lift and Gain Chart in R. Refer to

https://heuristically.wordpress.com/2009/12/18/plot-roc-curve-lift-chart-random-forest/

Decision Tree–Under Fitting and Over Fitting

• Model under fitting:

o Model did not learn from the training set due to less data
o Training and test error rate are large when the tree size is small

• Model overfitting:
o Model has learned too much from the data and cannot be generalized.
o As the number of nodes increases, the training error decreases but test error may
increase
o More complex trees than needed.

Model under fitting or over fitting leads to lack of generalizability and thus such decision tree
models may not be useful in correct classification on unknown cases.

Decision Tree–Pruning

Pruning is applied to overcome the under fitting or over fitting issues in the decision tree model

Pre-pruning Post Pruning

Stop the algorithm before it becomes a fully Grow decision tree to its entirety. Trim the
grown tree: nodes of the decision tree in a bottom-up
o Stop if number of instances is less fashion
than some user specified threshold. o If generalization error improves
o Stop if expanding the current node after trimming, replace sub-tree by
does not improve impurity a leaf node.
measures (e.g., Gini or information o Class label of leaf node is
gain) by at least some threshold determined from majority class of
instances in the sub-tree
This is more efficient but less accurate.
This is more accurate but less efficient.

Misclassification error pruning: Decision tree pruning stops when number of cases in a
terminal node becomes less than a threshold

Decision Tree in R Using an Example

Summary

• Decision Tree is one of the most widely used data mining

Summary of the topics
technique.
covered in this lesson:
• The outcome of decision tree can be used for exploration of
data as well as to build in predictive model.
• Unlike regression and logistic regression model, there are no
statistical attributes which can suggest that the decision tree
model is good and generalizable.

QUIZ TIME

Quiz Question 1

Quiz 1 Which of the below is a correct statement?

Select all that apply?

a. Sensitivity is the probability that predicted class is 1 when observed class is 1.

b. Specificity is the probability that the predicted class is 1 when the observed class is 0.

c. Specificity is the probability that the predicted class is 0 when the observed class is 0.

d. Sensitivity is the probability that predicted class is 0 when observed class is 1.

Quiz Question 1

Quiz 1 Which of the below is a correct statement?

Select all that apply?

a. Sensitivity is the probability that predicted class is 1 when observed class is 1.

b. Specificity is the probability that the predicted class is 1 when the observed class is 0.

c. Specificity is the probability that the predicted class is 0 when the observed class is 0.

d. Sensitivity is the probability that predicted class is 0 when observed class is 1.

Correct answer is: b & d are incorrect statements.

a&c

End of Lesson04–Decision Tree Concepts

Mid Term Cheat Sheet DMOP
No ratings yet
Mid Term Cheat Sheet DMOP
2 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Introduction To Big Data and Data Mining
No ratings yet
Introduction To Big Data and Data Mining
130 pages
Data Mining
No ratings yet
Data Mining
15 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
dm unit 4
No ratings yet
dm unit 4
24 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Classification Tree - Utkarsh Kulshrestha: Earn in G Is in Learnin G - Utkarsh Kulshrestha
No ratings yet
Classification Tree - Utkarsh Kulshrestha: Earn in G Is in Learnin G - Utkarsh Kulshrestha
33 pages
Class Basic
No ratings yet
Class Basic
75 pages
Unit-3
No ratings yet
Unit-3
98 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
Ml Unit 2 Final_iii Yr
No ratings yet
Ml Unit 2 Final_iii Yr
72 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
dm4
No ratings yet
dm4
68 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
dm 3
No ratings yet
dm 3
37 pages
5.KNN Naive Bayes and DT
No ratings yet
5.KNN Naive Bayes and DT
44 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
23-01!08!00 CS 633 Data Mining Prediction With Decision Trees PDF.pdf
No ratings yet
23-01!08!00 CS 633 Data Mining Prediction With Decision Trees PDF.pdf
80 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
Decision Tree Tutorial
No ratings yet
Decision Tree Tutorial
8 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
IS4834 Week 8
No ratings yet
IS4834 Week 8
42 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
ML Unit-2
No ratings yet
ML Unit-2
16 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
7 - Classfication - Concept - DecisionTree - Evaluation
No ratings yet
7 - Classfication - Concept - DecisionTree - Evaluation
47 pages
EDA Cat2
No ratings yet
EDA Cat2
54 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
22 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
S&ML Unit 6- Q & A
No ratings yet
S&ML Unit 6- Q & A
12 pages
07.2.decision Trees
No ratings yet
07.2.decision Trees
33 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
CH 5
No ratings yet
CH 5
81 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Business Analytics: Data Classification
No ratings yet
Business Analytics: Data Classification
36 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
07.2.Decision Trees_ML
No ratings yet
07.2.Decision Trees_ML
32 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
DMDW_Classification
No ratings yet
DMDW_Classification
18 pages
Classification
No ratings yet
Classification
33 pages
6__DecisionTrees__ID3_CART
No ratings yet
6__DecisionTrees__ID3_CART
24 pages
Classification
100% (1)
Classification
37 pages
DWDM final5
No ratings yet
DWDM final5
45 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Course Tracker: Phase Topic Details Points Week
No ratings yet
Course Tracker: Phase Topic Details Points Week
1 page
About TP Technologies - Monish S
No ratings yet
About TP Technologies - Monish S
2 pages
Course Tracker: Phase Topic Details Points Week
No ratings yet
Course Tracker: Phase Topic Details Points Week
1 page
How To Create Inquiry: Background
No ratings yet
How To Create Inquiry: Background
2 pages
IT Pricing Model
No ratings yet
IT Pricing Model
21 pages
No Gender Age Profession Which Brand of Helmet Are You Using How Do You Feel About Wearing Your Helmet
No ratings yet
No Gender Age Profession Which Brand of Helmet Are You Using How Do You Feel About Wearing Your Helmet
10 pages
What Are The Outcomes of Implementing CRM at Tesco?
No ratings yet
What Are The Outcomes of Implementing CRM at Tesco?
1 page
Punctuation, Symbols & Operators in Search
No ratings yet
Punctuation, Symbols & Operators in Search
3 pages
GE Case
No ratings yet
GE Case
3 pages
2018 IMC CampaignProjectGuidelines
No ratings yet
2018 IMC CampaignProjectGuidelines
12 pages
2) - Discuss The Approaches To New Product Development of Tata Ace
No ratings yet
2) - Discuss The Approaches To New Product Development of Tata Ace
2 pages
Time Saver Course: Mathematics For JEE Main & Advanced Problem Solving and Exercise Sheet
No ratings yet
Time Saver Course: Mathematics For JEE Main & Advanced Problem Solving and Exercise Sheet
20 pages
2η συλλογή ασκήσεων στον Γραμμικό Προγραμματισμό
No ratings yet
2η συλλογή ασκήσεων στον Γραμμικό Προγραμματισμό
4 pages
Lec 24 Lagrange Multiplier
No ratings yet
Lec 24 Lagrange Multiplier
20 pages
Blasius Equation
No ratings yet
Blasius Equation
7 pages
Mean Median Mode
No ratings yet
Mean Median Mode
5 pages
5 - Moment Distribution Method PDF
No ratings yet
5 - Moment Distribution Method PDF
15 pages
Uji Reliabilitas Risqy Wijaya - 213402516387
No ratings yet
Uji Reliabilitas Risqy Wijaya - 213402516387
8 pages
5 Calculus Variation
No ratings yet
5 Calculus Variation
13 pages
04 CAPE Pure Mathematics 2018 U2 P2 PDF
No ratings yet
04 CAPE Pure Mathematics 2018 U2 P2 PDF
35 pages
Checklist Dissertation
No ratings yet
Checklist Dissertation
2 pages
MA-108 Ordinary Differential Equations: M.K. Keshari
No ratings yet
MA-108 Ordinary Differential Equations: M.K. Keshari
29 pages
Introduction To Field Methods in Psychology 1
No ratings yet
Introduction To Field Methods in Psychology 1
40 pages
Kernel of A Linear Transformation
No ratings yet
Kernel of A Linear Transformation
52 pages
1552976863mathematical Statistics (MS) PDF
No ratings yet
1552976863mathematical Statistics (MS) PDF
15 pages
(Ebook) Data Science: A First Introduction by Tiffany Timbers, Trevor Campbell, Melissa Lee ISBN 9780367532178, 0367532174 - The full ebook with complete content is ready for download
100% (1)
(Ebook) Data Science: A First Introduction by Tiffany Timbers, Trevor Campbell, Melissa Lee ISBN 9780367532178, 0367532174 - The full ebook with complete content is ready for download
73 pages
3 Functions Assignment
No ratings yet
3 Functions Assignment
8 pages
Manifold
No ratings yet
Manifold
7 pages
Lecture # 1 Introduction and Scope of Statistics
No ratings yet
Lecture # 1 Introduction and Scope of Statistics
34 pages
Grade 10 Research Format
No ratings yet
Grade 10 Research Format
14 pages
Nonconstant Coefficients: 7.1. Reduction of Order
No ratings yet
Nonconstant Coefficients: 7.1. Reduction of Order
9 pages
Measure of Dispersion (Range Quartile & Mean Deviation)
No ratings yet
Measure of Dispersion (Range Quartile & Mean Deviation)
55 pages
Reflection Paper 8
No ratings yet
Reflection Paper 8
3 pages
V - 3 - Mapping or Functions
No ratings yet
V - 3 - Mapping or Functions
14 pages
PPT
No ratings yet
PPT
12 pages
Business Intelligence
No ratings yet
Business Intelligence
9 pages
2018 International Tuition Fees
No ratings yet
2018 International Tuition Fees
37 pages
Critical Path Method
No ratings yet
Critical Path Method
6 pages
Testing, Assessing, and Evaluating Audit Evidence
No ratings yet
Testing, Assessing, and Evaluating Audit Evidence
17 pages