Data Science Concepts Lesson04 Decision Tree Concepts
Data Science Concepts Lesson04 Decision Tree Concepts
Input Output
Calculation Table
Observed Expected (𝑂 − 𝐸)2 /𝐸)
M - (Status1) 135 (274*300)/1000 = 82.2 33.91
P – value 3.106E-16
𝒊 𝒕 = 𝑮𝒊𝒏𝒊 𝒕 = 𝑷 𝒋 𝒕 ∗ 𝟏−𝑷 𝒋 𝒕
𝒋=𝟏
𝑤ℎ𝑒𝑟𝑒𝑃 𝑗 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑎𝑡𝑒𝑜𝑔𝑦 𝑗 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡.
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = [𝒊 𝒕 − 𝑷𝑳 ∗ 𝒊 𝒕 𝑳 − 𝑷𝑹 ∗ 𝒊 𝒕 𝑹 ]
𝑃𝐿 = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑏𝑠 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑏𝑟𝑎𝑛𝑐ℎ
𝑃𝑅 = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑏𝑠 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑏𝑟𝑎𝑛𝑐ℎ
o The variable which maximizes the change in impurity is picked up for building decision tree
o In case of a two category, minimum value of Gini is 0 and Maximum value of Gini can be
0.5 (50% zeros and 50% ones as the two categories).
o If a variable has more than two classes, the classes are combined and then Gini index is
computed:
𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 = 2𝑘−1 − 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝒕 = − 𝑷 𝒋 𝒕 ∗ 𝒍𝒐𝒈𝟐 𝑷 𝒋 𝒕
𝒋=𝟏
𝑤ℎ𝑒𝑟𝑒𝑃 𝑗 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑎𝑡𝑒𝑜𝑔𝑦 𝑗 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡.
o The variable which maximizes the change in impurity is picked up for building
decision tree
o In case of a two category, minimum value of entropy is 0 and Maximum value is 1
(50% zeros and 50% ones as the two categories).
o If a variable has more than two classes, the classes are combined and then Gini index is
computed:
𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 = 2𝑘−1 − 1
Node t
Contingency Table D =300
ND = 700
Gender (X) NPA Status Total
1 0 t(L) t(R)
D = 135 D =165
Node (t(L): Male 135 139 274 ND = 139 ND = 561
Calculation Table
Node Proportion of the Class i(t) = P(j | K)*(1-P(j | K)) Proportion Δ𝑖(𝑡)
t(): P(D | t) = 300/1000 0.30*(1-0.30) = 0.21
0.42
P(ND | t) = 700/1000 0.70*(1.0.70) = 0.21
[0.42 −
t(L): Male P(D | t(L)): 135/274 = 0.49 (0.49)*(1-0.49) = 0.25 274/1000= 0.274
0.50 (0.27∗0.50) −
P(ND | t(L)): 139/274 = 0.51 (0.51)*(1-0.51) = 0.25
(0.726∗0.34)]=
t(R): Female P(D | t(R)): 165/726 = 0.23 (0.23)*(1-0.23) = 0.17 0.34 726/1000= 0.726 0.038
P(ND | t(R)): 561/726 = 0.77 (0.77)*(1-0.77) = 0.17
Classification matrix
Predicted
𝑇𝑁 17
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = = 100% 𝑓01 = 0 [FP] 𝑓00 = 17 [TN]
𝑇𝑁 + 𝐹𝑃 17 Class =0 (Negative)
𝑇𝑃 + 𝑇𝑁 21
𝑀𝑜𝑑𝑒𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 87.5%
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 24
• Receiver operating characteristics (ROC) Curve is a useful way to determine cut-off point
which maximizes sensitivity and specificity.
• Sensitivity and specificity measures are computed based on a sequence of cut-off points to
be applied to the model for predicting observations into Positive or Negative.
• Lift and Gain chart measure how much better one can expect to do with the model
comparing without a model.
• In contrast to the confusion/classification matrix that evaluates models on the whole
population, gain or lift chart evaluates model performance in a portion of the population.
Cumulative gains and lift charts are a graphical representation to depict the advantage of
using a predictive model to choose which customers to contact.
• Gain at a given decile level is the ratio of cumulative number of targets (events) up to that
decile to the total number of targets (events) in the entire data set.
Source: http://www.listendata.com/2014/08/excel-template-gain-and-lift-charts.html
© Copyright 2015 All rights reserved.
Decision Tree–Lift Chart
• Lift measures how much better one can expect to do with the model comparing without a
model.
• It is the ratio of gain % to the random expectation at a given decile level. The random
expectation at the xth decile is x%.
Interpretation:
By contacting only 10% of customers, 4.5 times customers may respond.
• Model overfitting:
o Model has learned too much from the data and cannot be generalized.
o As the number of nodes increases, the training error decreases but test error may
increase
o More complex trees than needed.
Model under fitting or over fitting leads to lack of generalizability and thus such decision tree
models may not be useful in correct classification on unknown cases.
Pruning is applied to overcome the under fitting or over fitting issues in the decision tree model
Stop the algorithm before it becomes a fully Grow decision tree to its entirety. Trim the
grown tree: nodes of the decision tree in a bottom-up
o Stop if number of instances is less fashion
than some user specified threshold. o If generalization error improves
o Stop if expanding the current node after trimming, replace sub-tree by
does not improve impurity a leaf node.
measures (e.g., Gini or information o Class label of leaf node is
gain) by at least some threshold determined from majority class of
instances in the sub-tree
This is more efficient but less accurate.
This is more accurate but less efficient.
Misclassification error pruning: Decision tree pruning stops when number of cases in a
terminal node becomes less than a threshold
b. Specificity is the probability that the predicted class is 1 when the observed class is 0.
c. Specificity is the probability that the predicted class is 0 when the observed class is 0.
b. Specificity is the probability that the predicted class is 1 when the observed class is 0.
c. Specificity is the probability that the predicted class is 0 when the observed class is 0.