Decision Trees
Decision Trees
Decision Trees
BUSINESS ANALYTICS
Decision Trees
2023-20234 Fall
Classify the
instance as “No”
Decision Tree Classification Task
Tree Induction
• Many algorithms available (Hunt’s algorithm, CART, ID3, C4.5, SLIQ,
SPRINT…)
• We will use CART (Classification and Regression Tree)
• Classification tree if outcome is discrete
• Regression tree if outcome is continuous
• All of the impurity measures take value zero (minimum) for the case of a pure node
where a single value has probability 1.
• All of the impurity measures take maximum value when the class distribution in a
node is uniform.
Steps to Generate Classification Trees
1. Start with the complete training data in the root node.
3. Repeat step 2 for each subset of the data (for each internal node) using the
independent variables until
• All the dependent variables are exhausted.
• The stopping criteria are met. Few stopping criteria used are number of levels of tree
from the root node, minimum number of observations in parent/child node (e.g., 10%
of the training data), and minimum reduction in impurity index.
4. Generate business rules for the leaf (terminal) nodes of the tree.
Benefits of Decision Trees
• Rules generated are simple and interpretable.
• Trees can be visualized.
• Work well with both numerical and categorical data. Do not require data to
be normalized or creation of dummy variables.
• Rules can help create business strategies.
Decision trees in Python - 1
• We will use sklearn.tree.DecisionTreeClassifier
True False
This rule has split the dataset into two
The first rule: checkin_account_A14 < subsets represented by the second level
0.5 means if the customer has nodes. On the left node, there are 425
checkin_account_A14 account or not. samples (i.e., not having
checkin_account_A14) and on the right
node, there are 275 samples (i.e., having
checkin_account_A14).
True False
If the customer does not have
checkin_account_A14 and credit
duration is greater than 33 and does not
have saving_acc_A65, then there is
high probability of being a bad credit.
There are 70 records in the dataset that
satisfy these conditions and 48 of them
have bad credit. True False
If the customer has
checkin_account_A14 and
Inst_plans_A143 and age>
23.5, then there is a high
probability of being good
credit.
True False
Finding Optimal Criteria and Max Depth
• sklearn.model_selection provides a feature called GridSearchCV which
searches through a set of possible hyperparameter values and reports the
most optimal one.
• In machine learning, a hyperparameter is a parameter whose value is used to
control the learning process
• GridSearchCV can be used for any machine learning model and can
search through multiple parameters of the model.