Decision Trees

MAN 456
BUSINESS ANALYTICS
Decision Trees
2023-20234 Fall
Ankara, Turkey November 26, 2023

Catching Tax Evasion
• Tax-return data 
• Classification problem: Discriminate
between records of different classes
(cheaters vs non-cheaters)
• A new tax return for the next year 

• Is this a cheating tax return?
Classification Tree (Decision Tree Learning)
• Uses tree-like structure to predict the value of an outcome variable
• Starts with complete data
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent outcome
Example of a Decision Tree
Another Example of Decision Tree
• There could be more than one tree that fits the same data!
Apply Model to Test Data
Classify the
instance as “No”
Decision Tree Classification Task
Tree Induction
• Many algorithms available (Hunt’s algorithm, CART, ID3, C4.5, SLIQ,
SPRINT…)
• We will use CART (Classification and Regression Tree)
• Classification tree if outcome is discrete
• Regression tree if outcome is continuous
• Question: How to split the records?

• Impurity measures
• Gini Impurity Index and Entropy for classification trees
• Sum of squared errors (SSE) for regression trees
Impurity Measures: Gini and Entropy
• Consider a classification problem with c classes.
• p(i|t): fraction of records associated with node t belonging to class i
• All of the impurity measures take value zero (minimum) for the case of a pure node
where a single value has probability 1.
• All of the impurity measures take maximum value when the class distribution in a
node is uniform.
Steps to Generate Classification Trees
1. Start with the complete training data in the root node.
2. Use a measure of impurity (Gini Impurity Index or Entropy). Search for a

predictor variable that minimizes the impurity (results in the maximum reduction
in the impurity), when the parent node is split into children nodes.
3. Repeat step 2 for each subset of the data (for each internal node) using the
independent variables until
• All the dependent variables are exhausted.
• The stopping criteria are met. Few stopping criteria used are number of levels of tree
from the root node, minimum number of observations in parent/child node (e.g., 10%
of the training data), and minimum reduction in impurity index.
4. Generate business rules for the leaf (terminal) nodes of the tree.
Benefits of Decision Trees
• Rules generated are simple and interpretable.
• Trees can be visualized.
• Work well with both numerical and categorical data. Do not require data to
be normalized or creation of dummy variables.
• Rules can help create business strategies.
Decision trees in Python - 1
• We will use sklearn.tree.DecisionTreeClassifier
• It takes the following key parameters:

• criterion: string – The function to measure the quality of a split. Supported
criteria are “gini” for the Gini impurity and “entropy” for the information
gain. Default is gini.
• max_depth: int – The maximum depth of the tree.
• min_samples_split: int or float – The minimum number of samples required to

split an internal node. If int, then number of samples or if float, percentage of
total number of samples. Default is 2.
• in_samples_leaf: int or float – The minimum number of samples required to

be at a leaf node.
• Create decision trees for credit classification
• Using Gini impurity index
• Using entropy
There are 700 observations of checkin_account_A14 is the most
which 491 are good credits and 209 important feature for splitting good
are bad credits. Gini index is 0.419. and bad credits when compared to
other features and hence, chosen as
the top splitting criteria.
True False
This rule has split the dataset into two
The first rule: checkin_account_A14 < subsets represented by the second level
0.5 means if the customer has nodes. On the left node, there are 425
checkin_account_A14 account or not. samples (i.e., not having
checkin_account_A14) and on the right
node, there are 275 samples (i.e., having
checkin_account_A14).
True False
If the customer does not have
checkin_account_A14 and credit
duration is greater than 33 and does not
have saving_acc_A65, then there is
high probability of being a bad credit.
There are 70 records in the dataset that
satisfy these conditions and 48 of them
have bad credit. True False
If the customer has
checkin_account_A14 and
Inst_plans_A143 and age>
23.5, then there is a high
probability of being good
credit.
True False
Finding Optimal Criteria and Max Depth
• sklearn.model_selection provides a feature called GridSearchCV which
searches through a set of possible hyperparameter values and reports the
most optimal one.
• In machine learning, a hyperparameter is a parameter whose value is used to
control the learning process
• GridSearchCV can be used for any machine learning model and can
search through multiple parameters of the model.
• We will use it to search for optimal parameters

• Splitting criteria: gini or entropy.
• Maximum depth of decision tree ranging from 2 to 10.
• See the Jupyter notebook.

Decision Trees

Uploaded by

Copyright:

Available Formats

Decision Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees

Uploaded by

Copyright:

Available Formats

MAN 456

Ankara, Turkey November 26, 2023

• A new tax return for the next year 

• Question: How to split the records?

2. Use a measure of impurity (Gini Impurity Index or Entropy). Search for a

• It takes the following key parameters:

• max_depth: int – The maximum depth of the tree.

• min_samples_split: int or float – The minimum number of samples required to

• in_samples_leaf: int or float – The minimum number of samples required to

• We will use it to search for optimal parameters

• See the Jupyter notebook.

You might also like