Decision Trees

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

MAN 456

BUSINESS ANALYTICS

Decision Trees

2023-20234 Fall

Ankara, Turkey November 26, 2023


Catching Tax Evasion
• Tax-return data 
• Classification problem: Discriminate
between records of different classes
(cheaters vs non-cheaters)

• A new tax return for the next year 


• Is this a cheating tax return?
Classification Tree (Decision Tree Learning)
• Uses tree-like structure to predict the value of an outcome variable
• Starts with complete data
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent outcome
Example of a Decision Tree
Another Example of Decision Tree
• There could be more than one tree that fits the same data!
Apply Model to Test Data

Classify the
instance as “No”
Decision Tree Classification Task
Tree Induction
• Many algorithms available (Hunt’s algorithm, CART, ID3, C4.5, SLIQ,
SPRINT…)
• We will use CART (Classification and Regression Tree)
• Classification tree if outcome is discrete
• Regression tree if outcome is continuous

• Question: How to split the records?


• Impurity measures
• Gini Impurity Index and Entropy for classification trees
• Sum of squared errors (SSE) for regression trees
Impurity Measures: Gini and Entropy
• Consider a classification problem with c classes.
• p(i|t): fraction of records associated with node t belonging to class i

• All of the impurity measures take value zero (minimum) for the case of a pure node
where a single value has probability 1.
• All of the impurity measures take maximum value when the class distribution in a
node is uniform.
Steps to Generate Classification Trees
1. Start with the complete training data in the root node.

2. Use a measure of impurity (Gini Impurity Index or Entropy). Search for a


predictor variable that minimizes the impurity (results in the maximum reduction
in the impurity), when the parent node is split into children nodes.

3. Repeat step 2 for each subset of the data (for each internal node) using the
independent variables until
• All the dependent variables are exhausted.
• The stopping criteria are met. Few stopping criteria used are number of levels of tree
from the root node, minimum number of observations in parent/child node (e.g., 10%
of the training data), and minimum reduction in impurity index.

4. Generate business rules for the leaf (terminal) nodes of the tree.
Benefits of Decision Trees
• Rules generated are simple and interpretable.
• Trees can be visualized.
• Work well with both numerical and categorical data. Do not require data to
be normalized or creation of dummy variables.
• Rules can help create business strategies.
Decision trees in Python - 1
• We will use sklearn.tree.DecisionTreeClassifier

• It takes the following key parameters:


• criterion: string – The function to measure the quality of a split. Supported
criteria are “gini” for the Gini impurity and “entropy” for the information
gain. Default is gini.

• max_depth: int – The maximum depth of the tree.

• min_samples_split: int or float – The minimum number of samples required to


split an internal node. If int, then number of samples or if float, percentage of
total number of samples. Default is 2.

• in_samples_leaf: int or float – The minimum number of samples required to


be at a leaf node.
Decision trees in Python - 2
• Create decision trees for credit classification
• Using Gini impurity index
• Using entropy
Decision trees in Python - 3
There are 700 observations of checkin_account_A14 is the most
which 491 are good credits and 209 important feature for splitting good
are bad credits. Gini index is 0.419. and bad credits when compared to
other features and hence, chosen as
the top splitting criteria.

True False
This rule has split the dataset into two
The first rule: checkin_account_A14 < subsets represented by the second level
0.5 means if the customer has nodes. On the left node, there are 425
checkin_account_A14 account or not. samples (i.e., not having
checkin_account_A14) and on the right
node, there are 275 samples (i.e., having
checkin_account_A14).
True False
If the customer does not have
checkin_account_A14 and credit
duration is greater than 33 and does not
have saving_acc_A65, then there is
high probability of being a bad credit.
There are 70 records in the dataset that
satisfy these conditions and 48 of them
have bad credit. True False
If the customer has
checkin_account_A14 and
Inst_plans_A143 and age>
23.5, then there is a high
probability of being good
credit.
True False
Finding Optimal Criteria and Max Depth
• sklearn.model_selection provides a feature called GridSearchCV which
searches through a set of possible hyperparameter values and reports the
most optimal one.
• In machine learning, a hyperparameter is a parameter whose value is used to
control the learning process

• GridSearchCV can be used for any machine learning model and can
search through multiple parameters of the model.

• We will use it to search for optimal parameters


• Splitting criteria: gini or entropy.
• Maximum depth of decision tree ranging from 2 to 10.

• See the Jupyter notebook.

You might also like