Decision Tree Induction Algorithm

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Decision tree induction algorithm:

A decision tree is a machine learning algorithm that creates a tree-like model of decisions and their
possible consequences. In classification, a decision tree is used to classify input data into one of
several possible classes. Here are the steps in how a decision tree works in classification in data
mining

A decision tree is a machine learning algorithm that creates a


tree-like model of decisions and their possible consequences. In
classification, a decision tree is used to classify input data into one
of several possible classes. Here are the steps in how a decision
tree works in classification in data mining:
here leaf node is assigned as class label. Here root node uses the
attributes body temperature to separate warm-blooded from cold-
blooded vertebrates.
Starting from root node , we apply the test condition to the record and
follow the

Data Preparation: The first step is to collect and prepare the


data. The data must be cleaned, pre-processed, and formatted in a
way that can be used by the decision tree algorithm.
Tree Construction: The decision tree algorithm starts by
selecting the best feature to split the data. The feature with
the highest information gain or the lowest Gini index is
selected as the root node. The data is then split into subsets based
on the values of this feature.

Recursive Partitioning: The algorithm then recursively repeats


this process for each subset, selecting the best feature to split
the data and creating new nodes for each feature. This
process is repeated until all the data has been classified into a set
of leaf nodes.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. The decision tree can be pruned to
prevent overfitting, which is when the model performs well on
the training data but poorly on the testing data.

Prediction: Once the decision tree is constructed, it can be used


to predict the target variable for new data by traversing the tree
from the root to the appropriate leaf node. At each node, the
feature value of the new data is compared to the value of the
node, and the algorithm follows the appropriate branch of the tree.
Evaluation: The final step is to evaluate the performance of the
decision tree on a testing dataset. This step is crucial to ensure
that the model can generalize well to new data and is not
overfitting to the training data.
They are also capable of handling both categorical and continuous
data and can handle missing data. However, decision trees can
overfit the training data, leading to poor performance on new data,
and they may not be suitable for complex data with many
features.
Advantages of decision tree :
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much
computation.
Entropy : Entropy refers to a common way to measure impurity
in the decision tree. It measures the impurity in data set.

Information gain (gini) : It refers to the decline in entropy after


the dataset is split. It is also called as entropy reduction.
The skeleton decision tree induction algorithm also known as TreeGrowth is shown
in Algorithm 3.1 presents a pseudo code for decision tree induction algorithm. The input to
this algorithm consists of the training records E and the attribute set F. The algorithm works
by recursively selecting the best attribute to split the data (Step 7) and expanding the nodes
of the tree (Steps 11 and 12) until the stopping criterion is met (Step 1).
The details of this algorithm are explained below.
1. The createNode() : This Function extends the decision tree by creating a new node. A
node in the decision tree either has a test condition, denoted as node.test cond, or a class
label, denoted as node.label.

2. The find_best_split() : function determines which attribute should be selected as the test
condition for splitting the training records. The choice of test condition depends on which
impurity measure is used to determine the goodness of a split. The popular measures include
entropy and the Gini index.

3. The Classify() : This Function determines the class label to be assigned to a leaf node. For
each leaf node t, let p(i|t) denote the fraction of training records from class i associated with
the node t. the leaf node is assigned to the class that has the majority number of training
records :
Algorithm 3.1 A skeleton decision tree induction algorithm.

TreeGrowth (E, F) # E= Training records and F= attribute set


1: if stopping cond(E,F) = true then # to stop or terminate the recursive condition if all records have
same class label or same attribute values
2: leaf = createNode(). # To extends the decision tree by creating a new node which is test

condition or a class label

3: leaf. Label = Classify (E). # determines the class label and assigned to a
leaf node 4: return leaf.
5: else
6: root = createNode().
7: root. Test cond = find best split(E, F). # recursively select best attribute to

9: for each v ∈ V do
split data. 8: let V = {v|v is a possible outcome of root.test cond }.

10: Ev = {e | root.test cond(e) = v and e ∈ E}.


11: child = TreeGrowth(Ev, F). # steps 11 and 12 to expand the nodes of tree until step
1 met 12: add child as descendent of root and label the edge (root → child) as v.
13: end for
14: end if
15: return root.

where the argmax operator returns the class i that maximizes p(i|t).

5. The stopping Cond() : Function is used to terminate the tree-growing process by testing
whether all the records have same class label or same attribute values. After building the
decision tree, a tree-pruning step can be performed to reduce the size of the decision tree.

Example for decision tree induction algorithm :


Training set, Test test and Classifier are given below:
CLASSIFIER OR
CLASSIFICATION
MODEL

You might also like