K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
VENKAT RATNAM
191911412
CLASS WORK
1)Describe the attribute selection measures used by the ID3 algorithm to construct a Decision
Tree.
A) The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria
are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words,
we can say that the purity of the node increases with respect to the target variable. The
decision tree splits the nodes on all available variables and then selects the split which results in
most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)
MARS → (multivariate adaptive regression splines)
The ID3 algorithm builds decision trees using a top-down greedy search approach through the
space of possible branches with no backtracking. A greedy algorithm, as the name suggests,
always makes the choice that seems to be the best at that moment.
Steps in ID3 algorithm:
1. It begins with the original set S as the root node.
2. On each iteration of the algorithm, it iterates through the very unused attribute of the
set S and calculates Entropy(H) and Information gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest Information gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur on each subset, considering only attributes never
selected before.
Attribute Selection Measures
If the dataset consists of N attributes then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step. By just randomly selecting
any node to be the root can’t solve the issue. If we follow a random approach, it may give us
bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like :
Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square
These criteria will calculate values for every attribute. The values are sorted, and attributes are
placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for the
Gini index, attributes are assumed to be continuous.
Gini index and information gain both of these methods are used to select from
the n attributes of the dataset which attribute would be placed at the root node or the
internal node.
Gini index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with lower gini index should be preferred.
Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.
Entropy
2) Write the python program to implement decision trees.
Aim:-To write the python program to implement decision trees.
Algorithm:-
1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make
sure that each subset of training dataset should have the same value for an attribute.
3. Find leaf nodes in all branches by repeating 1 and 2 on each subset .
Program:-
import numpy as np
import pandas as pd
def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data',
return balance_data
def splitdataset(balance_data):
X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
clf_entropy = DecisionTreeClassifier(
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
confusion_matrix(y_test, y_pred))
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
# Driver code
def main():
# Building Phase
data = importdata()
# Operational Phase
cal_accuracy(y_test, y_pred_gini)
cal_accuracy(y_test, y_pred_entropy)
if __name__=="__main__":
main()