K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

K.

VENKAT RATNAM

191911412

CLASS WORK

1)Describe the attribute selection measures used by the ID3 algorithm to construct a Decision
Tree.

A) The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria
are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words,
we can say that the purity of the node increases with respect to the target variable. The
decision tree splits the nodes on all available variables and then selects the split which results in
most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)
MARS → (multivariate adaptive regression splines)
The ID3 algorithm builds decision trees using a top-down greedy search approach through the
space of possible branches with no backtracking. A greedy algorithm, as the name suggests,
always makes the choice that seems to be the best at that moment.
Steps in ID3 algorithm:
1. It begins with the original set S as the root node.
2. On each iteration of the algorithm, it iterates through the very unused attribute of the
set S and calculates Entropy(H) and Information gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest Information gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur on each subset, considering only attributes never
selected before.
 
Attribute Selection Measures
 
If the dataset consists of N attributes then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step. By just randomly selecting
any node to be the root can’t solve the issue. If we follow a random approach, it may give us
bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like :
Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square
These criteria will calculate values for every attribute. The values are sorted, and attributes are
placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for the
Gini index, attributes are assumed to be continuous.
 Gini index and information gain both of these methods are used to select from
the n attributes of the dataset which attribute would be placed at the root node or the
internal node.
Gini index

 Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
 It means an attribute with lower gini index should be preferred.
 Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.
Entropy

 Entropy is the measure of uncertainty of a random variable, it characterizes the


impurity of an arbitrary collection of examples. The higher the entropy the more the
information content.
Information Gain
 The entropy typically changes when we use a node in a decision tree to partition the
training instances into smaller subsets. Information gain is a measure of this change in
entropy.
 Sklearn supports “entropy” criteria for Information Gain and if we want to use
Information Gain method in sklearn then we have to mention it explicitly.
Accuracy score
 Accuracy score is used to calculate the accuracy of the trained classifier.
Confusion Matrix
 Confusion Matrix is used to understand the trained classifier behavior over the test
dataset or validate dataset.

 
2) Write the python program to implement decision trees.
Aim:-To write the python program to implement decision trees.

Algorithm:-

1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make
sure that each subset of training dataset should have the same value for an attribute.
3. Find leaf nodes in all branches by repeating 1 and 2 on each subset .

Program:-

# Importing the required packages

import numpy as np

import pandas as pd

from sklearn.metrics import confusion_matrix

from sklearn.cross_validation import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

# Function importing Dataset

def importdata():
balance_data = pd.read_csv(

'https://archive.ics.uci.edu/ml/machine-learning-'+

'databases/balance-scale/balance-scale.data',

sep= ',', header = None)

# Printing the dataswet shape

print ("Dataset Length: ", len(balance_data))

print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions

print ("Dataset: ",balance_data.head())

return balance_data

# Function to split the dataset

def splitdataset(balance_data):

# Separating the target variable

X = balance_data.values[:, 1:5]

Y = balance_data.values[:, 0]

# Splitting the dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(

X, Y, test_size = 0.3, random_state = 100)


return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

clf_gini = DecisionTreeClassifier(criterion = "gini",

random_state = 100,max_depth=3, min_samples_leaf=5)

# Performing training

clf_gini.fit(X_train, y_train)

return clf_gini

# Function to perform training with entropy.

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(

criterion = "entropy", random_state = 100,

max_depth = 3, min_samples_leaf = 5)

# Performing training

clf_entropy.fit(X_train, y_train)

return clf_entropy
# Function to make predictions

def prediction(X_test, clf_object):

# Predicton on test with giniIndex

y_pred = clf_object.predict(X_test)

print("Predicted values:")

print(y_pred)

return y_pred

# Function to calculate accuracy

def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",

confusion_matrix(y_test, y_pred))

print ("Accuracy : ",

accuracy_score(y_test,y_pred)*100)

print("Report : ",

classification_report(y_test, y_pred))

# Driver code
def main():

# Building Phase

data = importdata()

X, Y, X_train, X_test, y_train, y_test = splitdataset(data)

clf_gini = train_using_gini(X_train, X_test, y_train)

clf_entropy = tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase

print("Results Using Gini Index:")

# Prediction using gini

y_pred_gini = prediction(X_test, clf_gini)

cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")

# Prediction using entropy

y_pred_entropy = prediction(X_test, clf_entropy)

cal_accuracy(y_test, y_pred_entropy)

# Calling main function

if __name__=="__main__":

main()

You might also like