Practical -15
Aim:-Write a program a Build a Decision Tree Classifier using Gini Criteria in a Dataset.
A Decision Tree Classifier is a type of supervised learning algorithm that uses a tree-like
model to classify data into different categories. The algorithm works by recursively
partitioning the data into smaller subsets based on the values of the input features. Each
internal node in the tree represents a feature or attribute, and each leaf node represents a
class label. The classification process involves traversing the tree from the root node to a leaf
node, with each node providing a decision based on the input features.
Input:-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# split dataset to training and test set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state = 99)
# initialize decision tree classifier
clf = DecisionTreeClassifier(random_state=1)
# train the classifier
clf.fit(X_train, y_train)
# predict using classifier
y_pred = clf.predict(X_test)
# claculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Output:-
Accuracy: 0.9555555555555556
Input:-
from sklearn.model_selection import GridSearchCV
# Hyperparameter to fine tune
param_grid = {
'max_depth': range(1, 10, 1),
'min_samples_leaf': range(1, 20, 2),
'min_samples_split': range(2, 20, 2),
'criterion': ["entropy", "gini"]
# Decision tree classifier
tree = DecisionTreeClassifier(random_state=1)
# GridSearchCV
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid,
cv=5, verbose=True)
grid_search.fit(X_train, y_train)
# Best score and estimator
print("best accuracy", grid_search.best_score_)
print(grid_search.best_estimator_)
Output:
Fitting 5 folds for each of 1620 candidates, totalling 8100 fits
best accuracy 0.9714285714285715
DecisionTreeClassifier(criterion='entropy', max_depth=4,
min_samples_leaf=3, random_state=1)
Visualizing the Decision Tree Classifier
Input:-
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
# best estimator
tree_clf = grid_search.best_estimator_
# plot
plt.figure(figsize=(18, 15))
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
Output:-
Input:-
import pandas as pd
import matplotlib.pyplot as plt
# load dataset
dataset_link = 'https://media.geeksforgeeks.org/wp-
content/uploads/20240620175612/spam_email.csv'
df = pd.read_csv(dataset_link)
# plot the category count
df['Category'].value_counts().plot.bar(color = ["g","r"])
plt.title('Total number of ham and spam in the dataset')
plt.show()
Input:-
import seaborn as sns
# confusion matrix
cmat = confusion_matrix(y_test, pred)
# plot heatmap
sns.heatmap(cmat, annot=True, cmap='Paired',
cbar=False, fmt="d", xticklabels=[
'Not Spam', 'Spam'], yticklabels=['Not Spam', 'Spam'])
Output:-
Conclusion:-
In this article, we have explored the world of Decision Tree Classifiers using Scikit-Learn. We
have covered the theoretical foundations, implementation, and practical applications of
Decision Tree Classifiers, providing a comprehensive guide for both beginners and
experienced practitioners. By understanding the strengths and limitations of Decision Tree
Classifiers, we can harness their power to build accurate and interpretable machine learning
models.