Exercise5 Solution
Exercise5 Solution
Essential Libraries
Let us begin by importing the essential Python Libraries.
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
houseData = pd.read_csv('train.csv')
houseData.head()
[5 rows x 81 columns]
houseData['CentralAir'].describe()
count 1460
unique 2
top Y
freq 1365
Name: CentralAir, dtype: object
<seaborn.axisgrid.FacetGrid at 0x141fd7475e0>
Note that the two levels of CentralAir, namely Y and N, are drastically imbalanced. This is not
a very good situation for a classification problem. It is desirable to have balanced classes for
classification, and there are several methods to make imbalanced classes balanced, or to get
desired classification results even from imbalanced classes. If you are interested, check out the
following article.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-
learning-dataset/
Train the Decision Tree Classifier model dectree using the Train Set.
DecisionTreeClassifier(max_depth=2)
Method 1 :
Export the Decision Tree as a dot file using export_graphviz, and visualize.
For MAC:
Method 2:
Some of you may encounter mysterious problems with the graphviz package. The issue may not
be easy to resolve as it could be due to some configuration in your OS or other problems. As
such, I have provided an alternative method using plot_tree() in the module sklearn.tree for you
to visualize the decision tree. As sklearn comes with Anaconda, you do not need to install
additional packages. The function plot_tree is less flexible, but it suffices for our purposes. You
may choose to use either method.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Goodness of Fit of the Model
Check how good the predictions are on the Train Set.
Metrics : Classification Accuracy and Confusion Matrix.
<AxesSubplot: >
<AxesSubplot: >
Important : Note the huge imbalance in the False Positives and False Negatives in the confusion
matrix. In case of Training Data, False Positives = 58 whereas False Negatives = 8. In case of Test
Data, False Positives = 16 whereas False Negatives = 3. This is not surprising -- actually, this is a
direct effect of the huge Y vs N imbalance in the CentralAir variable. As CentralAir = Y
was more likely in the data, False Positives are more likely too.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(3)
plt.show()
Now that you have obtained Decision Tree of CentralAir against all four variables
SalePrice, GrLivArea, LotArea, TotalBsmtSF, compare the initial position of the
variables in the tree (which level of the tree does the variable appear for the first time), and the
number of times the variables are used, to determine which variable is the most important in
order to predict CentralAir. What do you think?