Exercise3 Solution
Exercise3 Solution
Essential Libraries
Let us begin by importing the essential Python Libraries.
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
houseData = pd.read_csv('train.csv')
houseData.head()
[5 rows x 81 columns]
houseNumData.describe()
count = 0
for var in houseNumData:
sb.boxplot(data=houseNumData[var], orient = "h", color =
colors[count], ax = axes[count,0])
sb.histplot(data=houseNumData[var], color = colors[count], ax =
axes[count,1])
sb.violinplot(data=houseNumData[var], orient = "h", color =
colors[count], ax = axes[count,2])
count += 1
Check the Relationship amongst Variables
Correlation between the variables, followed by all bi-variate jointplots.
# Correlation Matrix
print(houseNumData.corr())
<AxesSubplot:>
# Draw pairs of variables against one another
sb.pairplot(data = houseNumData)
<seaborn.axisgrid.PairGrid at 0x213ff077a60>
Which variables do you think will help us predict SalePrice in this dataset?
Bonus : Attempt a comprehensive analysis with all Numeric variables in the dataset.
Fix the data types of the first four variables to convert them to categorical.
houseCatData['MSSubClass'] =
houseCatData['MSSubClass'].astype('category')
houseCatData['Neighborhood'] =
houseCatData['Neighborhood'].astype('category')
houseCatData['BldgType'] = houseCatData['BldgType'].astype('category')
houseCatData['OverallQual'] =
houseCatData['OverallQual'].astype('category')
houseCatData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null category
1 Neighborhood 1460 non-null category
2 BldgType 1460 non-null category
3 OverallQual 1460 non-null category
dtypes: category(4)
memory usage: 7.8 KB
Check the Variables Independently
Summary Statistics of houseCatData, followed by Statistical Visualizations on the variables.
houseCatData.describe()
MSSubClass OverallQual
count 1460.000000 1460.000000
mean 56.897260 6.099315
std 42.300571 1.382997
min 20.000000 1.000000
25% 20.000000 5.000000
50% 50.000000 6.000000
75% 70.000000 7.000000
max 190.000000 10.000000
<seaborn.axisgrid.FacetGrid at 0x213fe3f59f0>
sb.catplot(y = 'Neighborhood', data = houseCatData, kind = "count",
height = 8)
<seaborn.axisgrid.FacetGrid at 0x213ff151d80>
sb.catplot(y = 'BldgType', data = houseCatData, kind = "count", height
= 8)
<seaborn.axisgrid.FacetGrid at 0x213800ff0d0>
sb.catplot(y = 'OverallQual', data = houseCatData, kind = "count",
height = 8)
<seaborn.axisgrid.FacetGrid at 0x213fe2b08b0>
Check the Relationship amongst Variables
Joint heatmaps of some of the important bi-variate relationships in houseCatData.
<AxesSubplot:xlabel='MSSubClass', ylabel='BldgType'>
# Distribution of OverallQual across MSSubClass
f, axes = plt.subplots(1, 1, figsize=(20, 12))
sb.heatmap(houseCatData.groupby(['OverallQual',
'MSSubClass']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")
<AxesSubplot:xlabel='MSSubClass', ylabel='OverallQual'>
# Distribution of OverallQual across Neighborhood
f, axes = plt.subplots(1, 1, figsize=(20, 8))
sb.heatmap(houseCatData.groupby(['OverallQual',
'Neighborhood']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")
<AxesSubplot:xlabel='Neighborhood', ylabel='OverallQual'>
# Distribution of OverallQual across BldgType
f, axes = plt.subplots(1, 1, figsize=(20, 20))
sb.heatmap(houseCatData.groupby(['OverallQual',
'BldgType']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")
<AxesSubplot:xlabel='BldgType', ylabel='OverallQual'>
Check the effect of the Variables on SalePrice
Create a joint DataFrame by concatenating SalePrice to houseCatData.
saleprice = pd.DataFrame(houseData['SalePrice'])
houseCatSale = pd.concat([houseCatData, saleprice], axis = 1)
houseCatSale.head()
<AxesSubplot:xlabel='MSSubClass', ylabel='SalePrice'>
<AxesSubplot:xlabel='BldgType', ylabel='SalePrice'>
<AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>
Which variables do you think will help us predict SalePrice in this dataset?
Bonus : Attempt a comprehensive analysis with all Categorical variables in the dataset.