0% found this document useful (0 votes)

11 views

Exercise3 Solution

IE0005 Exercise solutions 3

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Exercise3 Solution

IE0005 Exercise solutions 3

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Exercise 3 : Exploratory Analysis

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python

Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset

Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape

\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal

MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice

0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Numeric Variables

Extract the required variables from the dataset, as mentioned in the problem.
LotArea, GrLivArea, TotalBsmtSF, GarageArea, SalePrice

houseNumData = pd.DataFrame(houseData[['LotArea', 'GrLivArea',

'TotalBsmtSF', 'GarageArea', 'SalePrice']])
houseNumData.head()

LotArea GrLivArea TotalBsmtSF GarageArea SalePrice

0 8450 1710 856 548 208500
1 9600 1262 1262 460 181500
2 11250 1786 920 608 223500
3 9550 1717 756 642 140000
4 14260 2198 1145 836 250000

Check the Variables Independently

Summary Statistics of houseNumData, followed by Statistical Visualizations on the variables.

houseNumData.describe()

LotArea GrLivArea TotalBsmtSF GarageArea

SalePrice
count 1460.000000 1460.000000 1460.000000 1460.000000
1460.000000
mean 10516.828082 1515.463699 1057.429452 472.980137
180921.195890
std 9981.264932 525.480383 438.705324 213.804841
79442.502883
min 1300.000000 334.000000 0.000000 0.000000
34900.000000
25% 7553.500000 1129.500000 795.750000 334.500000
129975.000000
50% 9478.500000 1464.000000 991.500000 480.000000
163000.000000
75% 11601.500000 1776.750000 1298.250000 576.000000
214000.000000
max 215245.000000 5642.000000 6110.000000 1418.000000
755000.000000

# Draw the distributions of all variables

f, axes = plt.subplots(5, 3, figsize=(18, 20))
colors = ["r", "g", "b", "m", "c"]

count = 0
for var in houseNumData:
sb.boxplot(data=houseNumData[var], orient = "h", color =
colors[count], ax = axes[count,0])
sb.histplot(data=houseNumData[var], color = colors[count], ax =
axes[count,1])
sb.violinplot(data=houseNumData[var], orient = "h", color =
colors[count], ax = axes[count,2])
count += 1
Check the Relationship amongst Variables
Correlation between the variables, followed by all bi-variate jointplots.

# Correlation Matrix
print(houseNumData.corr())

# Heatmap of the Correlation Matrix

f, axes = plt.subplots(1, 1, figsize=(10, 10))
sb.heatmap(houseNumData.corr(), vmin = -1, vmax = 1, linewidths = 1,
annot = True, fmt = ".2f", annot_kws = {"size": 14}, cmap =
"RdBu")

LotArea GrLivArea TotalBsmtSF GarageArea SalePrice

LotArea 1.000000 0.263116 0.260833 0.180403 0.263843
GrLivArea 0.263116 1.000000 0.454868 0.468997 0.708624
TotalBsmtSF 0.260833 0.454868 1.000000 0.486665 0.613581
GarageArea 0.180403 0.468997 0.486665 1.000000 0.623431
SalePrice 0.263843 0.708624 0.613581 0.623431 1.000000

<AxesSubplot:>
# Draw pairs of variables against one another
sb.pairplot(data = houseNumData)

<seaborn.axisgrid.PairGrid at 0x213ff077a60>

Which variables do you think will help us predict SalePrice in this dataset?

GrLivArea : Possibly the most important variable : Highest Correlation, Strong

Linearity
GarageArea and TotalBsmtSF : Important variables : High Correlation, Strong
Linearity
LotArea : Doesn't seem so important as a variable : Low Correlation, Weak Linear
Relation

Bonus : Attempt a comprehensive analysis with all Numeric variables in the dataset.

Problem 2 : Categorical Variables

Extract the required variables from the dataset, as mentioned in the problem.
MSSubClass, Neighborhood, BldgType, OverallQual

houseCatData = pd.DataFrame(houseData[['MSSubClass', 'Neighborhood',

'BldgType', 'OverallQual']])
houseCatData.head()

MSSubClass Neighborhood BldgType OverallQual

0 60 CollgCr 1Fam 7
1 20 Veenker 1Fam 6
2 60 CollgCr 1Fam 7
3 70 Crawfor 1Fam 7
4 60 NoRidge 1Fam 8

Fix the data types of the first four variables to convert them to categorical.

houseCatData['MSSubClass'] =
houseCatData['MSSubClass'].astype('category')
houseCatData['Neighborhood'] =
houseCatData['Neighborhood'].astype('category')
houseCatData['BldgType'] = houseCatData['BldgType'].astype('category')
houseCatData['OverallQual'] =
houseCatData['OverallQual'].astype('category')

houseCatData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null category
1 Neighborhood 1460 non-null category
2 BldgType 1460 non-null category
3 OverallQual 1460 non-null category
dtypes: category(4)
memory usage: 7.8 KB
Check the Variables Independently
Summary Statistics of houseCatData, followed by Statistical Visualizations on the variables.

houseCatData.describe()

MSSubClass OverallQual
count 1460.000000 1460.000000
mean 56.897260 6.099315
std 42.300571 1.382997
min 20.000000 1.000000
25% 20.000000 5.000000
50% 50.000000 6.000000
75% 70.000000 7.000000
max 190.000000 10.000000

sb.catplot(y = 'MSSubClass', data = houseCatData, kind = "count",

height = 8)

<seaborn.axisgrid.FacetGrid at 0x213fe3f59f0>
sb.catplot(y = 'Neighborhood', data = houseCatData, kind = "count",
height = 8)

<seaborn.axisgrid.FacetGrid at 0x213ff151d80>
sb.catplot(y = 'BldgType', data = houseCatData, kind = "count", height
= 8)

<seaborn.axisgrid.FacetGrid at 0x213800ff0d0>
sb.catplot(y = 'OverallQual', data = houseCatData, kind = "count",
height = 8)

<seaborn.axisgrid.FacetGrid at 0x213fe2b08b0>
Check the Relationship amongst Variables
Joint heatmaps of some of the important bi-variate relationships in houseCatData.

# Distribution of BldgType across MSSubClass

f, axes = plt.subplots(1, 1, figsize=(20, 8))
sb.heatmap(houseCatData.groupby(['BldgType',
'MSSubClass']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")

<AxesSubplot:xlabel='MSSubClass', ylabel='BldgType'>
# Distribution of OverallQual across MSSubClass
f, axes = plt.subplots(1, 1, figsize=(20, 12))
sb.heatmap(houseCatData.groupby(['OverallQual',
'MSSubClass']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")

<AxesSubplot:xlabel='MSSubClass', ylabel='OverallQual'>
# Distribution of OverallQual across Neighborhood
f, axes = plt.subplots(1, 1, figsize=(20, 8))
sb.heatmap(houseCatData.groupby(['OverallQual',
'Neighborhood']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")

<AxesSubplot:xlabel='Neighborhood', ylabel='OverallQual'>
# Distribution of OverallQual across BldgType
f, axes = plt.subplots(1, 1, figsize=(20, 20))
sb.heatmap(houseCatData.groupby(['OverallQual',
'BldgType']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws =
{"size": 18}, cmap = "BuGn")

<AxesSubplot:xlabel='BldgType', ylabel='OverallQual'>
Check the effect of the Variables on SalePrice
Create a joint DataFrame by concatenating SalePrice to houseCatData.

saleprice = pd.DataFrame(houseData['SalePrice'])
houseCatSale = pd.concat([houseCatData, saleprice], axis = 1)
houseCatSale.head()

MSSubClass Neighborhood BldgType OverallQual SalePrice

0 60 CollgCr 1Fam 7 208500
1 20 Veenker 1Fam 6 181500
2 60 CollgCr 1Fam 7 223500
3 70 Crawfor 1Fam 7 140000
4 60 NoRidge 1Fam 8 250000

Check the distribution of SalePrice across different MSSubClass.

f, axes = plt.subplots(1, 1, figsize=(16, 8))

sb.boxplot(x = 'MSSubClass', y = 'SalePrice', data = houseCatSale)

<AxesSubplot:xlabel='MSSubClass', ylabel='SalePrice'>

Check the distribution of SalePrice across different Neighborhood.

f, axes = plt.subplots(1, 1, figsize=(16, 8))

sb.boxplot(x = 'Neighborhood', y = 'SalePrice', data = houseCatSale)
plt.xticks(rotation=90);
Check the distribution of SalePrice across different BldgType.

f, axes = plt.subplots(1, 1, figsize=(16, 8))

sb.boxplot(x = 'BldgType', y = 'SalePrice', data = houseCatSale)

<AxesSubplot:xlabel='BldgType', ylabel='SalePrice'>

Check the distribution of SalePrice across different OverallQual.

f, axes = plt.subplots(1, 1, figsize=(16, 8))
sb.boxplot(x = 'OverallQual', y = 'SalePrice', data = houseCatSale)

<AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>

Which variables do you think will help us predict SalePrice in this dataset?

OverallQual : Definitely the most important variable : Highest variation in

SalePrice across the levels
Neighborhood and MSSubClass : Moderately important variables : Medium
variation in SalePrice across levels
BldgType : Not clear if important as a variable at all : Not much variation in
SalePrice across the levels

Bonus : Attempt a comprehensive analysis with all Categorical variables in the dataset.

Assignment2 DataViz
No ratings yet
Assignment2 DataViz
11 pages
Exercise6 Solution
No ratings yet
Exercise6 Solution
8 pages
ADS-Exp3
No ratings yet
ADS-Exp3
8 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Ex 1
No ratings yet
Ex 1
119 pages
Exercise2 Solution
No ratings yet
Exercise2 Solution
15 pages
IE0005 Exercise Solutions 2-6
No ratings yet
IE0005 Exercise Solutions 2-6
84 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
1722414346054
No ratings yet
1722414346054
18 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
q1
No ratings yet
q1
2 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
House Price Prediction
No ratings yet
House Price Prediction
63 pages
ds_ml__house_price_book
No ratings yet
ds_ml__house_price_book
46 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
IndianHouses 1695069727
No ratings yet
IndianHouses 1695069727
7 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Final DA LAB1 Merged (1)
No ratings yet
Final DA LAB1 Merged (1)
48 pages
Project PDF
No ratings yet
Project PDF
13 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
The Boston Housing Dataset
100% (1)
The Boston Housing Dataset
4 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
Eda Project
No ratings yet
Eda Project
28 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
Python Assignment 1.ipynb - Colaboratory
No ratings yet
Python Assignment 1.ipynb - Colaboratory
3 pages
Xgboost
No ratings yet
Xgboost
12 pages
Data Analysis With Python - Jupyter Notebook
No ratings yet
Data Analysis With Python - Jupyter Notebook
10 pages
Data Analysis Advance House Price Prediction 1682585529
No ratings yet
Data Analysis Advance House Price Prediction 1682585529
73 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Evan Marie Carr - Python and SKlearn
No ratings yet
Evan Marie Carr - Python and SKlearn
32 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
AAAAAAAAAAAAAAAAAAAAAAAAA
No ratings yet
AAAAAAAAAAAAAAAAAAAAAAAAA
41 pages
Prac - 8 (1) - Jupyter Notebook
No ratings yet
Prac - 8 (1) - Jupyter Notebook
6 pages
Report
No ratings yet
Report
40 pages
Matplotlib Library in Python
No ratings yet
Matplotlib Library in Python
85 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Technologyname Phase2
No ratings yet
Technologyname Phase2
20 pages
House Pricing Regression
No ratings yet
House Pricing Regression
11 pages
Quantam - Learning - Colaboratory
No ratings yet
Quantam - Learning - Colaboratory
13 pages
a
No ratings yet
a
2 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
Pandas Assignment 1
No ratings yet
Pandas Assignment 1
7 pages
Data_cleaning_on_Melbourne_housing
No ratings yet
Data_cleaning_on_Melbourne_housing
16 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
Prepared by Asif Bhat Exploratory Data Analysis: Explore Dataset
No ratings yet
Prepared by Asif Bhat Exploratory Data Analysis: Explore Dataset
143 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
Boston Housing Solutions
No ratings yet
Boston Housing Solutions
3 pages
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
From Everand
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
Clive W. Humphris
No ratings yet
Anti-Aliasing with MSAA vs ABAA
From Everand
Anti-Aliasing with MSAA vs ABAA
Michel A Rohner
No ratings yet
Model Paper Python BCC402
No ratings yet
Model Paper Python BCC402
4 pages
A Review On Python Libraries and Ides For Data Science: November 2021
No ratings yet
A Review On Python Libraries and Ides For Data Science: November 2021
19 pages
ROHIT
No ratings yet
ROHIT
7 pages
Data Analytics - Project Videos & Ideas
No ratings yet
Data Analytics - Project Videos & Ideas
6 pages
lab manual
No ratings yet
lab manual
80 pages
Imp question of Series
No ratings yet
Imp question of Series
9 pages
Informatics Practices Sample Paper 2 CBSE Class 12
No ratings yet
Informatics Practices Sample Paper 2 CBSE Class 12
16 pages
PPS Unit-4
No ratings yet
PPS Unit-4
120 pages
16 - Streamlit
100% (2)
16 - Streamlit
62 pages
ML Lab Manual
No ratings yet
ML Lab Manual
53 pages
WhatsApp Chat Analyzer8
No ratings yet
WhatsApp Chat Analyzer8
7 pages
All-Units-Python-Notes-By-MultiAtomsPlus
No ratings yet
All-Units-Python-Notes-By-MultiAtomsPlus
119 pages
Instant Download Pandas Workout (MEAP V06) Reuven Lerner PDF All Chapters
100% (2)
Instant Download Pandas Workout (MEAP V06) Reuven Lerner PDF All Chapters
37 pages
Assignment-12(Pandas)
No ratings yet
Assignment-12(Pandas)
4 pages
Where
No ratings yet
Where
22 pages
Teoh Teik Toe Python For Artificial Intelligence 2022
No ratings yet
Teoh Teik Toe Python For Artificial Intelligence 2022
5 pages
Healthcare_chatbot_report
No ratings yet
Healthcare_chatbot_report
35 pages
Rashmi Priya Resume
No ratings yet
Rashmi Priya Resume
1 page
1 - CE523 - Python Programming - II - CE - Sem5 - Batch2019-2023
No ratings yet
1 - CE523 - Python Programming - II - CE - Sem5 - Batch2019-2023
3 pages
01 Data Viz
No ratings yet
01 Data Viz
22 pages
DAP_Module3
No ratings yet
DAP_Module3
42 pages
MGT201H5F PRA0101 Syllabus
No ratings yet
MGT201H5F PRA0101 Syllabus
6 pages
Chapter 1 - Python Pandas - I
No ratings yet
Chapter 1 - Python Pandas - I
23 pages
Question Bank FDS
No ratings yet
Question Bank FDS
4 pages
Machine Learning Unit-1
No ratings yet
Machine Learning Unit-1
32 pages
CS3352 Foundations of Data Science
No ratings yet
CS3352 Foundations of Data Science
27 pages
ml_labmanual (3)
No ratings yet
ml_labmanual (3)
33 pages
Mechanical Resume
No ratings yet
Mechanical Resume
2 pages
Computer Science
No ratings yet
Computer Science
7 pages
Get Network Security Through Data Analysis From Data To Action 2nd Edition Michael Collins Free All Chapters
100% (4)
Get Network Security Through Data Analysis From Data To Action 2nd Edition Michael Collins Free All Chapters
52 pages