0% found this document useful (0 votes)
15 views

Exercise4 Solution

IE0005 Exercise solutions 4

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Exercise4 Solution

IE0005 Exercise solutions 4

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Exercise 4 : Linear Regression

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape


\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal


MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice


0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Predicting SalePrice using GrLivArea


Extract the required variables from the dataset, as mentioned in the problem.

houseGrLivArea = pd.DataFrame(houseData['GrLivArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

Plot houseSalePrice against houseGrLivArea using standard JointPlot.

trainDF = pd.concat([houseGrLivArea, houseSalePrice], axis =


1).reindex(houseGrLivArea.index)
sb.jointplot(data=trainDF, x='GrLivArea', y='SalePrice', height = 12)

<seaborn.axisgrid.JointGrid at 0x25b36d1db50>
Import the LinearRegression model from sklearn.linear_model.

# Import essential models and functions from sklearn


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a Linear Regression object


linreg = LinearRegression()

Prepare both the datasets by splitting in Train and Test sets.


Train Set with 1100 samples and Test Set with 360 samples.
# Split the dataset into Train and Test
#houseGrLivArea_train = pd.DataFrame(houseGrLivArea[:1100])
#houseGrLivArea_test = pd.DataFrame(houseGrLivArea[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])

# Split the Dataset into Train and Test


houseGrLivArea_train, houseGrLivArea_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(houseGrLivArea, houseSalePrice,
test_size = 360/1460)

# Check the sample sizes


print("Train Set :", houseGrLivArea_train.shape,
houseSalePrice_train.shape)
print("Test Set :", houseGrLivArea_test.shape,
houseSalePrice_test.shape)

Train Set : (1100, 1) (1100, 1)


Test Set : (360, 1) (360, 1)

Fit Linear Regression model on houseGrLivArea_train and houseSalePrice_train

linreg.fit(houseGrLivArea_train, houseSalePrice_train)

LinearRegression()

Visual Representation of the Linear Regression Model


Check the coefficients of the Linear Regression model you just fit.

print('Intercept \t: b = ', linreg.intercept_)


print('Coefficients \t: a = ', linreg.coef_)

Intercept : b = [16608.46906887]
Coefficients : a = [[108.93750382]]

Plot the regression line based on the coefficients-intercept form.

# Formula for the Regression line


# Alternative code for below: regline_x =
houseGrLivArea_train.to_numpy()
regline_x = houseGrLivArea_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x

# Plot the Linear Regression line


f = plt.figure(figsize=(16, 8))
plt.scatter(houseGrLivArea_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), "r-", linewidth =
3)
plt.show()

Prediction of Response based on the Predictor


Predict SalePrice given GrLivArea in the Test dataset.

# Predict SalePrice values corresponding to GrLivArea


houseSalePrice_test_pred = linreg.predict(houseGrLivArea_test)

# Plot the Predictions


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(houseGrLivArea_test, houseSalePrice_test, color = "green")
plt.scatter(houseGrLivArea_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
Check how good the predictions are on the Train Set.
Metric : Explained Variance or R^2 on the Train Set.

print("Explained Variance (R^2) \t:",


linreg.score(houseGrLivArea_train, houseSalePrice_train))

Explained Variance (R^2) : 0.5083530900052877

Check how good the predictions are on the Test Set.


Metric : Explained Variance or R^2 on the Test Set.

print("Explained Variance (R^2) \t:",


linreg.score(houseGrLivArea_test, houseSalePrice_test))

Explained Variance (R^2) : 0.4778441365777981

You should also try the following


• Convert SalePrice to log(SalePrice) in the beginning and then use it for
Regression
Code : houseSalePrice =
pd.DataFrame(np.log(houseData['SalePrice']))

• Perform a Random Train-Test Split on the dataset before you start with the
Regression
Note : Check the preparation notebook M3 LinearRegression.ipynb for the code
Problem 2 : Predicting SalePrice using LotArea
Extract the required variables from the dataset, as mentioned in the problem.

housePredictor = pd.DataFrame(houseData['LotArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

trainDF = pd.concat([housePredictor, houseSalePrice], axis =


1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='LotArea', y='SalePrice', height = 12)

<seaborn.axisgrid.JointGrid at 0x25b374c2e50>
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test


#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])

# Split the Dataset into Train and Test


housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)

# Create a Linear Regression object


linreg = LinearRegression()

# Train the Linear Regression model


linreg.fit(housePredictor_train, houseSalePrice_train)

LinearRegression()

Visual Representation of the Linear Regression Model


print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line


# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x

# Plot the Linear Regression line


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()

Intercept : b = [162100.36069809]
Coefficients : a = [[1.83081318]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()

Goodness of Fit of the Linear Regression Model


print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))

Explained Variance (R^2) on Train Set : 0.062378129532616566


Explained Variance (R^2) on Test Set : 0.08979920486099247

Problem 2 : Predicting SalePrice using TotalBsmtSF


Extract the required variables from the dataset, as mentioned in the problem.

housePredictor = pd.DataFrame(houseData['TotalBsmtSF'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
trainDF = pd.concat([housePredictor, houseSalePrice], axis =
1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='TotalBsmtSF', y='SalePrice', height =
12)

<seaborn.axisgrid.JointGrid at 0x25b384d3a60>

Linear Regression on SalePrice vs Predictor


# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Split the dataset into Train and Test
#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])

# Split the Dataset into Train and Test


housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)

# Create a Linear Regression object


linreg = LinearRegression()

# Train the Linear Regression model


linreg.fit(housePredictor_train, houseSalePrice_train)

LinearRegression()

Visual Representation of the Linear Regression Model


print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line


# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x

# Plot the Linear Regression line


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()

Intercept : b = [66517.88779439]
Coefficients : a = [[107.22623803]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()
Goodness of Fit of the Linear Regression Model
print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))

Explained Variance (R^2) on Train Set : 0.37367019826026027


Explained Variance (R^2) on Test Set : 0.38123042547864716

Problem 2 : Predicting SalePrice using GarageArea


Extract the required variables from the dataset, as mentioned in the problem.

housePredictor = pd.DataFrame(houseData['GarageArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

trainDF = pd.concat([housePredictor, houseSalePrice], axis =


1).reindex(housePredictor.index)
sb.jointplot(data=trainDF, x='GarageArea', y='SalePrice', height = 12)

<seaborn.axisgrid.JointGrid at 0x25b39697280>
Linear Regression on SalePrice vs Predictor
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test


#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])

# Split the Dataset into Train and Test


housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)

# Create a Linear Regression object


linreg = LinearRegression()

# Train the Linear Regression model


linreg.fit(housePredictor_train, houseSalePrice_train)

LinearRegression()

Visual Representation of the Linear Regression Model


print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line


# Alternative code for below: regline_x =
housePredictor_train.to_numpy()
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * regline_x

# Plot the Linear Regression line


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth =
3)
plt.show()

Intercept : b = [73619.02664658]
Coefficients : a = [[227.27199019]]
Prediction of Response based on the Predictor
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions


f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color =
"red")
plt.show()

Goodness of Fit of the Linear Regression Model


print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))

Explained Variance (R^2) on Train Set : 0.3949854232045428


Explained Variance (R^2) on Test Set : 0.37249611550040873

Extra : Predicting SalePrice using Multiple Variables


Extract the required variables from the dataset, and then perform Multi-Variate Regression.

housePredictor =
pd.DataFrame(houseData[['GrLivArea','LotArea','TotalBsmtSF','GarageAre
a']])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

Linear Regression on SalePrice vs Predictor


# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test


#housePredictor_train = pd.DataFrame(housePredictor[:1100])
#housePredictor_test = pd.DataFrame(housePredictor[-360:])
#houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
#houseSalePrice_test = pd.DataFrame(houseSalePrice[-360:])

# Split the Dataset into Train and Test


housePredictor_train, housePredictor_test, houseSalePrice_train,
houseSalePrice_test = train_test_split(housePredictor, houseSalePrice,
test_size = 360/1460)

# Create a Linear Regression object


linreg = LinearRegression()

# Train the Linear Regression model


linreg.fit(housePredictor_train, houseSalePrice_train)

LinearRegression()

Coefficients of the Linear Regression Model


Note that you CANNOT visualize the model as a line on a 2D plot, as it is a multi-dimensional
surface.

print('Intercept \t: b = ', linreg.intercept_)


print('Coefficients \t: a = ', linreg.coef_)

Intercept : b = [-18973.0369382]
Coefficients : a = [[70.07255406 0.19864354 43.29137855
96.31552172]]

Prediction of Response based on the Predictor


# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_train_pred = linreg.predict(housePredictor_train)
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions vs the True values


f, axes = plt.subplots(1, 2, figsize=(24, 12))
axes[0].scatter(houseSalePrice_train, houseSalePrice_train_pred, color
= "blue")
axes[0].plot(houseSalePrice_train.to_numpy(),
houseSalePrice_train.to_numpy(), 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable
(Train)")
axes[1].scatter(houseSalePrice_test, houseSalePrice_test_pred, color =
"green")
axes[1].plot(houseSalePrice_test.to_numpy(),
houseSalePrice_test.to_numpy(), 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

Goodness of Fit of the Linear Regression Model


print("Explained Variance (R^2) on Train Set \t:",
linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:",
linreg.score(housePredictor_test, houseSalePrice_test))

Explained Variance (R^2) on Train Set : 0.6503512298667333


Explained Variance (R^2) on Test Set : 0.6970077531306746

Interpretation and Discussion


Now that you have performed Linear Regression of SalePrice against the four variables
GrLivArea, LotArea, TotalBsmtSF, GarageArea, compare-and-contrast the Exaplained
Variance (R^2) to determine which model is the best in order to predict SalePrice. What do
you think?

You might also like