11/30/2020 01.multiple linear regression.
ipynb - Colaboratory
Linear Regression with Python
This is mostly just code for reference.
Your neighbor is a real estate agent and wants some help predicting housing prices for regions
in the USA. It would be great if you could somehow create a model for her that allows her to put
in a few features of a house and returns back an estimate of what the house would sell for.
She has asked you if you could help her out with your new data science skills. You say yes, and
decide that Linear Regression might be a good path to solve this problem!
Your neighbor then gives you some information about a bunch of houses in regions of the
United States,it is all in the data set: USA_Housing.csv.
The data contains the following columns:
'Avg. Area Income': Avg. Income of residents of the city house is located in.
'Avg. Area House Age': Avg Age of Houses in same city
'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
'Area Population': Population of city house is located in
'Price': Price that the house sold at
'Address': Address for the house
Let's get started!
Check out the data
We've been able to get some data from your neighbor for housing prices as a csv set, let's get
our environment ready with the libraries we'll need and then import the data!
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from google.colab import files
f=files.upload()
Choose Files No file chosen Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.
Saving USA Housing csv to USA Housing csv
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 1/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Check out the Data
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
Avg. Avg.
Avg.
Area Area
Avg. Area Area Area
Number Number Price
Income House Population
of of
Age
Rooms Bedrooms
208 Michael
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaur
188 Johns
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 Suite 0
Kathl
9127
2 61287 067179 5 865890 8 512727 5 13 36882 159400 1 058988e+06 Stravenue\nD
USAhousing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
USAhousing.describe()
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 2/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Avg. Area Avg. Area
Avg. Area Avg. Area Area
Number of Number of Price
Income House Age Population
Rooms Bedrooms
USAhousing.columns
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
Index(['Avg.
mean Area Income',5.977222
68583.108984 'Avg. Area 6.987792
House Age', 'Avg. Area36163.516039
3.981330 Number of Rooms',
1.232073e+06
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
std dtype='object')
10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
EDA 25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
Let's create some simple plots to check out the data!
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
sns.pairplot(USAhousing)
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 3/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
<seaborn.axisgrid.PairGrid at 0x7f3687566978>
sns.distplot(USAhousing['Price'])
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning:
warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0a81e26518>
sns.heatmap(USAhousing.corr(),cmap="coolwarm",annot=True)
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 4/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
<matplotlib.axes._subplots.AxesSubplot at 0x7f0a80487e80>
Training a Linear Regression Model
Let's now begin to train out regression model! We will need to rst split up our data into an X
array that contains the features to train on, and a y array with the target variable, in this case the
Price column. We will toss out the Address column because it only has text info that the linear
regression model can't use.
X and y arrays
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
Train Test Split
Now let's split the data into a training set and a testing set. We will train out model on the
training set and then use the test set to evaluate the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Creating and Training the Model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Model Evaluation
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 5/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Let's evaluate the model by checking out it's coe cients and how we can interpret them.
y=mx+c
y=m1x1+m2x2+m3x3+m4x4+m5x5+c
# print the intercept
print(lm.intercept_)
-2641372.6673013503
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Coefficient
Avg. Area Income 21.617635
Avg. Area House Age 165221.119872
Avg. Area Number of Rooms 121405.376596
Avg. Area Number of Bedrooms 1318.718783
Area Population 15.225196
Interpreting the coe cients:
Holding all other features xed, a 1 unit increase in Avg. Area Income is associated with an
*increase of $21.52 *.
Holding all other features xed, a 1 unit increase in Avg. Area House Age is associated
with an *increase of $164883.28 *.
Holding all other features xed, a 1 unit increase in Avg. Area Number of Rooms is
associated with an *increase of $122368.67 *.
Holding all other features xed, a 1 unit increase in Avg. Area Number of Bedrooms is
associated with an *increase of $2233.80 *.
Holding all other features xed, a 1 unit increase in Area Population is associated with an
*increase of $15.15 *.
Does this make sense? Probably not because I made up this data. If you want real data to repeat
this sort of analysis, check out the boston dataset:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)
boston_df = boston.data
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 6/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Predictions from our Model
Let's grab predictions off our test set and see how well it did!
predictions = lm.predict(X_test)
lm.predict([[79545.458574,5.682861,7.009188,4.09,23086.800503]])
array([1224988.39965275])
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x7f0a7a760da0>
Residual Histogram
sns.distplot((y_test-predictions),bins=50);
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning:
warnings.warn(msg, FutureWarning)
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 7/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Regression Evaluation Metrics
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
n
1
^ |
∑ |y i − y
i
n
i=1
Mean Squared Error (MSE) is the mean of the squared errors:
n
1
2
^ )
∑(y i − y
i
n
i=1
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
−−−−
n
−−−−− −− −
1
2
√ ^ )
∑(y i − y
i
n
i=1
Comparing these metrics:
MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be
useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
All of these are loss functions, because we want to minimize them.
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 81257.55795855916
MSE: 10169125565.897552
RMSE: 100842.08231635022
#Cheking the score
print('Train Score: ', lm.score(np.array(X_train), y_train))
print('Test Score: ', lm.score(np.array(X_test), y_test))
Train Score: 0.9181223200568411
Test Score: 0.9176824009649299
Up next is your own Machine Learning Project!
Great Job!
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 8/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Backward elimination
import statsmodels.api as smf
x = np.append(arr = np.ones((5000,1)).astype(int), values=X, axis=1)
x_opt=x [:, [0,1,2,3,4,5]]
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()
print(regressor_OLS.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.918
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 1.119e+04
Date: Sat, 28 Nov 2020 Prob (F-statistic): 0.00
Time: 15:48:10 Log-Likelihood: -64714.
No. Observations: 5000 AIC: 1.294e+05
Df Residuals: 4994 BIC: 1.295e+05
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -2.637e+06 1.72e+04 -153.708 0.000 -2.67e+06 -2.6e+06
x1 21.5780 0.134 160.656 0.000 21.315 21.841
x2 1.656e+05 1443.413 114.754 0.000 1.63e+05 1.68e+05
x3 1.207e+05 1605.160 75.170 0.000 1.18e+05 1.24e+05
x4 1651.1391 1308.671 1.262 0.207 -914.431 4216.709
x5 15.2007 0.144 105.393 0.000 14.918 15.483
==============================================================================
Omnibus: 5.580 Durbin-Watson: 2.005
Prob(Omnibus): 0.061 Jarque-Bera (JB): 4.959
Skew: 0.011 Prob(JB): 0.0838
Kurtosis: 2.847 Cond. No. 9.40e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spec
[2] The condition number is large, 9.4e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
x_opt=x [:, [0,1,2,3,5]]
regressor_OLS=smf.OLS(endog = y, exog=x_opt).fit()
print(regressor_OLS.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.918
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 1.398e+04
Date: Sat, 28 Nov 2020 Prob (F-statistic): 0.00
Time: 15:51:11 Log-Likelihood: -64714.
No. Observations: 5000 AIC: 1.294e+05
Df Residuals: 4995 BIC: 1.295e+05
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 9/10
11/30/2020 01.multiple linear regression.ipynb - Colaboratory
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -2.638e+06 1.72e+04 -153.726 0.000 -2.67e+06 -2.6e+06
x1 21.5827 0.134 160.743 0.000 21.320 21.846
x2 1.657e+05 1443.404 114.769 0.000 1.63e+05 1.68e+05
x3 1.216e+05 1422.608 85.476 0.000 1.19e+05 1.24e+05
x4 15.1961 0.144 105.388 0.000 14.913 15.479
==============================================================================
Omnibus: 5.310 Durbin-Watson: 2.006
Prob(Omnibus): 0.070 Jarque-Bera (JB): 4.742
Skew: 0.011 Prob(JB): 0.0934
Kurtosis: 2.851 Cond. No. 9.40e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spec
[2] The condition number is large, 9.4e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
https://colab.research.google.com/drive/1Hfw_WgnV6HjrEFPdL7n6OT35Z5NyJ5Zf#printMode=true 10/10