ISLR- Python: Chapter 3 -- Lab Linear
Regression
• Load Datasets
• 3.6.2 Simple Linear Regression
• 3.6.3 Multiple Linear Regression
• 3.6.4 Interaction Terms
• 3.6.5 Non-linear Transformations of the Predictors
• 3.6.6 Qualitative Predictors
## perform imports and set-up
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
from matplotlib import pyplot as plt
from sklearn.datasets import load_boston # boston data set is part of
sklearn
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
plt.style.use('ggplot') # emulate pretty r-style plots
Load Datasets
# Load Boston housing data set
boston = load_boston()
#Transform the data into a dataframe for analysis¶
# combine the predictors and responses for a dataframe
predictors = boston.data
response = boston.target
boston_data = np.column_stack([predictors,response])
# now get the column names of the data frame
col_names = np.append(boston.feature_names, 'MEDV')
# create the data frame
boston_df = pd.DataFrame(boston_data, columns = col_names)
boston_df.head()
# Load the carseats data set
carseats_df = pd.read_csv('../data/Carseats.csv',index_col = 0)
3.6.2 Simple Linear Regression
We will regress LSTAT (% of households with low economic status) onto the MEDV
(median home values). We will do this using two methods -- scipy.linregress and
statsmodels. We will look at several diagnostics that describe the quality of the fit.
Regress LSTAT onto MEDV (Scipy)
# Create a figure to plot our data and OLS estimate.
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(boston_df.LSTAT.values,
boston_df.MEDV.values,facecolors='none', edgecolors='b',\
label="data");
ax.set_ylabel('MEDV');
ax.set_xlabel('LSTAT');
# call scipy's linregress returning fit coefficients and simple
statistics
beta1, beta0, r_value, p_value, stderr =
stats.linregress(boston_df.LSTAT.values,\
boston_df.MEDV.values)
# add the estimation to the data plot
ax.plot(boston_df.LSTAT.values, beta0 +
beta1*boston_df.LSTAT.values,color='r', label="OLS");
ax.legend(loc='best');
# print the regression estimates returned from scipy
print('beta= [',round(beta0,3),',', round(beta1,3),']')
print('R={0:.3f}, p_value={1:.3f}, stderr={2:.3f}'.format(r_value,
p_value, stderr))
beta= [ 34.554 , -0.95 ]
R=-0.738, p_value=0.000, stderr=0.039
Regress LSTAT onto MEDV (Statsmodels)
# Another method is to use the package statsmodels.
# Create a design matrix
# set the independent variable as the LSTAT
X = boston_df.LSTAT
# We add a constant for the intercept term
X = sm.add_constant(X)
# set the dependent variable
Y = boston_df.MEDV
# create the model instance and fit
linear_model = sm.OLS(Y,X)
linear_results = linear_model.fit()
# data about the model is stored in summary
print(linear_results.summary())
OLS Regression Results
======================================================================
========
Dep. Variable: MEDV R-squared:
0.544
Model: OLS Adj. R-squared:
0.543
Method: Least Squares F-statistic:
601.6
Date: Fri, 24 Jun 2016 Prob (F-statistic):
5.08e-88
Time: 10:05:12 Log-Likelihood:
-1641.5
No. Observations: 506 AIC:
3287.
Df Residuals: 504 BIC:
3295.
Df Model: 1
Covariance Type: nonrobust
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 34.5538 0.563 61.415 0.000 33.448
35.659
LSTAT -0.9500 0.039 -24.528 0.000 -1.026
-0.874
======================================================================
========
Omnibus: 137.043 Durbin-Watson:
0.892
Prob(Omnibus): 0.000 Jarque-Bera (JB):
291.373
Skew: 1.453 Prob(JB):
5.36e-64
Kurtosis: 5.319 Cond. No.
29.7
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
# statsmodels includes functions for getting the prediction and
confidence intervals.
# Get prediction interval for a given x
prstd, iv_l, iv_u = wls_prediction_std(linear_results)
# statsmodels also includes the confidence intervals for the fitted
values (i.e. the mean fits)
# but they are buried in the results of summary table function.
from statsmodels.stats.outliers_influence import summary_table
# call the summary table at a (1-alpha)100% confidence interval level
simpleTable, data, column_names = summary_table(linear_results,
alpha=0.05)
# Data contains the confidence intervals we want but we need to make
sure we get the right
# columns so lets print off the names
print('column_names: ', column_names)
# Get confidence intervals for a given x
predicted_mean_ci_low, predicted_mean_ci_high = data[:,4:6].T
column_names: ['Obs', 'Dep Var\nPopulation', 'Predicted\nValue', 'Std
Error\nMean Predict', 'Mean ci\n95% low', 'Mean ci\n95% upp', 'Predict
ci\n95% low', 'Predict ci\n95% upp', 'Residual', 'Std Error\
nResidual', 'Student\nResidual', "Cook's\nD"]
# Create a plot to plot the data, OLS estimate, prediction and
confidence intervals
fig, ax = plt.subplots(figsize=(8,6))
# get numpy array values from dataframe
x = boston_df.LSTAT.values
y = boston_df.MEDV.values
# Plot the data
ax.scatter(x, y, facecolors='none', edgecolors='b', label="data")
# plot the models fitted values
ax.plot(x, linear_results.fittedvalues, 'g', label="OLS")
# plot the high and low prediction intervals
ax.plot(x, iv_u, color='0.75',label="Prediction Interval")
ax.plot(x, iv_l, color='0.75')
# plot the high and low mean confidence intervals
ax.plot(x,predicted_mean_ci_low, 'r', label="Predicted Mean CI")
ax.plot(x,predicted_mean_ci_high,'r')
ax.legend(loc='best');
plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic Plots for Linear Model
# Create plots of residuals
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# RESIDUALS
# The results contain the residuals
fitted_values = linear_results.fittedvalues.values
residuals = linear_results.resid.values
# Plot the residual for each fitted value
ax1.scatter(fitted_values, residuals, facecolors='none',
edgecolors='b');
ax1.set_xlabel('fitted values');
ax1.set_ylabel('residuals');
# The residual plot indicates significant nonlinearity (a u-shape
pattern is clear)
# STUDENTIZED RESIDUALS
# To asses data outliers we will look at the studentized residuals.
This is in the data array
# returned from summary table (10th column)
studentized_residuals = data[:,10]
# Plot the studentized residuals
ax2.scatter(fitted_values,studentized_residuals, facecolors='none',
edgecolors='b');
ax2.set_ylabel('Studentized Residuals');
ax2.set_xlabel('fitted values');
# |studentized residual| > 3 are generally considered outliers
# We can also examine the leverages to identify points that may alter
the regression line
from statsmodels.stats.outliers_influence import OLSInfluence
leverage = OLSInfluence(linear_results).influence
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(leverage, studentized_residuals,facecolors='none',
edgecolors='b');
ax.set_xlabel('Leverage');
ax.set_ylabel('Studentized Residuals');
3.6.3 Multiple Linear Regression
Here we will estimate MEDV using multiple linear regression. In the first example we
will regress LSTAT and AGE onto MEDV.
# create our design matrix using LSTAT and AGE predictors
X = sm.add_constant(boston_df[['LSTAT','AGE']])
# set the dependent variable
Y = boston_df.MEDV
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d',azim=-60, elev=5)
# create the model instance and estimate
model = sm.OLS(Y,X)
estimate = model.fit()
# data about the model is stored in summary
print('Model parameters:\n', estimate.params[:])
# Plot the data
ax.scatter(X.loc[:,'LSTAT'], X.loc[:,'AGE'], Y.values,
facecolors=(0,0,0,0),\
edgecolor='k', depthshade=True);
ax.set_xlabel('LSTAT');
ax.set_ylabel('AGE')
ax.set_zlabel('MEDV')
# Plot the OLS estimate
# create a grid of points
xx1, xx2 = np.meshgrid(np.linspace(X.LSTAT.min(), X.LSTAT.max(), 100),
np.linspace(X.AGE.min(), X.AGE.max(), 100))
# plot the plane by evaluating the parameters over the grid
Z = estimate.params[0] + estimate.params[1] * xx1 + estimate.params[2]
* xx2
# plot plane
surf = ax.plot_surface(xx1, xx2, Z, cmap=plt.cm.RdBu_r, alpha=0.75,
linewidth=0)
Model parameters:
const 33.222761
LSTAT -1.032069
AGE 0.034544
dtype: float64
Now we will perform the regression over all 13 predictors in the boston housing
dataset.
# create our design matrix using all the predictors (last column is
MEDV)
X = sm.add_constant(boston_df.iloc[:,0:-1])
# create the model instance and estimate
model = sm.OLS(Y,X)
est = model.fit()
# data about the model is stored in summary
print(est.summary())
OLS Regression Results
======================================================================
========
Dep. Variable: MEDV R-squared:
0.741
Model: OLS Adj. R-squared:
0.734
Method: Least Squares F-statistic:
108.1
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.95e-135
Time: 10:05:13 Log-Likelihood:
-1498.8
No. Observations: 506 AIC:
3026.
Df Residuals: 492 BIC:
3085.
Df Model: 13
Covariance Type: nonrobust
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 36.4911 5.104 7.149 0.000 26.462
46.520
CRIM -0.1072 0.033 -3.276 0.001 -0.171
-0.043
ZN 0.0464 0.014 3.380 0.001 0.019
0.073
INDUS 0.0209 0.061 0.339 0.735 -0.100
0.142
CHAS 2.6886 0.862 3.120 0.002 0.996
4.381
NOX -17.7958 3.821 -4.658 0.000 -25.302
-10.289
RM 3.8048 0.418 9.102 0.000 2.983
4.626
AGE 0.0008 0.013 0.057 0.955 -0.025
0.027
DIS -1.4758 0.199 -7.398 0.000 -1.868
-1.084
RAD 0.3057 0.066 4.608 0.000 0.175
0.436
TAX -0.0123 0.004 -3.278 0.001 -0.020
-0.005
PTRATIO -0.9535 0.131 -7.287 0.000 -1.211
-0.696
B 0.0094 0.003 3.500 0.001 0.004
0.015
LSTAT -0.5255 0.051 -10.366 0.000 -0.625
-0.426
======================================================================
========
Omnibus: 178.029 Durbin-Watson:
1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB):
782.015
Skew: 1.521 Prob(JB):
1.54e-170
Kurtosis: 8.276 Cond. No.
1.51e+04
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.
# Compute all the variance inflation factors
from statsmodels.stats.outliers_influence import
variance_inflation_factor
VIFs = [(predictor, variance_inflation_factor(X.values,_)) \
for _,predictor in enumerate(list(X))] # list(X) returns
column names of df
print('Variance Inflation Factors')
for tup in VIFs:
print('{:10}'.format(tup[0]), '{:.3f}'.format(tup[1]))
Variance Inflation Factors
const 585.425
CRIM 1.773
ZN 2.299
INDUS 3.991
CHAS 1.074
NOX 4.395
RM 1.934
AGE 3.101
DIS 3.957
RAD 7.481
TAX 9.008
PTRATIO 1.799
B 1.346
LSTAT 2.938
3.6.4 Interaction Terms
Statsmodels uses the patsy package to convert formulas to matrices for fitting. This
allows for easy implementation of arbitrary functions of the predictors.
# import statsmodels patsy api
import statsmodels.formula.api as smf
# Construct model and fit
model = smf.ols(formula='MEDV ~ LSTAT*AGE', data=boston_df)
estimate = model.fit()
print(estimate.summary())
OLS Regression Results
======================================================================
========
Dep. Variable: MEDV R-squared:
0.556
Model: OLS Adj. R-squared:
0.553
Method: Least Squares F-statistic:
209.3
Date: Fri, 24 Jun 2016 Prob (F-statistic):
4.86e-88
Time: 10:05:13 Log-Likelihood:
-1635.0
No. Observations: 506 AIC:
3278.
Df Residuals: 502 BIC:
3295.
Df Model: 3
Covariance Type: nonrobust
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
Intercept 36.0885 1.470 24.553 0.000 33.201
38.976
LSTAT -1.3921 0.167 -8.313 0.000 -1.721
-1.063
AGE -0.0007 0.020 -0.036 0.971 -0.040
0.038
LSTAT:AGE 0.0042 0.002 2.244 0.025 0.001
0.008
======================================================================
========
Omnibus: 135.601 Durbin-Watson:
0.965
Prob(Omnibus): 0.000 Jarque-Bera (JB):
296.955
Skew: 1.417 Prob(JB):
3.29e-65
Kurtosis: 5.461 Cond. No.
6.88e+03
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 6.88e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.
3.6.5 Non-linear Transformations of the Predictors
Statsmodel patsy api allows us to include non-linear terms
model = smf.ols('MEDV ~ LSTAT + I(LSTAT**2)', data=boston_df)
quadratic_results = model.fit()
print(quadratic_results.summary())
OLS Regression Results
======================================================================
========
Dep. Variable: MEDV R-squared:
0.641
Model: OLS Adj. R-squared:
0.639
Method: Least Squares F-statistic:
448.5
Date: Fri, 24 Jun 2016 Prob (F-statistic):
1.56e-112
Time: 10:05:13 Log-Likelihood:
-1581.3
No. Observations: 506 AIC:
3169.
Df Residuals: 503 BIC:
3181.
Df Model: 2
Covariance Type: nonrobust
======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 42.8620 0.872 49.149 0.000
41.149 44.575
LSTAT -2.3328 0.124 -18.843 0.000 -
2.576 -2.090
I(LSTAT ** 2) 0.0435 0.004 11.628 0.000
0.036 0.051
======================================================================
========
Omnibus: 107.006 Durbin-Watson:
0.921
Prob(Omnibus): 0.000 Jarque-Bera (JB):
228.388
Skew: 1.128 Prob(JB):
2.55e-50
Kurtosis: 5.397 Cond. No.
1.13e+03
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.
The near zero p-value for the quadratic term suggest an improved model. We will
plot and perform some diagnostics of the fit.
fig, ax = plt.subplots(figsize=(8,6))
# get numpy array values from dataframe
x = boston_df.LSTAT.values
y = boston_df.MEDV.values
# Plot the data
ax.scatter(x, y, facecolors='none', edgecolors='b', label="data");
# plot the models fitted values
ax.plot(x, quadratic_results.fittedvalues, 'g',
marker='o',linestyle='none', label="OLS");
ax.legend(loc='best');
plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic tests of quadratic estimate
# import anova function
from statsmodels.stats.api import anova_lm
# perform the hypothesis test (see https://en.wikipedia.org/wiki/F-
test regression section)
anova_table = anova_lm(linear_results, quadratic_results)
print(anova_table)
df_resid ssr df_diff ss_diff F
Pr(>F)
0 504 19472.381418 0 NaN NaN
NaN
1 503 15347.243158 1 4125.13826 135.199822 7.630116e-
28
The F-statistic is 135 with a p-value of ~0 indicating there is a large difference in the
unexplained variances of the two models. This is not too suprising given the plot
above. Now consider the residuals.
# Create plots of residuals
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# RESIDUALS OF LINEAR MODEL
# The results contain the residuals
linear_fit_values = linear_results.fittedvalues.values
residuals = linear_results.resid.values
# Plot the residual for each fitted value for the linear model
ax1.scatter(linear_fit_values, residuals, facecolors='none',
edgecolors='b');
ax1.set_xlabel('fitted values');
ax1.set_ylabel('residuals');
ax1.set_title('Linear Model Residuals')
# RESIDUALS OF QUADRATIC MODEL
# The results contain the residuals
quadratic_fit_values = quadratic_results.fittedvalues.values
quadratic_residuals = quadratic_results.resid.values
ax2.scatter(quadratic_fit_values, quadratic_residuals,
facecolors='none', edgecolors='b');
ax2.set_title('Quadratic Model Residuals');
We can also try higher order polynomial fits:
formula = 'MEDV ~ 1 + ' + ' + '.join('I(LSTAT**{})'.format(i) for i in
range(1, 6))
print(formula)
model = smf.ols(formula, data=boston_df)
order_5_results = model.fit()
print(order_5_results.summary())
MEDV ~ 1 + I(LSTAT**1) + I(LSTAT**2) + I(LSTAT**3) + I(LSTAT**4) +
I(LSTAT**5)
OLS Regression Results
======================================================================
========
Dep. Variable: MEDV R-squared:
0.682
Model: OLS Adj. R-squared:
0.679
Method: Least Squares F-statistic:
214.2
Date: Fri, 24 Jun 2016 Prob (F-statistic):
8.73e-122
Time: 10:57:12 Log-Likelihood:
-1550.6
No. Observations: 506 AIC:
3113.
Df Residuals: 500 BIC:
3139.
Df Model: 5
Covariance Type: nonrobust
======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 67.6997 3.604 18.783 0.000
60.618 74.781
I(LSTAT ** 1) -11.9911 1.526 -7.859 0.000 -
14.989 -8.994
I(LSTAT ** 2) 1.2728 0.223 5.703 0.000
0.834 1.711
I(LSTAT ** 3) -0.0683 0.014 -4.747 0.000 -
0.097 -0.040
I(LSTAT ** 4) 0.0017 0.000 4.143 0.000
0.001 0.003
I(LSTAT ** 5) -1.632e-05 4.42e-06 -3.692 0.000 -2.5e-
05 -7.63e-06
======================================================================
========
Omnibus: 144.085 Durbin-Watson:
0.987
Prob(Omnibus): 0.000 Jarque-Bera (JB):
494.545
Skew: 1.292 Prob(JB):
4.08e-108
Kurtosis: 7.096 Cond. No.
1.37e+08
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.37e+08. This might indicate that
there are
strong multicollinearity or other numerical problems.
3.6.6 Qualitative Predictors
# Examine the head of the carseats data set
carseats_df.head()
Sales CompPrice Income Advertising Population Price ShelveLoc
Age \
1 9.50 138 73 11 276 120 Bad
42
2 11.22 111 48 16 260 83 Good
65
3 10.06 113 35 10 269 80 Medium
59
4 7.40 117 100 4 466 97 Medium
55
5 4.15 141 64 3 340 128 Bad
38
Education Urban US
1 17 Yes Yes
2 10 Yes Yes
3 12 Yes Yes
4 14 Yes Yes
5 13 Yes No
# Construct the formula with two interaction terms
formula ='Sales ~' +
'+'.join(list(carseats_df.iloc[:,1:].columns.tolist()) +
['Income:Advertising'] + ['Price:Age'])
print(formula)
print()
model = smf.ols(formula, data=carseats_df)
carseat_results = model.fit()
print(carseat_results.summary())
Sales
~CompPrice+Income+Advertising+Population+Price+ShelveLoc+Age+Education
+Urban+US+Income:Advertising+Price:Age
OLS Regression Results
======================================================================
========
Dep. Variable: Sales R-squared:
0.876
Model: OLS Adj. R-squared:
0.872
Method: Least Squares F-statistic:
210.0
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.14e-166
Time: 14:38:28 Log-Likelihood:
-564.67
No. Observations: 400 AIC:
1157.
Df Residuals: 386 BIC:
1213.
Df Model: 13
Covariance Type: nonrobust
======================================================================
=================
coef std err t P>|t|
[95.0% Conf. Int.]
----------------------------------------------------------------------
-----------------
Intercept 6.5756 1.009 6.519 0.000
4.592 8.559
ShelveLoc[T.Good] 4.8487 0.153 31.724 0.000
4.548 5.149
ShelveLoc[T.Medium] 1.9533 0.126 15.531 0.000
1.706 2.201
Urban[T.Yes] 0.1402 0.112 1.247 0.213
-0.081 0.361
US[T.Yes] -0.1576 0.149 -1.058 0.291
-0.450 0.135
CompPrice 0.0929 0.004 22.567 0.000
0.085 0.101
Income 0.0109 0.003 4.183 0.000
0.006 0.016
Advertising 0.0702 0.023 3.107 0.002
0.026 0.115
Population 0.0002 0.000 0.433 0.665
-0.001 0.001
Price -0.1008 0.007 -13.549 0.000
-0.115 -0.086
Age -0.0579 0.016 -3.633 0.000
-0.089 -0.027
Education -0.0209 0.020 -1.063 0.288
-0.059 0.018
Income:Advertising 0.0008 0.000 2.698 0.007
0.000 0.001
Price:Age 0.0001 0.000 0.801 0.424
-0.000 0.000
======================================================================
========
Omnibus: 1.281 Durbin-Watson:
2.047
Prob(Omnibus): 0.527 Jarque-Bera (JB):
1.147
Skew: 0.129 Prob(JB):
0.564
Kurtosis: 3.050 Cond. No.
1.31e+05
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.31e+05. This might indicate that
there are
strong multicollinearity or other numerical problems.
# The default treatment for categoricals in patsy is the reference=0.
In this case the dummy variable for bad is 0.
# more treatments can be found in the patsy documentation. ....from
patsy.contrasts import Treatment