Parallel slopes linear
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The previous course
This course assumes knowledge from Introduction to Regression with statsmodels in Python
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
From simple regression to multiple regression
Multiple regression is a regression model with more than one explanatory variable.
More explanatory variables can give more insight and be er predictions.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The course contents
Chapter 1 Chapter 2
"Parallel slopes" regression Interactions
Simpson's Paradox
Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression
How linear regression works The logistic distribution
How logistic regression works
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The fish dataset
mass_g length_cm species Each row represents a sh
242.0 23.2 Bream mass_g is the response variable
5.9 7.5 Perch 1 numeric, 1 categorical explanatory
200.0 30.0 Pike variable
40.0 12.9 Roach
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
One explanatory variable at a time
from statsmodels.formula.api import ols mdl_mass_vs_species = ols("mass_g ~ species + 0",
data=fish).fit()
mdl_mass_vs_length = ols("mass_g ~ length_cm",
data=fish).fit() print(mdl_mass_vs_species.params)
print(mdl_mass_vs_length.params) species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
Intercept -536.223947
species[Roach] 152.050000
length_cm 34.899245
dtype: float64
dtype: float64
1 intercept coe cient 1 intercept coe cient for each category
1 slope coe cient
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Both variables at the same time
mdl_mass_vs_both = ols("mass_g ~ length_cm + species + 0",
data=fish).fit()
print(mdl_mass_vs_both.params)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64
1 slope coe cient
1 intercept coe cient for each category
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Comparing coefficients
print(mdl_mass_vs_length.params) print(mdl_mass_vs_both.params)
Intercept -536.223947 species[Bream] -672.241866
length_cm 34.899245 species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
print(mdl_mass_vs_species.params)
species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: 1 numeric explanatory variable
import matplotlib.pyplot as plt
import seaborn as sns
sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: 1 categorical explanatory variable
sns.boxplot(x="species",
y="mass_g",
data=fish,
showmeans=True)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: both explanatory variables
coeffs = mdl_mass_vs_both.params plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
print(coeffs) plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
ic_bream, ic_perch, ic_pike, ic_roach, sl = coeffs
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predicting parallel
slopes
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The prediction workflow
import pandas as pd length_cm
import numpy as np 0 5
expl_data_length = pd.DataFrame( 1 10
{"length_cm": np.arange(5, 61, 5)}) 2 15
print(expl_data_length) 3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction workflow
[A, B, C] x [1, 2] ==> [A1, B1, C1, A2, B2, C2] length_cm species
0 5 Bream
1 5 Roach
from itertools import product
2 5 Perch
product(["A", "B", "C"], [1, 2])
3 5 Pike
4 10 Bream
length_cm = np.arange(5, 61, 5) 5 10 Roach
species = fish["species"].unique() 6 10 Perch
...
p = product(length_cm, species) 41 55 Roach
42 55 Perch
expl_data_both = pd.DataFrame(p, 43 55 Pike
columns=['length_cm', 44 60 Bream
'species']) 45 60 Roach
print(expl_data_both) 46 60 Perch
47 60 Pike
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction workflow
Predict mass_g from length_cm only length_cm mass_g
0 5 -361.7277
1 10 -187.2315
prediction_data_length = expl_data_length.assign(
2 15 -12.7353
mass_g = mdl_mass_vs_length.predict(expl_data)
3 20 161.7610
)
4 25 336.2572
5 30 510.7534
... # number of rows: 12
Predict mass_g from both explanatory length_cm species mass_g
variables 0 5 Bream -459.3991
1 5 Roach -513.9350
2 5 Perch -500.4501
prediction_data_both = expl_data_both.assign(
3 5 Pike -876.6133
mass_g = mdl_mass_vs_both.predict(expl_data)
4 10 Bream -246.5563
)
5 10 Roach -301.0923
... # number of rows: 48
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the predictions
plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating predictions for linear regression
coeffs = mdl_mass_vs_length.params Intercept -536.223947
print(coeffs) length_cm 34.899245
intercept, slope = coeffs length_cm mass_g
0 5 -361.727721
1 10 -187.231494
explanatory_data = pd.DataFrame(
2 15 -12.735268
{"length_cm": np.arange(5, 61, 5)})
3 20 161.760959
4 25 336.257185
prediction_data = explanatory_data.assign(
5 30 510.753412
mass_g = intercept + slope * explanatory_data
...
)
9 50 1208.738318
10 55 1383.234545
print(prediction_data) 11 60 1557.730771
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating predictions for multiple
regression
coeffs = mdl_mass_vs_both.params
print(coeffs)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
ic_bream, ic_perch, ic_pike, ic_roach, slope = coeffs
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
np.select()
conditions = [
condition_1,
condition_2,
# ...
condition_n
]
choices = [list_of_choices] # same length as conditions
np.select(conditions, choices)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Choosing an intercept with np.select()
conditions = [ [ -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Bream", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Perch", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Pike", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Roach" -672.24 -726.78 -713.29 -1089.46
] -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
choices = [ic_bream, ic_perch, ic_pike, ic_roach]
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
intercept = np.select(conditions, choices) -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46]
print(intercept)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The final prediction step
prediction_data = explanatory_data.assign( length_cm species intercept mass_g
intercept = np.select(conditions, choices), 0 5 Bream -672.2419 -459.3991
mass_g = intercept + slope * explanatory_data["length_cm"]) 1 5 Roach -726.7778 -513.9350
2 5 Perch -713.2929 -500.4501
print(prediction_data) 3 5 Pike -1089.4561 -876.6133
4 10 Bream -672.2419 -246.5563
5 10 Roach -726.7778 -301.0923
6 10 Perch -713.2929 -287.6073
7 10 Pike -1089.4561 -663.7705
8 15 Bream -672.2419 -33.7136
...
40 55 Bream -672.2419 1669.0286
41 55 Roach -726.7778 1614.4927
42 55 Perch -713.2929 1627.9776
43 55 Pike -1089.4561 1251.8144
44 60 Bream -672.2419 1881.8714
45 60 Roach -726.7778 1827.3354
46 60 Perch -713.2929 1840.8204
47 60 Pike -1089.4561 1464.6572
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Compare to .predict()
mdl_mass_vs_both.predict(explanatory_data) 0 -459.3991
1 -513.9350
2 -500.4501
3 -876.6133
4 -246.5563
5 -301.0923
...
43 1251.8144
44 1881.8714
45 1827.3354
46 1840.8204
47 1464.6572
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Assessing model
performance
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Model performance metrics
Coe cient of determination (R-squared): how well the linear regression line ts the
observed values.
Larger is be er.
Residual standard error (RSE): the typical size of the residuals.
Smaller is be er.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the coefficient of determination
print(mdl_mass_vs_length.rsquared)
0.8225689502644215
print(mdl_mass_vs_species.rsquared)
0.25814887709499157
print(mdl_mass_vs_both.rsquared)
0.9200433561156649
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Adjusted coefficient of determination
More explanatory variables increases R2 .
Too many explanatory variables causes over ing.
Adjusted coe cient of determination penalizes more explanatory variables.
R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1
Penalty is noticeable when R2 is small, or nvar is large fraction of nobs .
In statsmodels , it's contained in the rsquared_adj a ribute.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the adjusted coefficient of determination
print("rsq_length: ", mdl_mass_vs_length.rsquared)
print("rsq_adj_length: ", mdl_mass_vs_length.rsquared_adj)
rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121
print("rsq_species: ", mdl_mass_vs_species.rsquared)
print("rsq_adj_species: ", mdl_mass_vs_species.rsquared_adj)
rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722
print("rsq_both: ", mdl_mass_vs_both.rsquared
print("rsq_adj_both: ", mdl_mass_vs_both.rsquared_adj)
rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the residual standard error
rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
print("rse_length: ", rse_length)
rse_length: 152.12092835414788
rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)
rse_species: 313.5501156682592
rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)
rse_both: 103.35563303966488
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Models for each
category
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Four categories
print(fish["species"].unique())
array(['Bream', 'Roach', 'Perch', 'Pike'], dtype=object)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Splitting the dataset
bream = fish[fish["species"] == "Bream"]
perch = fish[fish["species"] == "Perch"]
pike = fish[fish["species"] == "Pike"]
roach = fish[fish["species"] == "Roach"]
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Four models
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
print(mdl_bream.params) print(mdl_perch.params)
Intercept -1035.3476 Intercept -619.1751
length_cm 54.5500 length_cm 38.9115
mdl_pike = ols("mass_g ~ length_cm", data=pike).fit() mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
print(mdl_pike.params) print(mdl_roach.params)
Intercept -1540.8243 Intercept -329.3762
length_cm 53.1949 length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Explanatory data
explanatory_data = pd.DataFrame( length_cm
{"length_cm": np.arange(5, 61, 5)}) 0 5
1 10
print(explanatory_data) 2 15
3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Making predictions
prediction_data_bream = explanatory_data.assign( prediction_data_perch = explanatory_data.assign(
mass_g = mdl_bream.predict(explanatory_data), mass_g = mdl_perch.predict(explanatory_data),
species = "Bream") species = "Perch")
prediction_data_pike = explanatory_data.assign( prediction_data_roach = explanatory_data.assign(
mass_g = mdl_pike.predict(explanatory_data), mass_g = mdl_roach.predict(explanatory_data),
species = "Pike") species = "Roach")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Concatenating predictions
prediction_data = pd.concat([prediction_data_bream, length_cm mass_g species
prediction_data_roach, 0 5 -762.597660 Bream
prediction_data_perch, 1 10 -489.847756 Bream
prediction_data_pike]) 2 15 -217.097851 Bream
3 20 55.652054 Bream
4 25 328.401958 Bream
5 30 601.151863 Bream
...
3 20 -476.926955 Pike
4 25 -210.952626 Pike
5 30 55.021703 Pike
6 35 320.996032 Pike
7 40 586.970362 Pike
8 45 852.944691 Pike
9 50 1118.919020 Pike
10 55 1384.893349 Pike
11 60 1650.867679 Pike
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Adding in your predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Coefficient of determination
mdl_fish = ols("mass_g ~ length_cm + species", print(mdl_bream.rsquared_adj)
data=fish).fit()
0.874
print(mdl_fish.rsquared_adj)
print(mdl_perch.rsquared_adj)
0.917
0.917
print(mdl_pike.rsquared_adj)
0.941
print(mdl_roach.rsquared_adj)
0.815
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Residual standard error
print(np.sqrt(mdl_fish.mse_resid)) print(np.sqrt(mdl_bream.mse_resid))
103 74.2
print(np.sqrt(mdl_perch.mse_resid))
100
print(np.sqrt(mdl_pike.mse_resid))
120
print(np.sqrt(mdl_roach.mse_resid))
38.2
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
One model with an
interaction
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
What is an interaction?
In the sh dataset
Di erent sh species have di erent mass to length ratios.
The e ect of length on the expected mass is di erent for di erent species.
More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Specifying interactions
No interactions No interactions
response ~ explntry1 + explntry2 mass_g ~ length_cm + species
With interactions (implicit) With interactions (implicit)
response_var ~ explntry1 * explntry2 mass_g ~ length_cm * species
With interactions (explicit) With interactions (explicit)
response ~ explntry1 + explntry2 + explntry1:explntry2 mass_g ~ length_cm + species + length_cm:species
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Running the model
mdl_mass_vs_both = ols("mass_g ~ length_cm * species", data=fish).fit()
print(mdl_mass_vs_both.params)
Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Easier to understand coefficients
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0", data=fish).fit()
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Familiar numbers
print(mdl_mass_vs_both_inter.params) print(mdl_bream.params)
species[Bream] -1035.3476 Intercept -1035.3476
species[Perch] -619.1751 length_cm 54.5500
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
with interactions
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The model with the interaction
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0",
data=fish).fit()
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
from itertools import product length_cm species mass_g
0 5 Bream -762.5977
length_cm = np.arange(5, 61, 5) 1 5 Roach -212.7799
2 5 Perch -424.6178
species = fish["species"].unique() 3 5 Pike -1274.8499
4 10 Bream -489.8478
p = product(length_cm, species) 5 10 Roach -96.1836
6 10 Perch -230.0604
7 10 Pike -1008.8756
explanatory_data = pd.DataFrame(p,
8 15 Bream -217.0979
columns=["length_cm",
...
"species"])
40 55 Bream 1964.9014
41 55 Roach 953.1833
prediction_data = explanatory_data.assign(
42 55 Perch 1520.9556
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
43 55 Pike 1384.8933
44 60 Bream 2237.6513
print(prediction_data) 45 60 Roach 1069.7796
46 60 Perch 1715.5129
47 60 Pike 1650.8677
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
coeffs = mdl_mass_vs_both_inter.params
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
ic_bream, ic_perch, ic_pike, ic_roach,
slope_bream, slope_perch, slope_pike, slope_roach = coeffs
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
conditions = [
explanatory_data["species"] == "Bream",
explanatory_data["species"] == "Perch",
explanatory_data["species"] == "Pike",
explanatory_data["species"] == "Roach"
]
ic_choices = [ic_bream, ic_perch, ic_pike, ic_roach]
intercept = np.select(conditions, ic_choices)
slope_choices = [slope_bream, slope_perch, slope_pike, slope_roach]
slope = np.select(conditions, slope_choices)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
prediction_data = explanatory_data.assign( prediction_data = explanatory_data.assign(
mass_g = intercept + slope * explanatory_data["length_cm"]) mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
print(prediction_data) print(prediction_data)
length_cm species mass_g length_cm species mass_g
0 5 Bream -762.5977 0 5 Bream -762.5977
1 5 Roach -212.7799 1 5 Roach -212.7799
2 5 Perch -424.6178 2 5 Perch -424.6178
3 5 Pike -1274.8499 3 5 Pike -1274.8499
4 10 Bream -489.8478 4 10 Bream -489.8478
5 10 Roach -96.1836 5 10 Roach -96.1836
... ...
43 55 Pike 1384.8933 43 55 Pike 1384.8933
44 60 Bream 2237.6513 44 60 Bream 2237.6513
45 60 Roach 1069.7796 45 60 Roach 1069.7796
46 60 Perch 1715.5129 46 60 Perch 1715.5129
47 60 Pike 1650.8677 47 60 Pike 1650.8677
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Simpson's Paradox
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
A most ingenious paradox!
Simpson's Paradox occurs when the trend of a model on the whole dataset is very di erent
from the trends shown by models on subsets of the dataset.
trend = slope coe cient
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Synthetic Simpson data
x y group 5 groups of data, labeled "A" to "E"
62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E
1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Linear regressions
Whole dataset By group
mdl_whole = ols("y ~ x", mdl_by_group = ols("y ~ group + group:x + 0",
data=simpsons_paradox).fit() data = simpsons_paradox).fit()
print(mdl_whole.params) print(mdl_by_group.params)
Intercept -38.554 groupA groupB groupC groupD groupE
x 1.751 32.5051 67.3886 99.6333 132.3932 123.8242
groupA:x groupB:x groupC:x groupD:x groupE:x
-0.6266 -1.0105 -0.9940 -0.9908 -0.5364
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the whole dataset
sns.regplot(x="x",
y="y",
data=simpsons_paradox,
ci=None)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting by group
sns.lmplot(x="x",
y="y",
data=simpsons_paradox,
hue="group",
ci=None)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Reconciling the difference
Good advice
If possible, try to plot the dataset.
Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.
More good advice
Articulate a question before you start modeling.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Test score example
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Infectious disease example
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Reconciling the difference
Usually (but not always) the grouped model contains more insight.
Are you missing explanatory variables?
Context is important.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Simpson's paradox in real datasets
The paradox is usually less obvious.
You may see a zero slope rather than a complete change in direction.
It may not appear in every group.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Two numeric
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Visualizing three numeric variables
3D sca er plot
2D sca er plot with response as color
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Another column for the fish dataset
species mass_g length_cm height_cm
Bream 1000 33.5 18.96
Bream 925 36.2 18.75
Roach 290 24.0 8.88
Roach 390 29.5 9.48
Perch 1100 39.0 12.80
Perch 1000 40.2 12.60
Pike 1250 52.0 10.69
Pike 1650 59.0 10.81
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
3D scatter plot
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
2D scatter plot, color for response
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Modeling with two numeric explanatory variables
mdl_mass_vs_both = ols("mass_g ~ length_cm + height_cm",
data=fish).fit()
print(mdl_mass_vs_both.params)
Intercept -622.150234
length_cm 28.968405
height_cm 26.334804
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
from itertools import product length_cm height_cm mass_g
0 5 2 -424.638603
length_cm = np.arange(5, 61, 5) 1 5 4 -371.968995
height_cm = np.arange(2, 21, 2) 2 5 6 -319.299387
3 5 8 -266.629780
p = product(length_cm, height_cm) 4 5 10 -213.960172
.. ... ... ...
explanatory_data = pd.DataFrame(p, 115 60 12 1431.971694
columns=["length_cm", 116 60 14 1484.641302
"height_cm"]) 117 60 16 1537.310909
118 60 18 1589.980517
119 60 20 1642.650125
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both.predict(explanatory_data))
[120 rows x 3 columns]
print(prediction_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Including an interaction
mdl_mass_vs_both_inter = ols("mass_g ~ length_cm * height_cm",
data=fish).fit()
print(mdl_mass_vs_both_inter.params)
Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow with an interaction
length_cm = np.arange(5, 61, 5)
height_cm = np.arange(2, 21, 2)
p = product(length_cm, height_cm)
explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
More than two
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
From last time
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Faceting by species
grid = sns.FacetGrid(data=fish,
col="species",
hue="mass_g",
col_wrap=2,
palette="plasma")
grid.map(sns.scatterplot,
"length_cm",
"height_cm")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Faceting by species
It's possible to use more than one
categorical variable for faceting
Beware of faceting overuse
Plo ing becomes harder with increasing
number of variables
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Different levels of interaction
No interactions
ols("mass_g ~ length_cm + height_cm + species + 0", data=fish).fit()
two-way interactions between pairs of variables
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()
three-way interaction between all three variables
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
All the interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Only two-way interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
mdl_mass_vs_all = ols( length_cm height_cm species mass_g
"mass_g ~ length_cm * height_cm * species + 0", 0 5 2 Bream -570.656437
data=fish).fit() 1 5 2 Roach 31.449145
2 5 2 Perch 43.789984
length_cm = np.arange(5, 61, 5) 3 5 2 Pike 271.270093
height_cm = np.arange(2, 21, 2) 4 5 4 Bream -451.127405
species = fish["species"].unique() .. ... ... ... ...
475 60 18 Pike 2690.346384
p = product(length_cm, height_cm, species) 476 60 20 Bream 1531.618475
477 60 20 Roach 2621.797668
explanatory_data = pd.DataFrame(p, 478 60 20 Perch 3041.931709
columns=["length_cm", 479 60 20 Pike 2926.352397
"height_cm",
"species"]) [480 rows x 4 columns]
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How linear
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The standard simple linear regression plot
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing residuals
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A metric for the best fit
The simplest idea (which doesn't work)
Take the sum of all the residuals.
Some residuals are negative.
The next simplest idea (which does work)
Take the square of each residual, and add up those squares.
This is called the sum of squares.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A detour into numerical optimization
A line plot of a quadratic equation
x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10
xy_data = pd.DataFrame({"x": x,
"y": y})
sns.lineplot(x="x",
y="y",
data=xy_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Using calculus to solve the equation
y = x2 − x + 10
∂y
∂x
= 2x − 1
0 = 2x − 1
x = 0.5
y = 0.52 − 0.5 + 10 = 9.75
Not all equations can be solved like this.
You can let Python gure it out.
Don't worry if this doesn't make sense, you
won't need it for the exercises.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
minimize()
from scipy.optimize import minimize fun: 9.75
hess_inv: array([[0.5]])
jac: array([0.])
def calc_quadratic(x):
message: 'Optimization terminated successfully.'
y = x ** 2 - x + 10
nfev: 6
return y
nit: 2
njev: 3
minimize(fun=calc_quadratic, status: 0
x0=3) success: True
x: array([0.49999998])
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A linear regression algorithm
De ne a function to calculate the sum of
def calc_sum_of_squares(coeffs):
squares metric.
intercept, slope = coeffs
# More calculation!
Call minimize() to nd coe cients that minimize(
minimize this function. fun=calc_sum_of_squares,
x0=0
)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Multiple logistic
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
logit()
from statsmodels.formula.api import logit
logit("response ~ explanatory", data=dataset).fit()
logit("response ~ explanatory1 + explanatory2", data=dataset).fit()
logit("response ~ explanatory1 * explanatory2", data=dataset).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct
conf_matrix = mdl_logit.pred_table()
print(conf_matrix)
[[102. 98.]
[ 53. 147.]]
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Prediction flow
from itertools import product
explanatory1 = some_values
explanatory2 = some_values
p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])
sns.scatterplot(...
data=churn,
hue="has_churned",
...)
sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
The logistic
distribution
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Gaussian probability density function (PDF)
from scipy.stats import norm
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)
sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian inverse CDF
p = np.arange(0.001, 1, 0.001)
gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)
sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic PDF
from scipy.stats import logistic
x = np.arange(-4, 4.05, 0.05)
logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)
sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic distribution
Logistic distribution CDF is also called the logistic function.
1
cdf(x) = (1+exp(−x))
Logistic distribution inverse CDF is also called the logit function.
p
inverse_cdf(p) = log( (1−p) )
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How logistic
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Sum of squares doesn't work
np.sum((y_pred - y_actual) ** 2)
y_actual is always 0 or 1 .
y_pred is between 0 and 1 .
There is a be er metric than sum of squares.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
y_pred * y_actual
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
y_pred * y_actual + (1 - y_pred) * (1 - y_actual)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
np.sum(y_pred * y_actual + (1 - y_pred) * (1 - y_actual))
When y_actual = 1
y_pred * 1 + (1 - y_pred) * (1 - 1) = y_pred
When y_actual = 0
y_pred * 0 + (1 - y_pred) * (1 - 0) = 1 - y_pred
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Log-likelihood
Computing likelihood involves adding many very small numbers, leading to numerical error.
Log-likelihood is easier to compute.
log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)
Both equations give the same answer.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Negative log-likelihood
Maximizing log-likelihood is the same as minimizing negative log-likelihood.
-np.sum(log_likelihoods)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic regression algorithm
def calc_neg_log_likelihood(coeffs)
intercept, slope = coeffs
# More calculation!
from scipy.optimize import minimize
minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2
Fit/visualize/predict/assess parallel slopes Interactions between explanatory variables
Simpson's Paradox
Chapter 3 Chapter 4
Extend to many explanatory variables Logistic regression with multiple
explanatory variables
Implement linear regression algorithm
Logistic distribution
Implement logistic regression algorithm
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
There is more to learn
Training and testing sets
Cross validation
P-values and signi cance
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Advanced regression
Generalized Linear Models in Python
Introduction to Predictive Analytics in Python
Linear Classi ers in Python
Machine Learning with Tree-Based Models in Python
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Have fun regressing!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N