Going beyond linear
regression
GENERALIZED LINEAR MODELS IN PYTHON
Ita Cirovic Donev
Data Science Consultant
Course objectives
Learn building blocks of GLMs Chapter 1: How are GLMs an extension of
linear models
Train GLMs
Chapter 2: Binomial (logistic) regression
Interpret model results
Chapter 3: Poisson regression
Assess model performance
Chapter 4: Multivariate logistic regression
Compute predictions
GENERALIZED LINEAR MODELS IN PYTHON
Review of linear models
salary ∼ experience
salary = β0 + β1 × experience + ϵ
y = β0 + β1 x1 + ϵ
GENERALIZED LINEAR MODELS IN PYTHON
Review of linear models
salary ∼ experience
salary = β0 + β1 × experience + ϵ
y = β0 + β1 x1 + ϵ
where:
y - response variable (output)
GENERALIZED LINEAR MODELS IN PYTHON
Review of linear models
salary ∼ experience
salary = β0 + β1 × experience + ϵ
y = β0 + β1 x1 + ϵ
where:
y - response variable (output)
x - explanatory variable (input)
GENERALIZED LINEAR MODELS IN PYTHON
Review of linear models
salary ∼ experience
salary = β0 + β1 × experience + ϵ
y = β0 + β1 x1 + ϵ
where:
y - response variable (output)
x - explanatory variable (input)
β - model parameters
β0 - intercept
β1 - slope
GENERALIZED LINEAR MODELS IN PYTHON
Review of linear models
salary ∼ experience
salary = β0 + β1 × experience + ϵ
y = β0 + β1 x1 + ϵ
where:
y - response variable (output)
x - explanatory variable (input)
β - model parameters
β0 - intercept
β1 - slope
ϵ - random error
GENERALIZED LINEAR MODELS IN PYTHON
LINEAR MODEL - ols() GENERALIZED LINEAR MODEL - glm()
from statsmodels.formula.api import ols import statsmodels.api as sm
from statsmodels.formula.api import glm
model = ols(formula = 'y ~ X',
data = my_data).fit() model = glm(formula = 'y ~ X',
data = my_data,
family = sm.families.____).fit()
GENERALIZED LINEAR MODELS IN PYTHON
Assumptions of linear models
Regression function
E[y] = μ = β0 + β1 x1
Assumptions
Linear in parameters
Errors are independent and normally
distributed
Constant variance
salary = 25790 + 9449 × experience
GENERALIZED LINEAR MODELS IN PYTHON
What if ... ?
The response is binary or count → NOT continuous
The variance of y is not constant → depends on the mean
GENERALIZED LINEAR MODELS IN PYTHON
Dataset - nesting of horseshoe crabs
Variable Name Description
sat Number of satellites residing in the nest
y There is at least one satellite residing in the nest; 0/1
weight Weight of the female crab in kg
width Width of the female crab in cm
color 1 - light medium, 2 - medium, 3 - dark medium, 4 - dark
spine 1 - both good, 2 - one worn or broken, 3 - both worn or broken
1 A. Agresti, An Introduction to Categorical Data Analysis, 2007.
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary response
satellite crab ∼ female crab weight
y ~ weight
P (satellite crab is present) = P (y = 1)
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary response
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary response
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary response
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary data
GENERALIZED LINEAR MODELS IN PYTHON
Linear model and binary data
GENERALIZED LINEAR MODELS IN PYTHON
From probabilities to classes
GENERALIZED LINEAR MODELS IN PYTHON
Let's practice!
GENERALIZED LINEAR MODELS IN PYTHON
How to build a
GLM?
GENERALIZED LINEAR MODELS IN PYTHON
Ita Cirovic Donev
Data Science Consultant
Components of the GLM
GENERALIZED LINEAR MODELS IN PYTHON
Components of the GLM
GENERALIZED LINEAR MODELS IN PYTHON
Components of the GLM
GENERALIZED LINEAR MODELS IN PYTHON
Components of the GLM
GENERALIZED LINEAR MODELS IN PYTHON
Components of the GLM
GENERALIZED LINEAR MODELS IN PYTHON
Continuous → Linear Regression
Data type: continuous
Domain: (−∞, ∞)
Examples: house price, salary, person's height
Family: Gaussian()
Link: identity
g(μ) = μ = E(y)
Model = Linear regression
GENERALIZED LINEAR MODELS IN PYTHON
Binary → Logistic regression
Data type: binary
Domain: 0, 1
Examples: True/False
Family: Binomial()
Link: logit
Model = Logistic regression
GENERALIZED LINEAR MODELS IN PYTHON
Count → Poisson regression
Data type: count
Domain: 0, 1, 2, ..., ∞
Examples: number of votes, number of
hurricanes
Family: Poisson()
Link: logarithm
Model = Poisson regression
GENERALIZED LINEAR MODELS IN PYTHON
Link functions
Density Link: η = g(μ) Default link glm(family=...)
Normal η=μ identity Gaussian()
Poisson η = log(μ) logarithm Poisson()
Binomial η = log[p/(1 − p)] logit Binomial()
Gamma η = 1/μ inverse Gamma()
Inverse Gaussian η = 1/μ2 inverse squared InverseGaussian()
GENERALIZED LINEAR MODELS IN PYTHON
Benefits of GLMs
A uni ed framework for many di erent data distributions
Exponential family of distributions
Link function
Transforms the expected value of y
Enables linear combinations
Many techniques from linear models apply to GLMs as well
GENERALIZED LINEAR MODELS IN PYTHON
Let's practice
GENERALIZED LINEAR MODELS IN PYTHON
How to fit a GLM in
Python?
GENERALIZED LINEAR MODELS IN PYTHON
Ita Cirovic Donev
Data Science Consultant
statsmodels
Importing statsmodels
import statsmodels.api as sm
Support for formulas
import statsmodels.formula.api as smf
Use glm() directly
from statsmodels.formula.api import glm
GENERALIZED LINEAR MODELS IN PYTHON
Process of model fit
1. Describe the model → glm()
2. Fit the model → .fit()
3. Summarize the model → .summary()
4. Make model predictions → .predict()
GENERALIZED LINEAR MODELS IN PYTHON
Describing the model
FORMULA based ARRAY based
from statsmodels.formula.api import glm import statsmodels.api as sm
model = glm(formula, data, family) X = sm.add_constant(X)
model = sm.glm(y, X, family)
GENERALIZED LINEAR MODELS IN PYTHON
Formula Argument
response ∼ explanatory variable(s)
output ∼ input(s)
formula = 'y ~ x1 + x2'
C(x1) : treat x1 as categorical variable
-1 : remove intercept
x1:x2 : an interaction term between x1 and x2
x1*x2 : an interaction term between x1 and x2 and the individual variables
np.log(x1) : apply vectorized functions to model variables
GENERALIZED LINEAR MODELS IN PYTHON
Family Argument
family = sm.families.____()
The family functions:
Gaussian(link = sm.families.links.identity) → the default family
Binomial(link = sm.families.links.logit)
probit, cauchy, log, and cloglog
Poisson(link = sm.families.links.log)
identity and sqrt
Other distribution families you can review at statsmodels website.
GENERALIZED LINEAR MODELS IN PYTHON
Summarizing the model
print(model_GLM.summary())
GENERALIZED LINEAR MODELS IN PYTHON
Generalized Linear Model Regression Results
=============================================================================
Dep. Variable: y No. Observations: 173
Model: GLM Df Residuals: 171
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -97.226
Date: Mon, 21 Jan 2019 Deviance: 194.45
Time: 11:30:01 Pearson chi2: 165.
No. Iterations: 4 Covariance Type: nonrobust
=============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept -12.3508 2.629 -4.698 0.000 -17.503 -7.199
width 0.4972 0.102 4.887 0.000 0.298 0.697
=============================================================================
GENERALIZED LINEAR MODELS IN PYTHON
Regression coefficients
.params prints regression coe cients .conf_int(alpha=0.05, cols=None)
prints con dence intervals
model_GLM.params
model_GLM.conf_int()
Intercept -12.350818
width 0.497231 0 1
dtype: float64 Intercept -17.503010 -7.198625
width 0.297833 0.696629
GENERALIZED LINEAR MODELS IN PYTHON
Predictions
Specify all the model variables in test data
.predict(test_data) computes predictions
model_GLM.predict(test_data)
0 0.029309
1 0.470299
2 0.834983
3 0.972363
4 0.987941
GENERALIZED LINEAR MODELS IN PYTHON
Let's practice!
GENERALIZED LINEAR MODELS IN PYTHON