Linear Regression

 lectures > Linear Regression
Linear Regression
Data-Based Economics, ESCP, 2023-2024
AUTHOR PUBLISHED
Pablo Winant January 31, 2024
Descriptive Statistics
A Simple Dataset
Duncan’s Occupational Prestige Data - Many occupations in 1950. - Education and prestige
associated to each occupation
x: education
Percentage of occupational incumbents in 1950 who were high school graduates
y: income
Percentage of occupational incumbents in the 1950 US Census who earned $3,500 or
more per year
z: Percentage of respondents in a social survey who rated the occupation as “good” or better
in prestige
Quick look
Import the data from statsmodels’ dataset:
import statsmodels.api as sm
dataset = sm.datasets.get_rdataset("Duncan", "carData")
df = dataset.data
df.head()
type income education prestige
rownames
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
Descriptive Statistics
For any variable v with N observations:

df.describe()
mean: –
v= 1
N
N i=1 vi
variance V (v) = 1
N
(v i − –v) 2
N
i=1 income education prestige
standard deviation : σ(v) = V (v)
count 45.000000 45.000000 45.000000
mean 41.866667 52.555556 47.688889
std 24.435072 29.760831 31.510332
min 7.000000 7.000000 3.000000
25% 21.000000 26.000000 16.000000
50% 42.000000 45.000000 41.000000
75% 64.000000 84.000000 81.000000
max 81.000000 100.000000 97.000000
Relation between variables
How do we measure relations between two variables (with N observations)

Covariance: Cov(x, y) = 1 –
i (x i − x)(y i − y)
N
Cov(x,y)
Correlation: Cor(x, y) = σ(x)σ(y)
By construction, Cor(x, y) ∈ [−1, 1]
if Cor(x, y) > 0, x and y are positively correlated
if Cor(x, y) < 0, x and y are negatively correlated
Interpretation:
no interpretation!
correlation is not causality
also: data can be correlated by pure chance (spurious correlation)
Examples
df[['income','education','prestige']].cov()df[['income','education','prestige']].c
income education prestige income education prestige
income 597.072727 526.871212 645.071212 income 1.000000 0.724512 0.837801
education 526.871212 885.707071 798.904040 education 0.724512 1.000000 0.851916
prestige 645.071212 798.904040 992.901010 prestige 0.837801 0.851916 1.000000
...
Can we visualize correlations?
Quick
from matplotlib import pyplot as plt from matplotlib import pyplot as plt
plt.plot(df['education'],df['income'],'o') plt.plot(df['education'],df['prestige']
plt.grid() plt.grid()
plt.xlabel("Education") plt.xlabel("Education")
plt.ylabel("Income") plt.ylabel("Prestige")
plt.savefig("data_description.png") plt.savefig("data_description.png")
Quick look
Using matplotlib (3d)

Quick look
import seaborn as sns

sns.pairplot(df[['education', 'prestige', 'income']])
The pairplot made with seaborn gives a simple sense of correlations as well as information about
the distribution of each variable.
Fitting the data

A Linear Model
Now we want to build a
model to represent the data:
Consider the line:
y = α + βx
Several possibilities. Which

one do we choose to
represent the model?
We need some criterium.
Least Square Criterium
Compare the model to

the data:
y i = α + βx i + ei
prediction error
Square Errors
e i 2 = (y i − α − βx i ) 2
Loss Function: sum of

squares
N
L(α, β) = (e i ) 2
i=1
Minimizing Least Squares
Try to chose α, β so as
to minimize the sum of
the squares L(α, β)
It is a convex
minimization problem:
unique solution
This direct iterative
procedure is used in
machine learning
Ordinary Least Squares (1)
The mathematical problem min α,β L(α, β) has one unique solution1
Solution is given by the explicit formula:
^ = –y − β^x
α
Cov(x, y) σ(y)
β^ = = Cor(x, y)
V ar(x) σ(x)
^ and β^ are estimators.

α
Hence the hats.
More on that later.
Concrete Example
In our example we get the result:
y = 10 + 0.59 x
income education
We can say:
income and education are positively correlated

a unit increase in education is associated with a 0.59 increase in income
a unit increase in education explains a 0.59 increase in income
...
But:
here explains does not mean cause
Explained Variance
Predictions
It is possible to make predictions with the model:
How much would an occupation which hires 60% high schoolers fare salary-wise?
...
Prediction: salary measure is 45.4
...
OK, but that seems noisy, how much do I really predict ? Can I get a sense of the precision of my
prediction ?
Look at the residuals
Plot the residuals: Any abnormal observation?

Theory requires residuals to be:
zero-mean
non-correlated
normally distributed
That looks like a normal distribution
standard deviation is σ(e i ) = 16.84
A more honnest prediction would be
45.6 ± 16.84
What could go wrong?
a well specified model, residuals must look like white noise (i.i.d.: independent and identically
distributed)
when residuals are clearly abnormal, the model must be changed
Explained Variance
What is the share of the total variance MSS: model sum of squares, explained
explained by the variance of my prediction? variance
RSS: residual sum of square, unexplained
MSS
variance
V ar(α
^ + βx^ i) MSS TSS: total sum of squares, total variance
2
R = = = (Cor(x, y)) 2
V ar(y i ) T SS Coefficient of determination is a measure of
TSS the explanatory power of a regression
RSS
but not of the significance of a
coefficient
V ar(y i − α ^ i)
^ + βx RSS we’ll get back to it when we see
R2 = 1 − =1−
V ar(y i ) T SS multivariate regressions
TSS In one-dimensional case, it is possible to
have small R2, yet a very precise regression
coefficient.
Graphical Representation
Statistical inference
Statistical model
Imagine the true model is:
y = α + βx + ϵ
ϵi ∼ 0, σ 2
errors are independent …

and normallly distributed …
with constant variance (homoscedastic)
Using this data-generation process, I have

drawn randomly N data points (a.k.a. gathered
the data)
maybe an acual sample (for instance N

patients)
an abstract sample otherwise
^ , β^
Then computed my estimate α
How confident am I in these estimates ?
I could have gotten a completely different

one…
clearly, the bigger N , the more confident I
am…
Statistical inference (2)

Assume we have computed α ^ , β^ from the
data. Let’s make a thought experiment
instead.
Imagine the actual data generating process
^ + ϵ where
^ + βx
was given by α
ϵ∼ (0, V ar(e i ))
If I draw randomly N points using this D.G.P.

I get new estimates.
And if I make randomly many draws, I get a
distribution for my estimate.
I get an estimated σ ^
^ (β)
were my initial estimates very likely ?
or could they have taken any value with
another draw from the data ?
in the example, we see that estimates
around of 0.7 or 0.9, would be
compatible with the data
How do we formalize these ideas?
Statistical tests.
First estimates
Given the true model, all estimators are random variables of the data generating process
Given the values α, β, σ of the true model, we can model the distribution of the estimates.
Some closed forms:
^ 2 = V ar(y i − α − βx i ) estimated variance of the residuals

σ
mean(β)^ = β (unbiased)
^ = σ2
σ(β) V ar(x )
i
These statististics or any function of them can be computed exactly, given the data.
Their distribution depends, on the data-generating process
Can we produce statistics whose distribution is known given mild assumptions on the data-
generating process?
if so, we can assess how likely are our observations
Fisher-Statistic
Test
Hypothesis H0:
α=β=0
model explains nothing, i.e. R 2 = 0
Hypothesis H1: (model explains something)
model explains something, i.e. R 2 > 0
Fisher Statistics:
ExplainedV ariance
F=
UnexplainedV ariance
Distribution of F is known theoretically.
Assuming the model is actually linear and

the shocks normal.
It depends on the number of degrees of
Freedom. (Here N − 2 = 18)
Not on the actual parameters of the model.
In our case, F stat = 40.48.
What was the probability it was that big, under

the H0 hypothesis?
extremely small:
P rob(F > F stat|H0) = 5.41e − 6
we can reject H0 with p − value = 5e − 6
In social science, typical required p-value is 5%.
In practice, we abstract from the precise

calculation of the Fisher statistics, and look only
at the p-value.
Student test
So our estimate is y = 0.121 + 0.794 x.

~
α ~
β
~
we know β is a bit random (it’s an estimator)
~
are we even sure β could not have been zero?
Student Test:
H0: β = 0
H1: β ≠ 0
β^
Statistics: t =
σ(β^)
intuitively: compare mean of estimator to its standard deviation
also a function of degrees of freedom
Significance levels (read in a table or use software):
for 18 degrees of freedom, P (|t| > t ⋆ ) = 0.05 with t ⋆ = 1.734
if t > t ⋆ we are 95 confident the coefficient is significant
Student tables
Confidence intervals
The student test can also be used to construct confidence intervals.
^ with standard deviation σ(β)

Given estimate, β ^
Given a probability threshold α (for instance α = 0.05) we can compute t ⋆ such that
P (|t| > t∗) = α
We construct the confidence interval:
I α = [β^ − tσ(β),
^ β^ + tσ(β)]
^
...
Interpretation:
if the true value was outside of the confidence interval, the probability of obtaining the value
that we got would be less than 5%.
we can say the true value is within the interval with 95% confidence level
Footnotes
1. Proof not important here. ↩︎

Linear Regression

Uploaded by

Copyright:

Available Formats

Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

 lectures > Linear Regression

Import the data from statsmodels’ dataset:

type income education prestige

For any variable v with N observations:

mean 41.866667 52.555556 47.688889

std 24.435072 29.760831 31.510332

min 7.000000 7.000000 3.000000

25% 21.000000 26.000000 16.000000

50% 42.000000 45.000000 41.000000

75% 64.000000 84.000000 81.000000

max 81.000000 100.000000 97.000000

Relation between variables

How do we measure relations between two variables (with N observations)

income education prestige income education prestige

income 597.072727 526.871212 645.071212 income 1.000000 0.724512 0.837801

education 526.871212 885.707071 798.904040 education 0.724512 1.000000 0.851916

prestige 645.071212 798.904040 992.901010 prestige 0.837801 0.851916 1.000000

Using matplotlib (3d)

import seaborn as sns

Fitting the data

Consider the line:

Several possibilities. Which

We need some criterium.

Least Square Criterium

Compare the model to

Loss Function: sum of

Minimizing Least Squares

Solution is given by the explicit formula:

^ and β^ are estimators.

In our example we get the result:

income and education are positively correlated

here explains does not mean cause

It is possible to make predictions with the model:

Look at the residuals

Plot the residuals: Any abnormal observation?

What could go wrong?

errors are independent …

Using this data-generation process, I have

maybe an acual sample (for instance N

How confident am I in these estimates ?

I could have gotten a completely different

Statistical inference (2)

If I draw randomly N points using this D.G.P.

Some closed forms:

^ 2 = V ar(y i − α − βx i ) estimated variance of the residuals

Their distribution depends, on the data-generating process

if so, we can assess how likely are our observations

Distribution of F is known theoretically.

Assuming the model is actually linear and

In our case, F stat = 40.48.

What was the probability it was that big, under

In social science, typical required p-value is 5%.

In practice, we abstract from the precise

So our estimate is y = 0.121 + 0.794 x.

The student test can also be used to construct confidence intervals.

^ with standard deviation σ(β)

You might also like