Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

 lectures > Linear Regression

Linear Regression
Data-Based Economics, ESCP, 2023-2024

AUTHOR PUBLISHED
Pablo Winant January 31, 2024

Descriptive Statistics
A Simple Dataset

Duncan’s Occupational Prestige Data - Many occupations in 1950. - Education and prestige
associated to each occupation

x: education
Percentage of occupational incumbents in 1950 who were high school graduates
y: income
Percentage of occupational incumbents in the 1950 US Census who earned $3,500 or
more per year
z: Percentage of respondents in a social survey who rated the occupation as “good” or better
in prestige

Quick look

Import the data from statsmodels’ dataset:

import statsmodels.api as sm
dataset = sm.datasets.get_rdataset("Duncan", "carData")
df = dataset.data
df.head()

type income education prestige

rownames

accountant prof 62 86 82

pilot prof 72 76 83

architect prof 75 92 90

author prof 55 90 76

chemist prof 64 86 90
Descriptive Statistics

For any variable v with N observations:


df.describe()
mean: –
v= 1
N
N i=1 vi
variance V (v) = 1
N
(v i − –v) 2
N
i=1 income education prestige
standard deviation : σ(v) = V (v)
count 45.000000 45.000000 45.000000

mean 41.866667 52.555556 47.688889

std 24.435072 29.760831 31.510332

min 7.000000 7.000000 3.000000

25% 21.000000 26.000000 16.000000

50% 42.000000 45.000000 41.000000

75% 64.000000 84.000000 81.000000

max 81.000000 100.000000 97.000000

Relation between variables

How do we measure relations between two variables (with N observations)


Covariance: Cov(x, y) = 1 –
i (x i − x)(y i − y)
N
Cov(x,y)
Correlation: Cor(x, y) = σ(x)σ(y)
By construction, Cor(x, y) ∈ [−1, 1]
if Cor(x, y) > 0, x and y are positively correlated
if Cor(x, y) < 0, x and y are negatively correlated
Interpretation:
no interpretation!
correlation is not causality
also: data can be correlated by pure chance (spurious correlation)

Examples

df[['income','education','prestige']].cov()df[['income','education','prestige']].c

income education prestige income education prestige

income 597.072727 526.871212 645.071212 income 1.000000 0.724512 0.837801

education 526.871212 885.707071 798.904040 education 0.724512 1.000000 0.851916

prestige 645.071212 798.904040 992.901010 prestige 0.837801 0.851916 1.000000

...
Can we visualize correlations?

Quick

from matplotlib import pyplot as plt from matplotlib import pyplot as plt
plt.plot(df['education'],df['income'],'o') plt.plot(df['education'],df['prestige']
plt.grid() plt.grid()
plt.xlabel("Education") plt.xlabel("Education")
plt.ylabel("Income") plt.ylabel("Prestige")
plt.savefig("data_description.png") plt.savefig("data_description.png")

Quick look

Using matplotlib (3d)


Quick look

import seaborn as sns


sns.pairplot(df[['education', 'prestige', 'income']])
The pairplot made with seaborn gives a simple sense of correlations as well as information about
the distribution of each variable.

Fitting the data


A Linear Model
Now we want to build a
model to represent the data:

Consider the line:

y = α + βx

Several possibilities. Which


one do we choose to
represent the model?

We need some criterium.

Least Square Criterium

Compare the model to


the data:

y i = α + βx i + ei
prediction error

Square Errors

e i 2 = (y i − α − βx i ) 2

Loss Function: sum of


squares

N
L(α, β) = (e i ) 2
i=1

Minimizing Least Squares

Try to chose α, β so as
to minimize the sum of
the squares L(α, β)
It is a convex
minimization problem:
unique solution
This direct iterative
procedure is used in
machine learning
Ordinary Least Squares (1)

The mathematical problem min α,β L(α, β) has one unique solution1

Solution is given by the explicit formula:

^ = –y − β^x
α

Cov(x, y) σ(y)
β^ = = Cor(x, y)
V ar(x) σ(x)

^ and β^ are estimators.


α
Hence the hats.
More on that later.

Concrete Example

In our example we get the result:

y = 10 + 0.59 x
income education

We can say:

income and education are positively correlated


a unit increase in education is associated with a 0.59 increase in income
a unit increase in education explains a 0.59 increase in income

...

But:

here explains does not mean cause

Explained Variance
Predictions

It is possible to make predictions with the model:

How much would an occupation which hires 60% high schoolers fare salary-wise?

...
Prediction: salary measure is 45.4

...

OK, but that seems noisy, how much do I really predict ? Can I get a sense of the precision of my
prediction ?

Look at the residuals

Plot the residuals: Any abnormal observation?


Theory requires residuals to be:
zero-mean
non-correlated
normally distributed
That looks like a normal distribution
standard deviation is σ(e i ) = 16.84
A more honnest prediction would be
45.6 ± 16.84

What could go wrong?

a well specified model, residuals must look like white noise (i.i.d.: independent and identically
distributed)
when residuals are clearly abnormal, the model must be changed

Explained Variance
What is the share of the total variance MSS: model sum of squares, explained
explained by the variance of my prediction? variance
RSS: residual sum of square, unexplained
MSS
variance
V ar(α
^ + βx^ i) MSS TSS: total sum of squares, total variance
2
R = = = (Cor(x, y)) 2
V ar(y i ) T SS Coefficient of determination is a measure of
TSS the explanatory power of a regression
RSS
but not of the significance of a
coefficient
V ar(y i − α ^ i)
^ + βx RSS we’ll get back to it when we see
R2 = 1 − =1−
V ar(y i ) T SS multivariate regressions
TSS In one-dimensional case, it is possible to
have small R2, yet a very precise regression
coefficient.

Graphical Representation

Statistical inference
Statistical model
Imagine the true model is:

y = α + βx + ϵ

ϵi ∼ 0, σ 2

errors are independent …


and normallly distributed …
with constant variance (homoscedastic)

Using this data-generation process, I have


drawn randomly N data points (a.k.a. gathered
the data)

maybe an acual sample (for instance N


patients)
an abstract sample otherwise

^ , β^
Then computed my estimate α

How confident am I in these estimates ?

I could have gotten a completely different


one…
clearly, the bigger N , the more confident I
am…

Statistical inference (2)


Assume we have computed α ^ , β^ from the
data. Let’s make a thought experiment
instead.
Imagine the actual data generating process
^ + ϵ where
^ + βx
was given by α
ϵ∼ (0, V ar(e i ))

If I draw randomly N points using this D.G.P.


I get new estimates.
And if I make randomly many draws, I get a
distribution for my estimate.
I get an estimated σ ^
^ (β)
were my initial estimates very likely ?
or could they have taken any value with
another draw from the data ?
in the example, we see that estimates
around of 0.7 or 0.9, would be
compatible with the data
How do we formalize these ideas?
Statistical tests.

First estimates

Given the true model, all estimators are random variables of the data generating process

Given the values α, β, σ of the true model, we can model the distribution of the estimates.

Some closed forms:

^ 2 = V ar(y i − α − βx i ) estimated variance of the residuals


σ
mean(β)^ = β (unbiased)
^ = σ2
σ(β) V ar(x )
i

These statististics or any function of them can be computed exactly, given the data.

Their distribution depends, on the data-generating process

Can we produce statistics whose distribution is known given mild assumptions on the data-
generating process?

if so, we can assess how likely are our observations

Fisher-Statistic
Test

Hypothesis H0:
α=β=0
model explains nothing, i.e. R 2 = 0
Hypothesis H1: (model explains something)
model explains something, i.e. R 2 > 0

Fisher Statistics:

ExplainedV ariance
F=
UnexplainedV ariance

Distribution of F is known theoretically.

Assuming the model is actually linear and


the shocks normal.
It depends on the number of degrees of
Freedom. (Here N − 2 = 18)
Not on the actual parameters of the model.

In our case, F stat = 40.48.

What was the probability it was that big, under


the H0 hypothesis?

extremely small:
P rob(F > F stat|H0) = 5.41e − 6
we can reject H0 with p − value = 5e − 6

In social science, typical required p-value is 5%.

In practice, we abstract from the precise


calculation of the Fisher statistics, and look only
at the p-value.

Student test

So our estimate is y = 0.121 + 0.794 x.


~
α ~
β
~
we know β is a bit random (it’s an estimator)
~
are we even sure β could not have been zero?
Student Test:
H0: β = 0
H1: β ≠ 0
β^
Statistics: t =
σ(β^)
intuitively: compare mean of estimator to its standard deviation
also a function of degrees of freedom
Significance levels (read in a table or use software):
for 18 degrees of freedom, P (|t| > t ⋆ ) = 0.05 with t ⋆ = 1.734
if t > t ⋆ we are 95 confident the coefficient is significant

Student tables

Confidence intervals

The student test can also be used to construct confidence intervals.

^ with standard deviation σ(β)


Given estimate, β ^

Given a probability threshold α (for instance α = 0.05) we can compute t ⋆ such that
P (|t| > t∗) = α
We construct the confidence interval:

I α = [β^ − tσ(β),
^ β^ + tσ(β)]
^

...

Interpretation:

if the true value was outside of the confidence interval, the probability of obtaining the value
that we got would be less than 5%.
we can say the true value is within the interval with 95% confidence level

Footnotes
1. Proof not important here. ↩︎

You might also like