Linear Regression
Linear Regression
Linear Regression
Linear Regression
Data-Based Economics, ESCP, 2023-2024
AUTHOR PUBLISHED
Pablo Winant January 31, 2024
Descriptive Statistics
A Simple Dataset
Duncan’s Occupational Prestige Data - Many occupations in 1950. - Education and prestige
associated to each occupation
x: education
Percentage of occupational incumbents in 1950 who were high school graduates
y: income
Percentage of occupational incumbents in the 1950 US Census who earned $3,500 or
more per year
z: Percentage of respondents in a social survey who rated the occupation as “good” or better
in prestige
Quick look
import statsmodels.api as sm
dataset = sm.datasets.get_rdataset("Duncan", "carData")
df = dataset.data
df.head()
rownames
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
Descriptive Statistics
Examples
df[['income','education','prestige']].cov()df[['income','education','prestige']].c
...
Can we visualize correlations?
Quick
from matplotlib import pyplot as plt from matplotlib import pyplot as plt
plt.plot(df['education'],df['income'],'o') plt.plot(df['education'],df['prestige']
plt.grid() plt.grid()
plt.xlabel("Education") plt.xlabel("Education")
plt.ylabel("Income") plt.ylabel("Prestige")
plt.savefig("data_description.png") plt.savefig("data_description.png")
Quick look
y = α + βx
y i = α + βx i + ei
prediction error
Square Errors
e i 2 = (y i − α − βx i ) 2
N
L(α, β) = (e i ) 2
i=1
Try to chose α, β so as
to minimize the sum of
the squares L(α, β)
It is a convex
minimization problem:
unique solution
This direct iterative
procedure is used in
machine learning
Ordinary Least Squares (1)
The mathematical problem min α,β L(α, β) has one unique solution1
^ = –y − β^x
α
Cov(x, y) σ(y)
β^ = = Cor(x, y)
V ar(x) σ(x)
Concrete Example
y = 10 + 0.59 x
income education
We can say:
...
But:
Explained Variance
Predictions
How much would an occupation which hires 60% high schoolers fare salary-wise?
...
Prediction: salary measure is 45.4
...
OK, but that seems noisy, how much do I really predict ? Can I get a sense of the precision of my
prediction ?
a well specified model, residuals must look like white noise (i.i.d.: independent and identically
distributed)
when residuals are clearly abnormal, the model must be changed
Explained Variance
What is the share of the total variance MSS: model sum of squares, explained
explained by the variance of my prediction? variance
RSS: residual sum of square, unexplained
MSS
variance
V ar(α
^ + βx^ i) MSS TSS: total sum of squares, total variance
2
R = = = (Cor(x, y)) 2
V ar(y i ) T SS Coefficient of determination is a measure of
TSS the explanatory power of a regression
RSS
but not of the significance of a
coefficient
V ar(y i − α ^ i)
^ + βx RSS we’ll get back to it when we see
R2 = 1 − =1−
V ar(y i ) T SS multivariate regressions
TSS In one-dimensional case, it is possible to
have small R2, yet a very precise regression
coefficient.
Graphical Representation
Statistical inference
Statistical model
Imagine the true model is:
y = α + βx + ϵ
ϵi ∼ 0, σ 2
^ , β^
Then computed my estimate α
First estimates
Given the true model, all estimators are random variables of the data generating process
Given the values α, β, σ of the true model, we can model the distribution of the estimates.
These statististics or any function of them can be computed exactly, given the data.
Can we produce statistics whose distribution is known given mild assumptions on the data-
generating process?
Fisher-Statistic
Test
Hypothesis H0:
α=β=0
model explains nothing, i.e. R 2 = 0
Hypothesis H1: (model explains something)
model explains something, i.e. R 2 > 0
Fisher Statistics:
ExplainedV ariance
F=
UnexplainedV ariance
extremely small:
P rob(F > F stat|H0) = 5.41e − 6
we can reject H0 with p − value = 5e − 6
Student test
Student tables
Confidence intervals
Given a probability threshold α (for instance α = 0.05) we can compute t ⋆ such that
P (|t| > t∗) = α
We construct the confidence interval:
I α = [β^ − tσ(β),
^ β^ + tσ(β)]
^
...
Interpretation:
if the true value was outside of the confidence interval, the probability of obtaining the value
that we got would be less than 5%.
we can say the true value is within the interval with 95% confidence level
Footnotes
1. Proof not important here. ↩︎