Linear Regression and Tire Correlation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

AMS 572 Presentation

CH 10 Simple Linear Regression


Introduction
Example:

David Beckham: 1.83m Brad Pitt: 1.83m George Bush :1.81m


Victoria Beckham: 1.68m Angelina Jolie: 1.70m Laura Bush: ?

● To predict height of the wife in a couple, based on the husband’s height


Response (out come or dependent) variable (Y): height of the wife
Predictor (explanatory or independent) variable (X): height of the husband
Regression analysis:
●  regression analysis is a statistical methodology to estimate the relationship
of a response variable to a set of predictor variable.

● when there is just one predictor variable, we will use simple linear regression.
When there are two or more predictor variables, we use multiple linear
regression.

●when it is not clear which variable represents a response and which is a predictor,
correlation analysis is used to study the strength of the relationship

History:
● The earliest form of linear regression was the method of least squares, whic
h was published by Legendre in 1805, and by Gauss in 1809.
● The method was extended by Francis Galton in the 19th century to describe

a biological phenomenon.
● This work was extended by Karl Pearson and Udny Yule to a more general st

atistical context around 20th century.


A probabilistic model
 Specific settings of the predictor variable

x1 , x2 , ..., xn
 Corresponding values of the response variable

y1 , y2 , ..., y n
ASSUME:

yi - Observed value of the random variable Yi depends on xi

Yi   0  1 xi   i (i  1, 2, ..., n) (10.1)

E ( i )  0
i - random error with
Var ( i )   2

 E (Yi )  i   0  1 xi (10.2) unknown mean of Yi

True Regression Line Unknown Slope


Unknown Intercept
4 BASIC
ASSUMPTIONS
Yi Linear function of xi

Have a common variance, 


2

Same for all values of x.

i Normally distributed

Independent
Comments:
1. Linear not because of x
Linear in the parameters  0 and 1

Example:

E (Y )   0  1 log x linear, logx = x

2. Predictor variable is not set as predetermined fixed values,


is random along with Y

Example: Height and Weight of the children.


Height (X) – given
Weight (Y) – predict

E (Y | X  x )   0   1 x

Conditional expectation of Y given X = x


10.2 Fitting the Simple
Linear Regression Mo
del
10.2.1 Least Squares (LS) Fit
Example 10.1(Tires Tread Wear vs. Mileage: Scatter Plo
t)
y  0  1 x n
Q  �[ yi - (  0  1 xi )]
2
yi - (0  1 xi ) (i  1,2,.....n)
i 1
The “best” fitting straight line in the sense of minimizing Q: LS estimat
e

� �
One way to find the LS estimate  0 and 1
�Q n
 -2�[ yi - (  0  1 xi )]
0
� i 1

� Q n
 -2�xi [ yi - (  0  1 xi )]
�1 i 1

Setting these partial derivatives equal to zero and simplifying, we get

n n
0n  1�xi  �yi
i 1 i1
n n n
0 �xi  1�x  �xi yi 2
i
i 1 i1 i1
 Solve the equations and we get

n n n n


(�xi2 )(�yi ) - (�xi )(�xi yi )
0  i 1 i 1
n
i 1
n
i 1

n�x - (�xi ) 2
i
2

i 1 i 1
n n n


n�xi yi - (�xi )(�yi )
1  i 1
n
i 1
n
i 1

n�x - (�xi )
2
i
2

i 1 i 1
 To simplify, we introduce
n n
1 n n
S xy  �( xi - x )( yi - y )  �xi yi - (�xi )(�yi )
i 1 i 1 n i 1 i 1
n n n
1
S xx  �( xi - x ) 2  �xi2 - (�xi ) 2
i 1 i 1 n i 1
n n n
1
S yy  �( yi - y ) 2  �yi2 - (�yi ) 2
i 1 i 1 n i 1
� � � S xy
 0  y - 1 x 1 
S xx
� � �
 We get The equation y   0  1 x is known as the l
east squares line, which is an estimate of the true re
gression line.
Example 10.2 (Tire Tread vs. Mileage: LS Line Fit)
Find the equation of the line for the tire tread wear data from Table10.1,
we have

�x i  144, �yi  2197.32, �xi2  3264, �yi2  589,887.08, �xi yi  28,167.72

and n=9.From these we calculate


x  16, y  244.15,
n
1 n n
1
S xy  �xi yi - (�xi )( �yi )  28,167.72 - (144 * 2197.32)  -6989.40
i 1 n i 1 i 1 9
n
1 n
1
S xx  �xi - (�xi )  3264 - (144)  960
2 2 2

i 1 n i 1 9
The slope and intercept estimates are

ˆˆ -6989.40
1   -7.281 and  0  244.15  7.281*16  360.64
960
Therefore, the equation of the LS line is

y  360.64 - 7.281x.
Conclusion: there is a loss of 7.281 mils in the tire groove depth for
every 1000 miles of driving.

Given a particular
We can find x  25
y  360.64 - 7.281* 25  178.62 mils
Which means the mean groove depth for all tires driven for 25,000m
iles is estimated to be 178.62 miles.
10.2.2 Goodness of Fit of the LS Line
 Coefficient of Determination and Correlation
yˆi   0  ˆ1 xi (i  1, 2,.....n)

 The residuals:
(i  1, 2,.....n)
ei  yi - ( ˆˆ0  1 xi )

are used to evaluate the goodness of fit of the L


S line.
n n n n
SST  �( yi - y )2  �( yˆˆˆˆ
i - y ) 2
 � ( yi - yi ) 2
 2�( yi - yi )( yi - y )
i 1
1
i 1
42 43 1 i 1
4 2 4 3 1i 41 442 4 4 43
SSR SSE 0

 We define:
SST  SSR  SSE

The ratio r 2  SSR  1 - SSE


SST SST
Note: total sum of squares (SST)
Regression sum of squares (SSR)
Error sum of squares (SSE)

r is called the coefficient of determination 0<r<1


Example 10.3(Tire Tread Wear vs. Mileage: Co
efficient of Determination and Correlation
2
 For the tire tread wear data, calculate r and r using the re
sult s from example 10.2 We have
n
1 n 2 1
SST  S yy  �y - (�yi )  589,887.08 - (2197.32) 2  53, 418.73
i
2

i 1 n i 1 9
 Next calculate SSR  SST - SSE  53, 418.73 - 2531.53  50,887.20
50,887.20
 Therefore r 2

53, 418.73
 0.953 and r  - 0.953  -0.976

where the sign of r follows from the sign of ˆ1  -7.281 since 95.
3% of the variation in tread wear is accounted for by linear
regression on mileage, the relationship between the two is
strongly linear with a negative slope.
10.2.3 estimation of  2
An unbiased estimate of  2
is given by
n

�e SSE
2
i
s 
2

i 1

n-2 n-2
Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of  2

Find the estimate of for the tread wear data using the results from Example 10.3 W
e have SSE=2351.3 and n-2=7,therefore

2351.53
S2   361.65
7
Which has 7 d.f. The estimate of  is s  361.65  19.02 miles.
Statistical Inference on  0 and  1 ,
Con’t
� �
Point estimators:  0 , 1
� �
Sampling distributions of 0 and 1 :

 
2
xi
2
 � �xi
ˆ 0 ~ N   0,  2  SE (  0 )  s
 nSxx  nS xx
 

  2

� s
ˆ 
 1 ~ N   1,  SE ( 1 ) 
 S xx  S xx
For mathematical derivations, please refer to the text book, P331.
Statistical Inference on  0 and  1 ,
Con’t

P.Q.’s:
ˆ 0 -  0 ˆ 1 -  1
~ tn - 2 ~ tn - 2
SE ( ˆ 0) SE ( ˆ 1)
CI’s:


�� � � �� �
 0 �t a SE � 0 �, 1 �t a SE � 1 �
n - 2,
2 � � n - 2,
2 � �
Statistical Inference on  0 and  1 ,
Con’t
H 0 : 1  10 H 0 : 1  0
Hypothesis test:
H a : 1 �10 H a : 1 �0
-- Test statistic: � �
1 - 0
1
t0  �
1
t0  �
SE ( 1 ) SE ( 1 )
-- At the significance level a , we reject H 0 in
favor of H a iff t0 �tn - 2,a / 2

-- Can be used to show whether there is a


linear relationship between x and y
Analysis of Variance (ANOVA), Con’t

Mean Square:
-- a sum of squares divided by its d.f.

SSR SSE
MSR= , MSE=
1 n-2
2 2
MSR SSR ˆ 1 Sxx  ˆ 1   ˆ1  H 0 2
2
 2       t ~ F 1, n - 2
2 ˆ
MSE s s  s / Sxx   SE (  1) 
Analysis of Variance (ANOVA)

ANOVA Table
Source of Sum of Degrees of Mean F
Variation Squares Freedom Square
(Source) (SS) (d.f.) (MS)
Regression SSR 1 SSR
MSR=
1 MSR
SSE F=
Error SSE n-2 MSE= MSE
n-2
Total SST n-1

Example:
Source SS d.f. MS F
Regression 50,887.20 1 50,887.20 140.71
Error 7 361.25
2531.53
Total 53,418.73 8
10.4 Regression Diagnostics

10.4.1 Checking for Model Assumptions

 Checking for Linearity


 Checking for Constant Variance
 Checking for Normality
 Checking for Independence
Checking for Linearity
^
Xi =Mileage Y=β0 + β1 x
i Xi Yi Yi ei Yi =Groove Depth ^ ^ ^

1 0 394.33 360.64 33.69 ^ Y=β0 + β1 x


Yi =fitted value ^
2 4 329.50 331.51 -2.01
3 8 291.00 302.39 -11.39
ei =residual Residual = ei = Yi- Yi
4 12 255.17 273.27 -18.10
5 16 229.33 244.15 -14.82 Scatterplot of ei vs Xi
40
6 20 204.83 215.02 -10.19
7 24 179.00 185.90 -6.90
30

20

10
ei

8 28 163.83 156.78 7.05

-10

-20
0 5 10 15 20 25 30 35
Xi
Checking for Normality
Normal Probability Plot of residuals
Normal
99
Mean 3.947460E-16
StDev 17.79
95 N 9
AD 0.514
90
P-Value 0.138
80
70
Percent

60
50
40
30
20

10

1
-40 -30 -20 -10 0 10 20 30 40 50
C1
Checking for Constant Variance

Var(Y) is not constant. A sample residual plots when


Var(Y) is constant.
Checking for Independence

 Does not apply for S


imple Linear Regres
sion Model
 Only apply for time
series data
10.4.2 Checking for Outliers & Inf
luential Observations

 What is OUTLIER
 Why checking for outliers is important
 Mathematical definition
 How to deal with them
10.4.2-A. Intro
Recall Box and Whiskers Plot (Chapter 4)
 Where (mild) OUTLIER is defined as any observations that lies outside of
Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1)
 (Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR)
 Observation "far away" from the rest of the data
10.4.2-B. Why are outliers a probl
em?
 May indicate a sample peculiarity or a data entry error or othe
r problem ;
 Regression coefficients estimated that minimize the Sum of S
quares for Error (SSE) are very sensitive to outliers >>Bias or
distortion of estimates;
 Any statistical test based on sample means and variances ca
n be distorted In the presence of outliers >>Distortion of p-va
lues;
 Faulty conclusions.

Example:
( Estimators not sensitive to outliers are said to be robust )
Sorted Data Median Mean Variance 95% CI for mean
Real 1 3 5 9 12 5 6.0 20.6 [0.45, 11.55]
Data
Data 1 3 5 9 120 5 27.6 2676.8 [-36.630,91.83]
with
Error
10.4.2-C. Mathematical Definition
 Outlier
The standardized residual is given by

If |ei*|>2, then the corresponding observation may be regarded an outlier.


Example: (Tire Tread Wear vs. Mileage)

i 1 2 3 4 5 6 7 8 9

ei* 2.25 -0.12 -0.66 -1.02 -0.83 -0.57 -0.40 0.43 1.51

• STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current


observation deleted from the analysis.
• The LS fit can be excessively influenced by observation that is not necessarily an outlier as
defined above.
10.4.2-C. Mathematical Definition

 Influential Observation
Observation with extreme x-value, y-value, or both.

• On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage;


• If xi deviates greatly from mean x, then hii is large;
• Standardized residual will be large for a high leverage observation;
• Influence can be thought of as the product of leverage and outlierness.
Example: (Observation is influential/ high leverage, but not an outlier)

eg.1 with without eg.2 scatter plot residual plot


10.4.2-C. SAS code of the exampl
es
SAS code
proc reg data=tire;
model y=x;
output out=resid rstudent=r h=lev cookd=cd dffits=dffit;
proc print data=resid;
where abs(r)>=2 or lev>(4/9) or cd>(4/9) or abs(dffit)>(2*sqrt(1/9));
run;

SAS output
10.4.2-D. How to deal with Outli
ers & Influential Observations

 Investigate (Data errors? Rare events? Can be corr


ected?)
 Ways to accommodate outliers
 Non Parametric Methods (robust to outliers)
 Data Transformations
 Deletion (or report model results both with and with
out the outliers or influential observations to see ho
w much they change)
10.4.3 Data Transformations
Reason

 To achieve linearity
 To achieve homogeneity of variance
 To achieve normality or symmetry about the r
egression equation
Type of Transformation
 Linearzing Transformation
transformation of a response variable, or predicted
variable, or both, which produces an approximate li
near relationship between variables.

 Variance Stabilizing Transformation


make transformation if the constant variance assum
ption is violated
Method of
Linearizing Transformation

 Use mathematical operation, e.g. square roo


t, power, log, exponential, etc.

 Only one variable needs to be transformed in


the simple linear regression.
Which one? Predictor or Response? Why?
e.g. We take a exponential transformation on
Y = a exp (-x) <=> log Y = log a -  x

Plot of Residual vs xi & xi from the exponential fit


^
40 Variable
Y= ei (original)
^ ^ 30
ei with transformation

Xi Yi log Yi exp (logYi) Ei


0 394.33 5.926 374.64 19.69 20

Residual
4 329.50 5.807 332.58 -3.08 10

8 291.00 5.688 295.24 -4.24


0
12 255.17 5.569 262.09 -6.92
-10
16 229.33 5.450 232.67 -3.34
20 204.83 5.331 206.54 -1.71 -20
0 5 10 15 20 25 30 35
24 179.00 5.211 183.36 -4.36 xi

Normal Probability Plot of ei and ei with transformation


99
Variable
ei
95 ei with transformation
Mean StDev N AD P
90 3.947460E-16 17.79 9 0.514 0.138

28 163.83 5.092 162.77 1.06 80


0.3256 8.142 9 0.912 0.011

70
Percent

60
50
40
30
20

10

1
-40 -30 -20 -10 0 10 20 30 40 50
Data
Method of
Variance Stabilizing Transformation

Delta method : Two terms Taylor-series approximations

Var( h(Y)) ≈ [h()]2 g2 () where Var(Y) = g2(), E(Y) = 

1. set [h’()]2 g2 ()  1


1
2. h’() = g ( )

d dy
3. h() =  g ( 
) h(y) =  g ( y)

e.g. Var(Y) = c2 2 , where c > 0, g() = c ↔ g(y) = cy

h(y) =  dy
cy 
1
c  dyy 1
c log( y )
Therefore it is the logarithmic transformation
Correlation Analysis
 Correlation: a measurement of how closely two var
iables share a linear relationship.

Cov(X, Y)
   corr(X, Y) 
Var(X)Var( Y)

 Useful when it is not possible to determine which var


iable is the predictor and which is the response.
 Health vs wealth. Which is predictor? Which is response?
Statistical Inference on the Correla
tion Coefficient ρ
 We can derive a test on the correlation coeffi
cient in the same way that we have been doi
ng in class.
 Assumptions
 X, Y are from the bivariate normal distributi
on
 Start with point estimator
 R: sample estimate of the population correl
ation coefficient ρ
n

 (X i - X )(Yi - Y )
R i 1
n n

 ( X i - X ) 2  (Yi - Y ) 2
i 1 i 1

 Get the pivotal n-2


Rquantity
T 
 The distribution
1 - Rof2 R is quite complicated
 T: transform the point estimator into a p.q.
Bivariate Normal Distribution
 pdf:

 Properties
 μ1, μ2 means
for X, Y
 σ12, σ22 variances
for X, Y
 ρ the correlation coeff
between X, Y
Derivation of T
are these equivalent?
r n - 2 ? ˆ1
t   Therefore, we can use t
1- r 2
SE ( ˆ )
1
as a statistic for testing
substitute : against the null hypothe
s S
r  ˆ1 x  ˆ1 xx  ˆ1
S xx sis
sy S yy SST
H0: β1=0
SSE ( n - 2) s 2
1- r 
2

SST SST

then :
 Equivalently, we can te
S xx ( n - 2) SST ˆ1 ˆ1 st against
t  ˆ1  
SST ( n - 2) s 2 s / S xx SE ( ˆ1 ) H0: ρ=0
 yes, they are equivalent.
Exact Statistical Inference on ρ
 Test  Example (from textbook)
 H0 : ρ=0 , A researcher wants to determine if two test
Ha : ρ<>0 
instruments give similar results. The two
test instruments are administered to a
 Test statistic: sample of 15 students. The correlation
coefficient between the two sets of scores
is found to be 0.7. Is this correlation

r n-2
statistically significant at the .01 level?

t0 
1- r 2  H0 : ρ=0
0.7 ,15 -H
2 a : ρ<>0
t0   3.534
1 - 0.7 2

 Reject H0 if t0 > tn-2  for α = .01, 3.534 = t0 > t13, .005 = 3.012

 ▲ Reject H0
Approximate Statistical Inference on ρ
 There is no exact method of testin
g ρ vs an arbitrary ρ0
 Distribution of R is very complicate
d
 T ~ tn-2 only when ρ = 0

 To test ρ vs an arbitrary ρ0 use Fis


her’s Normal approximation
1  1 R   1  1   1 
  N ln 
-1
tanh R  ln , 
2  1- R   2  1 -   n - 3 

 Transform
^ the sample
1  1 r  ^ estimate
 1  1   1 
  ln  
, underH0 , ~ N ln 0

2  1- r  2 1 -  , n - 3 
  0 
Approximate Statistical Inference on ρ
H0 :   0 vs. H1 :   0
 Test :
1  1  0 
H0 :   0  ln  vs. H1 :   0
2  1 - 0 

^ 1 1 r 
 Sample estimate:   ln 
2 1- r 

^ 
z 0  n - 3  - 0 
 Z statistic:  
reject H0 if |z0| > zα/2

^ 1 ^ 1
 - za / 2      za / 2
 CI: n-3 n-3
e 2l - 1 e 2u - 1
   2u
e 2l  1 e 1
Approximate Statistical Inference on ρ
using SAS

 Code:

 Output:
Pitfalls of Regression and Correlat
ion Analysis
 Correlation and causation
 Ticks cause good health
 Coincidental data
 Sun spots and republicans
 Lurking variables
 Church, suicide, population
 Restricted range
 Local, global linearity
Summary
Model
Assumptions Linear regression analysis

Correlation
The Least squares (LS) estimates: 0 and 1 Coefficient r

Probabilistic model
for Linear regression:
 0or1  tn - 2 , a / 2 SE (  0 or1)
Correlation
Analysis
Outliers?
Confidence Interval & Prediction interval
Influential Observations?

Data Transformations?
n
Least Squares (LS) Fit Q � i
[ y
i 1
- (  0   x
1 i )]2

SSR SSE
Sample correlation coefficient r r 
2
 1-
SST SST

Statistical inference on ß0 & ß1  


ˆ 0 ~ N   0,  2   ˆ 1 ~ N   1,
 x 2
  2

i

 nSxx  S xx
  

( ) �
2
�� 1 x *
-x �
Prediction Interval Y α�
*
Y *
t n - 2,a / 2 s 1   �
n S xx
� �
� �

Model Assumptions Linearity Constant Variance


Normality Independence

Correlation Analysis r n-2 1 � 1  �


t   ln � �
2 � 1-  �
1- r2
Thank You and Any questions?

You might also like