Linear Regression and Tire Correlation
Linear Regression and Tire Correlation
Linear Regression and Tire Correlation
● when there is just one predictor variable, we will use simple linear regression.
When there are two or more predictor variables, we use multiple linear
regression.
●when it is not clear which variable represents a response and which is a predictor,
correlation analysis is used to study the strength of the relationship
History:
● The earliest form of linear regression was the method of least squares, whic
h was published by Legendre in 1805, and by Gauss in 1809.
● The method was extended by Francis Galton in the 19th century to describe
a biological phenomenon.
● This work was extended by Karl Pearson and Udny Yule to a more general st
x1 , x2 , ..., xn
Corresponding values of the response variable
y1 , y2 , ..., y n
ASSUME:
Yi 0 1 xi i (i 1, 2, ..., n) (10.1)
E ( i ) 0
i - random error with
Var ( i ) 2
i Normally distributed
Independent
Comments:
1. Linear not because of x
Linear in the parameters 0 and 1
Example:
E (Y | X x ) 0 1 x
� �
One way to find the LS estimate 0 and 1
�Q n
-2�[ yi - ( 0 1 xi )]
0
� i 1
� Q n
-2�xi [ yi - ( 0 1 xi )]
�1 i 1
n n
0n 1�xi �yi
i 1 i1
n n n
0 �xi 1�x �xi yi 2
i
i 1 i1 i1
Solve the equations and we get
n n n n
�
(�xi2 )(�yi ) - (�xi )(�xi yi )
0 i 1 i 1
n
i 1
n
i 1
n�x - (�xi ) 2
i
2
i 1 i 1
n n n
�
n�xi yi - (�xi )(�yi )
1 i 1
n
i 1
n
i 1
n�x - (�xi )
2
i
2
i 1 i 1
To simplify, we introduce
n n
1 n n
S xy �( xi - x )( yi - y ) �xi yi - (�xi )(�yi )
i 1 i 1 n i 1 i 1
n n n
1
S xx �( xi - x ) 2 �xi2 - (�xi ) 2
i 1 i 1 n i 1
n n n
1
S yy �( yi - y ) 2 �yi2 - (�yi ) 2
i 1 i 1 n i 1
� � � S xy
0 y - 1 x 1
S xx
� � �
We get The equation y 0 1 x is known as the l
east squares line, which is an estimate of the true re
gression line.
Example 10.2 (Tire Tread vs. Mileage: LS Line Fit)
Find the equation of the line for the tire tread wear data from Table10.1,
we have
i 1 n i 1 9
The slope and intercept estimates are
ˆˆ -6989.40
1 -7.281 and 0 244.15 7.281*16 360.64
960
Therefore, the equation of the LS line is
y 360.64 - 7.281x.
Conclusion: there is a loss of 7.281 mils in the tire groove depth for
every 1000 miles of driving.
Given a particular
We can find x 25
y 360.64 - 7.281* 25 178.62 mils
Which means the mean groove depth for all tires driven for 25,000m
iles is estimated to be 178.62 miles.
10.2.2 Goodness of Fit of the LS Line
Coefficient of Determination and Correlation
yˆi 0 ˆ1 xi (i 1, 2,.....n)
The residuals:
(i 1, 2,.....n)
ei yi - ( ˆˆ0 1 xi )
We define:
SST SSR SSE
i 1 n i 1 9
Next calculate SSR SST - SSE 53, 418.73 - 2531.53 50,887.20
50,887.20
Therefore r 2
53, 418.73
0.953 and r - 0.953 -0.976
where the sign of r follows from the sign of ˆ1 -7.281 since 95.
3% of the variation in tread wear is accounted for by linear
regression on mileage, the relationship between the two is
strongly linear with a negative slope.
10.2.3 estimation of 2
An unbiased estimate of 2
is given by
n
�e SSE
2
i
s
2
i 1
n-2 n-2
Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of 2
Find the estimate of for the tread wear data using the results from Example 10.3 W
e have SSE=2351.3 and n-2=7,therefore
2351.53
S2 361.65
7
Which has 7 d.f. The estimate of is s 361.65 19.02 miles.
Statistical Inference on 0 and 1 ,
Con’t
� �
Point estimators: 0 , 1
� �
Sampling distributions of 0 and 1 :
2
xi
2
� �xi
ˆ 0 ~ N 0, 2 SE ( 0 ) s
nSxx nS xx
2
� s
ˆ
1 ~ N 1, SE ( 1 )
S xx S xx
For mathematical derivations, please refer to the text book, P331.
Statistical Inference on 0 and 1 ,
Con’t
P.Q.’s:
ˆ 0 - 0 ˆ 1 - 1
~ tn - 2 ~ tn - 2
SE ( ˆ 0) SE ( ˆ 1)
CI’s:
�
�� � � �� �
0 �t a SE � 0 �, 1 �t a SE � 1 �
n - 2,
2 � � n - 2,
2 � �
Statistical Inference on 0 and 1 ,
Con’t
H 0 : 1 10 H 0 : 1 0
Hypothesis test:
H a : 1 �10 H a : 1 �0
-- Test statistic: � �
1 - 0
1
t0 �
1
t0 �
SE ( 1 ) SE ( 1 )
-- At the significance level a , we reject H 0 in
favor of H a iff t0 �tn - 2,a / 2
Mean Square:
-- a sum of squares divided by its d.f.
SSR SSE
MSR= , MSE=
1 n-2
2 2
MSR SSR ˆ 1 Sxx ˆ 1 ˆ1 H 0 2
2
2 t ~ F 1, n - 2
2 ˆ
MSE s s s / Sxx SE ( 1)
Analysis of Variance (ANOVA)
ANOVA Table
Source of Sum of Degrees of Mean F
Variation Squares Freedom Square
(Source) (SS) (d.f.) (MS)
Regression SSR 1 SSR
MSR=
1 MSR
SSE F=
Error SSE n-2 MSE= MSE
n-2
Total SST n-1
Example:
Source SS d.f. MS F
Regression 50,887.20 1 50,887.20 140.71
Error 7 361.25
2531.53
Total 53,418.73 8
10.4 Regression Diagnostics
20
10
ei
-10
-20
0 5 10 15 20 25 30 35
Xi
Checking for Normality
Normal Probability Plot of residuals
Normal
99
Mean 3.947460E-16
StDev 17.79
95 N 9
AD 0.514
90
P-Value 0.138
80
70
Percent
60
50
40
30
20
10
1
-40 -30 -20 -10 0 10 20 30 40 50
C1
Checking for Constant Variance
What is OUTLIER
Why checking for outliers is important
Mathematical definition
How to deal with them
10.4.2-A. Intro
Recall Box and Whiskers Plot (Chapter 4)
Where (mild) OUTLIER is defined as any observations that lies outside of
Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1)
(Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR)
Observation "far away" from the rest of the data
10.4.2-B. Why are outliers a probl
em?
May indicate a sample peculiarity or a data entry error or othe
r problem ;
Regression coefficients estimated that minimize the Sum of S
quares for Error (SSE) are very sensitive to outliers >>Bias or
distortion of estimates;
Any statistical test based on sample means and variances ca
n be distorted In the presence of outliers >>Distortion of p-va
lues;
Faulty conclusions.
Example:
( Estimators not sensitive to outliers are said to be robust )
Sorted Data Median Mean Variance 95% CI for mean
Real 1 3 5 9 12 5 6.0 20.6 [0.45, 11.55]
Data
Data 1 3 5 9 120 5 27.6 2676.8 [-36.630,91.83]
with
Error
10.4.2-C. Mathematical Definition
Outlier
The standardized residual is given by
i 1 2 3 4 5 6 7 8 9
ei* 2.25 -0.12 -0.66 -1.02 -0.83 -0.57 -0.40 0.43 1.51
Influential Observation
Observation with extreme x-value, y-value, or both.
SAS output
10.4.2-D. How to deal with Outli
ers & Influential Observations
To achieve linearity
To achieve homogeneity of variance
To achieve normality or symmetry about the r
egression equation
Type of Transformation
Linearzing Transformation
transformation of a response variable, or predicted
variable, or both, which produces an approximate li
near relationship between variables.
Residual
4 329.50 5.807 332.58 -3.08 10
70
Percent
60
50
40
30
20
10
1
-40 -30 -20 -10 0 10 20 30 40 50
Data
Method of
Variance Stabilizing Transformation
d dy
3. h() = g (
) h(y) = g ( y)
h(y) = dy
cy
1
c dyy 1
c log( y )
Therefore it is the logarithmic transformation
Correlation Analysis
Correlation: a measurement of how closely two var
iables share a linear relationship.
Cov(X, Y)
corr(X, Y)
Var(X)Var( Y)
(X i - X )(Yi - Y )
R i 1
n n
( X i - X ) 2 (Yi - Y ) 2
i 1 i 1
Properties
μ1, μ2 means
for X, Y
σ12, σ22 variances
for X, Y
ρ the correlation coeff
between X, Y
Derivation of T
are these equivalent?
r n - 2 ? ˆ1
t Therefore, we can use t
1- r 2
SE ( ˆ )
1
as a statistic for testing
substitute : against the null hypothe
s S
r ˆ1 x ˆ1 xx ˆ1
S xx sis
sy S yy SST
H0: β1=0
SSE ( n - 2) s 2
1- r
2
SST SST
then :
Equivalently, we can te
S xx ( n - 2) SST ˆ1 ˆ1 st against
t ˆ1
SST ( n - 2) s 2 s / S xx SE ( ˆ1 ) H0: ρ=0
yes, they are equivalent.
Exact Statistical Inference on ρ
Test Example (from textbook)
H0 : ρ=0 , A researcher wants to determine if two test
Ha : ρ<>0
instruments give similar results. The two
test instruments are administered to a
Test statistic: sample of 15 students. The correlation
coefficient between the two sets of scores
is found to be 0.7. Is this correlation
r n-2
statistically significant at the .01 level?
t0
1- r 2 H0 : ρ=0
0.7 ,15 -H
2 a : ρ<>0
t0 3.534
1 - 0.7 2
Reject H0 if t0 > tn-2 for α = .01, 3.534 = t0 > t13, .005 = 3.012
▲ Reject H0
Approximate Statistical Inference on ρ
There is no exact method of testin
g ρ vs an arbitrary ρ0
Distribution of R is very complicate
d
T ~ tn-2 only when ρ = 0
Transform
^ the sample
1 1 r ^ estimate
1 1 1
ln
, underH0 , ~ N ln 0
2 1- r 2 1 - , n - 3
0
Approximate Statistical Inference on ρ
H0 : 0 vs. H1 : 0
Test :
1 1 0
H0 : 0 ln vs. H1 : 0
2 1 - 0
^ 1 1 r
Sample estimate: ln
2 1- r
^
z 0 n - 3 - 0
Z statistic:
reject H0 if |z0| > zα/2
^ 1 ^ 1
- za / 2 za / 2
CI: n-3 n-3
e 2l - 1 e 2u - 1
2u
e 2l 1 e 1
Approximate Statistical Inference on ρ
using SAS
Code:
Output:
Pitfalls of Regression and Correlat
ion Analysis
Correlation and causation
Ticks cause good health
Coincidental data
Sun spots and republicans
Lurking variables
Church, suicide, population
Restricted range
Local, global linearity
Summary
Model
Assumptions Linear regression analysis
Correlation
The Least squares (LS) estimates: 0 and 1 Coefficient r
Probabilistic model
for Linear regression:
0or1 tn - 2 , a / 2 SE ( 0 or1)
Correlation
Analysis
Outliers?
Confidence Interval & Prediction interval
Influential Observations?
Data Transformations?
n
Least Squares (LS) Fit Q � i
[ y
i 1
- ( 0 x
1 i )]2
SSR SSE
Sample correlation coefficient r r
2
1-
SST SST
nSxx S xx
�
( ) �
2
�� 1 x *
-x �
Prediction Interval Y α�
*
Y *
t n - 2,a / 2 s 1 �
n S xx
� �
� �