Corr and Reg
Corr and Reg
Corr and Reg
Correlation and
Regression
§ 9.1
Correlation
Correlation
A correlation is a relationship between two variables. The
data can be represented by the ordered pairs (x, y) where
x is the independent (or explanatory) variable, and y is
the dependent (or response) variable.
A scatter plot can be used to y
x 1 2 3 4 5 –2
y –4 –2 –1 0 2
–4
Larson & Farber, Elementary Statistics: Picturing the World, 3e 3
Linear Correlation
y y
As x increases, As x increases,
y tends to y tends to
decrease. increase.
x x
Negative Linear Correlation Positive Linear Correlation
y y
x x
No Correlation Nonlinear Correlation
Larson & Farber, Elementary Statistics: Picturing the World, 3e 4
Correlation Coefficient
The correlation coefficient is a measure of the strength
and the direction of a linear relationship between two
variables. The symbol r represents the sample correlation
coefficient. The formula for r is
n xy x y
r .
n x 2 x n y 2 y
2 2
r = 0.91 r = 0.88
x
x
Strong negative correlation
Strong positive correlation
y
y
r = 0.42
r = 0.07
x
x
Weak positive correlation
Nonlinear Correlation
Larson & Farber, Elementary Statistics: Picturing the World, 3e 6
Calculating a Correlation Coefficient
Calculating a Correlation Coefficient
In Words In Symbols
1. Find the sum of the x-values. x
2. Find the sum of the y-values. y
3. Multiply each x-value by its xy
corresponding y-value and find the
sum.
4. Square each x-value and find the sum. x 2
5. Square each y-value and find the sum. y2
6. Use these five sums to calculate r
n xy x y
.
the correlation coefficient. n x x
2 2
n y y
2 2
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 7
Correlation Coefficient
Example:
Calculate the correlation coefficient r for the following data.
x y xy x2 y2
1 –3 –3 1 9
2 –1 –2 4 1
3 0 0 9 0
4 1 4 16 1
5 2 10 25 4
x 15 y 1 xy 9 x 2 55 y 2 15
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 9
Correlation Coefficient
Example continued:
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
y
100
80
Test score
60
40
20
x
2 4 6 8 10
Hours watching TV
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 10
Correlation Coefficient
Example continued:
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy 0 85 164 222 285 340 380 420 348 455 525 500
x2 0 1 4 9 9 25 25 25 36 49 49 100
y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
Example:
The following data represents the number of hours 12
different students watched television during the
weekend and the scores of each student who took a test
the following Monday.
The correlation coefficient r 0.831.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 14
Testing a Population Correlation Coefficient
Example continued: Appendix B: Table 11
r 0.831 n = 0.05 = 0.01
4 0.950 0.990
n = 12
5 0.878 0.959
= 0.01 6 0.811 0.917
10 0.632 0.765
11 0.602 0.735
12 0.576 0.708 |r| > 0.708
13 0.553 0.684
Because, the population correlation is significant, there is enough
evidence at the 1% level of significance to conclude that there is a
significant linear correlation between the number of hours of
television watched during the weekend and the scores of each
student who took a test the following Monday.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 15
Hypothesis Testing for ρ
A hypothesis test can also be used to determine whether the
sample correlation coefficient r provides enough evidence to
conclude that the population correlation coefficient ρ is
significant at a specified level of significance.
A hypothesis test can be one tailed or two tailed.
H0: ρ 0 (no significant negative correlation)
Left-tailed test
Ha: ρ < 0 (significant negative correlation)
t r r
σr 1 r2
n 2
follows a t-distribution with n – 2 degrees of freedom.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
Predicted d
3
y-value
x
Each data point di represents the difference between the
observed y-value and the predicted y-value for a given x-
value on the line. These differences are called residuals.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 24
Regression Line
A regression line, also called a line of best fit, is the line
for which the sum of the squares of the residuals is a
minimum.
The Equation of a Regression Line
The equation of a regression line for an independent variable
x and a dependent variable y is
ŷ = mx + b
where ŷ is the predicted y-value for a given x-value. The
slope m and y-intercept b are given by
n xy x y y x
m and b y mx m
n x 2 x
2 n n
where y is the mean of the y - values and x is the mean of the
x - values. The regression line always passes through (x , y ).
1
1
2 (x , y ) 3,
5
3
y
b y mx
908 54
100 (x , y ) 1254 , 908
12
4.5,75.7
(4.067) 80
12 12
Test score 60
93.97
40
ŷ = –4.07x + 93.97 20
x
2 4 6 8 10
Hours watching TV
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 29
Regression Line
Example continued:
Using the equation ŷ = –4.07x + 93.97, we can predict
the test score for a student who watches 9 hours of TV.
ŷ = –4.07x + 93.97
= –4.07(9) + 93.97
= 57.34
x
x
Larson & Farber, Elementary Statistics: Picturing the World, 3e 32
Variation About a Regression Line
The total variation about a regression line is the sum of the
squares of the differences between the y-value of each ordered
pair and the mean of y.
Total variation y i y
2
Example:
The correlation coefficient for the data that represents
the number of hours students watched television and the
test scores of each student is r 0.831. Find the
coefficient of determination.
r 2 (0.831)2 About 69.1% of the variation in the test
scores can be explained by the variation
0.691
in the hours of TV watched. About 30.9%
of the variation is unexplained.
( y i yˆ i )2
se
n 2
where n is the number of ordered pairs in the data set.
Hours, xi 5 5 6 7 7 10
Test score, yi 76 84 58 65 75 50
ŷi 73.62 73.62 69.55 65.48 65.48 53.27
(yi – ŷi)2 5.66 107.74 133.4 0.23 90.63 10.69 Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 38
The Standard Error of Estimate
Example continued:
( y i yˆ i )2 658.25
Unexplained
variation
( y i yˆ i )2 658.25 8.11
se
n 2 12 2
Continued.
Larson & Farber, Elementary Statistics: Picturing the World, 3e 41
Prediction Intervals
Construct a Prediction Interval for y for a Specific Value of x
In Words In Symbols
4. Find the standard error ( y i yˆ i )2
of estimate se. se
n 2
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50