Chapter 8. Correlation and Regression Analyses
Chapter 8. Correlation and Regression Analyses
Chapter 8. Correlation and Regression Analyses
Correlation analysis which is used to quantify the association between two continuous
variables say, between an independent and a dependent variable. It is important to know that
simple statistics can show a great deal of information, but it is more significant to examine
relationships within the data. Through correlation measures and hypothesis testing, the
relationships can be studied completely. Regression and correlation analysis can be used to
describe the nature and strong effectivity between two continuous variables.
To visually identify the relationship between variables, the Scatter plot diagram can be
drawn. It is a graphical method is used to determine correlation between two quantitative
variables. Scatter plot is like a line chart. A horizontal and vertical axis is sketched where the data
points are plotted. The pattern of the points is studied if a correlation exists. Positive correlation
exists when the points resemble a line that is leaning to the right upward. Negative correlation
exists when the points resemble a line that is leaning to the left upward.
Example 1. Below are the ages and weight of 10 randomly selected elementary pupils.
Age (X) 6 8 7 9 7 10 12 12 11 8
Weight (Y) 44 48 49 51 46 52 54 55 56 49
60
58
56
WEIGHT (Y)
54
52
50
48
46
44
42
40
4 6 8 10 12 14
AGE (X)
The scatter plot shows a strong positive correlation since the pattern of the points is leaning to
the right upward.
Below are examples of scatter plots that can be used as guide in interpreting correlations.
6
6
5
5
4
4
3
Y
3
Y
2
2
1
1
0
0 5 10 15 0
X 0 5 10 15
X
5 5
4 4
3 3
Y
Y
2 2
1 1
0 0
0 5 10 15 0 5 10 15
X X
Fig 3. Very Strong Positive Correlation Fig 4. Very Strong Negative Correlation
4.5 4.5
4 4
3.5 3.5
3 3
2.5 2.5
Y
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 5 10 15 0 5 10 15
X X
Y
2
Y
1.5 1.5
1 1
0.5
0.5
0
0
0 5 10 15
0 5 10 15
X
X
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
Y
2
Y
2
1.5
1.5
1
1
0.5
0.5
0
0
0 5 10 15
0 5 10 15
X
X
Y
Y
2
2
1.5 1.5
1 1
0.5 0.5
0 0
0 5 10 15 0 5 10 15
X X
12
10
8
Y
6
0 2 4 6 8 10 12
2. Plot the data points.
X 2 3 4 5 5 6 8 10 12 8
Y 2 4 3 5 6 7 8 9 11 9
12
10
8
Y
6
0 2 4 6 8 10 12
X
Clearly, the scatterplot diagram can help us determine the relationship between two
variables. However, Pearson Correlation Coefficient (r) give us an a more or less exact measure
of relationship since Pearson’s r is a statistic that measures the correlation between two
variables. Its value ranges from -1 to +1. Positive value indicates a positive correlation while
negative value is a sign of negative correlation. When 𝑟 = ±1, it means a perfect correlation.
When r approaches to ±1 it indicates a strong correlation. And when r = 0 it means no or zero
correlation. The table below presents the value of r and its verbal interpretation as suggested by
Evans (1996).
Interpreting Correlation (Evans, 1996)
r Verbal Interpretation
0 No Correlation
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√[𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 ]
Where:
Age (X) 6 8 7 9 7 10 12 12 11 8
Weight (Y) 44 48 49 51 46 52 54 55 56 49
Solution: To make the solution easier, we use a table consisting of the following columns, x, y,
x2, y2 and xy.
X Y X2 Y2 XY
6 44 36 1936 264
8 48 64 2304 384
7 49 49 2401 343
9 51 81 2601 459
7 46 49 2116 322
10 52 100 2704 520
12 54 144 2916 648
12 55 144 3025 660
11 56 121 3136 616
8 49 64 2401 392
SUM 90 504 852 25540 4608
𝑛 = 10 ∑ 𝑥 2 = 852
∑ 𝑥 = 90 (∑ 𝑥 )2 = 902 = 8100
∑ 𝑦 = 504 ∑ 𝑦 2 = 25540
10(4608)−(90)(504) 720
𝑟= =
√[10(852)−8100][10(25540)−254016] √(420)(1384)
𝑟 = 0.944
The computed r = 0.944 which indicate a very strong positive correlation between
student’s age and weight. We can conclude that older pupils weigh more than the younger ones.
Example 2. A study was conducted to determine linear association between car’s weight and
mileage. The data is presented below:
1080 14
988 20
1140 16
1250 12
1178 12
980 18
1050 15
1095 16
1225 11
1180 13
1010 17
1160 12
Solution: Let X = weight and Y = mileage
X Y X2 Y2 XY
1080 14 1166400 196 15120
988 20 976144 400 19760
1140 16 1299600 256 18240
1250 12 1562500 144 15000
1178 12 1387684 144 14136
980 18 960400 324 17640
1050 15 1102500 225 15750
1095 16 1199025 256 17520
1225 11 1500625 121 13475
1180 13 1392400 169 15340
1010 17 1020100 289 17170
𝑛 = 12 ∑ 𝑥 2 = 14912978
∑ 𝑦 = 176 ∑ 𝑦 2 = 2668
𝑛 ∑ 𝑥𝑦−∑ 𝑥 ∑ 𝑦
𝑟=
√[𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 ]
12(193071)−(13336)(176)
𝑟=
√[12(14912978)−177848896][12(2668)−30976]
𝑟 = −0.893
The computed r = -0893 which indicate a strong negative correlation between car’s
weight and mileage. This implies that as car’s weight tends to increase, the mileage tends to
decrease. It can also be concluded that heavier cars, in general, consume more gasoline.
Example 3. The data represents the Self Efficacy Score (SES) and Intelligence Quotient of 10
randomly selected teenagers.
SES IQ
35 104
46 125
48 100
55 112
52 120
39 117
48 105
30 116
50 108
45 108
X Y X2 Y2 XY
35 104 1225 10816 3640
46 125 2116 15625 5750
48 100 2304 10000 4800
55 112 3025 12544 6160
52 120 2704 14400 6240
39 117 1521 13689 4563
48 105 2304 11025 5040
30 116 900 13456 3480
50 108 2500 11664 5400
45 108 2025 11664 4860
SUM 448 1115 20624 124883 49933
From the solution table, we have;
𝑛 = 10 ∑ 𝑥 2 = 20624
∑ 𝑦 = 1115 ∑ 𝑦 2 = 124883
𝑛 ∑ 𝑥𝑦−∑ 𝑥 ∑ 𝑦
𝑟=
√[𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 ]
10(49933)−(448)(1115)
𝑟=
√[10(20624)−20074][10(124883)−1243225]
𝑟 = −0.034
The computed r = -0.034 indicates a very weak negative correlation between teenagers’
Self Efficacy Score and Intelligence Quotient or since the computer r (-0.034) is very close to zero
(0), we can say that there is no correlation between SES and IQ. Therefore, we can conclude that
there is no linear association between teenagers’ SES and IQ.
Practice Exercise 8.1.1
25
20
1.
15
10
0
0 10 20 30 40
25
20
15
2.
10
0
0 10 20 30 40
25
20
15
3.
10
0
0 10 20 30 40
25
20
15
10
4.
5
0
0 10 20 30 40
25
5.
20
15
10
0
0 10 20 30 40
1. 𝑟 = 0.781
2. 𝑟 = −0.95
3. 𝑟 = −0.389
4. 𝑟 = 0.53
5. 𝑟 = −0.49
6. 𝑟 = −0.88
7. 𝑟 = −0.001
8. 𝑟 = 1.0
9. 𝑟 = 0.27
10. 𝑟 = −0.56
1. Given are the scores of ten randomly selected Grade 11 students in their long quiz in Statistics
and Basic Math.
Statistics(X) 18 15 13 16 13 10 13 15 10 14
Basic
Math(Y) 19 17 14 15 14 11 12 14 17 13
2. Chapman and Demeritt (Elements of Forest Mensuration, 2nd ed., Albany, NY, J.B. Lyon
Company [now Williams Press], 1936) reported diameters (in inches) and ages (in years) of oak
trees.
Age(X) 4 5 8 8 8 10 10 12 13 30
Diameter(Y) 0.8 0.8 1 2 3 2 3.5 4.9 3.5 6
3. Below are the prices (pesos per kg) and supply (in kg) of Dragon fruit of in 10 supermarkets in
Cavite
Supply
(X) 128 132 95 105 125 112 132 100 140 130
85 90 120 115 110 110 100 120 95 90
Price (Y)
4. A study was conducted to determine the relationship between daily allowance and weekly
expenses on cellphone load. The data is presented below.
Allowance
(X) 1500 1200 800 750 600 750 1000 1000 900 700
Expenses 100 150 120 100 110 150 150 120 150 120
(Y)
5. A recent study claims that the number of casinos and crime in a certain city are linearly related
such that city with more number of casinos have higher crime rate. To test the claim, a group of
researchers conducted a research on 8 major cities in CALABARZON and gathered the following
information:
Cities A B C D E F G H
Number of
Casinos 7 9 12 11 14 5 7 9
Crime rate 1.8 1.6 2.2 2.4 2.3 1.1 1.4 1.9
8.2 REGRESSION ANALYSIS
𝑦̂ = 𝑎 + 𝑏𝑥
Where:
𝑎 = 𝑡ℎ𝑒 𝑦 − 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
[𝑛 ∑ 𝑥2 − (∑ 𝑥)2 ]
𝑎 = 𝑦̅ + 𝑏𝑥̅
Example 1. Given the following data for a mother’s height and her daughter’s height in inches:
Mother’s 63 67 64 60 65 67 59 60
Height
Daughter’s 63.6 64.7 65.3 61 65.4 67.4 60.9 63.1
Height
a. Find the best fit linear equation that relates the mother’s height to her daughter’s height.
b. Sketch the regression line in the scatter plot.
c. What is the best predicted height for a daughter whose mother’s height is 66 inches tall?
Solution:
Mother's Daughter's
x2 y2 xy
Height (X) Height (Y)
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 8(32326.8) − (505)(511.4)
𝑏= =
[𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ] 8(31949) − (5052 )
357.4
𝑏= = 0.63
567
𝑎 = 24.14
Answer:
67
y = 0.63x + 24.16
66
65
64
63
62
61
60
58 59 60 61 62 63 64 65 66 67 68
c. What is the best predicted height for a daughter whose mother’s height is 66 inches tall?
The predicted height for a daughter whose mother’s height is 66 inches is 65.74 inches.
The y-intercept from the equation of the regression line is the predicted value of y
whenever x = 0. However, in practice, y-intercept is only meaningful when the data contains 0
value for variable x or when 0 value for variable x is allowable.
From the result, the slope of the line b=0.63, suggests that for every 1-inch change in
mother’s height, there is a 0.63-inch change in daughter’s height.
Furthermore, since b is positive, we can say that a daughter’s height is directly
proportional to her mother’s height. We can say that daughter’s height is affected by their
mother’s height. Thus, we can conclude that taller mother has taller daughter.
The y-intercept equals 24.14 is meaningless because we cannot assume a zero value for
mother’s height.
Example 3. Using the data in Example 2, (a) determine and interpret the equation of the
regression line, and (b) estimate the mileage of a car which is 1000 kg in weight. Use weight as
the independent variable and mileage the dependent variable.
Solution: Let X = weight and Y = mileage
X Y X2 Y2 XY
1080 14 1166400 196 15120
988 20 976144 400 19760
1140 16 1299600 256 18240
1250 12 1562500 144 15000
1178 12 1387684 144 14136
980 18 960400 324 17640
1050 15 1102500 225 15750
1095 16 1199025 256 17520
1225 11 1500625 121 13475
1180 13 1392400 169 15340
1010 17 1020100 289 17170
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 12(193071) − (13336)(176)
𝑏= =
[𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ] 12(14912978) − (133362 )
−30284
𝑏= = −0.027
11060840
𝑎 = 45.07
Source: https://edu.casio.com/support/qsg/pdf/991EX_570EX/05_CASIO_QuickStartGuide_fx-991EX_fx-
570EX_STATISTICS.pdf
Performing correlation and regression analysis using Excel
2. Input the “y” data range and the “X” data range and choose the cell for output
range
3. The output shows the value of the correlation coefficient “r” and the values of y-
intercept “a” and the slope of the “b”.
From the output:
𝑟 = 0.893986 = 0.89
Direction: Identify the two variables considered in each statement and indicate which is the
independent variable and the dependent variable.
2. The number of hours a child plays his computer/online games makes him more prone to eye
problems in the near future.
3. An increase in the number of dengue patients in a locality increased the number Dengue
Awareness programs in the locality.
4. Less incidents of stress in a workplace produce productive employees.
6. The life of a light bulb depends on the length of time it is turned on daily.
7. The number of enrollees in a university increased as the number of board passers from the
university increased.
8. The number of foreign tourists increased as islands in the country were developed.
9. The sale of cars decreased with the implementation of the new taxation scheme for car
sales.
10. The incident of motorcycle incidents increases as the sales of motorcycles increased.
1. Chapman and Demeritt (Elements of Forest Mensuration, 2nd ed., Albany, NY, J.B. Lyon
Company [now Williams Press], 1936) reported diameters (in inches) and ages (in years) of oak
trees.
Age(X) 4 5 8 8 8 10 10 12 13 30
Diameter(Y) 0.8 0.8 1 2 3 2 3.5 4.9 3.5 6
a. Estimate the equation of the regression line. What conclusion can be made?
2. Giovanni L. Nazareno, a business man from Cavite, owns 10 fast-food restaurants in 10 towns
of Cavite. He wants to know if the town’s population affects the monthly sales. The monthly
average sales, in millions of pesos, and the town’s population, in hundred thousand, is given
below:
TOWN A B C D E F G H I J
POPULATION 3.25 7.72 8.65 9.74 5.76 4.38 6.41 8.53 9.12 6.15
SALES 1.12 1.56 1.75 1.98 1.64 1.21 1.48 1.73 2.07 1.78
a. Find the best fitted equation of the regression line.
c. If the population of Town A will be increased by 120000, what is the expected monthly
sales.
3. The systolic and diastolic pressure readings of 12 randomly selected senior citizens (aged 60-
70 years old) were recorded.
Systolic 135 130 135 140 120 125 120 130 130 144 143 140 125 150
(b) How does senior citizens systolic blood pressure relate to diastolic blood pressure?
4. Data on Biological Oxygen Demand, Dissolved Oxygen and Diversity is available for 16 sites
on the Calder Catchment. It is hypothesized that the level of Diversity depends on the level of
BOD - the higher the level of BOD, the more polluted the river and the less Diversity of life
(insects, fish, plants etc.)
a. Determine the best fit linear equation that relates Diversity to BOD.
https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-
data/more-on-regression/v/regression-line-example
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Correlation-Regression/BS704_Correlation-
Regression_print.html
Answer to Practice Exercise in Chapter 8
Practice Exercise 8.1.1
. 1. scatter plot
Interpretation: There is very strong positive correlation between the age and
diameter of oak trees. This implies that as age of oak trees tends to increase,
the diameter of oak trees tends to increase as well. It can also be concluded that
there is a direct relationship between the age and diameter of oak trees.
3. scatter plot
5. scatter plot
Conclusion: From the result, the slope of the line b=0.203, suggests that for every 1 year
change in age of the oak tree, there is a 0.203-inch change in its diameter.
The y-intercept equals 0.554 is meaningless because we cannot assume a zero value for the
age of the tree.
b. Graph
For every 1 unit increase in the population on the average 0.125 increase in its sales.
Therefore the expected monthly sales if the population of Town A will be increased by
120000 is P1,314,000.
b. How does senior citizens systolic blood pressure relate to diastolic blood pressure?
For every one unit increase in diastolic blood pressure, there will be an average increase of
0.63 in the systolic blood pressure.
4. a. The best fitted equation of the regression line.
b. The hypothesis is true. It can be concluded that for every one level increase in BOD, the Diversity
of life decreases by 0.46.