Handout 4 Regression and Correlation
Handout 4 Regression and Correlation
HAND OUT 4
REGRESSION ANALYSIS
4.0 Introduction
Statistical methods covered so far were dealing with problems relating to just one variable. Regression is
concerned with problems dealing with two or more variables. This module will however consider
problems relating to two variables only or paired data. The focus will be on establishing relationships
between the variables in question. The resultant relationship can either be positive or negative. A
positive relationship means that as one variable increases, it will also cause another to increase or just
associated with an increase in another. Whether causation can be assumed or not is subject for another
day. What may indeed just exist is association where an increase in demand for roller meal could be
associated with an increase in demand for all types of relishes or an increase in covid-19 cases could be
associated with a rise in death cases in a community. The same can also be assumed between demand
for bread and milk as well as the demand for hospital beds and outbreak of a pandemic. When the
relationship is negative, it means that as one variable increases or decreases, the associated variable will
be decreasing or increasing. An interesting example could be that of demand for accommodation and
rent. As rent rises, demand for accommodation decreases. Similarly one can also speculate and say that
as entry requirements for the different degree programmes are increased or decreased applications
either rises or decreases. Interest in the associations or relationships will focus on;
The relationship between variables can either be linear or non- linear. A linear relationship implies a
constant change in the dependent variable as a result of a unit change to the independent variable.
A non-linear relationship implies varying absolute change in the dependent variable for a unit change to
the independent variable. Focus in this module will however be on linear relationships only.
4.1 Terms
I. Dependent variable- this is a factor in a relationship whose values are influenced or affected
by a change in another variable e.g kudya nenzara (hunger and food or ukhudhla ngedhlala).
You eat because you are hungry. Eating is therefore the dependent variable. It is caused by
hunger.
1|Page
Mphokololo series
II. Independent variable- This refers to a factor which when changed, leads to changes in
another. It is the effect variable. Change in the effect variable could be a result of a decision by
an individual or company. A change in this variable leads to a change in another representing
the required change. You look at profit and price. A change in price can lead to a change in
profit of an organisation.
It implies a straight line relationship between two variables. It must be noted however that the choice
on which of the two variables will be designated as dependent or independent is of no consequence.
The process of obtaining a linear regression line for a given data set (bivariate values) is referred as
fitting a regression line. Three methods are available in achieving the said task and they include
inspection method, semi- averages method and the least squares method.
The method involves the plotting of a scatter diagram for the relevant data and then drawing a line
through the points to describe the data. The major weakness of such is that different people may come
up with different lines and this because we use our individual judgement in coming up with the line.
12
10
Y axis - Housing demand
6
Y-Values
Linear (Y-Values)
4
0
1 2 3 4 5 6 7 8
X axis - Rentals
2|Page
Mphokololo series
The method involves splitting the data for both x and y into two equal halves and then calculating the
mean for each half. The two means for each half of x and y are then plotted and joined by a straight line
giving the regression line.
25
20
15
Y-axis (Rent)
Y-Values
10
Linear (Y-Values)
0
0 1 2 3 4 5 6 7 8 9
X-axis (space occupied)
Where;
a = y intercept i.e point where the regression line crosses the y axis
b = gradient of the regression line or the slope coefficient representing the amount of change in y for
every unit change in x. This is illustrated in the diagram below;
3|Page
Mphokololo series
12
10
8
y- axis-Rent ($)
6
Y-Values
Linear (Y-Values)
4
0
-6 -4 -2 0 2 4 6 8 10
x- axis- space occupied
To get the line of best fit, you need to establish the values of a and b first and these are given by;
b=
∑ xy−∑ x ∑ y
n ∑ x −¿ ¿
2
∑ y−b ∑ x
a=
n
Example 1
Josphat, the manager of an engineering company collected the following data on monthly power costs
and machine hours;
4|Page
Mphokololo series
Come up with the least square regression line for the above data.
Solution
a=
∑ y−b ∑ x = 167−(−8.1158119658)∗129 = 167+1046.9397435897 = 1213.9397435897
n 9 9 9
= 134.8821937322
Y = 134.88 - 8.12x
Example 2
As part of an investigation into levels of overtime working, a company decides to tabulate the number of
orders received weekly and compare this with total weekly overtime worked to give the following
Week no 1 2 3 4 5 6 7 8 9 10
Orders 22 107 55 48 92 135 32 67 83 122
Received
Total 9 42 18 11 30 48 10 29 38 51
overtime(Hrs)
Using the least squares method, obtain the regression line and estimate and estimate the level of time
required for 100 orders.
5|Page
Mphokololo series
A measure in the error of estimate lies in the concept of standard error of estimate. Denoted by Syx and
is read as standard error of Y on X and is given by
Syx=
√ ∑ (Y −Ye)²
N
Y = observed Y values
N = Number of values
Procedure
1) Establish the estimate line by obtaining values for a and b as shown above,
2) Given the estimating line Y = a + bx, obtain the estimate Ye values for each actual value of X.
3) Subtract from the corresponding observed Y value (Y-Ye)
4) Square the deviation and add over N to get ∑(Y-Ye)².
5) Divide the sum in 4 ∑(Y-Ye)² by N and then find the square root to get Syx standard error of
estimate.
1. Syx is a concept that is statistically parallel to standard deviation Sy. The difference between
the two is that Sy measures the dispersion around the mean while Syx measure the
dispersion around the best fit provided by the regression line Ye = a + bx.
2. When Sy = 0, the corresponding Syx is also equal to 0.
3. Since Syx is a measure of how close the values of observed Y are from Ye, it thus serves as a
measure of the reliability of the estimate. The greater the closeness between the observed
values of Y and the estimate values of Ye, the lesser the error and also the more reliable the
estimate is and vice-versa .It is perfect when the value of Syx is zero.
6|Page
Mphokololo series
Example 3
As part of an investigation into levels of overtime working, a company decides to tabulate the number of
orders received weekly and compare this with total weekly overtime worked to give the following
Week no 1 2 3 4 5 6 7 8 9 10
Total overtime(Hrs) 9 42 18 11 30 48 10 29 38 51
Using the least squares method, obtain the regression line and estimate and estimate the level of time
required for 100 orders.
X Y XY X² Y-Ye (Y-Ye)²
22 9 198 484 2.22 4.91
107 42 4494 11449 1.07 1.14
55 18 990 3025 -2.04 4.17
48 11 528 2394 -6.23 38.81
92 30 2760 8464 -4.91 24.09
135 48 6480 18225 -4.18 17.50
32 10 320 1024 -0.80 0.64
67 29 1943 4489 4.14 17.11
83 38 3154 6889 6.71 45.00
122 51 6222 14884 4.04 16.32
763 286 27089 71327 169.69
n ∑ xy−∑ x ∑ y ∑ y−b ∑ x
b= a=
n ∑ x −¿ ¿ ¿ n
2
10∗27089−763∗286 286−0.401766576∗763
= =
10∗71327−763² 10
7|Page
Mphokololo series
270890−218218 286−306.5478982
= =
713270−582169 10
52672 −20.5478982
= =
131101 10
= 0.401766576 =-2.05478982
Y = -2.05 +0.40X
In order to use the regression line to estimate Ye we first replace Y with Ye in the same equation. The
new equation is then used to estimate Ye values by substituting X with actual values of the same using
given values from the table. For better accuracy, we use original values for both a and b instead of
approximate values.
Ye = -2.05478982 + 0.401766576X
Procedure
Step 1 Calculate values of Ye by substituting X in the equation with actual X values in the table.
Step 3 Square the difference (Y-Ye)² for each calculate Ye value to create column five
Syx =
√ ∑(Y −Y e)²
N
as shown below;
8|Page
Mphokololo series
Generally, the bigger the value of the error of estimate, the less accurate the regression line is as an
estimator. Given that SYX = 4.11, one can say the regression line is a good estimator and can therefore
be relied upon.
Regression is only a measure of relationship between two variables and does not establish causality. The
designation of variable as dependent and independent is only a matter of convenience. It does not
automatically establish causality .One should not therefore assume that x causes Y nor can one assume
that causality does not exist between the two either. Regression –Analysis only reveals the statistical
relationship between two variables and is not of causality effect.
Estimating the most likely value of the dependent variable Y against lowest values of the independent
variable X between the highest and lowest observations is called interpolation. Those results of
interpolation holds true for values generated by the regression line between the lowest and highest
points observations. Extrapolation on the other hand speaks to extension of the regression line beyond
the lowest point or highest points of the current regression line and the reliability of the new line in
estimating behaviour of Y beyond the defined observations.
4.6 CORRELATIONSHIP
It is concerned with describing the strength of the relationship between two variables by measuring the
degree of ‘scatter’ of the data values.
It is represented by r and measures the strength of the relationship between bivariate values. It is a
value which lies between -1 and +1 i.e −1 ≤r ≤ 1. If r=0, it means that there is no relationship but if r is
less or greater than 0, it infers existence of a relationship. If r = -1, it means that there is a perfect
negative correlation while if r = +1 it means existence of a perfect positive correlation. That basically
means that as one variable increases the other will be increasing or decreasing.
n ∑ xy−∑ x ∑ y
r=
√⦗ n ∑ x −¿ ¿ ¿ ¿
2
9|Page
Mphokololo series
Example 4
The data below relates to weekly maintenance costs ($) to the age (in months) of 10 machines of
similar type in a manufacturing company. Calculate the product moment correlation coefficient
between age and cost.
Machine 1 2 3 4 5 6 7 8 9 10
Age (x) 5 10 15 20 30 30 30 50 50 60
Cost (y) 190 240 250 300 310 335 300 300 350 395
Solution
X Y XY X² Y²
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
300 2970 97650 12050 913050
n ∑ xy−∑ x ∑ y
r=
√⦗ n ∑ x −¿ ¿ ¿ ¿
2
10∗97650−(300∗2970)
=
√( 10∗2050−30 0 )∗(10∗913050−297 0 )
2 2
10 | P a g e
Mphokololo series
8550
=
174.64∗556.42
It is the ratio of explained variation to total variation and is determined by squaring the value of N. It
gives the proportion of all variation (in the Y values) that is explained.(by the variation in the x-values )
6∑d ²
r² = 1 - 2
n(n −1)
Example 5
The data below shows the average rent and rates ($ per square metre ) for a selection of areas.
Calculate the Spearman’s rank correlation to assess whether there is any correlation between rent and
rates
Rates(x) 1.68 1.46 1.57 13.37 3.18 1.95 1.07 1.71 1.22 6.46
Rent(y) 3.81 4.19 4.87 22.85 6.47 6.48 2.66 6.49 5.33 15.23
11 | P a g e
Mphokololo series
Revision Questions
One
The following data relate to the length of service and monthly salary of the employees of an
organisation. Obtain
Two
Ten year data on price (in RTGS$ per unit) of a commodity sold through a retail store and the
corresponding sales (in thousand units) are given below;
Price 4 5 5 7 6 8 10 12 15 16
Sales 10 12 13 16 20 22 20 25 28 30
Three
Ten year data on price ($ per kg) of a commodity sold through a consumer cooperative store and the
corresponding sales (in thousand units) are as given below;
Price 4 5 5 7 6 8 10 12 15 16
Saes 10 12 13 16 20 22 20 25 28 30
12 | P a g e
Mphokololo series
a) Establish line of best fit for the for the data (10 marks)
b) Calculate the standard error of estimate for the line of best fit (10 marks)
c) Compute the coefficient of correlation, using product-moment formula (10 marks)
13 | P a g e