0% found this document useful (0 votes)
17 views13 pages

Handout 4 Regression and Correlation

This document provides an overview of regression and correlation analysis, focusing on the relationship between two variables. It explains key concepts such as dependent and independent variables, linear regression methods, and the calculation of regression lines using least squares. Additionally, it discusses the importance of measuring the error of estimate and the distinction between correlation and causality.

Uploaded by

mkaratibran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Handout 4 Regression and Correlation

This document provides an overview of regression and correlation analysis, focusing on the relationship between two variables. It explains key concepts such as dependent and independent variables, linear regression methods, and the calculation of regression lines using least squares. Additionally, it discusses the importance of measuring the error of estimate and the distinction between correlation and causality.

Uploaded by

mkaratibran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Mphokololo series

HAND OUT 4

REGRESSION AND CORRELATION

REGRESSION ANALYSIS

4.0 Introduction

Statistical methods covered so far were dealing with problems relating to just one variable. Regression is
concerned with problems dealing with two or more variables. This module will however consider
problems relating to two variables only or paired data. The focus will be on establishing relationships
between the variables in question. The resultant relationship can either be positive or negative. A
positive relationship means that as one variable increases, it will also cause another to increase or just
associated with an increase in another. Whether causation can be assumed or not is subject for another
day. What may indeed just exist is association where an increase in demand for roller meal could be
associated with an increase in demand for all types of relishes or an increase in covid-19 cases could be
associated with a rise in death cases in a community. The same can also be assumed between demand
for bread and milk as well as the demand for hospital beds and outbreak of a pandemic. When the
relationship is negative, it means that as one variable increases or decreases, the associated variable will
be decreasing or increasing. An interesting example could be that of demand for accommodation and
rent. As rent rises, demand for accommodation decreases. Similarly one can also speculate and say that
as entry requirements for the different degree programmes are increased or decreased applications
either rises or decreases. Interest in the associations or relationships will focus on;

I. Nature and degree of relationship between any two variables, X and Y.


II. Measuring the rate of change in one variable (dependent) as a result of change in another
(independent)
III. Evaluating the predictive strength of the relationship that obtains i.e to what extent can one tell
the amount of change in another variable from a unit of a certain variable?

The relationship between variables can either be linear or non- linear. A linear relationship implies a
constant change in the dependent variable as a result of a unit change to the independent variable.

A non-linear relationship implies varying absolute change in the dependent variable for a unit change to
the independent variable. Focus in this module will however be on linear relationships only.

4.1 Terms

I. Dependent variable- this is a factor in a relationship whose values are influenced or affected
by a change in another variable e.g kudya nenzara (hunger and food or ukhudhla ngedhlala).
You eat because you are hungry. Eating is therefore the dependent variable. It is caused by
hunger.

1|Page
Mphokololo series

II. Independent variable- This refers to a factor which when changed, leads to changes in
another. It is the effect variable. Change in the effect variable could be a result of a decision by
an individual or company. A change in this variable leads to a change in another representing
the required change. You look at profit and price. A change in price can lead to a change in
profit of an organisation.

4.2 Linear Regression

It implies a straight line relationship between two variables. It must be noted however that the choice
on which of the two variables will be designated as dependent or independent is of no consequence.

4.2.1 Standard Methods of obtaining a regression line

The process of obtaining a linear regression line for a given data set (bivariate values) is referred as
fitting a regression line. Three methods are available in achieving the said task and they include
inspection method, semi- averages method and the least squares method.

4.2.1.1 Inspection Method

The method involves the plotting of a scatter diagram for the relevant data and then drawing a line
through the points to describe the data. The major weakness of such is that different people may come
up with different lines and this because we use our individual judgement in coming up with the line.

12

10
Y axis - Housing demand

6
Y-Values
Linear (Y-Values)
4

0
1 2 3 4 5 6 7 8
X axis - Rentals

2|Page
Mphokololo series

4.2.1.2 Semi-averages method

The method involves splitting the data for both x and y into two equal halves and then calculating the
mean for each half. The two means for each half of x and y are then plotted and joined by a straight line
giving the regression line.

25

20

15
Y-axis (Rent)

Y-Values
10
Linear (Y-Values)

0
0 1 2 3 4 5 6 7 8 9
X-axis (space occupied)

4.2.1.3 Least Squares Method

This is expressed as y=a +bx

Where;

Y = dependent variable (rent)

X = independent variable (space occupied)

a = y intercept i.e point where the regression line crosses the y axis

b = gradient of the regression line or the slope coefficient representing the amount of change in y for
every unit change in x. This is illustrated in the diagram below;

3|Page
Mphokololo series

12

10

8
y- axis-Rent ($)

6
Y-Values
Linear (Y-Values)
4

0
-6 -4 -2 0 2 4 6 8 10
x- axis- space occupied

To get the line of best fit, you need to establish the values of a and b first and these are given by;

b=
∑ xy−∑ x ∑ y
n ∑ x −¿ ¿
2

∑ y−b ∑ x
a=
n

Example 1

Josphat, the manager of an engineering company collected the following data on monthly power costs
and machine hours;

Month Power costs ($) Machine hours


April 23 22
May 25 23
June 20 19
July 20 12
August 20 12
September 15 9
October 14 7
November 14 11
December 16 14
Total 167 129

4|Page
Mphokololo series

Come up with the least square regression line for the above data.

Solution

Month Machine hour (X) Power cost (Y) ∑XY ∑X²


April 22 23 506 484
May 23 25 575 529
June 19 20 380 361
July 12 20 240 144
August 12 20 240 144
September 9 15 135 81
October 7 14 98 49
November 11 14 154 121
December 14 16 224 196
∑ 129 167 2552 2109

∑ xy −∑ x ∑ y 2552−(129 x 167) 2552−21543 −18991


b= = = = = -8.1158119658
n ∑ x −¿ ¿ ¿ ¿ 9 x 2109−16641 18981−16641
2
2340

a=
∑ y−b ∑ x = 167−(−8.1158119658)∗129 = 167+1046.9397435897 = 1213.9397435897
n 9 9 9
= 134.8821937322

The regression line is therefore;

Y = 134.88 - 8.12x

Example 2

As part of an investigation into levels of overtime working, a company decides to tabulate the number of
orders received weekly and compare this with total weekly overtime worked to give the following

Week no 1 2 3 4 5 6 7 8 9 10
Orders 22 107 55 48 92 135 32 67 83 122
Received
Total 9 42 18 11 30 48 10 29 38 51
overtime(Hrs)

Using the least squares method, obtain the regression line and estimate and estimate the level of time
required for 100 orders.

5|Page
Mphokololo series

4.3 MEASURING THE ERROR OF ESTIMATE

A measure in the error of estimate lies in the concept of standard error of estimate. Denoted by Syx and
is read as standard error of Y on X and is given by

Syx=
√ ∑ (Y −Ye)²
N

Y = observed Y values

Ye = Expressed Y values obtained using the linear method.

N = Number of values

Procedure

1) Establish the estimate line by obtaining values for a and b as shown above,
2) Given the estimating line Y = a + bx, obtain the estimate Ye values for each actual value of X.
3) Subtract from the corresponding observed Y value (Y-Ye)
4) Square the deviation and add over N to get ∑(Y-Ye)².
5) Divide the sum in 4 ∑(Y-Ye)² by N and then find the square root to get Syx standard error of
estimate.

INTERPRETATION OFSTANDARD ERROR Syx

1. Syx is a concept that is statistically parallel to standard deviation Sy. The difference between
the two is that Sy measures the dispersion around the mean while Syx measure the
dispersion around the best fit provided by the regression line Ye = a + bx.
2. When Sy = 0, the corresponding Syx is also equal to 0.
3. Since Syx is a measure of how close the values of observed Y are from Ye, it thus serves as a
measure of the reliability of the estimate. The greater the closeness between the observed
values of Y and the estimate values of Ye, the lesser the error and also the more reliable the
estimate is and vice-versa .It is perfect when the value of Syx is zero.

6|Page
Mphokololo series

Example 3

As part of an investigation into levels of overtime working, a company decides to tabulate the number of
orders received weekly and compare this with total weekly overtime worked to give the following

Week no 1 2 3 4 5 6 7 8 9 10

Orders Received 22 107 55 48 92 135 32 67 83 122

Total overtime(Hrs) 9 42 18 11 30 48 10 29 38 51

Using the least squares method, obtain the regression line and estimate and estimate the level of time
required for 100 orders.

X Y XY X² Y-Ye (Y-Ye)²
22 9 198 484 2.22 4.91
107 42 4494 11449 1.07 1.14
55 18 990 3025 -2.04 4.17
48 11 528 2394 -6.23 38.81
92 30 2760 8464 -4.91 24.09
135 48 6480 18225 -4.18 17.50
32 10 320 1024 -0.80 0.64
67 29 1943 4489 4.14 17.11
83 38 3154 6889 6.71 45.00
122 51 6222 14884 4.04 16.32
763 286 27089 71327 169.69

n ∑ xy−∑ x ∑ y ∑ y−b ∑ x
b= a=
n ∑ x −¿ ¿ ¿ n
2

10∗27089−763∗286 286−0.401766576∗763
= =
10∗71327−763² 10

7|Page
Mphokololo series

270890−218218 286−306.5478982
= =
713270−582169 10

52672 −20.5478982
= =
131101 10

= 0.401766576 =-2.05478982

= 0.40 (to 2 dp) = -2.05 (to 2 dp)

There the regression line is;

Y = -2.05 +0.40X

In order to use the regression line to estimate Ye we first replace Y with Ye in the same equation. The
new equation is then used to estimate Ye values by substituting X with actual values of the same using
given values from the table. For better accuracy, we use original values for both a and b instead of
approximate values.

The new equation will read;

Ye = -2.05478982 + 0.401766576X

Procedure

Step 1 Calculate values of Ye by substituting X in the equation with actual X values in the table.

Step 2 Subtract calculate Ye value from actual Y value to give (Y-Ye)

Step 3 Square the difference (Y-Ye)² for each calculate Ye value to create column five

Step 4 Add individual squared differences to give ∑(Y-Y e)²

Step 5 Apply the formula

Syx =
√ ∑(Y −Y e)²
N
as shown below;

Error of estimate Syx =


√ ∑(Y −Y e)²
N
=
√ 169.69 =
10
√ 16.969 = 4.119344608 = 4.11 (to 2 dp)

8|Page
Mphokololo series

Generally, the bigger the value of the error of estimate, the less accurate the regression line is as an
estimator. Given that SYX = 4.11, one can say the regression line is a good estimator and can therefore
be relied upon.

4.4 Regression and Causality

Regression is only a measure of relationship between two variables and does not establish causality. The
designation of variable as dependent and independent is only a matter of convenience. It does not
automatically establish causality .One should not therefore assume that x causes Y nor can one assume
that causality does not exist between the two either. Regression –Analysis only reveals the statistical
relationship between two variables and is not of causality effect.

4.5 REGRESSION AND EXTRAPOLATION

Estimating the most likely value of the dependent variable Y against lowest values of the independent
variable X between the highest and lowest observations is called interpolation. Those results of
interpolation holds true for values generated by the regression line between the lowest and highest
points observations. Extrapolation on the other hand speaks to extension of the regression line beyond
the lowest point or highest points of the current regression line and the reliability of the new line in
estimating behaviour of Y beyond the defined observations.

4.6 CORRELATIONSHIP

It is concerned with describing the strength of the relationship between two variables by measuring the
degree of ‘scatter’ of the data values.

4.6.1 The Correlation Coefficient

It is represented by r and measures the strength of the relationship between bivariate values. It is a
value which lies between -1 and +1 i.e −1 ≤r ≤ 1. If r=0, it means that there is no relationship but if r is
less or greater than 0, it infers existence of a relationship. If r = -1, it means that there is a perfect
negative correlation while if r = +1 it means existence of a perfect positive correlation. That basically
means that as one variable increases the other will be increasing or decreasing.

4.6.2 Product Moment Correlation Coefficient

This is given by the formula

n ∑ xy−∑ x ∑ y
r=
√⦗ n ∑ x −¿ ¿ ¿ ¿
2

9|Page
Mphokololo series

Example 4

The data below relates to weekly maintenance costs ($) to the age (in months) of 10 machines of
similar type in a manufacturing company. Calculate the product moment correlation coefficient
between age and cost.

Machine 1 2 3 4 5 6 7 8 9 10
Age (x) 5 10 15 20 30 30 30 50 50 60
Cost (y) 190 240 250 300 310 335 300 300 350 395

Solution

X Y XY X² Y²
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
300 2970 97650 12050 913050

n ∑ xy−∑ x ∑ y
r=
√⦗ n ∑ x −¿ ¿ ¿ ¿
2

10∗97650−(300∗2970)
=
√( 10∗2050−30 0 )∗(10∗913050−297 0 )
2 2

10 | P a g e
Mphokololo series

8550
=
174.64∗556.42

= 0.880 (to 3 dp)

4.6.3 Coefficient of Determination

It is the ratio of explained variation to total variation and is determined by squaring the value of N. It
gives the proportion of all variation (in the Y values) that is explained.(by the variation in the x-values )

Explained variation∈all items


r² = Total variation∈all items ¿
¿
Spearman’s Rank Correlation Coefficient Procedure

1. Rank the x value(to give rx values)


2. Rank the Y values(to give ry values )
3. For each pair of ranks ,calculate d2 = (rx – ry)2
4. Calculate ∑d²

6∑d ²
r² = 1 - 2
n(n −1)

Where n = number of bi-variate (paired)

Example 5

The data below shows the average rent and rates ($ per square metre ) for a selection of areas.
Calculate the Spearman’s rank correlation to assess whether there is any correlation between rent and
rates

Rates(x) 1.68 1.46 1.57 13.37 3.18 1.95 1.07 1.71 1.22 6.46
Rent(y) 3.81 4.19 4.87 22.85 6.47 6.48 2.66 6.49 5.33 15.23

Uses of Rank Correlation Coefficient

1. It is used as an approximation of product amount correlation-coefficient.


2. To measure correlation between non numeric variable especially if these can be ranked in
some way.

11 | P a g e
Mphokololo series

Revision Questions

One

The following data relate to the length of service and monthly salary of the employees of an
organisation. Obtain

a) The regression line for the data (15 marks)


b) The coefficient of correlation and comment on strength of relationship (10 marks)
c) Standard error of estimate (5 marks)

Length of service (yrs) 11 7 2 5 8 6 10


Salary ($) 7000 5000 3000 2000 6000 4000 8000

Two

Ten year data on price (in RTGS$ per unit) of a commodity sold through a retail store and the
corresponding sales (in thousand units) are given below;

Price 4 5 5 7 6 8 10 12 15 16
Sales 10 12 13 16 20 22 20 25 28 30

a) Come up with the regression line (25 marks)


b) If the price was increased to 20, what would be the expected sales? (5 marks)

Three

Ten year data on price ($ per kg) of a commodity sold through a consumer cooperative store and the
corresponding sales (in thousand units) are as given below;

Price 4 5 5 7 6 8 10 12 15 16
Saes 10 12 13 16 20 22 20 25 28 30

12 | P a g e
Mphokololo series

a) Establish line of best fit for the for the data (10 marks)
b) Calculate the standard error of estimate for the line of best fit (10 marks)
c) Compute the coefficient of correlation, using product-moment formula (10 marks)

13 | P a g e

You might also like