Block 5 MS 08 Correlation
Block 5 MS 08 Correlation
Block 5 MS 08 Correlation
CORRELATION
Correlation
Objectives
After completion of this unit, you should be able to :
Structure
18.1 Introduction
18.2 The Correlation Coefficient
18.3 Testing for the Significance of the Correlation Coefficient
18.4 Rank Correlation
18.5 Practical Applications of Correlation
18.6 Auto-correlation and Time Series Analysis
18.7 Summary
18.8 Self-assessment Exercises
18.9 Key Words
18.10 Further Readings
18.1 INTRODUCTION
We often encounter situations where data appears as pairs of figures relating to two
variables. A correlation problem considers the joint variation of two measurements
neither of which is restricted by the experimenter. The regression problem, which is
treated in Unit 19, considers the frequency distributions of one variable (called the
dependent variable) when another (independent variable) is held fixed at each of
several levels.
Examples of correlation problems are found in the study of the relationship between
IQ and aggregate percentage marks obtained by a person in SSC examination, blood
pressure and metabolism or the relation between height and weight of individuals. In
these examples both variables are observed as they naturally occur, since neither
variable is fixed at predetermined levels.
Examples of regression problems can be found in the study of the yields of crops
grown with different amount of fertiliser, the length of life of certain animals exposed
to different amounts of radiation, the hardness of plastics which are heat-treated for
different periods of time, and so on. In these problems the variation in one
measurement is studied for particular levels of the other variable selected by the
experimenter. Thus the factors or independent variables in regression analysis are not
assumed to be random variables, though the dependent variable is modelled as a
random variable for which intervals of given precision and confidence are often
worked out. In correlation analysis, all variables are assumed to be random variables.
For example, we may have figures on advertisement expenditure (X) and Sales (Y) of
a firm for the last ten years, as shown in Table I. When this data is plotted on a graph
as in Figure I we obtain a scatter diagram. A scatter diagram gives two very useful
types of information. First, we can observe patterns between variables that indicate
whether the variables are related. Secondly, if the variables are related we can get an
idea of what kind of relationship (linear or non-linear) would describe the
relationship. Correlation examines the first
17
Table 1
Forecasting Methods
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
Advertisement
Expenditure
in thousand Rs. (X)
Sales in
Thousand
Rs. (Y)
50
50
50
40
30
20
20
15
10
5
700
650
600
500
450
400
300
250
210
200
question of determining whether an association exists between the two variables, and
if it does, to what extent. Regression examines the second question of establishing an
appropriate relation between the variables.
Figure I: Scatter Diagram
The scatter diagram may exhibit different kinds of patterns. Some typical patterns
indicating different correlations between two variables are shown in Figure II.
What we shall study next is a precise and quantitative measure of the degree of
association between two variables and the correlation coefficient.
18
Where r is the correlation coefficient between X and Y, a% and ay are the standard
deviations of X and Y respectively and n is the number of values of the pair of
Correlation
19
Forecasting Methods
Activity A
Suggest five pairs of variables which you expect to be positively correlated.
Activity B
Suggest five pairs of variables which you expect to be negatively correlated.
This value of r (= 0.976) indicates a high degree of association between the variables
X and Y. For this particular problem, it indicates that an increase in advertisement
expenditure is likely to yield higher sales.
20
You may have noticed that in carrying out calculations for the correlation coefficient
in Table 2, large values for x2 and y2 resulted in a great computational burden.
Simplification in computations can be adopted by calculating the deviations of the
observations from an assumed average rather than the, actual average, and also
scaling these deviations conveniently. To illustrate this short cut procedure, let us
compute the correlation coefficient for the same data. We shall take U to be the
deviation of X values from the assumed mean of 30 divided by 5. Similarly, V
represents the deviation of Y values from the assumed mean of 400 divided by 10.
Correlation
Table 3
Short cut Procedure for Calculation of Correlation Coefficient
S.No
UV
U2
V2
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Total
50
50
50
40
30
20
20
15
10
5
700
650
600
500
450
400
300
250
210
200
4
4
4
2
0
-2
-2
-3
-4
-5
-2
30
25
20
10
5
0
-10
-15
-19
-20
26
120
100
80
20
0
0
20
45
76
100
561
16
16
16
4
0
4
4
9
16
25
110
900
625
400
100
25
0
100
225
361
400
3,13
6
Providing confidence limits for the population correlation coefficient from the
sample size n and the sample correlation coefficient r. If this confidence interval
includes the value zero, then we say that r is not significant, implying thereby
that the population correlation coefficient may be zero and the value of r may be
due to sampling variability.
21
Forecasting Methods
ii) Testing the null hypothesis that population correlation coefficient equals zero vs.
the alternative hypothesis that it does not, by using the t-statistic.
The use of both these procedures is now illustrated.
The value of the sample correlation coefficient is used as an estimate of the true
population correlation p. It is desirable to include a confidence interval for the true
value along with the sample statistics. There are several methods for obtaining the
confidence interval for p. However, the most straight forward method is to use a chart
such as that shown in Figure III.
Figure III: Confidence Bands for the Population Correlation
Once r has been calculated, the chart can be used to determine the upper and lower
values of the interval for the sample size used. In this chart the range of unknown
values of p is shown in the vertical scale; while the sample r values are shown on the
horizontal axis, with a number of curves for selected sample sizes. Notice that for
every sample size there are two curves. To read the 95% confidence limits for an
observed sample correlation coefficient of 0.8 for a sample of size 10, we simply
look along the horizontal line for a value of 0.8 (the sample correlation coefficient)
and construct a vertical line from there till it intersects the first curve for n =10. This
happens for p = 0.2. This is the lower limit of the confidence interval. Extending the
vertical line upwards, it again intersects the second n =10 line at p = 0.92, which
represents the upper confidence limit. Thus the 95% confidence interval for the
population correlation coefficient becomes
If a confidence interval for p includes the value zero, then r is not considered
significant since that value of r may be due to nothing more than sampling
variability.
This method of using charts to determine the confidence intervals is convenient,
though of course we must use a different chart for different confidence limits (e.g.
90%, 95%, 99%).
The alternative approach for testing the significance of r is to use the formula
22
Referring to the table of t-distribution for (n-2) degrees of freedom, we can find the
critical value for t at any desired level of significance (5% level of significance is
commonly used). If the calculated value oft (as obtained by equation 18.3) is less
than or equal to the table value, we accept the hypothesis (Ho: the correlation
coefficient equals zero), meaning that the correlation between the variables is not
significantly different from zero:
Correlation
And from the t-distribution with 8 degrees of freedom for a 5% level of significance,
the table value = 2.306. Thus we conclude that this r of 0.2 for n = 10 is not
significantly different from zero.
It should be mentioned here that in case the same value of the correlation coefficient
of 0.2 was obtained on a sample of size 100 then
And the tabled value for a t-distribution with 98 degrees of freedom and a 5% level
of significance = 1.99. Since the calculated t exceeds this figure of 1.99, we can
conclude that this correlation coefficient of 0.2 on a sample of size 100 could be
considered significantly different from zero, or alternatively that there is statistically
significant association between the variables.
Here n is the number of pairs of observations and di is the difference in ranks for the
ith observation set.
Suppose the ranks obtained by a set of ten students in a Mathematics test (variable X)
and a Physics test (variable Y) are as shown below :
Rank for
variable X
10
Rank for
variable Y
10
d2
4
1
1
4
1
9
1
4
16
9
50
23
Forecasting Methods
We can thus say that there is a high degree of correlation between the performance in
Mathematics and Physics.
We can also test the significance of the value obtained. The null hypothesis is that the
two variables are not associated, i.e. r, = O. That is, we are interested to test the null
hypothesis, Ho that the two variables are not associated in the population and that the
observed value of rs differs from zero only by chance. The t-statistic that is used to
test this is
Referring to the table of the t-distribution for n-2 = 8 degrees of freedom, the critical
value for t at a 5% level of significance is 2.306. Since the calculated value of t is
higher than the table value, we reject the null hypothesis concluding that the
performances in Mathematics and Physics are closely associated.
When two or more items have the same rank, a correction has to be applied to
d i 2 . For example, if the ranks of X are 1, 2, 3, 3, 5, ... showing that there are two
items with the same 3rd rank, then instead of writing 3, we write 3
1
for each so that
2
the sum of these items is 7 and the mean of the ranks is unaffected. But in such cases
the standard deviation is affected, and therefore, a correction is required. For this,
d i 2 is increased by (t3-t)/12 for each tie, where t is the number of items in each
tie.
Activity D
Suppose the ranks in Table 4 were tied as follows: Individuals 3 and 4 both ranked
3rd in Maths and individuals 6, 7 and 8 ranked 8th in Physics. Assuming that other
rankings remain unaltered, compute the value of Spearman's rank correlation.
.
.
.
.
24
Correlation is also used in factor analysis wherein attempts are made to resolve a
large set of measured variables in terms of relatively few new Categories, known as
factors. The results could be useful in the following three ways :
i)
Correlation
to reveal the underlying or latent factors that determine the relationship between
the observed data, -
ii) to make evident relationships between data that had been obscured before such
analysis, and
iii) to provide a classification scheme when data scored on various rating scales have
to be grouped together.
Another major application of correlation is in forecasting with the help of time series
models. In using past data (which is often a time series of the variable of interest
available at equal time intervals) one has to identify the trend, seasonality and
random pattern in the data before an appropriate forecasting model can be built. The
notion of auto-correlation and plots of auto-correlation for various time lags help one
to identify the nature of the underlying process. Details of time series analysis are
discussed in Unit 20. However, some fundamental concepts of auto-correlation and
its use for time series analysis-are outlined below.
25
Forecasting Methods
One could construct from one variable another time-lagged variable which is twelve
periods removed. If the data consists of monthly figures, a twelve-month time lag
will show how values of 'the same month but of different years correlate with each
other. If the auto-correlation coefficient is positive, it implies that there is a seasonal
pattern of twelve months duration. On the other hand, a near zero auto-correlation
indicates the absence of a seasonal pattern. Similarly, if there is a trend in the data,
values next to each other will relate, in the sense that if one increases, the other too
will tend to increase in order to maintain the trend. Finally, in case of completely
random data, all auto-correlations will tend to zero (or not significantly different
from zero).
The formula for the auto correlation coefficient at time lag k is:
where
rk denotes the auto-correlation coefficient for time lag k
k denotes the length of the time lag
n is the number of observations
X, is the value of the variable at time t and
X is the mean of all the data
Using the data of Figure IV the calculations can be illustrated.
A plot of the auto-correlations for various lags is often made to identify the nature of
the underlying time series. We, however, reserve the detailed discussion on such
plots and their use for time series analysis for Unit 20.
18.7 SUMMARY
26
In this unit the concept of correlation or the association between two variables has
been discussed. A scatter plot of the variables may suggest that the two variables are
related but the value of the Pearson correlation coefficient r quantifies this
association. The correlation coefficient r may assume values between -1 and 1. The
sign indicates whether the association is direct (+ve) or inverse (-ve). A numerical
value of r equal to unity indicates perfect association while a value of zero indicates
no association.
Tests for significance of the correlation coefficient have been described. Spearman's
rank correlation for data with ranks is outlined. Applications of correlation in
identifying relevant variables for regression, factor analysis and in forecasting using
time series have been highlighted. Finally the concept of auto-correlation is defined
and illustrated for use in time series analysis.
Correlation
What do you understand by the term correlation? Explain how the study of
correlation helps in forecasting demand of a product.
A company wants to study the relation between R&D expenditure (X) and annual
profit (Y). The following table presents the information for the last eight years:
Year
1988
1987
1986
1985
1984
1983
1982
1981
Annual
Profit (Y)
(R45 i
42
41
60
30
34
25
20
a)
b)
c)
What are the 95% confidence limits for the population correlation
coefficient?
d)
The following data pertains to length of service (in years) and. the annual income
for a sample of ten employees of an industry:
Compute the correlation coefficient between X and Y and test its significance at
levels of 0.01 and 0.05.
4
Twelve salesmen are ranked for efficiency and the length of service as below :
Salesman
Efficiency (X)
Length of
Service (Y)
A
1
2
B
2
1
C
3
5
D
5
3
E
5
9
F
5
7
G
7
7
H
8
6
I
9
4
j
10
11
K
11
10
L
12
11
a)
b)
An alternative definition of the correlation coefficient between a twodimensional random variable (X, Y) is
27
Forecasting Methods
where E(.) represents expectation and V(.) the variance of the random variable. Show
that the above expression can be simplified as follows :
(Notice here that the numerator is called the covariance of X and Y).
6
In studying the relationship between the index of industrial production and index
of security prices the following data from the Economic Survey 1980-81
(Government of India Publication) was collected.
70-7171-72 72-73 73-74 74-75 75-76 76-77 77-78 78-79
Index of
Industrial
(1970-100)
Index of
Security
Prices
(1970-71-100)
a)
b)
Compute and plot the first five auto-correlations (i.e. up-to time lag 5 periods) for
the time series given below :
Box, G.E.P., and G.M. Jenkins, 1976. Time Series Analysis, Forecasting and
Control, Holden-Day: San Francisco.
Draper, N. and H. Smith, 1966. Applied Regression Analysis, John Wiley: New
York.
Correlation
Edwards, B. 1980. The Readable Maths and Statistics Book, George Allen and
Unwin: London.
Makridakis, S. and S. Wheelwright, 1978. Interactive Forecasting: Univariate and
Multivariate Methods, Holden-Day: San Francisco.
Peters, W.S. and G.W: Summers, 1968. Statistical Analysis for Business Decisions,
Prentice Hall: Englewood-Cliffs.
Srivastava, U.K., G.V. Shenoy and S.C. Sharma, 1987. Quantitative Techniques for
Managerial Decision Making,Wiley Eastern: New Delhi.
Stevenson, W.J. 1978. Business Statistics-Concepts and Applications, Harper and
Row: New York.
29