Chapter 4 Correlational Analysis
Chapter 4 Correlational Analysis
Chapter 4 Correlational Analysis
CORRELATIONAL ANALYSIS
Topic Outline:
1. Introduction
2. Hypothesis Testing for Correlation
3. Pearson Product-Moment Correlation (Pearson r)
4. Spearman Rank Correlation (Spearman rho, )
5. Gamma Correlation (G)
6. Point-Biserial Correlation (rpb)
7. Lambda Correlation ()
8. Chi-Square (2) Tests
Learning Outcomes:
At the end of the unit, the students must have:
1. discussed the conditions imposed by each measure of
relationship/associations;
2. computed and interpreted each measure of relationship;
3. performed hypothesis testing involving each of the measures of relationship;
and
4. differentiated multiple from partial correlation.
Prepared by:
Prof. Jeanne Valerie Agbayani-Agpaoa
STAT 201: Statistical Methods I
Dr. Virgilio Julius P. Manzano, Jr.
Engr. Lawrence John C. Tagata
CHAPTER IV: CORRELATIONAL ANALYSIS
TOPIC 1: INTRODUCTION
Measure of correlation or relationship is used to find the amount and degree of relationship or the absence
of relationship between two sets of values, characteristics or variables. This relationship is expressed by a factor
called Coefficient of Correlation. It may be expressed as an abstract number. It is the ratio of two values, or series
of values, or variables being compared. It can also be expressed in percent.
Correlation is a measure of degree of relationship between paired data. All statistical research aim to
establish relationship between paired variables to enable the researcher to predict one variable in terms of the
other variable. For example, grades in Science and English tend to be related to high grades in Mathematics.
However, in some instances, there may be weak or none at all, such as the bulk of sales of candy tend to be
unrelated to the rate of crime in a particular place. It must be remembered that correlation does not determine the
cause and effect of the relationship, but rather it merely focuses on the strength of the relationship between paired
data.
Simple correlation is amenable to either ungrouped or grouped data, for nominal, ordinal, or interval
scales of data. Usually, however, rank correlation is aptly applied to ordinal data when the number of items or
cases is rather small (less than 30).
The term correlation refers to the association which occurs between two or more statistical series of
values. The coefficient of correlation which represents correlation values shows the extent to which two variables
are related and to what extent variations in one group of data go the variations in the other. Coefficient of
correlation is a single number that tells us to what extent two values are related. It can vary from 1.00 which means
perfect positive correlation through 0, which means no correlation at all, and -1.00 which means perfect negative
correlation.
Perfect correlation refers to direct relationship between any two sets of data in that any increase in the
values of the first set of data will correspondingly generate in a corresponding increase or decrease in the second
set of data, respectively. When correlation is negative, an inverse behaviour of data is observed, that is, a decrease
in values of a first set of data will result in an increase in the second set being compared or vice versa. When there
is a minimal or even zero change at one time or another between two sets of data being correlated, there is little
or no correlation at all.
The coefficient of correlation does not give directly anything like a percentage of relationship. It cannot
be concluded that a correlation value of 0.50 indicates twice the relationship that is indicated by a correlation
value of 0.25. A coefficient of relationship is an index number, not a measurement on an interval scale. Moreover,
we cannot compute a coefficient of correlation from just two measurements on one person alone.
51
CHAPTER IV: CORRELATIONAL ANALYSIS
Anybody who wants to interpret the result of the coefficient of correlation should be guided by the
following reminders:
1. The relationship of two variables does not necessarily mean that one is the cause or the effect of the other
variable. It does not imply cause-effect relationship.
2. When the computed r is high, it does not necessarily mean that one factor is strongly dependent on the
other. This is shown by height and intelligence of people. Making a correlation here does not make any
sense at all. On the other hand, when the computed r is small, it does not necessarily mean that one factor
has no dependence on the other factor. This may be applicable to IQ and grades in school. A low grade
would suggest that a student did not make use of his time in studying.
3. If there is a reason to believe that the two variables are related and the computed r is high, these two
variables are really meant to be associated. On the other hand, if the variables correlated are low (though
theoretically related), other factors might be responsible for such small associations.
4. Lastly, the meaning of correlation coefficient simply informs us that when two variables change, there
may be a strong or weak relationship taking place.
52
CHAPTER IV: CORRELATIONAL ANALYSIS
Example: Consider the values of x and y on the descriptive problem, “What is the relationship
between the NSAT percentile rank and the scholastic rating of BS Physics students in
selected universities and colleges in a certain region?
Student 1 2 3 4 5 6 7 8 9 10
NSAT Percentile Rank, x 60 73 61 70 75 79 65 67 77 80
Scholastic Rating, y 78 87 80 86 87 90 85 84 89 90
NSAT
Scholastic
Percentile
Student Rating, x2 y2 xy
Rank,
y
x
1 60 78 3,600 6,084 4,680
2 73 87 5,329 7,569 6,351
3 61 80 3,721 6,400 4,880
4 70 86 4,900 7,396 6,020
5 75 87 5,625 7,569 6,525
6 79 90 6,241 8,100 7,110
7 65 85 4,225 7,225 5,525
8 67 84 4,489 7,056 5,628
9 77 89 5,929 7,921 6,853
10 80 90 6,400 8,100 7,200
Totals 707 856 50,459 73,420 60,772
Interpretation:
The rxy value obtained is 0.9596 which denotes very high positive relationship. This means the higher the NSAT
percentile rank, the higher is the scholastic rating of the BS Physics students.
STAT 201: Statistical Methods I
53
CHAPTER IV: CORRELATIONAL ANALYSIS
Example: Consider the specific problem: “What is the rank relationship between capital and profit of light bulbs?”
Capital, x, Profit, y Rx Ry D = |Rx – Ry| D2
1 20,000 5,000 6 7 1 1
2 50,000 15,000 3 3.5 0.5 0.25
3 10,000 3,000 9 9.5 0.5 0.25
4 100,000 30,000 2 2 0 0
5 15,000 4,000 7 8 1 1
6 25,000 9,000 5 5 0 0
7 11,000 6,000 8 6 2 4
8 150,000 70,000 1 1 0 0
9 5,000 3,000 10 9.5 0.5 0.25
10 40,000 15,000 4 3.5 0.5 0.25
TOTAL 7.0
6 ∑ 𝐷2 6∗7
𝑟𝑠 = 1 − =1− 3 = 𝟎. 𝟗𝟓𝟕𝟔
𝑛3 − 𝑛 10 − 10
54
CHAPTER IV: CORRELATIONAL ANALYSIS
Example: Compute for the gamma for the data shown below
Socio-Economic Educational Status
Status Upper Middle Lower
Upper 24 19 5
Middle 12 54 29
Lower 9 26 25
Solution:
Step 1. Arrange the ordering for one of the two characteristics from the highest to the lowest or vice
versa from top to bottom through the rows and for the other characteristics from the highest
to the lowest or vice versa from left to right through the column.
Step 2. Compute Ns by multiplying the frequency in every cell by the series of the frequencies in all
of the other cells which are both to the right of the original cell below it and then sum up the
products obtained.
Ns = 24*(54 + 29 + 26 + 25) + 19*(29 + 25) + 12*(26 + 25) + 54*(25)
Ns = 6,204
Step 3. To solve for Ni, simply reverse partially the process described in Step 2. Multiply the
frequency of every cell by the sum of the frequencies in all of the cells to the left of the
original cell below it, and then sum up the products obtained.
Ni = 19*(12 + 9) + 5*(12 + 54 + 9 + 26) + (54*9) + 29*(9 + 26)
Ni = 2,405
55
CHAPTER IV: CORRELATIONAL ANALYSIS
Example:
A researcher wishes to determine if a significant relationship exists between the sex of the worker and if they
experience pain while performing an electronics assembly task. The independent variable is the question which
asks “What is your sex, male or female?” (Dichotomous). The dependent variable is from the question that asks
“How many years have you been performing the task?” (Ratio).
Respondent 1 2 3 4 5 6 7 8 9 10
Sex M M M M F F M F F F
Number of years 10 11 6 11 4 3 12 2 2 1
Males Females
10 4
11 3
6 2
11 2
12 1
Mean 10.0 2.4
Standard Deviation 4.37
56
CHAPTER IV: CORRELATIONAL ANALYSIS
Formula:
𝑭𝒃𝒊 − 𝑴𝒃𝒄
𝝀𝒄 =
𝑵 − 𝑴𝒃𝒄
where:
𝑭𝒃𝒊 = the biggest cell frequencies in the ith row (with the sum taken over all of the rows)
𝑴𝒃𝒄 = the biggest of the column totals
𝑵 = the number of observations
However, if your dependent variable is regarded as the row variable, the formula to be used is:
𝑭𝒃𝒋 − 𝑴𝒃𝒓
𝝀𝒓 =
𝑵 − 𝑴𝒃𝒓
where:
𝑭𝒃𝒋 = the biggest cell frequencies in the jth column (with the sum taken over all of the columns)
𝑴𝒃𝒓 = the biggest of the row totals
𝑵 = the number of observations
57
CHAPTER IV: CORRELATIONAL ANALYSIS
Chi-square distribution was discovered by Karl Pearson. The distribution was introduced to determine
whether or not discrepancies between observed and theoretical counts were significant. The test used to find out
how well an observed frequency distribution conforms to or fits some theoretical frequency distribution is referred
to as a “goodness of fit test”.
Also, chi-square distribution can be used to test the normality of any distribution. Testing a hypothesis
made about several population proportions are sometimes considered. In this section, a discussion for testing the
normality with the use of chi-square is being emphasized.
On the other hand, tables representing rows and columns are often called contingency tables. This
particular topic is equally important. It helps us determine whether the two classifications of variances are
independent. The value of chi-square varies from each number of degrees of freedom, one of the assumptions that
apply for a contingency table is to have 5 expected frequencies for every one of the X categories.
USES OF CHI-SQUARE
1. Chi-square is used in descriptive research if the researcher wants to determine the significant difference
between the observed and the expected or theoretical frequencies from independent variables.
2. It is used to test the goodness of fit where a theoretical distribution is fitted to some data, i.e., the fitting
of a normal curve.
3. It is used to test the hypothesis that the variances of a normal population are equal to a given value.
4. It is also used for the construction of confidence interval for variances.
5. It is used to compare two uncorrelated and correlated proportions.
Using the degree of freedom, we can use the table of chi-square values in order to compare our obtained
𝝌𝟐 value. If our computed 𝝌𝟐 is equal or greater than the table value, in the degree of freedom required
and the probability level chosen, our chi-square value is significant and the null hypothesis earlier set is
rejected.
Example 1: Suppose we want to test the claim that the fatal accidents occur at the different widths of the
road.
(𝑶 − 𝑬)𝟐
𝝌𝟐 =
𝑬
where
𝜒2 = chi-square
O = observed frequency
E = expected frequency
Observed frequency 95 90 83 73
Expected frequency 85.25 85.25 85.25 85.25
58
CHAPTER IV: CORRELATIONAL ANALYSIS
Ho: Fatal accidents do not occur at the different widths of the road.
Ha: Fatal accidents occur at the different widths of the road
The tabular value of 𝜒2 at 0.05 level of significance with degrees of freedom of 𝑑𝑓 = 𝑘 − 1 = 4 − 1 = 3 is 7.815.
Since the computed value is less than the critical value of 𝜒2 , the null hypothesis is not rejected.
Thus, we can say that fatal accidents do not occur at the different widths of the road.
Example 2: Students from MMSU claim that among the four most popular flavors of ice cream, students
have these preference rates: 58% prefer Double Dutch, 25% prefer Rocky Road, 12% prefer
chocolate mocha and 5% prefer vanilla. A random sample of 300 students was chosen. Test the
claim that the percentages given by the students are correct. Use 0.01 significance level.
Flavor Number of Students
Double Dutch 123
Rocky Road 72
Chocolate Mocha 55
Vanilla 50
Solution:
Ho: The claim of the students is correct, that is P1=0.58 and P2=0.25 and P3=0.12 and P4=0.05.
Ha: At least one of the proportions is not equal to the value claimed.
Thus, we can say that at least one of the proportions is not equal to the value claimed.
Step 1: Use the mean and the standard deviation of the sample to estimate the mean and the standard
deviation of the population if not known or assumed.
Step 2: Group the sample data into class intervals or categories.
Step 3: Calculate for the z-values for the class boundaries.
Step 4: Determine the area under the standard normal curve between z-values to obtain the hypothesis
proportion of the sample in each class.
Step 5: Multiply each proportion by the total number of observations to obtain FE.
Step 6: Compute for the 𝝌𝟐 .
59
CHAPTER IV: CORRELATIONAL ANALYSIS
Remarks:
1. The hypothesis being tested is that the sample came from a population that has a normal distribution.
2. The degrees of freedom for the chi-square test is 𝑘 − 1 − 𝑚, where k is the number of classes and m is
the number of population parameters estimated. If the sample mean and the standard deviation have been
used to estimate the population mean and the standard deviation, then 𝑚 = 2; thus, the
𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 (𝑑𝑓) = 𝑘 − 3.
CONTINGENCY TABLES
In contingency tables, we intend to test that the row variable is independent of the column variable.
Computation for expected frequency for the contingency table is different from the one in the goodness of fit. The
expected frequency E can be computed with the use of this formula:
𝑹𝒐𝒘 𝑻𝒐𝒕𝒂𝒍 ∗ 𝑪𝒐𝒍𝒖𝒎𝒏 𝑻𝒐𝒕𝒂𝒍
𝑬=
𝑮𝒓𝒂𝒏𝒅 𝑻𝒐𝒕𝒂𝒍
Teenagers and Young adults have their own style of studying. Some prefer to study with music; others
do not. A group of psychologists conducted a study to determine the particular age of the students who like
studying with music. At the 0.01 level of significance, test the claim that style of studying is independent of the
listed age groups. The table below summarizes the information.
Age Groups
Study Habit
9-12 13-16 17-20 21-24
With Music 89 75 63 52
Without Music 28 20 34 39
Contingency Table:
Age Groups Row
Study Habit
9-12 13-16 17-20 21-24 Totals
With Music 89 (81.61) 75 (66.26) 63 (67.66) 52 (63.47) 279
[0.67] [1.15] [0.32] [2.07]
Without Music 28 (35.39) 20 (28.74) 34 (29.34) 39 (27.53) 121
[1.54] [2.66] [0.74] [4.78]
Column Totals 117 95 97 91 400
Interpretation:
At the 0.01 significance level, the tabulated 𝝌𝟐 = 𝟏𝟑. 𝟗𝟑𝟕𝟑 and the obtained value lies within the rejection
region. Therefore, there is a sufficient evidence to reject the null hypothesis. The result further implies that the
type of study habit has something to do with age.
STAT 201: Statistical Methods I
60
CHAPTER IV: CORRELATIONAL ANALYSIS
ONE-WAY CLASSIFICATION
Chi-square in one way of classification in applicable when the researcher is interested in determining the
number of subjects, objects or responses which fall in various categories.
Example:
The subjects are 30 women and 30 men, or a total of 60 subjects in all. When asked “Can divorce be applied in
the Philippines?” of the 30 women, 9 answered yes, 12, no; and 9, undecided, and of the 30 men, 15 answered
yes; 2, no; and 13, undecided. Test the significant difference in their responses.
Sex
Responses Row Totals
Women Men
Yes 9 (12.00) [0.75] 15 (12.00) [0.75] 24
No 12 (7.00) [3.57] 2 (7.00) [3.57] 14
Undecided 9 (11.00) [0.36] 13 (11.00) [0.36] 22
Column Totals 30 30 60
Interpretation:
At the 0.05 significance level, the tabulated 𝝌𝟐 = 𝟗. 𝟑𝟕𝟎𝟏 and the obtained value lies within the rejection region.
Therefore, there is a sufficient evidence to reject the null hypothesis. The result further implies that the response
to the survey question has something to do with sex.
Example:
The frequencies shown in the table below are observed frequencies. The specific question is “Is there a significant
difference in the job performance of mentors who failed and mentors who passed the teacher’s licensure
examination?” Of the 100 subjects, 20 failed but with satisfactory job performance; 40 passed with satisfactory
job performance; 25 failed with unsatisfactory job performance; and 15 passed with unsatisfactory job
performance. Test the significant difference existing in the foregoing data.
Ho: There is no significant difference in the job performance of mentors who failed and mentors who
passed the teacher’s licensure examination.
Ha: There is a significant difference in the job performance of mentors who failed and mentors who
passed the teacher’s licensure examination.
STAT 201: Statistical Methods I
Ho:
Teachers Licensure Examination
Job Performance
Failed Passed Total
Satisfactory 20 (27.00) 40 (33.00) 60
[1.81] [1.48]
Unsatisfactory 25 (18.00) 15 (22.00) 40
[2.72] [2.23]
Total 45 55 100
61
CHAPTER IV: CORRELATIONAL ANALYSIS
Interpretation:
At the 0.05 significance level, the tabulated 𝝌𝟐 = 𝟖. 𝟐𝟒𝟗𝟐 and the obtained value lies within the rejection region.
Therefore, there is a sufficient evidence to reject the null hypothesis. The result further implies that there is a
significant difference in the job performance of mentors who failed and mentors who passed the teacher’s
licensure examination.
ASSESSMENT
Login to mVLE portal to access the assessment for Chapter IV.
REFERENCES:
• D.C. Montgomery and G.C. Runger, Applied Statistics and Probability for Engineers, 5th Edition, John
Wiley & Sons, Inc., 2011.
• R.E. Walpole. R.H. Myers, S.L. Myers and K. Ye, Probability and Statistics for Engineers and
Scientists, 9th Edition, Pearson International Edition, 2012.
• Zulueta, F. M. and Nestor Edilberto B. Costales, Jr. (2005). Methods of Research: Thesis Writing and
Applied Statistics. Mandaluyong City: National Bookstore, Inc.
62