Lecture 5 - Correlation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Relationships between

the data set


Purpose of Data Analysis
• To understand the distribution of data - use
descriptive statistics –Numerical or
graphical
• Investigate relationships
• Examine the difference
• Make predictions
• THE SCALE OF DATA MEASUREMENT
(norminal, ordinal, ratio/interval) will
determine the type of analysis to be used
• One of the purposes of analysis is to
see relationships
• Relationship – must be more than two
things/variables/parameters/ ……
• Z score, to see the distribution
(ratio/proportion) of one thing only
Relationship???
• Is there a relationship between A and B
• For example – Height and Weight of people
aged 18-30 years
• If there is no relationship - then the study is
COMPLETED
• If there IS a relationship - one more question
arises -
• WHAT IS THE STRENGTH LEVEL OF THE
RELATIONSHIP
• In theory, relationship analysis involves 2
forms of analysis namely
• First - To see whether or not there is a
relationship between A and B
• If there is no - then the analysis is over
and just make a conclusion
• Second - If there is, it is necessary to look
at the relationship in two aspects, namely:
– Type of relationship (Positive or Negative)
– Level of relationship (Strong or Weak)
• In any research, we always want to see the relationship
between 2 interrelated variables.
• The selection of the type of analysis depends on the
scale of data measurement
Ratio/interval scale
Ratio scale: A ratio scale has all the properties of an interval scale,
but also has a true zero point. This means that ratios of scores are
meaningful. Examples include height, weight, and temperature
measured in Kelvin.

Interval scale: An interval scale is a scale of measurement where the


difference between two values is meaningful. However, there is no
true zero point. Common examples include temperature measured in
Celsius or Fahrenheit, as well as calendar dates.

Ratio/interval scale– Pearson


Correlation Analysis
Nominal Scale:
Nominal data are categorical and do not have a
natural order. Examples include gender, ethnicity, or types
of cars.
Statistical analysis methods for nominal data include
frequency counts, chi-square tests, and mode calculation.
These methods focus on understanding the distribution
and association of categories within the data.

Nominal Data Scale – Chi-


Square Test/Cramer’s V
Ordinal Scale
When dealing with ordinal scale data, which involves
categories with a meaningful order or ranking, but the
intervals between categories may not be uniform or well-
defined, different statistical tests are used to analyze
relationships and differences between variables.

• Ordinal Scale - Spearman


Correlation Test (Sample 2
variables) and Kruskal-Wallis
Test (more than 2 variables
Types of analysis

TESTING DATA RELATIONSHIPS

RATIO/INTERVAL
NORMAL DATA ORDINAL DATA
DATA

Pearson a) Spearman
correlation a) Chi Power Two correlation
b) Cramer V b) Kruskal -Wallis
Correlation
What is the relationship between air temperature and
rainfall?
What is the relationship between air temperature
and flooding incidences?

What is the relationship between tree disease and air


temperature?
Correlation

• The first step in any analysis is to visually examine the


data. In the case of two quantitative variables the most
appropriate graphical display is the scatter plot
• The scatter plot allows investigation of the relationship
between two variables
– the independent variable is plotted along the horizontal axis and
– the dependent variable is plotted on the vertical axis.
– A scatterplot is frequently also referred to as a plot of Y versus
X.
Different calculations for
Correlation
different data scales

Data scale

Nominal Ordinal Interval/Ratio

Pearson’s r
Spearman’s rho
Spearman’s rho
Kendall’s tau
Kendall’s tau
Numerical summary of the data — Correlation

• If there are two quantitative variables,


namely a family’s annual expenditure
on recreation and their annual
income.
• Annual expenditure is likely to depend
on a family’s annual income, we refer
to expenditure as the dependent
variable, while income is considered
to be the independent variable.
• The dependent variable, often denoted
by Y is also referred to as the
response variable while the
independent variable, often denoted
by X, is also referred to as the
explanatory or predictor variable.
What to look out for in a
scatterplot ?

 Overall pattern
 Direction
 Form
 Strength

 Deviations from the pattern


 Outliers
Overall pattern &
Direction
 Overall pattern
 Direction
 A trend that runs from
the lower left to upper
right has a positive
+ve= if x increases, y also increases
correlation.
 A trend that runs from
the upper left to the
lower right is said to
have a negative
correlation. -ve= if x increases, y decreases
Strength

 Dots close to each other, whatever


the shape of the curve, indicate a
strong relationship between x and
y. High correlation coefficient (close
to +1 or -1).
 Dots far away from each other, with
no clear pattern means x and y are
not likely to be related. Low
correlation coefficient (close to 0).
Correlation

• ˆ A negative value indicates a decreasing relationship


between X and Y , that is, as X increases, Y decreases.
• ˆ A positive value indicates an increasing relationship
between X and Y , that is, as X increases, so does Y .
• ˆ A value of 0 indicates that there is no linear relationship
between the two variables — this however does not imply
that there is no relationship.
• ˆ The correlation does not give an indication about the value
of the slope of any linear relationship.
Scatter plot
• Direction: Positive, i.e. as
income increases so does
recreation expenditure;
• Shape: Roughly linear, i.e.
the points appear to fall
along a straight line; and
• Strength: Reasonably
strong, i.e. there is
considerable scatter about Annual Income ($)

a straight line.
Numerical summary
• After investigating the data visually, a numerical
summary of the strength of the association
between the two variables is often desired.
• This can be achieved with the population
correlation coefficient, which measures the strength
of the linear association between two variables, X
and Y .
• Since X and Y are two quantitative variables, is
also known as the Pearson correlation coefficient or
Pearson’s product-moment correlation coefficient.
PEARSON’s
CORRELATION
• The technique of determining this relationship is
called PEARSON'S CORRELATION ANALYSIS
• The strength of a relationship can be measured
by using an index called PEARSON'S
CORRELATION COEFFICIENT (r)
• The value of the correlation coefficient is
between -1 and +1
• A value of +ve indicates a direct (direct)
relationship and a value of –ve indicates an
inverse relationship
• Coefficients approaching the value of -1 or +1
indicate a strong relationship
Direct relationship (positive r)
The relationship between the price of rice and the weight
of rice
Price (RM)

80

60

40

20

0
5 10 15 20 Weight
(kg)
Inverse relationship (negative r)
Relationship between Car
Price (RM) Price and Car Age
80

60

40

20

0
5 10 15 20 Age (years)
• Pearson's correlation is a type of
analysis that is often used
• It can ONLY be used for data in Interval
Scale, Ratio Scale and Percentage only.
• This method should NOT be used on
NORMAL SCALE data
The strength of relationships
• Depending on the value of r
• Conclusions about the strength of the relationship
depend on the researcher
• Example:
Coefficient Correlation (r) (+ Strength Correlation
ve @ - ve )
0.91 – 1.00 Very Strong
0.71 – 0.90 Strong
0.51 – 0.70 Moderate
0.31 – 0.50 Weak
0.01 – 0.30 Very Weak
0.00 No Relationship
The formula for calculating r

r= n Σ xy - ( Σ x)( Σ y)

[ n Σ x 2 - ( Σ x) 2 ] [ n Σ y 2 - ( Σ y) 2 ]
Example 1

• For example, a geographer wants to see


the relationship between the amount of
rain ( DATA RATIO ) and the rate of slope
erosion ( DATA RATIO ).
• Based on observation or reading (Theory),
it is found that there is a possibility that a
large amount of rain will increase the rate
of slope erosion .
• With that, a hypothesis was put forward
which is " The amount of rain will affect
the rate of cliff erosion "
• In this study, the researcher wants to see
the extent of the relationship between the
variable amount of rain and the rate of
erosion.
• Does the amount of rain affect the rate of
cliff erosion (strong relationship) or does
the amount of rain not affect the rate of
cliff erosion (no relationship) or does the
relationship exist but is weak
Location Rain (cm) Erosion (mg/l)
Station A1 25 40
Station A2 27 45
Station A3 18 21
Station B1 21 35
Station B2 24 38
Station C1 22 39
Station C2 10 20
Station D 35 75
Station E 9 16
Location Rain Erosion x2 y2 xy
(x) (y)
Station A1 25 40 625 1600 1000
Station A2 27 45 729 2025 1215
Station A3 18 21 324 441 378
Station B1 21 35 441 1225 735
Station B2 24 38 576 1444 912
Station C1 22 39 484 1521 858
Station C2 10 20 100 400 200
Station D 35 75 1225 5625 2625
Station E 9 16 81 256 144
Σ 191 329 4585 14537 8067

( Σ x) 2 = (191) 2 = 36481 ( Σy ) 2 = (329) 2 = 108241


= 9(8,067) – (191)(329)
[9(4,585) – 36,481] [9(14,537) – 10,824]

= 72,603 – 62,839
[41,265 – 36,481] [130,833 – 108,241]

= 9,764 = 9,764
[4,784] [22,592] 108,080,128

= 9,764
10,396.16

= 0.94
• A geographer has done a study on the
use of organic fertilizers on vegetable
production.
• Is there a relationship between the rate
of fertilizer use (g/l) and vegetable
production (kg)
Fertilizer (g/l) Yield (kg)
28 12
32 16
65 42
79 51
41 38
38 30
22 35
35 49
24 42
50 21
SPEARMAN'S
CORRELATION
Ordinal form data
Introduction
• The concept of SPEARMAN CORRELATION ANALYSIS
is the same as PEARSON CORRELATION ANALYSIS
• The strength of a relationship can be measured by
using an index called SPEARMAN'S CORRELATION
COEFFICIENT ( p )
• Difference - for ORDINAL data only
• Usually the data has only 2 categories, if there are
more than 2 categories, use the Kruskal Wallis Test
• The value of the correlation coefficient is between -1
and +1
• A value of +ve indicates a direct (direct) relationship
and a value of –ve indicates an inverse relationship
• Coefficients approaching the value of -1 or +1 indicate
a strong relationship
Spearman's correlation coefficient ( p)

p = 1- 6Σd
n(n
2 2
-1)

where d = rank X – rank Y


n = number of observations
Example 1
The following are the math test results and science
scores for 10 students
Mathematics science
(x)
75 65
86 72
100 98
95 70
99 69
89 75
97 83
76 85
55 52
65 68
Step 1 - Arrange the x values and mark
the rank for x and y
Mathematics science
(x)
100 98 (1)
99 69 (7)
97 83 (3)
95 70 (6)
NOTE:
89 75 (4) Basically general , ordinal
86 72 (5) data is difficult exists in the
world real​It is established
76 85 (2) by researcher . One
method for form ordinal
75 65 (9) data is by how to
65 68 (8) ORGANIZE the data

55 52 (10)
Step 2 - Calculate the rank difference for x and
y
Mathematics science Rank x Rank y d d2
(x) ( xy )
100 98 1 1 0 0
99 69 2 7 -5 25
97 83 3 3 0 0
95 70 4 6 -2 4
89 75 5 4 1 1
86 72 6 5 1 1
76 85 7 2 5 25
75 65 8 8 0 0
65 68 9 9 0 0
55 52 10 10 0 0

Total d 2 = 56
p=1- 6 x 56
10(10 2 -1)
p=1- 336
990
p = 1 - 0.339

p = 0.661

A strong positive relationship


Example 2
• Rain greatly affects the production of
vegetables. An agricultural geographer
wants to do a study on the relationship
between the number of rainy days and the
amount of vegetable yield. He has
observed the data of the number of rainy
days in a month for 16 months.
• Based on the data, what conclusions can
be made by using the Spearman
Correlation Method
No.​Day Production
17 119
20 78
8 121
44 69
27 93
23 75
35 67
17 134
22 93
11 108
39 58
31 52
13 148
29 88 where d = rank X – rank Y
19 93 n = number of observations
25 83
Day (x) y Rank x Rank y d d2
8 121 1 3 -2 4
11 108 2 5 -3 9
13 148 3 1 2 4
17 119 4 4 0 0
17 134 4 2 2 4
19 93 6 7 -1 1
20 78 7 11 -4 16
22 93 8 7 1 1
23 75 9 12 -3 9
25 83 10 10 0 0
27 93 11 7 4 16
29 88 12 9 3 9
31 52 13 16 -3 9
35 67 14 14 0 0
39 58 15 15 0 0
44 69 16 13 3 9
91
6(91.5)
p = 1 – 16(162 – 1)

p = 1 – 546
4080

p = 1 – 0.1338

p = 0.87
ATTENTION
• For data RATIO (Ratio)/INTERVAL
(Interval); if the question does not specify
which method to use, you can choose the
PEARSON or SPEARMAN method
• Each method has
disadvantages/advantages
Calculate the relationship between BMI
and age
Student BMI Age
Student A1 16.5 37
A2 students 18.2 29
Student A3 19.4 65
F1 student 19.7 18
F2 students 19.9 52
G1 students 20.5 33
G2 student 21.1 26
Student H 22.3 20
B1 student 22.5 21
B2 student 24.0 29
Student C1 24.6 48
C2 students 25.7 45
Student D 28.9 32
Student E 29.5 54

You might also like