0% found this document useful (0 votes)
24 views81 pages

module 5 bivariate analysis

The document provides an overview of bivariate data analysis, focusing on the relationships between two variables, including nominal, ordinal, and numerical types. It covers methods such as contingency tables, chi-square tests, correlation coefficients, and various graphical representations for exploratory data analysis (EDA). Additionally, it details the steps for conducting chi-square tests for independence and includes examples to illustrate the concepts.

Uploaded by

sudeep shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views81 pages

module 5 bivariate analysis

The document provides an overview of bivariate data analysis, focusing on the relationships between two variables, including nominal, ordinal, and numerical types. It covers methods such as contingency tables, chi-square tests, correlation coefficients, and various graphical representations for exploratory data analysis (EDA). Additionally, it details the steps for conducting chi-square tests for independence and includes examples to illustrate the concepts.

Uploaded by

sudeep shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Unit V

Bivariate Data Analysis


&
EDA – Graphical Representation
Topics
• Introduction to bivariate analysis
• Association between two nominal variables
• Contingency tables
• Chi-Square calculations
• Phi Coefficient
• Scatter plot and its interpretations
• Correlation coefficient
• Regression coefficient
• Relationship between two ordinal variables
• Spearman Rank correlation
• Kendall’s Tau Coefficients
• Measuring association between mixed combination of numerical, ordinal and
nominal variables.
EDA – Graphical Representation:
• Introduction to graphical representation of data
• dot plot, stem and leaf plot
• bar chart, stacked bar chart, multiple bar chart, percentage bar chart
• histogram, bimodal and symmetric interpretation of histogram,
• pie chart and its legends
• block plot, Box Plot, contour plot,
• star plot, dendrogram, interpretation of dendrogram,
• heat map, tree map.
Introduction to bivariate analysis
• Bivariate analysis refers to the analysis of two variables to determine
relationships between them.
• Bivariate analyses are conducted to determine whether a statistical
association exists between two variables, the degree of association if
one does exist, and whether one variable may be predicted from
another.
• For example, bivariate analyses could be used to answer the question
of whether there is an association between income and quality of life,
or whether quality of life can be predicted.
• Bivariate Analysis are of the following types:
• Bivariate Analysis of two Numerical Variables (Numerical-Numerical)
• In this type, both the variables of bivariate data, independent and dependent, are having
numerical values.
• Inferential Statistics such as Correlation Coefficient can be used to explore two numerical
variables.
• Visualization techniques such as Scatterplot can be used.
• Bivariate Analysis of two categorical Variables (Categorical-Categorical)
• When both the variables are categorical.
• Inferential Statistics such as Chi-Square test can be used to explore two categorical variables.
• Visualization techniques such as Stacked Column Charts can be used.
• Bivariate Analysis of one numerical and one categorical
variable (Numerical-Categorical)
• When one variable is numerical and one is categorical.
• Inferential Statistics such as T-Test, Z-Test, ANOVA can be used.
• The insights provided by such statistics can help us explore the dataset by looking at the various
combinations of numerical and categorical variables.
• Visualization techniques such as Line Chart with Error Bars can be used for such analysis.
Association between two nominal variables
• Nominal variables are variables that are measured at the nominal level, and
have no inherent ranking.
• Nominal variable association refers to the statistical relationship(s) on
nominal variables.
• Examples of nominal variables that are commonly assessed includes
gender, race, religious affiliation, and college major.
• Cross-tabulation (also known as contingency or bivariate tables) is
commonly used to examine the relationship between nominal variables.
• Chi Square tests-of-independence are widely used to assess relationships
between two independent nominal variables.
Contingency tables
• In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of
table in a matrix format that displays the (multivariate) frequency distribution of the
variables. OR
• A contingency table is a table showing the distribution of one variable in rows and
another variable in columns. It is used to study the correlation between the two variables.
• Contingency tables are also called crosstabs or two-way tables, used in statistics to
summarize the relationship between several categorical variables.
• The term contingency table was first used by Karl Pearson.
• The pandas crosstab function builds a cross-tabulation table that can show the frequency
with which certain groups of data appear.
• The contingency coefficient is a coefficient of association which tells whether two
variables or datasets are independent or dependent of each other, It is also known as
Pearson's Coefficient.
• They are mainly used in survey research, business intelligence, engineering and scientific
research.
• Example: 2x2 Contingency table
• The following table shows the numbers of red and black cards in a
deck that are Aces and non-Aces:
Red Black Total
Ace 2 2 4
Non Ace 24 24 48
Total 26 26 52

• The total in the Red column is 26, which means there are 26 red cards
in the deck. Of these, 2 are Aces and 24 are non-Aces. There are 52
cards in a deck. Use the values in the table to calculate some
conditional probabilities.
A contingency table is a type of table that summarizes the relationship
between two categorical variables.
To create a contingency table in Python, we can use
the pandas.crosstab() function.
Syntax:
pandas.crosstab(index, columns)
• index: name of variable to display in the rows of the contingency
table
• columns: name of variable to display in the columns of the
contingency table
• Step 1: Create the Data
Let’s create a dataset that shows information for 20 different product orders,
including the type of product purchased (TV, computer, or radio) along with the
country (A, B, or C) that the product was purchased in:
#create data
import pandas as pd
df = pd.DataFrame({'Order': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'Product': ['TV', 'TV', 'Comp', 'TV', 'TV', 'Comp',
'Comp', 'Comp', 'TV', 'Radio', 'TV', 'Radio', 'Radio',
'Radio', 'Comp', 'Comp', 'TV', 'TV', 'Radio', 'TV'],
'Country': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C']})

#view data
df
Step 2: Create the Contingency Table:
To create a contingency table to count the number of each product
ordered by each country:
#create contingency table
pd.crosstab(index=df['Country'], columns=df['Product'])

Step 3: Add Margin Totals to the Contingency Table


We can use the argument margins=True to add the margin totals to
the contingency table:
#add margins to contingency table
pd.crosstab(index=df['Country'], columns=df['Product'], margins=True)
Example
• In the below example we take the iris flower data set for analysis.
This data set consists of 50 samples from each of three species of Iris
(Iris setosa, Iris virginica and Iris versicolor). Four features were
measured from each sample: the length and the width of the sepals
and petals, in centimeters.
• We will create contingency model on these features which will be
ultimately used in distinguishing the species from each other.
Chi-Square calculations
• Chi-Square test is a statistical method to determine if two categorical
variables have a significant correlation between them.
• Both those variables should be from same population and they should be
categorical like − Yes/No, Male/Female, Red/Green etc.
Chi-Square Test for Independence:
• The test is applied when you have two categorical variables from a single
population. It is used to determine whether there is a significant
association between the two variables.
• For example, in an election survey, voters might be classified by gender
(male or female) and voting preference (Democrat, Republican, or
Independent). We could use a chi-square test for independence to
determine whether gender is related to voting preference.
Steps for Chi-Square Test for Independence
(1) state the hypotheses
(2) formulate an analysis plan
(3) analyze sample data
(4) interpret results.

(1): State the Hypotheses:


• Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states that knowing the
level of Variable A does not help you predict the level of Variable B. That is, the variables are independent.
Ho: Variable A and Variable B are independent.
Ha: Variable A and Variable B are not independent.
• The alternative hypothesis is that knowing the level of Variable A can help you predict the level of Variable B.
• Note: Support for the alternative hypothesis suggests that the variables are related; but the relationship is not
necessarily causal, in the sense that one variable "causes" the other.
(2): Formulate an Analysis Plan
• The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should
specify the following elements.
Significance level : Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1
can be used.
Test method: Use the chi-square test for independence to determine whether there is a significant relationship between two
categorical variables.
(3): Analyze Sample Data
• the test statistic. Using sample data, find the degrees of freedom, expected frequencies, test statistic, and
the P-value associated with.
• Degrees of freedom: The degrees of freedom (DF) is equal to:DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the number of levels
for the other categorical variable.

• Expected frequencies: The expected frequency counts are computed separately for each level of one
categorical variable at each level of the other categorical variable. Compute r * c expected frequencies,
according to the following formula.
Er,c = (nr * nc) / n
• where Er,c is the expected frequency count for level r of Variable A and level c of Variable
B, nr is the total number of sample observations at level r of Variable A, nc is the total
number of sample observations at level c of Variable B, and n is the total sample size.
• Test statistic. The test statistic is a chi-square random variable (Χ2) defined by the following
equation.Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ]
• where Or,c is the observed frequency count at level r of Variable A and level c of Variable B, and
Er,c is the expected frequency count at level r of Variable A and level c of Variable B.
• P-value. The P-value is the probability of observing a sample statistic as extreme as the test
statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution Calculator to
assess the probability associated with the test statistic. Use the degrees of freedom computed
above.
(4).Interpret Results:
• If the sample findings are unlikely, given the null hypothesis,rejects the null hypothesis.
• Typically, this involves comparing the P-value to the significance level, and rejecting the null
hypothesis when the P-value is less than the significance level.
• A public opinion poll surveyed a simple random sample of 1000
voters. Respondents were classified by gender (male or female) and by
voting preference (Republican, Democrat, or Independent). Results are
shown in the contingency table below.
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified
by gender (male or female) and by voting preference (Republican, Democrat, or Independent).
Results are shown in the contingency table below.

Voting Preferences
Republican Democratic Independent
Male 200 150 50
Female 250 300 50

Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Find the row total and column total

Voting Preferences
Rep Dem Ind Row Total
Male 200 150 50 400
Female 250 300 50 600
Column Total 450 450 100 1000
• State the hypotheses. The first step is to state the null
hypothesis and an alternative hypothesis.
Ho: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent.
Formulate an analysis plan. For this analysis, the significance level is
0.05. Using sample data, we will conduct a chi-square test for
independence.
Analyze sample data. Applying the chi-square test for independence to
sample data, we compute the degrees of freedom, the expected
frequency counts, and the chi-square test statistic. Based on the
chi-square statistic and the degrees of freedom, we determine
the P-value.
Voting Preferences
Rep Dem Ind Row Total
Male 200 150 50 400
Female 250 300 50 600
Column Total 450 450 100 1000

• DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
• Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
E1,3 = (400 * 100) / 1000 = 40000/1000 = 40
E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60
• Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ]

Χ22 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40 + (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/60
Χ2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
Χ = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
• where DF is the degrees of freedom, r is the number of levels of gender, c is the number of levels of the voting
preference, nr is the total number of observations from level r of gender, nc is the total number of observations
from level c of voting preference, n is the number of observations in the sample, Er,c is the expected frequency
count when gender is level r and voting preference is level c, and Or,c is the observed frequency count when
gender is level r voting preference is level c.
• The P-value is the probability that a chi-square statistic having 2
degrees of freedom is more extreme than 16.2.
• We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) =
0.0003.
• Interpret results. Since the P-value (0.0003) is less than the
significance level (0.05), we cannot accept the null hypothesis. Thus,
we conclude that there is a relationship between gender and voting
preference.
Example 2
• The results of a random sample of children with pain from
musculoskeletal injuries treated with acetaminophen, ibuprofen, or
codeine are shown in the table. At α = 0.10, is there enough evidence
to conclude that the treatment and result are independent?
Acetaminophen(c,1) Ibuprofen (c. 2) Codeine (c. 3)

(r. 1) Significant 58(66.7) 81(66.7) 61(66.7)


Improvement
(r. 2) Slight 42 (33.3) 19 (33.3) 39 (33.3)
Improvement
First, calculate the column and row totals. Then find the expected
frequency for each item and write it in the parenthesis next to the
observed frequency.

Acetaminophe Ibuprofen (c. Codeine (c. 3) Total


n(c,1) 2)

(r. 1) 58 (66.7) 81 (66.7) 61 (66.7) 200


Significant
Improvement
(r. 2) Slight 42 (33.3) 19 (33.3) 39 (33.3) 100
Improvement

Total 100 100 100 300


Now perform the hypothesis test.
(i) The null hypothesis H0: the treatment and response are independent.
(ii) The alternative hypothesis, Ha: the treatment and response are dependent.
(iii) α = 0.10.
(iv) The degrees of freedom: (number of rows - 1)×(number of columns - 1) =
(2 − 1) × (3 − 1) = 1 × 2 = 2.
(v) The test statistic can be calculated using a table:

Row, E 0 O−E (O − E) 2 (O−E) 2 /E


Column
1,1 (200x100) /300 = 66.7 58 -8.7 75.67 1.134

1,2 (200x100)/ 300 = 66.7 81 14.3 204.49 3.066

1,3 (200x100)/ 300 = 66.7 61 -5.7 32.49 0.487

2,1 (100x100)/300 = 33.3 42 8.7 75.69 2.272

2,2 (100x100 )/300 = 33.3 19 -14.3 204.49 6.14

2,3 (100x100 )/300 = 33.3 39 5.7 32.49 0.975


• χ 2 = X ( observed − expected )2/ expected = X (O − E) 2/ E
• =1.13+3.07+0.49+2.27+6.14+0.98
• =14.074
• From α = 0.10 and d.f = 2, p=0.0008788
• Since the P-value (0.000878) is less than the significance level (0.1),
we cannot accept the null hypothesis. Thus, we conclude that there is
a relationship between the treatment and response.
Phi Coefficient
• The phi coefficient (represented symbolically as φ) is a test of the relationship or association
between two dichotomous variables (A dichotomous variable is one that takes on one of only two
possible values when observed or measured). OR
• A Phi coefficient is a non-parametric test (A non parametric test (sometimes called a distribution
free test) does not assume anything about the underlying distribution) of relationships that operates
on two dichotomous (or dichotomized) variables.
• It is also called the Yule phi or Mean Square Contingency Coefficient and is used for contingency
tables when:
At least one variable is a nominal variable.
Both variables are dichotomous variables.
• It intersects variables across a 2x2 matrix to estimate whether there is a non-random pattern across
the four cells in the 2x2 matrix.
• In other words, both variables have only two, mutually exclusive responses options, such as yes/no
or left-handed/right-handed.
Interpreting the Phi Coefficient

• If you consider 2x2 contingency table, Φ co-efficient is calculated, as

• The interpretation for the phi coefficient is similar to the Pearson Correlation Coefficient. The range is from
-1 to 1, where:
• 0 is no relationship.
• 1 is a perfect positive relationship: most of your data falls along the diagonal cells.
• -1 is a perfect negative relationship: most of your data is not on the diagonal.
Φ value =

+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship

+.01 to +.19 No or negligible relationship

0 No relationship

-.01 to -.19 No or negligible relationship

-.20 to -.29 weak negative relationship

-.30 to -.39 Moderate negative relationship

-.40 to -.69 Strong negative relationship

Very strong negative


-.70 or higher
relationship
• Phi Coefficient Chi-Square Test Value
• A measure of association used for 2 x 2 tables is the Phi coefficient:

Again, the measure ranges between 0 and 1 with higher values meaning a stronger association.
Example: Find the Phi Coefficient for the following data.
Subjects Marital Status Gender
Gender
2=Married 1=Male
1=Single 2=Female
Male Female
A 2 1
B 1 1 Married 1 4
Marital
C 1 1 Status
Single 5 2
D 2 2
E 2 2
F 1 1
G 2 2
H 1 2 Φ=(2-20)/sqrt((5)(7)(6)(6)
I 2 2 =-18/sqrt(1260)=-18/35.496=-0.507
J 1 1 A negative Phi coefficient would indicate that most of the data
K 1 1 are in the off-diagonal cells.
L 1 2
The table below shows the ‘first time’ driving test results of a sample of 200 individuals classified by
gender and success or failure in the examination. We wish to explore the association between the two
variables, the null hypothesis being that there is no relationship between gender and success/failure in
driving test results. Consider 1% significant level.
Gender SUCCESS FAILURE When each of the variables is dichotomous, that is, can
only take two values (male/female) or pass/fail) then
Male 70 28
the phi coefficient (φ) is an appropriate test of
Female 50 52 association. Phi is given as:

Find the row total and column total

Gender SUCCESS FAILURE Total Φ=(3640-1400)/sqrt((98)(102)(120)(80)


=2240/9795.99=0.2286=0.229
Male 70 28 98
Female 50 52 102 The significance of phi can be tested from the
following formula:
Total 120 80 200
χ 2 = N(φ)2, where degrees of freedom are given by:
df = (r-1) (c-1), df=1
Our obtained value exceeds this. We therefore conclude
that the null hypothesis is rejected and that there is a χ 2=200(0.229)2
statistically significant association between the gender of
the driver and ‘first time’ driving success or failure. =10.48
Critical value at 1% significant level=6.635
Example:
• A researcher wishes to determine if a significant relationship exists between the gender of the worker and if they experience pain
while performing an electronics assembly task. Consider 0.05% significance level.
• First question asked is “Do you experience pain while performing the assembly task? Yes /No”
• Second question asked is “What is your gender? Male/Female”.
- Box A contains the number of Males that said Yes to the pain item (4)
– Box B contains the number of Females that said Yes to the pain item (6)
– Box C contains the number of Males that said No to the pain item (11)
– Box D contains the number of Females that said No to the pain item (8)

Step 1: Null and Alternative Hypotheses


• Ho: There is no relationship between the gender of the worker and if they feel pain while performing the task.
• H1: There is a significant relationship between the gender of the worker and if they feel pain while performing the task.

Step 2: Determine dependent and independent variables and their formats.


• Gender is the independent variable • Feeling pain is dichotomous, dependent. An independent variable is the variable doing the
causing or influencing. A dependent variable is the thing being caused or influenced by the independent variable.
• Gender is a dichotomous variable. In this study it can only take on two variables: 1 = Male 2 = Female
• Feeling pain is a dichotomous variable. In this study it can only take on two variables: 1 = Feel Pain 2 = Don’t Feel Pain
Step 3: Choose test statistic.
• Because we are investigating the relationship between two dichotomous variables, the appropriate test statistic is the Phi
Coefficient.
Step 4: Run the Test:
– Box A contains the number of Males that said Yes to the pain item (4)
– Box B contains the number of Females that said Yes to the pain item (6)
– Box C contains the number of Males that said No to the pain item (11)
– Box D contains the number of Females that said No to the pain item (8)

Male Female Total


Yes 4 6 10
No 11 8 19
Total 15 14 29

Φ=(32-66)/(sqrt((10)(19)(15)(14))
=-34/sqrt(39900)=-34/199.749=-0.170
• Compare your score to the critical score.
• To interpret the -0.17, we need to convert it to a Chi Square value.
• To do this, multiply N x (Phi Coeff)2
• If the obtained score is greater than the critical value, reject the Null
hypothesis and accept the alternative hypothesis.
• Here, 29x -0.172=0.84
• Since 0.84 is less than 3.84, we failed to reject the Null hypothesis.

• Conclusions:
• There is no significant relationship between the genders of the workers and
if they feel pain while they perform the task.
• Both males and females have pain or (no pain) at approximate equal
frequencies.
Scatter plot and its interpretations
• The most useful graph for displaying the relationship between two quantitative variables is a
scatterplot.
Interpreting Scatterplots:
• As in any graph of data, look for the overall pattern and for striking departures from that pattern.
• The overall pattern of a scatterplot can be described by the direction, form, and strength of the
relationship.
• An important kind of departure is an outlier, an individual value that falls outside the overall pattern of
the relationship.
Interpreting Scatterplots: Direction
• One important component of a scatterplot is the direction of the relationship between the two variables.
• Each observation (or point) in a scatterplot has two coordinates; the first corresponds to the first piece of
data in the pair (that’s the X coordinate; the amount that you go left or right). The second coordinate
corresponds to the second piece of data in the pair (that’s the Y-coordinate; the amount that you go up or
down). The point representing that observation is placed at the intersection of the two coordinates.

If the data show an uphill pattern as you move from left to right,
this indicates a positive relationship between X and Y. As the
X-values increase (move right), the Y-values tend to increase
(move up).
If the data show a downhill pattern as you move from left to right,
this indicates a negative relationship between X and Y. As the
X-values increase (move right) the Y-values tend to decrease (move
down).

If the data don’t seem to resemble any kind of pattern (even a vague one),
then no relationship exists between X and Y.
Interpreting Scatterplots: Form
• Another important component to a scatterplot is the form of the relationship between the two variables.
Curvilinear relationship

Linear Relationship

This example illustrates a relationship that has the


This example illustrates a linear form of a curve, rather than a straight line. This is due
relationship. This means that the to the fact that one variable does not increase at a
points on the scatterplot closely constant rate and may even start decreasing after a
resemble a straight line. A certain point. This example describes a curvilinear
relationship is linear if one variable relationship between the variable “age” and the
increases by approximately the same variable “working memory.” In this example, working
rate as the other variables changes by memory increases throughout childhood, remains
one unit steady in adulthood, and begins decreasing around age
50.
Interpreting Scatterplots: Strength
• Another important component to a scatterplot is the strength of the relationship between the two variables.
• The slope provides information on the strength of the relationship.

• The strongest linear relationship occurs when the slope is 1. This means that when one variable increases
by one, the other variable also increases by the same amount. This line is at a 45 degree angle.
• The strength of the relationship between two variables is a crucial piece of information. Relying on the
interpretation of a scatterplot is too subjective. More precise evidence is needed, and this evidence is
obtained by computing a coefficient that measures the strength of the relationship under investigation.
Correlation Coefficient
• A correlation is a statistical measure of the relationship between two variables.
• The correlation coefficient is a measure of the association between two variables. OR
• The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two
variables.
• The values range between -1.0 and 1.0.
• A calculated number greater than 1.0 or less than -1.0 means that there was an error in the correlation measurement.
• A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation.
• A correlation of 0.0 shows no linear relationship between the movement of the two variables.
• The absolute value of the correlation coefficient gives us the relationship strength. The larger the number, the stronger the
relationship.
• Pearson correlation coefficient or Pearson’s correlation coefficient or Pearson’s r is defined in statistics as the measurement
of the strength of the relationship between two variables and their association with each other.
Pearson correlation coefficient formulas

Pearson correlation coefficient


(Here r = Pearson correlation coefficient
x = Values in the first set of data
y = Values in the second set of data
n = Total number of values).
Sample correlation coefficient

Here, Sx and Sy are the sample standard deviations,


and Sxy is the sample covariance.

Population Correlation Coefficient

The population correlation coefficient uses σx and


σy as the population standard deviations and σxy as
the population covariance.
Marks obtained by 5 students in EDA Test1 and Test2 as given below:

Test1(x) 15 16 12 10 8
Test2(y) 18 11 10 20 17

Calculate the Pearson correlation coefficient.


Construct the following table:
x y x2 y2 xy
15 18 225 324 270
16 11 256 121 176
12 10 144 100 120
10 20 100 400 200
8 17 64 289 136
∑x = 61 ∑y = 76 ∑x2 =789 ∑y2 =1234 ∑xy =902
Marks obtained by 5 students in EDA Test1 and Test2 as given below:

Test1(x) 15 16 12 10 8
Test2(y) 18 11 10 20 17

Calculate the Pearson correlation coefficient?


Construct the following table:
x y x2 y2 xy
15 18 225 324 270
16 11 256 121 176
12 10 144 100 120
10 20 100 400 200
8 17 64 289 136
∑x = 61 ∑y = 76 ∑x2 = 789 ∑y2 = 1234 ∑xy = 902
• Formula for Pearson correlation coefficient is given by:

r = (5×902–61×76)/ √[5×789–(61)2][5×1234–(76) 2] 5
==
r55
=(4510-4636)/√[3945-3721][6170-5776]
=-126/√[224][394]
=-126/√88256
=-126/297.079
=-0.424
John is an investor. His portfolio primarily tracks the performance of the S&P 500 and John wants to add the
stock of Apple Inc. Before adding Apple to his portfolio, he wants to assess the correlation between the stock
and the S&P 500 to ensure that adding the stock won’t increase the systematic risk of his portfolio. To find the
correlation coefficient, John gathers the following prices for the last five years.

Determine the correlation between the prices of the S&P 500 Index and Apple Inc.

Solution:
Advantages and disadvantages of Pearson Correlation Coefficient

Advantages:
• It helps in knowing how strong the relationship between the two variables is. Not only the presence or the absence of
the correlation between the two variables is indicated using the Pearson Correlation Coefficient but it also determines
the exact extent to which those variables are correlated.
• Using this method, one can ascertain the direction of correlation i.e. whether the correlation between two variables is
negative or positive.
Disadvantages:
• The Pearson Correlation Coefficient R is not sufficient to tell the difference between the dependent variables and the
independent variables as the Correlation coefficient between the variables is symmetric. For example, if a person is
trying to know the correlation between the high stress and blood pressure, then one might find the high value of the
correlation which shows that high stress causes the blood pressure. Now if the variable is switched around then the
result, in that case, will also be the same which shows that stress is caused by the blood pressure which makes no
sense. Thus, the researcher should be aware of the data that he is using for conducting the analysis.
• Using this method one cannot get the information about the slope of the line as it only states whether any relationship
between the two variables exists or not.
• It is likely that the Pearson Correlation Coefficient may be misinterpreted especially in case of the homogeneous data.
• When compared with the other methods of the calculation, this method takes much time for arriving at the results.
Regression coefficient
• Regression is a method or an algorithm in Machine Learning that models a target value based on independent
predictors. It is essentially a statistical tool used in finding out the relationship between a dependent variable and
independent variable.
• Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and
one or more independent variables.
• In regression analysis, one variable is considered as dependent and other(s) as independent. Thus, it measures the
degree of dependence of one variable on the other(s).
• Regression coefficient is a statistical measure of the average functional relationship between two or more variables.

Note:
1. Regression coefficient is denoted by b.
2. It is expressed in terms of original unit of data.
3. Between two variables (say x and y), two values of regression coefficient can be obtained. One will be obtained when
we consider x as independent and y as dependent and the other when we consider y as independent and x as dependent.
The regression coefficient of y on x is represented as byx and that of x on y as bxy.
4. Both regression coefficients must have the same sign. If byx is positive, bxy will also be positive and vice versa.
5.For calculation of regression coefficient from un-replicated
2
data
2
three estimates, viz., (1) sum of all observations on x
and y (∑x, ∑y) variables, (2) their sum of squares (∑x and ∑y ) and (3) sum of products of all observations on x and y
variables (∑xy).
• Regression coefficient of X on Y:

• Regression equation of X on Y:

• Regression coefficient of Y on X:

• Regression equation of Y on X:
• Calculate the regression coefficient and obtain the lines of regression for
the following data.
X 1 2 3 4 5 6 7

Y 9 8 10 12 11 13 14

Calculate mean of X and Y:


• ̅X= ΣX/N=28/7=4
X Y X2 Y2 XY
̅Y= ΣY/N=77/7=11
1 9 1 81 9

Regression coefficient of X on Y: 2 8 4 64 18
3 10 9 100 30
4 12 16 144 48

=(7(334)-(28)(77))/(7(875)-(77)2) 5 11 25 121 55

=(2338-2156)/(6125-5929) 6 13 36 169 78

=182/196 7 14 49 196 98

=0.928 ΣX=28 ΣY=77 ΣX2=140 ΣY2=875 ΣXY=334


Regression equation of X on Y:

X-4=0.929(Y-11)
X-4=0.929Y-10.219
X=0.929Y-6.219
The regression equation X on Y is:
X=0.929Y-6.219

Regression coefficient of Y on X: Regression equation of Y on X:

Y-11=0.929(X-4)
byx=7(334)-(28)(77)/7(140)-(28)2 Y=0.929X-3.716+11
=(2338-2156)/(980-784) Y=0.929X+7.284
=182/196 The regression equation of Y on X is:
byx=0.929 Y=0.929X+7.284
• Obtain regression equation of Y on X and estimate Y when X=55 from
the following.
X 40 50 38 60 65 50 35

Y 38 60 55 70 60 48 30
X Y X2 Y2 XY
40 38 1600 1444 1520
50 60 2500 3600 3000
38 55 1444 3025 2090
60 70 3600 4900 4200
65 60 4225 3600 3900
50 48 2500 2304 2400
35 30 1225 900 1050
ΣX=338 ΣY=361 ΣX2=17094 ΣY2 =19773 ΣXY=18160

x̄= ΣX/N=338/7=48.29 ͞Y= ΣY/N=361/7=51.57 Regression equation of Y on X

Regression coefficient of Y on X:

Y–51.57 = 0.942(X–48.29 )
Y=0.942X-45.49+51.57
Y=0.942X+6.08
byx=(7(18160)-(338)(361))/((7(17094)-(338)2)
The regression equation of Y on X is Y= 0.942X+6.08
=5102/5414
Estimation of Y when X= 55
=0.9423
Y= 0.942(55)+6.08=57.89
• The values of x and their corresponding values of y are shown in the
table below:

X 0 1 2 3 4

Y 2 3 5 4 6

a) Obtain the regression equation on X on Y and Y on X.


b) Estimate the value of y when x = 10.
Comparison Between Correlation and Regression
Basis Correlation Regression
A statistical measure that defines Describes how an independent
Meaning co-relationship or association of two variable is associated with the
variables. dependent variable.
Dependent and
No difference Both variables are different.
Independent variables
To describe a linear relationship between To fit the best line and estimate one
Usage
two variables. variable based on another variable.
To estimate values of a random
To find a value expressing the
Objective variable based on the values of a
relationship between variables.
fixed variable.
Relationship between two ordinal variables: Spearman Rank correlation
• Ordinal variable is a kind of categorical variable with a set order or scale to it.
• In ordinal data, there is no standard scale on which the difference in each score is measured.
• All bivariate correlation analyses express the strength of association between two variables in a
single value between -1 and +1. This value is called the correlation coefficient.
• A positive correlation coefficient indicates a positive relationship between the two variables (as
values of one variable increase, values of the other variable also increase) while a negative
correlation coefficient expresses a negative relationship (as values of one variable increase, values
of the other variable decrease). A correlation coefficient of zero indicates that no relationship
exists between the variables.
• A rank correlation coefficient measures the degree of similarity between two rankings, and can be
used to assess the significance of the relation between them.
• A Spearman correlation coefficient is also referred to as Spearman rank correlation or Spearman’s
rho. It is denoted either with the Greek letter rho (ρ), or rs. Like all correlation coefficients,
Spearman’s rho measures the strength of association between two variables.
• The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal
variables. In a monotonic relationship, the variables tend to change together, but not necessarily at
a constant rate. The Spearman correlation coefficient is based on the ranked values for each
variable rather than the raw data.
• Spearman’s correlation coefficient returns a value from -1 to 1, where:
+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.
• A monotonic function is one that either never increases or never decreases as its independent variable
increases.
Monotonically increasing - as the x variable increases the y variable never decreases.
Monotonically decreasing - as the x variable increases the y variable never increases.
Not monotonic - as the x variable increases the y variable sometimes decreases and sometimes increases.
Note:
• Compared to the Pearson correlation coefficient, the Spearman correlation does not require
continuous-level data (interval or ratio), because it uses ranks instead of assumptions about the
distributions of the two variables. This allows us to analyze the association between variables of ordinal
measurement levels.
• The Spearman correlation does not assume that the variables are normally distributed. A Spearman
correlation analysis can therefore be used in many cases in which the assumptions of the Pearson
correlation (continuous-level variables, linearity and normality) are not met.
• The sign of the Spearman correlation coefficient indicates the direction of association between X
(the independent variable) and Y (the dependent variable).
• If Y tends to increase when X increases, the Spearman correlation coefficient is positive.
• If Y tends to decrease when X increases, the Spearman correlation coefficient is negative.
• A Spearman correlation of zero indicates that there is no tendency for Y to either increase or
decrease when X increases.
Spearman Rank Correlation
The formula for the Spearman rank correlation coefficient when there are no tied ranks is:

where di = difference in paired ranks and n = number of cases.

The formula to use when there are tied ranks is (repetition of ranks):

When there is a repetition of ranks, a correction factor m(m2-1)/12 is added to Σd2 in the
Spearman’s rank correlation coefficient formula, where m is the number of times a rank is
repeated. It is very important to know that this correction factor is added for every repetition
of rank in both characters.
• Interpretation
• Spearman’s rank correlation coefficient is a statistical measure of the strength of a monotonic
(increasing/decreasing) relationship between paired data. Its interpretation is similar to that of
Pearson’s. That is, the closer to the ±1 means the stronger the monotonic relationship.
• The scores for nine students in ML and Big Data are as follows:
ML: 35, 23, 47, 17, 10, 43, 9, 6, 28
Big Data: 30, 33, 45, 23, 8, 49, 12, 4, 31
Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.
Sorting the data in descending order
Assigning the ranks to the original data and finding d.
ML Rank Big Data Rank ML Rank Big Data Rank d d2
47 1 49 1 35 3 30 5 -2 4
43 2 45 2 23 5 33 3 2 4
35 3 33 3 47 1 45 2 -1 1
28 4 31 4 17 6 23 6 0 0
23 5 30 5 10 7 8 8 -1 1
17 6 23 6 43 2 49 1 1 1
10 7 12 7 9 8 12 7 1 1
9 8 8 8 6 9 4 9 0 0
6 9 4 9 28 4 31 4 0 0
12
= 1 – (6*12)/(9(81-1))
=1 – 72/720
= 1-0.1= 0.9
This indicates a strong positive relationship between the ranks individuals obtained in the ML and Big Data
exam. That is, the higher you ranked in ML, the higher you ranked in Big Data also, and vice versa.
• The scores for nine students in ML and Big Data are as follows:
ML: 35, 23, 47, 17, 10, 43, 9, 6, 28
Big Data: 30, 33, 45, 23, 8, 49, 12, 4, 31
Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.
Sorting the data in descending order
Assigning the ranks to the original data and finding d.
ML Rank Big Data Rank ML Rank Big Data Rank d d2
47 1 49 1 35 3 30 5 -2 4
43 2 45 2 23 5 33 3 2 4
35 3 33 3 47 1 45 2 -1 1
28 4 31 4 17 6 23 6 0 0
23 5 30 5 10 7 8 8 -1 1
17 6 23 6 43 2 49 1 1 1
10 7 12 7 9 8 12 7 1 1
9 8 8 8 6 9 4 9 0 0
6 9 4 9 28 4 31 4 0 0
12
= 1 – (6*12)/(9(81-1))
=1 – 72/720
= 1-0.1= 0.9
This indicates a strong positive relationship between the ranks individuals obtained in the ML and Big Data
exam. That is, the higher you ranked in ML, the higher you ranked in Big Data also, and vice versa.
Spearman Rank Correlation: with Tied Ranks
• Tied ranks are where two items in a column have the same rank. Tied data point assigned a mean rank:

• Calculate rank correlation coefficient from the following data:

Expenditure on
10 15 14 25 14 14 20 22
advertisement

Profit 6 25 12 18 25 40 10 7
Sort the data in descending order:
Assigning the ranks to the original data and finding d
Exp. Rank(x) Profit Rank(y)
Advt(x) (y)
Exp. Rank(x) Profit Rank(y) d d2
25 1 40 1 Advt(x) (y)
22 2 25 2.5 8 8 0 0
10 6
20 3 25 2.5 4 2.5 1.5 2.25
15 25
15 4 18 4 6 5 1 1
14 12
14 6 12 5 1 4 -3 9
25 18
14 6 10 6 6 2.5 3.5 12
14 25
14 6 7 7 6 1 5 25
14 40
10 8 6 8 3 6 -3 9
20 10
22 2 7 7 -5 25
83.5
Sort the data in descending order:
Assigning the ranks to the original data and finding d
Exp. Rank(x) Profit Rank(y)
Advt(x) (y)
Exp. Rank(x) Profit Rank(y) d d2
25 1 40 1 Advt(x) (y)
22 2 25 2.5 8 8 0 0
10 6
20 3 25 2.5 4 2.5 1.5 2.25
15 25
15 4 18 4 6 5 1 1
14 12
14 6 12 5 1 4 -3 9
25 18
14 6 10 6 6 2.5 3.5 12.25
14 25
14 6 7 7 6 1 5 25
14 40
10 8 6 8 3 6 -3 9
20 10
22 2 7 7 -5 25
Σd2 83.5
N=8
The formula to use when there are tied ranks is (repetition of ranks):

Here rank 6 is repeated three times in rank of x and rank 2.5 is repeated twice in rank of y, so the correction factor
is 3(32-1)/12+ 2(22-1)/12.

•Hence rank correlation coefficient is:


rs = 1-(6(83.5+ 3(32-1)/12+ 2(22-1)/12))/(8(82-1)
=1-(6(83.5+2+0.5))/(504)
=1-6(86)/504
=1-516/504
=1-1.0238
=-0.0238
There is a negative association between
expenditure on advertisement and profit.
Example
• Calculate Spearman rank correlation coefficient from the following
data:

X 10 20 30 30 40 45 50

Y 15 20 25 30 40 40 40
• Quotations of index numbers of equity share prices of a certain joint
stock company and the prices of preference shares are given below.

Years 2013 2014 2015 2016 2017 2008 2007

Equity
97.5 99.4 98.6 96.2 95.1 98.4 97.1
Shares

Preference
75.1 75.9 77.1 78.2 79 74.6 76.2
shares

Using the method of rank correlation determine the relationship between equity shares and
preference shares prices.
Years 2013 2014 2015 2016 2017 2008 2007
Equity
97.5 99.4 98.6 96.2 95.1 98.4 97.1
Shares
Preference
75.1 75.9 77.1 78.2 79 74.6 76.2
shares

Assigning the ranks to the original data and


Sort the data in descending order: finding d

Equity Rank(x) Prefere Rank(y) d d2


Equity Rank(x) Prefere Rank(y) Shares nce
Shares nce (x) shares
(x) shares (y)
(y)
97.5 4 75.1 6 -2 4
99.4 1 79.0 1
99.4 1 75.9 5 -4 16
98.6 2 78.2 2
3 3 98.6 2 77.1 3 -1 1
98.4 77.1
4 4 96.2 6 78.2 2 4 16
97.5 76.2
5 5 95.1 7 79 1 6 36
97.1 75.9
6 6 98.4 3 74.6 7 -4 16
96.2 75.1
95.1 7 74.6 7 97.1 5 76.2 4 1 1
Sort the data in descending order: Assigning the ranks to the original data and finding d
Equity Rank(x) Prefere Rank(y)
Shares nce
(x) shares Equity Rank(x Prefer Rank(y d d2
(y) Shares ) ence )
(x) shares
99.4 1 79.0 1 (y)
98.6 2 78.2 2 4 6 -2 4
97.5 75.1
98.4 3 77.1 3 1 5 -4 16
99.4 75.9
97.5 4 76.2 4 2 3 -1 1
98.6 77.1
97.1 5 75.9 5 6 2 4 16
96.2 78.2
96.2 6 75.1 6 7 1 6 36
95.1 79.0
95.1 7 74.6 7 3 7 -4 16
98.4 74.6
97.1 5 76.2 4 1 1
Σd2 90
Σd2 = 90, n=7

=1-[6(90)/(7(49-1))
=1-[540/336]
=1-1.607
=-0.607
• Compute the rank correlation coefficient for the following data of the
marks obtained by 8 students in the Python and Time Series.

Marks in
15 20 28 12 40 60 20 80
Python

Marks in
40 30 50 30 20 10 30 60
Time Series
Kendall’s Tau Coefficients
• Kendall rank correlation coefficient (often called Kendall’s τ or tau) is a non-parametric test which
measures the strength of the relationship between two variables.
• This correlation procedure was developed by Kendall (1938). Kendall’s tau is based on an analysis
of two sets of ranks, X and Y.
• This test is also used when Pearson correlation cannot be used because (one of) the assumptions for
the test (is) are challenged. It is also an alternative to Spearman’s rho when the sample size is
small.
• Like Spearman’s rho, the range of tau is from – 1.00 to + 1.00. Though there are some similarities
in the properties of tau and rs , the logic employed by tau is entirely different than that of rho.
• The interpretation is based on the sign and the value. Higher value indicates stronger relationship.
Positive value indicates positive relationship and negative value indicates negative relationship.
• The tau is based on concordant and discordant among two sets of ranks.
• Concordant pairs: The number of observed ranks below a particular rank which are larger than that particular
rank.
Student Rx Ry C D
A 1 1 3 0
B 2 3 1 1
C 3 4 0 1
D 4 2

• Discordant pairs: The number of observed ranks below a particular rank which are smaller in
value than that particular value.
The Kendall τ coefficient is defined as:

τ=

τ = value of τ obtained on sample


nC = Total number of concordant pairs
nD = Total number of discordant pairs
n = number of observations
Steps to find Kendall’s Tau Coefficient:
Step 1: Make a table of rankings. The first column, “Observation” is optional and for reference only. The
rankings for Rx should be in ascending order.
Step 2: Accordingly ranks of Ry are arranged as it is.
Step 3: Count the number of concordant pairs from Ry. Concordant pairs are how many larger ranks are below a
certain rank.
Step 4: Count the number of discordant pairs and insert them into the next column.
Step 5: Sum the values in the concordant and discordant columns.
Step 6: Calculate Kendal's Tau using the formula.
• Two interviewers ranked 12 candidates (A through L) for a position. The results from most preferred to least
preferred are:
• Interviewer 1: ABCDEFGHIJKL. Candidate Interviewer1 Interviewer2 C D

• Interviewer 2: ABDCFEHGJILK.
A 1 1 11 0
• Calculate the Kendall Tau correlation.
B 2 2 10 0
Step 1: Make a table of rankings.
C 3 4 8 1
The first column, “Candidate” is optional and
D 4 3 8 0
for reference only. The rankings for Interviewer 1
E 5 6 6 1
should be in ascending order.
F 6 5 6 0
G 7 8 4 1
H 8 7 4 0
I 9 10 2 1
=(61-5)/((12(12-1))/2) J 10 9 2 0
=56/66
=0.848 K 11 12 0 1
L 12 11
The Tau coefficient is .85, suggesting a Total 61 5
strong relationship between the
rankings.
• Two interviewers ranked 12 candidates (A through L) for a position. The results from most preferred to least
preferred are:
• Interviewer 1: ABCDEFGHIJKL. Candidate Interviewer1 Interviewer2 C D

• Interviewer 2: ABDCFEHGJILK.
A 1 1 11 0
• Calculate the Kendall Tau correlation.
B 2 2 10 0
Step 1: Make a table of rankings.
C 3 4 8 1
The first column, “Candidate” is optional and
D 4 3 8 0
for reference only. The rankings for Interviewer 1
E 5 6 6 1
should be in ascending order.
F 6 5 6 0
G 7 8 4 1
H 8 7 4 0
I 9 10 2 1
=(61-5)/((12(12-1))/2) J 10 9 2 0
=56/66
=0.848 K 11 12 0 1
L 12 11
The Tau coefficient is .85, suggesting a Total 61 5
strong relationship between the
rankings.
• Calculate the Kendall Tau correlation for the following data.
Subjects Rank X Rank Y
(Rx) (Ry) Kendall Tau can be calculated as
A 1 1
B 4 2
C 3 4
D 2 3

Step 1: Make a table of rankings.


=(4-2)/((4(4-1))/2)
The first column, “Subjects” is optional and
=2/6
for reference only. The rankings for Rank X
=0.333
should be in ascending order.
Subjects Rank X Rank Y C D
(Rx) (Ry)
A 1 1 3 0
D 2 3 1 1
C 3 4 0 1
B 4 2
Total 4 2
• Calculate the Kendall Tau correlation for the following data.
Subjects Rank X Rank Y
(Rx) (Ry) Kendall Tau can be calculated as
A 1 1
B 4 2
C 3 4
D 2 3

Step 1: Make a table of rankings.


=(4-2)/((4(4-1))/2)
The first column, “Subjects” is optional and
=2/6
for reference only. The rankings for Rank X
=0.333
should be in ascending order.
Subjects Rank X Rank Y C D
(Rx) (Ry)
A 1 1
D 2 3
C 3 4
B 4 2
Total
• Consider ranking the grades and IQ levels for a group of students is shown. Calculate the Kendall's correlation
coefficient.
Students Grades IQ C D
Students Grades IQ A 1 1
A 1 1 B 2 4
B 2 4 C 3 3
E 5 2 D 4 5
C 3 3 E 5 2
D 4 5 Total

Step 1: Make a table of rankings.


The first column, “Students” is optional and
for reference only. The rankings for Grades Kendall Tau can be calculated as
should be in ascending order.

Τ=(6-4)/(20/2)
= 2/10
=0.2
• Consider ranking the grades and IQ levels for a group of students is shown. Calculate the Kendall's correlation
coefficient.
Students Grades IQ C D
Students Grades IQ A 1 1 4 0
A 1 1 B 2 4 1 2
B 2 4 C 3 3 1 1
E 5 2 D 4 5 0 1
C 3 3 E 5 2
D 4 5 Total 6 4

Step 1: Make a table of rankings.


The first column, “Students” is optional and
for reference only. The rankings for Grades Kendall Tau can be calculated as
should be in ascending order.

Τ=(6-4)/(20/2)
= 2/10
=0.2

You might also like