Dsbda Unit 2
Dsbda Unit 2
Dsbda Unit 2
Statistical Inference
-Ashwini Jarali
Computer Engineering
- The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
- The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-sized
consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles.
•
A plot of the data distribution for some attribute X. The quantiles
plotted are quartiles. The three quartiles divide the distribution
into four equal-size consecutive subsets. The second
quartile corresponds to the median.
• The distance between the first and third quartiles is a simple measure
of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as
IQR = Q3 - Q1
• 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• The data in above Example contain 12 observations, already sorted in
increasing order.
• Thus, the quartiles for this data are the third, sixth, and ninth values,
respectively, in the sorted list.
• Therefore, Q1= $47,000 and Q3 is $63,000.
• Thus, the interquartile range is IQR= 63 – 47= $16,000.
• The prices (in dollars) for a sample of round-trip flights from Chicago,
Illinois to Cancun, Mexico are listed.What is the mean price of the flights?
• 872 432 397 427 388 782 397
• The sum of the flight prices is To find the mean price, divide the sum of
the prices by the number of prices in the sample.
Finding a Weighted Mean
You are taking a class in which your grade is determined from five sources:
50% from your test mean, 15% from your midterm, 20% from your final exam,
10% from your computer lab work, and 5% from your homework.Your scores
are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab), and 100
(homework). What is the weighted mean of your scores? If the minimum
average for an A is 90, did you get an A?
• A frequency distribution is symmetric when a vertical line can be drawn
• through the middle of a graph of the distribution and the resulting halves
are approximately mirror images.
• A frequency distribution is uniform (or rectangular) when all entries, or
classes, in the distribution have equal or approximately equal frequencies.
• A uniform distribution is also symmetric.
• A frequency distribution is skewed if the “tail” of the graph elongates more
to one side than to the other. A distribution is skewed left (negatively
skewed) if its tail extends to the left. A distribution is skewed right
(positively skewed) if its tail extends to the right.
Finding the Range of a Data Set
Two corporations each hired 10 graduates. The starting salaries for each
graduate are shown. Find the range of the starting salaries for Corporation A.
• Variance and Standard Deviation
- Variance and standard deviation are measures of data
dispersion.
-They indicate how spread out a data distribution is.
- A low standard deviation means that the data observations tend
to be very close to the mean,
-while a high standard deviation indicates that the data are
spread out over a large range of values
- The variance of N observations, x1,x2,:::,xN , for a numeric
attribute X is
• You have to figure out what your “tests” and “events” are first.
For two events, A and B, Bayes’ theorem allows you to figure out
p(A|B) (the probability that event A happened, given that test B
was positive) from p(B|A) (the probability that test B happened,
given that event A happened).
• Bayes’ Theorem Example #1
• You might be interested in finding out a patient’s probability of
having liver disease if they are an alcoholic. “Being an alcoholic”
is the test (kind of like a litmus test) for liver disease.
• A could mean the event “Patient has liver disease.” Past data tells
you that 10% of patients entering your clinic have liver disease.
P(A) = 0.10.
• B could mean the litmus test that “Patient is an alcoholic.” Five
percent of the clinic’s patients are alcoholics. P(B) = 0.05.
• You might also know that among those patients diagnosed with
liver disease, 7% are alcoholics. This is your B|A: the probability
that a patient is alcoholic, given that they have liver disease, is
7%.
• Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
• In other words, if the patient is an alcoholic, their chances of
having liver disease is 0.14 (14%). This is a large increase from
the 10% suggested by past data. But it’s still unlikely that any
particular patient has liver disease.
• Bayesian Spam Filtering
• Although Bayes’ Theorem is used extensively in the medical
sciences, there are other applications. For example, it’s used
to filter spam. The event in this case is that the message is spam.
The test for spam is that the message contains some flagged
words (like “viagra” or “you have won”). Here’s the equation set
up (from Wikipedia), read as “The probability a message is spam
given that it contains certain flagged words”:
– Descriptive Analysis
“What is happening now based on incoming data.” It is
a method for quantitatively describing the main features of a collection
of data. Here are a few key points about descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps.
– Diagnostic Analytics
Diagnostic analytics are used for discovery, or to determine why
something happened.
Sometimes this type of analytics when done hands-on with a small
dataset is also known as causal analysis, since it involves at least one
cause (usually more than one) and one effect.
• For example, for a social media marketing campaign, you can
use descriptive analytics to assess the number of posts,
mentions, followers, fans, page views, reviews, or pins, etc.
There can be thousands of online mentions that can be distilled
into a single view to see what worked and what did not work in
your past campaigns.
• There are various types of techniques available for diagnostic or
causal analytics. Among them, one of the most frequently used
is correlation.
• Predictive Analytics
– predictive analytics has its roots in our ability to predict what
might happen. These analytics are about understanding the future using
the data and the trends we have seen in the past, as well as emerging new
contexts and processes.
– An example is trying to predict how people will spend their tax refunds
based on how consumers normally behave around a given time of the
year (past data and trends), and
how a new tax policy (new context) may affect people’s refunds.
– Predictive analytics provides companies with actionable insights based
on data. Such information includes estimates about the likelihood of a
future outcome. It is important to remember that no statistical algorithm
can “predict” the future with 100% certainty because the foundation of
predictive analytics is based on probabilities.
– Companies use these statistics to forecast what might happen.
– Some of the software most commonly used by data science
professionals for predictive analytics are SAS predictive analytics, IBM
predictive analytics, RapidMiner, and others.
• Let us assume that Sales force kept campaign data for the last
eight quarters. This data comprises total sales generated by
newspaper, TV, and online ad campaigns and associated
expenditures, as provided in Table
With this data, we can predict the sales based on the expenditures of ad
campaigns in different media for Salesforce.
• Predictive analytics has a number of common applications.
• For example, many people turn to predictive analytics to
produce their credit scores.
Financial services use such numbers to determine the
probability that a customer will make their credit payments on
time.
Customer relationship management (CRM) classifies another
common area for predictive analytics. Here, the process
contributes to objectives such as marketing campaigns, sales,
and customer service.
• Predictive analytics applications are also used in the healthcare
field. They can determine which patients are at risk for
developing certain conditions such as diabetes, asthma, and
other chronic or serious illnesses.
• Prescriptive Analytics
– Prescriptive analytics is the area of business analytics dedicated to finding
the best course of action for a given situation. This may start by first
analyzing the situation (using descriptive analysis), but then moves
toward finding connections among various parameters/variables, and their
relation to each other to address a specific problem .
– A process-intensive task, the prescriptive approach analyzes potential
decisions, the interactions between decisions, the influences that bear
upon these decisions, and the bearing all of this has on an outcome to
ultimately prescribe an optimal course of action in real time.
– Prescriptive analytics can also suggest options for taking advantage of a
future opportunity or mitigate a future risk and illustrate the implications
of each
– Specific techniques used in prescriptive analytics include optimization,
simulation, game theory, and decision-analysis methods.
Exploratory Analysis
-Exploratory analysis is an approach to analyzing datasets to find previously
unknown relationships. Often such analysis involves using various data
visualization approaches.
-exploratory analysis consists of a range of techniques; its application is varied
as well.
However, the most common application is looking for patterns in the data,
such as finding groups of similar genes from a collection of samples.
-Let us consider the US census data available from the US census website.
This data has dozens of variables; If you are looking for something specific
(e.g., which State has the highest population), you could go with descriptive
analysis. If you are trying to predict something (e.g., which city will have
the lowest influx of immigrant population), you could use prescriptive or
predictive analysis. But, if someone gave you this data and asks you to find
interesting insights, then what do you do? You could still do descriptive or
prescriptive analysis, but given that there are lots of variables with massive
amounts of data, it may be futile to do all possible combinations of those
variables. So, you need to go exploring.
Mechanistic Analysis
-Mechanistic analysis involves understanding the exact changes in
variables that lead to changes in other variables for individual objects.
-For instance, we may want to know how the number of free doughnuts
per employee per day affects employee productivity. Perhaps by giving
them one extra doughnut we gain a 5% productivity boost, but two extra
doughnuts could end up making them lazy (and diabetic)
-More seriously, though, think about studying the effects of carbon
emissions on bringing about the Earth’s climate change. Here, we are
interested in seeing how the increased amount of CO2 in the atmosphere
is causing the overall temperature to change.
• Basics and need of hypothesis & hypothesis testing
– Hypothesis testing is a common statistical tool used in research and data
science to support the certainty of findings. The aim of testing is to
answer how probable an apparent effect is detected by chance given a
random data sample.
• What is a hypothesis?
A hypothesis is often described as an “educated guess” about a specific
parameter or population. Once it is defined, one can collect data to determine
whether it provides enough evidence that the hypothesis is true.
Parameters and statistics
In statistics, a parameter is a description of a population,
while a statistic describes a small portion of a population (sample).
For example, if you ask everyone in your class (population) about their average
height, you receive a parameter, a true description about the population since
everyone was asked.
If you now want to guess the average height of people in your grade
(population) using the information you have from your class (sample), this
information turns into a statistic.
• A hypothesis is a calculated prediction or assumption about
a population parameter based on limited evidence. The whole
idea behind hypothesis formulation is testing—this means the
researcher subjects his or her calculated assumption to a series of
evaluations to know whether they are true or false.
• Typically, every research starts with a hypothesis—the
investigator makes a claim and experiments to prove that this
claim is true or false. For instance, if you predict that students
who drink milk before class perform better than those who don't,
then this becomes a hypothesis that can be confirmed or refuted
using an experiment.
• Hypothesis testing is an assessment method that allows researchers to
determine the plausibility of a hypothesis. It involves testing an
assumption about a specific population parameter to know whether it's true
or false. These population parameters include variance, standard deviation,
and median.
• Typically, hypothesis testing starts with developing a null hypothesis and
then performing several tests that support or reject the null hypothesis. The
researcher uses test statistics to compare the association or relationship
between two or more variables.
• How Hypothesis Testing Works
• The basis of hypothesis testing is to examine and analyze the null hypothesis
and alternative hypothesis to know which one is the most plausible
assumption. Since both assumptions are mutually exclusive, only one can be
true. In other words, the occurrence of a null hypothesis destroys the chances
of the alternative coming to life, and vice-versa.
What are the Types of Hypotheses?
1. Simple Hypothesis
2. Complex Hypothesis
3. Null Hypothesis
4. Alternative Hypothesis
5. Logical Hypothesis
6. Empirical Hypothesis
7. Statistical Hypothesis
• Five-Step Procedure for Testing a Hypothesis
Step 1: State the Null Hypothesis (H0) and the Alternate
Hypothesis (H1):
• The first step is to state the hypothesis being tested. It is called
the null hypothesis, designated H0, and read “H sub zero.”
The capital letter H stands for hypothesis,and the subscript
zero implies “no difference.” There is usually a “not” or a “no”
term in the null hypothesis, meaning that there is “no
change.”
• For example, the null hypothesis is that the mean number of
miles driven on the steel-belted tire is not different from
60,000. The null hypothesis would be written H0: 60,000.
• Generally speaking, the null hypothesis is developed for the
purpose of testing. We either reject or fail to reject the null
hypothesis. The null hypothesis is a statement that is not
rejected unless our sample data provide convincing evidence
that it is false.
• The alternate hypothesis describes what you will conclude if you reject the
• null hypothesis. It is written H1 and is read “H sub one.” It is also referred
to as the research hypothesis. The alternate hypothesis is accepted if the
sample data provide us with enough statistical evidence that the null
hypothesis is false.
• The actual test begins by considering two hypotheses. They are
called the null hypothesis and the alternative hypothesis. These
hypotheses contain opposing viewpoints.
• H0: The null hypothesis: It is a statement of no difference
between sample means or proportions or no difference between
a sample mean or proportion and a population mean or
proportion. In other words, the difference equals 0.
• Ha: The alternative hypothesis: It is a claim about the
population that is contradictory to H0 and what we conclude
when we reject H0.
• The following example will help clarify what is meant by the null
hypothesis and the alternate hypothesis. A recent article indicated the
mean age of U.S. commercial aircraft is 15 years. To conduct a statistical
test regarding this statement, the first step is to determine the null and
the alternate hypotheses.
• The null hypothesis represents the current or reported condition. It is
written H0: µ=15.
• The alternate hypothesis is that the statement is not true, that is, H1: µ≠
15.
• It is important to remember that no matter how the problem is stated,
the null hypothesis will always contain the equal sign. The equal sign (=)
will never appear in the alternate hypothesis. Why? Because the null
hypothesis is the statement being tested, and we need a specific value to
include in our calculations. We turn to the alternate hypothesis only if the
data suggests the null hypothesis is untrue.
• Null & Alternative hypothesis
• The null and alternative hypotheses are the two mutually
exclusive statements about a parameter or population.
• The null hypothesis (often abbreviated as H0) claims that there
is no effect or no difference.
• The alternative hypothesis (often abbreviated as H1 or HA) is
what you want to prove. Using one of the examples from above:
• H0: There is no difference in the mean return from A and B, or
the difference between A and B is zero.
• H1: There is a difference in the mean return from A and B or
the difference between A and B > zero.
H0 Ha
equal (=) not equal (≠) or greater than (>) or less
than (<
● A type I error is the rejection of the null hypothesis when the null hypothesis
is TRUE. The probability of the type I error is denoted by the Greek letter α.
● A type II error is the acceptance of a null hypothesis when the null hypothesis
is FALSE. The probability of the type II error is denoted by the Greek letter
β.
Type I and Type II Error
• Step 3: Select the Test Statistic
• There are many test statistics., we use both z and t as the test
statistic., we will use such test statistics as F and X2, called chi-
square.
Left: example of normally distributed data. Right: example of non-normal data distribution.
• Real-world examples
• Hypothesis 1: Average order value has increased since last financial year
-Parameter: Mean order value
-Test type: one-sample, parametric test (assuming the order value follows a
normal distribution)
• Hypothesis 2: Investing in A brings a higher return than investing in B
– Parameter: Difference in mean return
– Test type: two-sample, parametric test, also AB test (assuming the return
follows a normal distribution)
• Hypothesis 3: The new user interface converts more users into customers than
the expected 30%
– Parameter: none
– Test type: one-sample, non-parametric test (assuming number of customers
is not normally distributed)
• One-sample, two-sample, or more-sample test
• When testing hypotheses, it is distinguished between one-
sample, two-sample or more-sample tests.
• In a one-sample test, a sample (average order value this year) is
compared to a known value (average order value of last year).
• In a two-sample test, two samples (investment A and B) are
compared to each other.
• What exactly is a test statistic?
– A test statistic describes how closely the distribution of your data
matches the distribution predicted under the null hypothesis of the
statistical test you are using.
– The distribution of data is how often each observation occurs, and can
be described by its central tendency and variation around that central
tendency. Different statistical tests predict different types of
distributions, so it’s important to choose the right statistical test for your
hypothesis.
– The test statistic summarizes your observed data into a single
number using the central tendency, variation, sample size, and
number of predictor variables in your statistical model.
– Generally, the test statistic is calculated as the pattern in your data
(i.e. the correlation between variables or difference between
groups) divided by the variance in the data (i.e. the standard
deviation).
• What exactly is a test statistic?
For Example You are testing the relationship between temperature
and flowering date for a certain type of apple tree. You use a long-
term data set that tracks temperature and flowering dates from the
past 25 years by randomly sampling 100 trees every year in an
experimental field.
– Null hypothesis: There is no correlation between temperature and
flowering date.
– Alternate hypothesis: There is a correlation between temperature
and flowering date.
To test this hypothesis you perform a regression test, which generates
a t-value as its test statistic. The t-value compares the observed
correlation between these variables to the null hypothesis of zero
correlation.
• Types of test statistics
Test Null and alternative hypotheses Statistical tests that
statistic use it
Our P-value is greater than 0.05 thus we fail to reject the null
hypothesis and don’t have enough evidence to support the
hypothesis that on average, girls score more than 600 in the exam.
• T-test formula
• The formula for the two-sample t-test (a.k.a. the Student’s t-
test) is shown below.
If the sample size is large enough, then the Z test and t-Test will conclude with the same
results. For a large sample size, Sample Variance will be a better estimate of
Population variance so even if population variance is unknown, we can use the Z test
using sample variance.
What is the Z Test?
Z tests are a statistical way of testing a hypothesis when either:
We know the population variance, or
We do not know the population variance but our sample size is large n
≥ 30
If we have a sample size of less than 30 and do not know the
population variance, then we must use a t-test.
One-Sample Z test
We perform the One-Sample Z test when we want to compare a
sample mean with the population mean.
• Here’s an Example to Understand a One Sample Z Test
• Let’s say we need to determine if girls on average score higher than
600 in the exam. We have the information that the standard
deviation for girls’ scores is 100. So, we collect the data of 20 girls
by using random samples and record their marks. Finally, we also
set our ⍺ value (significance level) to be 0.05.
In this example:
Mean Score for Girls is 641
The size of the sample is 20
The population mean is 600
Standard Deviation for Population is 100
Since the P-value is less than 0.05, we can reject the null
hypothesis and conclude based on our result that Girls on average scored
higher than 600.
• Two Sample Z Test
• We perform a Two Sample Z test when we want to compare the
mean of two samples. Here’s an Example to Understand a Two
Sample Z Test
• Here, let’s say we want to know if Girls on average score 10
marks more than the boys. We have the information that the
standard deviation for girls’ Score is 100 and for boys’ score is 90.
Then we collect the data of 20 girls and 20 boys by using random
samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.
In this example:
Mean Score for Girls (Sample Mean) is 641
Mean Score for Boys (Sample Mean) is 613.3
Standard Deviation for the Population of Girls’ is 100
Standard deviation for the Population of Boys’ is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
• The subscript “c” is the degrees of freedom. “O” is your observed value and E is
your expected value. It’s very rare that you’ll want to actually use this formula to
find a critical chi-square value by hand. The summation symbol means that
you’ll have to perform a calculation for every single data item in your data set.
As you can probably imagine, the calculations can get very, very, lengthy and
tedious. Instead, you’ll probably want to use technology:
• A chi-square statistic is one way to show a relationship between
two categorical variables. In statistics, there are two types of
variables: numerical (countable) variables and non-numerical (categorical)
variables. The chi-squared statistic is a single number that tells you how
much difference exists between your observed counts and the counts you
would expect if there were no relationship at all in the population.
• There are a few variations on the chi-square statistic. Which one you use
depends upon how you collected the data and which hypothesis is being
tested. However, all of the variations use the same idea, which is that you are
comparing your expected values with the values you actually collect. One of
the most common forms can be used for contingency tables:
• Where O is the observed value, E is the expected value and “i” is the “ith”
position in the contingency table.
• Chi Square P-Values.
– A chi square test will give you a p-value. The p-value will tell you if your test results
are significant or not. In order to perform a chi square test and get the p-value, you need
two pieces of information:
– Degrees of freedom. That’s just the number of categories minus 1.
– The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05
(5%), but you could also have other levels like 0.01 or 0.10.
A chi-square test for independence
• Example: a scientist wants to know if education level and marital status are related
for all people in some country. He collects data on a simple random sample of n =
300 people, part of which are shown below.
• Chi-Square Test - Observed Frequencies
• A good first step for these data is inspecting the contingency table of marital status by
education. Such a table -shown below- displays the frequency distribution of marital
status for each education category separately. So let's take a look at it.
•
• Chi-Square Test - Column Percentages
• Although our contingency table is a great starting point, it doesn't really show
us if education level and marital status are related. This question is answered
more easily from a slightly different table as shown below.
• This table shows -for each education level separately- the percentages of
respondents that fall into each marital status category. Before reading on, take
a careful look at this table and tell me is marital status related to education
level and -if so- how?
Marital status is clearly associated with education level.The lower
someone’s education, the smaller the chance he’s married. That is:
education “says something” about marital status (and reversely) in
our sample. So what about the population?
• Chi-Square Test - Null Hypothesis
The null hypothesis for a chi-square independence test is that two categorical
variables are independent in some population.
Chi-Square Test - Statistical Independence
independence means that one variable doesn't “say anything” about another
variable.
A different way of saying the exact same thing is that
independence means that the relative frequencies of one variable are identical
over all levels of some other variable.
Expected Frequencies
• Expected frequencies are the frequencies we expect in our sample
if the null hypothesis holds.
• If education and marital status are independent in our population, then we
expect this in our sample too. This implies the contingency table -holding
expected frequencies- shown below.
These expected frequencies are calculated as
eij=(oi.oj)/N
where
eij is an expected frequency;
oi is a marginal column frequency;
oj is a marginal row frequency;
N is the total sample size
So for our first cell, that'll be eij=39.90/300=11.7
• Test Statistic
• The chi-square test statistic is calculated as
In simpler and general terms, it can be stated that the ANOVA test is used to identify which process, among all the other
processes, is better. The fundamental concept behind the Analysis of Variance is the “Linear Model”.
Example of ANOVA
An example to understand this can be prescribing medicines.
They are being given three different medicines that have the same functionality i.e. to cure fever.
To understand the effectiveness of each medicine and choose the best among them, the ANOVA test is used.
You may wonder that a t-test can also be used instead of using the ANOVA test. You are probably right, but, since t-tests are
used to compare only two things, you will have to run multiple t-tests to come up with an outcome. While that is not the case
with the ANOVA test.
That is why the ANOVA test is also reckoned as an extension of t-test and z-tests.
Terminologies in ANOVA Test
There are few terms that we continuously encounter or better say come across while performing the ANOVA
test. We have listed and explained them below:
As we know, a mean is defined as an arithmetic average of a given range of values. In the ANOVA test, there
are two types of mean that are calculated: Grand and Sample Mean.
A sample mean (μn) represents the average value for a group while the grand mean (μ) represents the average
value of sample means of different groups or mean of all the observations combined.
2. F-Statistic
The statistic which measures the extent of difference between the means of different samples or how
significantly the means differ is called the F-statistic or F-Ratio. It gives us a ratio of the effect we are
measuring (in the numerator) and the variation associated with the effect (in the denominator).
Since we use variances to explain both the measure of the effect and the measure of the error, F is
more of a ratio of variances. The value of F can never be negative.
• When the value of F exceeds 1 it means that the variance due to the effect is larger than the
variance associated with sampling error; we can represent it as:
• When F>1, variation due to the effect > variation due to error
• If F<1, it means variation due to effect < variation due to error
• When F = 1 it means variation due to effect = variation due to error. This situation is not so
favorable
3. Sums of Squares
In statistics, the sum of squares is defined as a statistical technique that is used in regression
analysis to determine the dispersion of data points. In the ANOVA test, it is used while computing
the value of F.
As the sum of squares tells you about the deviation from the mean, it is also known as variation.
Degrees of Freedom refers to the maximum numbers of logically independent values that have the
freedom to vary in a data set.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.
The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.
The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
7. Group Variability (Within-group and Between-group)
To understand group variability, we should know about groups first. In the ANOVA test, a group is the
set of samples within the independent variable.
There are variations among the individual groups as well as within the group. This gives rise to the two
terms: Within-group variability and Between-group variability.
• When there is a big variation in the sample distributions of the individual groups, it is called
between-group variability.
• On the other hand, when there are variations in the sample distribution within an individual group,
it is called Within-group variability.
Types of ANOVA Test
The ANOVA test is generally done in three ways depending on the number of Independent Variables
(IVs) included in the test. Sometimes the test includes one IV, sometimes it has two IVs, and sometimes
the test may include multiple IVs.
1. One-Way ANOVA
2. Two-Way ANOVA
3. N-Way ANOVA (MANOVA)
One-Way ANOVA
One-way ANOVA is generally the most used method of performing the ANOVA test. It is also referred
to as one-factor ANOVA, between-subjects ANOVA, and an independent factor ANOVA. It is used to
compare the means of two independent groups using the F-distribution.
Two carry out the one-way ANOVA test, you should necessarily have only one independent variable
with at least two levels. One-way ANOVA does not differ much from t-test.
Example where one-way ANOVA is used: Suppose a teacher wants to know how good he has been in
teaching with the students. So, he can split the students of the class into different groups and assign
different projects related to the topics taught to them.
He can use one-way ANOVA to compare the average score of each group. He can get a rough
understanding of topics to teach again. However, he won’t be able to identify the student who could not
understand the topic.
Two-way ANOVA
Two-way ANOVA is carried out when you have two independent variables. It is an extension of one-
way ANOVA. You can use the two-way ANOVA test when your experiment has a quantitative outcome
and there are two independent variables.
Two-way ANOVA with replication: It is performed when there are two groups and the
members of these groups are doing more than one thing. Our example in the beginning can be a good
example of two-way ANOVA with replication.
Two-way ANOVA without replication: This is used when you have only one group but you are
double-testing that group. For example, a patient is being observed before and after medication.
When we have multiple or more than two independent variables, we use MANOVA. The main purpose
of the MANOVA test is to find out the effect on dependent/response variables against a change in the IV.
• Does the change in the independent variable significantly affect the dependent variable?
• What are interactions among the dependent variables?
• What are interactions between independent variables?
The one way ANOVA test is used to determine whether there is any difference between the means of three or
more groups. A one way ANOVA will have only one independent variable. The hypothesis for a one way
ANOVA test can be set up as follows:
Null Hypothesis, H0
:μ1 = μ2 = μ3 = ... = μk
Alternative Hypothesis, H1
: The means are not equal
Decision Rule: If test statistic > critical value then reject the null hypothesis and conclude that the means of at
least two groups are statistically significant.
The steps to perform the one way ANOVA test are given below:
○ Step 1: Calculate the mean for each group.
○ Step 2: Calculate the total mean. This is done by adding all the means and dividing it by the total number
of means.
○ Step 3: Calculate the SSB.
○ Step 4: Calculate the between groups degrees of freedom.
○ Step 5: Calculate the SSE.
○ Step 6: Calculate the degrees of freedom of errors.
○ Step 7: Determine the MSB and the MSE.
○ Step 8: Find the f test statistic.
○ Step 9: Using the f table for the specified level of significance, α , find the critical value. This is given by
F(α, df1. df2).
○ Step 10: If f > F then reject the null hypothesis.
● Examples on ANOVA Test
● Example 1: Three types of fertilizers are used on three groups of plants for 5 weeks. We want to
check if there is a difference in the mean growth of each group. Using the data given below apply a
one way ANOVA test at 0.05 significance level.
Fertilizer 1 Fertilizer 2 Fertilizer 3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12
Solution:
6 1 8 1 13 9
8 9 12 9 9 1
4 1 9 0 11 1
5 0 11 4 8 4
3 4 6 9 7 9
4 1 8 1 12 4
SSE = 16 + 24 + 28 = 68
N = 18
df2 = N - k = 18 - 3 = 15
MSB = SSB / df1 = 84 / 2 = 42
MSE = SSE / df2 = 68 / 15 = 4.53
ANOVA test statistic, f = MSB / MSE = 42 / 4.53 = 9.33
Using the f table at α= 0.05 the critical value is given as F(0.05, 2, 15) = 3.68
As f > F, thus, the null hypothesis is rejected and it can be concluded that there is a
difference in the mean growth of the plants.
Answer: Reject the null hypothesis
• Pearson Correlation
• Pearson’s correlation coefficient is the test statistics that measures the
statistical relationship, or association, between two continuous variables. It is
known as the best method of measuring the association between variables of
interest because it is based on the method of covariance. It gives information
about the magnitude of the association, or correlation, as well as the direction
of the relationship.
• Questions Answered:
• Do test scores and hours spent studying have a statistically significant
relationship?
• Is there a statistical association between IQ scores and depression?
• Assumptions:
• Independent of case: Cases should be independent to each other.
• Linear relationship: Two variables should be linearly related to each other.
This can be assessed with a scatterplot: plot the value of variables on a scatter
diagram, and check if the plot yields a relatively straight line.
• Homoscedasticity: the residuals scatterplot should be roughly rectangular-
shaped.
• Properties:
• Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect
positive relationship, -1 indicates a perfect negative relationship, and a 0 indicates no
relationship exists..
• Pure number: It is independent of the unit of measurement. For example, if one
variable’s unit of measurement is in inches and the second variable is in quintals, even
then, Pearson’s correlation coefficient value does not change.
• Symmetric: Correlation of the coefficient between two variables is symmetric. This
means between X and Y or Y and X, the coefficient value of will remain the same.
• Degree of correlation:
• Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one variable
increases, the other variable tends to also increase (if positive) or decrease (if
negative).
• High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a
strong correlation.
• Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a
medium correlation.
• Low degree: When the value lies below + .29, then it is said to be a small correlation.
• No correlation: When the value is zero.
• Correlation coefficients are used to measure how strong a relationship is
between two variables. There are several types of correlation coefficient, but
the most popular is Pearson’s. Pearson’s correlation (also called
Pearson’s R) is a correlation coefficient commonly used in linear
regression. If you’re starting out in statistics, you’ll probably learn about
Pearson’s R first. In fact, when anyone refers to the correlation coefficient,
they are usually talking about Pearson’s.
• Correlation Coefficient Formula: Definition
• Correlation coefficient formulas are used to find how strong a relationship is
between data. The formulas return a value between -1 and 1, where:
– 1 indicates a strong positive relationship.
– -1 indicates a strong negative relationship.
– A result of zero indicates no relationship at all.
• Types of correlation coefficient formulas.
• There are several types of correlation coefficient formulas.
– One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re
taking a basic stats class, this is the one you’ll probably use:
– Two other formulas are commonly used: the sample correlation coefficient and the population
correlation coefficient.
– Sx and sy are the sample standard deviations, and sxy is the sample covariance.
– The population correlation coefficient uses σx and σy as the population standard deviations, and
σxy as the population covariance.
• What is Pearson Correlation?
• Correlation between sets of data is a measure of how well they are related.
The most common measure of correlation in stats is the Pearson
Correlation. The full name is the Pearson Product Moment Correlation
(PPMC). It shows the linear relationship between two sets of data. In
simple terms, it answers the question, Can I draw a line graph to represent
the data? Two letters are used to represent the Pearson correlation: Greek
letter rho (ρ) for a population and the letter “r” for a sample.
• Potential problems with Pearson correlation.
• The PPMC is not able to tell the difference between dependent
variables and independent variables. For example, if you are trying to find
the correlation between a high calorie diet and diabetes, you might find a
high correlation of .8. However, you could also get the same result with the
variables switched around. In other words, you could say that diabetes
causes a high calorie diet. That obviously makes no sense. Therefore, as a
researcher you have to be aware of the data you are plugging in. In
addition, the PPMC will not give you any information about the slope of
the line; it only tells you whether there is a relationship.
Example question: Find the value of the correlation coefficient from the
following table:
GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
• The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means
the variables have a moderate positive correlation.
More_problems