Dsbda Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 155

Unit II

Statistical Inference

-Ashwini Jarali
Computer Engineering

International Institute of Information Technology, I²IT


www.isquareit.edu.in
Introduction to Data Science and Big
Data
• Need of statistics in Data Science and Big Data Analytics,
• Measures of Central Tendency:
• Mean,Median, Mode, Mid-range.
• Measures of Dispersion: Range, Variance, Mean Deviation,
Standard Deviation.
• Bayes theorem,
• Basics and need of hypothesis & hypothesis testing,
• Pearson Correlation,
Sample Hypothesis testing,
• Chi-Square Tests, t-test
• Need of statistics in Data Science and Big Data
Analytics
• Statistics is the practice or science of collecting and analyzing
numerical data in large quantities, especially for the purpose of
inferring proportions in a whole from those in a representative
sample.
• Statistics is the science concerned with developing and studying
methods for collecting, analyzing, interpreting and presenting
empirical data.
• A branch of mathematics dealing with the collection, analysis,
interpretation, and presentation of masses of numerical data
• Example :Government statistics detail a surge in crime, domestic
abuse, substance abuse, alcoholism, and suicidal ideation.
• Data Science is about extraction, preparation, analysis,
visualization, and maintenance of information. It is a cross-
disciplinary field which uses scientific methods and processes to
draw insights from data.
• How are statistics used in data science?
• In data science, statistics is at the core of sophisticated machine
learning algorithms, capturing and translating data patterns into
actionable evidence. Data scientists use statistics to gather,
review, analyze, and draw conclusions from data, as well as
apply quantified mathematical models to appropriate variables.
• Measures of Central Tendency:
Measures of central tendency include the mean, median, mode,
and midrange.
-The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean. Let
x1,x2,:::,xN be a set of N values or observations, such as for
some numeric attribute X, like salary. The mean of this set of
values is
• Mean. Suppose we have the following values for salary (in
thousands of dollars), shown
in increasing order: Measures of Central Tendency Using Eq. we
have

• Sometimes, each value xi in a set may be associated with a weight


wi for i D 1,:::,N. The weights reflect the significance, importance,
or occurrence frequency attached to their respective values. In this
case, we can compute

• This is called the weighted arithmetic mean or the weighted


average.
• A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values. Even a small number of extreme values can
corrupt the mean. For example, the mean salary at a company
may be substantially pushed up by that of a few highly paid
managers. Similarly, the mean score of a class in an exam could
be pulled down quite a bit by a few very low scores. To offset the
effect caused by a small number of extreme values, we can
instead use the trimmed mean
• For skewed (asymmetric) data, a better measure of the center of
data is the median, which is the middle value in a set of ordered
data values. It is the value that separates the higher half of a data
set from the lower half.

• Let’s find the median of the data


• we assign the average of the two middlemost values as the
median; that is, 52+56=108/2=54
• The mode is another measure of central tendency. The mode for a
set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative
attributes.
• Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with two
or more modes is multimodal. At the other extreme, if each data
value occurs only once, then there is no mode.
• The data from previous example are bimodal. The two modes are
$52,000 and $70,000.
• The midrange can also be used to assess the central tendency
of a numeric data set.
It is the average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate
functions, max() and min().
• Midrange. The midrange of the data of in previous example is
(30,000+110,000)/2= $70,000
In a unimodal frequency curve with perfect symmetric data
distribution, the mean,median, and mode are all at the same
center value, as shown in Figure
• Data in most real applications are not symmetric. They may
instead be either positively skewed, where the mode occurs at a
value that is smaller than the median (Figure b), or negatively
skewed, where the mode occurs at a value greater than the
median (Figure c)
• Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Rang
– We now look at measures to assess the dispersion or spread of
numeric data. The measures include range, quantiles, quartiles,
percentiles, and the interquartile range. The five-number
summary, which can be displayed as a boxplot, is useful in
identifying outliers. Variance and standard deviation also indicate
the spread of a data distribution.
– Let x1,x2,:::,xN be a set of observations for some numeric
attribute, X. The range of the set is the difference between the
largest (max()) and smallest (min()) values.
– Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets.
-The 2-quantile is the data point dividing the lower and upper
halves of the data distribution. It corresponds to the median.

- The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
- The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-sized
consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles.

A plot of the data distribution for some attribute X. The quantiles
plotted are quartiles. The three quartiles divide the distribution
into four equal-size consecutive subsets. The second
quartile corresponds to the median.
• The distance between the first and third quartiles is a simple measure
of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as
IQR = Q3 - Q1

• 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• The data in above Example contain 12 observations, already sorted in
increasing order.
• Thus, the quartiles for this data are the third, sixth, and ninth values,
respectively, in the sorted list.
• Therefore, Q1= $47,000 and Q3 is $63,000.
• Thus, the interquartile range is IQR= 63 – 47= $16,000.
• The prices (in dollars) for a sample of round-trip flights from Chicago,
Illinois to Cancun, Mexico are listed.What is the mean price of the flights?
• 872 432 397 427 388 782 397

• The sum of the flight prices is To find the mean price, divide the sum of
the prices by the number of prices in the sample.
Finding a Weighted Mean
You are taking a class in which your grade is determined from five sources:
50% from your test mean, 15% from your midterm, 20% from your final exam,
10% from your computer lab work, and 5% from your homework.Your scores
are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab), and 100
(homework). What is the weighted mean of your scores? If the minimum
average for an A is 90, did you get an A?
• A frequency distribution is symmetric when a vertical line can be drawn
• through the middle of a graph of the distribution and the resulting halves
are approximately mirror images.
• A frequency distribution is uniform (or rectangular) when all entries, or
classes, in the distribution have equal or approximately equal frequencies.
• A uniform distribution is also symmetric.
• A frequency distribution is skewed if the “tail” of the graph elongates more
to one side than to the other. A distribution is skewed left (negatively
skewed) if its tail extends to the left. A distribution is skewed right
(positively skewed) if its tail extends to the right.
Finding the Range of a Data Set
Two corporations each hired 10 graduates. The starting salaries for each
graduate are shown. Find the range of the starting salaries for Corporation A.
• Variance and Standard Deviation
- Variance and standard deviation are measures of data
dispersion.
-They indicate how spread out a data distribution is.
- A low standard deviation means that the data observations tend
to be very close to the mean,
-while a high standard deviation indicates that the data are
spread out over a large range of values
- The variance of N observations, x1,x2,:::,xN , for a numeric
attribute X is

Where is the mean value of the observations,


• The standard deviation, σ , of the observations is the square root
of the variance.
• 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• The data in above Example contain N=12 observations
• X=$58000
• The basic properties of the standard deviation, σ , as a measure
of spread are as follows:
– σ measures spread about the mean and should be considered
only when the mean is chosen as the measure of center.
– σ= 0 only when there is no spread, that is, when all
observations have the same value. Otherwise, σ > 0.
• Finding the Population Variance and Standard Deviation
Using the Empirical Rule
In a survey conducted by the National Center for Health Statistics, the sample
mean height of women in the United States (ages 20–29) was 64.3 inches, with
a sample standard deviation of 2.62 inches. Estimate the percent of women
whose heights are between 59.06 inches and 64.3 inches
• Another important application of quartiles is to represent data sets using
box-and-whisker plots. A box-and-whisker plot (or boxplot) is an
exploratory data analysis tool that highlights the important features of a data
set.To graph a box-and-whisker plot, you must know the following values.
• 1. The minimum entry 4. The third quartile Q3
• 2. The first quartile Q1 5. The maximum entry
• 3. The median Q2
• These five numbers are called the five-number summary of the data set.
• Bayes Theorem:
– Bayes’ theorem describes the probability of occurrence of
an event related to any condition. It is also considered for the
case of conditional probability.
– Bayes’ theorem is a way to figure out conditional
probability.
– Conditional probability is the probability of an event
happening, given that it has some relationship to one or more
other events.
– For example, your probability of getting a parking space is
connected to the time of day you park, where you park, and
what conventions are going on at any time.
– it gives you the actual probability of an event given
information about tests.
• “Events” Are different from “tests.” For example, there is
a test for liver disease, but that’s separate from the event of
actually having liver disease.
• Tests are flawed: just because you have a positive test does not
mean you actually have the disease.
• Many tests have a high false positive rate. Rare events tend to
have higher false positive rates than more common events.
• We’re not just talking about medical tests here. For example,
spam filtering can have high false positive rates.
• Bayes’ theorem takes the test results and calculates your real
probability that the test has identified the event.
• EVENT A collection of one or more outcomes of an experiment.
• OUTCOME A particular result of an experiment.
• EXPERIMENT A process that leads to the occurrence of one and only one
of several possible observations.
• Classical Probability
• Classical probability is based on the assumption that the outcomes of an
experiment are equally likely. Using the classical viewpoint, the probability
of an event happening is computed by dividing the number of favorable
outcomes by the number of possible outcomes:
• The variable “gender” presents mutually exclusive outcomes, male and
female. An employee selected at random is either male or female but
cannot be both. A manufactured part is acceptable or unacceptable. The
part cannot be both acceptable and unacceptable at the same time. In a
sample of manufactured parts, the event of selecting an unacceptable part
and the event of selecting an acceptable part are mutually exclusive.
• If an experiment has a set of events that includes every possible outcome,
such as the events “an even number” and “an odd number” in the die-
tossing experiment, then the set of events is collectively exhaustive. For
the die-tossing experiment,every outcome will be either even or odd. So
the set is collectively exhaustive.
• The empirical approach to probability is based on what is called the law of
large numbers. The key to establishing probabilities empirically is that
more observations will provide a more accurate estimate of the
probability.
• Subjective Probability If there is little or no experience or information on
which to base a probability, it may be arrived at subjectively. Essentially,
this means an individual evaluates the available opinions and information
and then estimates or assigns the probability.
• This probability is aptly called a subjective probability.
• Illustrations of subjective probability are:
• 1. Estimating the likelihood the New England Patriots will play in the Super
Bowl next year.
• 2. Estimating the likelihood you will be married before the age of 30.
• 3. Estimating the likelihood the U.S. budget deficit will be reduced by half in
the next 10 years.
Summary of Approaches to
Probability
• Some Rules for Computing Probabilities
• Rules of Addition
• There are two rules of addition, the special rule of addition and the
general rule of addition. We begin with the special rule of addition.
• For three mutually exclusive events designated A, B, and C, the rule is
written:
• P(A or B or C) = P(A) +P(B)+ P(C)
• An example will help to show the details.
Complement Rule The probability that a bag of mixed vegetables selected is
underweight, P(A), plus the probability that it is not an underweight bag,
writtenP(~A) ,and read “not A,” must logically equal 1.
This is written: P(A)+ P(~A)= 1
This can be revised to read: P(A)=1-P(~A)
This is the complement rule. It is used to determine the probability of an
eventoccurring by subtracting the probability of the event not occurring from 1.
• Rules of Multiplication
• we find the likelihood that two events both happen.
• For example, a marketing firm may want to estimate the likelihood
that a person is 21 years old or older and buys a Hummer.
• Venn diagrams illustrate this as the intersection of two events.
• To find the likelihood of two events happening we use the rules of
multiplication. There are two rules of multiplication, the special rule
and the general rule.
• Special Rule of Multiplication The special rule of multiplication
requires that two events A and B are independent. Two events are
independent if the occurrence of one event does not alter the
probability of the occurrence of the other event.
• For three independent events, A, B, and C, the special rule of
multiplication used to determine the probability that all three events will
occur is: P(A and B and C) = P(A)P(B)P(C)
• Assume that the events and are mutually exclusive and collectively
exhaustive, and refers to either event or Hence and are in this case
complements. The meaning of the symbols used is illustrated by the
following example.
• Suppose 5 percent of the population of Umen, a fictional Third World
country, have a disease that is peculiar to that country. We will let A1
refer to the event “has the disease” and A2 refer to the event “does not
have the disease.” Thus, we know that if we select a person from Umen at
random, the probability the individual chosen has the disease is .05, or
P(A1)=.05 This probability P(A1)=P(has disease)=.05,is called the prior
probability. It is given this name because the probability is assigned
before any empirical data are obtained.
• The prior probability a person is not afflicted with the disease is therefore
.95,or P(A2)=.95,found by 1-.05=.95
• There is a diagnostic technique to detect the disease, but it is not very
accurate.
• Let B denote the event “test shows the disease is present.” Assume that
historical evidence shows that if a person actually has the disease, the
probability that the test will indicate the presence of the disease is .90.
Using the conditional probability definitions developed earlier in this
chapter, this statement is written as: P(B|A1)=.90
• Assume the probability is .15 that for a person who actually does not have
the disease the test will indicate the disease is present.
• P(B|A2)=.15
• Let’s randomly select a person from Umen and perform the test. The test
results indicate the disease is present. What is the probability the person
actually has the disease? In symbolic form, we want to know P(A1|B),
which is interpreted as: P(has the disease ƒ the test results are positive).
The probability P(A1|B) is called a posterior probability.
• So the probability that a person has the disease, given
that he or she tested positive,is .24. How is the result
interpreted? If a person is selected at random from
• the population, the probability that he or she has the
disease is .05. If the person is tested and the test result
is positive, the probability that the person actually has
the disease is increased about fivefold, from .05 to .24.
• Bayes’ Theorem (also known as Bayes’ rule) is a deceptively
simple formula used to calculate conditional probability. The
Theorem was named after English mathematician Thomas Bayes
(1701-1761). The formal definition for the rule is:

• You have to figure out what your “tests” and “events” are first.
For two events, A and B, Bayes’ theorem allows you to figure out
p(A|B) (the probability that event A happened, given that test B
was positive) from p(B|A) (the probability that test B happened,
given that event A happened).
• Bayes’ Theorem Example #1
• You might be interested in finding out a patient’s probability of
having liver disease if they are an alcoholic. “Being an alcoholic”
is the test (kind of like a litmus test) for liver disease.
• A could mean the event “Patient has liver disease.” Past data tells
you that 10% of patients entering your clinic have liver disease.
P(A) = 0.10.
• B could mean the litmus test that “Patient is an alcoholic.” Five
percent of the clinic’s patients are alcoholics. P(B) = 0.05.
• You might also know that among those patients diagnosed with
liver disease, 7% are alcoholics. This is your B|A: the probability
that a patient is alcoholic, given that they have liver disease, is
7%.
• Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
• In other words, if the patient is an alcoholic, their chances of
having liver disease is 0.14 (14%). This is a large increase from
the 10% suggested by past data. But it’s still unlikely that any
particular patient has liver disease.
• Bayesian Spam Filtering
• Although Bayes’ Theorem is used extensively in the medical
sciences, there are other applications. For example, it’s used
to filter spam. The event in this case is that the message is spam.
The test for spam is that the message contains some flagged
words (like “viagra” or “you have won”). Here’s the equation set
up (from Wikipedia), read as “The probability a message is spam
given that it contains certain flagged words”:
– Descriptive Analysis
“What is happening now based on incoming data.” It is
a method for quantitatively describing the main features of a collection
of data. Here are a few key points about descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps.
– Diagnostic Analytics
Diagnostic analytics are used for discovery, or to determine why
something happened.
Sometimes this type of analytics when done hands-on with a small
dataset is also known as causal analysis, since it involves at least one
cause (usually more than one) and one effect.
• For example, for a social media marketing campaign, you can
use descriptive analytics to assess the number of posts,
mentions, followers, fans, page views, reviews, or pins, etc.
There can be thousands of online mentions that can be distilled
into a single view to see what worked and what did not work in
your past campaigns.
• There are various types of techniques available for diagnostic or
causal analytics. Among them, one of the most frequently used
is correlation.
• Predictive Analytics
– predictive analytics has its roots in our ability to predict what
might happen. These analytics are about understanding the future using
the data and the trends we have seen in the past, as well as emerging new
contexts and processes.
– An example is trying to predict how people will spend their tax refunds
based on how consumers normally behave around a given time of the
year (past data and trends), and
how a new tax policy (new context) may affect people’s refunds.
– Predictive analytics provides companies with actionable insights based
on data. Such information includes estimates about the likelihood of a
future outcome. It is important to remember that no statistical algorithm
can “predict” the future with 100% certainty because the foundation of
predictive analytics is based on probabilities.
– Companies use these statistics to forecast what might happen.
– Some of the software most commonly used by data science
professionals for predictive analytics are SAS predictive analytics, IBM
predictive analytics, RapidMiner, and others.
• Let us assume that Sales force kept campaign data for the last
eight quarters. This data comprises total sales generated by
newspaper, TV, and online ad campaigns and associated
expenditures, as provided in Table

With this data, we can predict the sales based on the expenditures of ad
campaigns in different media for Salesforce.
• Predictive analytics has a number of common applications.
• For example, many people turn to predictive analytics to
produce their credit scores.
Financial services use such numbers to determine the
probability that a customer will make their credit payments on
time.
Customer relationship management (CRM) classifies another
common area for predictive analytics. Here, the process
contributes to objectives such as marketing campaigns, sales,
and customer service.
• Predictive analytics applications are also used in the healthcare
field. They can determine which patients are at risk for
developing certain conditions such as diabetes, asthma, and
other chronic or serious illnesses.
• Prescriptive Analytics
– Prescriptive analytics is the area of business analytics dedicated to finding
the best course of action for a given situation. This may start by first
analyzing the situation (using descriptive analysis), but then moves
toward finding connections among various parameters/variables, and their
relation to each other to address a specific problem .
– A process-intensive task, the prescriptive approach analyzes potential
decisions, the interactions between decisions, the influences that bear
upon these decisions, and the bearing all of this has on an outcome to
ultimately prescribe an optimal course of action in real time.
– Prescriptive analytics can also suggest options for taking advantage of a
future opportunity or mitigate a future risk and illustrate the implications
of each
– Specific techniques used in prescriptive analytics include optimization,
simulation, game theory, and decision-analysis methods.
Exploratory Analysis
-Exploratory analysis is an approach to analyzing datasets to find previously
unknown relationships. Often such analysis involves using various data
visualization approaches.
-exploratory analysis consists of a range of techniques; its application is varied
as well.
However, the most common application is looking for patterns in the data,
such as finding groups of similar genes from a collection of samples.
-Let us consider the US census data available from the US census website.
This data has dozens of variables; If you are looking for something specific
(e.g., which State has the highest population), you could go with descriptive
analysis. If you are trying to predict something (e.g., which city will have
the lowest influx of immigrant population), you could use prescriptive or
predictive analysis. But, if someone gave you this data and asks you to find
interesting insights, then what do you do? You could still do descriptive or
prescriptive analysis, but given that there are lots of variables with massive
amounts of data, it may be futile to do all possible combinations of those
variables. So, you need to go exploring.
Mechanistic Analysis
-Mechanistic analysis involves understanding the exact changes in
variables that lead to changes in other variables for individual objects.
-For instance, we may want to know how the number of free doughnuts
per employee per day affects employee productivity. Perhaps by giving
them one extra doughnut we gain a 5% productivity boost, but two extra
doughnuts could end up making them lazy (and diabetic)
-More seriously, though, think about studying the effects of carbon
emissions on bringing about the Earth’s climate change. Here, we are
interested in seeing how the increased amount of CO2 in the atmosphere
is causing the overall temperature to change.
• Basics and need of hypothesis & hypothesis testing
– Hypothesis testing is a common statistical tool used in research and data
science to support the certainty of findings. The aim of testing is to
answer how probable an apparent effect is detected by chance given a
random data sample.
• What is a hypothesis?
A hypothesis is often described as an “educated guess” about a specific
parameter or population. Once it is defined, one can collect data to determine
whether it provides enough evidence that the hypothesis is true.
Parameters and statistics
In statistics, a parameter is a description of a population,
while a statistic describes a small portion of a population (sample).
For example, if you ask everyone in your class (population) about their average
height, you receive a parameter, a true description about the population since
everyone was asked.
If you now want to guess the average height of people in your grade
(population) using the information you have from your class (sample), this
information turns into a statistic.
• A hypothesis is a calculated prediction or assumption about
a population parameter based on limited evidence. The whole
idea behind hypothesis formulation is testing—this means the
researcher subjects his or her calculated assumption to a series of
evaluations to know whether they are true or false.
• Typically, every research starts with a hypothesis—the
investigator makes a claim and experiments to prove that this
claim is true or false. For instance, if you predict that students
who drink milk before class perform better than those who don't,
then this becomes a hypothesis that can be confirmed or refuted
using an experiment.
• Hypothesis testing is an assessment method that allows researchers to
determine the plausibility of a hypothesis. It involves testing an
assumption about a specific population parameter to know whether it's true
or false. These population parameters include variance, standard deviation,
and median.
• Typically, hypothesis testing starts with developing a null hypothesis and
then performing several tests that support or reject the null hypothesis. The
researcher uses test statistics to compare the association or relationship
between two or more variables.
• How Hypothesis Testing Works
• The basis of hypothesis testing is to examine and analyze the null hypothesis
and alternative hypothesis to know which one is the most plausible
assumption. Since both assumptions are mutually exclusive, only one can be
true. In other words, the occurrence of a null hypothesis destroys the chances
of the alternative coming to life, and vice-versa.
What are the Types of Hypotheses?
1. Simple Hypothesis
2. Complex Hypothesis
3. Null Hypothesis
4. Alternative Hypothesis
5. Logical Hypothesis
6. Empirical Hypothesis
7. Statistical Hypothesis
• Five-Step Procedure for Testing a Hypothesis
Step 1: State the Null Hypothesis (H0) and the Alternate
Hypothesis (H1):
• The first step is to state the hypothesis being tested. It is called
the null hypothesis, designated H0, and read “H sub zero.”
The capital letter H stands for hypothesis,and the subscript
zero implies “no difference.” There is usually a “not” or a “no”
term in the null hypothesis, meaning that there is “no
change.”
• For example, the null hypothesis is that the mean number of
miles driven on the steel-belted tire is not different from
60,000. The null hypothesis would be written H0: 60,000.
• Generally speaking, the null hypothesis is developed for the
purpose of testing. We either reject or fail to reject the null
hypothesis. The null hypothesis is a statement that is not
rejected unless our sample data provide convincing evidence
that it is false.
• The alternate hypothesis describes what you will conclude if you reject the
• null hypothesis. It is written H1 and is read “H sub one.” It is also referred
to as the research hypothesis. The alternate hypothesis is accepted if the
sample data provide us with enough statistical evidence that the null
hypothesis is false.
• The actual test begins by considering two hypotheses. They are
called the null hypothesis and the alternative hypothesis. These
hypotheses contain opposing viewpoints.
• H0: The null hypothesis: It is a statement of no difference
between sample means or proportions or no difference between
a sample mean or proportion and a population mean or
proportion. In other words, the difference equals 0.
• Ha: The alternative hypothesis: It is a claim about the
population that is contradictory to H0 and what we conclude
when we reject H0.
• The following example will help clarify what is meant by the null
hypothesis and the alternate hypothesis. A recent article indicated the
mean age of U.S. commercial aircraft is 15 years. To conduct a statistical
test regarding this statement, the first step is to determine the null and
the alternate hypotheses.
• The null hypothesis represents the current or reported condition. It is
written H0: µ=15.
• The alternate hypothesis is that the statement is not true, that is, H1: µ≠
15.
• It is important to remember that no matter how the problem is stated,
the null hypothesis will always contain the equal sign. The equal sign (=)
will never appear in the alternate hypothesis. Why? Because the null
hypothesis is the statement being tested, and we need a specific value to
include in our calculations. We turn to the alternate hypothesis only if the
data suggests the null hypothesis is untrue.
• Null & Alternative hypothesis
• The null and alternative hypotheses are the two mutually
exclusive statements about a parameter or population.
• The null hypothesis (often abbreviated as H0) claims that there
is no effect or no difference.
• The alternative hypothesis (often abbreviated as H1 or HA) is
what you want to prove. Using one of the examples from above:
• H0: There is no difference in the mean return from A and B, or
the difference between A and B is zero.
• H1: There is a difference in the mean return from A and B or
the difference between A and B > zero.

• Note: H0 must always contain equality(=). Ha always


contains difference(≠, >, <).
For example,
1) if the task is to identify the effect of drug A compared to drug B on patients, the
null hypothesis and alternative hypothesis would be this.
H0: Drug A and drug B have the same effect on patients.
HA: Drug A has a greater effect than drug B on patients
2) If the task is to identify whether advertising Campaign C is effective on reducing
customer churn, the null hypothesis and alternative hypothesis would be as
follows.
H0: Campaign C does not reduce customer churn better than the current
campaign method.
• HA: Campaign C does reduce customer churn better than the current campaign.
• Mathematical Symbols Used in H0 and Ha:

H0 Ha
equal (=) not equal (≠) or greater than (>) or less
than (<

greater than or equal to (≥) less than (<)

less than or equal to (≤) more than (>)


• Try It
We want to test whether the mean height of eighth graders is 66 inches. State the
null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >)
for the null and alternative hypotheses.
H0: μ __ 66
Ha: μ __ 66
Step 2: Select a Level of Significance
• After setting up the null hypothesis and alternate hypothesis, the next
step is to state the level of significance.
• The level of significance is designated α , the Greek letter alpha. It is also
sometimes called the level of risk. This may be a more appropriate term
because it is the risk you take of rejecting the null hypothesis when it is
really true.
• There is no one level of significance that is applied to all tests. A decision
is
made to use the .05 level (often stated as the 5 percent level), the .01 level,
the .10 level, or any other level between 0 and 1. Traditionally, the .05 level is
selected for consumer research projects, .01 for quality assurance, and .10
for political polling.
• You, the researcher, must decide on the level of significance before
formulating a decision rule and collecting sample data.
• Type I and Type II Errors
A hypothesis test may result in two types of errors, depending on whether
the test accepts or rejects the null hypothesis. These two errors are known as
type I and type II errors.

● A type I error is the rejection of the null hypothesis when the null hypothesis
is TRUE. The probability of the type I error is denoted by the Greek letter α.

● A type II error is the acceptance of a null hypothesis when the null hypothesis
is FALSE. The probability of the type II error is denoted by the Greek letter
β.
Type I and Type II Error
• Step 3: Select the Test Statistic
• There are many test statistics., we use both z and t as the test
statistic., we will use such test statistics as F and X2, called chi-
square.

• Step 4: Formulate the Decision Rule


• The decision rule states the conditions when H0 is rejected.
• A decision rule is a statement of the specific conditions under
which the null hypothesis is rejected and the conditions under
which it is not rejected. The region or area of rejection defines
the location of all those values that are so large or so small
that the probability of their occurrence under a true null
hypothesis is rather remote.
• Sampling Distribution of the Statistic z, a Right-Tailed Test,.05 Level of
Significance.
Note in the chart that:
• The area where the null hypothesis is not rejected is to the left of 1.65.
• The area of rejection is to the right of 1.65.
• A one-tailed test is being applied. (This will also be explained later.)
• The .05 level of significance was chosen.
• The sampling distribution of the statistic z follows the normal probability distribution.
• The value 1.65 separates the regions where the null hypothesis is rejected and
where it is not rejected.
• The value 1.65 is the critical value.
• Step 5: Make a Decision
• The fifth and final step in hypothesis testing is computing the test statistic,
comparing it to the critical value, and making a decision to reject or not to
reject the null hypothesis.
• if, based on sample information, z is computed to be 2.34, the null
hypothesis is rejected at the .05 level of significance
• The decision to reject H0 was made because 2.34 lies in the region of
rejection, that is, beyond 1.65.
• We would reject the null hypothesis, reasoning that it is highly improbable
that a computed z value this large is due to sampling error (chance).
• Had the computed value been 1.65 or less, say 0.71, the null hypothesis
would not be rejected. It would be reasoned that such a small computed
value could be attributed to chance, that is, sampling error.
Performing Hypothesis Testing
When you perform a hypothesis test, we should follow the steps of hypothesis
testing in this order:
1. State the null hypothesis, H0, and the alternative hypothesis, H1.
2. Evaluate the risks of making type I and II errors, and choose the level
of significance, α, and the sample size as appropriate.
3. Determine the appropriate test statistic and sampling distribution to
use and identify the critical values that divide the rejection and nonrejection
regions.
4. Collect the data, calculate the appropriate test statistic, and determine
whether the test statistic has fallen into the rejection or the nonrejection region.
5. Make the proper statistical inference. Reject the null hypothesis if the
test statistic falls into the rejection region. Do not reject the null
hypothesis if the test statistic falls into the nonrejection region.
One-Tailed Test:
• A one-tailed test is a statistical hypothesis test in which the critical area
of a distribution is one-sided so that it is either greater than or less than
a certain value, but not both. If the sample being tested falls into the
one-sided critical area, the alternative hypothesis will be accepted
instead of the null hypothesis.
• A one-tailed test is also known as a directional hypothesis or
directional test.
• for a one-tailed test, we define H0: µ1 = µ2 and Ha: µ1 >
µ2 or Ha: µ1 < µ2
• Two-Tailed Test:
• A two-tailed test is a method in which the critical area of a distribution is two-
sided and tests whether a sample is greater than or less than a certain range of
values. If the sample being tested falls into either of the critical areas, the
alternative hypothesis is accepted instead of the null hypothesis.
• for a two-tailed test, we define H0: µ1 = µ2 and Ha: µ1≠µ2
To illustrate a one-tailed test, let’s consider the problem. Suppose the vice
president wants to know whether there has been an increase in the number of
units assembled. Can we conclude, because of the improved production methods,
that the mean number of desks assembled in the last 50 weeks was more
than 200? Look at the difference in the way the problem is formulated. In the first
case, we wanted to know whether there was a difference in the mean number
assembled, but now we want to know whether there has been an increase.
Because we are investigating different questions, we will set our hypotheses
differently.The biggest difference occurs in the alternate hypothesis. Before, we
stated the alternate hypothesis as “different from”; now we want to state it as
“greater than.” In symbols:

Rejection Regions for Two-Tailed and One-Tailed Tests, α=


.01
• Test Statistic:
– The value based on the sample statistic and the sampling distribution
for the sample statistic is called Test Statistic.
– Example: If you are testing whether the mean of a population was
equal to a specific value, the sample statistic is the sample mean. The
test statistic is based on the difference between the sample mean and
the value of the population mean stated in the null hypothesis. This
test statistic follows a statistical distribution called the t distribution.
– If you are testing whether the mean of population one is equal to the
mean of population two, the sample statistic is the difference
between the mean in sample one and the mean in sample two. The
test statistic is based on the difference between the mean in sample
one and the mean in sample two.This test statistic also follows the t
distribution .
– The sampling distribution of the test statistic is divided into two
regions, a region of rejection (also known as the critical region)
and a region of nonrejection. If the test statistic falls into the region
of nonrejection, the null hypothesis is not rejected.
• The region of rejection contains the values of the test statistic that are unlikely to
occur if the null hypothesis is true.
• If the null hypothesis is false, these values are likely to occur. Therefore, if you
observe a value of the test statistic that falls into the rejection region, the null
hypothesis is rejected, because that value is unlikely if the null hypothesis is true.
• To make a decision concerning the null hypothesis, you first determine the
critical value of the test statistic that separates the nonrejection region from
the rejection region.
• You determine the critical value by using the appropriate sampling distribution and
deciding on the risk you are willing to take of rejecting the null hypothesis when it
is true
• Level of Significance – It is the probability of making type I error and is
denoted by α. It is the maximum probability of making type I error.
• Alpha is also known as the level of significance of the statistical test.
• Traditionally, we control the probability of a type I error by deciding the risk
level α to tolerate rejecting the null hypothesis when it is true.
• Because we specify the level of significance before performing the hypothesis
test, the risk of committing a type I error, α, is directly under our control.
• The most common α values are 0.01, 0.05, and 0.10, and researchers
traditionally select a value of 0.05 or smaller .
• value for α, determine the rejection region, and using the appropriate
sampling distribution, the critical value or values that divide the rejection and
nonrejection regions are determined.
• Confidence level: The probability that if a poll/test/survey were repeated
over and over again, the results obtained would be the same. A confidence
level = 1 – alpha.
• The p-Value Approach to Hypothesis Testing
– The p-value is the smallest level at which H0 can be
rejected for a given set of data.
– consider the p-value the actual risk of having a type I error for a
given set of data.
– Using p-values, we reject the null hypothesis if the p-value is
less than α and do not reject the null hypothesis if the p-value is
greater than or equal to α.
– The p-value is also known as the observed level of significance.
• When using p-values, you can restate the steps of hypothesis
testing as follows:
1. State the null hypothesis, H0, and the alternative hypothesis,
H1.
2. Evaluate the risks of making type I and II errors, and choose
the level of significance, α, and the sample size as appropriate.
3. Collect the data and calculate the sample value of the
appropriate test statistic.
4. Calculate the p-value based on the test statistic and compare the
p -value to α.
5. Make the proper statistical inference. Reject the null hypothesis
if the p-value is less than α. Do not reject the null hypothesis if
the p-value is greater than or equal to α.
• Parametric tests
– Hypothesis tests including a specific parameter are called parametric
tests. In parametric tests, the population is assumed to have a normal
distribution (e.g., the height of people in a class).
• Non-parametric tests
– In contrast, non-parametric tests (also distribution-free tests) are used
when parameters of a population cannot be assumed to be normally
distributed. For example, the price of diamonds seems exponentially
distributed (below right). Non-parametric doesn’t mean that you do not
know anything about a population but rather that it is not normally
distributed.

Left: example of normally distributed data. Right: example of non-normal data distribution.
• Real-world examples
• Hypothesis 1: Average order value has increased since last financial year
-Parameter: Mean order value
-Test type: one-sample, parametric test (assuming the order value follows a
normal distribution)
• Hypothesis 2: Investing in A brings a higher return than investing in B
– Parameter: Difference in mean return
– Test type: two-sample, parametric test, also AB test (assuming the return
follows a normal distribution)
• Hypothesis 3: The new user interface converts more users into customers than
the expected 30%
– Parameter: none
– Test type: one-sample, non-parametric test (assuming number of customers
is not normally distributed)
• One-sample, two-sample, or more-sample test
• When testing hypotheses, it is distinguished between one-
sample, two-sample or more-sample tests.
• In a one-sample test, a sample (average order value this year) is
compared to a known value (average order value of last year).
• In a two-sample test, two samples (investment A and B) are
compared to each other.
• What exactly is a test statistic?
– A test statistic describes how closely the distribution of your data
matches the distribution predicted under the null hypothesis of the
statistical test you are using.
– The distribution of data is how often each observation occurs, and can
be described by its central tendency and variation around that central
tendency. Different statistical tests predict different types of
distributions, so it’s important to choose the right statistical test for your
hypothesis.
– The test statistic summarizes your observed data into a single
number using the central tendency, variation, sample size, and
number of predictor variables in your statistical model.
– Generally, the test statistic is calculated as the pattern in your data
(i.e. the correlation between variables or difference between
groups) divided by the variance in the data (i.e. the standard
deviation).
• What exactly is a test statistic?
For Example You are testing the relationship between temperature
and flowering date for a certain type of apple tree. You use a long-
term data set that tracks temperature and flowering dates from the
past 25 years by randomly sampling 100 trees every year in an
experimental field.
– Null hypothesis: There is no correlation between temperature and
flowering date.
– Alternate hypothesis: There is a correlation between temperature
and flowering date.
To test this hypothesis you perform a regression test, which generates
a t-value as its test statistic. The t-value compares the observed
correlation between these variables to the null hypothesis of zero
correlation.
• Types of test statistics
Test Null and alternative hypotheses Statistical tests that
statistic use it

t-value Null: The means of two groups are equal • T-test


Alternative: The means of two groups are not equal • Regression tests

z-value Null: The means of two groups are equal • Z-test


Alternative:The means of two groups are not equal

F-value Null: The variation among two or more groups is • ANOVA


greater than or equal to the variation between the • ANCOVA
groups • MANOVA
Alternative: The variation among two or more
groups is smaller than the variation between the
groups
• T-Tests
– A t-test is a statistical test that is used to compare the means of
two groups. It is often used in hypothesis testing to determine
whether a process or treatment actually has an effect on the
population of interest, or whether two groups are different
from one another.
– You want to know whether the mean petal length of iris
flowers differs according to their species. You find two
different species of irises growing in a garden and measure 25
petals of each species. You can test the difference between
these two groups using a t-test.
– The null hypothesis (H0) is that the true difference between
these group means is zero.
– The alternate hypothesis (Ha) is that the true difference is
different from zero.
• When to use a t-test
• A t-test can only be used when comparing the means of two
groups (a.k.a. pairwise comparison). If you want to compare more
than two groups, or if you want to do multiple pairwise
comparisons, use an ANOVA test or a post-hoc test.
• The t-test is a parametric test of difference, meaning that it makes
the same assumptions about your data as other parametric tests.
The t-test assumes your data:
• are independent
• are (approximately) normally distributed.
• have a similar amount of variance within each group being
compared (a.k.a. homogeneity of variance)
• If your data do not fit these assumptions, you can try
a nonparametric alternative to the t-test, such as the Wilcoxon
Signed-Rank test for data with unequal variances.
• What type of t-test should I use?
• When choosing a t-test, you will need to consider two things:
whether the groups being compared come from a single
population or two different populations, and whether you want to
test the difference in a specific direction.
• One-sample, two-sample, or paired t-test?
• If the groups come from a single population (e.g. measuring
before and after an experimental treatment), perform a paired t-
test.
• If the groups come from two different populations (e.g. two
different species, or people from two separate cities), perform
a two-sample t-test (a.k.a. independent t-test).
• If there is one group being compared against a standard value
(e.g. comparing the acidity of a liquid to a neutral pH of 7),
perform a one-sample t-test.
One-tailed or two-tailed t-test?
If you only care whether the two populations are different from
one another, perform a two-tailed t-test.
If you want to know whether one population mean is greater
than or less than the other, perform a one-tailed t-test.
Performing a t-test
The t-test estimates the true difference between two group
means using the ratio of the difference in group means over
the pooled standard error of both groups. You can calculate
it manually using a formula, or use statistical analysis
software.
• One-Sample t-Test
• We perform a One-Sample t-test when we want to compare a
sample mean with the population mean. The difference from
the Z Test is that we do not have the information on
Population Variance here. We use the sample standard
deviation instead of population standard deviation in this case.
• Here’s an Example to Understand a One Sample t-Test
• Let’s say we want to determine if on average girls score more
than 600 in the exam. We do not have the information related to
variance (or standard deviation) for girls’ scores. To a perform
t-test, we randomly collect the data of 10 girls with their marks
and choose our ⍺ value (significance level) to be 0.05 for
Hypothesis Testing.
In this example:
Mean Score for Girls is 606.8
The size of the sample is 10
The population mean is 600
Standard Deviation for the sample is 13.14

Our P-value is greater than 0.05 thus we fail to reject the null
hypothesis and don’t have enough evidence to support the
hypothesis that on average, girls score more than 600 in the exam.
• T-test formula
• The formula for the two-sample t-test (a.k.a. the Student’s t-
test) is shown below.

• In this formula, t is the t-value, x1 and x2 are the means of the


two groups being compared, s2 is the pooled standard error of
the two groups, and n1 and n2 are the number of observations
in each of the groups.
• A larger t-value shows that the difference between group
means is greater than the pooled standard error, indicating a
more significant difference between the groups.
– Example Python
• Two-Sample t-Test
• We perform a Two-Sample t-test when we want to compare
the mean of two samples.
• Here’s an Example to Understand a Two-Sample t-Test
• Here, let’s say we want to determine if on average, boys
score 15 marks more than girls in the exam. We do not
have the information related to variance (or standard
deviation) for girls’ scores or boys’ scores. To perform a t-
test. we randomly collect the data of 10 girls and boys with
their marks. We choose our ⍺ value (significance level) to
be 0.05 as the criteria for Hypothesis Testing.
In this example:
Mean Score for Boys is 630.1
Mean Score for Girls is 606.8
Difference between Population Mean 15
Standard Deviation for Boys’ score is 13.42
Standard Deviation for Girls’ score is 13.14

Thus, P-value is less than 0.05 so we can reject the null


hypothesis and conclude that on average boys score 15
marks more than girls in the exam.
• Z-test
• Z-test is a statistical method to determine whether the distribution
of the test statistics can be approximated by a normal distribution. It
is the method to determine whether two sample means are
approximately the same or different when their variance is known
and the sample size is large (should be >= 30).
• When to Use Z-test:
• The sample size should be greater than 30. Otherwise, we should
use the t-test.
• Samples should be drawn at random from the population.
• The standard deviation of the population should be known.
• Samples that are drawn from the population should be independent
of each other.
• The data should be normally distributed, however for large sample
size, it is assumed to have a normal distribution.
• Deciding between Z Test and T-Test
• So when we should perform the Z test and when we should
perform t-Test? It’s a key question we need to answer if we want
to master statistics.

If the sample size is large enough, then the Z test and t-Test will conclude with the same
results. For a large sample size, Sample Variance will be a better estimate of
Population variance so even if population variance is unknown, we can use the Z test
using sample variance.
What is the Z Test?
Z tests are a statistical way of testing a hypothesis when either:
We know the population variance, or
We do not know the population variance but our sample size is large n
≥ 30
If we have a sample size of less than 30 and do not know the
population variance, then we must use a t-test.
One-Sample Z test
We perform the One-Sample Z test when we want to compare a
sample mean with the population mean.
• Here’s an Example to Understand a One Sample Z Test
• Let’s say we need to determine if girls on average score higher than
600 in the exam. We have the information that the standard
deviation for girls’ scores is 100. So, we collect the data of 20 girls
by using random samples and record their marks. Finally, we also
set our ⍺ value (significance level) to be 0.05.
In this example:
Mean Score for Girls is 641
The size of the sample is 20
The population mean is 600
Standard Deviation for Population is 100

Since the P-value is less than 0.05, we can reject the null
hypothesis and conclude based on our result that Girls on average scored
higher than 600.
• Two Sample Z Test
• We perform a Two Sample Z test when we want to compare the
mean of two samples. Here’s an Example to Understand a Two
Sample Z Test
• Here, let’s say we want to know if Girls on average score 10
marks more than the boys. We have the information that the
standard deviation for girls’ Score is 100 and for boys’ score is 90.
Then we collect the data of 20 girls and 20 boys by using random
samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.
In this example:
Mean Score for Girls (Sample Mean) is 641
Mean Score for Boys (Sample Mean) is 613.3
Standard Deviation for the Population of Girls’ is 100
Standard deviation for the Population of Boys’ is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10

Thus, we can conclude based on the P-value that we


fail to reject the Null Hypothesis.
The Paired t-Test
t=Mean of difference(xd)/(Sample standard deviation(sd)/
Sqrt(Sample size)(sqrt(n))
Alpha=0.05,df=n-1
Example:An Instructure has prepared two sets pf question
papers,She wants to know whether both the sets are equally
difficult or not.Using the marks obtained for both the exams we
can get know are exams equally difficult.
• Chi-Square (χ2) test
• The chi-square independence test is a procedure for testing if two categorical
variables are related in some population.
• There are two types of chi-square tests. Both use the chi-square statistic and
distribution for different purposes:
• A chi-square goodness of fit test determines if sample data matches a population.
• A chi-square test for independence compares two variables in a contingency
table to see if they are related. In a more general sense, it tests to see whether
distributions of categorical variables differ from each another.
• The formula for the chi-square statistic used in the chi square test is:

• The subscript “c” is the degrees of freedom. “O” is your observed value and E is
your expected value. It’s very rare that you’ll want to actually use this formula to
find a critical chi-square value by hand. The summation symbol means that
you’ll have to perform a calculation for every single data item in your data set.
As you can probably imagine, the calculations can get very, very, lengthy and
tedious. Instead, you’ll probably want to use technology:
• A chi-square statistic is one way to show a relationship between
two categorical variables. In statistics, there are two types of
variables: numerical (countable) variables and non-numerical (categorical)
variables. The chi-squared statistic is a single number that tells you how
much difference exists between your observed counts and the counts you
would expect if there were no relationship at all in the population.
• There are a few variations on the chi-square statistic. Which one you use
depends upon how you collected the data and which hypothesis is being
tested. However, all of the variations use the same idea, which is that you are
comparing your expected values with the values you actually collect. One of
the most common forms can be used for contingency tables:

• Where O is the observed value, E is the expected value and “i” is the “ith”
position in the contingency table.
• Chi Square P-Values.
– A chi square test will give you a p-value. The p-value will tell you if your test results
are significant or not. In order to perform a chi square test and get the p-value, you need
two pieces of information:
– Degrees of freedom. That’s just the number of categories minus 1.
– The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05
(5%), but you could also have other levels like 0.01 or 0.10.
A chi-square test for independence
• Example: a scientist wants to know if education level and marital status are related
for all people in some country. He collects data on a simple random sample of n =
300 people, part of which are shown below.
• Chi-Square Test - Observed Frequencies
• A good first step for these data is inspecting the contingency table of marital status by
education. Such a table -shown below- displays the frequency distribution of marital
status for each education category separately. So let's take a look at it.


• Chi-Square Test - Column Percentages
• Although our contingency table is a great starting point, it doesn't really show
us if education level and marital status are related. This question is answered
more easily from a slightly different table as shown below.

• This table shows -for each education level separately- the percentages of
respondents that fall into each marital status category. Before reading on, take
a careful look at this table and tell me is marital status related to education
level and -if so- how?
Marital status is clearly associated with education level.The lower
someone’s education, the smaller the chance he’s married. That is:
education “says something” about marital status (and reversely) in
our sample. So what about the population?
• Chi-Square Test - Null Hypothesis
The null hypothesis for a chi-square independence test is that two categorical
variables are independent in some population.
Chi-Square Test - Statistical Independence
independence means that one variable doesn't “say anything” about another
variable.
A different way of saying the exact same thing is that
independence means that the relative frequencies of one variable are identical
over all levels of some other variable.
Expected Frequencies
• Expected frequencies are the frequencies we expect in our sample
if the null hypothesis holds.
• If education and marital status are independent in our population, then we
expect this in our sample too. This implies the contingency table -holding
expected frequencies- shown below.
These expected frequencies are calculated as
eij=(oi.oj)/N
where
eij is an expected frequency;
oi is a marginal column frequency;
oj is a marginal row frequency;
N is the total sample size
So for our first cell, that'll be eij=39.90/300=11.7
• Test Statistic
• The chi-square test statistic is calculated as

• so for our data


• X2=(18-11.7) 2/(11.7)+(36-27)2/27+…….=23.57
• Chi-Square Test - Degrees of Freedom
• We'll get the p-value we're after from the chi-square distribution if we give it
2 numbers:
• the χ2 value (23.57) and
• the degrees of freedom (df).
• The degrees of freedom is basically a number that determines the exact shape
of our distribution. The figure below illustrates this point.
• Right. Now, degrees of freedom -or df- are calculated as
• df=(i−1)⋅(j−1)
• so in our example
• df=(5−1)⋅(4−1)=12.
And with df = 12, the probability of finding χ2 ≥ 23.57 ≈ 0.023.* This is
our 1-tailed significance. It basically means, there's a 0.023 (or 2.3%)
chance of finding this association in our sample if it is zero in our
population.
Since this is a small chance, we no longer believe our null hypothesis of our
variables being independent in our population.
Conclusion: marital status and education are related in our population.
Now, keep in mind that our p-value of 0.023 only tells us that the association
between our variables is probably not zero.
• Nonparametric Methods:
Goodness-of-Fit Tests
– The goodness-of-fit test is one of the most commonly used statistical
tests. It is particularly useful because it requires only the nominal level of
measurement. So we are able to conduct a test of hypothesis on data that
has been classified into groups.
– As the full name implies, the purpose of the goodness-of-fit test is to
compare an observed distribution to an expected distribution. An
example will describe the hypothesis-testing situation.
– Consider an Example
– Bubba’s Fish and Pasta is a chain of restaurants located along the Gulf
Coast of Florida. Bubba, the owner, is considering adding steak to his
menu. Before doing so,he decides to hire Magnolia Research, LLC, to
conduct a survey of adults as to their favorite meal when eating out.
Magnolia selected a sample 120 adults and asked each to indicate their
favorite meal when dining out. The results are reported below.
• No one entrée is assumed better than another. Therefore, the nominal scale is
appropriate
• If the entrées are equally popular, we would expect 30 adults to select each
meal.Why is this so? If there are 120 adults in the sample and four categories,
we expect that one-fourth of those surveyed would select each
entrée. So 30, found by 120/4, is the expected frequency for each category or
cell, assuming there is no preference for any of the entrées

• Is the difference in the number of times each entrée is selected due to chance,
or should we conclude that the entrées are not equally preferred?
• To investigate the issue, we use the five-step hypothesis-testing procedure.
• Step 1: State the null hypothesis and the alternate hypothesis. The
null hypothesis, , is that there is no difference between the set of
observed frequencies and the set of expected frequencies. In other words,
any difference between the two sets of frequencies is attributed to
sampling error. The alternate hypothesis, , is that there is a difference
between the observed and expected sets of frequencies. If the null
hypothesis is rejected and the alternate hypothesis is accepted, we
conclude the preferences are not equally distributed among the four
categories (cells).
HO: There is no difference in the proportion of adults selecting each
entrée.
H1: There is a difference in the proportion of adults selecting each entrée.
• Step 2: Select the level of significance. We selected the .05 significance
level.
The probability is .05 that a true null hypothesis is rejected.
• Step 3: Select the test statistic. The test statistic follows the chi-square
distribution, designated by X2

• with k - 1 degrees of freedom, where:


k is the number of categories.
fo is an observed frequency in a particular category.
fe is an expected frequency in a particular category.
• Step 4: Formulate the decision rule. Recall that the decision rule in hypothesis
testing is the value that separates the region where we do not reject from the
region where is rejected. This number is called the critical value.
• The number of degrees of freedom is k - 1, where k is the number of categories.
In this particular problem, there are four categories, the four meal entrées.
Because there are four categories, there is k -1 = 4 - 1 = 3 degrees of freedom.
As noted, a category is called a cell, and there are four cells. The critical value
for 3 degrees of freedom and the .05 level of significance is found in X2
• Table . The critical value is 7.815, found by locating 3 degrees of freedom in the
left margin and then moving horizontally (to the right) and reading the critical
value in the .05 column
• The decision rule is to reject the null hypothesis if the computed value
of chi-square is greater than 7.815. If it is less than or equal to 7.815, we fail
to reject the null hypothesis. Chart 17–1 shows the decision rule.
• Step 5: Compute the value of chi-square and make a decision. Of the 120
adults in the sample, 32 indicated their favorite entrée was chicken. The
calculations for chi-square follow. (Note again that the expected frequencies are
the same for each cell.)

• The computed X2 of 2.20 is not in the rejection region. It is less than


the critical value of 7.815.
• The decision, therefore, is to not reject the null hypothesis. We conclude that the
differences between the observed and the expected frequencies could be due to
chance. That means there is no preference among the four entrées.
• The chi-square distribution, which is used as the test statistic in this chapter,
has the following characteristics.
1. Chi-square values are never negative. This is because the difference between
fo and fe is squared, that is, ( fo-fe)2.
2. There is a family of chi-square distributions. There is a chi-square
distribution for 1 degree of freedom, another for 2 degrees of freedom, another
for 3 degrees of freedom, and so on. In this type of problem, the number of
degrees of freedom is determined by k -1, where k is the number of categories.
• Therefore, the shape of the chi-square distribution does not depend on the size of
the sample, but on the number of categories used. For example, if 200 employees
of an airline were classified into one of three categories—flight personnel, ground
support, and administrative personnel—there would be k - 1 = 3- 1= 2 degrees of
freedom.
3. The chi-square distribution is positively skewed. However, as the number of
degrees of freedom increases, the distribution begins to approximate the normal
probability distribution.
Chi-Square Distributions for Selected Degrees of Freedom
ANOVA
What does the ANOVA test mean?
The ANOVA, which stands for the Analysis of Variance test, is a tool in statistics that is concerned with comparing the
means of two groups of data sets and to what extent they differ.

In simpler and general terms, it can be stated that the ANOVA test is used to identify which process, among all the other
processes, is better. The fundamental concept behind the Analysis of Variance is the “Linear Model”.

Example of ANOVA
An example to understand this can be prescribing medicines.

Suppose, there is a group of patients who are suffering from fever.

They are being given three different medicines that have the same functionality i.e. to cure fever.

To understand the effectiveness of each medicine and choose the best among them, the ANOVA test is used.

You may wonder that a t-test can also be used instead of using the ANOVA test. You are probably right, but, since t-tests are
used to compare only two things, you will have to run multiple t-tests to come up with an outcome. While that is not the case
with the ANOVA test.

That is why the ANOVA test is also reckoned as an extension of t-test and z-tests.
Terminologies in ANOVA Test
There are few terms that we continuously encounter or better say come across while performing the ANOVA
test. We have listed and explained them below:

1. Means(Grand and Sample)

As we know, a mean is defined as an arithmetic average of a given range of values. In the ANOVA test, there
are two types of mean that are calculated: Grand and Sample Mean.

A sample mean (μn) represents the average value for a group while the grand mean (μ) represents the average
value of sample means of different groups or mean of all the observations combined.
2. F-Statistic

The statistic which measures the extent of difference between the means of different samples or how
significantly the means differ is called the F-statistic or F-Ratio. It gives us a ratio of the effect we are
measuring (in the numerator) and the variation associated with the effect (in the denominator).

The formula given to calculate the F-Ratio is:

Since we use variances to explain both the measure of the effect and the measure of the error, F is
more of a ratio of variances. The value of F can never be negative.

• When the value of F exceeds 1 it means that the variance due to the effect is larger than the
variance associated with sampling error; we can represent it as:
• When F>1, variation due to the effect > variation due to error
• If F<1, it means variation due to effect < variation due to error
• When F = 1 it means variation due to effect = variation due to error. This situation is not so
favorable
3. Sums of Squares

In statistics, the sum of squares is defined as a statistical technique that is used in regression
analysis to determine the dispersion of data points. In the ANOVA test, it is used while computing
the value of F.

As the sum of squares tells you about the deviation from the mean, it is also known as variation.

The formula given to calculate the sum of squares is:

4. Degrees of Freedom (Df)

Degrees of Freedom refers to the maximum numbers of logically independent values that have the
freedom to vary in a data set.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.

6. Hypothesis (Alternate and Null)


Hypothesis, in general terms, is an educated guess about something around us. When we are given a set
of data and are required to predict, we use some calculations and make a guess. This is all a hypothesis.

In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.

The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.

6. Hypothesis (Alternate and Null)


Hypothesis, in general terms, is an educated guess about something around us. When we are given a set
of data and are required to predict, we use some calculations and make a guess. This is all a hypothesis.

In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.

The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
7. Group Variability (Within-group and Between-group)

To understand group variability, we should know about groups first. In the ANOVA test, a group is the
set of samples within the independent variable.

There are variations among the individual groups as well as within the group. This gives rise to the two
terms: Within-group variability and Between-group variability.

• When there is a big variation in the sample distributions of the individual groups, it is called
between-group variability.
• On the other hand, when there are variations in the sample distribution within an individual group,
it is called Within-group variability.
Types of ANOVA Test

The ANOVA test is generally done in three ways depending on the number of Independent Variables
(IVs) included in the test. Sometimes the test includes one IV, sometimes it has two IVs, and sometimes
the test may include multiple IVs.

We have three known types of ANOVA test:

1. One-Way ANOVA
2. Two-Way ANOVA
3. N-Way ANOVA (MANOVA)

One-Way ANOVA

One-way ANOVA is generally the most used method of performing the ANOVA test. It is also referred
to as one-factor ANOVA, between-subjects ANOVA, and an independent factor ANOVA. It is used to
compare the means of two independent groups using the F-distribution.

Two carry out the one-way ANOVA test, you should necessarily have only one independent variable
with at least two levels. One-way ANOVA does not differ much from t-test.

Example where one-way ANOVA is used: Suppose a teacher wants to know how good he has been in
teaching with the students. So, he can split the students of the class into different groups and assign
different projects related to the topics taught to them.

He can use one-way ANOVA to compare the average score of each group. He can get a rough
understanding of topics to teach again. However, he won’t be able to identify the student who could not
understand the topic.
Two-way ANOVA

Two-way ANOVA is carried out when you have two independent variables. It is an extension of one-
way ANOVA. You can use the two-way ANOVA test when your experiment has a quantitative outcome
and there are two independent variables.

Two-way ANOVA is performed in two ways:

Two-way ANOVA with replication: It is performed when there are two groups and the
members of these groups are doing more than one thing. Our example in the beginning can be a good
example of two-way ANOVA with replication.

Two-way ANOVA without replication: This is used when you have only one group but you are
double-testing that group. For example, a patient is being observed before and after medication.

Assumptions for Two-way ANOVA

• The population must be close to a normal distribution.


• Samples must be independent.
• Population variances must be equal.
• Groups must have equal sample sizes.
N-way ANOVA (MANOVA)

When we have multiple or more than two independent variables, we use MANOVA. The main purpose
of the MANOVA test is to find out the effect on dependent/response variables against a change in the IV.

It answers the following questions:

• Does the change in the independent variable significantly affect the dependent variable?
• What are interactions among the dependent variables?
• What are interactions between independent variables?

MANOVA is advantageous as compared to ANOVA because it allows you to test multiple


dependent variables and protects from Type I errors where we ignore a true null hypothesis.

• Real-world application of ANOVA test


• Suppose medical researchers want to find the best diabetes medicine and they have to choose from four medicines. They
can choose 20 patients and give them each of the four medicines for four months.
• The researchers can take note of the sugar levels before and after medication for each medicine and then to understand
whether there is a statistically significant difference in the mean results from the medications, they can use one-way
ANOVA.
• The type of medicine can be a factor and reduction in sugar level can be considered the response. Researchers can then
calculate the p-value and compare if they are lower than the significance level.
• If the results reveal that there is a statistically significant difference in mean sugar level reductions caused by the four
medicines, the post hoc tests can be run further to determine which medicine led to this result.

ANOVA TableThe ANOVA formulas can be arranged systematically in the form of a table. This
ANOVA table can be summarized as follows:
ANOVA TableThe ANOVA formulas can be arranged systematically in the form of a table. This
ANOVA table can be summarized as follows:
● One Way ANOVA

The one way ANOVA test is used to determine whether there is any difference between the means of three or
more groups. A one way ANOVA will have only one independent variable. The hypothesis for a one way
ANOVA test can be set up as follows:
Null Hypothesis, H0
:μ1 = μ2 = μ3 = ... = μk
Alternative Hypothesis, H1
: The means are not equal
Decision Rule: If test statistic > critical value then reject the null hypothesis and conclude that the means of at
least two groups are statistically significant.
The steps to perform the one way ANOVA test are given below:
○ Step 1: Calculate the mean for each group.
○ Step 2: Calculate the total mean. This is done by adding all the means and dividing it by the total number
of means.
○ Step 3: Calculate the SSB.
○ Step 4: Calculate the between groups degrees of freedom.
○ Step 5: Calculate the SSE.
○ Step 6: Calculate the degrees of freedom of errors.
○ Step 7: Determine the MSB and the MSE.
○ Step 8: Find the f test statistic.
○ Step 9: Using the f table for the specified level of significance, α , find the critical value. This is given by
F(α, df1. df2).
○ Step 10: If f > F then reject the null hypothesis.
● Examples on ANOVA Test
● Example 1: Three types of fertilizers are used on three groups of plants for 5 weeks. We want to
check if there is a difference in the mean growth of each group. Using the data given below apply a
one way ANOVA test at 0.05 significance level.
Fertilizer 1 Fertilizer 2 Fertilizer 3

6 8 13

8 12 9

4 9 11

5 11 8

3 6 7

4 8 12

Mean -X-5 X-9 X-10

Solution:

H0: μ1 = μ2 = μ3, H1=: The means are not equal


Total mean,X= 8, n1= n2 = n3= 6, k = 3,
SSB = 6(5 - 8)2 + 6(9 - 8)2 + 6(10 - 8)2= 84,
df1 = k - 1 = 2
Fertilizer 1 (X - 5)2 Fertilizer 2 (X - 9)2 Fertilizer 3 (X - 10)2

6 1 8 1 13 9

8 9 12 9 9 1

4 1 9 0 11 1

5 0 11 4 8 4

3 4 6 9 7 9

4 1 8 1 12 4

mean=5 Total=16 mean=9 Total=24 mean=18 Total=28

SSE = 16 + 24 + 28 = 68

N = 18
df2 = N - k = 18 - 3 = 15
MSB = SSB / df1 = 84 / 2 = 42
MSE = SSE / df2 = 68 / 15 = 4.53
ANOVA test statistic, f = MSB / MSE = 42 / 4.53 = 9.33
Using the f table at α= 0.05 the critical value is given as F(0.05, 2, 15) = 3.68
As f > F, thus, the null hypothesis is rejected and it can be concluded that there is a
difference in the mean growth of the plants.
Answer: Reject the null hypothesis
• Pearson Correlation
• Pearson’s correlation coefficient is the test statistics that measures the
statistical relationship, or association, between two continuous variables. It is
known as the best method of measuring the association between variables of
interest because it is based on the method of covariance. It gives information
about the magnitude of the association, or correlation, as well as the direction
of the relationship.
• Questions Answered:
• Do test scores and hours spent studying have a statistically significant
relationship?
• Is there a statistical association between IQ scores and depression?
• Assumptions:
• Independent of case: Cases should be independent to each other.
• Linear relationship: Two variables should be linearly related to each other.
This can be assessed with a scatterplot: plot the value of variables on a scatter
diagram, and check if the plot yields a relatively straight line.
• Homoscedasticity: the residuals scatterplot should be roughly rectangular-
shaped.
• Properties:
• Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect
positive relationship, -1 indicates a perfect negative relationship, and a 0 indicates no
relationship exists..
• Pure number: It is independent of the unit of measurement. For example, if one
variable’s unit of measurement is in inches and the second variable is in quintals, even
then, Pearson’s correlation coefficient value does not change.
• Symmetric: Correlation of the coefficient between two variables is symmetric. This
means between X and Y or Y and X, the coefficient value of will remain the same.
• Degree of correlation:
• Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one variable
increases, the other variable tends to also increase (if positive) or decrease (if
negative).
• High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a
strong correlation.
• Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a
medium correlation.
• Low degree: When the value lies below + .29, then it is said to be a small correlation.
• No correlation: When the value is zero.
• Correlation coefficients are used to measure how strong a relationship is
between two variables. There are several types of correlation coefficient, but
the most popular is Pearson’s. Pearson’s correlation (also called
Pearson’s R) is a correlation coefficient commonly used in linear
regression. If you’re starting out in statistics, you’ll probably learn about
Pearson’s R first. In fact, when anyone refers to the correlation coefficient,
they are usually talking about Pearson’s.
• Correlation Coefficient Formula: Definition
• Correlation coefficient formulas are used to find how strong a relationship is
between data. The formulas return a value between -1 and 1, where:
– 1 indicates a strong positive relationship.
– -1 indicates a strong negative relationship.
– A result of zero indicates no relationship at all.
• Types of correlation coefficient formulas.
• There are several types of correlation coefficient formulas.
– One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re
taking a basic stats class, this is the one you’ll probably use:

– Two other formulas are commonly used: the sample correlation coefficient and the population
correlation coefficient.

– Sx and sy are the sample standard deviations, and sxy is the sample covariance.

– Population correlation coefficient

– The population correlation coefficient uses σx and σy as the population standard deviations, and
σxy as the population covariance.
• What is Pearson Correlation?
• Correlation between sets of data is a measure of how well they are related.
The most common measure of correlation in stats is the Pearson
Correlation. The full name is the Pearson Product Moment Correlation
(PPMC). It shows the linear relationship between two sets of data. In
simple terms, it answers the question, Can I draw a line graph to represent
the data? Two letters are used to represent the Pearson correlation: Greek
letter rho (ρ) for a population and the letter “r” for a sample.
• Potential problems with Pearson correlation.
• The PPMC is not able to tell the difference between dependent
variables and independent variables. For example, if you are trying to find
the correlation between a high calorie diet and diabetes, you might find a
high correlation of .8. However, you could also get the same result with the
variables switched around. In other words, you could say that diabetes
causes a high calorie diet. That obviously makes no sense. Therefore, as a
researcher you have to be aware of the data you are plugging in. In
addition, the PPMC will not give you any information about the slope of
the line; it only tells you whether there is a relationship.
Example question: Find the value of the correlation coefficient from the
following table:

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241


4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022

Use the following correlation coefficient formula.


• From our table:
• Σx = 247
• Σy = 486
• Σxy = 20,485
• Σx2 = 11,409
• Σy2 = 40,022
• n is the sample size, in our case = 6
• The correlation coefficient =

6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]= 0.5298

• The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means
the variables have a moderate positive correlation.

More_problems

You might also like