Advanced Educational Statistics - Edu 901C
Advanced Educational Statistics - Edu 901C
Advanced Educational Statistics - Edu 901C
EDU 901C
Hillary Rutto
Solomon Kiplimo
1
Statistical hypothesis
b) The research (alternative) hypothesis (HA or H1). The objective of hypothesis testing is to
the data.
We usually do new research to challenge the existing (accepted) beliefs. The burden of proof is
placed on those who believe in the alternative claim. This initially favored claim (H0) will not be
rejected in favor of the alternative claim (HA or H1) unless the sample evidence provides
significant support for the alternative assertion. If the sample does not strongly contradict H0, we
will continue to believe in the plausibility of the null hypothesis. This is based on thee Popperian
Principle of Falsification put forward by Karl Popper who discovered that we can't conclusively
confirm a hypothesis, but we can conclusively negate one.
Example: Suppose a school is considering overhauling its academic revision programme with a
new one, the school would be reluctant to change over to the new programme unless evidence
strongly suggests that the new programme is superior to the current one.
An appropriate problem formulation would involve testing:
H0: There’s no difference between the current programme and the new programme against HA:
The new programme is superior to the current programme.
2
The conclusion that a change is justified is identified with HA, and it would take conclusive
evidence to justify rejecting H0 and switching to the new programme.
The alternative to the null hypothesis H0: μ1 = μ2 will look like one of the following three
assertions:
a) HA: μ1≠ μ2 (Two-tailed test)
b) HA: μ1> μ2 (in which case the null hypothesis is μ1≤ μ2)
c) HA: μ1< μ2 (in which case the null hypothesis is μ1≥ μ2)
A researcher might believe that the parameter has increased, decreased or changed.
Upper tailed and lower tailed tests are one tailed tests and make up directional research
hypothesis which reflects an expected difference between groups and specifies the direction of
this difference e.g. HA:The mean height of males is greater than that of females.
c) Where a difference is hypothesized, this is called a two-tailed test. A two tailed-test reflects an
expected difference between groups but does not specify the direction of this difference e.g.
HA: The mean height of males is different from that of females.
The exact form of the research hypothesis depends on the investigator's belief about the
parameter of interest and whether it has possibly increased, decreased or is different from the
null value. The research hypothesis is set up by the investigator before any data are collected.
To decide if we have sufficient evidence against the null hypothesis to reject it in favour of the alternative
hypothesis, one must first decide upon a significance level. The significance level is the probability of
3
The significance level (or α level) is a threshold that determines whether a study result can be considered
statistically significant after performing the planned statistical tests. It is most often set to 5% (or 0.05),
It is the probability of rejecting the null hypothesis when it is in fact true and represents the probability to
commit a type I error. For example, a significance level of 0.05 indicates a 5% risk of concluding that a
A confidence level is a way to express how sure we are about the results of a study or experiment. It is
often represented as a percentage, such as 95% or 99%. The confidence level tells us the likelihood that
the true value or effect we are estimating falls within a given range. A 95% Confidence Level for example
means we are 95% confident in our results, it implies that if we were to conduct the same study many
times, we would expect the true value to be within our estimated range about 95 out of 100 times. The
remaining 5 times, our estimate might not capture the true value, but this is considered acceptable
Confidence level+α = 1
e.g 95%+5% = 1.
(0.95+0.05)=1
In hypothesis testing, either the H0 is rejected or it is not. Because this is based on a sample and
not the entire population, we could be wrong about the true treatment effect. Just by chance, it is
possible that this sample reflects a relationship which is not present in the population – this is
when type I and type II errors can happen.
Type I error
A type I error is the incorrect rejection of a true null hypothesis. Usually a type I error leads one
to conclude that a supposed effect or relationship exists when in fact it doesn't. Examples of type
I errors include a test that shows a patient to have a disease when in fact the patient does not have
the disease, or an experiment indicating that a medical treatment should cure a disease when in
fact it does not. Type I errors cannot be completely avoided, but investigators should decide on
an acceptable level of risk of making type I errors when designing the study.
4
Type II error.
A type II error is the failure to reject a false null hypothesis. This leads to the conclusion that an
effect or relationship doesn't exist when it really does. Examples of type II errors would be a
blood test failing to detect the disease it was designed to detect, in a patient who really has the
disease.
A contingency table for type I and type II errors
In reality
H0 is TRUE H0 is FALSE
Decision
Type II error
Do not reject H0 Correct decision
(False negative)
Type I error
Reject H0 Correct decision
(False positive)
Choosing a higher significance level, such as 0.10, increases the chances of making a Type I
error but reduces the chances of making a Type II error (failing to reject a false null hypothesis).
On the other hand, choosing a lower significance level, like 0.01, decreases the chances of a
Type I error but increases the chances of a Type II error. Researchers need to strike a balance
based on the context and the consequences of making each type of error.
3. Data Collection
The type of data collected plays a crucial role in determining the appropriate statistical test
for hypothesis testing. There are two main types of data
1. Categorical Data: This type of data represents categories or groups and is often nominal
or ordinal. Examples include gender, colors, or education levels. For hypothesis testing
with categorical data, chi-square tests or Fisher's exact tests may be used.
2. Numerical Data (Quantitative Data): This type of data consists of numerical values and
can be further categorized as either continuous or discrete. Continuous numerical data
includes measurements like height or weight, while discrete numerical data includes
counts, such as the number of people in a household. Depending on the characteristics of
the data and the research question, t-tests, ANOVA, regression analysis, or other
appropriate statistical tests may be applied.
5
4. Determine the appropriate test statistic and calculate it using the sample data.
The test statistic is a function of the sample data that will be used to make a decision
about whether the null hypothesis should be rejected or not and represents the likelihood of
obtaining sample outcomes if the null hypothesis were true.
. The test statistic summarizes the observed data into a single number using the central
tendency, variation, sample size, and number of predictor variables in the statistical model.
The type of test statistic to be used in a hypothesis test depends on several factors
including:
a) The type of statistic you are using in the test.
b) The size of your sample. For a statistical test to be valid, your sample size needs to be
large enough to approximate the true distribution of the population being studied.
c) Assumptions you can make about the distribution of your data.
d) Assumptions you can make about the distribution of the statistic used in the test.
Statistical tests make some common assumptions about the data they are testing. They
include:
a) Independence of observations. The observations/variables you include in your test are not
related.
b) Homogeneity of variance. The variance within all comparison groups are the same. If one
group has much more variation than others, it will limit the test’s effectiveness.
c) Normality of data. The data follows a normal distribution. This assumption applies only
to quantitative data.
If the data does not meet the assumptions of normality or homogeneity of variance, one can
perform nonparametric statistical tests, which allows making comparisons without any
assumptions about the data distribution.
The distribution of data is how often each observation occurs, and can be described by its central
tendency and variation around that central tendency. Different statistical tests predict different
types of distributions, so it’s important to choose the right statistical test for the stated
hypothesis.
6
Choosing a parametric test
Parametric tests usually have stricter requirements than nonparametric tests, and are able to
make stronger inferences from the data. They can only be conducted with data that adheres to
the common assumptions of statistical tests. The most common types of parametric test include
regression tests, comparison tests, and correlation tests.
Regression tests
Regression tests look for cause-and-effect relationships. They can be used to estimate the effect
of one or more continuous variables on another variable.
Correlation tests
Correlation tests check whether variables are related without hypothesizing a cause-and-effect
relationship.
Variables Research question example
Pearson’s r 2 continuous variables How are latitude and temperature related?
Comparison tests
Comparison tests look for differences among group means. They can be used to test the effect of
a categorical variable on the mean value of some other characteristic.
T-tests are used when comparing the means of precisely two groups (e.g., the average heights of
men and women). ANOVA and MANOVA tests are used when comparing the means of more
than two groups (e.g., the average heights of children, teenagers, and adults).
Predictor variable Outcome variable Research question example
Paired t-test Categorical Quantitative What is the effect of two different test
1 predictor groups come from the prep programs on the average exam
7
Predictor variable Outcome variable Research question example
same population scores for students from the same class?
Independent Categorical Quantitative What is the difference in average exam
t-test 1 predictor groups come from scores for students from two different
different populations schools?
ANOVA Categorical Quantitative What is the difference in average pain
1 or more 1 outcome levels among post-surgical patients
predictor given three different painkillers?
MANOVA Categorical Quantitative What is the effect of flower
1 or more 2 or more outcome species on petal length, petal width,
predictor and stem length?
8
A number of test statistics are available for testing hypotheses but vary on suitability and on
ways to calculate them. Common ones are:
Test statistics are typically calculated using a statistical program e.g. the SPSS, which will also
calculate the p value of the test statistic. Tables for estimating the p value of the test statistic are
also available and are based show, based on the test statistic and degrees of freedom (number of
observations minus number of independent variables) of the test, how frequently you would
expect to see that test statistic under the null hypothesis.
The p- value (probability value) is the probability of obtaining a result as extreme as, or more
extreme than, the result actually obtained when the null hypothesis is true. p- value ranges from
0-1.
Example
9
Suppose one wants to run a one sample t-test to determine whether or not the average score of
male and female students in a chemistry test are equal, data is collected and analysed and a p-
value arrived at.
A high p-value, for example 0.90 leaves little reason to doubt the null hypothesis. On the other
hand, if the p-value is small (for example 0.01), there would only be a small chance that the data
would be obtained if the null hypothesis was true.
In order to decide whether to reject the null hypothesis, a test statistic is calculated. The
decision is made based on the numerical value of that test statistic. There are two approaches
how to arrive at that decision, the critical value approach and the p-value approach.
The observed test statistic calculated based from sample data is compared to the critical value
(a cutoff value). The critical value divides the area under the probability distribution curve
in rejection region(s) and non-rejection region.
The null hypothesis is rejected if the test statistic is more extreme than the critical value. The
null hypothesis is not rejected if the test statistic is not as extreme as the critical value. The
critical value is computed based on the given significance level α and the type of probability
distribution employed.
In a two-tailed test, the null hypothesis is rejected if the test statistic is too small or too large.
The rejection region for such a test consists of two parts: one on the left and one on the right.
10
Rejection region in one tailed test (Left/Lower- tailed)
The null hypothesis is rejected for a left-tailed test if the test statistic is too small. Thus, the
rejection region for such a test consists of one part left from the centre.
The null hypothesis is rejected for a right-tailed test if the test statistic is too large. Thus, the
rejection region for such a test consists of one part right from the centre.
In this approach, the numerical value of the test statistic is compared to the specified
significance level of the hypothesis test.
11
The p-value corresponds to the probability of observing sample data at least as extreme as the
actually obtained test statistic. Small p-values provide evidence against the null hypothesis.
The smaller (closer to 0) the p-value, the stronger is the evidence against the null hypothesis.
The null hypothesis is rejected if the p-value is less than or equal to the specified significance
level (α). Otherwise, the null hypothesis is not rejected.
If p≤α, reject H0; otherwise, if p>α, do not reject H0.
If the null hypothesis is rejected, results are interpreted in the context of the study and the
alternative hypothesis. If on the contrary the null hypothesis is not rejected, limitations and the
lack of evidence against it are acknowledged.
In addition to hypothesis testing, the effect size is considered as this quantifies the magnitude of
the observed effect. A small p-value does not necessarily imply a large practical significance.
A sensitivity analysis also needs to explore how the results change under different assumptions,
significance levels, or statistical methods.
12