Assignment#8614 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37


Name: Maria ayub

Semester: 3rd

User Id: 0000391332

Assignment no: 02

Program: b.ed 1.5

Course Code: 8614

Q.1 Explain three major measures of central tendency. Also explain the
procedure to calculate them.

Central Tendency

In statistics, the central tendency is the descriptive summary of a data set. Through
the single value from the dataset, it reflects the centre of the data distribution.
Moreover, it does not provide information regarding individual data from the
dataset, where it gives a summary of the dataset. Generally, the central tendency

of a dataset can be defined using some of the measures in statistic


The central tendency is stated as the statistical measure that represents the single
value of the entire distribution or a dataset. It aims to provide an accurate
description of the entire data in the distribution.

Measures of Central Tendency

The central tendency of the dataset can be found out using the three important
measures namely mean, median and mode.


The mean represents the average value of the dataset. It can be calculated as the
sum of all the values in the dataset divided by the number of values. In general, it
is considered as the arithmetic mean. Some other measures of mean used to find
the central tendency are as follows:

● Geometric Mean

● Harmonic Mean

● Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the
data, then the mean value differs. Calculating the mean value is completely easy.
The formula to calculate the mean value is given by:

In symmetric data distribution, the mean value is located accurately at the centre.

Median is the middle value of the dataset in which the dataset is arranged in the
ascending order or in descending order. When the dataset contains an even
number of values, then the median value of the dataset can be found by taking the
mean of the middle two values.

Consider the given dataset with the odd number of observations arranged in
descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2
Here 12 is the middle or median number that has 6 values above it and 6 values
below it.

Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and

When you look at the given dataset, the two middle values obtained are 27 and 29.

Now, find out the mean value for these two numbers.

i.e.,(27+29)/2 =28

Therefore, the median for the given data distribution is 28.


The mode represents the frequently occurring value in the dataset. Sometimes the
dataset may contain multiple modes and in some cases, it does not contain any
mode at all.

Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5

Since the mode represents the most common value. Hence, the most frequently
repeated value in the given dataset is 5.

Based on the properties of the data, the measures of central tendency are selected.

● If you have a symmetrical distribution of continuous data, all the three

measures of central tendency hold good. But most of the times, the analyst
uses the mean because it involves all the values in the distribution or dataset.

● If you have skewed distribution, the best measure of finding the central
tendency is the median.

Video Lesson

Measure of Central Tendency

Measures of Central Tendency and Dispersion

The central tendency measure is defined as the number used to represent the

or middle of a set of data values. The three commonly used measures of central
tendency are the mean, median, and mode.

Measures of central tendency

Recommended: First read Measures of shape


A measure of central tendency (also referred to as measures of centre or central

location) is a summary measure that attempts to describe a whole set of data with
a single value that represents the middle or centre of its distribution.

There are three main measures of central tendency:

● mode

● median

● mean

● Each of these measures describes a different indication of the typical or

central value in the distribution.


The mode is the most commonly occurring value in a distribution.

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.


Frequency distribution table

The most commonly occurring value is 54, therefore the mode of this distribution
is 54 years.

Advantage of the mode

The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.

Limitations of the mode

The are some limitations to using the mode. In some distributions, the mode may
not reflect the centre of the distribution very well. When the distribution of
retirement age is ordered from lowest to highest value, it is easy to see that the
centre of the distribution is 57 years, but the mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of
data, (bi-modal, or multi-modal).


The median is the middle value in distribution when the values are arranged in
ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either
side of the median value). In a distribution with an odd number of observations,
the median value is the middle value.

Advantage of the median

The median is less affected by outliers and skewed data than the mean and is
usually the preferred measure of central tendency when the distribution is not

Limitation of the median

The median cannot be identified for categorical nominal data, as it cannot be

logically ordered.


The mean is the sum of the value of each observation in a dataset divided by the
number of observations. This is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the mean

The mean can be used for both continuous and discrete numeric data.

Limitations of the mean

The mean cannot be calculated for categorical data, as the values cannot be
As the mean includes every value in the distribution the mean is influenced by
outliers and skewed distributions.

Another thing about the mean

The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When
the mean is calculated on a distribution from a sample it is indicated by the symbol
x̅ (pronounced X-bar).

Q.2 What do you mean by inferential statistics? How is it important in

educational research?

Inferential Statistics | An Easy Introduction & Examples

Published on September 4, 2020 by Pritha Bhandari. Revised on June

22, 2023.

While descriptive statistics summarize the characteristics of a data set, inferential

statistics help you come to conclusions and make predictions based on your data.

Inferential statistics have two main uses:

● making estimates about populations (for example, the mean SAT score of all
11th graders in the US).
● testing hypotheses to draw conclusions about populations (for example, the
relationship between SAT scores and family income).

Table of contents

Descriptive versus inferential statistics

Descriptive statistics allow you to describe a data set, while inferential
statistics allow you to make inferences based on a data set.

Descriptive statistics

Using descriptive statistics, you can report characteristics of your data:

● The distribution concerns the frequency of each value.

● The central tendency concerns the averages of the values.
● The variability concerns how spread out the values are.

In descriptive statistics, there is no uncertainty – the statistics precisely describe the

data that you collected. If you collect data from an entire population, you can directly
compare these descriptive statistics to those from other populations.

Example: Descriptive statisticsYou collect data on the SAT scores of all 11th
graders in a school for three years.

Inferential statistics
Most of the time, you can only acquire data from samples, because it is too difficult
or expensive to collect data from the whole population that you’re interested in.

Example: Inferential statisticsYou randomly select a sample of 11th graders in your

state and collect data on their SAT scores and other characteristics.

You can use inferential statistics to make estimates and test hypotheses about the
whole population of 11th graders in the state based on your sample data.

Sampling error in inferential statistics

Since the size of a sample is always smaller than the size of the population, some of
the population isn’t captured by sample data. This creates sampling error, which is
the difference between the true population values (called parameters) and the
measured sample values (called statistics).

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Estimating population parameters from sample statistics

The characteristics of samples and populations are described by numbers
called statistics and parameters:

● A statistic is a measure that describes the sample (e.g., sample mean).

● A parameter is a measure that describes the whole population (e.g.,
population mean).

Sampling error is the difference between a parameter and a corresponding statistic.

Since in most cases you don’t know the real population parameter, you can use
inferential statistics to estimate these parameters in a way that takes sampling error
into account.

There are two important types of estimates you can make about the population: point
estimates and interval estimates.

● A point estimate is a single value estimate of a parameter. For instance, a

sample mean is a point estimate of a population mean.
● An interval estimate gives you a range of values where the parameter is
expected to lie. A confidence interval is the most common type of interval

Both types of estimates are important for gathering a clear idea of where a parameter
is likely to lie.

Confidence intervals

A confidence interval uses the variability around a statistic to come up with an

interval estimate for a parameter. Confidence intervals are useful for estimating
parameters because they take sampling error into account.

While a point estimate gives you a precise value for the parameter you are interested
in, a confidence interval tells you the uncertainty of the point estimate. They are best
used in combination with each other.

Hypothesis testing

Hypothesis testing is a formal process of statistical analysis using inferential

statistics. The goal of hypothesis testing is to compare populations or assess
relationships between variables using samples.

Hypotheses, or predictions, are tested using statistical tests. Statistical tests also
estimate sampling errors so that valid inferences can be made.

Parametric tests make assumptions that include the following:

● the population that the sample comes from follows a normal distribution of
● the sample size is large enough to represent the population
● the variances, a measure of variability, of each group being compared are

Comparison tests

Comparison tests assess whether there are differences in means, medians or

rankings of scores of two or more groups.

Means can only be found for interval or ratio data, while medians and rankings are
Comparison test Parametric? What’s being compared? Samples

t test Yes Means 2 samples

ANOVA Yes Means 3+ samples

Mood’s median No Medians 2+ samples

Wilcoxon signed-rank No Distributions 2 samples

Wilcoxon rank-sum (Mann-Whitney U) No Sums of rankings 2 samples

Kruskal-Wallis H No Mean rankings 3+ samples

more appropriate measures for ordinal data.

Correlation tests

Correlation tests determine the extent to which two variables are associated.
Although Pearson’s r is the most statistically powerful test, Spearman’s r is
appropriate for interval and ratio variables when the data doesn’t follow a normal

The chi square test of independence is the only test that can be used
with nominal variables.

Correlation test Parametric? Variables

Pearson’s r Yes Interval/ratio variables

Spearman’s r No Ordinal/interval/ratio variables

Chi square test of independence No Nominal/ordinal variables

Regression tests
Regression tests demonstrate whether changes in predictor variables cause changes
in an outcome variable. You can decide which regression test to use based on the
number and types of variables you have as predictors and outcomes.

Data transformations help you make your data normally distributed using
mathematical operations, like taking the square root of each value.

Regression test Predictor Outcome

Simple linear regression 1 interval/ratio variable 1 interval/ratio variable

Regression test Predictor Outcome

Multiple linear regression 2+ interval/ratio variable(s) 1 interval/ratio variable

Logistic regression 1+ any variable(s) 1 binary variable

Nominal regression 1+ any variable(s) 1 nominal variable

Ordinal regression 1+ any variable(s) 1 ordinal variable

Inferential Statistics

Inferential statistics is a branch of statistics that makes the use of various analytical
tools to draw inferences about the population data from sample data. Apart from
inferential statistics, descriptive statistics forms another branch of statistics.

What is Inferential Statistics?

Inferential statistics helps to develop a good understanding of the population data

by analyzing the samples obtained from it. It helps in making generalizations about
the population by using various analytical tests and tools.

Q.3 When and where do we use correlation and regression in research?

Correlation is the relationship or association between two variables. There are

multiple ways to measure correlation, but the most common is Pearson's
correlation coefficient (r), which tells you the strength of the linear relationship
between two variables.
Spurious Relationships

It's important to remember that correlation does not always indicate causation.
Two variables can be correlated without either variable causing the other. For
instance, ice cream sales and drownings might be correlated, but that doesn't mean
that ice cream causes drownings—instead, both ice cream sales and drownings
increase when the weather is hot. Relationships like this are called spurious

Q: When can I use correlation analysis as opposed to regression


Detailed Question -

I would like to test a number of hypotheses but I am not sure which analysis method
is best. I am therefore looking for information indicating instances when to use
each method, including interpretation and reporting of results. Can you suggest
suitable resources? I am new to research work.

1 Answer to this question


The usage of correlation analysis or regression analysis depends on your data set
and the objective of the study. Correlation analysis is used to quantify the degree
to which two variables are related.

Correlational Research | When & How to Use

Published on July 7, 2021 by Pritha Bhandari. Revised on June 22, 2023.

A correlational research design investigates relationships
between variables without the researcher controlling or manipulating any of them.

A correlation reflects the strength and/or direction of the relationship between two
(or more) variables. The direction of a correlation can be either positive or negative.

Positive correlation Both variables change As height increases, weight also increases
in the same direction

Negative correlation The variables change in As coffee

opposite directions consumption increases, tiredness decreases

Zero correlation There is no relationship Coffee consumption is not correlated

between the variables with height

Prevent plagiarism. Run a free check.

Try for free

When to use correlational research

Correlational research is ideal for gathering data quickly from natural settings. That
helps you generalize your findings to real-life situations in an externally valid way.

There are a few situations where correlational research is an appropriate choice.

To investigate non-causal relationships

You want to find out if there is an association between two variables, but you don’t
expect to find a causal relationship between them.

Correlational research can provide insights into complex real-world relationships,

helping researchers develop theories and make predictions.
To explore causal relationships between variables

You think there is a causal relationship between two variables, but it is impractical,
unethical, or too costly to conduct experimental research that manipulates one of
the variables.

To test new measurement tools

You have developed a new instrument for measuring your variable, and you need
to test its reliability or validity.

Correlational research can be used to assess whether a tool consistently or

accurately captures the concept it aims to measure.

How to collect correlational data

There are many different methods you can use in correlational research. In the
social and behavioral sciences, the most common data collection methods for this
type of research include surveys, observations, and secondary data.


In survey research, you can use questionnaires to measure your variables of

interest. You can conduct surveys online, by mail, by phone, or in person.

Surveys are a quick, flexible way to collect standardized data from many
participants, but it’s important to ensure that your questions are worded in an
unbiased way and capture relevant insights.

Naturalistic observation
Naturalistic observation is a type of field research where you gather data about a
behavior or phenomenon in its natural environment.

This method often involves recording, counting, describing, and categorizing

actions and events. Naturalistic observation can include both qualitative and
quantitative elements, but to assess correlation, you collect data that can
be analyzed quantitatively (e.g., frequencies, durations, scales, and amounts).

Secondary data

Instead of collecting original data, you can also use data that has already been
collected for a different purpose, such as official records, polls, or previous studies.

Using secondary data is inexpensive and fast, because data collection is complete.
However, the data may be unreliable, incomplete or not entirely relevant, and you
have no control over the reliability or validity of the data collection procedures.

How to analyze correlational data

After collecting data, you can statistically analyze the relationship between
variables using correlation or regression analyses, or both. You can also visualize
the relationships between variables with a scatterplot.

Correlation analysis

Using a correlation analysis, you can summarize the relationship between variables
into a correlation coefficient: a single number that describes the strength and
direction of the relationship between variables. With this number, you’ll quantify
the degree of the relationship between variables.

Regression analysis
With a regression analysis, you can predict how much a change in one variable will
be associated with a change in the other variable. The result is a regression
equation that describes the line on a graph of your variables.

Here's why students love Scribbr's proofreading services

Correlation and causation

It’s important to remember that correlation does not imply causation. Just because
you find a correlation between two things doesn’t mean you can conclude one of
them causes the other for a few reasons.

Directionality problem

If two variables are correlated, it could be because one of them is a cause and the
other is an effect. But the correlational research design doesn’t allow you
to infer which is which. To err on the side of caution, researchers don’t conclude
causality from correlational studies.

Third variable problem

A confounding variable is a third variable that influences other variables to make

them seem causally related even though they are not. Instead, there are separate
causal links between the confounder and each variable.

In correlational research, there’s limited or no researcher control over extraneous

variables. Even if you statistically control for some potential confounders, there
may still be other hidden variables that disguise the relationship between your
study variables.
ExampleYou find a strong positive correlation between working hours and work-
related stress: people with lower working hours report lower levels of work-related
stress. However, this doesn’t prove that lower working hours causes a reduction in

Q.4 How F Distribution is helpful in making conclusion in educational

research? Briefly discuss the interpretation of F Distribution.

Why do we need F-distribution?

The F-distribution is useful in hypothesis testing. Hypothesis testing is used by

scientists to statistically compare data from two or more populations. The F-
distribution is needed to determine whether the F-value for a study indicates any
statistically significant differences between two populations.

What is F-distribution and what is an example of it?

The F-distribution contains all of the possible values for a test statistic. It is
determined by the degrees of freedom and is always skewed right, meaning that
all of the values are greater than zero.

What does an F-test tell you?

An F-test is a statistical method for comparing the variances of two populations.

This can be used to determine whether statistically significant differences occur
between two populations.

What is an example of an F-test?

The F-test can be used in a variety of experimental settings. For example, if a
scientist wants to determine whether statistically significant weight loss exists
between two groups based on the amount of time spent exercising, the F-test could
be used.

What is a two sample F-test?

The two sample F-test is used when comparing the variances of two populations.
This allows the researcher to determine whether there are statistically significant
differences between the two populations.

7 types of statistical distributions with practical examples

Data Science Dojo Staff

Statistical distributions help us understand a problem better by assigning a range

of possible values to the variables, making them very useful in data science and
machine learning. Here are 7 types of distributions with intuitive examples that
often occur in real-life data.

Whether you’re guessing if it’s going to rain tomorrow, betting on a sports team
to win an away match, framing a policy for an insurance company, or simply trying
your luck on blackjack at the casino, probability and distributions come into action
in all aspects of life to determine the likelihood of events.

Having a sound statistical background can be incredibly beneficial in the daily life
of a data scientist. Probability is one of the main building blocks of data science and
machine learning. While the concept of probability gives us mathematical
calculations, statistical distributions help us visualize what’s happening
Level up your AI game: Dive deep into Large Language Models with us!

Having a good grip on statistical distribution makes exploring a new dataset and
finding patterns within a lot easier. It helps us choose the appropriate machine
learning model to fit our data on and speeds up the overall process.

PRO TIP: Join our data science bootcamp program today to enhance your data
science skillset!

In this blog, we will be going over diverse types of data, the common distributions
for each of them, and compelling examples of where they are applied in real life.

Before we proceed further, if you want to learn more about probability

distribution, watch this video below:

Common types of data

Explaining various distributions becomes more manageable if we are familiar with

the type of data they use. We encounter two different outcomes in day-to-day
experiments: finite and infinite outcomes.

Difference between Discrete and Continuous Data (Source)

When you roll a die or pick a card from a deck, you have a limited number of
outcomes possible. This type of data is called Discrete Data, which can only take a
specified number of values. For example, in rolling a die, the specified values are 1,
2, 3, 4, 5, and 6.

Types of statistical distributions

Depending on the type of data we use, we have grouped distributions into two
categories, discrete distributions for discrete data (finite outcomes) and continuous
distributions for continuous data (infinite outcomes).

Discrete distributions

Discrete uniform distribution: All outcomes are equally likely

In statistics, uniform distribution refers to a statistical distribution in which all

outcomes are equally likely. Consider rolling a six-sided die. You have an equal
probability of obtaining all six numbers on your next roll, i.e., obtaining precisely
one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete
uniform distribution.

Fair Dice Uniform Distribution Graph

Uniform distribution is represented by the function U(a, b), where a and b

represent the starting and ending values, respectively. Similar to a discrete uniform
distribution, there is a continuous uniform distribution for continuous variables.

Bernoulli Distribution: Single-trial with two possible outcomes

The Bernoulli distribution is one of the easiest distributions to understand. It can

be used as a starting point to derive more complex distributions. Any event with a
trial and only two outcomes follows a Bernoulli distribution.

Binomial Distribution: A sequence of Bernoulli events

The Binomial Distribution can be thought of as the sum of outcomes of an event

following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary
outcome events, and the probability of success and failure is the same in all
successive trials. An example of a binomial event would be flipping a coin multiple
times to count the number of heads and tails.

Binomial vs Bernoulli distribution.

The difference between these distributions can be explained through an example.

Consider you’re attempting a quiz that contains 10 True/False questions. Trying a
single T/F question would be considered a Bernoulli trial, whereas attempting the
entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main
characteristics of Binomial Distribution are:

● Given multiple trials, each of them is independent of the other. That is, the
outcome of one trial doesn’t affect another one.

● Each trial can lead to just two possible results (e.g., winning or losing), with
probabilities p and (1 – p).

A binomial distribution is represented by B (n, p), where n is the number of trials

and p is the probability of success in a single trial. A Bernoulli distribution can be
shaped as a binomial trial as B (1, p) since it has only one trial. The expected value
of a binomial trial “x” is the number of times a success occurs, represented as E(x)
= np. Similarly, variance is represented as Var(x) = np(1-p).
Binomial Distribution Graph

Poisson Distribution: The probability that an event may or may not


Poisson distribution deals with the frequency with which an event occurs within a
specific interval. Instead of the probability of an event, Poisson distribution
requires knowing how often it happens in a particular period or distance. For
example, a cricket chirps two times in 7 seconds on average.

The main characteristics which describe the Poisson Processes are:

● The events are independent of each other.

● An event can occur any number of times (within the defined period).

● Two events can’t take place simultaneously.

Poisson Distribution Graph

The graph of Poisson distribution plots the number of instances an event occurs in
the standard interval of time and the probability of each one.

Continuous distributions

Normal Distribution: Symmetric distribution of values around the mean

Normal distribution is the most used distribution in data science. In a normal

distribution graph, data is symmetrically distributed with no skew.

Normal Distribution Bell Curve Graph

Here, you can witness the “bell-shaped” curve around the central region, indicating
that most data points exist there. The normal distribution is represented as N(µ,
σ2) here, µ represents the mean, and σ2 represents the variance, one of which is
mostly provided. The expected value of a normal distribution is equal to its mean.
The curve is symmetric at the center. Therefore mean, mode, and median are equal
to the same value, distributing all the values symmetrically around the mean.


Data is an essential component of the data exploration and model development

process. The first thing that springs to mind when working with continuous
variables is looking at the data distribution. We can adjust our Machine Learning
models to best match the problem if we can identify the pattern in the data
distribution, which reduces the time to get to an accurate outcome.

Q.5 Discuss, in details, Chi-square as independent test and Goodness-

of-fit test.

Chi-Square (Χ²) Tests | Types, Formula & Examples

Published on May 23, 2022 by Shaun Turney. Revised on June 22, 2023.

A Pearson’s chi-square test is a statistical test for categorical data. It is used to

determine whether your data are significantly different from what you expected.
There are two types of Pearson’s chi-square tests:

● The chi-square goodness of fit test is used to test whether the frequency
distribution of a categorical variable is different from your expectations.
● The chi-square test of independence is used to test whether two categorical
variables are related to each other.

Chi-square is often written as Χ2 and is pronounced “kai-square” (rhymes with

“eye-square”). It is also called chi-squared.

What is a chi-square test?

Pearson’s chi-square (Χ2) tests, often referred to simply as chi-square tests, are
among the most common nonparametric tests.

Note: Parametric tests can’t test hypotheses about the distribution of a

categorical variable, but they can involve a categorical variable as an independent
variable (e.g., ANOVAs).

Frequency of visits by bird species

at a bird feeder during a 24-hour

Bird species Frequency

House sparrow 15

House finch 12

Black-capped 9

Common grackle 8
Frequency of visits by bird species Test hypotheses about frequency
at a bird feeder during a 24-hour distributions
There are two types of Pearson’s chi-square
Bird species Frequency tests, but they both test whether the

European starling 8 observed frequency distribution of a

categorical variable is significantly different
Mourning dove 6
from its expected frequency distribution. A
frequency distribution describes how observations are distributed between
different groups.

Example: Bird species at a bird feeder

A chi-square test (a chi-square goodness of fit test) can test whether these
observed frequencies are significantly different from what was expected, such as
equal frequencies.

Example: Handedness and nationality

Contingency table of the handedness of a sample of Americans and Canadians

Right-handed Left-handed

American 236 19

Canadian 157 16

A chi-square test (a test of independence) can test whether these observed

frequencies are significantly different from the frequencies expected if handedness
is unrelated to nationality.
Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

● Academic style

● Vague sentenc

The chi-square formula

Both of Pearson’s chi-square tests use the same formula to calculate the test
statistic, chi-square (Χ2):


● Χ2 is the chi-square test statistic

● Σ is the summation operator (it means “take the sum of”)

● O is the observed frequency

● E is the expected frequency

The larger the difference between the observations and the expectations (O − E in
the equation), the bigger the chi-square will be.

When to use a chi-square test

A Pearson’s chi-square test may be an appropriate option for your data if all of the
following are true:
1. You want to test a hypothesis about one or more categorical variables. If
one or more of your variables is quantitative, you should use a
different statistical test. Alternatively, you could convert the quantitative
variable into a categorical variable by separating the observations into

2. The sample was randomly selected from the population.

3. There are a minimum of five observations expected in each group or

combination of groups.

Types of chi-square tests

The two types of Pearson’s chi-square tests are:

● Chi-square goodness of fit test

● Chi-square test of independence

Mathematically, these are actually the same test. However, we often think of them
as different tests because they’re used for different purposes.

Chi-square goodness of fit test

You can use a chi-square goodness of fit test when you have one categorical
variable. It allows you to test whether the frequency distribution of the categorical
variable is significantly different from your expectations.

Example: Hypotheses for chi-square goodness of fit testExpectation of equal

● Null hypothesis (H0): The bird species visit the bird feeder
in equal proportions.

● Alternative hypothesis (HA): The bird species visit the bird feeder
in different proportions.

Expectation of different proportions

● Null hypothesis (H0): The bird species visit the bird feeder in
the same proportions as the average over the past five years.

● Alternative hypothesis (HA): The bird species visit the bird feeder
in different proportions from the average over the past five years.

● Chi-square test of independence

You can use a chi-square test of independence when you have two categorical

Example: Chi-square test of independence

● Null hypothesis (H0): The proportion of people who are left-handed is the
same for Americans and Canadians.

● Alternative hypothesis (HA): The proportion of people who are left-

handed differs between nationalities.

Other types of chi-square tests

Some consider the chi-square test of homogeneity to be another variety of

Pearson’s chi-square test. It tests whether two populations come from the same
distribution by determining whether the two populations have the same
proportions as each other.

McNemar’s test is a test that uses the chi-square test statistic. It isn’t a variety
of Pearson’s chi-square test, but it’s closely related. You can conduct this test when

you have a related pair of categorical variables that each have two groups.

Example: McNemar’s testSuppose that a sample of 100 people is offered two

flavors of ice cream and asked whether they like the taste of each.

Contingency table of ice cream flavor preference

Like chocolate Dislike chocolate

Like vanilla 47 32

Dislike vanilla 8 13

● Null hypothesis (H0): The proportion of people who like chocolate is the
same as the proportion of people who like vanilla.

● Alternative hypothesis (HA): The proportion of people who like

chocolate is different from the proportion of people who like vanilla.

There are several other types of chi-square tests that are not Pearson’s chi-square
tests, including the test of a single variance and the likelihood ratio chi-square

How to perform a chi-square test

The exact procedure for performing a Pearson’s chi-square test depends on which
test you’re using, but it generally follows these steps:
1. Create a table of the observed and expected frequencies. This can
sometimes be the most difficult step because you will need to carefully
consider which expected values are most appropriate for your null

2. Calculate the chi-square value from your observed and expected

frequencies using the chi-square formula.

3. Find the critical chi-square value in a chi-square critical value table or

using statistical software.

4. Compare the chi-square value to the critical value to determine

which is larger.

5. Decide whether to reject the null hypothesis. You should reject the
null hypothesis if the chi-square value is greater than the critical value. If you
reject the null hypothesis, you can conclude that your data are significantly
different from what you expected.

How to report a chi-square test

If you decide to include a Pearson’s chi-square test in your research

paper, dissertation or thesis, you should report it in your results section. You can

follow these rules if you want to report statistics in APA Style:

● You don’t need to provide a reference or formula since the chi-square test is
a commonly used statistic.
● Refer to chi-square using its Greek symbol, Χ2. Although the symbol looks
very similar to an “X” from the Latin alphabet, it’s actually a different symbol.
Greek symbols should not be italicized.

Practice questions

powered by Typeform

Other interesting articles

If you want to know more about statistics, methodology, or research bias, make
sure to check out some of our other articles with explanations and examples.


● Chi square test of independence

● Statistical power

● Descriptive statistics


● Double-blind study

● Case-control study

● Research bias

● Hawthorne effect

● Unconscious bias

● Chi-Square Goodness of Fit Test

● What is the Chi-square goodness of fit test?

● The Chi-square goodness of fit test is a statistical hypothesis test used to

determine whether a variable is likely to come from a specified distribution
or not.

● When can I use the test?

● You can use the test when you have counts of values for a categorical

● The Chi-square goodness of fit test checks whether your sample data is likely
to be from a specific theoretical distribution.

● What do we need?

● For the goodness of fit test, we need one variable. We also need an idea, or
hypothesis, about how that variable is distributed. Here are a couple of

● We have bags of candy with five flavors in each bag. The bags should contain
an equal number of pieces of each flavor. The idea we'd like to test is that
the proportions of the five flavors in each bag are the same.

● Understanding results

● Let’s use a few graphs to understand the test and the results.

● A simple bar chart of the data shows the observed counts for the flavors of

You might also like