Chi Square Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CHI-SQUARE TEST

The world is constantly curious about the Chi-Square test's application in machine
learning and how it makes a difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must choose the best ones to
build the model. By examining the relationship between the elements, the chi-square
test aids in the solution of feature selection problems. In this lesson, you will learn
about the chi-square test and its application.

What Is a Chi-Square Test?

The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to determine whether it
correlates to the categorical variables in our data. It helps to find out whether a
difference between two categorical variables is due to chance or a relationship between
them.

Chi-Square Test Definition

A chi-square test is a statistical test that is used to compare observed and expected
results. The goal of this test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the variables under consideration.
As a result, the chi-square test is an ideal choice for aiding in our understanding and
interpretation of the connection between our two categorical variables.

A chi-square test or comparable nonparametric test is required to test a hypothesis


regarding the distribution of a categorical variable. Categorical variables, which indicate
categories such as animals or countries, can be nominal or ordinal. They cannot have a
normal distribution since they can only have a few particular values.

For example, a meal delivery firm in India wants to investigate the link between gender,
geography, and people's food preferences.

It is used to calculate the difference between two categorical variables, which are:

 As a result of chance or

 Because of the relationship

Formula For Chi-Square Test

Where

c = Degrees of freedom

O = Observed Value

E = Expected Value

The degrees of freedom in a statistical calculation represent the number of variables


that can vary in a calculation. The degrees of freedom can be calculated to ensure that
chi-square tests are statistically valid. These tests are frequently used to compare
observed data with data that would be expected to be obtained if a particular
hypothesis were true.

The Observed values are those you gather yourselves.

The expected values are the frequencies expected, based on the null hypothesis.

Fundamentals of Hypothesis Testing

Hypothesis testing is a technique for interpreting and drawing inferences about a


population based on sample data. It aids in determining which sample data best
support mutually exclusive population claims.

Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will not
occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite of the
null hypothesis. The acceptance of the alternative hypothesis follows the rejection of
the null hypothesis. H1 is the symbol for it.

What Are Categorical Variables?

Categorical variables belong to a subset of variables that can be divided into discrete
categories. Names or labels are the most common categories. These variables are also
known as qualitative variables because they depict the variable's quality or
characteristics.
Categorical variables can be divided into two categories:

1. Nominal Variable: A nominal variable's categories have no natural ordering.


Example: Gender, Blood groups

2. Ordinal Variable: A variable that allows the categories to be sorted is ordinal


variables. Customer satisfaction (Excellent, Very Good, Good, Average, Bad,
and so on) is an example.

Why Do You Use the Chi-Square Test?

Chi-square is a statistical test that examines the differences between categorical


variables from a random sample in order to determine whether the expected and
observed results are well-fitting.

Here are some of the uses of the Chi-Squared test:

 The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.

 The Chi-squared test allows you to assess your trained regression model's
goodness of fit on the training, validation, and test data sets.

What Does A Chi-Square Statistic Test Tell You?

A Chi-Square test ( symbolically represented as 2 ) is fundamentally a data analysis


based on the observations of a random set of variables. It computes how a model
equates to actual observed data. A Chi-Square statistic test is calculated based on the
data, which must be raw, random, drawn from independent variables, drawn from a
wide-ranging sample and mutually exclusive. In simple terms, two sets of statistical
data are compared -for instance, the results of tossing a fair coin. Karl Pearson
introduced this test in 1900 for categorical data analysis and distribution. This test is
also known as ‘Pearson’s Chi-Squared Test’.

Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis is an


assumption that any given condition might be true, which can be tested afterwards. The
Chi-Square test estimates the size of inconsistency between the expected results and
the actual results when the size of the sample and the number of variables in the
relationship is mentioned.

These tests use degrees of freedom to determine if a particular null hypothesis can be
rejected based on the total number of observations made in the experiments. Larger the
sample size, more reliable is the result.

There are two main types of Chi-Square tests namely -

1. Independence

2. Goodness-of-Fit

Independence

The Chi-Square Test of Independence is a derivable ( also known as inferential )


statistical test which examines whether the two sets of variables are likely to be related
with each other or not. This test is used when we have counts of values for two nominal
or categorical variables and is considered as non-parametric test. A relatively large
sample size and independence of obseravations are the required criteria for conducting
this test.

For Example-

In a movie theatre, suppose we made a list of movie genres. Let us consider this as the
first variable. The second variable is whether or not the people who came to watch
those genres of movies have bought snacks at the theatre. Here the null hypothesis is
that th genre of the film and whether people bought snacks or not are unrelatable. If this
is true, the movie genres don’t impact snack sales.

Goodness-Of-Fit

In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines


whether a variable is likely to come from a given distribution or not. We must have a set
of data values and the idea of the distribution of this data. We can use this test when we
have value counts for categorical variables. This test demonstrates a way of deciding if
the data values have a “ good enough” fit for our idea or if it is a representative sample
data of the entire population.

For Example-

Suppose we have bags of balls with five different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour. The
idea we would like to test here is that the proportions of the five colours of balls in each
bag must be exact.

Who Uses Chi-Square Analysis?

Chi-square is most commonly used by researchers who are studying survey response
data because it applies to categorical variables. Demography, consumer and marketing
research, political science, and economics are all examples of this type of research.
Example

Let's say you want to know if gender has anything to do with political party preference.
You poll 440 voters in a simple random sample to find out which political party they
prefer. The results of the survey are shown in the table below:

To see if gender is linked to political party preference, perform a Chi-Square test of


independence using the steps below.

Step 1: Define the Hypothesis

H0: There is no link between gender and political party preference.

H1: There is a link between gender and political party preference.

Step 2: Calculate the Expected Values

Now you will calculate the expected frequency.

For example, the expected value for Male Republicans is:

E
Similarly, you can calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table

Now you will calculate the (O - E)2 / E for each cell in the table.

Where

O = Observed Value

E = Expected Value

Step 4: Calculate the Test Statistic X2

X2 is the sum of all the values in the last table

= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1


= 9.837

Before you can conclude, you must first determine the critical statistic, which requires
determining our degrees of freedom. The degrees of freedom in this case are equal to
the table's number of columns minus one multiplied by the table's number of rows
minus one, or (r-1) (c-1). We have (3-1)(2-1) = 2.

Finally, you compare our obtained statistic to the critical statistic found in the chi-square
table. As you can see, for an alpha level of 0.05 and two degrees of freedom, the critical
statistic is 5.991, which is less than our obtained statistic of 9.83. You can reject our
null hypothesis because the critical statistic is higher than your obtained statistic.

This means you have sufficient evidence to say that there is an association between
gender and political party preference.
When to Use a Chi-Square Test?

A Chi-Square Test is used to examine whether the observed results are in order with the
expected values. When the data to be analysed is from a random sample, and when the
variable is the question is a categorical variable, then Chi-Square proves the most
appropriate test for the same. A categorical variable consists of selections such as
breeds of dogs, types of cars, genres of movies, educational attainment, male v/s
female etc. Survey responses and questionnaires are the primary sources of these
types of data. The Chi-square test is most commonly used for analysing this kind of
data. This type of analysis is helpful for researchers who are studying survey response
data. The research can range from customer and marketing research to political
sciences and economics.

Types of Chi-square Tests

Pearson's chi-square tests are classified into two types:

1. Chi-square goodness-of-fit analysis

2. Chi-square independence test

These are, mathematically, the same exam. However, because they are utilized for
distinct goals, we generally conceive of them as separate tests.
Properties

The chi-square test has the following significant properties:

1. If you multiply the number of degrees of freedom by two, you will receive an
answer that is equal to the variance.

2. The chi-square distribution curve approaches the data is normally distributed


as the degree of freedom increases.

3. The mean distribution is equal to the number of degrees of freedom.

Properties of Chi-Square Test

1. Variance is double the times the number of degrees of freedom.

2. Mean distribution is equal to the number of degrees of freedom.

3. When the degree of freedom increases, the Chi-Square distribution curve


becomes normal.

Limitations of Chi-Square Test

There are two limitations to using the chi-square test that you should be aware of.

 The chi-square test, for starters, is extremely sensitive to sample size. Even
insignificant relationships can appear statistically significant when a large
enough sample is used. Keep in mind that "statistically significant" does not
always imply "meaningful" when using the chi-square test.

 Be mindful that the chi-square can only determine whether two variables are
related. It does not necessarily follow that one variable has a causal
relationship with the other. It would require a more detailed analysis to
establish causality.

Chi-Square Goodness of Fit Test

When there is only one categorical variable, the chi-square goodness of fit test can be
used. The frequency distribution of the categorical variable is evaluated for determining
whether it differs significantly from what you expected. The idea is that the categories
will have equal proportions, however, this is not always the case.

SPSS

When you want to see if there is a link between two categorical variables, you perform
the chi-square test. To acquire the test statistic and its related p-value in SPSS, use the
chisq option on the statistics subcommand of the crosstabs command. Remember that
the chi-square test implies that each cell's anticipated value is five or greater.

Conclusion

In this lesson, you explored the concept of Chi-square test and how to find the related
values. You also take a look at how the critical value and chi-square value is related to
each other.
FAQs

1) What is the chi-square test used for?


The chi-square test is a statistical method used to determine if there is a significant
association between two categorical variables. It helps researchers understand whether
the observed distribution of data differs from the expected distribution, allowing them
to assess whether any relationship exists between the variables being studied.

2) What is the chi-square test and its types?

The chi-square test is a statistical test used to analyze categorical data and assess the
independence or association between variables. There are two main types of chi-square
tests:

a) Chi-square test of independence: This test determines whether there is a significant


association between two categorical variables.
b) Chi-square goodness-of-fit test: This test compares the observed data to the
expected data to assess how well the observed data fit the expected distribution.

3) What is the chi-square test easily explained?

The chi-square test is a statistical tool used to check if two categorical variables are
related or independent. It helps us understand if the observed data differs significantly
from the expected data. By comparing the two datasets, we can draw conclusions
about whether the variables have a meaningful association.

You might also like