Chi Square Test
Chi Square Test
Chi Square Test
The world is constantly curious about the Chi-Square test's application in machine
learning and how it makes a difference. Feature selection is a critical topic in machine
learning, as you will have multiple features in line and must choose the best ones to
build the model. By examining the relationship between the elements, the chi-square
test aids in the solution of feature selection problems. In this lesson, you will learn
about the chi-square test and its application.
The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to determine whether it
correlates to the categorical variables in our data. It helps to find out whether a
difference between two categorical variables is due to chance or a relationship between
them.
A chi-square test is a statistical test that is used to compare observed and expected
results. The goal of this test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the variables under consideration.
As a result, the chi-square test is an ideal choice for aiding in our understanding and
interpretation of the connection between our two categorical variables.
For example, a meal delivery firm in India wants to investigate the link between gender,
geography, and people's food preferences.
It is used to calculate the difference between two categorical variables, which are:
As a result of chance or
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The expected values are the frequencies expected, based on the null hypothesis.
Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will not
occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.
Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite of the
null hypothesis. The acceptance of the alternative hypothesis follows the rejection of
the null hypothesis. H1 is the symbol for it.
Categorical variables belong to a subset of variables that can be divided into discrete
categories. Names or labels are the most common categories. These variables are also
known as qualitative variables because they depict the variable's quality or
characteristics.
Categorical variables can be divided into two categories:
The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.
The Chi-squared test allows you to assess your trained regression model's
goodness of fit on the training, validation, and test data sets.
These tests use degrees of freedom to determine if a particular null hypothesis can be
rejected based on the total number of observations made in the experiments. Larger the
sample size, more reliable is the result.
1. Independence
2. Goodness-of-Fit
Independence
For Example-
In a movie theatre, suppose we made a list of movie genres. Let us consider this as the
first variable. The second variable is whether or not the people who came to watch
those genres of movies have bought snacks at the theatre. Here the null hypothesis is
that th genre of the film and whether people bought snacks or not are unrelatable. If this
is true, the movie genres don’t impact snack sales.
Goodness-Of-Fit
For Example-
Suppose we have bags of balls with five different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour. The
idea we would like to test here is that the proportions of the five colours of balls in each
bag must be exact.
Chi-square is most commonly used by researchers who are studying survey response
data because it applies to categorical variables. Demography, consumer and marketing
research, political science, and economics are all examples of this type of research.
Example
Let's say you want to know if gender has anything to do with political party preference.
You poll 440 voters in a simple random sample to find out which political party they
prefer. The results of the survey are shown in the table below:
E
Similarly, you can calculate the expected value for each of the cells.
Now you will calculate the (O - E)2 / E for each cell in the table.
Where
O = Observed Value
E = Expected Value
Before you can conclude, you must first determine the critical statistic, which requires
determining our degrees of freedom. The degrees of freedom in this case are equal to
the table's number of columns minus one multiplied by the table's number of rows
minus one, or (r-1) (c-1). We have (3-1)(2-1) = 2.
Finally, you compare our obtained statistic to the critical statistic found in the chi-square
table. As you can see, for an alpha level of 0.05 and two degrees of freedom, the critical
statistic is 5.991, which is less than our obtained statistic of 9.83. You can reject our
null hypothesis because the critical statistic is higher than your obtained statistic.
This means you have sufficient evidence to say that there is an association between
gender and political party preference.
When to Use a Chi-Square Test?
A Chi-Square Test is used to examine whether the observed results are in order with the
expected values. When the data to be analysed is from a random sample, and when the
variable is the question is a categorical variable, then Chi-Square proves the most
appropriate test for the same. A categorical variable consists of selections such as
breeds of dogs, types of cars, genres of movies, educational attainment, male v/s
female etc. Survey responses and questionnaires are the primary sources of these
types of data. The Chi-square test is most commonly used for analysing this kind of
data. This type of analysis is helpful for researchers who are studying survey response
data. The research can range from customer and marketing research to political
sciences and economics.
These are, mathematically, the same exam. However, because they are utilized for
distinct goals, we generally conceive of them as separate tests.
Properties
1. If you multiply the number of degrees of freedom by two, you will receive an
answer that is equal to the variance.
There are two limitations to using the chi-square test that you should be aware of.
The chi-square test, for starters, is extremely sensitive to sample size. Even
insignificant relationships can appear statistically significant when a large
enough sample is used. Keep in mind that "statistically significant" does not
always imply "meaningful" when using the chi-square test.
Be mindful that the chi-square can only determine whether two variables are
related. It does not necessarily follow that one variable has a causal
relationship with the other. It would require a more detailed analysis to
establish causality.
When there is only one categorical variable, the chi-square goodness of fit test can be
used. The frequency distribution of the categorical variable is evaluated for determining
whether it differs significantly from what you expected. The idea is that the categories
will have equal proportions, however, this is not always the case.
SPSS
When you want to see if there is a link between two categorical variables, you perform
the chi-square test. To acquire the test statistic and its related p-value in SPSS, use the
chisq option on the statistics subcommand of the crosstabs command. Remember that
the chi-square test implies that each cell's anticipated value is five or greater.
Conclusion
In this lesson, you explored the concept of Chi-square test and how to find the related
values. You also take a look at how the critical value and chi-square value is related to
each other.
FAQs
The chi-square test is a statistical test used to analyze categorical data and assess the
independence or association between variables. There are two main types of chi-square
tests:
The chi-square test is a statistical tool used to check if two categorical variables are
related or independent. It helps us understand if the observed data differs significantly
from the expected data. By comparing the two datasets, we can draw conclusions
about whether the variables have a meaningful association.