Statistics
Statistics
Examples:
Descriptive statistics deals with the Inferential statistics deals with making
collection, organization, analysis, conclusions and predictions about a population
interpretation, and presentation of data. based on a sample. It involves the use of
It focuses on summarizing and describing probability theory to estimate the likelihood of
the main features of a set of data, without certain events occurring, hypothesis testing to
making inferences or predictions about the determine if a certain claim about a population is
larger population. supported by the data, and regression analysis to
examine the relationships between variables
Examples
1. Sample Size
2. Random
3. Representative
Parameter Vs Statistics
Inferential statistics is a branch of statistics that deals with making inferences or predictions
about a larger population based on a sample of data. It involves using statistical techniques to
test hypotheses and draw conclusions from data. Some of the topics that come under
inferential statistics are:
1. Hypothesis testing: This involves testing a hypothesis about a population parameter based
on a sample of data. For example, testing whether the mean height of a population is
different from a given value.
2. Confidence intervals: This involves estimating the range of values that a population
parameter could take based on a sample of data. For example, estimating the population
mean height within a given confidence level.
3. Analysis of variance (ANOVA): This involves comparing means across multiple groups to
determine if there are any significant differences. For example, comparing the mean
height of individuals from different regions.
4. Regression analysis: This involves modelling the relationship between a dependent
variable and one or more independent variables. For example, predicting the sales of a
product based on advertising expenditure.
5. Chi-square tests: This involves testing the independence or association between two
categorical variables. For example, testing whether gender and occupation are
independent variables.
6. Sampling techniques: This involves ensuring that the sample of data is representative of
the population. For example, using random sampling to select individuals from a
population.
7. Bayesian statistics: This is an alternative approach to statistical inference that involves
updating beliefs about the probability of an event based on new evidence. For example,
updating the probability of a disease given a positive test result.
Mean: The mean is the sum of all values in the dataset divided by the number of
values.
Median: The median is the middle value in the dataset when the data is arranged
in order.
Mode: The mode is the value that appears most frequently in the dataset.
Weighted Mean: The weighted mean is the sum of the products of each value and
its weight, divided by the sum of the weights. It is used to calculate a mean when
the values in the dataset have different importance or frequency.
Range: The range is the difference between the maximum and minimum values in
the dataset. It is a simple measure of dispersion that is easy to calculate but can be
affected by outliers.
Variance: The variance is the average of the squared differences between each
data point and the mean. It measures the average distance of each data point
from the mean and is useful in comparing the dispersion of datasets with different
means.
X-mean (X-mean)^2
3 3-3 0
2 2-3 1
1 1-3 4
5 5-3 4
4 4-3 1
Standard Deviation: The standard deviation is the square root of the variance. It is
a widely used measure of dispersion that is useful in describing the shape of a
distribution.
Coefficient of Variation (CV): The CV is the ratio of the standard deviation to the
mean expressed as a percentage. It is used to compare the variability of datasets
with different means and is commonly used in fields such as biology, chemistry,
and engineering.
The coefficient of variation (CV) is a statistical measure that expresses the amount
of variability in a dataset relative to the mean. It is a dimensionless quantity that is
expressed as a percentage.
A frequency distribution table is a table that summarizes the number of times (or
frequency) that each value occurs in a dataset.
Let's say we have a survey of 200 people and we ask them about their favourite
type of vacation, which could be one of six categories: Beach, City, Adventure,
Nature, Cruise, or Other
Shapes of Histogram
Contingency Table/Crosstab
Scatter Plot
Quantiles are statistical measures used to divide a set of numerical data into
equal-sized groups, with each group containing an equal number of observations.
Quantiles are important measures of variability and can be used to: understand
distribution of data, summarize and compare different datasets. They can also be
used to identify outliers.
a. Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2
(50th percentile or median), and Q3 (75th percentile).
b. Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2
(20th percentile), ..., D9 (90th percentile).
c. Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2
(2nd percentile), ..., P99 (99th percentile).
Percentile
A percentile is a statistical measure that represents the percentage of observations in a
dataset that fall below a particular value. For example, the 75th percentile is the value below
which 75% of the observations in the dataset fall.
PL =
where:
Example:
2. First quartile (Q1): The value that separates the lowest 25% of the data from
the rest of the dataset.
3. Median (Q2): The value that separates the lowest 50% from the highest 50%
of the data.
4. Third quartile (Q3): The value that separates the lowest 75% of the data from
the highest 25% of the data.
The five-number summary is often represented visually using a box plot, which
displays the range of the dataset, the median, and the quartiles.
The five-number summary is a useful way to quickly summarize the central
tendency, variability, and distribution of a dataset.
Interquartile Range
The interquartile range (IQR) is a measure of variability that is based on the five-number
summary of a dataset. Specifically, the IQR is defined as the difference between the third
quartile (Q3) and the first quartile (Q1) of a dataset.
1. What is a boxplot
A box plot, also known as a box-and-whisker plot, is a graphical representation of a
dataset that shows the distribution of the data. The box plot displays a summary of the
data, including the minimum and maximum values, the first quartile (Q1), the median
(Q2), and the third quartile (Q3).
If the covariance between two variables is positive, it means that the variables tend to
move together in the same direction. If the covariance is negative, it means that the
variables tend to move in opposite directions. A covariance of zero indicates that the
variables are not linearly related.
• How is it calculated?
2. What is correlation?
Correlation refers to a statistical relationship between two or more variables.
Specifically, it measures the degree to which two variables are related and
how they tend to change together.
The phrase "correlation does not imply causation" means that just because
two variables are associated with each other, it does not necessarily mean that
one causes the other. In other words, a correlation between two variables
does not necessarily imply that one variable is the reason for the other
variable's behaviour.
Thus, while correlations can provide valuable insights into how different
variables are related, they cannot be used to establish causality. Establishing
causality often requires additional evidence such as experiments, randomized
controlled trials, or well-designed observational studies.
1. 3D Scatter Plots
2. Hue Parameter
3. Facetgrids
5. Pairplots
In many scenarios, the number of outcomes can be much larger and hence a table would
be tedious to write down. Worse still, the number of possible outcomes could be infinite,
in which case, good luck writing a table for that.
Solution - Function?
What if we use a mathematical function to model the relationship between outcome and
probability?
Note - A lot of time Probability Distribution and Probability Distribution Functions are
A note on Parameters
Parameters in probability distributions are numerical values that determine the shape,
location, and scale of the distribution.
Different probability distributions have different sets of parameters that determine their
shape and characteristics, and understanding these parameters is essential in statistical
analysis and inference.
The PMF of a discrete random variable assigns a probability to each possible value
of the random variable. The probabilities assigned by the PMF must satisfy two
conditions:
Examples
https://en.wikipedia.org/wiki/Bernoulli_distribution
https://en.wikipedia.org/wiki/Binomial_distribution
The cumulative distribution function (CDF) F(x) describes the probability that a
random variable X with a given probability distribution will be found at a value less
than or equal to x
Examples:
https://en.wikipedia.org/wiki/Bernoulli_distribution
https://en.wikipedia.org/wiki/Binomial_distribution
There are various methods for density estimation, including parametric and non-
parametric approaches. Parametric methods assume that the data follows a
specific probability distribution (such as a normal distribution), while non-
parametric methods do not make any assumptions about the distribution and
instead estimate it directly from the data.
But sometimes the distribution is not clear or it's not one of the famous distributions.
The KDE technique involves using a kernel function to smooth out the data and create a
continuous estimate of the underlying density function.
-> Tail
-> Asymptotic in nature
-> Lots of points near the mean and very few far away
The normal distribution is characterized by two parameters: the mean (μ) and the
standard deviation (σ). The mean represents the centre of the distribution, while the
standard deviation represents the spread of the distribution.
Denoted as:
Why is it so important?
https://samp-suman-normal-dist-visualize-app-lkntug.streamlit.app/
Equation in detail:
Equation:
Suppose the heights of adult males in a certain population follow a normal distribution
with a mean of 68 inches and a standard deviation of 3 inches. What is the probability
that a randomly selected adult male from this population is taller than 72 inches?
A z-table tells you the area underneath a normal distribution curve, to the left of the z-
score
https://www.ztable.net/
For a Normal Distribution X~(u,std) what percent of population lie between mean and 1
standard deviation, 2 std and 3 std?
1. Symmetricity
The normal distribution is symmetric about its mean, which means that the probability of
observing a value above the mean is the same as the probability of observing a value below
the mean. The bell-shaped curve of the normal distribution reflects this symmetry.
3. Empirical Rule
The normal distribution has a well-known empirical rule, also called the 68-95-99.7 rule,
which states that approximately 68% of the data falls within one standard deviation of the
mean, about 95% of the data falls within two standard deviations of the mean, and about
99.7% of the data falls within three standard deviations of the mean.
• What is skewness?
A normal distribution is a bell-shaped, symmetrical distribution with a specific
mathematical formula that describes how the data is spread out. Skewness indicates that
the data is not symmetrical, which means it is not normally distributed.
In a symmetrical distribution, the mean, median, and mode are all equal. In contrast, in a
skewed distribution, the mean, median, and mode are not equal, and the distribution
tends to have a longer tail on one side than the other.
Skewness can be positive, negative, or zero. A positive skewness means that the tail of
the distribution is longer on the right side, while a negative skewness means that the tail
is longer on the left side. A zero skewness indicates a perfectly symmetrical distribution.
The greater the skew the greater the distance between mode, median and mode.
• Python Example
• Interpretation
• Outlier detection
• Assumptions on data for ML algorithms -> Linear Regression and GMM
• Hypothesis Testing
• Central Limit Theorem
• What is Kurtosis?
Kurtosis is the 4th statistical moment. In probability theory and statistics, kurtosis
(meaning "curved, arching") is a measure of the "tailedness" of the probability
distribution of a real-valued random variable. Like skewness, kurtosis describes a
particular aspect of a probability distribution.
• Formula
• Practical Use-case
In finance, kurtosis risk refers to the risk associated with the possibility of extreme
outcomes or "fat tails" in the distribution of returns of a particular asset or portfolio.
Types of Kurtosis
Leptokurtic
Example - Assets with positive excess kurtosis are riskier and more volatile than those
with a normal distribution, and they may experience sudden price movements that
can result in significant gains or losses.
Platykurtic
Assets with negative excess kurtosis are less risky and less volatile than those with a
normal distribution, and they may experience more gradual price movements that
are less likely to result in large gains or losses.
Mesokurtic
Distributions with zero excess kurtosis are called mesokurtic. The most prominent
example of a mesokurtic distribution is the normal distribution family, regardless of
the values of its parameters.
Example -
In finance, a mesokurtic distribution is considered to be the ideal distribution for
assets or portfolios, as it represents a balance between risk and return.
○ QQ Plot: Another way to check for normality is to create a normal probability plot
(also known as a Q-Q plot) of the data. A normal probability plot plots the observed
data against the expected values of a normal distribution. If the data points fall along
a straight line, the distribution is likely to be normal.
○ Statistical tests: There are several statistical tests that can be used to test for
normality, such as the Shapiro-Wilk test, the Anderson-Darling test, and the
Kolmogorov-Smirnov test. These tests compare the observed data to the expected
values of a normal distribution and provide a p-value that indicates whether the data
is likely to be normal or not. A p-value less than the significance level (usually 0.05)
suggests that the data is not normal.
In a QQ plot, the quantiles of the two sets of data are plotted against each other. The
quantiles of one set of data are plotted on the x-axis, while the quantiles of the other
set of data are plotted on the y-axis. If the two sets of data have the same
distribution, the points on the QQ plot will fall on a straight line. If the two sets of
data do not have the same distribution, the points will deviate from the straight line.
• Python example
• https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html
1.
Types
Denoted as
• Examples
a. The height of a person randomly selected from a group of individuals whose heights
range from 5'6" to 6'0" would follow a continuous uniform distribution.
b. The time it takes for a machine to produce a product, where the production time
ranges from 5 to 10 minutes, would follow a continuous uniform distribution.
c. The distance that a randomly selected car travels on a tank of gas, where the
distance ranges from 300 to 400 miles, would follow a continuous uniform
distribution.
d. The weight of a randomly selected apple from a basket of apples that weighs
between 100 and 200 grams, would follow a continuous uniform distribution.
https://en.wikipedia.org/wiki/Continuous_uniform_distribution
• Skewness
b. Sampling: Uniform distribution can also be used for sampling. For example, if you
have a dataset with an equal number of samples from each class, you can use
uniform distribution to randomly select a subset of the data that is representative of
all the classes.
c. Data augmentation: In some cases, you may want to artificially increase the size of
your dataset by generating new examples that are similar to the original data.
Uniform distribution can be used to generate new data points that are within a
specified range of the original data.
Examples
Denoted as
PDF Equation
CDF
Skewness
Pareto Distribution
The Pareto distribution is a type of probability distribution that is commonly used to model the
distribution of wealth, income, and other quantities that exhibit a similar power-law behaviour
In mathematics, a power law is a functional relationship between two variables, where one
variable is proportional to a power of the other. Specifically, if y and x are two variables
related by a power law, then the relationship can be written as:
y = k * x^a
Vilfredo Pareto originally used this distribution to describe the allocation of wealth among
individuals since it seemed to show rather well the way that a larger portion of the wealth of
any society is owned by a smaller percentage of the people in that society. He also used it to
describe distribution of income. This idea is sometimes expressed more simply as the Pareto
principle or the "80-20 rule" which says that 20% of the population controls 80% of the wealth
Examples
CDF
Population Vs Sample
Population: A population is the entire group or set of individuals, objects, or events that a
researcher wants to study or draw conclusions about. It can be people, animals, plants, or
even inanimate objects, depending on the context of the study. The population usually
represents the complete set of possible data points or observations.
Sample: A sample is a subset of the population that is selected for study. It is a smaller group
that is intended to be representative of the larger population. Researchers collect data from
the sample and use it to make inferences about the population as a whole. Since it is often
impractical or impossible to collect data from every member of a population, samples are used
as an efficient and cost-effective way to gather information.
Parameter Vs Estimate
Parameter: A parameter is a numerical value that describes a characteristic of a population.
Parameters are usually denoted using Greek letters, such as μ (mu) for the population mean or
σ (sigma) for the population standard deviation. Since it is often difficult or impossible to
obtain data from an entire population, parameters are usually unknown and must be
estimated based on available sample data.
Inferential Statistics
Inferential statistics is a branch of statistics that focuses on making predictions, estimations, or
generalizations about a larger population based on a sample of data taken from that
population. It involves the use of probability theory to make inferences and draw conclusions
about the characteristics of a population by analysing a smaller subset or sample.
The key idea behind inferential statistics is that it is often impractical or impossible to collect
data from every member of a population, so instead, we use a representative sample to make
inferences about the entire group. Inferential statistical techniques include hypothesis testing,
confidence intervals, and regression analysis, among others.
Inferential statistics are widely used in various fields, such as economics, social sciences,
medicine, and natural sciences, to make informed decisions and guide policy based on limited
data.
A point estimate is a single value, calculated from a sample, that serves as the best guess or
approximation for an unknown population parameter, such as the mean or standard
deviation. Point estimates are often used in statistics when we want to make inferences about
a population based on a sample.
Confidence interval, in simple words, is a range of values within which we expect a particular
population parameter, like a mean, to fall. It's a way to express the uncertainty around an
estimate obtained from a sample of data.
Confidence level, usually expressed as a percentage like 95%, indicates how sure we are that
the true value lies within the interval.
Confidence Interval is created for Parameters and not statistics. Statistics help us get the
confidence interval for a parameter.
Examples of CT usage
Assumptions
1. Random sampling: The data must be collected using a random sampling method to
ensure that the sample is representative of the population. This helps to minimize biases
and ensures that the results can be generalized to the entire population.
2. Known population standard deviation The population standard deviation (σ) must be
known or accurately estimated. In practice, the population standard deviation is often
unknown, and the sample standard deviation (s) is used as an estimate. However, if the
sample size is large enough, the sample standard deviation can provide a reasonably
accurate approximation.
3. Normal distribution or large sample size: The Z-procedure assumes that the underlying
population is normally distributed. However, if the population distribution is not normal,
the Central Limit Theorem can be applied when the sample size is large (usually, sample
size n ≥ 30 is considered large enough). According to the Central Limit Theorem, the
sampling distribution of the sample mean will approach a normal distribution as the
sample size increases, regardless of the shape of the population distribution.
A confidence interval is a range of values within which a population parameter, such as the
population mean, is estimated to lie with a certain level of confidence. The confidence interval
provides an indication of the precision and uncertainty associated with the estimate. To
interpret the confidence interval values, consider the following points:
1. Confidence level: The confidence level (commonly set at 90%, 95%, or 99%) represents
the probability that the confidence interval will contain the true population parameter if
the sampling and estimation process were repeated multiple times. For example, a 95%
confidence interval means that if you were to draw 100 different samples from the
population and calculate the confidence interval for each, approximately 95 of those
intervals would contain the true population parameter.
2. Interval range: The width of the confidence interval gives an indication of the precision of
the estimate. A narrower confidence interval suggests a more precise estimate of the
population parameter, while a wider interval indicates greater uncertainty. The width of
the interval depends on the sample size, variability in the data, and the desired level of
confidence.
3. Interpretation: To interpret the confidence interval values, you can say that you are "X%
confident that the true population parameter lies within the range (lower limit, upper
limit)." Keep in mind that this statement is about the interval, not the specific point
estimate, and it refers to the confidence level you chose when constructing the interval.
What is the trade-off
1. Random sampling: The data must be collected using a random sampling method to
ensure that the sample is representative of the population. This helps to minimize biases
and ensures that the results can be generalized to the entire population.
2. Sample standard deviation The population standard deviation (σ) is unkno n, and the
sample standard deviation (s) is used as an estimate. The t-distribution is specifically
designed to account for the additional uncertainty introduced by using the sample
standard deviation instead of the population standard deviation.
Student's t-distribution, or simply the t-distribution, is a probability distribution that arises when
estimating the mean of a normally distributed population when the sample size is small and the
population standard deviation is unknown. It was introduced by William Sealy Gosset, who
published under the pseudonym "Student."
The t-distribution is similar to the normal distribution (also known as the Gaussian distribution or
the bell curve) but has heavier tails. The shape of the t-distribution is determined by the degrees of
freedom, which is closely related to the sample size (degrees of freedom = sample size - 1). As the
degrees of freedom increase (i.e., as the sample size increases), the t-distribution approaches the
normal distribution.
In hypothesis testing and confidence interval estimation, the t-distribution is used in place of the
normal distribution when the sample size is small (usually less than 30) and the population standard
deviation is unknown. The t-distribution accounts for the additional uncertainty that arises from
estimating the population standard deviation using the sample standard deviation.
To use the t-distribution in practice, you look up critical t-values from a t-distribution table, which
provides values corresponding to specific degrees of freedom and confidence levels (e.g., 95%
confidence). These critical t-values are then used to calculate confidence intervals or perform
hypothesis tests.
Bernoulli distribution is a probability distribution that models a binary outcome, where the
outcome can be either success (represented by the value 1) or failure (represented by the
value 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli,
who first introduced it in the late 1600s.
The Probability of anyone watching this lecture in the future and then liking it is 0.5. What is the
probability that:
PDF Formula:
Graph of PDF:
Criteria:
4. A/B testing: A/B testing is a common technique used to compare two different
versions of a product, web page, or marketing campaign. In A/B testing, we
randomly assign individuals to one of two groups and compare the outcomes of
interest between the groups. Since the outcomes are often binary (e.g., click-
through rate or conversion rate), the binomial distribution can be used to model
the distribution of outcomes and test for differences between the groups.
The Central Limit Theorem (CLT) states that the distribution of the sample means of a large
number of independent and identically distributed random variables will approach a normal
distribution, regardless of the underlying distribution of the variables.
1. The sample size is large enough, typically greater than or equal to 30.
2. The sample is drawn from a finite population or an infinite population with a finite
variance.
3. The random variables in the sample are independent and identically distributed.
Step-by-step process:
3. Calculate the average of the sample means. This value will be your best
estimate of the population mean (average salary of all Indians).
4. Calculate the standard error of the sample means, which is the standard
deviation of the sample means divided by the square root of the number
of samples.
5. Calculate the confidence interval around the average of the sample means
to get a range within which the true population mean likely falls. For a
95% confidence interval:
Python code
Remember that the validity of your results depends on the quality of your
data and the representativeness of your samples. To obtain accurate
results, it's crucial to ensure that your samples are unbiased and
representative.
In simple terms, the null hypothesis is a statement that assumes there is no significant
effect or relationship between the variables being studied. It serves as the starting point
for hypothesis testing and represents the status quo or the assumption of no effect until
proven otherwise. The purpose of hypothesis testing is to gather evidence (data) to either
reject or fail to reject the null hypothesis in favour of the alternative hypothesis, which
claims there is a significant effect or relationship.
The alternative hypothesis, is a statement that contradicts the null hypothesis and claims
there is a significant effect or relationship between the variables being studied. It
represents the research hypothesis or the claim that the researcher wants to support
through statistical analysis.
Important Points
• How to decide what will be Null hypothesis and what will be Alternate
Hypothesis(Typically the Null hypothesis says nothing new is happening)
• It's important to note that failing to reject the null hypothesis doesn't necessarily mean
that the null hypothesis is true; it just means that there isn't enough evidence to support
the alternative hypothesis.
Hypothesis tests are similar to jury trials, in a sense. In a jury trial, H0 is similar to the not-guilty verdict,
and Ha is the guilty verdict. You assume in a jury trial that the defendant isn’t guilty unless the
prosecution can show beyond a reasonable doubt that he or she is guilty. If the jury says the evidence is
beyond a reasonable doubt, they reject H0, not guilty, in favour of Ha , guilty.
Suppose a company is evaluating the impact of a new training program on the productivity of
its employees. The company has data on the average productivity of its employees before
implementing the training program. The average productivity was 50 units per day with a
known population standard deviation of 5 units. After implementing the training program, the
company measures the productivity of a random sample of 30 employees. The sample has an
average productivity of 53 units per day. The company wants to know if the new training
program has significantly increased productivity.
Suppose a snack food company claims that their Lays wafer packets contain an
average weight of 50 grams per packet. To verify this claim, a consumer watchdog
organization decides to test a random sample of Lays wafer packets. The
organization wants to determine whether the actual average weight differs
significantly from the claimed 50 grams. The organization collects a random
sample of 40 Lays wafer packets and measures their weights. They find that the
sample has an average weight of 49 grams, with a known population standard
deviation of 4 grams.
The critical region is the region of values that corresponds to the rejection of the null
hypothesis at some chosen probability level.
One-sided (one-tailed) test: A one-sided test is used when the researcher is interested in
testing the effect in a specific direction (either greater than or less than the value specified in
the null hypothesis). The alternative hypothesis in a one-sided test contains an inequality
(either ">" or "<").
Example: A researcher wants to test whether a new medication increases the average
recovery rate compared to the existing medication.
Two-sided (two-tailed) test: A two-sided test is used when the researcher is interested in
testing the effect in both directions (i.e., whether the value specified in the null hypothesis is
different, either greater or lesser). The alternative hypothesis in a two-sided test contains a
"not equal to" sign (≠).
Example: A researcher wants to test whether a new medication has a different average
recovery rate compared to the existing medication.
The main difference between them lies in the directionality of the alternative hypothesis and
how the significance level is distributed in the critical regions.
Advantages:
1. Detects effects in both directions: Two-tailed tests can detect effects in both directions,
which makes them suitable for situations where the direction of the effect is uncertain or
when researchers want to test for any difference between the groups or variables.
2. More conservative: Two-tailed tests are more conservative because the significance level
(α) is split between both tails of the distribution. This reduces the risk of Type I errors in
cases where the direction of the effect is uncertain.
Disadvantages:
1. Less powerful: Two-tailed tests are generally less powerful than one-tailed tests because
the significance level (α) is divided between both tails of the distribution. This means the
test requires a larger effect size to reject the null hypothesis, which could lead to a higher
risk of Type II errors (failing to reject the null hypothesis when it is false).
2. Not appropriate for directional hypotheses: Two-tailed tests are not ideal for cases where
the research question or hypothesis is directional, as they test for differences in both
directions, which may not be of interest or relevance.
Advantages:
1. More powerful: One-tailed tests are generally more powerful than two-tailed tests, as the
entire significance level (α) is allocated to one tail of the distribution. This means that the
test is more likely to detect an effect in the specified direction, assuming the effect exists.
2. Directional hypothesis: One-tailed tests are appropriate when there is a strong theoretical
or practical reason to test for an effect in a specific direction.
Disadvantages:
1. Missed effects: One-tailed tests can miss effects in the opposite direction of the specified
alternative hypothesis. If an effect exists in the opposite direction, the test will not be
able to detect it, which could lead to incorrect conclusions.
2. Increased risk of Type I error: One-tailed tests can be more prone to Type I errors if the
effect is actually in the opposite direction than the one specified in the alternative
hypothesis.
3. Analysing relationships between variables: Hypothesis testing can be used to evaluate the
association between variables, such as the correlation between age and income or the
relationship between advertising spend and sales.
4. Evaluating the goodness of fit: Hypothesis testing can help assess if a particular theoretical
distribution (e.g., normal, binomial, or Poisson) is a good fit for the observed data.
6. A/B testing: In marketing, product development, and website design, hypothesis testing is
often used to compare the performance of two different versions (A and B) to determine
which one is more effective in terms of conversion rates, user engagement, or other
metrics.
2. Feature selection: Hypothesis testing can help identify which features are significantly
related to the target variable or contribute meaningfully to the model's performance. For
example, you can use a t-test, chi-square test, or ANOVA to test the relationship between
individual features and the target variable. Features with significant relationships can be
selected for building the model, while non-significant features may be excluded.
4. Assessing model assumptions: In some cases, machine learning models rely on certain
statistical assumptions, such as linearity or normality of residuals in linear regression.
Hypothesis testing can help assess whether these assumptions are met, allowing you to
determine if the model is appropriate for the data.
In simple words p-value is a measure of the strength of the evidence against the Null
Hypothesis that is provided by our sample data.
1. Very small p-values (e.g., p < 0.01) indicate strong evidence against the null hypothesis,
suggesting that the observed effect or difference is unlikely to have occurred by chance
alone.
2. Small p-values (e.g., 0.01 ≤ p < 0.05) indicate moderate evidence against the null
hypothesis, suggesting that the observed effect or difference is less likely to have
occurred by chance alone.
3. Large p-values (e.g., 0.05 ≤ p < 0.1) indicate weak evidence against the null hypothesis,
suggesting that the observed effect or difference might have occurred by chance alone,
but there is still some level of uncertainty.
4. Very large p-values (e.g., p ≥ 0.1) indicate weak or no evidence against the null
hypothesis, suggesting that the observed effect or difference is likely to have occurred by
chance alone.
Suppose a company is evaluating the impact of a new training program on the productivity of its employees. The
company has data on the average productivity of its employees before implementing the training program. The
average productivity was 50 units per day. After implementing the training program, the company measures the
productivity of a random sample of 30 employees. The sample has an average productivity of 53 units per day and
the pop std is 4. The company wants to know if the new training program has significantly increased productivity.
Suppose a snack food company claims that their Lays wafer packets contain an average weight of 50 grams per
packet. To verify this claim, a consumer watchdog organization decides to test a random sample of Lays wafer
packets. The organization wants to determine whether the actual average weight differs significantly from the
claimed 50 grams. The organization collects a random sample of 40 Lays wafer packets and measures their
weights. They find that the sample has an average weight of 49 grams, with a pop standard deviation of 5
grams.
A t-test is a statistical test used in hypothesis testing to compare the means of two samples or
to compare a sample mean to a known population mean. The t-test is based on the t-
distribution, which is used when the population standard deviation is unknown and the
sample size is small.
One-sample t-test: The one-sample t-test is used to compare the mean of a single sample to a
known population mean. The null hypothesis states that there is no significant difference
between the sample mean and the population mean, while the alternative hypothesis states
that there is a significant difference.
Independent two-sample t-test: The independent two-sample t-test is used to compare the
means of two independent samples. The null hypothesis states that there is no significant
difference between the means of the two samples, while the alternative hypothesis states that
there is a significant difference.
Paired t-test (dependent two-sample t-test): The paired t-test is used to compare the means of
two samples that are dependent or paired, such as pre-test and post-test scores for the same
group of subjects or measurements taken on the same subjects under two different
conditions. The null hypothesis states that there is no significant difference between the
means of the paired differences, while the alternative hypothesis states that there is a
significant difference.
A one-sample t-test checks whether a sample mean differs from the population mean.
Suppose a manufacturer claims that the average weight of their new chocolate bars is 50
grams, we highly doubt that and want to check this so we drew out a sample of 25 chocolate
bars and measured their weight, the sample mean came out to be 49.7 grams and the sample
std deviation was 1.2 grams. Consider the significance level to be 0.05
2. Normality: The data in each of the two groups should be approximately normally
distributed. The t-test is considered robust to mild violations of normality, especially
when the sample sizes are large (typically n ≥ 30) and the sample sizes of the two groups
are similar. If the data is highly skewed or has substantial outliers, consider using a non-
parametric test, such as the Mann-Whitney U test.
4. Random sampling: The data should be collected using a random sampling method from
the respective populations. This ensures that the sample is representative of the
population and reduces the risk of selection bias.
Suppose a website owner claims that there is no difference in the average time spent on their
website between desktop and mobile users. To test this claim, we collect data from 30
desktop users and 30 mobile users regarding the time spent on the website in minutes. The
sample statistics are as follows:
desktop users = [12, 15, 18, 16, 20, 17, 14, 22, 19, 21, 23, 18, 25, 17, 16, 24, 20, 19, 22, 18, 15,
14, 23, 16, 12, 21, 19, 17, 20, 14]
mobile_users = [10, 12, 14, 13, 16, 15, 11, 17, 14, 16, 18, 14, 20, 15, 14, 19, 16, 15, 17, 14, 12,
11, 18, 15, 10, 16, 15, 13, 16, 11]
Desktop users:
○ Sample size (n1): 30
○ Sample mean (mean1): 18.5 minutes
○ Sample standard deviation (std_dev1): 3.5 minutes
Mobile users:
○ Sample size (n2): 30
○ Sample mean (mean2): 14.3 minutes
○ Sample standard deviation (std_dev2): 2.7 minutes
We will use a significance level (α) of 0.05 for the hypothesis test.
2. Matched or correlated groups: Comparing the performance of two groups that are
matched or correlated in some way, such as siblings or pairs of individuals with similar
characteristics.
Assumptions
1. Paired observations: The two sets of observations must be related or paired in some way,
such as before-and-after measurements on the same subjects or observations from
matched or correlated groups.
Let's assume that a fitness center is evaluating the effectiveness of a new 8 -week weight loss
program. They enroll 15 participants in the program and measure their weights before and
after the program. The goal is to test whether the new weight loss program leads to a
significant reduction in the participants' weight.
The Chi-Square distribution has a single parameter, the degrees of freedom (df),
which influences the shape and spread of the distribution. The degrees of
freedom are typically associated with the number of independent variables or
constraints in a statistical problem.
The Chi-Square distribution is used in various statistical tests, such as the Chi-
Square goodness-of-fit test, which evaluates whether an observed frequency
distribution fits an expected theoretical distribution, and the Chi-Square test for
independence, which checks the association between categorical variables in a
contingency table.
The Chi-Square test is a statistical hypothesis test used to determine if there is a significant
association between categorical variables or if an observed distribution of categorical data
differs from an expected theoretical distribution. It is based on the Chi-Square (χ²) distribution,
and it is commonly applied in two main scenarios:
2. Chi-Square Test for Independence (Chi-Square Test for Association): This test is used to
determine whether there is a significant association between two categorical variables in
a sample.
Steps
• Define the null hypothesis (H0) and the alternative hypothesis (H1):
○ H0: The observed data follows the expected theoretical
distribution.
○ H1: The observed data does not follow the expected theoretical
distribution.
• Calculate the p-value for the test statistic using the Chi-Square
distribution with the calculated degrees of freedom.
Assumptions
Suppose we have a six-sided fair die, and we want to test if the die is indeed fair. We roll the
die 60 times and record the number of times each side comes up. We'll use the Chi-Square
Goodness-of-Fit test to determine if the observed frequencies are consistent with a fair die
(i.e., a uniform distribution of the sides).
Observed frequencies:
○ Side 1: 12 times
○ Side 2: 8 times
○ Side 3: 11 times
○ Side 4: 9 times
○ Side 5: 10 times
○ Side 6: 10 times
Suppose a marketing team at a retail company wants to understand the distribution of visits to
their website by day of the week. They have a hypothesis that visits are uniformly distributed
across all days of the week, meaning they expect an equal number of visits on each day. They
collected data on website visits for four weeks and want to test if the observed distribution
matches the expected uniform distribution.
Observed frequencies (number of website visits per day of the week for four weeks):
• Monday: 420
• Tuesday: 380
• Wednesday: 410
• Thursday: 400
• Friday: 410
• Saturday: 430
• Sunday: 390
Is this data consistent with the result that male and female births are
equally probable?
The Chi-Square test for independence, also known as the Chi-Square test for association, is a
statistical test used to determine whether there is a significant association between two
categorical variables in a sample. It helps to identify if the occurrence of one variable is
dependent on the occurrence of the other variable, or if they are independent of each other.
The test is based on comparing the observed frequencies in a contingency table (a table that
displays the frequency distribution of the variables) with the frequencies that would be
expected under the assumption of independence between the two variables.
Steps
○ H0: There is no association between the two categorical variables (they are
independent).
○ H1: There is an association between the two categorical variables (they are
dependent).
2. Create a contingency table with the observed frequencies for each combination of the
categories of the two variables.
3. Calculate the expected frequencies for each cell in the contingency table assuming that
the null hypothesis is true (i.e., the variables are independent).
where O_ij is the observed frequency in each cell and E_ij is the expected frequency.
6. Obtain the critical value or p-value using the Chi-Square distribution table or a statistical
software/calculator with the given degrees of freedom and significance level (commonly
α = 0.05).
7. Compare the test statistic to the critical value or the p-value to the significance level to
decide whether to reject or fail to reject the null hypothesis. If the test statistic is greater
than the critical value, or if the p-value is less than the significance level, we reject the null
hypothesis and conclude that there is a significant association between the two variables.
Assumptions
2. Categorical variables: Both variables being tested must be categorical, either ordinal or
nominal. The Chi-Square test for independence is not appropriate for continuous
variables.
3. Adequate sample size: The sample size should be large enough to ensure that the
expected frequency for each cell in the contingency table is sufficient. A common rule of
thumb is that the expected frequency for each cell should be at least 5. If some cells have
expected frequencies less than 5, the test may not be valid, and other methods like
Fisher's exact test may be more appropriate.
4. Fixed marginal totals: The marginal totals (the row and column sums of the contingency
table) should be fixed before the data is collected. This is because the Chi-Square test for
independence assesses the association between the two variables under the assumption
that the marginal totals are fixed and not influenced by the relationship between the
variables.
1. Feature selection: Chi-Square test can be used as a filter-based feature selection method to
rank and select the most relevant categorical features in a dataset. By measuring the
association between each categorical feature and the target variable, you can eliminate
irrelevant or redundant features, which can help improve the performance and efficiency
of machine learning models.
3. Analysing relationships between categorical features: In exploratory data analysis, the Chi-
Square test for independence can be applied to identify relationships between pairs of
categorical features. Understanding these relationships can help inform feature
engineering and provide insights into the underlying structure of the data.
5. Variable selection in decision trees: Some decision tree algorithms, such as the CHAID (Chi-
squared Automatic Interaction Detection) algorithm, use the Chi-Square test to determine
the most significant splitting variables at each node in the tree. This helps construct more
effective and interpretable decision trees.
One-way ANOVA (Analysis of Variance) is a statistical method used to compare the means of
three or more independent groups to determine if there are any significant differences
between them. It is an extension of the t-test, which is used for comparing the means of two
independent groups. The term "one-way" refers to the fact that there is only one independent
variable (factor) with multiple levels (groups) in this analysis.
The primary purpose of one-way ANOVA is to test the null hypothesis that all the group means
are equal. The alternative hypothesis is that at least one group mean is significantly different
from the others.
Steps
• Calculate the p-value associated with the calculated F-statistic using the F-distribution and
the appropriate degrees of freedom. The p-value represents the probability of obtaining
an F-statistic as extreme or more extreme than the calculated value, assuming the null
hypothesis is true.
• Choose a significance level (alpha), typically 0.05.
• Compare the calculated p-value with the chosen significance level (alpha).
a. If the p-value is less than or equal to alpha, reject the null hypothesis in favour of the alternative
hypothesis, concluding that there is a significant difference between at least one pair of group
means.
b. If the p-value is greater than alpha, fail to reject the null hypothesis, concluding that there is not
enough evidence to suggest a significant difference between the group means.
It's important to note that one-way ANOVA only determines if there is a significant difference
between the group means; it does not identify which specific groups have significant
differences. To determine which pairs of groups are significantly different, post-hoc tests, such
as Tukey's HSD or Bonferroni, are conducted after a significant ANOVA result.
Assumptions
2. Normality: The data within each group should be approximately normally distributed.
While one-way ANOVA is considered to be robust to moderate violations of normality,
severe deviations may affect the accuracy of the test results. If normality is in doubt, non-
parametric alternatives like the Shapiro-wilk test can be considered.
3. Homogeneity of variances: The variances of the populations from which the samples are
drawn should be equal, or at least approximately so. This assumption is known as
homoscedasticity. If the variances are substantially different, the accuracy of the test
results may be compromised. Levene's test or Bartlett's test can be used to assess the
homogeneity of variances. If this assumption is violated, alternative tests such as Welch's
ANOVA can be used.
Post hoc tests, also known as post hoc pairwise comparisons or multiple comparison tests, are
used in the context of ANOVA when the overall test indicates a significant difference among
the group means. These tests are performed after the initial one-way ANOVA to determine
which specific groups or pairs of groups have significantly different means.
The main purpose of post hoc tests is to control the family-wise error rate (FWER) and adjust
the significance level for multiple comparisons to avoid inflated Type I errors. There are
several post hoc tests available, each with different characteristics and assumptions. Some
common post hoc tests include:
1. Bonferroni correction: This method adjusts the significance level (α) by dividing it by the
number of comparisons being made. It is a conservative method that can be applied when
making multiple comparisons, but it may have lower statistical power when a large
number of comparisons are involved.
2. Tukey's HSD (Honestly Significant Difference) test: This test controls the FWER and is used
when the sample sizes are equal and the variances are assumed to be equal across the
groups. It is one of the most commonly used post hoc tests.
When performing post hoc tests, it is essential to choose a test that aligns with the
assumptions of your data (e.g., equal variances, equal sample sizes) and provides an
appropriate balance between controlling Type I errors and maintaining statistical power.
1. Increased Type I error: When you perform multiple comparisons using individual t-tests,
the probability of making a Type I error (false positive) increases. The more tests you
perform, the higher the chance that you will incorrectly reject the null hypothesis in at least
one of the tests, even if the null hypothesis is true for all groups.
2. Difficulty in interpreting results: When comparing multiple groups using multiple t-tests,
the interpretation of the results can become complicated. For example, if you have 4
groups and you perform 6 pairwise t-tests, it can be challenging to interpret and summarize
the overall pattern of differences among the groups.
3. Inefficiency: Using multiple t-tests is less efficient than using a single test that accounts for
all groups, such as one-way ANOVA. One-way ANOVA uses the information from all the
groups simultaneously to estimate the variability within and between the groups, which
can lead to more accurate conclusions.
1. Hyperparameter tuning: When selecting the best hyperparameters for a machine learning
model, one-way ANOVA can be used to compare the performance of models with different
hyperparameter settings. By treating each hyperparameter setting as a group, you can
perform one-way ANOVA to determine if there are any significant differences in
performance across the various settings.
2. Feature selection: One-way ANOVA can be used as a univariate feature selection method to
identify features that are significantly associated with the target variable, especially when
the target variable is categorical with more than two levels. In this context, the one-way
ANOVA is performed for each feature, and features with low p-values are considered to be
more relevant for prediction.
4. Model stability assessment: One-way ANOVA can be used to assess the stability of a
machine learning model by comparing its performance across different random seeds or
initializations. If the model's performance varies significantly between different
initializations, it may indicate that the model is unstable or highly sensitive to the choice of
initial conditions.