Full Statistics
Full Statistics
Full Statistics
What is Statistics
Statistics is a branch of mathematics that involves collecting, analysing, interpreting, and
presenting data. It provides tools and methods to understand and make sense of large amounts
of data and to draw conclusions and make decisions based on the data.
Types of Statistics
Descriptive
Inferential
Descriptive Statistics :Descriptive statistics deals with the collection, organization, analysis,
interpretation, and presentation of data. It focuses on summarizing and describing the main
features of a set of data, without making inferences or predictions about the larger population
Inferential statistics: Inferential statistics deals with making conclusions and predictions about
a population based on a sample. It involves the use of probability theory to estimate the
likelihood of certain events occurring, hypothesis testing to determine if a certain claim about a
population is supported by the data, and regression analysis to examine the relationships
between variables
Population Vs Sample
Population :Population refers to the entire group of individuals or objects that we are interested
in studying. It is the complete set of observations that we want to make inferences about. For
example, the population might be all the students in a particular school or all the cars in a
particular city.
Sample : A sample, on the other hand, is a subset of the population. It is a smaller group of
individuals or objects that we select from the population to study. Samples are used to estimate
characteristics of the population, such as the mean or the proportion with a certain attribute. For
example, we might randomly select 100 students.
Mean : The mean is the sum of all values in the dataset divided by the number of values
Median: The median is the middle value in the dataset when the data is arranged in order.
Mode: The mode is the value that appears most frequently in the dataset
Weighted Mean : : The weighted mean is the sum of the products of each value and its
weight, divided by the sum of the weights. It is used to calculate a mean when the values
in the dataset have different importance or frequency.
Range: The range is the difference between the maximum and minimum values in the
dataset. It is a simple measure of dispersion that is easy to calculate but can be affected
by outliers.
Variance:The variance is the average of the squared differences between each data point
and the mean. It measures the average distance of each data point from the mean and is
useful in comparing the dispersion of datasets with different means.
Coefficient of Variation (CV): The CV is the ratio of the standard deviation to the mean
expressed as a percentage. It is used to compare the variability of datasets with different
means and is commonly used in fields such as biology, chemistry, and engineering.
The coefficient of variation (CV) is a statistical measure that expresses the amount of variability
in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a
percentage.
The formula for calculating the coefficient of variation is: CV = (standard deviation / mean) x
100%
Quantiles are statistical measures used to divide a set of numerical data into equal-sized
groups, with each group containing an equal number of observations.
Quantiles are important measures of variability and can be used to: understand distribution of
data, summarize and compare different datasets. They can also be used to identify outliers.
Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2 (50th percentile
or median), and Q3 (75th percentile).
Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2 (20th percentile), ...,
D9 (90th percentile).
Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2 (2nd percentile),
..., P99 (99th percentile).
Percentile
A percentile is a statistical measure that represents the percentage of observations in a dataset
that fall below a particular value. For example, the 75th percentile is the value below which 75%
of the observations in the dataset fail
5 number summary
2.First quartile (Q1): The value that separates the lowest 25% of the data from the rest of
the dataset.
3.Median (Q2): The value that separates the lowest 50% from the highest 50% of the data.
4. Third quartile (Q3): The value that separates the lowest 75% of the data from the
highest 25% of the data.
The five-number summary is often represented visually using a box plot, which displays the
range of the dataset, the median, and the quartiles. The five-number summary is a useful way
to quickly summarize the central tendency, variability, and distribution of a dataset
1. What is a boxplot
A box plot, also known as a box-and-whisker plot, is a graphical representation of a dataset that
shows the distribution of the data. The box plot displays a summary of the data, including the
minimum and maximum values, the first quartile (Q1), the median (Q2), and the third quartile
(Q3).
Covariance
Covariance measures the direction of the relationship between two variables. A positive
covariance means that both variables tend to be high or low at the same time. A negative
covariance means that when one variable is high, the other tends to be low
In [2]: df = pd.DataFrame()
In [4]: df['x'] = x
df['y'] = y
In [6]: # Covariances
print(np.cov(df['x'],df['y'])[0,1])
print(np.cov(df['x']*2,df['y']*2)[0,1])
1148.75
4595.0
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It's a common tool for
describing simple relationships without making a statement about cause and effect.
Probability distribution
A probability distribution is an idealized frequency distribution. A frequency distribution
describes a specific sample or dataset. It's the number of times each possible value of a
variable occurs in the dataset. The number of times a value occurs in a sample is determined
by its probability of occurrence.
The PMF of a discrete random variable assigns a probability to each possible value of the
random variable.
a.The probability assigned to each value must be non-negative (i.e., greater than or equal
to zero).
b. The sum of the probabilities assigned to all possible values must equal 1.
In [11]: l = []
for i in range(10000):
l.append(random.randint(1,6))
# Here we take values from 1 - 6
In [12]: len(l)
Out[12]: 10000
In [13]: l[:5]
Out[13]: [2, 4, 5, 1, 6]
In [14]: pd.Series(l).value_counts()
Out[14]: 2 1728
5 1682
3 1677
4 1670
6 1657
1 1586
dtype: int64
Out[15]: 2 0.1728
5 0.1682
3 0.1677
4 0.1670
6 0.1657
1 0.1586
dtype: float64
Out[17]: <AxesSubplot:>
In [18]: # Now rollig two dices and Add the both a and b
l = []
for i in range(10000):
a = (random.randint(1,6))
b = (random.randint(1,6))
l.append(a+b)
In [19]: len(l)
Out[19]: 10000
In [20]: l[:5]
In [21]: (pd.Series(l).value_counts())
Out[21]: 7 1724
8 1396
6 1369
9 1098
5 1089
10 836
4 820
11 545
3 537
2 295
12 291
dtype: int64
In [22]: (pd.Series(l).value_counts()/pd.Series(l).value_counts().sum())
Out[22]: 7 0.1724
8 0.1396
6 0.1369
9 0.1098
5 0.1089
10 0.0836
4 0.0820
11 0.0545
3 0.0537
2 0.0295
12 0.0291
dtype: float64
In [23]: (pd.Series(l).value_counts()/pd.Series(l).value_counts().sum()).sort_index()
Out[23]: 2 0.0295
3 0.0537
4 0.0820
5 0.1089
6 0.1369
7 0.1724
8 0.1396
9 0.1098
10 0.0836
11 0.0545
12 0.0291
dtype: float64
In [24]: s = (pd.Series(l).value_counts()/pd.Series(l).value_counts().sum()).sort_index
In [25]: s.plot(kind='bar')
Out[25]: <AxesSubplot:>
The cumulative distribution function (CDF) F(x) describes the probability that a random variable
X with a given probability distribution will be found at a value less than or equal to x
Out[27]: 2 0.0295
3 0.0832
4 0.1652
5 0.2741
6 0.4110
7 0.5834
8 0.7230
9 0.8328
10 0.9164
11 0.9709
12 1.0000
dtype: float64
Out[28]: <AxesSubplot:>
DENSITY Estimation
Density estimation is a statistical technique used to estimate the probability density function
(PDF) of a random variable based on a set of observations or data. In simpler terms, it involves
estimating the underlying distribution of a set of data points
here are various methods for density estimation, including parametric and
nonparametric approaches.
Parametric methods assume that the data follows a specific probability distribution (such
as a normal distribution),
while nonparametric methods do not make any assumptions about the distribution and
instead estimate it directly from the data.
Commonly used techniques for density estimation include kernel density estimation (KDE),
histogram estimation, and Gaussian mixture models (GMMs). The choice of method depends
on the specific characteristics of the data and the intended use of the density estimate.
In [30]: sample
Out[31]: 50.06793537132727
Out[32]: 4.70306827374624
Out[33]: (array([ 3., 28., 92., 180., 282., 205., 139., 52., 15., 4.]),
array([35.20388838, 38.31826811, 41.43264784, 44.54702757, 47.6614073 ,
50.77578703, 53.89016676, 57.0045465 , 60.11892623, 63.23330596,
66.34768569]),
<BarContainer object of 10 artists>)
In [37]: sample.max()
Out[37]: 66.34768568938694
In [38]: sample.min()
Out[38]: 35.203888376819734
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Futu
reWarning: `distplot` is a deprecated function and will be removed in a futur
e version. Please adapt your code to use either `displot` (a figure-level fun
ction with similar flexibility) or `histplot` (an axes-level function for his
tograms).
warnings.warn(msg, FutureWarning)
Out[41]: <AxesSubplot:ylabel='Density'>
But sometimes the distribution is not clear or it's not one of the famous distributions
Non-parametric density estimation has several advantages over parametric density estimation.
One of the main advantages is that it does not require the assumption of a specific
distribution, which allows for more flexible and accurate estimation in situations where the
underlying distribution is unknown or complex. However, non-parametric density estimation can
be computationally intensive and may require more data to achieve accurate estimates
compared to parametric methods
In [43]: sample
Out[44]: (array([ 2., 2., 0., 4., 6., 7., 6., 12., 18., 22., 17., 22., 33.,
17., 16., 21., 23., 19., 13., 21., 6., 9., 8., 20., 13., 21.,
26., 25., 40., 48., 58., 42., 51., 67., 45., 59., 43., 38., 25.,
23., 12., 12., 13., 7., 2., 2., 2., 1., 0., 1.]),
array([ 6.80805431, 7.80548117, 8.80290803, 9.80033489, 10.79776175,
11.7951886 , 12.79261546, 13.79004232, 14.78746918, 15.78489604,
16.7823229 , 17.77974976, 18.77717662, 19.77460347, 20.77203033,
21.76945719, 22.76688405, 23.76431091, 24.76173777, 25.75916463,
26.75659148, 27.75401834, 28.7514452 , 29.74887206, 30.74629892,
31.74372578, 32.74115264, 33.7385795 , 34.73600635, 35.73343321,
36.73086007, 37.72828693, 38.72571379, 39.72314065, 40.72056751,
41.71799437, 42.71542122, 43.71284808, 44.71027494, 45.7077018 ,
46.70512866, 47.70255552, 48.69998238, 49.69740924, 50.69483609,
51.69226295, 52.68968981, 53.68711667, 54.68454353, 55.68197039,
56.67939725]),
<BarContainer object of 50 artists>)
Out[45]: ▾ KernelDensity
KernelDensity(bandwidth=5)
score_samples(values) returns the log-density estimate of the input samples values. This is
because the score_samples() method of the KernelDensity class returns the logarithm of the
probability density estimate rather than the actual probability density estimate.
In [49]: sns.kdeplot(sample.reshape(1000),bw_adjust=0.3)
Out[49]: <AxesSubplot:ylabel='Density'>
Function
In [51]: df = sns.load_dataset('iris')
In [52]: df.head()
Out[52]:
sepal_length sepal_width petal_length petal_width species
We can tell from the graph that petal length and petal width are crucial for analysis.
In [58]: titanic.head()
Out[58]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 N
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 N
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 N
Henry
In [59]: # By using PDF (Probability density distribution) we can analyse the data
sns.kdeplot(data=titanic,x='Age',hue='Survived')
# We can examine the likelihood of survival for children aged 0 to 8 here.
# PDF gives , at particular points Probability
In [60]: sns.kdeplot(df['petal_width'],hue=df['species'])
# ecdfPlot
sns.ecdfplot(data=df,x='petal_width',hue='species')
2D density curve
Normal Distribution
the mean (μ) and the standard deviation (σ). The mean represents the centre of the
distribution, while the standard deviation represents the spread of the distribution. Denoted as:
Why is it so important?
Commonality in Nature: Many natural phenomena follow a normal distribution, such as the
heights of people, the weights of objects, the IQ scores of a population, and many more. Thus,
the normal distribution provides a convenient way to model and analyse such data.
https://samp-suman-normal-dist-visualize-app-lkntug.streamlit.app/ (https://samp-suman-
normal-dist-visualize-app-lkntug.streamlit.app/)
A Standard Normal Variate(Z) is a standardized form of the normal distribution with mean = 0
and standard deviation = 1
In [62]: # Example
sns.kdeplot(titanic['Age'])
In [63]: titanic['Age'].mean()
Out[63]: 29.69911764705882
In [64]: titanic['Age'].std()
Out[64]: 14.526497332334044
Out[65]: 0 -0.530005
1 0.571430
2 -0.254646
3 0.364911
4 0.364911
...
886 -0.185807
887 -0.736524
888 NaN
889 -0.254646
890 0.158392
Name: Age, Length: 891, dtype: float64
In [68]: x.mean()
Out[68]: 2.0039214607642444e-16
In [69]: x.std()
Out[69]: 0.9999999999999994
Problem
Z table
Positive Z score
Problem For a Normal Distribution X~(u,std) what percent of population lie between mean
and 1 standard deviation, 2 std and 3 std?
Empirical Rule:
The normal distribution has a well-known empirical rule, also called the 68-95-99.7 rule, which
states that approximately 68% of the data falls within one standard deviation of the mean,
about 95% of the data falls within two standard deviations of the mean, and about 99.7% of the
data falls within three standard deviations of the mean.
1. Symmetricity
The normal distribution is symmetric about its mean, which means that the probability of
observing a value above the mean is the same as the probability of observing a value below
the mean. The bell-shaped curve of the normal distribution reflects this symmetry.
3. Empirical Rule
Skewness
What is skewness?
In a symmetrical distribution, the mean, median, and mode are all equal. In contrast, in a
skewed distribution, the mean, median, and mode are not equal, and the distribution
tends to have a longer tail on one side than the other.
Skewness can be positive, negative, or zero. A positive skewness means that the tail of the
distribution is longer on the right side, while a negative skewness means that the tail is
longer on the left side. A zero skewness indicates a perfectly symmetrical distribution
The greater the skew the greater the distance between mode, median and mean
Interpretation
On the other hand, standardisation is beneficial in cases where the dataset follows the
Gaussian distribution. Unlike Normalization, Standardisation is not affected by the outliers in
the dataset as it does not have any bounding range.
Applying Normalization or Standardisation depends on the problem and the machine learning
algorithm. There are no definite rules as to when to use Normalization or Standardisation. One
can fit the normalized or standardized dataset into the model and compare the two.
It is always advisable to first fit the scaler on the training data and then transform the testing
data. This would prohibit data leakage during the model testing process, and the scaling of
target values is generally not required.
Q:What is Kurtosis.?
Kurtosis is the 4th statistical moment. In probability theory and statistics, kurtosis (meaning
"curved, arching") is a measure of the "tailedness"of the probability distribution of a real-
valued random variable. Like skewness, kurtosis describes a particular aspect of a probability
distribution.
Moment in distribution
1. Mean
2. standard deviation
3. Skewness
4. Kurtosis
Example:
Inshort:
Kurtosis is the distribution with a Fat tail, and tail Fatness indicates the presence of outliers.
Formula:
Practical Use-case
In finance, kurtosis risk refers to the risk associated with the possibility of extreme
outcomes or "fat tails" in the distribution of returns of a particular asset or portfolio.
If a distribution has high kurtosis, it means that there is a higher likelihood of extreme
events occurring, either positive or negative, compared to a normal distribution.
In finance, kurtosis risk is important to consider because it indicates that there is a greater
probability of large losses or gains occurring, which can have significant implications for
investors. As a result, investors may want to adjust their investment strategies to account
for kurtosis risk.
Excess Kurtosis
Excess kurtosis is a measure of how much more peaked or flat a distribution is compared to
a normal distribution, which is considered to have a kurtosis of 0. It is calculated by subtracting
3 from the sample kurtosis coefficient.
Types of Kurtosis:
1. Leptokurtic
A distribution with positive excess kurtosis is called leptokurtic. "Lepto-" means "slender". In
terms of shape, a leptokurtic distribution has fatter tails. This indicates that there are more
extreme values or outliers in the distribution.
Example - Assets with positive excess kurtosis are riskier and more volatile than those with a
normal distribution, and they may experience sudden price movements that can result in
significant gains or losses.
2. Platykurtic
A distribution with negative excess kurtosis is called platykurtic. "Platy-" means "broad". In
terms of shape, a platykurtic distribution has thinner tails. This indicates that there are fewer
extreme values or outliers in the distribution.
Assets with negative excess kurtosis are less risky and less volatile than those with a normal
distribution, and they may experience more gradual price movements that are less likely to
result in large gains or losses
3. Mesokurtic
Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of
a mesokurtic distribution is the normal distribution family, regardless of the values of its
parameters.
Mesokurtic is a term used to describe a distribution with a excess kurtosis of 0, indicating that
it has the same degree of "peakedness" or "flatness" as a normal distribution.
Visual inspection: One of the easiest ways to check for normality is to visually inspect a
histogram or a density plot of the data. A normal distribution has a bell-shaped curve,
which means that the majority of the data falls in the middle, and the tails taper off
symmetrically. If the distribution looks approximately bell-shaped, it is likely to be normal.
QQ Plot: Another way to check for normality is to create a normal probability plot (also
known as a Q-Q plot) of the data. A normal probability plot plots the observed data against
the expected values of a normal distribution. If the data points fall along a straight line, the
distribution is likely to be normal.
Statistical tests: There are several statistical tests that can be used to test for normality,
such as the Shapiro-Wilk test, the Anderson-Darling test, and the Kolmogorov-Smirnov
test. These tests compare the observed data to the expected values of a normal
distribution and provide a p-value that indicates whether the data is likely to be normal or
not. A p-value less than the significance level (usually 0.05) suggests that the data is not
normal.
In [71]: df = sns.load_dataset('iris')
In [72]: sns.kdeplot(df['sepal_length'])
In [74]: y_quant = []
for i in range(1,101):
y_quant.append(np.percentile(temp,i))
In [76]: x_quant = []
for i in range(1,101):
x_quant.append(np.percentile(samples,i))
In [77]: sns.scatterplot(x=x_quant,y=y_quant)
Out[77]: <AxesSubplot:>
using statsmodel
In [78]: # Style
plt.style.use('ggplot')
# using statsmodel
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Create a QQ plot of the two sets of data
fig = sm.qqplot(df['sepal_length'], line='45', fit=True)
# Add a title and labels to the plot
plt.title('QQ Plot')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
# Show the plot
plt.show()
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
In a QQ plot, the quantiles of the two sets of data are plotted against each other. The quantiles
of one set of data are plotted on the x-axis, while the quantiles of the other set of data are
plotted on the y-axis. If the two sets of data have the same distribution, the points on the QQ
plot will fall on a straight line. If the two sets of data do not have the same distribution, the
points will deviate from the straight line.
• https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html
(https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html)
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
No handles with labels found to put in legend.
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
No handles with labels found to put in legend.
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
No handles with labels found to put in legend.
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
No handles with labels found to put in legend.
In [82]: plt.hist(x)
Out[82]: (array([ 88., 85., 110., 84., 93., 114., 104., 99., 102., 121.]),
array([0.00154826, 0.10131761, 0.20108696, 0.30085631, 0.40062566,
0.50039501, 0.60016435, 0.6999337 , 0.79970305, 0.8994724 ,
0.99924175]),
<BarContainer object of 10 artists>)
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
We can comparw two distributions with the Help of Using QQ plot , Not only Normal
distribution but also uniform distribution and more , Mostly used to check the Normal
distributions.
Uniform Distribution
In probability theory and statistics, a uniform distribution is a probability distribution where all
outcomes are equally likely within a given range. This means that if you were to select a
random value from this range, any value would be as likely as any other value
The height of a person randomly selected from a group of individuals whose heights range
from 5'6" to 6'0" would follow a continuous uniform distribution.
The time it takes for a machine to produce a product, where the production time ranges
from 5 to 10 minutes, would follow a continuous uniform distribution.
localhost:8888/notebooks/Statistics ( Prudhvi Vardhan Notes).ipynb 60/108
6/18/23, 7:38 AM Statistics ( Prudhvi Vardhan Notes) - Jupyter Notebook
The distance that a randomly selected car travels on a tank of gas, where the distance
ranges from 300 to 400 miles, would follow a continuous uniform distribution.
The weight of a randomly selected apple from a basket of apples that weighs between 100
and 200 grams, would follow a continuous uniform distribution.
https://en.wikipedia.org/wiki/Continuous_uniform_distribution
(https://en.wikipedia.org/wiki/Continuous_uniform_distribution)
Sampling: Uniform distribution can also be used for sampling. For example, if you have a
dataset with an equal number of samples from each class, you can use uniform
distribution to randomly select a subset of the data that is representative of all the classes.
Data augmentation: In some cases, you may want to artificially increase the size of your
dataset by generating new examples that are similar to original data. Uniform distribution
can be used to generate new data points that are within a specified range of the original
data
Examples
Users dwell time on online articles (jokes, news etc.) follows a log-normal distribution.
The length of chess games tends to follow a log-normal distribution.In economics, there is
evidence that the income of 97%–99% of the population is distributed log-normally.
Formula
Create a Q-Q plot of the transformed data. If the points approximately lie on a straight line, it
suggests that the data follows a log-normal distribution.
Pareto Distribution
The Pareto distribution is a type of probability distribution that is commonly used to model the
distribution of wealth, income, and other quantities that exhibit a similar power-law behaviour
In mathematics, a power law is a functional relationship between two variables, where one
variable is proportional to a power of the other. Specifically, if y and x are two variables related
by a power law, then the relationship can be written as
y = k * x^a
The Long Tail offers a great contemporary example of the Pareto probability distribution – a few
extreme events or “blockbusters” on the left hand side of the curve and a very long tail of much
less popular events on the right hand side of the curve. The Pareto distribution has also been
popularized as the “80/20” rule.
Vilfredo Pareto originally used this distribution to describe the allocation of wealth among
individuals since it seemed to show rather well the way that a larger portion of the wealth of any
society is owned by a smaller percentage of the people in that society. He also used it to
describe distribution of income. This idea is sometimes expressed more simply as the Pareto
principle or the "80-20 rule" which says that 20% of the population controls 80% of the
wealth
The Pareto distribution is a power-law probability distribution, and has only two parameters to
describe the distribution: α (“alpha”) and Xm. The α value is the shape parameter of the
distribution, which determines how distribution is sloped **(see Figure 1). The Xm
parameter is the scale parameter, which represents the minimum possible value for the
distribution and helps to determine the distribution’s spread. The probability density function is
given by the following formula:
When we plot this function across a range of x values, we see that the distribution slopes
downward as x increases. This means that the majority of the distribution’s density is
concentrated near Xm on the left-hand side, with only a small proportion of the density as we
move to the right. For reference, the “80-20 Rule” is represented by a distribution with
alpha equal to approximately 1.16.
Examples
File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few
larger ones)
Sales: In fact, there are companies where only 20% of the salespeople are responsible for
80% of the sales.
Warehouses: It is not uncommon for 20% of the products to take up 80% of the available
space.
Productivity: With proper prioritization, as little as 20% of all efforts can often get 80% of
the work done.
Time Management: With 20% of the time (properly) spent, 80% of the tasks can be
completed.
The Pareto Principle states that 80% of the results are achieved with 20% of the total
effort. The remaining 20% of the results require the most work with 80% of the total
effort.
CDF:
In [89]: plt.hist(x)
Out[89]: (array([947., 36., 8., 2., 2., 2., 1., 1., 0., 1.]),
array([ 1.00049542, 4.50522233, 8.00994925, 11.51467616, 15.01940307,
18.52412999, 22.0288569 , 25.53358381, 29.03831073, 32.54303764,
36.04776455]),
<BarContainer object of 10 artists>)
C:\Users\user\anaconda3\lib\site-packages\statsmodels\graphics\gofplots.py:99
3: UserWarning: marker is redundantly defined by the 'marker' keyword argumen
t and the fmt string "bo" (-> marker='o'). The keyword argument will take pre
cedence.
ax.plot(x, y, fmt, **plot_style)
In sklearn
Function Transformer
1. Log transform
2. reciprocal
3. square/square_root
4. custom
Power Transformer
1. boxcox
2. Yeo-Johnson
LOG TRANSFORMATION:
Generally, these transformations make our data close to a normal distribution but are not
able to exactly abide by a normal distribution.
This transformation is not applied to those features which have negative values.
Convert data from addictive Scale to multiplicative scale i,e, linearly distributed data.
Example
In [93]: df = titanic[['Age','Fare','Survived']]
In [94]: df.head()
Out[94]:
Age Fare Survived
0 22.0 7.2500 0
1 38.0 71.2833 1
2 26.0 7.9250 1
3 35.0 53.1000 1
4 35.0 8.0500 0
C:\Users\user\anaconda3\lib\site-packages\pandas\core\generic.py:6392: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [97]: df.isnull().sum()
Out[97]: Age 0
Fare 0
Survived 0
dtype: int64
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Futu
reWarning: `distplot` is a deprecated function and will be removed in a futur
e version. Please adapt your code to use either `displot` (a figure-level fun
ction with similar flexibility) or `histplot` (an axes-level function for his
tograms).
warnings.warn(msg, FutureWarning)
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: Futu
reWarning: `distplot` is a deprecated function and will be removed in a futur
e version. Please adapt your code to use either `displot` (a figure-level fun
ction with similar flexibility) or `histplot` (an axes-level function for his
tograms).
warnings.warn(msg, FutureWarning)
In [102]: # Modelling
clf = LogisticRegression()
clf2 = DecisionTreeClassifier()
y_pred = clf.predict(X_test)
y_pred1 = clf2.predict(X_test)
print("Accuracy LR",accuracy_score(y_test,y_pred))
print("Accuracy DT",accuracy_score(y_test,y_pred1))
Accuracy LR 0.6480446927374302
Accuracy DT 0.6871508379888268
np. LOG is only used when we have to take simple log. If we have zeros (0's) in our data
so the model will be not that much accurate hence here we use NP. LOG1P and it turns
zeros value to x+1 then no values in the data will be zero.
Accuracy LR 0.6815642458100558
Accuracy DT 0.6815642458100558
LR 0.678027465667915
DT 0.6622471910112359
Observation: Here after applying Log transformation to the 'Age' column , it didnt
perform well and gives bad results
The application of a logarithmic transformation to the 'Age' column did not yield favorable
results. It is possible that the 'Age' column was already normally distributed,
while another column, such as 'Fare', exhibited right skewness. This suggests that the
logarithmic transformation was unnecessary for the 'Age' column, potentially leading to
undesirable changes in the data. Consider the initial distribution and specific requirements
of the analysis when deciding on the appropriate data transformation technique.
trf = ColumnTransformer([('log',
FunctionTransformer(transform),
['Fare'])],
remainder='passthrough')
X_trans = trf.fit_transform(X)
clf = LogisticRegression()
print("Accuracy",np.mean(cross_val_score(clf,X_trans,y,
scoring='accuracy',
cv=10)))
plt.figure(figsize=(14,4))
plt.subplot(121)
stats.probplot(X['Fare'], dist="norm", plot=plt)
plt.title('Fare Before Transform')
plt.subplot(122)
stats.probplot(X_trans[:,0], dist="norm", plot=plt)
plt.title('Fare After Transform')
plt.show()
In [111]: # sqaure
apply_transform(lambda x:x**2)
Accuracy 0.6442446941323345
In [112]: # Square_root
apply_transform(lambda x:x**1/2)
Accuracy 0.6589013732833957
In [113]: # Reciprocal
apply_transform(lambda x:1/(x+0.00000001))
Accuracy 0.61729088639201
Accuracy 0.6195131086142323
Power Transformers
Box-cox Transformer
Yeo-Johnson Transformer
In [117]: df =pd.read_csv("concrete_data.csv")
In [118]: df.head()
Out[118]:
Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
In [119]: df.shape
Out[119]: (1030, 9)
In [120]: df.isnull().sum()
Out[120]: Cement 0
Blast Furnace Slag 0
Fly Ash 0
Water 0
Superplasticizer 0
Coarse Aggregate 0
Fine Aggregate 0
Age 0
Strength 0
dtype: int64
check if any column as '0' values , because 'boxcox' will not work , if any '0'values
present in data
In [121]: df.describe()
Out[121]:
Blast
Coarse
Cement Furnace Fly Ash Water Superplasticizer
Aggregate Agg
Slag
In [122]: X = df.drop(columns=['Strength'])
y = df.iloc[:,-1]
Out[124]: 0.627553179231485
Out[125]: 0.4609940491662866
In [126]: # Plotting the distplots without any transformation , for every column
for col in X_train.columns:
plt.figure(figsize=(14,4))
plt.subplot(121)
sns.distplot(X_train[col])
plt.title(col)
plt.subplot(122)
stats.probplot(X_train[col], dist="norm", plot=plt)
plt.title(col)
plt.show()
ion for histograms).
warnings.warn(msg, FutureWarning)
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: F
utureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-l
evel function with similar flexibility) or `histplot` (an axes-level funct
ion for histograms).
i ( t i )
Box-Cox Transform
Out[127]:
cols box_cox_lambdas
0 Cement 0.177025
3 Water 0.772682
4 Superplasticizer 0.098811
7 Age 0.066631
Out[128]: 0.8047825006181187
Out[129]: 0.6658537942219862
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: F
Yeo-Johnson transform
0.8161906513339305
Out[131]:
cols Yeo_Johnson_lambdas
0 Cement 0.174348
3 Water 0.771307
4 Superplasticizer 0.253935
7 Age 0.019885
In [132]:
# applying cross val score
pt = PowerTransformer()
X_transformed2 = pt.fit_transform(X)
lr = LinearRegression()
np.mean(cross_val_score(lr,X_transformed2,y,scoring='r2'))
Out[132]: 0.6834625134285743
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: F
utureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-l
evel function with similar flexibility) or `histplot` (an axes-level funct
ion for histograms).
warnings.warn(msg, FutureWarning)
C:\Users\user\anaconda3\lib\site-packages\seaborn\distributions.py:2619: F
utureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-l
evel function with similar flexibility) or `histplot` (an axes-level funct
ion for histograms).
warnings.warn(msg, FutureWarning)
Out[135]:
cols box_cox_lambdas Yeo_Johnson_lambdas
Bernoulli Distribution
it is a probability distribution that models a binary outcome, where the outcome can be either
success (represented by the value 1) or failure (represented by the value 0). The Bernoulli
distribution is named after the Swiss mathematician Jacob Bernoulli, who first introduced it in
the late 1600s.
The Bernoulli distribution is commonly used in machine learning for modelling binary
outcomes, such as whether a customer will make a purchase or not, whether an email is spam
or not, or whether a patient will have a certain disease or not.
Binomial Distribution
n - no.of trails
p - probability of sucess
x - desired result
Graph of PDF:
In [136]: # code
n = 10 # number of trials
p = 0.5 # probability of success
size = 1000 # number of samples to generate
binomial_dist = np.random.binomial(n, p, size)
plt.hist(binomial_dist,density=True)
plt.show()
9, 9, 8, 7, 9, 7, 9, 5, 9, 7, 6, 8, 7, 7, 9, 9, 7,
8, 9, 7, 7, 9, 9, 7, 9, 9, 7, 10, 7, 8, 10])
In [138]: # Plot
n = 10 # number of trials
p = 0.8 # probability of success
size = 1000 # number of samples to generate
binomial_dist = np.random.binomial(n, p, size)
plt.hist(binomial_dist,density=True)
plt.show()
Criteria:
3. Logistic regression: Logistic regression is a popular machine learning algorithm used for
classification problems. It models the probability of an event happening as a logistic
function of the input variables. Since the logistic function can be viewed as a
transformation of a linear combination of inputs, the output of logistic regression can be
thought of as a binomial distribution.
4. A/B testing: A/B testing is a common technique used to compare two different versions of
a product, web page, or marketing campaign. In A/B testing, we randomly assign
individuals to one of two groups and compare the outcomes of interest between the
groups. Since the outcomes are often binary (e.g., clickthrough rate or conversion rate),
the binomial distribution can be used to model the distribution of outcomes and test for
differences between the groups.
Sampling Distribution
1. The sample size is large enough, typically greater than or equal to 30.
2. The sample is drawn from a finite population or an infinite population with a inite variance.
3. The random variables in the sample are independent and identically distributed.
The CLT is important in statistics and machine learning because it allows us to make
probabilistic inferences about a population based on a sample of data.
For example, we can use the CLT to construct confidence intervals, perform hypothesis tests,
and make predictions about the population mean based on the sample data. The CLT also
provides a theoretical justification for many commonly used statistical techniques, such as t-
tests, ANOVA, and linear regression
In [140]: # code
# Set the parameters
num_samples = 10000
sample_size = 300
distribution_range = (0, 1)
# Generate samples from a uniform distribution
samples = np.random.uniform(distribution_range[0],
distribution_range[1],
(num_samples, sample_size))
# Calculate the sample means
sample_means = np.mean(samples, axis=1)
# Plot the histogram of the sample means
plt.hist(sample_means, bins=30, density=True, edgecolor='black',color='blue')
plt.title('Histogram of Sample Means')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.show()
Out[144]: 0.04
Step-by-step process:
1. Collect multiple random samples of salaries from a representative group of Indians. Each
sample should be large enough (usually, n > 30) to ensure the CLT holds. Make sure the
samples are representative and unbiased to avoid skewed results.
2. Calculate the sample mean (average salary) and sample standard deviation for each
sample.
3. Calculate the average of the sample means. This value will be your best estimate of the
population mean (average salary of all Indians).
4. Calculate the standard error of the sample means, which is the standard deviation of the
sample means divided by the square root of the number of samples.
5. Calculate the confidence interval around the average of the sample means to get a range
within which the true population mean likely falls. For a 95% confidence interval: