Article Review 1 Eng
Article Review 1 Eng
Article Review 1 Eng
2
Introduction
Statistics is a big part of a Data Scientist’s daily living. Each time you start an
analysis, your first step before applying fancy algorithms and making some
predictions is to first do some exploratory data analysis (EDA) and try to read and
understand the data by applying statistical techniques. With this first data analysis,
you are able to understand what type of distribution the data presents.
At the end of this brief introduction, we will use the Lego dataset to make sense of
these concepts.
When looking at data the first step of your statistical analysis will be to determine if
the dataset you’re dealing with is a population or a sample.
3
A population is the collection of all items of interest in your study and it is generally
denoted with the capital letter N . The calculated values when analyzing a
population are known as parameters. On the other hand, a sample is a subset of a
population and it’s usually denoted by the letter n. The values calculated when using
a sample are known as statistics.
Populations are hard to define and analyze in real life. It is easy to miss values when
studying a population which will influence the analysis, as well as an analysis of the
whole population is very expensive and time-consuming. Therefore, you normally
hear about samples. In contrast to a population, a sample is not expected to account
for all the data and is easier to analyze since its smaller size makes the analysis less
time consuming, less costly and less prone to error. A sample must be random and
both representative of the population. With a sample, anyone can make deductions
on the population.
4
Example of sampling techniques are:
● Simple Random Sampling
○ Simple random sampling is like putting all the elements of a population into
a hat and picking names without any particular order or reason. Every
individual in the group has an equal chance of being chosen. Imagine you
have a class of students, and you want to select a few for a survey. If you use
simple random sampling, you give each student a number, put those
numbers in a hat, and pick out the ones you need. It's completely random,
like drawing names out of a hat.
● Stratified Random Sampling
○ Stratified random sampling is a bit like organizing your class into groups
based on something important, like their grades. Instead of picking
randomly from everyone, you first divide the students into these groups,
called strata, and then randomly select from each group. So, if you're doing a
survey in your class, you might first divide the students by grades, like A, B,
and C, and then randomly choose a few from each grade. This helps make
sure you have a good mix from each important subgroup.
Types of data
5
● Discrete - data which can only take certain values. You only have a
fixed set of values you have access to. For example, age, number of
cars in a street, number of fingers.
● Continuous - data which can take any real or fractional value between
a certain range, without any restrictions (e.g. weight, Balance in a bank
account, value spent on the purchase, Grade on Exam,Foot Size)
On categorical data, there are also Ordinal and Nominal types:
● Nominal Data:
○ Nominal data are categories without any inherent order or
ranking. They represent different groups or labels, but there is
no implied order among them.
○ Example: Colors of cars (red, blue, green). Each color is distinct,
but there is no inherent order or ranking among them.
● Ordinal Data:
○ Ordinal data, on the other hand, have categories with a specific
order or rank. The intervals between the categories are not
necessarily uniform, but there is a clear sequence.
○ Example: Educational attainment levels (high school diploma,
bachelor's degree, master's degree). Here, there is an order
from less to more education, but the difference between having
a high school diploma and a bachelor's degree may not be the
same as the difference between a bachelor's and a master's
degree.
6
Figure 1. Data Types
Levels of Measurement
7
● Interval : Represented by numbers, without having a true zero.
In this case, the zero value is meaningless.
● Ratio: Represented by numbers and has a true zero.
For quantitative data to be regarded as an interval or ratio, it depends on the context
we are using them in. For example, think about temperature. Saying it is 0º Celsius
or 0º Fahrenheit has no meaning, since that is not the true zero. The absolute zero
temperature in Celsius is -273.15 ºC whereas in Fahrenheit is -459.67º F. Therefore,
in this case, the temperature has to be considered as Interval data, since the zero
value is meaningless. However, if you analyze temperature in Kelvins, the absolute
zero temperature is 0º Kelvin, thus you can say now the temperature value is a Ratio
since it has a true zero.
The measure of central tendency refers to the idea there’s one number that best
summarizes the entire set. The most popular are mean, median and mode.
1. Mean
This is considered the most reliable measure of the measure of central
tendency for making assumptions about a population from a single sample.
The μ symbol is used to describe the population value whereas the x̅ to
describe the sample mean.
2. Median
The median is the midpoint or the “middle” value in your orderly ascending
dataset. It is also known as the 50th percentile. In order to avoid the error
provoked in the mean by outliers, it is usually a good idea to also calculate the
median.
3. Mode
The mode shows us the value that occurs most often. It can be used for
numerical as well as categorical variables. If there is not a single value that
does not appear more than once you say there is no mode.
Suppose you have the following exam scores for a class of students:
● 85,92,75,88,92,92,75,82,88,92
To find the mode:
● Count the Frequency of each values:
9
○ 75 appears twice
○ 82 appears once
○ 85 appears once
○ 88 appears twice
○ 92 appears four times
So the mode is 92.
Measures of variability
The measure of variability refers to the idea of measuring the dispersion in our data
according to the mean value. The most known measures of variability are the range,
interquartile range (IQR), variance and standard deviation.
1. Range
The range is the most obvious measure of dispersion and describes the
difference between the largest and the smallest points in your data.
Range is 99–12 = 87
2. Interquartile range (IQR)
10
The IQR is a measure of variability between the upper (75th) and lower
(25th) quartiles. The data is sorted into ascending order and divided into four
quarters.
Where:
While the range measures the range of values in which our dataset is distributed,
the interquartile range measures the interval of values where the majority of values
lie.
3. Variance
The variance as well as the standard deviation are more complex forms of
measuring how much the data disperse from the mean value of the dataset.
11
Where:
The variance is found by computing the difference between every data point and
the mean, squaring that value and summing for all available data points. In the end,
the variance is calculated by dividing the sum by the total number of available
points.
Squaring the difference has two main purposes :
1. Dispersion is non-negative, by powering the subtraction by 2 we
ensure we do not have negative values and thus there is not the
chance of them canceling out.
2. Amplifies the effect of large differences
The problem with Variance is that because of the squaring, it is not in
the same unit of measurement as the original data. This is why the
Standard Deviation is used more often because it is in the original unit.
Squared dollars means nothing in statistics.
3. Standard Deviation
Usually standard deviation is much more meaningful than variance. It
is the preferred measure of variability as it is directly interpretable.
Standard deviation is basically the square root of our variance.
12
Figure 6. How to calculate standard deviation
Where:
Standard deviation is best used when data presents a unimodal shape. In a normal
distribution, approximately 34% of data points fall one standard deviation away
from the mean. Since a normal distribution is symmetrical, we have 68.2% of data
points one standard deviation away from the mean. Around 95% of points fall
between two standard deviations from the mean whereas 99.7% fall under three
standard deviations.
With the Z-Score, you can check how many standard deviations below (or above)
the mean, a specific data point is.
13
Measure of Asymmetry
1. Modality
The modality of a distribution is determined by the number of peaks the data
presents. Most distributions are unimodal which means it has only one
frequently occurring score, clustered at the top while a bimodal has two
values occurring frequently.
2. Skewness
It is the most common tool to measure asymmetry. Skewness indicates to
which side the data is concentrated. The skewness captures the outliers in
the data. If it is left skewed it means the outliers are to the left. Moreover,
when the mean is higher than the median we have a right skew. If it’s lower
we have a left skew.
14
Figure 9. Skewness type
Measures of asymmetry are the link between Central Tendency Measures and
Probability theory which will ultimately allow us to obtain a more accurate
knowledge of the data we are working with.
Impact of Skewness on Mean, Median, and Mode:
● Mean:
○ In a right-skewed distribution (positive skewness), the mean
will be larger than the median due to higher values in the right
tail.
○ In a left-skewed distribution (negative skewness), the mean will
be smaller than the median due to lower values in the left tail.
● Median:
○ The median is less sensitive to extreme values and is not
significantly influenced by skewness. It represents the middle
value when the data is sorted.
15
● Mode:
○ If the distribution is skewed, the mode (most frequently
occurring value) may not align with the mean or median. The
mode tends to move toward the longer tail of the distribution.
Now, Let’s take a look at the EPL 2014–2015 Player Heights and Weights dataset,
which shows information about English Premier League player’s height, weight and
age as well as the name, number, position and team.
Starting simple, what do you think is the relationship between Height and Weight
for football players? You probably assumed that the higher the player is the heavier
he will be. So, we can see here a positive relationship between height and weight.
16
And it is clear that higher players present a higher weight, with some
exceptions.
3. Covariance
Covariance is a measure that indicates how two variables are related. A
positive covariance means the variables are positively related, while a
negative covariance means the variables are inversely related. The formula
for calculating covariance of sample data is shown below.
Nevertheless, with covariance we have an issue… with the units. If you calculated
the covariance by end you might have noticed this issue but with pandas it is not
possible. Take a look again at the formula. We’ve chosen two variables: Height
measured in centimeters (cm) and Weight measured in kilograms (Kg). Notice the
denominator, where for each data point you subtract the mean of the respective
variable and later multiple both values. In the end, our value for the covariance will
be 34,43 cm.kg. This is not very informative! First of all, it seems that our covariance
depends on the magnitude of our variables. If we’ve used the american metric
17
system for Height and Weight, the covariance would probably return a different
value and deceive us on the covariance’s strength.
So, our metric is showing us to what extent these variables are changing together,
which is good, but it is dependent on the magnitude of the variables themselves
which generally does not give us what we want. A better question instead of “How
do our variables relate?” is “How strong is the relationship between our variables?”.
For that, Correlation is the best answer.
Another way to think about it is to see a distribution as a function that shows the
possible values for a variable and how often they occur. It is a common mistake to
believe that the distribution is the graph when in fact it’s the “rule” that determines
how values are positioned in relation to each other.
Here you have a map of relationships between the different distributions out there,
with many naturally following Bernoulli distribution. Each distribution is illustrated
by an example of its probability density function (PDF), which we’ll see later.
18
Figure 12. Probability distribution
We will first start by focusing our attention on the most widely used distribution,
Normal Distribution, due to the following reasons:
● It approximates to a wide variety of random variables;
● Distributions of sample mean with large enough sample sizes could be
approximated to normal;
● All computable statistics are elegant;
● Heavily used in regression analysis;
● Decisions based on normal distribution insights have a good track
record.
Normal Distribution
Also known as Gaussian Distribution or the Bell curve, it is a continuous probability
distribution, and it’s the most common distribution you’ll find. A distribution of a
19
dataset shows the frequency at which possible values occur. It presents the
following notation
With N standing for normal, ~ as distribution, μ being the mean and the squared σ
the variance. Normal distribution is symmetrical and its median, mean and mode
are equal, thus it does not have any skewness.
Univariate Analysis
20
The primary aim of Univariate analysis is to succinctly depict the data and identify
patterns within it. This is achieved by examining metrics such as mean, median,
mode, dispersion, variance, range, standard deviation, and so on.
Univariate analysis employs various descriptive methods, including:
● Frequency Distribution Tables
● Histograms
● Frequency Polygons
● Pie Charts
● Bar Charts
21
Bivariate Analysis
22
○ A correlation coefficient of -1 indicates a perfect negative linear
relationship.
○ A correlation coefficient of 0 suggests no linear relationship.
○ For example, if we have data on hours of study and exam
scores, a positive correlation coefficient would imply that as
hours of study increase, exam scores also tend to increase.
Conversely, a negative correlation coefficient would suggest
that as one variable increases, the other tends to decrease.
● Regression analysis
○ Regression analysis is a statistical method used to model the
relationship between a dependent variable and one or more
independent variables. It helps us understand how changes in
the independent variables are associated with changes in the
dependent variable.
○ In simple terms, if we take the same example of hours of study
and exam scores, regression analysis would allow us to create a
mathematical formula (a regression equation) that predicts
exam scores based on the number of hours studied. The
equation might tell us how much, on average, an additional
hour of study is associated with an increase or decrease in
exam scores.
○ Regression analysis is widely used for prediction, understanding
cause-and-effect relationships, and making informed decisions
based on the relationships between
23
Multivariate Analysis
24
○ Variance analysis is about understanding differences. If you
planned to spend a certain amount of money but actually spent
more or less, variance analysis helps figure out why. It looks at
the differences (variances) between what was planned and
what actually happened.
● Discriminant Analysis
○ Discriminant analysis is like finding the features that make
things different. If you have two or more groups (like different
types of animals), discriminant analysis helps you identify which
characteristics discriminate or set them apart from each other.
● Principal Component Analysis
○ Principal component analysis is about simplifying complex data.
Imagine you have a lot of information about students, like
grades, study time, and test scores. Principal component
analysis helps you find the most important things that explain
most of the differences among students, making it easier to
understand the data.
Inferential Analysis
25
The process of inferential analysis typically involves hypothesis testing, estimation,
and drawing conclusions based on probability theory. Researchers use inferential
statistics to make judgments about the characteristics of a population, determine
the significance of relationships between variables, or make predictions about
future observations.
26
Suppose a group of researchers is interested in understanding whether a new
teaching method enhances student performance in mathematics compared to the
traditional teaching method.
Use Case
One of the steps in EDA that you do is descriptive statistics. Calculate Average,
Median and Range values!
Solution:
Formula:
- Mean :
27
- Median : sorted data and then find the middle values
Result:
Mean (Average):
(50,000+55,000+60,000+65,000+70,000+75,000+80,000+85,000+90,000+95,000
)/10=72,500
Median (Middle Value):
(70,000 + 75,000) / 2 : 72,500
Mode (Most Frequent Value): No mode in this example; all installments are unique.
Range:
95,000−50,000=45,000
28
References
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-6c246ed2468d
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-3087b80eb1c6
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-7bf596237ac6
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-a67a3199dcd4
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-16a188a400ca
https://www.kdnuggets.com/2020/02/probability-distributions-data-scienc
e.html
https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-
data-science-7bf596237ac6
https://www.yourdatateacher.com/2021/04/16/the-most-used-probability-
distributions-in-data-science/
https://towardsdatascience.com/statistical-significance-hypothesis-testing-
the-normal-curve-and-p-values-93274fa32687
29
https://towardsdatascience.com/statistical-significance-in-action-84a4f47b
51ba
https://hotcubator.com.au/research/what-is-univariate-bivariate-and-multi
variate-analysis/
https://sciencing.com/similarities-of-univariate-multivariate-statistical-anal
ysis-12549543.html
https://thecleverprogrammer.com/2021/01/13/univariate-and-multivariate-
for-data-science/
30