Skewness and Kurtosis
Skewness and Kurtosis
Skewness and Kurtosis
Atul Sharma
(Image by author)
(Image by author)
So there are two things to notice — The peak of the curve and the
tails of the curve, Kurtosis measure is responsible for capturing
this phenomenon. The formula for kurtosis calculation is complex
(4th moment in the moment-based calculation) so we will stick to
the concept and its visual clarity. A normal distribution has a
kurtosis of 3 and is called mesokurtic. Distributions greater than 3
are called leptokurtic and less than 3 are called platykurtic. So the
greater the value more the peakedness. Kurtosis ranges from 1 to
infinity. As the kurtosis measure for a normal distribution is 3, we
can calculate excess kurtosis by keeping reference zero for normal
distribution. Now excess kurtosis will vary from -2 to infinity.
(Image by author)
The topic of Kurtosis has been controversial for decades now, the
basis of kurtosis all these years has been linked with the
peakedness but the ultimate verdict is that outliers (fatter tails)
govern the kurtosis effect far more than the values near the mean
(peak).
Thanks!!!
https://www.slideshare.net/nemalynyap/pearsons-coefficient-of-skewness
https://www.slideshare.net/rajkumarteotia/skewness-40437601
https://www.slideshare.net/THIYAGUSURI/npc-skewness-and-kurtosis
https://www.slideshare.net/dhanasekaran10/skewness-7301997
https://www.mymathtables.com/calculator/stats/sample
-skewness-calculator.html
https://www.wallstreetmojo.com/skewness/
Statistics for Data Science: What is Skewness and Why is it Important?
Published
1 year ago
on
July 5, 2020
By
Republished by Plato
Overview
Skewness is a key statistics concept you must know in the data science and
analytics fields
Learn what is skewness, the formula for skewness, and why it’s important
for you as a data science professional
Introduction
The concept of skewness is baked into our way of thinking. When we look at a
visualization, our minds intuitively discern the pattern in that chart.
Think about it – you look at a chart of a cricket team’s batting performance in a
50-over game and you’ll quickly notice how there’s a sudden deluge of runs in the
last 10 overs. Now think of that in terms of a bar chart – there’s a skew towards
the end, right?
So even if you haven’t read up on skewness as a data science or analytics
professional, you have definitely interacted with the concept on an informal note.
And it’s actually a pretty easy topic in statistics – and yet a lot of folks skim
through it in their haste of learning other seemingly complex data science
concepts. To me, that’s a mistake.
https://itfeature.com/statistics/skewness-measure-of-asymmetry
https://itfeature.com/statistics/skewness-introduction-formula-interpretation\
These distributions are all exactly symmetrical and thus have skewness = 0.000...
where
XiXi is each individual score;
X¯¯¯¯X¯ is the sample mean;
SS is the sample-standard-deviation and
NN is the sample size.
An example calculation is shown in this Googlesheet (shown below).
Skewness in SPSS
First off, “skewness” in SPSS always refers to sample skewness: it quietly assumes that
your data hold a sample rather than an entire population. There's plenty of options for
obtaining it. My favorite is via MEANS because the syntax and output are clean and
simple. The screenshots below guide you through.
https://www.spss-tutorials.com/skewness/
The syntax can be as simple asmeans v1 to v5
/cells skew.A very complete table -including means, standard deviations, medians and
more- is run frommeans v1 to v5
/cells count min max mean median stddev skew kurt.The result is shown below.
Credits: Wikipedia
The probability distribution with its tail on the right side is a positively skewed
distribution and the one with its tail on the left side is a negatively skewed
distribution. If you’re finding the above figures confusing, that’s alright. We’ll
understand this in more detail later.
Before that, let’s understand why skewness is such an important concept for you
as a data science professional.
Why is Skewness Important?
Now, we know that the skewness is the measure of asymmetry and its types are
distinguished by the side on which the tail of probability distribution lies. But why
is knowing the skewness of the data important?
First, linear models work on the assumption that the distribution of the
dependent variable and the target variable are similar. Therefore, knowing about
the skewness of data helps us in creating better linear models.
Secondly, let’s take a look at the below distribution. It is the distribution of
horsepower of cars:
You can clearly see that the above distribution is positively skewed. Now, let’s say
you want to use this as a feature for the model which will predict the mpg (miles
per gallon) of a car.
Since our data is positively skewed here, it means that it has a higher number of
data points having low values, i.e., cars with less horsepower. So when we train
our model on this data, it will perform better at predicting the mpg of cars with
lower horsepower as compared to those with higher horsepower. This is similar
to how class imbalance happens in classification problems.
Also, skewness tells us about the direction of outliers. You can see that our
distribution is positively skewed and most of the outliers are present on the right
side of the distribution.
Note: The skewness does not tell us about the number of outliers. It only tells us
the direction.
Now we know why skewness is important, let’s understand the distributions
which I showed you earlier.
What is a Symmetric/Normal Distribution?
Credits: Wikipedia
Yes, we’re back again with the normal distribution. It is used as a reference for
determining the skewness of a distribution. As I mentioned earlier, the normal
distribution is the probability distribution with almost no skewness. It is nearly
perfectly symmetrical. Due to this, the value of skewness for a normal distribution
is zero.
But, why is it nearly perfectly symmetrical and not absolutely symmetrical?
That’s because, in reality, no real word data has a perfectly normal distribution.
Therefore, even the value of skewness is not exactly zero; it is nearly
zero. Although the value of zero is used as a reference for determining the
skewness of a distribution.
You can see in the above image that the same line represents the mean, median,
and mode. It is because the mean, median, and mode of a perfectly normal
distribution are equal.
So far, we’ve understood the skewness of normal distribution using a probability
or frequency distribution. Now, let’s understand it in terms of a boxplot because
that’s the most common way of looking at a distribution in the data science
space.
The above image is a boxplot of symmetric distribution. You’ll notice here that the
distance between Q1 and Q2 and Q2 and Q3 is equal i.e.:
But that’s not enough for concluding if a distribution is skewed or not. We also
take a look at the length of the whisker; if they are equal, then we can say that
the distribution is symmetric, i.e. it is not skewed.
Now that we’ve discussed the skewness in the normal distribution, it’s time to
learn about the two types of skewness which we discussed earlier. Let’s start with
positive skewness.
Understanding Positively Skewed Distribution
Source: Wikipedia
A positively skewed distribution is the distribution with the tail on its right side.
The value of skewness for a positively skewed distribution is greater than zero. As
you might have already understood by looking at the figure, the value of mean is
the greatest one followed by median and then by mode.
So why is this happening?
Well, the answer to that is that the skewness of the distribution is on the right; it
causes the mean to be greater than the median and eventually move to the right.
Also, the mode occurs at the highest frequency of the distribution which is on the
left side of the median. Therefore, mode < median < mean.
In the above boxplot, you can see that Q2 is present nearer to Q1. This represents
a positively skewed distribution. In terms of quartiles, it can be given by:
In this case, it was very easy to tell if the data is skewed or not. But what if we
have something like this:
Here, Q2-Q1 and Q3-Q2 are equal and yet the distribution is positively skewed.
The keen-eyed among you will have noticed the length of the right whisker is
greater than the left whisker. From this, we can conclude that the data is
positively skewed.
So, the first step is always to check the equality of Q2-Q1 and Q3-Q2. If that is
found not equal, then we look for the length of whiskers.
Understanding Negatively Skewed Distribution
Source: Wikipedia
As you might have already guessed, a negatively skewed distribution is the
distribution with the tail on its left side. The value of skewness for a negatively
skewed distribution is less than zero. You can also see in the above figure that
the mean < median < mode.
In the boxplot, the relationship between quartiles for a negative skewness is given
by:
Similar to what we did earlier, if Q3-Q2 and Q2-Q1 are equal, then we look for the
length of whiskers. And if the length of the left whisker is greater than that of the
right whisker, then we can say that the data is negatively skewed.