Measures of Variability
Measures of Variability
Measures of Variability
In statistics, variability, dispersion, and spread are synonyms that denote the width of the
distribution. Just as there are multiple measures of central tendency, there are several
measures of variability. In this blog post, you’ll learn why understanding the variability of
your data is critical. Then, I explore the most common measures of variability—the range,
interquartile range, variance, and standard deviation. I’ll help you determine which one is
best for your data.
The two plots below show the difference graphically for distributions with the
same mean but more and less dispersion. The panel on the left shows a distribution that is
tightly clustered around the average, while the distribution in the right panel is more
spread out.
Why Understanding Variability is Important
Let’s take a step back and first get a handle on why understanding variability is so
essential. Analysts frequently use the mean to summarize the center of a population or a
process. While the mean is relevant, people often react to variability even more. When a
distribution has lower variability, the values in a dataset are more consistent. However,
when the variability is higher, the data points are more dissimilar and extreme values
become more likely. Consequently, understanding variability helps you grasp the likelihood
of unusual events.
In some situations, extreme values can cause problems! Have you seen a weather report
where the meteorologist shows extreme heat and drought in one area and flooding in
another? It would be nice to average those together! Frequently, we feel discomfort at
the extremes more than the mean. Understanding that variability around the mean
provides critical information.
Variability is everywhere. Your commute time to work varies a bit every day. When you
order a favorite dish at a restaurant repeatedly, it isn’t exactly the same each time. The
parts that come off an assembly line might appear to be identical, but they have subtly
different lengths and widths.
These are all examples of real-life variability. Some degree of variation is unavoidable.
However, too much inconsistency can cause problems. If your morning commute takes much
longer than the mean travel time, you will be late for work. If the restaurant dish is much
different than how it is usually, you might not like it at all. And, if a manufactured part is
too much out of spec, it won’t function as intended.
Some variation is inevitable, but problems occur at the extremes. Distributions with
greater variability produce observations with unusually large and small values more
frequently than distributions with less variability.
The graphs below display the distribution of delivery times and provide the answer. The
restaurant with more variable delivery times has the broader distribution curve. I’ve used
the same scales in both graphs so you can visually compare the two distributions.
As this example shows, the central tendency doesn’t provide complete information. We
also need to understand the variability around the middle of the distribution to get the
full picture. Now, let’s move on to the different ways of measuring variability!
Range
Let’s start with the range because it is the most straightforward measure of variability to
calculate and the simplest to understand. The range of a dataset is the difference
between the largest and smallest values in that dataset. For example, in the two datasets
below, dataset 1 has a range of 20 – 38 = 18 while dataset 2 has a range of 11 – 52 = 41.
Dataset 2 has a broader range and, hence, more variability than dataset 1.
While the range is easy to understand, it is based on only the two most extreme values in
the dataset, which makes it very susceptible to outliers. If one of those numbers is
unusually high or low, it affects the entire range even if it is atypical.
Additionally, the size of the dataset affects the range. In general, you are less likely to
observe extreme values. However, as you increase the sample size, you have more
opportunities to obtain these extreme values. Consequently, when you draw random
samples from the same population, the range tends to increase as the sample size
increases. Consequently, use the range to compare variability only when the sample sizes
are similar.
The Interquartile Range (IQR) . . . and other Percentiles
The interquartile range is the middle half of the data. To visualize it, think about
the median value that splits the dataset in half. Similarly, you can divide the data into
quarters. Statisticians refer to these quarters as quartiles and denote them from low to
high as Q1, Q2, and Q3. The lowest quartile (Q1) contains the quarter of the dataset with
the smallest values. The upper quartile (Q4) contains the quarter of the dataset with the
highest values. The interquartile range is the middle half of the data that is in between
the upper and lower quartiles. In other words, the interquartile range includes the 50% of
data points that fall between Q1 and Q3. The IQR is the red area in the graph below.
The interquartile range is a robust measure of variability in a similar manner that the
median is a robust measure of central tendency. Neither measure is influenced
dramatically by outliers because they don’t depend on every value. Additionally, the
interquartile range is excellent for skewed distributions, just like the median. As you’ll
learn, when you have a normal distribution, the standard deviation tells you the percentage
of observations that fall specific distances from the mean. However, this doesn’t work for
skewed distributions, and the IQR is a great alternative.
I’ve divided the dataset below into quartiles. The interquartile range (IQR) extends from
the low end of Q2 to the upper limit of Q3. For this dataset, the range is 21 – 39.
Using other percentiles
When you have a skewed distribution, I find that reporting the median with the
interquartile range is a particularly good combination. The interquartile range is equivalent
to the region between the 75th and 25th percentile (75 – 25 = 50% of the data). You can
also use other percentiles to determine the spread of different proportions. For example,
the range between the 97.5th percentile and the 2.5th percentile covers 95% of the data.
The broader these ranges, the higher the variability in your dataset.
Variance
Variance is the average squared difference of the values from the mean. Unlike the
previous measures of variability, the variance includes all values in the calculation by
comparing each value to the mean. To calculate this statistic, you calculate a set of
squared differences between the data points and the mean, sum them, and then divide by
the number of observations. Hence, it’s the average squared difference.
There are two formulas for the variance depending on whether you are calculating the
variance for an entire population or using a sample to estimate the population variance. The
equations are below, and then I work through an example in a table to help bring it to life.
Population variance
The formula for the variance of an entire population is the following:
In the equation, σ2 is the population parameter for the variance, μ is the parameter for the
population mean, and N is the number of data points, which should include the entire
population.
Sample variance
To use a sample to estimate the variance for a population, use the following formula. Using
the previous equation with sample data tends to underestimate the variability. Because it’s
usually impossible to measure an entire population, statisticians use the equation for
sample variances much more frequently.
In the equation, s2 is the sample variance, and M is the sample mean. N-1 in the
denominator corrects for the tendency of a sample to underestimate the population
variance.
Standard Deviation
The standard deviation is the standard or typical difference between each data point and
the mean. When the values in a dataset are grouped closer together, you have a smaller
standard deviation. On the other hand, when the values are spread out more, the standard
deviation is larger because the standard distance is greater.
Conveniently, the standard deviation uses the original units of the data, which makes
interpretation easier. Consequently, the standard deviation is the most widely used
measure of variability. For example, in the pizza delivery example, a standard deviation of
5 indicates that the typical delivery time is plus or minus 5 minutes from the mean. It’s
often reported along with the mean: 20 minutes (s.d. 5).
The standard deviation is just the square root of the variance. Recall that the variance is
in squared units. Hence, the square root returns the value to the natural units. The symbol
for the standard deviation as a population parameter is σ while s represents it as a sample
estimate. To calculate the standard deviation, calculate the variance as shown above, and
then take the square root of it. Voila! You have the standard deviation!
People often confuse the standard deviation with the standard error of the mean. Both
measures assess variability, but they have extremely different purposes.
2 95%
3 99.7%
Let’s take another look at the pizza delivery example where we have a mean delivery time
of 20 minutes and a standard deviation of 5 minutes. Using the Empirical Rule, we can use
the mean and standard deviation to determine that 68% of the delivery times will fall
between 15-25 minutes (20 +/- 5) and 95% will fall between 10-30 minutes (20 +/- 2*5).
When you are comparing samples that are the same size, consider using the range as the
measure of variability. It’s a reasonably intuitive statistic. Just be aware that a
single outlier can throw the range off. The range is particularly suitable for small samples
when you don’t have enough data to calculate the other measures reliably, and the
likelihood of obtaining an outlier is also lower.
When you have a skewed distribution, the median is a better measure of central tendency,
and it makes sense to pair it with either the interquartile range or other percentile-based
ranges because all of these statistics divide the dataset into groups with specific
proportions.
For normally distributed data, or even data that aren’t terribly skewed, using the tried and
true combination reporting the mean and the standard deviation is the way to go. This
combination is by far the most common. You can still supplement this approach with
percentile-base ranges as you need.
PROJE
CT