Histogram
Histogram
Histogram
What is histogram?
A graphic representation of the frequency distribution of a continuous variable. Rectangles are
drawn in such a way that their bases lie on a linear scale representing different intervals, and
their heights are proportional to the frequencies of the values within each of the intervals.
Histogram
A histogram is a type of graph that shows the frequency distribution of data within equal
intervals (thus, there are no spaces between the bars).
Break the range of values into intervals and count how many observations fall into each
interval.
It shows the number of values within an interval and not the actual values.
You could change the intervals of the histogram to see which gives a better description of
the data.
Organize data into groups by counting how much data is in each group
Groups
Bins
Frequency
number of observations
Range=Max-Min
Histogram is used to show the following information of the given data:
center of data
spread of data
Shape
When describing the shape of a distribution, we should consider:
1. Symmetry/skewness of the distribution.
2. Peakedness (modality)the number of peaks (modes) the distribution has.
We distinguish between:
Symmetric Distributions
Note that all three distributions are symmetric, but are different in their modality
(peakedness).
The first distribution is unimodalit has one mode (roughly at 10) around which the
observations are concentrated.
The second distribution is bimodalit has two modes (roughly at 10 and 20) around
which the observations are concentrated.
The third distribution is kind of flat, or uniform. The distribution has no modes, or no
value around which the observations are concentrated.
Rather, we see that the observations are roughly uniformly distributed among the different
values.
Skewed Right Distributions
A right skewed distribution is sometimes called a positively skewed distribution. Thats because
the tail is longer on the positive direction of the number line.
For a right skewed distribution, the mean is typically greater than the median. Also notice that
the tail of the distribution on the right hand (positive) side is longer than on the left hand side.
From the box and whisker diagram we can also see that the median is closer to the first quartile
than the third quartile. The fact that the right hand side tail of the distribution is longer than the
left can also be seen.
Note that in a skewed right distribution, the bulk of the observations are small/medium, with a
few observations that are much larger than the rest.
An example
of a real-life variable that has a skewed right distribution is salary. Most people earn in the
low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are
distributed along a large range (long "tail") of higher values.
A distribution that is skewed left has exactly the opposite characteristics of one that is skewed
right:
Table 1
Skewed Left Distributions
A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is
much longer than the right tail (larger values). Note that in a skewed left distribution, the bulk of
the observations are medium/large, with a few observations that are much smaller than the rest.
An example
of a real life variable that has a skewed left distribution is age of death from natural causes (heart
disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at
younger ages.
Example
State whether each of the following data sets are symmetric, skewed right or skewed left.
a) A data set with this histogram:
Answer
1. skewed right
2. skewed right
3. skewed left
Center
The center of the distribution is its midpointthe value that divides the distribution so that
approximately half the observations take smaller values, and approximately half the observations
take larger values. Note that from looking at the histogram we can get only a rough estimate for
the center of the distribution.
As you can see from the histogram, the center of the grades distribution is roughly 70 (7 students
scored below 70, and 8 students scored above 70).
Spread
The spread (also called variability) of the distribution can be described by the approximate range
covered by the data. From looking at the histogram, we can approximate the smallest observation
(min), and the largest observation (max), and thus approximate the range. In our example:
Outliers
Outliers are observations that fall outside the overall pattern.
For example, the following histogram represents a distribution that has a high probable outlier:
Advantages of Histogram
Vertical axis is used to represent count of items falling into each category
Disadvantage of Histogram
Exact values are not known as the data is grouped into categories/groups to draw the bar
graph
Multiple histograms can be drawn for the same data making it difficult to read and
interpret.
Consider the exam scores of a group of students. Define data classes of an interval of 10
points and counting the number of scores in each data class, a frequency table is as
follows:
Group Count
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79
1
80-89 90-99
2
Solution
Group
frequency
0-9
10-19
20-29
30-39
40-49
50-59
60-69
70-79
80-89
90-99
Histogram chart