0% found this document useful (0 votes)
131 views10 pages

Histogram

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 10

Histogram

What is histogram?
A graphic representation of the frequency distribution of a continuous variable. Rectangles are
drawn in such a way that their bases lie on a linear scale representing different intervals, and
their heights are proportional to the frequencies of the values within each of the intervals.
Histogram

A histogram is a type of graph that shows the frequency distribution of data within equal
intervals (thus, there are no spaces between the bars).

Break the range of values into intervals and count how many observations fall into each
interval.

It was first introduced by Karl Pearson

In statistics a histogram is a graphical representation showing a visual impression of the


distribution data.

It shows the number of values within an interval and not the actual values.

You can graph huge data sets easily with histograms.

They are used only for numerical data.

You could change the intervals of the histogram to see which gives a better description of
the data.

Organize data into groups by counting how much data is in each group

Groups

Bins

Frequency

number of observations

Max = maximum value of the given data

Min= minimum value of the given data

Range=Max-Min
Histogram is used to show the following information of the given data:

center of data

spread of data

skewness of the data

presence of outliers (if any); and

presence of multiple modes in the data.

Shape
When describing the shape of a distribution, we should consider:
1. Symmetry/skewness of the distribution.
2. Peakedness (modality)the number of peaks (modes) the distribution has.
We distinguish between:
Symmetric Distributions

Note that all three distributions are symmetric, but are different in their modality
(peakedness).
The first distribution is unimodalit has one mode (roughly at 10) around which the
observations are concentrated.
The second distribution is bimodalit has two modes (roughly at 10 and 20) around
which the observations are concentrated.

The third distribution is kind of flat, or uniform. The distribution has no modes, or no
value around which the observations are concentrated.
Rather, we see that the observations are roughly uniformly distributed among the different
values.
Skewed Right Distributions
A right skewed distribution is sometimes called a positively skewed distribution. Thats because
the tail is longer on the positive direction of the number line.

Right Skewed Histogram


A histogram is right skewed if the peak of the histogram veers to the left, giving the histograms
tail a positive skew to the right.

For a right skewed distribution, the mean is typically greater than the median. Also notice that
the tail of the distribution on the right hand (positive) side is longer than on the left hand side.

Right Skewed Box Plot


If a box plot is skewed to the right, the mean will be greater than the median. The box plot will
look like the box was shifted to the left and the right whisker will be longer.

From the box and whisker diagram we can also see that the median is closer to the first quartile
than the third quartile. The fact that the right hand side tail of the distribution is longer than the
left can also be seen.
Note that in a skewed right distribution, the bulk of the observations are small/medium, with a
few observations that are much larger than the rest.
An example
of a real-life variable that has a skewed right distribution is salary. Most people earn in the
low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are
distributed along a large range (long "tail") of higher values.

A distribution that is skewed left has exactly the opposite characteristics of one that is skewed
right:

the mean is typically less than the median;


the tail of the distribution is longer on the left hand side than on the right hand side; and
the median is closer to the third quartile than to the first quartile.

The table below summarises the different categories visually.


Symmetric

Skewed right (positive)

Skewed left (negative)

Table 1
Skewed Left Distributions

A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is
much longer than the right tail (larger values). Note that in a skewed left distribution, the bulk of
the observations are medium/large, with a few observations that are much smaller than the rest.
An example
of a real life variable that has a skewed left distribution is age of death from natural causes (heart
disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at
younger ages.

Example
State whether each of the following data sets are symmetric, skewed right or skewed left.
a) A data set with this histogram:

b) A data set with this box and whisker plot:

c) A data set with this frequency polygon:

Answer
1. skewed right
2. skewed right
3. skewed left

Center
The center of the distribution is its midpointthe value that divides the distribution so that
approximately half the observations take smaller values, and approximately half the observations
take larger values. Note that from looking at the histogram we can get only a rough estimate for
the center of the distribution.

As you can see from the histogram, the center of the grades distribution is roughly 70 (7 students
scored below 70, and 8 students scored above 70).
Spread
The spread (also called variability) of the distribution can be described by the approximate range
covered by the data. From looking at the histogram, we can approximate the smallest observation
(min), and the largest observation (max), and thus approximate the range. In our example:

Outliers
Outliers are observations that fall outside the overall pattern.
For example, the following histogram represents a distribution that has a high probable outlier:

Advantages of Histogram

We represent frequency distribution of the grouped data

Visually strong. Gives immediate information

Can be compared to a normal distribution curve when the data is large

Vertical axis is used to represent count of items falling into each category

Disadvantage of Histogram

Exact values are not known as the data is grouped into categories/groups to draw the bar
graph

Difficult to compare two data sets

Use only with continuous data

Representation of the frequency distribution depends upon the number of categories


selected.

Multiple histograms can be drawn for the same data making it difficult to read and
interpret.

Consider the exam scores of a group of students. Define data classes of an interval of 10
points and counting the number of scores in each data class, a frequency table is as
follows:

Group Count
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79
1

80-89 90-99
2

Solution
Group

frequency

0-9

10-19

20-29

30-39

40-49

50-59

60-69

70-79

80-89

90-99

Histogram chart

You might also like