Module Part 2 Frequency Distribution and Graphs
Module Part 2 Frequency Distribution and Graphs
OUTLINE OBJECTIVES
Introduction 1. Organize data using a frequency distribution.
2–1 Organizing Data 2. Represent data in frequency distributions
2–2 Histograms, Frequency Polygons, and Ogives graphically, using histograms, frequency
2–3 Other Types of Graphs Summary polygons, and ogives.
3. Represent data using bar graphs, Pareto charts,
time series graphs, pie graphs, and dot plots.
4. Draw and interpret a stem and leaf plot.
Introduction
When conducting a statistical study, the researcher must gather data for the particular variable under
study. For example, if a researcher wishes to study the number of people who were bitten by poisonous snakes
in a specific geographic area over the past several years, he or she has to gather the data from various doctors,
hospitals, or health departments.
To describe situations, draw conclusions, or make inferences about events, the researcher must
organize the data in some meaningful way. The most convenient method of organizing data is to construct a
frequency distribution.
After organizing the data, the researcher must present them so they can be understood by those who
will benefit from reading the study. The most useful method of presenting the data is by constructing statistical
charts and graphs. There are many different types of charts and graphs, and each one has a specific purpose.
This chapter explains how to organize data by constructing frequency distributions and how to present
the data by constructing charts and graphs. The charts and graphs illustrated here are histograms, frequency
polygons, ogives, pie graphs, Pareto charts, and time series graphs. A graph that combines the characteristics of
a frequency distribution and a histogram, called a stem and leaf plot, is also explained
Since little information can be obtained from looking at raw data, the researcher organizes the data into what is
called a frequency distribution.
A frequency distribution is the organization of raw data in table form, using classes and frequencies.
Each raw data value is placed into a quantitative or qualitative category called a class. The frequency of a class
then is the number of data values contained in a specific class. A frequency distribution is shown for the
preceding data set.
Now some general observations can be made from looking at the frequency distribution. For example, it can be
stated that the majority of the wealthy people in the study are 45 years old or older.
The classes in this distribution are 27–35, 36–44, etc. These values are called class limits. The data
values 27, 28, 29, 30, 31, 32, 33, 34, 35 can be tallied in the first class; 36, 37, 38, 39, 40, 41, 42, 43, 44 in the
second class; and so on. Two types of frequency distributions that are most often used are the categorical
frequency distribution and the grouped frequency distribution. The procedures for constructing these
distributions are shown now.
If the data are in tenths, such as 6.2, 7.8, and 12.6, the limits for a class hypothetically might be 7.8–
8.8, and the boundaries for that class would be 7.75–8.85. Find these values by subtracting 0.05 from 7.8 and
adding 0.05 to 8.8. Class boundaries are not always included in frequency distributions; however, they give a
more formal approach to the procedure of organizing data, including the fact that sometimes the data have been
rounded. You should be familiar with boundaries since you may encounter them in a statistical study. Finally,
the class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of
one class from the lower (or upper) class limit of the next class. For example, the class width in the preceding
distribution on the distribution of blood glucose levels is 7, found from 65 − 58 = 7. The class width can also
be found by subtracting the lower boundary from the upper boundary for any given class. In this case, 64.5 −
57.5 = 7. Note: Do not subtract the limits of a single class. It will result in an incorrect answer. The researcher
must decide how many classes to use and the width of each class.
To construct a frequency distribution, follow these rules:
1. There should be between 5 and 20 classes. Although there is no hard-and-fast rule for the number
of classes contained in a frequency distribution, it is of utmost importance to have enough classes
to present a clear description of the collected data.
2. It is preferable but not absolutely necessary that the class width be an odd number. This ensures
that the midpoint of each class has the same place value as the data. The class midpoint Xm is
obtained by adding the lower and upper boundaries and dividing by 2, or adding the lower and
upper limits and dividing by 2:
The midpoint is the numeric location of the center of the class. Midpoints are necessary for
graphing (see Section 2–2). If the class width is an even number, the midpoint is in tenths. For
example, if the class width is 6 and the boundaries are 5.5 and 11.5, the midpoint is
Rule 2 is only a suggestion, and it is not rigorously followed, especially when a computer is used
to group data.
3. The classes must be mutually exclusive. Mutually exclusive classes have nonoverlapping class
limits so that data cannot be placed into two classes. Many times, frequency distributions such as
this
Age
10–20
20–30
30–40
40–50
are found in the literature or in surveys. If a person is 40 years old, into which class should she or
he be placed? A better way to construct a frequency distribution is to use classes such as
Age
10–20
21–31
32–42
43–53
Recall that boundaries are mutually exclusive. For example, when a class boundary is 5.5 to 10.5,
the data values that are included in that class are values from 6 to 10. A data value of 5 goes into
the previous class, and a data value of 11 goes into the next-higher class.
4. The classes must be continuous. Even if there are no values in a class, the class must be included in
the frequency distribution. There should be no gaps in a frequency distribution. The only exception
occurs when the class with a zero frequency is the first or last class. A class with a zero frequency
at either end can be omitted without affecting the distribution
5. The classes must be exhaustive. There should be enough classes to accommodate all the data.
6. The classes must be equal in width. This avoids a distorted view of the data. One exception occurs
when a distribution has a class that is open-ended. That is, the first class has no specific lower
limit, or the last class has no specific upper limit. A frequency distribution with an open-ended
class is called an open-ended distribution. Here are two examples of distributions with open-
ended classes.
The frequency distribution for age is open-ended for the last class, which means that anybody who is 54 years
or older will be tallied in the last class. The distribution for minutes is open-ended for the first class, meaning
that any minute values below 110 will be tallied in that class.
The steps for constructing a grouped frequency distribution are summarized in the following Procedure Table.
Example 2–2 shows the procedure for constructing a grouped frequency distribution, i.e., when the classes
contain more than one data value.
Sometimes it is necessary to use a cumulative frequency distribution. A cumulative frequency
distribution is a distribution that shows the number of data values less than or equal to a specific value
(usually an upper boundary). The values are found by adding the frequencies of the classes less than or
equal to the upper class boundary of a specific class. This gives an ascending cumulative frequency. In
this example, the cumulative frequency for the first class is 0 + 2 = 2; for the second class it is 0 + 2 +
8 = 10; for the third class it is 0 + 2 + 8 + 18 = 28. Naturally, a shorter way to do this would be to just
add the cumulative frequency of the class below to the frequency of the given class. For example, the
cumulative frequency for the number of data values less than 114.5 can be found by adding 10 + 18 =
28. The cumulative frequency distribution for the data in this example is as follows:
Cumulative frequencies are used to show how many data values are accumulated up to and including a
specific class. In Example 2–2, of the total record high temperatures 28 are less than or equal to 114°F.
Forty-eight of the total record high temperatures are less than or equal to 124°F.
After the raw data have been organized into a frequency distribution, it will be analyzed by
looking for peaks and extreme values. The peaks show which class or classes have the most data values
compared to the other classes. Extreme values, called outliers, show large or small data values that are
relative to other data values.
When the range of the data values is relatively small, a frequency distribution can be
constructed using single data values for each class. This type of distribution is called an ungrouped
frequency distribution and is shown next.
All the different types of distributions are used in statistics and are helpful when one is organizing and
presenting data. The reasons for constructing a frequency distribution are as follows:
1. To organize the data in a meaningful, intelligible way.
2. To enable the reader to determine the nature or shape of the distribution.
3. To facilitate computational procedures for measures of average and spread.
4. To enable the researcher to draw charts and graphs for the presentation of data.
5. To enable the reader to make comparisons among different data sets. The factors used to analyze a
frequency distribution are essentially the same as those used to analyze histograms and frequency polygons,
which are shown in Section 2–2.
1. Were the data obtained from a population or a sample? Explain your answer.
2. What was the age of the oldest President?
3. What was the age of the youngest President?
4. Construct a frequency distribution for the data. (Use your own judgment as to the number of classes
and class size.)
5. Are there any peaks in the distribution?
6. Identify any possible outliers.
7. Write a brief summary of the nature of the data as shown in the frequency distribution.
2-2 Histograms, Frequency Polygons, and Ogives
After you have organized the data into a frequency distribution, you can present them in graphical
form. The purpose of graphs in statistics is to convey the data to the viewers in pictorial form. It is easier for
most people to comprehend the meaning of data presented graphically than data presented numerically in tables
or frequency distributions. This is especially true if the users have little or no statistical knowledge.
Statistical graphs can be used to describe the data set or to analyze it. Graphs are also useful in getting
the audience’s attention in a publication or a speaking presentation. They can be used to discuss an issue,
reinforce a critical point, or summarize a data set. They can also be used to discover a trend or pattern in a
situation over a period of time.
The three most commonly used graphs in research are
1. The histogram.
2. The frequency polygon.
3. The cumulative frequency graph, or ogive (pronounced o-jive).
The steps for constructing the histogram, frequency polygon, and the ogive are summarized in the
procedure table.
The Histogram
The histogram is a graph that displays the data by using contiguous vertical bars (unless the frequency of a
class is 0) of various heights to represent the frequencies of the classes.
The Frequency Polygon
Another way to represent the same data set is by using a frequency polygon.
The frequency polygon is a graph that displays the data by using lines that connect points plotted for the
frequencies at the midpoints of the classes. The frequencies are represented by the heights of the points.
Example 2–5 shows the procedure for constructing a frequency polygon. Be sure to begin and end on the x-
axis.
The frequency polygon and the histogram are two different ways to represent the same data set. The choice of
which one to use is left to the discretion of the researcher.
The Ogive
The third type of graph that can be used represents the cumulative frequencies for the classes. This type of
graph is called the cumulative frequency graph, or ogive. The cumulative frequency is the sum of the
frequencies accumulated up to the upper boundary of a class in the distribution.
The ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution.
Example 2–6 shows the procedure for constructing an ogive. Be sure to start on the x-axis.
Cumulative frequency graphs are used to visually represent how many values are below a certain
upper class boundary. For example, to find out how many record high temperatures are less than 114.5°F,
locate 114.5°F on the x-axis, draw a vertical line up until it intersects the graph, and then draw a horizontal
line at that point to the y-axis. The y-axis value is 28, as shown in Figure 2–5.
In addition to the histogram, the frequency polygon, and the ogive, several other types of graphs are often
used in statistics. They are the bar graph, Pareto chart, time series graph, pie graph, and the dotplot. Figure
2–8 shows an example of each type of graph.
Bar Graphs
When the data are qualitative or categorical, bar graphs can be used to represent the data.
A bar graph can be drawn using either horizontal or vertical bars. A bar graph represents the data by using
vertical or horizontal bars whose heights or lengths represent the frequencies of the data.
Bar graphs can also be used to compare data for two or more groups. These types of bar graphs are called
compound bar graphs. Consider the following data for the number (in millions) of never married adults in
the United States.
Figure 2–10 shows a bar graph that compares the number of never married males with the number
of never married females for the years shown. The comparison is made by placing the bars next to each other
for the specific years. The heights of the bars can be compared. This graph shows that there have consistently
been more never married males than never married females and that the difference in the two groups has
increased slightly over the last 50 years.
Pareto Charts
When the variable displayed on the horizontal axis is qualitative or categorical, a Pareto chart can also be
used to represent the data.
A Pareto chart is used to represent a frequency distribution for a categorical variable, and the frequencies
are displayed by the heights of vertical bars, which are arranged in order from highest to lowest.
When you analyze a Pareto chart, make comparisons by looking at the heights of the bars.
A time series graph represents data that occur over a specific period of time.
Example 2–10 shows the procedure for constructing a time series graph.
When you analyze a time series graph, look for a trend or pattern that occurs over the time period. For
example, is the line ascending (indicating an increase over time) or descending (indicating a decrease over
time)? Another thing to look for is the slope, or steepness, of the line. A line that is steep over a specific time
period indicates a rapid increase or decrease over that period.
Two or more data sets can be compared on the same graph called a compound time series graph if
two or more lines are used, as shown in Figure 2–13. This graph shows the percentage of elderly males and
females in the U.S. labor force from 1960 to 2010. It shows that the percentage of elderly men decreased
significantly from 1960 to 1990 and then increased slightly after that. For the elderly females, the percentage
decreased slightly from 1960 to 1980 and then increased from 1980 to 2010.
A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in
each category of the distribution.
Dotplots
A dotplot uses points or dots to represent the data values. If the data values occur more than once, the
corresponding points are plotted above one another.
A dotplot is a statistical graph in which each data value is plotted as a point (dot) above the horizontal axis.
Dotplots are used to show how the data values are distributed and to see if there are any extremely high or
low data values.
Stem and Leaf Plots
The stem and leaf plot is a method of organizing data and is a combination of sorting and graphing. It has the
advantage over a grouped frequency distribution of retaining the actual data while showing them in graphical
form.
A stem and leaf plot is a data plot that uses part of the data value as the stem and part of the data value as
the leaf to form groups or classes. For example, a data value of 34 would have 3 as the stem and 4 as the leaf.
A data value of 356 would have 35 as the stem and 6 as the leaf.
Example 2–14 shows the procedure for constructing a stem and leaf plot.
Figure 2–17 shows that the distribution peaks in the center and that there are no gaps in the data. For 7 of the
20 days, the number of patients receiving cardiograms was between 31 and 36. The plot also shows that the
testing center treated from a minimum of 2 patients to a maximum of 57 patients in any one day.
If there are no data values in a class, you should write the stem number and leave the leaf row
blank. Do not put a zero in the leaf row.
When you analyze a stem and leaf plot, look for peaks and gaps in the distribution. See if the distribution is
symmetric or skewed. Check the variability of the data by looking at the spread. ‘
Related distributions can be compared by using a back-to-back stem and leaf plot. The back-to-back
stem and leaf plot uses the same digits for the stems of both distributions, but the digits that are used for the
leaves are arranged in order o ut from the stems on both sides. Example 2–16 shows a back-to-back stem and
leaf plot.
Stem and leaf plots are part of the techniques called exploratory data analysis. More information on this
topic is presented in Chapter 3.
Misleading Graphs
Graphs give a visual representation that enables readers to analyze and interpret data more easily than they
could simply by looking at numbers. However, inappropriately drawn graphs can misrepresent the data and
lead the reader to false conclusions. For example, a car manufacturer’s ad stated that 98% of the vehicles it
had sold in the past 10 years were still on the road. The ad then showed a graph similar to the one in
Figure 2–20. The graph shows the percentage of the manufacturer’s automobiles still on the road and the
percentage of its competitors’ automobiles still on the road. Is there a large difference? Not necessarily.
Notice the scale on the vertical axis in Figure 2–20. It has been cut off (or truncated) and starts at
95%. When the graph is redrawn using a scale that goes from 0 to 100%, as in Figure 2–21, there is hardly a
noticeable difference in the percentages. Thus, changing the units at the starting point on the y axis can
convey a very different visual representation of the data.
It is not wrong to truncate an axis of the graph; many times it is necessary to do so. However, the reader
should be aware of this fact and interpret the graph accordingly. Do not be misled if an inappropriate
impression is given.
Let us consider another example. The projected required fuel economy in miles per gallon for
General Motors vehicles is shown. In this case, an increase from 21.9 to 23.2 miles per gallon is projected.
When you examine the graph shown in Figure 2–22(a), using a scale of 0 to 25 miles per gallon, the graph
shows a slight increase. However, when the scale is changed to 21 to 24 miles per gallon, the graph shows a
much larger increase even though the data remain the same. See Figure 2–22(b). Again, by changing the
units or starting point on the y axis, one can change the visual representation.
Another misleading graphing technique sometimes used involves exaggerating a one-dimensional increase
by showing it in two dimensions. For example, the average cost of a 30-second Super Bowl commercial has
increased from $42,000 in 1967 to $4.5 million in 2015 (Source: USA TODAY).
The increase shown by the graph in Figure 2–23(a) represents the change by a comparison of the
heights of the two bars in one dimension. The same data are shown two-dimensionally with circles in Figure
2–23(b). Notice that the difference seems much larger because the eye is comparing the areas of the circles
rather than the lengths of the diameters.
Note that it is not wrong to use the graphing techniques of truncating the scales or representing
data by two-dimensional pictures. But when these techniques are used, the reader should be cautious of the
conclusion drawn on the basis of the graphs. Another way to misrepresent data on a graph is by omitting
labels or units on the axes of the graph. The graph shown in Figure 2–24 compares the cost of living,
economic growth, population growth, etc., of four main geographic areas in the United States.
However, since there are no numbers on the y axis, very little information can be gained from this
graph, except a crude ranking of each factor. There is no way to decide the actual magnitude of the
differences.
Finally, all graphs should contain a source for the information presented. The inclusion of a source
for the data will enable you to check the reliability of the organization presenting the data.
Applying the Concepts
Causes of Accidental Deaths in the United States, 1999–2009 The graph shows the number of deaths in the
United States due to accidents. Answer the following questions about the graph.
Determine whether each statement is true or false. If the statement is false, explain why.
1. In the construction of a frequency distribution, it is a good idea to have overlapping class limits, such
as 10–20, 20–30, 30–40.
2. Bar graphs can be drawn by using vertical or horizontal bars.
3. It is not important to keep the width of each class the same in a frequency distribution.
4. Frequency distributions can aid the researcher in drawing charts and graphs.
5. The type of graph used to represent data is determined by the type of data collected and by the
researcher’s purpose.
6. In construction of a frequency polygon, the class limits are used for the x axis.
7. Data collected over a period of time can be graphed by using a pie graph. Select the best answer.
8. What is another name for the ogive?
a. Histogram b. Frequency polygon c. Cumulative frequency graph d. Pareto chart
9. What are the boundaries for 8.6–8.8?
a. 8–9 b. 8.5–8.9 c. 8.55–8.85 d. 8.65–8.75
10. What graph should be used to show the relationship between the parts and the whole?
a. Histogram b. Pie graph c. Pareto chart d. Ogive
11. Except for rounding errors, relative frequencies should add up to what sum?
a. 0 b. 1 c. 50 d. 100