Chapter 3
Chapter 3
Chapter 3
Histograms
❑ The height of each bar represents the frequency or the relative frequency within that
particular interval.
Relative
Class Interval Frequency Frequency
100 to <115 2 0.025
115 to <130 10 0.127
130 to <145 21 0.266
145 to <160 15 0.190
160 to <175 15 0.190
175 to <190 8 0.101
190 to <205 3 0.038
205 to <220 1 0.013
220 to <235 2 0.025
235 to <250 2 0.025
79 1.000
Bin Size: Slicing up the entire span of values into equal width intervals.
16
In a histogram, the class frequencies are represented by bars. The height of each bar corresponds to its
class frequency. A histogram of these data is shown below:
The histogram makes it plain that most of the scores are in the middle of the distribution, with fewer
scores in the extremes. You can also see that the distribution is not symmetric; the scores extend to the
right farther than they do on the left. The distribution is therefore said to be skewed (We'll have more to
say about the shape of distributions in our discussion on summarizing distributions.)
a) Estimate the percentage of students weighing between 130 and 175 pounds.
b) What percentage of students were less than 130 pounds or more than 205?
c) What percentage of students were less than 130 pounds and more than 205?
If we wanted to create a relative frequency histogram, we would replace the frequencies on the y-axis
with percentages.
17
Stem Plots
*Good for small data sets and it is a very quick way to see the distribution of the data
Guidelines For Constructing Regular Stem Plots (Stem and Leaf Display)
1. Put all of your data in order from the smallest to the largest value.
2. Separate observations into stem (all but rightmost digit) and leaf (final digit).
3. Write stems in a vertical column; draw a vertical line to the right of the column.
4. Write the leaves to the right of the appropriate stem, in increasing order (round or truncate data if
necessary)
10 3
11 37
12 011444555
13 000000455589
14 000000000555
15 000000555567
16 000005558
17 0000005555
18 0358
19 5
20 00
21 0
22 55
23 79
Example: Minitab Output of Weights of College Students from example #1, separated by gender.
Female Male
3 10
3 11 7
554410 12 145
95000 13 0004558
000 14 000000555
75000 15 0005556
0 16 00005558
0 17 000005555
18 0358
5 19
0 20 0 **It is reasonable to separate the weights by
21 0 gender because women tend to weigh less
22 55 than men on average.
23 79
18
Split Stem Plots:
Follow the guidelines for a regular stem plot. The difference is you will write each stem value twice and
the leaves will be separated into two different categories (Low and High). The first set of leaves will
consist of digits from 0-4 (L) and the second set of leaves will consist of digits from 5-9 (H).
Example #2: The following data represent the mean August temperature (Fahrenheit) in 20 U.S. cities:
64 64 68 69 70 71 71 72 74 75
76 76 76 77 81 82 82 83 85 98
a) Construct a split-stem plot.
.
b) What percent of the 20 cities have a mean August temperature in the 80s?
19
Describing your quantitative graphs using:
Shape of the distribution (How your data is spread out),
Center
Spread
Shape
Uniform Bell/normal
If the upper tail stretches out much further than the lower tail= positively skewed, right skewed
If the lower tail is much longer than the upper tail= negatively skewed, left skewed
Unimodal Bimodal
E. Outlier: Unusual observation apart from the group. (Either high or low)
20
CENTER:
SAMPLE MEAN ( X) : This measure of the average is found by adding up all of the values in the data
set and dividing by the number of values.
X = X i= i=
n= =
1 n X + X 2 + X 3 + ... + X n
X =
n i =1
Xi = 1
n
Median: “The Midpoint” or typical value where half of the values are above the median and half are
below. (50% mark)
Calculation:
1st: Put all the data in order from smallest to largest
n+1
2nd: Find the middle position using the following location formula location =
2
2. For an even number of values: The average of the two middle values (add and divide by 2)
21
SPREAD:
Example #1: Below are the numbers of deaths from tornadoes in the U.S. from 1990-2000.
53 39 39 33 69 30 25 67 130 94 40
12
10
Frequency
0
0 16 32 48 64
Length of Calls
22
Example #3: U.S. Department of Labor Bureau of Labor Statistics
Unemployment Rates June 2016
Variance is a measure of how far the data is from the mean, on average squared. The Standard
Deviation is simply the square root of the variance.
(X i − X )2
Sample Standard deviation: s = i =1
n −1
(X i − )2
Population Standard deviation: = i =1
Mean X
Standard deviation s
23
Properties of the standard deviation and variance:
1. Sensitive to OUTLIERS and SKEWNESS
2. Standard deviation is greater than, or equal to, zero.
3. Values that are very close together have a small standard deviation.
4. Values that are very far apart have a large standard deviation.
• The mean will be pulled towards the tail of the skewed data.
• For Normal Bell Shaped Distributions, the mean is the more appropriate measure of the center. The
standard deviation is the appropriate measure of spread.
• For Skewed Distributions, the median is the more appropriate measure of the center. The IQR is the
appropriate measure of spread.
Example: The back-to-back stemplot below gives the bowling scores of male and female participants in
the finals of a national tournament.
Males Females
23 89
0 24 028
3 25 0267
80 26 378
88421 27 25
997632 28 6
Identify the best measure of center and spread for each gender, based on the distributions. Then calculate
these values and justify your choice of this measure in one sentence:
Justification of choice_______________________________________________________
24
Females Choice of center_____________ calculated value_____________
Justification of choice_______________________________________________________
5-NUMBER SUMMARY
The 5-number summary of a distribution consists of the following:
Example: The back-to-back stemplot below gives the bowling scores of male and female participants in
the finals of a national tournament.
Males Females
23 89
0 24 028
3 25 0267
80 26 378
88421 27 25
997632 28 6
Calculate (and clearly label) the five-number summary for the distribution of male scores
25
BOXPLOTS: We use the 5-NUMBER SUMMARY to construct box plots. Boxplots can be useful for
comparing several groups of data.
Side-by-side skeletal boxplots comparing the distributions of earnings for two levels of education:
Step 3: Determine the Fences. Fences serve as cutoff points for determining outliers
If you identify an outlier in a data set and are asked to graph the data, you must create a modified box
plot! In a modified box plot drawn by hand, you will flag a mild outlier with a open circle and an
extreme outlier with an asterisk. You will extend the lines from your quartiles to the next smallest or next
largest value that is not an outlier. Note: In minitab, the box plot “flags” all outliers with an asterisk.
26
Example #1: Oxygen capacity
To understand better the effects of exercise and aging on various circulatory functions, a study was done
on16 middle-aged male runners. The following data set gives values of oxygen capacity values (ml/kg per
minute) while the participants pedaled at a specified rate on a bicycle.
14 16 18 19 20 21 21 21 22 22 23 23 24 26 35 36
Are there any outliers? Check mild and extreme criterions and show all work! Then draw a boxplot to
represent the distribution of your data.
Step 3: Determine the Fences. Fences serve as cutoff points for determining outliers
Lower Fence=
Upper Fence=
Lower Fence=
Upper Fence=
____________________________________________________________________
10 15 20 25 30 35 40
27
Example #2:
The following back-to-back stemplot represents the amount a sample of students spent on their last
haircut. Because there are significant differences between males and females, the data was separated by
gender.
Male spending Female spending
4 3 2 0 | 1 |
5 3 1 0 0 | 2 | 0 5 5
5 4 3 2 | 3 | 5 7 7 9
2 | 4 | 3 5 6
| 5 | 0 3
| 6 | 0
| 7 | 0
| 8 | 5
| 9 |
| 10 | 0
I. Calculate (and label) the five-number summary for the amount spent by females.
II. Construct, and clearly label, a modified boxplot for the amount spent by females.
a) Check for outliers, show all work:
IQR = _________________
b) Modified Boxplot:
_____________________________________________________________________________
20 30 40 50 60 70 80 90 100
28
Comparing Groups
150
125
CasePrice
100
75
50
Cayuga Keuka Seneca
Location
Outliers
What should be done with outliers? First try to understand them in the context of the data. A histogram
can show how the outlier fits with the rest of the data:
• Is there a large gap between the outlier and the rest of the data?
• Leave an outlier in place without comment, and proceed as if nothing were unusual.
29
Example: Don’t automatically discard outliers! In 1985 three researchers (Farman, Gardinar and
Shanklin) were puzzled by some data gathered by the British Antarctic Survey showing the ozone levels
for Antarctica had dropped 10% below normal January levels. The puzzle was why the Nimbus 7 satellite
which had instruments aboard for recording ozone levels, hadn't recorded similarly low ozone
concentrations. When they examined the data from the satellite it didn't take long to realize that the
satellite was in fact recording these low concentration levels and had been doing so for years. But
because the ozone concentrations recorded by the satellite were so low they were being treated as outliers
by a computer program and discarded! The Nimbus 7 satellite had in fact been gathering evidence of low
ozone levels since 1976. The damage to our atmosphere caused by chlorofluorocarbons went undetected
and untreated for up to nine years because outliers were discarded without being examined.
Standardizing
When comparing scores from different variables it is helpful to standardize the values, to determine how
many standard deviations the value is away from the mean.
X−X X −
So, Z= or Z=
s
The z-score tells us how many standard deviation an observation is above or below the mean.
Example:
• A Z-score of 1 means the observation is 1 standard deviation larger than the mean.
• A Z-Score of –2 means the observation is 2 standard deviations smaller than the mean.
Example: Dan is working diligently to get through his general education requirements at a local
community college. As a result, he is enrolled in a statistics course as well as a history course. On his
first round of exams, Dan got a 79% on his statistics exam and an 84% on his history exam. The class
results for his statistics exam were normally distributed with a mean of 64% and a standard deviation of
6.23%. The results of his history exam were also normally distributed with a mean of 78% and standard
deviation of 3.27. Which of the following is true about Dan’s performance?
A. Dan performed better on his statistics exam because the z-score of his statistics exam was larger
than the z-score for his history exam.
B. Dan performed better on his history exam because the z-score of his history exam was closer to 0
than the z-score for his statistics exam.
C. Dan performed better on his history exam because the z-score of his history exam was larger than
the z-score for his statistics exam.
D. Dan performed better on his statistics exam because the z-score of his statistics exam was closer
to 0 than the z-score for his history exam.
30
Timeplots
A graph of data collected over a period of time (measured by seconds, days, months, years, etc.). Time
goes on the x-axis. This graph is useful when looking for trends or patterns over time.
Total Revenues and Outlays in CBO's Baseline and Under the President's Budget (Percentage of
GDP)
http://www.cbo.gov/ftpdocs/
31
Practice Problems:
1. A. The median score of a student’s seven quizzes is 80 points. The instructor allows each student to
drop the lowest score, which in this case is 62 points. The median score of the remaining six is:
a) 71 points.
b) 80 points.
c) 83 points.
d) Cannot be determined from the information given.
B. Using the information above: if 80 was the mean score, could the mean of the remaining 6 be found?
2. A severe drought affected several western states for 3 years. A Christmas tree farmer is worried about the
drought’s effect on the size of his trees. To decide whether the growth of the trees has been retarded, the
farmer takes a sample of the heights of 15 trees and obtains the following results (in inches):
60 57 62 69 46 54 64 60 58 75 51 49 67 65 44
Answer
c. Calculate the mean and the standard deviation for the data above:
32
3. The distribution of payoffs at a racetrack has a mean of $5.85 and a median of $3.40. We can conclude
the following:
a) More than half the winners get less than a $3.40 payoff
b) More than half the winners get less than a $5.85 payoff
c) The distribution of payoffs is symmetric
d) The distribution of payoffs is left-skewed
4. About how many music CDs do you own? Responses to this question for 24 students are shown in the
stem-and-leaf plot below:
Stem-and-leaf of CDs N = 24
Leaf Unit = 10
0 001222233
0 55569
1 002
15
2 002
25
30
3
4
45
d. Would the mean or the median be the best measure of center for this dataset? Explain your choice.
5. An athlete completed an 800-m race in 150 seconds. The distribution of 800-m race times followed a
bell curve, with mean 165 seconds and standard deviation 7. The same athlete also competed in a swim,
finishing in 12.25 minutes. The distribution of swim times also followed a bell curve, with mean 15
minutes and standard deviation 1.5 minutes. In which event does the athlete have a better standing relative
to the other competitors in the event?
33
6. A meteorological station in Hawaii has gathered the following average daily wind speeds over 43 days.
In the following histogram, these average speeds (miles/h) are displayed:
Answer:
b) Describe in detail the shape, center (find the median interval), and spread of the distribution.
Shape:
Spread:
▪ For a histogram, report the range from the beginning interval to the highest value interval (that
does not contain an outlier).
34