0 Lec 4 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Probability and Statistics Lec 4 & 5After Eid

Composed by Adnan Alam Khan


Previous Topics:

1|Page aoabyadnan2@gmail.com
2|Page aoabyadnan2@gmail.com
Today’s Topic
1. Stem and leaf plots,
2. Double stem and
3. Leaf plots,
4. Dot plots
Stem and Leaf Plots
The stem and leaf plot is a method of organizing data and is a combination of sorting and graphing. It has
the advantage over a grouped frequency distribution of retaining the actual data while showing them in
graphical form.
A stem and leaf plot is a data plot that uses part of the data value as the stem and part of the data value
as the leaf to form groups or classes.
Example: At an outpatient testing center, the number of cardiograms performed each day for 20 days is
shown. Construct a stem and leaf plot for the data.
25 31 20 32 13
14 43 02 57 23
36 32 33 32 44
32 52 44 51 45

3|Page aoabyadnan2@gmail.com
If there are no data values in a class, you should write the stem number and leave the leaf row blank. E.g.
“20”Do not put a zero in the leaf row.

4|Page aoabyadnan2@gmail.com
Example 2.
An insurance company researcher conducted a survey on the number of car thefts in a large city for a
period of 30 days last summer. The raw data are shown. Construct a stem and leaf plot by using classes
50–54, 55–59, 60–64, 65–69,70–74, and 75–79.
52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55

Class Work: Solve it Answer:

Solution on NEXT page 

5|Page aoabyadnan2@gmail.com
When the data values are in the hundreds, such as 325, the stem is 32 and the leaf is 5. For example, the
stem and leaf plot for the data values 325, 327, 330, 332, 335, 341,345, and 347 looks like this.

Construct a back-to-back stem and leaf plot:


The number of stories in two selected samples of tall buildings in Atlanta and Philadelphia is shown.
Construct a back-to-back stem and leaf plot, and compare the distributions.
Atlanta Philadelphia
55 70 44 36 40 61 40 38 32 30
63 40 44 34 38 58 40 40 25 30
60 47 52 32 32 54 40 36 30 30
50 53 32 28 31 53 39 36 34 33
52 32 34 32 50 50 38 36 39 32
26 29

6|Page aoabyadnan2@gmail.com
Stem and leaf plots are part of the techniques called exploratory data analysis.

7|Page aoabyadnan2@gmail.com
Where we will use Pareto chart?
A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the cumulative total is represented by the line.
Pareto charts show the ordered frequency counts of data. These charts are often used to identify areas to
focus on first in process improvement. Pareto charts show the ordered frequency counts of values for the
different levels of a categorical or nominal variable.

8|Page aoabyadnan2@gmail.com
9|Page aoabyadnan2@gmail.com
Leading Cause of Death:
The following shows approximations of the leading causes of death among men ages 25–44 years. The
rates are per 100,000 men. Answer the following questions about the graph.

Section 2–3 Leading Cause of Death


1. The variables in the graph are the year, cause of death, and rate of death per 100,000 men.
2. The cause of death is qualitative, while the year and death rates are quantitative.
3. Year is a discrete variable, and death rate is continuous. Since cause of death is qualitative, it is neither
discrete nor continuous.
4. A line graph was used to display the data.
5. No, a Pareto chart could not be used to display the data, since we can only have one quantitative
variable and one categorical variable in a Pareto chart.
6. We cannot use a pie chart for the same reasons as given for the Pareto chart.
7. A Pareto chart is typically used to show a categorical variable listed from the highest-frequency category
to the category with the lowest frequency.
8. A time series chart is used to see trends in the data. It can also be used for forecasting and predicting.

10 | P a g e aoabyadnan2@gmail.com
Summary
 When data are collected, the values are called raw data. Since very little knowledge can be
obtained from raw data, they must be organized in some meaningful way. A frequency
distribution using classes is the common method that is used. (2–1)
 Once a frequency distribution is constructed, graphs can be drawn to give a visual representation
of the data. The most commonly used graphs in statistics are the histogram, frequency polygon,
and ogive. (2– 2)
 Other graphs such as the bar graph, Pareto chart, time series graph, and pie graph can also be
used. Some of these graphs are frequently seen in newspapers, magazines, and various statistical
reports. (2–3)
 Finally, a stem and leaf plot uses part of the data values as stems and part of the data values as
leaves. This graph has the advantage of a frequency distribution and a histogram. (2–3)

11 | P a g e aoabyadnan2@gmail.com
From the outline:
Measures of Central Tendency, Measures of Variation, Measures of Position, Exploratory Data
Analysis, Detection of outliers
Chapter 3:
The authors go on to give examples of averages:
The average American man is five feet, nine inches tall; the average woman is five feet, 3.6 inches.
The average American is sick in bed seven days a year missing five days of work.
On the average day, 24 million people receive animal bites.
By his or her 70th birthday, the average American will have eaten 14 steers, 1050 chickens, 3.5 lambs, and
25.2 hogs2. Loosely stated, the average means the center of the distribution or the most typical case.
Measures of average are also called measures of central tendency and include the mean, median, mode,
and midrange. Do the data values cluster around the mean, or are they spread more evenly throughout
the distribution? The measures that determine the spread of the data values are called measures of
variation, or measures of dispersion. These measures include the range, variance, and standard deviation
measures include the range, variance, and standard deviation. Finally, another set of measures is
necessary to describe data. These measures are called measures of position. The most common position
measures are percentiles, deciles, and quartiles. These measures are used extensively in psychology and
education. Sometimes they are referred to as norms. The measures of central tendency, variation, and
position explained in this chapter are part of what is called traditional statistics. Section 4 shows the
techniques of what is called exploratory data analysis. These techniques include the boxplot and the
fivenumber summary. They can be used to explore data to see what they show (as opposed to the
traditional techniques, which are used to confirm conjectures about the data).
A statistic is a characteristic or measure obtained by using the data values from a sample.
A parameter is a characteristic or measure obtained by using all the data values from a specific population.
Properties and Uses of Central Tendency
The Mean
1. The mean is found by using all the values of the data.
2. The mean varies less than the median or mode when samples are taken from the same population and
all three measures are computed for these samples.
3. The mean is used in computing other statistics, such as the variance.
4. The mean for the data set is unique and not necessarily one of the data values.
5. The mean cannot be computed for the data in a frequency distribution that has an open-ended class.
6. The mean is affected by extremely high or low values, called outliers, and may not be the appropriate
average to use in these situations.
The Median
1. The median is used to find the center or middle value of a data set.
2. The median is used when it is necessary to find out whether the data values fall into the upper half or
lower half of the distribution.
3. The median is used for an open-ended distribution.
4. The median is affected less than the mean by extremely high or extremely low values. The Mode
1. The mode is used when the most typical case is desired.
2. The mode is the easiest average to compute.
3. The mode can be used when the data are nominal or categorical, such as religious preference, gender,
or political affiliation.

12 | P a g e aoabyadnan2@gmail.com
4. The mode is not always unique. A data set can have more than one mode, or the mode may not exist
for a data set. The Midrange
1. The midrange is easy to compute.
2. The midrange gives the midpoint.
3. The midrange is affected by extremely high or low values in a data set.
For the spread or variability of a data set, three measures are commonly used: range, variance, and
standard deviation. Each measure will be discussed in this section.
Range
The range is the simplest of the three measures and is defined now.
The range is the highest value minus the lowest value. The symbol R is used for the range. R
= highest value - lowest value

Find the ranges for the paints of above Example.


Solution
For brand A, the range is
R = 60 - 10 = 50 months
For brand B, the range is
R = 45 - 25 = 20 months
Make sure the range is given as a single number.
The range for brand A shows that 50 months separate the largest data value from the smallest data value.
For brand B, 20 months separate the largest data value from the smallest data value, which is less than
one-half of brand A’s range.
Q. Find the variance and standard deviation for the data set for brand “A” paint in.
10, 60, 50, 30, 40, 20

13 | P a g e aoabyadnan2@gmail.com
Find the variance and standard deviation for brand B paint data in Example. The months were
35, 45, 30, 35, 40, 25

14 | P a g e aoabyadnan2@gmail.com
Volatility often refers to the amount of uncertainty or risk related to the size of changes in a security's
value. A higher volatility means that a security's value can potentially be spread out over a larger range of
values. This means that the price of the security can change dramatically over a short time period in either
direction.

15 | P a g e aoabyadnan2@gmail.com
Where to invest and where not to invest?
Use coefficient of variation means lower the ratio better the risk-return trade off.

Investment A: In Stock Volatility is 11%, expected return is 15%, Coefficient of variation is 0.73
Investment B: In Stock Volatility is 16%, expected return is 19%, Coefficient of variation is 0.84 X
Investment C: In Stock Volatility is 6%, expected return is 10%, Coefficient of variation is 0.60 (BEST)

16 | P a g e aoabyadnan2@gmail.com
Coefficient of variation
Historical Note: Karl Pearson devised the coefficient of variation to compare the deviations of two
different groups such as the heights of men and women.
Whenever two samples have the same units of measure, the variance and standard deviation for each can
be compared directly. For example, suppose an automobile dealer wanted to compare the standard
deviation of miles driven for the cars she received as trade in son new cars. She found that for a specific
year, the standard deviation for Buicks was 422 miles and the standard deviation for Cadillacs was 350
miles. She could say that the variation in mileage was greater in the Buicks. But what if a manager wanted
to compare the standard deviations of two different variables, such as the number of sales per salesperson
over a 3-month period and the commissions made by these salespeople?
A statistic that allows you to compare standard deviations when the units are different, as in this example,
is called the coefficient of variation.

17 | P a g e aoabyadnan2@gmail.com
In other words, if the range is divided by 4, an approximate value for the standard deviation is obtained.
For example, the standard deviation for the data set 5, 8, 8, 9, 10,12, and 13 is 2.7, and the range is 13 - 5
=8. The range rule of thumb is s = 2.

18 | P a g e aoabyadnan2@gmail.com
19 | P a g e aoabyadnan2@gmail.com
20 | P a g e aoabyadnan2@gmail.com
21 | P a g e aoabyadnan2@gmail.com
Travel Allowances
A survey of local companies found that the mean amount of travel allowance for executives was $0.25 per
mile. The standard deviation was $0.02. Using Chebyshev’s theorem, find the minimum percentage of the
data values that will fall between $0.20 and $0.30.

The Empirical (Normal) Rule


Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is
bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are
true.
Approximately 68% of the data values will fall within 1 standard deviation of the mean.
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.

22 | P a g e aoabyadnan2@gmail.com
Outliers:
An extremely high or extremely low data value in a data set can have a striking effect on the mean of the
data set. These extreme values are called outliers. This is one reason why when analyzing a frequency
distribution, you should be aware of any of these values. For the data set shown in Example 3–14, the
mean, median, and mode can be quite different because of extreme values. A method for identifying
outliers is given in Section 3–3.
After the raw data have been organized into a frequency distribution, it will be analyzed by looking for
peaks and extreme values. The peaks show which class or classes have the most data values compared to
the other classes. Extreme values, called outliers, show large or small data values that are relative to other
data values.

How to calculate Q1, Q2 & Q3?


Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.
Solution
Step 1 Arrange the data in order.
5, 6, 12, 13, 15, 18, 22, 50
Step 2 Find the median (Q2).
5, 6, 12, 13, 15, 18, 22, 50

MD = (13+15)/2=14
Step 3 Find the median of the data values less than 14.
5, 6, 12, 13

Q1=(6+12)/2=9
So Q1 is 9.
Step 4 Find the median of the data values greater than 14.
15, 18, 22, 50

Q3=(18+22)/2=20
Here Q3 is 20. Hence, Q1 = 9, Q2 = 14, and Q3 = 20.

23 | P a g e aoabyadnan2@gmail.com
Q# Check the following data set for outliers.
5, 6, 12, 13, 15, 18, 22, 50
Solution
From last question here Q3 is 20. Hence, Q1 = 9, Q2 = 14, and Q3 = 20.
The data value 50 is extremely suspect. These are the steps in checking for an outlier.
Step 1 Find Q1 and Q3. This was done in Example above whereas Q1 is 9 and Q3 is 20.
Step 2 Find the interquartile range (IQR), which is Q3 - Q1.
IQR = Q3 - Q1 = 20 - 9 = 11
Step 3 Multiply this value by 1.5.
1.5(11) = 16.5
Step 4 Subtract the value obtained in step 3 from Q1, and add the value obtained in step 3 to Q3.
9 - 16.5 =-7.5 and 20 + 16.5 = 36.5
Step 5 Check the data set for any data values that fall outside the interval from -7.5 to 36.5. The value 50
is outside this interval; hence, it can be considered an outlier.

1. What is a z score?
A z score tells how many standard deviations the data value is above or below the mean.
2. Define percentile rank.
A percentile rank indicates the percentage of data values that fall below the specific rank.
3. What is the difference between a percentage and a percentile?
A percentile is a relative measurement of position; a percentage is an absolute measure of the part to the
total.
4. Define quartile.
A quartile is a relative measure of position obtained by dividing the data set into quarters.
5. What is the relationship between quartiles and percentiles?
Q1 = P25; Q2 = P50; Q3 = P75 6. What is a decile?
A decile is a relative measure of position obtained by dividing the data set into tenths.
7. How are deciles related to percentiles?
D1 = P10; D2 = P20; D3 = P30; etc.
8. To which percentile, quartile, and decile does the median correspond?
P50; Q2; D5
Exploratory Data Analysis:
Use the techniques of exploratory data analysis, including boxplots and five number summaries, to
discover various aspects of data. The purpose of traditional analysis is to confirm various conjectures
about the nature of the data. For example, from a carefully designed study, a researcher might want to
know if the proportion of Americans who are exercising today has increased from 10 years ago. This study
would contain various assumptions about the population, various definitions such as of exercise, and so
on. In exploratory data analysis (EDA), data can be organized using a stem and leaf plot. The measure of
central tendency used in EDA is the median. The measure of variation used in EDA is the interquartile
range Q3- Q1.
In EDA the data are represented graphically using a boxplot (sometimes called a box-and-whisker plot).
The purpose of exploratory data analysis is to examine data to find out what information can be discovered
about the data such as the center and the spread. Exploratory data analysis was developed by John Tukey.

24 | P a g e aoabyadnan2@gmail.com
The Five-Number Summary and Boxplots
A boxplot can be used to graphically represent the data set. These plots involve five specific values:
1. The lowest value of the data set (i.e., minimum)
2. Q1
3. The median
4. Q3
5. The highest value of the data set (i.e., maximum)
These values are called a five-number summary of the data set.
Procedure for constructing a boxplot
1. Find the five-number summary for the data values, that is, the maximum and minimum data values, Q1
and Q3, and the median.
2. Draw a horizontal axis with a scale such that it includes the maximum and minimum data values.
3. Draw a box whose vertical sides go through Q1 and Q3, and draw a vertical line though the median.
4. Draw a line from the minimum data value to the left side of the box and a line from the maximum data
value to the right side of the box.

25 | P a g e aoabyadnan2@gmail.com
26 | P a g e aoabyadnan2@gmail.com
27 | P a g e aoabyadnan2@gmail.com
In exploratory data analysis, hinges are used instead of quartiles to construct boxplots. When the data set
consists of an even number of values, hinges are the same as quartiles. Hinges for a data set with an odd
number of values differ somewhat from quartiles. However, since most calculators and computer
programs use quartiles, they will be used in this textbook. Another important point to remember is that
the summary statistics (median and interquartile range) used in exploratory data analysis are said to be
resistant statistics. A resistant statistic is relatively less affected by outliers than a nonresistant statistic.
The mean and standard deviation are nonresistant statistics. Sometimes when a distribution is skewed or
contains outliers, the median and interquartile range may more accurately summarize the data than the
mean and standard deviation, since the mean and standard deviation are more affected in this case.

28 | P a g e aoabyadnan2@gmail.com
29 | P a g e aoabyadnan2@gmail.com

You might also like