Mmw Unit IV Statistics
Mmw Unit IV Statistics
Mmw Unit IV Statistics
MATHEMATICS
WORLD
UNIT IV: STATISTICS
Prepared by: Engr. Beljan Marzan, REE, RME
4.1 Introduction to Statistics
TOPICS
4.3 Normal Distributions
2
INTRODUCTION TO
STATISTICS
DEFINING STATISTICS; BRANCHES; CONCEPTS
3
INTRODUCTION TO
STATISTICS
4
INTRODUCTION TO STATISTICS
Statistics
deals with the collection, presentation, analysis, and
use of data to make decisions, solve problems, and
design products and processes.
5
INTRODUCTION TO STATISTICS
Branches of Statistics
6
INTRODUCTION TO STATISTICS
Branches of Applied Statistics
7
INTRODUCTION TO STATISTICS
Inferential Statistics
8
INTRODUCTION TO STATISTICS
Inferential Statistics
Statistic is the
numerical measure that
describes a
characteristic of a
sample.
9
NUMERICAL DESCRIPTIVE
MEASURES
SUMMATION NOTATION; MEASURES OF CENTRAL TENDENCY;
MEASURES OF DISPERSION; MEASURES OF RELATIVE POSITION
10
Numerical Descriptive Measures
Summation Notation
Summation
Notation
is used to denote the sum of values and is expressed
using the Greek capital letter Sigma, Σ.
11
Numerical Descriptive Measures
Summation Notation
5
Final value
∑
Variable or
𝑥𝑖
Expression
Subscript
𝑖 =1 Initial value
12
Example: The following table lists four pairs of m and f values. m 12 15 20 30
f 5 9 10 16
Compute the following:
a) Σm
b) Σf2
c) Σmf
d) Σm2f
e) (Σmf) 2
13
Numerical Descriptive Measures
Measures of Dispersion
Help us learn about the spread of a data
Measures of Position
Determine the position of a single value in relation to other
values.
14
Measures of Central
Tendency
At times, we require a
common numerical value to
aid in the description of a set
of data. This typical value is
often called an average value
or a mean. We are searching
for a single figure that, in
some way, embodies the
entirety of the set of data.
15
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data
The mean, also called the arithmetic mean, is the most frequently used
measure of central tendency.
The mean for a sample consisting of n observations is
μ
The mean for a population consisting of N observations is
16
Example: The number of 911 emergency calls classified as domestic disturbance calls in a large metropolitan location were
sampled for thirty randomly selected 24-hour periods with the following results. Find the mean number of calls per 24-hour
period.
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50
1
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data
The median of a set of data is a value that divides the bottom 50% of
the data from the top 50% of the data. To find the median of a data
set, first arrange the data in increasing order. If the number of
observations is odd, the median is the number in the middle of the
ordered list. If the number of observations is even, the median is the
mean of the two values closest to the middle of the ordered list.
18
Example: To find the median number of domestic disturbance calls per 24-hour period for the data in the last example,
first arrange the data in increasing order.
19 22 23 25 27 29 29 30 30 34 36 37 37 38 38
40 40 40 44 45 46 46 47 47 49 50 50 56 56 58
19
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data
The mode is the value in a data set that occurs the most
often. If no such value exists, we say that the data set has no
mode. If two such values exist, we say the data set is bimodal.
If three such values exist, we say the data set is trimodal.
20
Numerical Descriptive Measures
Measures of Central Tendency for Grouped Data
𝑥=
∑ 𝑚𝑓
μ=
∑ 𝑚𝑓
𝑛 𝑁
where m represents the class marks and f represents the class frequencies
The median for grouped data is found by locating the value that
divides the data onto two equal parts and the class where it is located
is called the median class.
The modal class is defined to be the class with the maximum
frequency. The mode for grouped data is defined to be the class mark
of the modal class.
21
The table gives the frequency distribution of the daily commuting times (in minutes) from home to work for all 25
employees of a company. Calculate the mean of the daily commuting times.
22
Relationships Among the
Mean, Median, and Mode
Data Medi
Mean Mode Distribution Shape
Set an
1 15 15 15 Bell-shaped
2 10 10.5 15 Left-skewed
3 20.5 19.5 15 Right-skewed
27
28
Measures of
Dispersion
Consider the following two data sets on the ages (in years) of all
workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39 Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40
years. however, the variation in the workers’ ages for each of these
two companies is very different. As illustrated in the diagram, the
ages of the workers at the second company have a much larger
variation than the ages of the workers at the first company.
Thus, the mean, median, or mode by itself is usually not a sufficient
measure to reveal the shape of the distribution of a data set. We also
need a measure that can provide some information about the
variation among data values. The measures that help us learn about
the spread of a data set are called the measures of dispersion.
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data
The range for a data set is equal to the maximum value in the data
set minus the minimum value in the data set. The range is reflective
of the spread in the data set since the difference between the largest
and the smallest value is directly related to the spread in the data.
29
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data
30
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data
31
Find the variance and standard deviation for these data.
32
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data
The range for grouped data is given by the difference between the
upper boundary of the class having the largest values minus the
lower boundary of the class having the smallest values.
33
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data
( ∑ 𝑚𝑓 )
2
2
𝑠 =
∑ 𝑓 (𝑚 − 𝑥 ) 2 2
𝛴𝑚 𝑓 −
𝑛
𝑛− 1 𝑠2 =
𝑛− 1
2
( ∑ 𝑚𝑓 )
2
𝜎 =
∑ 𝑓 ( 𝑚 −𝑢 ) 2
2
2
𝛴𝑚 𝑓 −
𝑁
𝑁 𝜎 =
𝑁
34
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data
35
The table gives the frequency distribution of the daily commuting times (in minutes) from home to work for all 25
employees of a company. Calculate the variance and standard deviation.
36
Z Scores
37
The mean salary for deputies in Douglas County is $27,500 and the standard deviation is $4,500. The mean salary for
deputies in Hall County is $24,250 and the standard deviation is $2,750. A deputy who makes $30,000 in Douglas
County makes $1,500 more than a deputy does in Hall County who makes $28,500. Which deputy has the higher salary
relative to the county in which he works?
38
Measures of
Position
A measure of position
determines the position of a
single value in relation to
other values in a sample or a
population data set
39
Quartiles are the summary measures that
divide a ranked data set into four equal parts.
Three measures will divide any data set into four
equal parts. These three measures are the first
quartile (denoted by Q1), the second quartile
(denoted by Q2), and the third quartile (denoted
by Q3). The data should be ranked in increasing
order before the quartiles are determined.
The difference between the third quartile and
Measures of
the first quartile for a data set is called the Position
interquartile range (IQR). Quartiles and IQR
40
The table below gives the total compensations (in millions of dollars) for the year 2010 of
the 12 highest-paid CEOs of U.S. companies. That table is reproduced here:
41
Percentiles are the
summary measures that
divide a ranked data set
Measures of into 100 equal parts. The
kth percentile is denoted by
Position Percentiles Pk, where k is an integer in
the range 1 to 99. For
instance, the 25th percentile
is denoted by P25.
42
43
The table below gives the total compensations (in millions of dollars) for the year 2010 of the 12 highest-paid CEOs of
U.S. companies. That table is reproduced here:
44
NORMAL DISTRIBUTION
PROPERTIES; EMPIRICAL RULE; STANDARD NORMAL
DISTRIBUTION;
45
NORMAL DISTRIBUTION
A normal distribution forms a bell-shaped curve that is
symmetric about a vertical line through the mean of the data.
PROPERTIES OF A NORMAL DISTRIBUTION
The graph is symmetric about a vertical line through the
mean of the distribution.
The mean, median, and mode are equal.
The y-value of each point on the curve is the percent
(expressed as a decimal) of the data at the corresponding
x-value.
Areas under the curve that are symmetric about the mean
are equal.
The total area under the curve is 1.
46
NORMAL DISTRIBUTION
In the normal distribution shown, the area of the shaded region is
0.159 units. This region represents the fact that 15.9% of the
data values are greater than or equal to 10. Because the area
under the curve is 1, the unshaded region under the curve has
area 1-0.159, or 0.841, representing the fact that 84.1% of the
data are less than 10.
47
EMPIRICAL RULE
In a normal distribution, approximately
68% of the data lie within 1 standard deviation of the mean.
95% of the data lie within 2 standard deviations of the mean.
99.7% of the data lie within 3 standard deviations of the mean.
48
A survey of 1000 U.S. gas stations found that the price charged for a gallon of regular gas could be closely approximated
by a normal distribution with a mean of $3.10 and a standard deviation of $0.18. How many of the stations charge
a. between $2.74 and $3.46 for a gallon of regular gas?
b. less than $3.28 for a gallon of regular gas?
c. more than $3.46 for a gallon of regular gas?
49
THE STANDARD NORMAL
DISTRIBUTION
If the original distribution of x
values is a normal distribution,
then the corresponding
distribution of z-scores will also
be a normal distribution. This
normal distribution of z-scores
is called the standard normal
distribution. The standard
normal distribution is the
normal distribution that has a
mean of 0 and a standard
deviation of 1.
50
THE STANDARD NORMAL
DISTRIBUTION
Tables and calculators are often used
to determine the area under a
portion of the standard normal curve.
We will refer to this type of area as
an area of the standard normal
distribution.
In the standard normal distribution,
the area of the distribution from z =
a to z = b represents
the percentage of z-values that lie
in the interval from a to b.
the probability that z lies in the
interval from a to b.
51
52
Find the area of the standard normal distribution
a. between z = -1.44 and z = 0
b. between z = - 0.67 and z = 0
c. to the right of z = 0.82
d. to the left of z = -1.47
53
A soda machine dispenses soda into 12-ounce cups. Tests show that the actual amount of soda dispensed is normally
distributed, with a mean of 11.5 oz and a standard deviation of 0.2 oz.
a. What percent of cups will receive less than 11.25 oz of soda?
b. What percent of cups will receive between 11.2 oz and 11.55 oz of soda?
c. If a cup is filled at random, what is the probability that the machine will overflow the cup?
54
LINEAR REGRESSION AND
CORRELATION
LEAST-SQUARES REGRESSION; LINEAR CORRELATION
COEFFICIENT
55
The Least-Squares Regression
Line
When performing research studies, scientists often wish to know
whether two variables are related. If the variables are determined to be
related, a scientist may then wish to find an equation that can be used
to model the relationship.
The least-squares regression line for a set of bivariate data is the line
that minimizes the sum of the squares of the vertical deviations from
each data point to the line. The equation of the least-squares line can be
used to predict the value of one variable when the value of the other
variable is known.
56
A geologist might want to know whether there is a relationship between the duration of an eruption of a geyser and the
time between eruptions. A first step in this determination is to collect some data. Data involving
two variables are called bivariate data. The table gives bivariate data showing the time between
two eruptions and the duration of the second eruption for 10 eruptions of the geyser Old Faithful.
Once the data are collected, a scatter diagram or scatter plot can be drawn, as shown
57
One way for the geologist to create a model of the relationship between the time between two eruptions and the duration of
the second eruption is to find a line that approximates the data points plotted in the scatter plot. There are many such lines
that can be drawn, as shown
58
Of all the possible lines that can be drawn, the one that is usually of most interest is called the line of best fit or the least-
squares regression line. The least-squares regression line is the line that fits the data better than any other line that might
be drawn. In the definition, the phrase “minimizes the sum of the squares of the vertical deviations” means that of all the
lines possible, the linear equation that minimizes the sum
is the equation of the line of best fit. In this expression, each d n represents the distance from data point n to the line.
59
60
61
Find the equations of the least-squares line for the given ordered pairs.
62