Mmw Unit IV Statistics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

IN THE MODERN

MATHEMATICS
WORLD
UNIT IV: STATISTICS
Prepared by: Engr. Beljan Marzan, REE, RME
4.1 Introduction to Statistics

4.2 Numerical Descriptive Measures

UNIT IV: STATISTICS

TOPICS
4.3 Normal Distributions

4.4 Linear Regression and Correlation

2
INTRODUCTION TO
STATISTICS
DEFINING STATISTICS; BRANCHES; CONCEPTS

3
INTRODUCTION TO
STATISTICS

4
INTRODUCTION TO STATISTICS

Statistics
deals with the collection, presentation, analysis, and
use of data to make decisions, solve problems, and
design products and processes.

5
INTRODUCTION TO STATISTICS
Branches of Statistics

Theoretical statistics deals with


the development, derivation, and
proof of statistical theorems,
formulas, rules, and laws.

Applied statistics involves


the applications of those
theorems, formulas, rules, and
laws to solve real-world
problems.

6
INTRODUCTION TO STATISTICS
Branches of Applied Statistics

Descriptive statistics consists of methods for


organizing, displaying, and describing data by
using tables, graphs, and summary measures.

Inferential statistics consists of methods that


use sample results to help make decisions or
predictions about a population.

7
INTRODUCTION TO STATISTICS
Inferential Statistics

A population consists of all


elements—individuals, items, or
objects—whose characteristics are
being studied. The population that is
being studied is also called the target
population.

A portion of the population


selected for study is referred to
as a sample.

8
INTRODUCTION TO STATISTICS
Inferential Statistics

Parameter is the numerical


measure that describes a
characteristic of a
population.

Statistic is the
numerical measure that
describes a
characteristic of a
sample.
9
NUMERICAL DESCRIPTIVE
MEASURES
SUMMATION NOTATION; MEASURES OF CENTRAL TENDENCY;
MEASURES OF DISPERSION; MEASURES OF RELATIVE POSITION

10
Numerical Descriptive Measures
Summation Notation

Summation
Notation
is used to denote the sum of values and is expressed
using the Greek capital letter Sigma, Σ.
11
Numerical Descriptive Measures
Summation Notation

5
Final value


Variable or

𝑥𝑖
Expression

Subscript

𝑖 =1 Initial value

12
Example: The following table lists four pairs of m and f values. m 12 15 20 30
f 5 9 10 16
Compute the following:
a) Σm
b) Σf2
c) Σmf
d) Σm2f
e) (Σmf) 2

13
Numerical Descriptive Measures

Measures of Central Tendency


Gives the representative of the data set

Measures of Dispersion
Help us learn about the spread of a data

Measures of Position
Determine the position of a single value in relation to other
values.
14
Measures of Central
Tendency

At times, we require a
common numerical value to
aid in the description of a set
of data. This typical value is
often called an average value
or a mean. We are searching
for a single figure that, in
some way, embodies the
entirety of the set of data.

15
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data

The mean, also called the arithmetic mean, is the most frequently used
measure of central tendency.
 The mean for a sample consisting of n observations is

μ
 The mean for a population consisting of N observations is

16
Example: The number of 911 emergency calls classified as domestic disturbance calls in a large metropolitan location were
sampled for thirty randomly selected 24-hour periods with the following results. Find the mean number of calls per 24-hour
period.
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 38 47 58 22 29 56 40 46 38 19 49 50

1
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data

 The median of a set of data is a value that divides the bottom 50% of
the data from the top 50% of the data. To find the median of a data
set, first arrange the data in increasing order. If the number of
observations is odd, the median is the number in the middle of the
ordered list. If the number of observations is even, the median is the
mean of the two values closest to the middle of the ordered list.

18
Example: To find the median number of domestic disturbance calls per 24-hour period for the data in the last example,
first arrange the data in increasing order.

19 22 23 25 27 29 29 30 30 34 36 37 37 38 38
40 40 40 44 45 46 46 47 47 49 50 50 56 56 58

19
Numerical Descriptive Measures
Measures of Central Tendency for Ungrouped Data

 The mode is the value in a data set that occurs the most
often. If no such value exists, we say that the data set has no
mode. If two such values exist, we say the data set is bimodal.
If three such values exist, we say the data set is trimodal.

20
Numerical Descriptive Measures
Measures of Central Tendency for Grouped Data

 The mean for grouped data is given by

𝑥=
∑ 𝑚𝑓
μ=
∑ 𝑚𝑓
𝑛 𝑁
where m represents the class marks and f represents the class frequencies
 The median for grouped data is found by locating the value that
divides the data onto two equal parts and the class where it is located
is called the median class.
 The modal class is defined to be the class with the maximum
frequency. The mode for grouped data is defined to be the class mark
of the modal class.

21
The table gives the frequency distribution of the daily commuting times (in minutes) from home to work for all 25
employees of a company. Calculate the mean of the daily commuting times.

22
Relationships Among the
Mean, Median, and Mode

For a very large data


set, as the number of
classes is increased
(and the width of
classes is decreased),
the frequency
polygon eventually
becomes a smooth
curve. Such a curve
is called a frequency
distribution curve or
simply a frequency
23
curve.
For a symmetric
histogram and
frequency distribution
curve with one peak,
the values of the
mean, median, and
mode are identical,
and they lie at the
center of the
distribution. In this
case, the data set is
said to have a bell-
shaped distribution or
a mound-shaped
distribution 24
For a histogram and a
frequency distribution
curve skewed to the right,
the value of the mean is
the largest, that of the
mode is the smallest, and
the value of the median
lies between these two.
(Notice that the mode
always occurs at the peak
point.) The value of the
mean is the largest in this
case because it is
sensitive to outliers that
occur in the right tail.
These outliers pull the
mean to the right. 25
If a histogram and a
frequency distribution
curve are skewed to
the left, the value of
the mean is the
smallest and that of
the mode is the
largest, with the
value of the median
lying between these
two. In this case, the
outliers in the left tail
pull the mean to the
left.
26
Find the mean, median, and mode for the following data sets as well as the shape o the distribution.

Data set 1: 10, 12, 15, 15, 18, 20


Data set 2: 2, 4, 6, 15, 15, 18
Data set 3: 12, 15, 15, 24, 26, 28

Data Medi
Mean Mode Distribution Shape
Set an
1 15 15 15 Bell-shaped
2 10 10.5 15 Left-skewed
3 20.5 19.5 15 Right-skewed

27
28

Measures of
Dispersion
Consider the following two data sets on the ages (in years) of all
workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39 Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40
years. however, the variation in the workers’ ages for each of these
two companies is very different. As illustrated in the diagram, the
ages of the workers at the second company have a much larger
variation than the ages of the workers at the first company.
Thus, the mean, median, or mode by itself is usually not a sufficient
measure to reveal the shape of the distribution of a data set. We also
need a measure that can provide some information about the
variation among data values. The measures that help us learn about
the spread of a data set are called the measures of dispersion.
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data

 The range for a data set is equal to the maximum value in the data
set minus the minimum value in the data set. The range is reflective
of the spread in the data set since the difference between the largest
and the smallest value is directly related to the spread in the data.

29
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data

 The variance and the standard deviation of a data set measures


the spread of the data about the mean of the data set.
 The variance of a sample of size n is represented by s2 and is given by

 The variance of a population of size N is represented by σ2 and is given by

30
Numerical Descriptive Measures
Measures of Dispersion for Ungrouped Data

 The variance and the standard deviation of a data set measures


the spread of the data about the mean of the data set.
 The sample standard deviation is

 The population standard deviation is

31
Find the variance and standard deviation for these data.

32
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data

 The range for grouped data is given by the difference between the
upper boundary of the class having the largest values minus the
lower boundary of the class having the smallest values.

33
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data

 The sample variance for grouped data is given by

( ∑ 𝑚𝑓 )
2

2
𝑠 =
∑ 𝑓 (𝑚 − 𝑥 ) 2 2
𝛴𝑚 𝑓 −
𝑛
𝑛− 1 𝑠2 =
𝑛− 1

 The population variance for grouped data is given by

2
( ∑ 𝑚𝑓 )
2
𝜎 =
∑ 𝑓 ( 𝑚 −𝑢 ) 2

2
2
𝛴𝑚 𝑓 −
𝑁
𝑁 𝜎 =
𝑁

34
Numerical Descriptive Measures
Measures of Dispersion for Grouped Data

The sample standard deviation for grouped data is

The population standard deviation for grouped data is

35
The table gives the frequency distribution of the daily commuting times (in minutes) from home to work for all 25
employees of a company. Calculate the variance and standard deviation.

36
Z Scores

 A z score is the number


of standard deviations
that a given observation,
x, is below or above the
mean.

37
The mean salary for deputies in Douglas County is $27,500 and the standard deviation is $4,500. The mean salary for
deputies in Hall County is $24,250 and the standard deviation is $2,750. A deputy who makes $30,000 in Douglas
County makes $1,500 more than a deputy does in Hall County who makes $28,500. Which deputy has the higher salary
relative to the county in which he works?

38
Measures of
Position
A measure of position
determines the position of a
single value in relation to
other values in a sample or a
population data set

39
 Quartiles are the summary measures that
divide a ranked data set into four equal parts.
Three measures will divide any data set into four
equal parts. These three measures are the first
quartile (denoted by Q1), the second quartile
(denoted by Q2), and the third quartile (denoted
by Q3). The data should be ranked in increasing
order before the quartiles are determined.
 The difference between the third quartile and
Measures of
the first quartile for a data set is called the Position
interquartile range (IQR). Quartiles and IQR

40
The table below gives the total compensations (in millions of dollars) for the year 2010 of
the 12 highest-paid CEOs of U.S. companies. That table is reproduced here:

a) Find the values of the three quartiles.


b) Find the interquartile range.

41
 Percentiles are the
summary measures that
divide a ranked data set
Measures of into 100 equal parts. The
kth percentile is denoted by
Position Percentiles Pk, where k is an integer in
the range 1 to 99. For
instance, the 25th percentile
is denoted by P25.

42
43
The table below gives the total compensations (in millions of dollars) for the year 2010 of the 12 highest-paid CEOs of
U.S. companies. That table is reproduced here:

a) Find the value of the 60th percentile.


b) Find the percentile rank for $26.5 million (2010 total compensation of Alan Mulally, CEO of Ford Motor).

44
NORMAL DISTRIBUTION
PROPERTIES; EMPIRICAL RULE; STANDARD NORMAL
DISTRIBUTION;

45
NORMAL DISTRIBUTION
A normal distribution forms a bell-shaped curve that is
symmetric about a vertical line through the mean of the data.
PROPERTIES OF A NORMAL DISTRIBUTION
 The graph is symmetric about a vertical line through the
mean of the distribution.
 The mean, median, and mode are equal.
 The y-value of each point on the curve is the percent
(expressed as a decimal) of the data at the corresponding
x-value.
 Areas under the curve that are symmetric about the mean
are equal.
 The total area under the curve is 1.

Because a normal distribution is symmetric about the mean, the area


under the curve to the right of the mean is one-half the total area. The
total area under a normal distribution is 1, so the area under the curve to
the right of the mean is 0.5.

46
NORMAL DISTRIBUTION
In the normal distribution shown, the area of the shaded region is
0.159 units. This region represents the fact that 15.9% of the
data values are greater than or equal to 10. Because the area
under the curve is 1, the unshaded region under the curve has
area 1-0.159, or 0.841, representing the fact that 84.1% of the
data are less than 10.

47
EMPIRICAL RULE
In a normal distribution, approximately
 68% of the data lie within 1 standard deviation of the mean.
 95% of the data lie within 2 standard deviations of the mean.
 99.7% of the data lie within 3 standard deviations of the mean.

48
A survey of 1000 U.S. gas stations found that the price charged for a gallon of regular gas could be closely approximated
by a normal distribution with a mean of $3.10 and a standard deviation of $0.18. How many of the stations charge
a. between $2.74 and $3.46 for a gallon of regular gas?
b. less than $3.28 for a gallon of regular gas?
c. more than $3.46 for a gallon of regular gas?

49
THE STANDARD NORMAL
DISTRIBUTION
If the original distribution of x
values is a normal distribution,
then the corresponding
distribution of z-scores will also
be a normal distribution. This
normal distribution of z-scores
is called the standard normal
distribution. The standard
normal distribution is the
normal distribution that has a
mean of 0 and a standard
deviation of 1.

50
THE STANDARD NORMAL
DISTRIBUTION
Tables and calculators are often used
to determine the area under a
portion of the standard normal curve.
We will refer to this type of area as
an area of the standard normal
distribution.
In the standard normal distribution,
the area of the distribution from z =
a to z = b represents
 the percentage of z-values that lie
in the interval from a to b.
 the probability that z lies in the
interval from a to b.

51
52
Find the area of the standard normal distribution
a. between z = -1.44 and z = 0
b. between z = - 0.67 and z = 0
c. to the right of z = 0.82
d. to the left of z = -1.47

53
A soda machine dispenses soda into 12-ounce cups. Tests show that the actual amount of soda dispensed is normally
distributed, with a mean of 11.5 oz and a standard deviation of 0.2 oz.
a. What percent of cups will receive less than 11.25 oz of soda?
b. What percent of cups will receive between 11.2 oz and 11.55 oz of soda?
c. If a cup is filled at random, what is the probability that the machine will overflow the cup?

54
LINEAR REGRESSION AND
CORRELATION
LEAST-SQUARES REGRESSION; LINEAR CORRELATION
COEFFICIENT

55
The Least-Squares Regression
Line
When performing research studies, scientists often wish to know
whether two variables are related. If the variables are determined to be
related, a scientist may then wish to find an equation that can be used
to model the relationship.
The least-squares regression line for a set of bivariate data is the line
that minimizes the sum of the squares of the vertical deviations from
each data point to the line. The equation of the least-squares line can be
used to predict the value of one variable when the value of the other
variable is known.

56
A geologist might want to know whether there is a relationship between the duration of an eruption of a geyser and the
time between eruptions. A first step in this determination is to collect some data. Data involving
two variables are called bivariate data. The table gives bivariate data showing the time between
two eruptions and the duration of the second eruption for 10 eruptions of the geyser Old Faithful.

Once the data are collected, a scatter diagram or scatter plot can be drawn, as shown

57
One way for the geologist to create a model of the relationship between the time between two eruptions and the duration of
the second eruption is to find a line that approximates the data points plotted in the scatter plot. There are many such lines
that can be drawn, as shown

58
Of all the possible lines that can be drawn, the one that is usually of most interest is called the line of best fit or the least-
squares regression line. The least-squares regression line is the line that fits the data better than any other line that might
be drawn. In the definition, the phrase “minimizes the sum of the squares of the vertical deviations” means that of all the
lines possible, the linear equation that minimizes the sum

is the equation of the line of best fit. In this expression, each d n represents the distance from data point n to the line.

59
60
61
Find the equations of the least-squares line for the given ordered pairs.

62

You might also like