Probability and Statistics_Y2Phys

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 108

PROBABILITY AND

S TAT I S T I C S
( M AT 2 1 6 1 )

LECTURER: Dr. Emelyne UMUNOZA


GASANA
Email: emygasana@gmail.com

Year 2 Physics/Material Science


(Gako)
FIRST SEMESTER 2024/25

1
Indicative Course content
1. Descriptive and Inferential statistical Analysis methods (Mean, Median, Mode, Range,
Inter quartile Range, Variance, Standard Deviation, Confidence Interval, Hypothesis
Testing).
2. Data presentation using appropriate graphs (Bar, Histogram, Frequency polygon, Pie)
and Tables (Frequency distribution, 2X2 Table).
3. Introduction to probability theory and spaces.
4. Application of different probability distribution (Binomial, Poisson, Chi-square,
Normal) according to type of variable/data.
5. Inference using confidence interval and hypothesis testing.
6. Sample size determination and sampling techniques.
7. Tests of significance.
8. Simple linear regression and correlation.
9. Application of different statistical control charts.

2
Weight
Component Time
(%)

CAT 1 (Individual) 15 Decision to be made


A SS E SS M E N T
COMPONENT CAT 2 (Individual) 15 Decision to be made
BY T I M E A N D
WEIGHT
Group 10 Any time
Assignments

Quiz 10 Any time

Final exam 50 At the end of the


Semester

Total 100

3
INTRODUCTION TO STATISTICS
 Statistics is the branch of Applied Mathematics applied to observation
data. It is a science which studies about the methods of data collection,
presentation, analysis and interpretation.
 Methods of data collection:
Data can be collected through observation (gender), measurement
(weight), asking (age), counting (number of students in class) etc.
 Presentation:
Tables (such as frequency distribution, 2X2 table), Graphs (Bar,
Histogram, Pie, Frequency polygon, cumulative frequency polygon).
 Analysis:
 Descriptive method consists of the collection, organization,
summarization, and presentation of data through measure of central
tendency or location (such as mean, median and mode) and using
measure of variation or dispersion (such as range, Inter quartile range,
variance and standard deviation).
 Inferential method consists of generalization from samples to
populations, performing estimations and hypothesis tests, determining
relationships among variables, and making predictions. It uses 4

probability theory.
Interpretation:
Giving meanings for the output of data analysis result.
 Population includes all objects of interest; it consists of all
subjects (human or otherwise) that are being studied.
 A sample is a portion of the population i.e. is a group of
subjects selected from a population.
Data:
Generally, we have two kinds of data:
 Primary data are those which are collected from the units or
individuals directly and these data have never been used for
any purpose earlier.
 Secondary data are data, which had been collected by some
individual or agency and statistical treated to draw certain
conclusions. Again the same data are used to extract some
other information.
 A variable a characteristic or attribute that can assume
different values. 5

 A Random Variable is a variable whose values are determined


Classification of data/variable
 Numerical or Quantitative data is data where the
observations are numbers. For example, age, height, weight,
Systolic Blood Pressure measurement, number of children, etc.
 A numerical data can be either discrete or continuous.
 Discrete if the number of possible values within every bounded
range is finite. Examples: rolling dice (1, 2, …, 6), number of
children (0, 1, 2, …).
 Continuous if the value of the variable is not restricted to an
integer. Example: height, weight, temperature, Systolic Blood
 Pressure measurement,
Categorical Diastolic data
or Qualitative Blood Pressure
is data measurement
where the
etc.
observations are non-numerical. Example: place of residence
(urban, rural), favorite color (blue, white, red, green, …),
marital status (single, married…), disease status (positive,
negative) etc.

6
The simplest type of categorical variable is one,
which can take only two categories. Such a variable
is known as binary (or dichotomous). e.g. Sex
(Male/Female), Smoking status (Smoker/Non-
smoker), Disease status (Positive/Negative) etc.
Some qualitative variables can take more than two
values. e.g. Marital status (Single, married,
divorced, widowed), Disease severity (Low, mild,
moderate, high), etc.
Generally, qualitative variable can be:
 Unordered if the categories may be listed in any
order such as marital status (it does not involve
ranking).
 Ordered if the categories have a natural ordering to
their categories such as disease severity (it involves
ranking). 7
 Note: When using qualitative variable, it is very important to
check the provided categories’ exhaustiveness and mutually
exclusiveness.
 Exhaustiveness: All categories included. For instance, if the
variable is "religion" and the only options are "Protestant",
“Catholic", and "Muslim", there are many other religions that
haven't been included. The list does not exhaust all
possibilities.
 On the other hand, if you exhaust all the possibilities with
some variables, religion being one of them, you would
simply have too many responses.
 The way to deal with this is to explicitly list the most
common attributes and then use a general category like
"Other" to account for all remaining ones.
 Mutually exclusiveness: no intersection between the given
categories i.e. they cannot occur at the same time. e.g.
tossing a coin, the result is head or tail, but never both,
employment status (employed/unemployed), etc. 8


FREQUENCY DISTRIBUTION
Frequency distribution table is more difficult to
construct for numerical data than for categorical
data because the scale of the observations must
first be divided into classes.
Therefore, the steps for constructing a frequency
distribution table for numerical data are as
follows:
i. Identify the largest and smallest observations.
ii. Subtract the smallest observation from the
largest to obtain the range of the data.
iii. Determine the number of classes.
iv. Divide the range of observations by the
number of classes to obtain the width of the
9
COLUMNS IN FREQUENCY
DISTRIBUTION TABLE
Frequency: in a particular event it is the
number of times that the event occurs.
Relative frequency (percent) is the
proportion of observed responses in the
category to the total number of
observations.
Cumulative relative frequency is the
running total of the relative frequencies by
reading from top to bottom.
10
EXAMPLE
We asked the students what country their car is from (or no
car) and make a tally of the answers. Then we computed
the frequency and relative frequency of each category. The
relative frequency is computed by dividing the frequency by
the total number of respondents. The following table
summarizes.
Country Frequency Relative Frequency (%)
US 6 30
Japan 7 35
Europe 2 10
Korea 1 5
None 4 20
Total 20 100
11
Example:
The following are the marks out of 20 obtained by 50 students.
18 15 17 17 12 9 16 10 12 12
12 5 18 13 19 15 7 18 15 16
20 11 18 9 19 16 14 18 10 11
16 18 20 15 15 10 12 17 8 16
19 17 15 8 5 17 11 16 16 7
The test scores could be grouped into various classes
Score Frequency Relative Cumulative Relative
(x) (f) Frequency Frequency (%)
(%)
1-5 2 4 4
6-10 9 18 22
11-15 16 32 54
16-20 23 46 100
Total 50 100 12
Two by Two Table (2X2)
 The measures of association between risk factor (exposure)
and outcome are often calculated from data presented in 2X2
table or in the general form of nxn Table.
 The following is a 2X2 table showing association between
exposure and outcome variables.

OUTCOME
EXPOSURE Total
Positive (+) Negative (-)

Yes A B A+B

No C D C+D

Total A+C B+D A+B+C+D

13
EXERCISE

For the following results in the tables, construct 2x2 Table.


Table 1
Lung + + + + + -(ve) -(ve) -(ve) -(ve) -(ve)
cance (ve) (ve) (ve) (ve) (ve)
r
Cigaret yes yes yes no no no no no yes yes
te
smokin
g
status

Table 2
HIV/ + + + + + -(ve) -(ve) -(ve) -(ve) -(ve)
AIDS (ve) (ve) (ve) (ve) (ve)

Condo no no no yes yes yes yes yes no no


m use
status

14
GRAPHS

"A graph/picture is worth than 1000 words".


A distribution presented as a graph or chart
gives a more immediate message than a
frequency table does.
The type of graph/chart used depends on the
type of data you have. In general, if the data is
categorical or discrete, we use a bar or a pie
chart. If the data is continuous, a histogram or
frequency polygon is more appropriate.
15
BAR CHART

• For categorical variables, the frequency • Example:


for each category is easily displayed in a
bar chart. Below is a bar chart for the car
Key points about Bar Chart. data. This bar graph is called
 It is used to display qualitative (or a Pareto chart since the height
discrete numerical) data. represents the frequency.
 One bar represents one category, and
the height of the bar equals its
frequency.
 Each bar has the same width and equally
spaced.
 Bars should have a space between them
to stress that they represent categorical
data.
 The position of each category is arbitrary
if the variable is unordered.
 It is important that the vertical axis of a
bar chart starts at zero, to avoid
distortion of true differences between
frequencies. 16
PIE CHART

• Also called circle graph, is an • Example:


alternative display for categorical
data where the frequency of each We make a pie chart of the car
category is represented by the data by placing wedges in the
angle at the center of each slice of circle of proportionate size to
the circle. the frequencies.
• To find the angles of each of the
slices we use the formula
Frequency
Angle = x 360
Total
• For example, to find the angle for
US cars we have
6
Angle = x 360 = 108
degrees
20
17
HISTOGRAM

 For quantitative continuous  Key points about Histogram


variables we need a different  The x-axis must be continuous,
type of plot from a bar chart. and there are no spaces
Instead, we use a histogram. between the bars.
 A histogram is like a bar chart  The y-axis always begins at
but because we use it to zero, this is important because
display quantitative relative comparisons are being
continuous variables there made.
are no spaces between
 The area of each bar
adjacent bars.
represents the frequency in
 Another important feature of each group.
a histogram is that it is the  The width of each bar is the
area of each bar, not the
size of the interval for each
height, which is proportional
group.
to the frequency in each
group. 18
• Unimodal, Symmetric,
EXAMPLE:
Nonskewed

• Nonsymmetric, Skewed
right
• The shape of Histogram can be uni-
modal if there is one hump, bimodal
if there are two humps and
multimodal if there are many humps.
• A non symmetric histogram is called
skewed if it is not symmetric.
• If the right tail is longer than the left
tail then it is positively skewed.
• Bimodal
• If the right tail is shorter then it is
negatively skewed.
19
F R E Q U E N C Y P O LY G O N

Frequency polygon for respondent's age


• The relative frequency 6

polygon can be 5

constructed easily by 4

joining the midpoints of 3

the tops of the vertical 2

Frequency
bars of a histogram. 1

0
18 20 23 25 26 27 30 35 38 40 43 44 45

AGE

• The cumulative relative 30


Cumulative frequency polygon for respondent's ag

frequency polygon can


be constructed by first Cumulative Frequency 20

getting the cumulative


relative frequency and 10

then using line graphs.


0
20 25 30 35 40 45

AGE 20
MEASURE OF CENTRAL
TENDENCY/LOCATION

One of the most important objectives of


statistical analysis is to get one single
value that describes the characteristics
of entire data. Such a value is called a
central value or average.
Those measures are: mean, median,
mode.
21
MEAN

In mathematical terms, it is given as:  Notes: When you have


Class-Interval type of data,
to get the mean, variance
and standard deviation first
: sample size it is a must to get the mid
: observations
point of each interval.
 We use the Greek letter μ for
the population mean instead
of .
 Let m1, m2, …, mn be the
 Arithmetic mean from discrete midpoints of class intervals
frequency distribution: and f1, f2, …, fn be the
1 n
f1 x1  f 2 x2  .......  f n xn corresponding frequencies
X
N
fx
i 1
i i 
N
, where N  f
then the arithmetic mean is
given by:

22
MEDIAN

One problem with Median from individual


using the mean, is observations
that it often does not After arranging the data in ascending
depict the typical order,
outcome that is, if • if the number of observations (n) is odd,
there is one outcome then the median value is the ((n+1)/2)th
.
that is very far from
• if n is even, then the median is the
the rest of the data, arithmetic mean of (n/2)th and
then the mean will be ((n/2)+1)th observations.
strongly affected by  Example: The following are daily milk
this outcome. yield (lit.) of eleven Friesian and ten
Cross-breed cows respectively,
Such an outcome is 13, 11, 12, 10, 13, 11, 12, 15, 12, 9, 16
called an outlier. In 10, 7, 11, 9, 8, 10, 11, 7, 9, 12
this case an Find the mean and median of daily milk
alternative measure is production of two breeds. 23
the median.
Median from discrete series
Example: From the following average lactation yield (lit.) of 122
cows, find the median (and mean).

Lactation yield 4000 4500 580 5060 6600 5380


(lit.) 0
No. of cows 24 26 16 20 6 30

Median from continuous frequency


distribution
The median for a continuous frequency distribution is
N
 Fi  1
Median Li  ( 2 ) ci
fi

Where, Li= lower limit of median class, Fi-1 is the cumulative


frequency up to median class, fi is the frequency of median
class, Ci is the width of class interval and N is the total
frequency.
24
Example: Estimate the median (and mean) from the following
grouped data.

Class Frequency (f) Cumulative


frequency (F)
0-5 4 4
5-10 5 9
 Me
10-15 5 14
15-20 4 18
20-25 2 20
Total 20
5
me 10  10  9 10  1 11
5

25
MODE

 The mode of a set of data is where, Li is the lower limit of modal


the number with the highest is the frequency of the mode
class,
class,
frequency. For grouped data,
the modal class is the class is the frequency of class just
with the greatest frequency. preceding the modal class,
There may of course be more is the frequency of the class
than one mode or modal class. just following the modal class and
 Prepare a frequency ci is the class width.
distribution and determine the
mode class i.e. the value which
Exercise: Find the value of mode from following data on body weight
occurs most frequently.
(kg) of 60
Weight (kg)calves:
93-98 98- 103- 108- 113- 118- 123- 128-133
103 108 113 118 123 128

No. of 2 5 12 17 14 6 3 1
calves

26
MEASURE OF DISPERSION/VARIATION

 Suppose we want to compare performance of students in three


sections A, B, C for which a random sample of 5 students is
selected from each section and we have following data related to
marks obtained by students in each section.

 Section A: 60, 60,60,60,60. = 60

 Section B: 58, 59,60,61,62. = 60

 Section C: 80, 20,80,70,50. = 60

 Measures of dispersion measure how far the data is spread apart.


27
MEASURES OF DISPERSION

Range • For a frequency distribution


• It’s the difference between the
largest (L) and the smallest (S)
values.
Range (R) = L-S.
• The problem with this is that it
reports the extreme values, while Where,
the actual distribution of all the
• Li = Lower limit of the first/third
values in between will not be
quartile class
summarized in any way.
• Ci = Class interval (difference
Inter Quartile Range
between two consecutive lower or
IQR= 3rd Quartile – 1st upper limits)
Quartile
• fi = frequency of the first quartile
• After arranging the data in class
ascending order the position of the
first (Q1) and third quartile (Q3) are • Fi-1 = cumulative frequency of the
class preceding first quartile class
the (n+1)/4th and 3(n+1)/4th
values, respectively. • N = total frequency 28
MEASURES OF DISPERSION

• Steps to calculate Variance


• Variance and Standard Deviation
1. Calculate the sample mean,
x-bar
2. Write a column in a table
that subtracts the mean
• The Variance of Grouped from each observed values.
ndata n n

 f i ( xi  x) 2
 fi ( xi  x ) 2
 fi xi2 3. Square each of the
s 2  i 1 n  i 1  i 1  x
2
differences.
n n
 fi
i 1
4. Add values in this column.
5. Divide by n-1 where n is the
number of items in the
sample. This is the variance.
• Standard Deviation
6. To get the standard deviation
we take the square root of
the variance.
29
C O E F F I C I E N T O F VA R I AT I O N

• Example:
From the following Table calculate the mean, variance and
coefficient of variation.
Number of hours per week spent watching television by
students.
Hours No. of students
10-14 2
15-19 12
20-24 23
25-29 60
30-34 77
35-39 38
30
40-44 8
SKEWNESS

 Skewness is a measure of
symmetry, or more precisely, the
lack of symmetry.
 A distribution, or data set, is
symmetric if it looks the same to the
left and right of the center point.
 The skewness for a normal
distribution is zero, and any
symmetric data should have a skew
ness near zero.
 Negative values for the skewness
indicate data that are skewed left
and positive values for the skewness
indicate data that are skewed right.
31
KURTOSIS

• Kurtosis is a measure of • Positive kurtosis indicates a


whether the data are peaked or "peaked" distribution and
flat relative to a normal negative kurtosis indicates a
distribution. That is, data sets
"flat" distribution with referring
with high kurtosis tend to have a
to normal distribution.
distinct peak near the mean,
decline rather rapidly, and have
heavy tails.
• Data sets with low kurtosis tend
to have a flat top near the mean
rather than a sharp peak. A
uniform distribution would be
the extreme case.

32
ELEMENTARY PROBABILITY THEORY
 Assume that an experiment can be  An experiment is defined as
repeated many times, with each
any planned process of data
repetition called a trial and assume
that one or more outcomes can result collection.
from each trial, then the probability of  For an experiment we define
a given outcome is the number of
times that outcome occurs (favorable an event to be any collection
outcome) divided by the total number of possible outcomes.
of trials (total outcomes).
 A conditional probability is
r Favourable
outcomes the probability of one event
Prob(x) =

n Totaloutcomes given that another event has
occurred.
 If the outcome is sure to occur, it has a
 A simple event is an event
probability of 1; if an outcome can not
occur, its probability is 0. that consists of exactly one
 Example: The probability of flipping a outcome.
fair coin and getting tails is 0.50, or  In Probability “OR” means the
50%. If a coin is flipped 10 times, there
is no guarantee, that exactly 5 tails will union, that is, either can occur
be observed, the proportion of tails can and in probability
range from 0 to 1. “AND” means intersection 34
that both must occur.
CHARACTERISTIC S OF PROBABILITY
 P(E) is always between 0 and 1.  Example
 The sum of the probabilities of all
A bag contains 80 balls of which
simple events must be 1.
20 are red, 25 are blue and 35
 P(E) + P (not E) = 1
are white. A ball is picked at
 If A and B are mutually exclusive, random what is the probability
then that the ball picked is:
P(A or B) = P(A) + P(B)
(i) Red ball
 If A and B are not mutually
exclusive, then (ii) Black ball
P(A or B) = P(A) + P(B) - P(A and B) (iii) Red or Blue ball.
 If A and B are independent, then
P(A and B) = P(A)P(B)
 For conditional probability
P(A and B) = P(A|B)XP(B)
or
P(B and A) = P(B|A)XP(A)
35
MUTUALLY EXCLUSIVE AND
INDEPENDENT EVENTS

A set of events is said to  Events are said to be


be mutually exclusive independent when the
occurrence of any of the events
if the occurrence of one does not affect the occurrence of
of the events precludes the other(s).
the occurrence of any of  e.g. the outcome of tossing a
the other events. coin is independent of the
outcome of the preceding or
e.g. when tossing a coin, succeeding toss.
the events are a head or  Example
a tail these are said to be
From a pack of playing cards what
mutually exclusive since is the probability of;
the occurrence of heads (i) Picking either a ‘Diamond’ or a
for instance implies that ‘Heart’
tails cannot and has not (ii) Picking either a ‘Flower’ or an
occurred. ‘Ace’ 36
EXERCISES

1. In a competitive examination, 30
candidates are to be selected. In all 600
candidates appearing in a written test, 100
will be called for the interview.
(i) What is the probability that a person
will be called for the interview?
(ii) Determine the probability of a person
getting selected if he has been called for the
interview?
(iii) Probability that person is called for the
interview and is selected?
38
EXERCISES

1. In a competitive examination, 30 candidates are to be selected. In all


600 candidates appearing in a written test, 100 will be called for the
interview.
(i) What is the probability that a person will be called for the
interview?
(ii) Determine the probability of a person getting selected if he has
been called for the interview?
Solution:
(iii) Probability that person is called for the interview and is
Let event A be that the person is called for the interview and event B
selected?
that he is selected.
100 1
(i)  P(A) = =
600 6
30 3
(ii) P(B|A) = 
100 10
(iii) P(A Π B) = P(A) × P(B|A)
= 1 3  3  1
6 10 60 20

39
E X E RC I S E 2

From experience, a machine is known to be


set up correctly on 90% of occasions. If the
machine is set up correctly, then 95% of
good parts are expected but if the machine
is not set up correctly then the probability of
a good part is only 30%.
On a particular day, the machine is set up
and the first component produced and found
to be good. What is the probability that the
machine is set up correctly?

40
E X E RC I S E 2

From experience, a machine is known to be set up correctly on 90% of


occasions. If the machine is set up correctly, then 95% of good parts
are expected but if the machine is not set up correctly then the
probability of a good part is only 30%.
On a particular day, the machine is set up and the first component
produced and found to be good. What is the probability that the
machine is set up correctly?
Solution

• Probability of getting a good part (GP) = P(CSGP or ISGP)=P(CSGP)


+ P(ISGP)
• Probability that the machine is correctly set up after getting a good
41
part = P(CS|GP)
EXERCISE 3

A machine comprises of 3 transformers A, B and C.


The machine may operate if at least 2 transformers
are working. The probabilities of each transformer
working are given as shown below; P(A) = 0.6,
P(B) = 0.5, P(C) = 0.7
A mechanical engineer went to inspect the working
conditions of those transformers. Find the
probabilities of having the following outcomes:
i. Only one transformer operating
ii. Two transformers are operating
iii. All three transformers are operating
iv. None is operating
v. At least 2 are operating 42
PERMUTATIONS

 This is an order arrangement • Example


of items in which the order
must be strictly observed. Calculate

 Example i. P3
3

Let x, y and z be any three ii. P3


5

items. Arrange these in all iii. P5


7
possible permutations.
• Example
 The number of permutations
of “r‟ items taken from a There are 6 contestants for the
sample of “n‟ items may be post of chairman, secretary and
provided as treasurer. These positions can
be filled by any of the 6. Find
the possible number of ways in
 nPr= which the 3 positions may be
filled.
45
C O M B I N AT I O N S

 A combination is a group of • Example


items in which order is not There is a committee to be
important. selected comprising of 5 people
 For a combination to hold from a group of 5 men and 6
women. If the selection is
at any given time it must
randomly done, find the
comprise of the same
possibility of having the
items but if a new item is following possibilities
added to the group or (combinations).
removed from the group
then we have a new i. Three men and two women.
combination. ii. At least one man and at
least one woman must be
 e.g. 3 items x, y and z will
in the committee.
have 6 different
permutations but only one iii. One particular man and one
particular woman must not
combination.
be in the committee (one
man, four women).
S O LU T I O N

i. The committee size = 5 people • The probability of having a committee


of five women only
• The group size = 5m + 6w
• Assuming no restrictions the
• P(at least one man and at least one
committee can be selected in woman)
11
C5
= 1 – {P(no man) + P(no
• The committee has to consist of woman)}
3m & 2w iii. P(one particular man and one
• These may be selected as particular woman must not be in the
committee would be determined as
follows. 5C3 × 6C2
follows
• P(comm 3m and 2w) • The group size = 5m + 6w
• Committee size = 5 people
P(at least one man and at least
ii.
• Actual groups size from which to select
one woman must be in the
the committee = 4m + 5w
committee)
• Committee = 1m + 4w
• The no. of possible combinations
of selecting the committee • The committee may be selected in 9C5
without any woman = 5C5 • The one man may be selected in 4C1
• The probability of having a committee of five men ways
only 47
• The four women may be selected in 5C4
DISCRETE PROBABILITY DISTRIBUTIONS

Binomial Probability  P(r successes) = nCr pr qn-r


Distribution  Where p = Probability of success

 Binomial probability distribution is a r = no. of successes


set of probabilities for discrete n = sample size
events. q = 1 – p = Probability of failure
 Discrete events are those whose
Example
results or outcomes can be
counted. A medical survey was conducted in order
 e.g. in quality control activities, the to establish the proportion of the
population which was infected with cancer.
binomial probabilities are frequently The results indicated that 40% of the
used especially when determining population were suffering from the
the probability of having a certain disease. A sample of 6 people was later
no. of defective items in a given taken and examined for the disease. Find
consignment. the probability that the following
 We call a distribution binomial outcomes were observed
distribution if there are a fixed i. Only one person had the disease
number of trials, n, which are all
ii. Exactly two people had the disease
independent, the outcomes are
Binary, such as true or false, yes or iii. At most two people had the disease
no, success or failure, positive or iv. At least two people had the disease
negative, the probability of success 48
is the same for each trial. v. Three or four people had the
BINOMIAL MATHEMATICAL PROPERTIES

1. The mean or expected Examples


value, E(X) = np 1. A firm is manufacturing
45,000 units of nuts. The
probability of having a
2. The variance = npq defective nut is 0.15.
Calculate the following
3. The standard deviation i. The expected no. of
defective nuts.

Where; n = Sample Size ii. The variance and standard


deviation of the defective
p = Probability of success nuts in a daily consignment of
q = probability of failure 45,000.

=1-p 2. A die is tossed 180 times.


Find the mean and standard 49
deviation of the random
P O I SS O N P R O B A B I L I T Y D I S T R I B U T I O N

 This is a set of probabilities which Example


is obtained for discrete events
described as being rare. A manufacturer assures his
 Occasions
customers that the probability of
like binominal
distribution but have very low
having defective item is 0.005. A
probabilities and large sample sample of 1000 items was
sizes. inspected. Find the probabilities
 The formula used to determine the
of having the following possible
probability of exactly x occurrences
outcomes
is as follows i. Only one is defective
ii. At most 2 defective
where
iii. More than 3 defective
x = No. of successes
⋋ = mean no. of the successes in the
sample (⋋ = np)
e = is the base of natural logarithms,
equal to 2.718 50
POISSON MATHEMATICAL PROPERTIES

1. The mean or expected value = np = λ


2. The variance = np = ⋋
3. Standard deviation =
Where
n = Sample Size
p = Probability of success

Example
The probability of a rare disease striking a given population is
0.003. A sample of 10000 was examined.
Find the expected no. suffering from the disease and hence
determine the variance and the standard deviation for the
51
above problem.
PROBABILITY DISTRIBUTION FOR
CONTINUOUS RANDOM VARIABLES.

 In a continuous distribution, Normal Distribution


the variable can take any  It is a special distribution that we will
value within a specified range. use just about every day for a
continuous random variable.
 The probability is represented  It can take on any value (not just
by the area under the integers, as do the binomial and
probability density curve Poisson distribution)
between the given values.  It is symmetric about the mean, μ
 The uniform distribution, the  The standard deviation σ, is the
normal probability distribution horizontal distance between the mean
and the point of inflection on the
and the exponential curve.
distribution are examples of a
 The area under the curve is equal to 1.
continuous distribution.
 Since it is symmetrical distribution,
 Examples of continuous half the area is on the left of the mean
variables are: distances, times, and half is on the right.
heights, capacity, etc.  e=2.718, π=3.1416
52
THE STANDARD NORMAL DISTRIBUTION

 The Standard Normal distribution  If a distribution is normal but not


follows a normal distribution and standard, we can convert a value to
has mean 0 and standard deviation the Standard normal distribution z
1. score by finding first as how many
standard deviations away the
number is from the mean.
 The number of standard deviations
from the mean is called the z-score
and can be found by the formula:
where
χ = value to be standardized
z = standardization of χ
µ = population mean
σ = standard deviation
 Often, we want to find the
 The picture above is perfectly
probability that a z-score will be
symmetric about 0.
less than a given value, greater
53
than a given value, or in between
two values.
EXAMPLE

1. The age of the 2. A sample of students had a


subscribers to a mean age of 35 years with a
newspaper has a normal standard deviation of 5 years. A
student was randomly picked
distribution with mean
from a group of 200 students.
50 years and standard Find the probability that the age
deviation 5 years. of the student turned out to be
Compute as follows

I. the percentage of i. Lying between 35 and 40


subscribers who are less ii. Lying between 30 and 40
than 40 years old. iii. Lying between 25 and 30
II. the percentage of iv. Lying beyond 45 years
subscribers who are
v. Lying beyond 30 years
between 40 and 60
years old. vi. Lying below 25 years
54
CHI-SQUARE DISTRIBUTION

 Chi-square test is used to  The observed frequencies will be


compare frequencies or very close to the expected
proportions in two or more frequencies, they will differ only by
groups, especially for their small amounts.
independence.
 The total number of observations
in each column and the total  Where d.f. = degree of freedom =
number of observations in each (r-1) (c-1),
row are considered to be given or r and c are number of rows and
fixed (marginal frequencies). column, respectively
 After assuming that the columns
Oi= Observed frequency
and rows are independent, we
can calculate the number of Ei= Expected frequency
observations expected to occur
by chance (expected H0: The two variables are
frequencies). independent/not associated
 Expected Frequency = H1: The two variables are
dependent/associated
(Row total X Column
total)/Grand total  Reject H0 when the calculated value
 Chi-square of chi-square is greater than the 55
test compares the
tabulated chi-square (obtained from
observed frequency in each cell
EXERCISE

According to the following table which summarized status of


students’ knowledge on statistical computer software and
applied prob. and statistics exam performance.
Calculate chi-square and make conclusion about the
independence of the two variables.

Software The course Performance


Knowledge Good Bad Total

Yes 70 5 75

No 10 15 25

Total 80 20 100
56
STUDENT T-DISTRIBUTION

 The t distribution is similar  t-distribution is given by


in shape to the z-
distribution and one of its where
major uses is to answer
= sample mean
research questions about
means. μ = population mean
S = sample standard
 The t-distribution is deviation
symmetric and has a mean
n = sample size
of 0, but its standard
 Here the hypothesis to be tested is
deviation is larger than 1.
H0: Population means among groups
 The precise size of the
are equal
standard deviation
H1: Population means among groups
depends on the sample
are not equal
size, which is called here
degree of freedom (d.f).  Reject H0 when the calculated t-
value is greater than the tabulated
 The t-distribution has a t-value (obtained from the t-
57

distribution Table)
EXAMPLE

According to the following table which shows weight for nine


female students.
By using t-test, make a conclusion about the equality of the
selected female students’ weight with female population
mean weight of 55 kg
Female Student Wt. in Kg
55
50
50
50
55
50
60
55
50
58
S T U D E N T T- D I S T R I B U T I O N F O R T W O P O P U L AT I O N S ’ M E A N
D I F F E R E N C E S ( I N D E P E N D E N T T-T E S T )

The following table shows age for


nine male and female Age of Age of
individuals. By using t-test, make
a conclusion about the equality Male Female
of male and female selected
individual’s age.
26 40
(Independent t-test)(t tabulated 22 17
with 16 d.f and at 5% level =
2.12) 18 15
Note: Here t is (- with d.f
38 44
18 16
15 20
27 28
59
17 48
S T U D E N T T- D I S T R I B U T I O N F O R M E A N S D I F F E R E N C E S
F R O M A S I N G L E P O P U L AT I O N ( PA I R E D T-T E S T )

The following table shows SBP1 SBP2


SBP repeated
measurements for nine 120 120
individuals. By using t-
125 120
test, make a conclusion
about the measurement 130 135
difference. (paired t-test)
(t tabulated with 8 d.f and 140 140
at 5% level=2.306). 125 120
Note: Here you can use
130 130
one sample t-test using
difference (di) and t with 120 120
n-1 d.f in comparing to
zero. 140 140
60

125 135
HYPOTHESIS TESTING

 A hypothesis is a claim or an The null hypothesis


opinion about an item or issue.
 This is the hypothesis being
Therefore, it has to be tested
statistically in order to establish tested, the belief of a certain
whether it is correct or not characteristic.
correct.  e.g. Rwanda Bureau of
 Whenever testing a hypothesis, Standards (RBS) may walk to a
one must fully understand the 2 sugar making company with an
basic hypotheses to be tested intention of confirming that the
namely 2kg bags of sugar produced
i. The null hypothesis (H0) are actually 2kg and not less,
they conduct hypothesis
ii. The alternative hypothesis (H1) testing with the null hypothesis
 Whenever we have a decision to being: H0 = each bag weights
make about a population 2kg.
characteristic, we make a
 The testing will set out to
hypothesis.
confirm this or to refute it. 62
HYPOTHESIS TESTING

Alternative hypothesis Procedures in Hypothesis Testing


 While formulating a null hypothesis When we test a hypothesis, we proceed
we also consider the fact that the as follows:
belief might be found to be untrue i. Formulate the null and alternative
hence we will reject it. hypothesis.
 We therefore formulate an ii. Choose a level of significance.
alternative hypothesis which is a iii. Determine the sample size.
contradiction to the null hypothesis, iv. Collect data.
thus when we reject the null
v. Calculate z (or t) score.
hypothesis we accept the
alternative hypothesis. vi. Utilize the table to determine if the
z score falls within the acceptance
 In our example the alternative
region.
hypothesis would be
Decide to : -
H1 = each bag does not weigh 2kg
a. Reject the null hypothesis and
 For the null hypothesis we always therefore accept the alternative
use equality, since we are hypothesis or
comparing μ with a previously b. Fail to reject the null hypothesis and
determined mean. For the therefore state that there is not
alternative hypothesis, we have the enough evidence to suggest the
choices: <, >, or ≠. truth of the alternative hypothesis.63
ERRORS IN HYPOTHESIS TESTS

We define a “type I We define a “type II


error” as the event of error” (with probability
rejecting the null ß) as the event of failing
hypothesis when the null to reject the null
hypothesis was true. hypothesis when the null
The probability of a type hypothesis was false.
I error (α) is called the  Note: Larger α
significance level. results in smaller ß, and
smaller α results in larger
ß. hypothesis
Null
Action True False

Fail to reject Correct Type II Error (ß)


(accept) 64
REJECTION REGIONS

 Suppose that α =0.05. We can


draw the appropriate picture
and find the z score for -0.025
and 0.025. We call the outside
regions the rejection regions.
 We call the blue areas the
rejection region since if the
value of z falls in these regions,
we can say that the null Example
hypothesis is very unlikely so In the study of the piglets being
we can reject the null fed the supplemented diet, we
hypothesis. know that the mean weekly weight
Note:- Here our test statistic is gain for our sample is 311.9, and
Z=( Ẍ - ) / (σ/√n) [TEST that this is based on 16
STATISTIC] at 5% level, reject the observations. We also know that
null hypothesis Ho if Z>1.96. we have assumed the population.
standard deviation, σ, to be 120
grams and that we want to test
65
μ=200 grams.
ONE SIDED VS TWO SIDED TESTS

The statement of the For a one-sided test, we


alternative hypothesis in would not use the same
above example was “not cutoff as that of two-sided
equal to (≠)”, that is, test.
either higher than or For one sided test, we are
lower than. only interested in a
This is called a two-sided critical value or cutoff
test. above which 5% of the
distribution lies.
If we were only interested
Instead of 1.96, for a one-
in testing whether this
diet would give a greater sided test, the cutoff
weight gain, we would above which 5% of the
have used a one-sided normal distribution lies is
test, implying alternative 1.645.
66
hypothesis H1: μ > 200 g.
SMALL SAMPLE HYPOTHESIS TESTS FOR
A NORMAL POPULATION

When we have a small Hypothesis Testing for a


sample from a normal Population Proportion
population, we use the  The process is completely
same method as a analogous with the mean,
although we will need to
large sample except use the standard deviation
we use the t -statistic formula for a proportion.
instead of the z-  Like √pq/n instead of
statistic. standard deviation formula
Hence, we need to find for mean=s/√n.
the degrees of freedom  If the sample is large, we
(n - 1) and use the t- can use the central limit
theorem to say that the
table in the book. distribution of proportions is
67
approximately normal.
POINT AND INTERVAL ESTIMATION

 Assume that we have a sample (x1,x2,  Estimator is not rejected because it may
…,xn) from a given population. All give one bad result for one sample.
parameters of the populations are  It is rejected when it gives bad results in
known except some parameter. a long run.
 We want to determine, from the given  Estimator is accepted or rejected
observations, the unknown parameter. depending on its sampling properties.
In other words, we want to determine  Every member of a population cannot
a number or range of numbers from be examined so we use the data from a
the observations that can be taken as sample, taken from the same
a value. population, to estimate some measure,
• Estimator – is a method of such as the mean, of the population
estimation. itself.
 The sample will provide us with the best
• Estimate – is a result of an estimator
estimate of the exact 'truth' about the
• Point estimation – as the name population.
suggests is the estimation of the  The method of sampling depends on the
population parameter with one data available but the ideal method, as
number. every member of the population has an
 Problem of statistics is not to find equal chance of being selected, is
estimates but to find estimators. random sampling. 68
POINT AND INTERVAL ESTIMATION

 We estimate limits within  Point Estimate of Parameter


which we are expecting the (e.g. Mean) from the sample of
'truth' about the population a single value is calculated to
serve as an estimate for the
to lie and state how
population parameter.
confident we are about this
estimation.  The best estimate of the
unknown population mean, ,
 Therefore, there are two is the sample mean, that is,
types of estimate of a
population parameter:-
 The best estimate of the
• Point estimate - one unknown population standard
particular value; deviation, , is the sample
• Interval estimate - an standard deviation s, where:
interval centred on the
point esti­mate.
69
PROPERTIES OF ESTIMATOR

We want that estimator  We would like that estimator stays


as close as possible to the
to have several parameter it estimates as sample
desirable properties like size increases.
• Consistency  If we want to estimate  , is an
• Unbiasedness estimate.
• Minimum variance  If tends to  in probability as n
 In general, it is not possible for an increases, then estimator is called
estimator to have all these properties. consistent.
 Note that estimator is a sample  That is, if there is one consistent
statistic. i.e. it is a function of the estimator then you can construct
sample elements.
infinitely many others.
Consistency
 For example, if is consistent then
 For many estimators, variance of the
n/(n-1) is also consistent.
sampling distribution of an estimator
decreases as sample size increases. Example: 1/nxi and 1/(n-1) xi are
both consistent estimators for the
population mean. 70
PROPERTIES OF ESTIMATOR

Unbiasedness
 If an estimator estimates , then
the difference between them - )
is called the estimation error. Sample variance
 Bias of the estimator is defined  Expectation value of the square of
as the expectation value of this the differences between estimator
difference, that is:- and the expectation of the
B = E( -  ) = E( ) -  estimator is called its variance.

 If the bias is equal to zero, then  if tn is an estimator of θ, then


the estimation is called unbiased.
 For example, sample mean is an
unbiased estimator, since the  In ideal world we would like to
expectation for the difference of
have minimum variance and
the estimate and the parameter
unbiased estimator. But it is not
is equal to zero. That is:
always possible.
71
INTERVAL ESTIMATE OR
CONFIDENCE INTERVAL
 Often it is more useful to quote Confidence interval for
two limits between which the single population mean
parameter is expected to lie,
together with the probability of  Confidence interval is a
it lying in that range. widely used tool to describe
 The limits are called the a population based on
confidence limits and the sample data.
interval between them is the  The idea here is to obtain
confidence interval. an interval, based on
 The width of the confidence sample statistics, that we
interval depends on three can be confident contains
sensible factors: the population parameter of
• the degree of confidence we interest.
wish to have in it, the chance  Applying the properties of
of it including the 'truth', e.g.
95%;
the sampling distribution to
the results of a single 72
• the size of the sample, n; sample leads us to the
C O N F I D E N C E I N T E R VA L F O R S I N G L E P O P U L AT I O N M E A N

 This is an interval around the where


estimated mean which we can • is estimated mean
be confident that it contains • 1.96 is the value of Z for 95%
the true population mean. confidence
• S.E() is standard error of the estimate
 A confidence interval extends
s/√n
either side of the sample  Sometimes we may wish to use other
mean by a multiple of the confidence intervals such as 90% or
standard error. 99% confidence intervals. For a 90%
and 99% confidence interval the value
 It is most common to calculate 1.96 in the formula used previously
a 95% confidence interval; becomes 1.65 and 2.58, respectively.
this extends 1.96 SE either • Exercise
side of the mean. Calculate and interpret a 90% and 95%
X 1.96[
 Thus, a 95% S .E ( Xconfidence
)] confidence interval for population mean
height from the sample of 150 students,
interval for a single population having sample mean height =1.69 m
mean () is calculated as and standard error,
follows SE ( ) = 0.70 cm 73
CONFIDENCE INTERVAL FOR TWO
POPULATIONS MEANS DIFFERENCE

 Testing a hypothesis that a Exercise


parameter equals some specified
value (such as 1 - 2=0) can be done The following data shows the
by determining whether or not 0 falls measurement of systolic blood
in the interval. pressure measurements among
 Therefore, similar to the confidence two groups of population
interval for single mean, a 95% separated by getting /not getting
confidence interval for a population appropriate treatment. Based on
mean difference (1 - 2) is calculated the information find the 95%
as follows:
confidence interval for the
difference between the
population Withmeans of SBP
Without

 Where is estimated mean measurements between


treatment the
treatment

difference treated
Mean and120 untreated
mmHg 140groups
mmHg

• 1.96 is the value of Z for 95% and interpret the result


Standard 10 mmHg 15 mmHg
confidence
deviation
• is standard error of the
Sample size 144 144
estimate which is given as:- 74
CONFIDENCE INTERVAL FOR SINGLE
POPULATION PROPORTION

The 95% confidence interval for Exercise


a single population proportion (p) Suppose that it is known that in a
is calculated as follows: certain population of women, 90%
entering their 3rd trimester of
• p + 1.96 (S.E(p))
pregnancy have had some prenatal
• Where: p is estimated care. A random sample of size 200
proportion is drawn from the population in an
informal settlement and it is found
• 1.96 is multiplier that 170 have had prenatal care at
the beginning of their third
• SE (p) is standard error of the
trimester. From this, find the 95%
estimate= confidence interval proportion of
women in the informal settlement
who have had some form of
prenatal care by the third trimester
and interpret the result.

75
CONFIDENCE INTERVAL FOR TWO
POPULATION PROPORTIONS
DIFFERENCE

Similarly, the 95% confidence Example


interval for two population A study of teenage suicide
proportions difference (P1 – P2) is included a sample of 96 boys and
calculated as follows: 123 girls between ages of 12 and
(p1-p2) + 1.96 (SE(p1-p2)) 16 years selected scientifically
from admissions records to a
where private psychiatric hospital.
• p1 and p2 are estimated Suicide attempts were reported
by 18 of the boys and 60 of the
proportions
girls. We assume that the girls
• 1.96 is multiplier constitute a simple random
sample from a population of
• SE (p1-p2) is standard error of
similar girls and likewise for the
the estimates =
boys. Construct a 95 percent
confidence interval for the
difference between the two
proportions. 76
Note
In general, the width of confidence interval
depends on:
 The confidence level (1-α): As (1-α) increases,
so does the width of the interval. If we want to
increase the confidence, we have that the
interval contains the parameter, we must
increase the width of the interval.
 The sample size:- The larger the sample size,
the smaller the standard error of the
estimator, and thus the estimator and thus the
smaller the interval.
 The standard deviation of the underlying
distributions. If the standard deviations are
large, then the standard error of the estimator78
SAMPLE SIZE DETERMINATION
 One question statisticians are  In order to determine the sample
frequently asked is, “How large a
size required, we need to specify
sample must I take?” This is a fairly
complicated question to answer and the size of difference (d) that we
require a number of assumptions. want to detect. One can then
decide whether this fulfills the
 Therefore, the initial questions a
expectations of the study.
statistician will ask are:-
 Using too large sample creates
• What do you mean by how large a
sample? more work, may waste money.
• How large a sample to do what?  Using too small sample may result
Because, in order to detect a very in a difference not being detected
small difference say 0.01, between or not being detectable.
two means, one will need a much
larger sample than to detect a larger  In most cases where the sample
difference, say 1.0. size is too small, no information is
• What one is trying to do is to use a gained from the study at all, which
sufficiently large sample so that the implies that the effort, money and
confidence interval for the mean will subjects (material) involved are
allow one to detect the required totally wasted.
difference.
 In the case where animals are
 In practice the sample size is usually
involved, using too many animals
79
fixed by the number of subjects
is unethical.
available, and the cost involved (in
S A M PLE SIZE C A LC U LAT I O N F O R S IN G LE M E A N
F R O M IT S C O N F ID E N C E IN T E R VA L

 The sample mean is used to Example


estimate the population mean,
and the Confidence Interval is If we want to estimate the mean
used to determine how big or SBP of Rwandan males and the
small the population mean is. standard deviation is known to
 s2, the variance is required before
be around 20 mmHg and we
wish to estimate the true mean
sample size calculation. This may
be obtained from the literature, within 10 mmHg at 95%
previous studies or a pilot study. confidence, what will be the
sample size?
 d is the expected difference (size
of effect), is a major factor in
determining the sample size. The
smaller the size of effect that you
want to detect, the larger the
sample size that is needed.
Therefore, for a 95% C.I
n = (1.96 s/d)2
80
S A M PLE SIZE C A LC U LAT I O N F O R S IN G LE M E A N
F R O M IT S C O N F ID E N C E IN T E R VA L

 The sample mean is used to Example


estimate the population mean,
and the Confidence Interval is If we want to estimate the mean
used to determine how big or SBP of Rwandan males and the
small the population mean is. standard deviation is known to
 s2, the variance is required before
be around 20 mmHg and we
wish to estimate the true mean
sample size calculation. This may
be obtained from the literature, within 10 mmHg at 95%
previous studies or a pilot study. confidence, what will be the
sample size?
 d is the expected difference (size
of effect), is a major factor in Solution
determining the sample size. The
• We are given s=20 mmHg,
smaller the size of effect that you
want to detect, the larger the
d=10 mmHg and z=1.96
sample size that is needed. • Therefore, n = [(1.96*20)/10]2
Therefore, for a 95% C.I = 15.37 which round to 16
n = (1.96 s/d)2
81
S AM PLE SIZE C A LC U LAT I O N F O R S IN G LE
PR O PO RT IO N F R O M I T S C O N F ID E N C E IN T E R VA L

 Research questions such as “What Example


proportions are infected? What is
We wish to estimate the proportion of
the prevalence of HSV-2 in rural
males who smoke in a given country.
area? What is the sensitivity (or
What sample size do we require to
specificity) of a particular
achieve a 95% confidence interval of
diagnostic test for disease x?”
width + 5% (that is to be within 5% of
lead to the estimation of a
the true value)? A study some years
proportion.
ago found approximately 30% were
 To determine how big or how small smokers.
the population proportion is likely
to be, a confidence interval is
calculated and the sample size for
95% confidence is given by
n = (1.96 /d)2 p (1-p)
 Thus, to determine the sample
size required to estimate the
proportion with the desired level
of precision, some idea is required
before hand about the possible
82
magnitude of the proportion.
S AM PLE SIZE C A LC U LAT I O N F O R S IN G LE
PR O PO RT IO N F R O M I T S C O N F ID E N C E IN T E R VA L

 Research questions such as “What Example


proportions are infected? What is
We wish to estimate the proportion of
the prevalence of HSV-2 in rural
males who smoke in a given country.
area? What is the sensitivity (or
What sample size do we require to
specificity) of a particular
achieve a 95% confidence interval of
diagnostic test for disease x?”
width + 5% (that is to be within 5% of
lead to the estimation of a
the true value)? A study some years
proportion.
ago found approximately 30% were
 To determine how big or how small smokers.
the population proportion is likely
Solution
to be, a confidence interval is
calculated and the sample size for • We take p=0.30, d=0.05 and z=1.96
95% confidence is given by
• n = (1.96/0.05)20.3(1-0.3) = 322.69
n = (1.96 /d)2 p (1-p) rounded to 323 men
 Thus, to determine the sample • If we anticipate a 75% response rate,
size required to estimate the then we need to sample
proportion with the desired level 323/0.75=431 men
of precision, some idea is required
• If we had no idea what the
before hand about the possible
prevalence of smoking is likely to83be
magnitude of the proportion.
we would use p = 0.50 to give n =
POWER OF A TEST

 There is further concept that we  Therefore, the power of the


should include to these sample size test is defined as (1- β).
calculation, which concerns the
power of the test. This defined as the
Which is the probability of
probability of detecting a difference making a correct
that really exists, and denoted by 1- decision and rejecting H0
β.
when it is false.
 In the above, we have only
considered the probability α of  In order determine the
rejecting the null hypothesis when it power of we need to specify
is really true, that is, the chance of
wrongly concluding that a difference
the size of the difference
exists when it does not really exist. between the means that
This probability is called the type I one is interested in.
error.
 There is a second kind of error that
 We would usually like to
could be made, namely failing to have α small, so that there
reject the null hypothesis when it is is a small chance of wrongly
really false. In other words, failing to
detect a difference that really exists.
saying there is an effect
This probability is called the type II when there is not, and (1-β)
error, and is usually denoted by β. large, so that there is 84a
good chance of detecting
POWER OF A TEST

 Therefore, the power must also be included in the sample size


calculations, since it is of little use being relatively sure that the
sample is large enough not reject the null hypothesis falsely, without
also being sure that the sample size is large enough not to miss a
difference that is really there.
 This is done by replacing the critical value Z α by (Z α +Zβ). That
is :-
• n= (Z +Zβ) 2
s2/d2 for single mean study and;
• n=(Z +Zβ) 2 p (1-p)/d2 for single proportion study.
Notes
 Z is the value of the standard normal distribution cutting of
probability  in one tail for a one sided alternative or /2 in each tail
for a two sided alternative and Z is the value of the standard normal
distribution cutting off probability  (or right hand) tail.
 Commonly used values for Z and Z are Z=1.96 for =0.05 (two
tailed) and Z =0.84 for 80% power or Z=1.28 for 90% power. 85
PROBABILITY SAMPLING METHODS
(RANDOM)

 The best way to ensure that a  The recommended way to


sample will lead to reliable and select a simple random sample
valid inferences is to use is to use a table of random
probability samples, in which the numbers or a computer-
probability of being included in generated list of random
the sample is known for each numbers.
subject in the population.
 It is the simplest form of
 The four commonly used
probability sampling.
probability-sampling methods
are:-  To select a simple random
• Simple random sampling sample, you need to make a
numbered list of all the units in
• Systematic sampling the population from which you
• Stratified sampling want to draw a sample or use
an already existing one
• Cluster sampling.
(sampling frame).
Simple random sampling • Decide on the size of the
86
 This is one in which every sample
subject has an equal probability
PROBABILITY SAMPLING METHODS
(RANDOM)

Systematic random sampling Stratified random sampling


 It is an alternative to simple  It is one in which the population is
random sampling that is useful in first divided into relevant strata
some cases. (subgroups), and a random sample
 The items or individuals of the is then selected from each stratum
proportionally.
population are arranged in some
order. A random starting point is  Characteristics used to stratify
selected (by lottery) and then should be related to the
every kth member of the population measurement of interest, in which
is selected. case stratified random sampling is
 k is determined by dividing the the most efficient.
total number of items in the  A population is first divided into
sampling frame by the desired subgroups, called strata, and a
sample size. sample is selected from each
 For example, if there are N=1000 stratum.
stores along Fifth avenue and we  e.g. 70% males, 30% females. If a
want to select n=100 stores in the sample of 10 is selected, (n=10)
sample, k=N/n or 10. We shuffle 70% of n =7, so select 7 males and
only the first k, and select one, say 3 females. 87

#4. Now if we systematically select 


PR O BABILIT Y S A M PLI N G M E T H O D S (R A N D O M )

Cluster random sampling


 In practice, the selection could either
 It is a survey procedure in which the be carried out in a single stage, two
sampling units consist of a group of stage or multistage. At each of these
elements known as a cluster. stages, the sample sizes could either
be equal or unequal, known or
 Information on clusters is used in
unknown.
making inferences about a population.
 The scheme of selection of these
clusters could be simple random  The three major advantages of using
sampling or systematic sampling. cluster sampling in some surveys are:-
 Clusters are usually formed by grouping • The sampling frame from which the
units which can be conveniently list of individual elements can be
observed together. These could either obtained may not be available or
be mutually exclusive or overlapping. In may not be easily obtained, but that
practice, however, it is more convenient of the clusters may be available and
to use clusters that are non- can be obtained easily.
overlapping.
• It is more expensive to construct a
 In cluster sampling, the first task is to list of elements in a population
specify the appropriate clusters. This is compared to a list of clusters. Greater
then followed by composing a sampling costs may be incurred in taking a
frame which lists all the clusters in the sample of elements as compared to
population. clusters of elements.
88
 A simple/systematic random sample of • The use of cluster sampling usually
clusters is then selected, and a requires fewer personnel and takes a
NONRANDOM SAMPLING

 Non-probability samples are  Examples:


those in which the probability • When checking oranges packed
that a subject selected is in boxes, only oranges from the
unknown. top layer are inspected, which
 Non-probability samples often are often the better oranges.
reflect selection biases of the • Examining plants that are easily
person doing the study and do accessible, for example, found
not fulfill the requirements of near a road. Soil near a road has
usually been disturbed and
randomness needed to
differs from soil further from the
estimate sampling errors. road.
Examples: convenience
samples or quota samples. • Sampling patients coming to a
specific clinic on a specific day.
Convenience sampling That is if the clinic survey is in
the morning, you may miss
 That is, one chooses a sample
children that are at school.
that is convenient.
89
NON-RANDOM SAMPLING

Quota sampling
 It is often used for opinion surveys.
 That is each interviewer is told to collect opinions from a specified number
of, for example, males and females of certain ages.
 This is fraught with danger, as interviewers often have difficulty in making
up their quota and may ask relatives or friends to masquerade as the
desired case.
 Even if they don’t just re-label the people interviewed, they are likely to
interview friends or acquaintances, who are likely to be more similar to
each other than the people in the general population.
 This method is sometimes useful when one has no list of the population,
and where the population is hard to find, is called snowball sampling. That
is once you have identified some respondents, you ask them to provide you
with contact details of other respondents.
90
QUIZ 1 /10.

1. What do we mean by a skewed distribution (to the right or left)?


How can we know that a given variable follows a normal
distribution?
2. In hypothesis testing, what do we mean by Type II error?
3. An insurance company takes a keen interest in the age at which a
person is insured. Consequently, a survey conducted on
prospective clients indicated that for clients having the same age
the probability that they will be alive in 30 years’ time is 2/3. If a
sample of 5 people was insured now, answer the following
questions:
i. Why is Binomial probability distribution applied here?

ii. What is the expected number of people to be alive in 30 years from


now?
iii. Find the probability that at least 2 people are alive.
91
SCAT TER DIAGRAM
 If data is given in pairs, then the
scatter diagram of the data is just the
points plotted on the xy-plane.
 The scatter plot is used to visually
identify relationships between the first
and the second entries of paired data.
 The following scatter plot represents
the age versus size of a plant. It is
clear from the scatter plot that as the
plant aged, its size tends to increase.
 If it seems to be the case that the
points follow a linear pattern well,
then we say that there is a high linear
correlation, while if it seems that the
data do not follow a linear pattern, we
say that there is no linear correlation.
 If the data somewhat follow a linear
path, then we say that there is a
moderate linear correlation.

92
SIMPLE LINEAR REGRESSION

 Linear regression is applicable


Examples:
when one has collected data
• Predicting weight from height.
on two or more variables and
wants to quantify a • Relating sales to advertising
expenditure.
relationship between the
 Given a scatter plot, we can draw the
Response (dependent) and
line that best fits the data. To find the
Predictor (independent) equation of a line, we need the slope
variables. and the y-intercept.
 Therefore, regression is used:-  We will write the equation of the line
as:
• To predict the value of one
variable from the other
variables where
 a is the y-intercept and b is the
• To examine the actual slope.
relationship between  x is the independent or predictor
variables variable and
• To determine trends in data.  y is the dependent or response
variable. 93
 The simplest form of regression
LEAST SQUARE ESTIMATION

 Least squares can be interpreted  So that is minimized through


as a method of fitting data. partial derivatives of the
 The best fit in the least-squares equation with respect to a and
sense is that instance of the b.
model for which the sum of
squared residuals has its least
 The two resulting equations
value. are set equal to zero to locate
 A residual being the difference
the minimum values; these two
equations in two unknowns,
between an observed value and
the value given by the model.
are solved simultaneously to
obtain the formulas for a and b.
 Let
 Then the formulas for “α” and
 The least square methods
determines the line that “β” in terms of the sample
minimizes the sum of squared estimates “a” and “b” will be:
vertical differences between the
actual and predicted values of
outcome variable such as and ,
respectively. 94
M O D E L A SS U M P T I O N S

 For simple linear regression • are symmetry, which means the


model deviations from the model are
equally likely to be positive or
 where negative
 the Yi’s are the measurements • The error for any particular case is
on the dependent variable, not related to the value of the
predictor variable, this means that
 the Xi’s are the measurements we are assuming that the variability
on the independent or of the errors is the same over the
predictor variable, and whole range of the regression,
which we quantify by saying that
 α and β are the parameters in the error variance is constant
the linear regression model • The error for one case is not
that we want to estimate. affected by the error for another
 The ’s or error terms are used case, or, in other words
to make allowance for the • The distribution by error terms is
random scatter about the line, normal or symmetry
in other words, we are allowing • The dependent variable Yi is a
for the fact that there is continuous variable
variability in our sample. 95
• The only assumption made about
 Most of the assumptions are independent variable is that you
LEAST SQUARE ESTIMATION

 For the simple linear Example


regression model,  Suppose that a study was done to
determine the weight loss after
 Yi=α+ βXi+Єi taking various amounts of a diet pill
 the following hypothesis can in combination with exercise. If the
regression line was
be tested:
 Y = 3 + 2X
H0: α=0 H1: α≠0  Where X denotes the grams of the
pill per day and Y represents the
H0: β=0 H1: β ≠0
weight loss, then we can say that
The test Statistics are:- with only the exercise and no pill the
average weight loss is 3 pounds.
• t=(a-0)/S.Ea and t=(b-0)/S.Eb,  We can also say that if a person
respectively. takes an additional gram of the pill,
then that on average the person
 where should expect to lose an additional 2
pounds. If a person takes 5 grams,
then that person can expect to lose
an average of 13 pounds.
96
EXAMPLE

The following table shows the individual’s working hours and wage.
Determine the regression equations of using OLS estimation
technique and hence estimate the individual’s wage whose working
hours were 5.6.
Individual Working Wage
ID. Hours

1. 6 7
2. 4 6
3. 3 5
4. 7 8
5. 5 6
97
COEFFICIENT OF DETERMINATION (R 2 )

 From the linear regression line, R2 = 1 - SSResid/SSTot


generally Residual is given by:
yi - y’ = yi - (a+bxi) = SSReg/SSTot

 We define the coefficient of


determination R2 as an Where
indication of how linear the
• SSTot. = Σ(y – y̅)2
data is.
 R2 has the following • SSRes. = Σ(y – y’)2
properties:- • SSReg. = Σ(y’ – ̅y)2
 R2 is between 0 and 1.
• SSTot. = SSRes +SSReg
 If R2=1 then all points lie on a
 If we multiply R2 by 100%, we
line. (perfectly linear)
arrive at the percent of the
 If R2=0 then the regression line observed variation attributable
is a useless indicator for to the linear relationship.
predicting y values.
98
COEFFICIENT OF DETERMINATION (R 2 )

Note
 Although a good regression will
give a high R2 , a high R2 does not
necessarily mean a good fit.
 Consider the plot below, its R2 is
high (98.8%), but the plot show
one point far away from the
others.
 This point is often termed as
influential point, because it has a
great effect on the estimation of
the regression line.
 A low R2 does not necessarily
mean that there is no relation.
The relation could be very strong,
but could be nonlinear, such as a
semi circle relation which gives a
R2 of zero.
99
C OR R E LAT I O N C OE F F I C I E N T ( R )

 In many cases, more than one  For the sample this is defined
variable has been measured on as:
each unit, such as animal,
plant and object. So, if there  rxy= Sxy/SxSy
are several variables of  where, Sx and Sy are the
interest, one is frequently
standard deviations of x and y,
interested in correlations
and Sxy is the covariance
between these variables.
between x and y, defined as:
 In the previously discussed
simple linear regression model,
if we want to determine not
just if they are linearly related,
but also want to know whether
there is a positive relationship
or a negative relationship (b>0
or b<0).
 One method of examining the
relationship between two
continuous variables (such as 10
0
weight and height) is to look at
CORRELATION COEFFICIENT (R)
 The correlation coefficient  Similarly, a correlation coefficient of
measures the strength of the near one does not imply a near
linear relation between x and y. perfect linear relation, if there is one
point in the data set that is far away
 If the relationship between x
from the rest of the points, the
and y is positive (direct), then correlation coefficient may be near 1
when the x value is higher the or -1.
y value is also very likely to be  Generally, the square of the
higher. correlation coefficient (r ) is equal to
2

the coefficient of determination (R2).


 For negative (indirect)
• If r < 0 then they are negatively
relationship, one would find
correlated.
that when y is higher, x will be
lower and vice versa. • If r > 0 then they are positively
correlated.
 A correlation coefficient of zero
We say that the correlation is
does not necessarily imply that
• Strong if |r| >0.8
there is no relationship
between the variables; a zero- • Middle if 0.5 < |r| < 0.8 and
correlation coefficient would • Weak otherwise.
also be obtained if the plot of x 10
1
versus y showed a semi-circle.
STATISTICAL CONTROL CHARTS

 The control chart, is also known  If the chart indicates that the
as the Shewhart chart or process being monitored is not in
process-behavior chart. control, the pattern it reveals can
help determine the source of
 Statistical process control is a variation to be eliminated to bring
tool used to determine whether a the process back into control.
manufacturing or business  In this case a point falls outside the
process is in a state of statistical limits established for a given control
control or not. chart.
 For this Shewhart set a 3-sigma  Those responsible for the underlying
limits. process are expected to determine
whether a special cause has
 If the chart indicates that the
occurred. If one has, then that cause
process is currently under control should be eliminated if possible.
then it can be used with
 It is known that even when a
confidence to predict the future
process is in control, there is
performance of the process.
approximately a 0.3% probability of
 In this case all points will plot a point exceeding 3-sigma control
within the control limits. limits.
10
4
STATISTICAL CONTROL CHARTS

 Generally, a control chart Key terms of control chart


consists of:
• Out of Control: the process
• Points representing may not be performing
measurements of a quality correctly.
characteristic in samples taken
from the process at different • In Control: the process may be
times, i.e. the data. performing correctly.
• A center line, drawn at the • UCL: upper control limit.
process characteristic mean,
which is calculated from the • LCL: lower control limit.
data.
• Average value: average
• Upper and lower control limits (Arithmetic mean)
that indicate the threshold at
which the process output is
considered statistically ‘unlikely’.

10
5
PROCESS OUT/IN OF CONTROL

Process is out of control if Process is in control if:


there are:
• The sample points fall between
• One or multiple points outside the control limits.
the control limits.
• There are no major trends
• Eight points in a row above the forming, i.e. the points vary,
average value. both above and below the
average value.
• Multiple points in a row near
the control limits.

10
6
TYPES OF CONTROL CHARTS

 There are two types of control X-Bar chart:


charts: It deals with an average value in a
process.
• Variable Control Charts
 The horizontal axis denotes the
• Attribute Control Charts sample number. The vertical axis
depicts the sample (X-Bar) for a
Variable control charts series of lots or subgroup samples.
 It deals with items that can be  It has a centerline represented by X-
measured. double bar (), which is simply the
overall process average, as well as
 Examples: Weight, Height, two horizontal lines, one above and
Speed, Volume, time, money, one below the centerline, known as
length, width, depth, etc. the upper control limit (UCL) and
lower control limit (LCL), respectively.
 Types of variable control charts
 These lines are drawn at a distance
are:
of plus and minus three standard
 X-Bar() chart deviations (i.e. standard deviations
of the sampling distribution of
 R chart
sample means) from the process
 MA chart average. 10
7
 Note: standard error for = s/√n
10
8
VARIABLE CONTROL CHARTS

 R chart:- It takes into count the  MA chart:- It takes into count


range of the values the moving average of a
 The R-chart is sensitive to changes process.
in variation in the measurement
process. It consists of:  It is important if we are
 Vertical axis = the range for each mostly interested in detecting
sub-group; Horizontal axis = sub- small trends across
group designation. successive sample means.
 In addition, horizontal lines are
drawn at the mean range value and
at the upper and lower control
limits.
 Note: standard error for Where D3
and D2 are the factors for control
limits provided according to the
number of observations in sample.

10
9
AT TRIBUTE CONTROL CHARTS

 Is the quality attributes (can be counted) C-chart: a chart of the number


of a process to determine if the process of defects per unit in each
is performing in or out of control.
Examples: number of defects, mistakes, sample set.
errors, injuries, etc.  The C-chart is sensitive to
 Types of variable control charts are:
changes in the number of
 P-chart defective items in the
 C-chart measurement process. The “C”
 U-chart in C-CONTROL CHART stands
P-chart: a chart of the percent defective in for “counts” as in defectives
each sample set. per lot.
 The P chart is sensitive to changes in the
proportion of defective items in the  = total number of defects/total
measurement process. The “P” in P chart number of units
stands for the p (the proportion of
successes) of a binomial distribution. The  The C-control chart consists of:
P control chart consists of: Vertical axis = Vertical axis = the number
the percentage of defectives for each
sub-group; Horizontal axis = the sub- defective for each sub-group;
group designation.  Horizontal axis = sub-group
 Note: standard error for
designation. 11
0

AT TRIBUTE CONTROL CHARTS

U -chart: a chart of the average number of defects in each sample set.


 The U-chart is, sensitive to changes in the normalized number of
defective items in the measurement process.
 Normalized means that the number of defectives is divided by the unit
area.
 In U-chart, U stands for “units” as in defectives per lot.
 The U control chart consists of:
 Vertical axis = the normalized number of defectives (number of
defectives/area for sub-group = u) for each sub-group;
 Horizontal axis = sub-group designation.
 is the total number of defects divided by the total area (i.e., the sum
of the areas variable) and A is the area corresponding to a given sub-
group.
 Note: standard error for .
11
1
CONTROL CHARTS

Steps to have statistical Calculating major lines in a


control chart control charts
• Determine what type of data • Average Value: take the
you are working with. average of the sample data.
• Determine what type of control • UCL: Multiply the Standard
chart to use with your data set. deviation by three. Then add
that value to the Average
• Calculate the average and the
Value.
control limit.
• LCL: Multiply the Standard
deviation by three. Then
subtract that value from the
Average Value.

11
2
EXERCISES

1. A manufacturer of hair products needs to monitor the PH of its


shampoo. The normal PH of hair is slightly acidic, in the range of
4.5 – 5.5. To maintain hair strength and vitality, shampoo and
conditioning products should be PH balanced with hair. In order
to ensure quality, 40 separate output batches are measured at
regular time intervals and their PH recorded. Based on the
collected information, use control charts to monitor this process.
2. A clothing manufacturer employs quality inspectors to track the
manufacturing process. From each lot produced at the factory,
the inspectors take a sample of clothes and count the number of
clothes that are unacceptable. For the collected information use
control charts to track the proportion of defective clothes.

11
3
X- B A R ( M E A N ) C H A RT

11
4
R-CHART

11
5
P-CHART

11
6

You might also like