Probability and Statistics_Y2Phys
Probability and Statistics_Y2Phys
Probability and Statistics_Y2Phys
S TAT I S T I C S
( M AT 2 1 6 1 )
1
Indicative Course content
1. Descriptive and Inferential statistical Analysis methods (Mean, Median, Mode, Range,
Inter quartile Range, Variance, Standard Deviation, Confidence Interval, Hypothesis
Testing).
2. Data presentation using appropriate graphs (Bar, Histogram, Frequency polygon, Pie)
and Tables (Frequency distribution, 2X2 Table).
3. Introduction to probability theory and spaces.
4. Application of different probability distribution (Binomial, Poisson, Chi-square,
Normal) according to type of variable/data.
5. Inference using confidence interval and hypothesis testing.
6. Sample size determination and sampling techniques.
7. Tests of significance.
8. Simple linear regression and correlation.
9. Application of different statistical control charts.
2
Weight
Component Time
(%)
Total 100
3
INTRODUCTION TO STATISTICS
Statistics is the branch of Applied Mathematics applied to observation
data. It is a science which studies about the methods of data collection,
presentation, analysis and interpretation.
Methods of data collection:
Data can be collected through observation (gender), measurement
(weight), asking (age), counting (number of students in class) etc.
Presentation:
Tables (such as frequency distribution, 2X2 table), Graphs (Bar,
Histogram, Pie, Frequency polygon, cumulative frequency polygon).
Analysis:
Descriptive method consists of the collection, organization,
summarization, and presentation of data through measure of central
tendency or location (such as mean, median and mode) and using
measure of variation or dispersion (such as range, Inter quartile range,
variance and standard deviation).
Inferential method consists of generalization from samples to
populations, performing estimations and hypothesis tests, determining
relationships among variables, and making predictions. It uses 4
probability theory.
Interpretation:
Giving meanings for the output of data analysis result.
Population includes all objects of interest; it consists of all
subjects (human or otherwise) that are being studied.
A sample is a portion of the population i.e. is a group of
subjects selected from a population.
Data:
Generally, we have two kinds of data:
Primary data are those which are collected from the units or
individuals directly and these data have never been used for
any purpose earlier.
Secondary data are data, which had been collected by some
individual or agency and statistical treated to draw certain
conclusions. Again the same data are used to extract some
other information.
A variable a characteristic or attribute that can assume
different values. 5
6
The simplest type of categorical variable is one,
which can take only two categories. Such a variable
is known as binary (or dichotomous). e.g. Sex
(Male/Female), Smoking status (Smoker/Non-
smoker), Disease status (Positive/Negative) etc.
Some qualitative variables can take more than two
values. e.g. Marital status (Single, married,
divorced, widowed), Disease severity (Low, mild,
moderate, high), etc.
Generally, qualitative variable can be:
Unordered if the categories may be listed in any
order such as marital status (it does not involve
ranking).
Ordered if the categories have a natural ordering to
their categories such as disease severity (it involves
ranking). 7
Note: When using qualitative variable, it is very important to
check the provided categories’ exhaustiveness and mutually
exclusiveness.
Exhaustiveness: All categories included. For instance, if the
variable is "religion" and the only options are "Protestant",
“Catholic", and "Muslim", there are many other religions that
haven't been included. The list does not exhaust all
possibilities.
On the other hand, if you exhaust all the possibilities with
some variables, religion being one of them, you would
simply have too many responses.
The way to deal with this is to explicitly list the most
common attributes and then use a general category like
"Other" to account for all remaining ones.
Mutually exclusiveness: no intersection between the given
categories i.e. they cannot occur at the same time. e.g.
tossing a coin, the result is head or tail, but never both,
employment status (employed/unemployed), etc. 8
FREQUENCY DISTRIBUTION
Frequency distribution table is more difficult to
construct for numerical data than for categorical
data because the scale of the observations must
first be divided into classes.
Therefore, the steps for constructing a frequency
distribution table for numerical data are as
follows:
i. Identify the largest and smallest observations.
ii. Subtract the smallest observation from the
largest to obtain the range of the data.
iii. Determine the number of classes.
iv. Divide the range of observations by the
number of classes to obtain the width of the
9
COLUMNS IN FREQUENCY
DISTRIBUTION TABLE
Frequency: in a particular event it is the
number of times that the event occurs.
Relative frequency (percent) is the
proportion of observed responses in the
category to the total number of
observations.
Cumulative relative frequency is the
running total of the relative frequencies by
reading from top to bottom.
10
EXAMPLE
We asked the students what country their car is from (or no
car) and make a tally of the answers. Then we computed
the frequency and relative frequency of each category. The
relative frequency is computed by dividing the frequency by
the total number of respondents. The following table
summarizes.
Country Frequency Relative Frequency (%)
US 6 30
Japan 7 35
Europe 2 10
Korea 1 5
None 4 20
Total 20 100
11
Example:
The following are the marks out of 20 obtained by 50 students.
18 15 17 17 12 9 16 10 12 12
12 5 18 13 19 15 7 18 15 16
20 11 18 9 19 16 14 18 10 11
16 18 20 15 15 10 12 17 8 16
19 17 15 8 5 17 11 16 16 7
The test scores could be grouped into various classes
Score Frequency Relative Cumulative Relative
(x) (f) Frequency Frequency (%)
(%)
1-5 2 4 4
6-10 9 18 22
11-15 16 32 54
16-20 23 46 100
Total 50 100 12
Two by Two Table (2X2)
The measures of association between risk factor (exposure)
and outcome are often calculated from data presented in 2X2
table or in the general form of nxn Table.
The following is a 2X2 table showing association between
exposure and outcome variables.
OUTCOME
EXPOSURE Total
Positive (+) Negative (-)
Yes A B A+B
No C D C+D
13
EXERCISE
Table 2
HIV/ + + + + + -(ve) -(ve) -(ve) -(ve) -(ve)
AIDS (ve) (ve) (ve) (ve) (ve)
14
GRAPHS
• Nonsymmetric, Skewed
right
• The shape of Histogram can be uni-
modal if there is one hump, bimodal
if there are two humps and
multimodal if there are many humps.
• A non symmetric histogram is called
skewed if it is not symmetric.
• If the right tail is longer than the left
tail then it is positively skewed.
• Bimodal
• If the right tail is shorter then it is
negatively skewed.
19
F R E Q U E N C Y P O LY G O N
polygon can be 5
constructed easily by 4
Frequency
bars of a histogram. 1
0
18 20 23 25 26 27 30 35 38 40 43 44 45
AGE
AGE 20
MEASURE OF CENTRAL
TENDENCY/LOCATION
22
MEDIAN
25
MODE
No. of 2 5 12 17 14 6 3 1
calves
26
MEASURE OF DISPERSION/VARIATION
f i ( xi x) 2
fi ( xi x ) 2
fi xi2 3. Square each of the
s 2 i 1 n i 1 i 1 x
2
differences.
n n
fi
i 1
4. Add values in this column.
5. Divide by n-1 where n is the
number of items in the
sample. This is the variance.
• Standard Deviation
6. To get the standard deviation
we take the square root of
the variance.
29
C O E F F I C I E N T O F VA R I AT I O N
• Example:
From the following Table calculate the mean, variance and
coefficient of variation.
Number of hours per week spent watching television by
students.
Hours No. of students
10-14 2
15-19 12
20-24 23
25-29 60
30-34 77
35-39 38
30
40-44 8
SKEWNESS
Skewness is a measure of
symmetry, or more precisely, the
lack of symmetry.
A distribution, or data set, is
symmetric if it looks the same to the
left and right of the center point.
The skewness for a normal
distribution is zero, and any
symmetric data should have a skew
ness near zero.
Negative values for the skewness
indicate data that are skewed left
and positive values for the skewness
indicate data that are skewed right.
31
KURTOSIS
32
ELEMENTARY PROBABILITY THEORY
Assume that an experiment can be An experiment is defined as
repeated many times, with each
any planned process of data
repetition called a trial and assume
that one or more outcomes can result collection.
from each trial, then the probability of For an experiment we define
a given outcome is the number of
times that outcome occurs (favorable an event to be any collection
outcome) divided by the total number of possible outcomes.
of trials (total outcomes).
A conditional probability is
r Favourable
outcomes the probability of one event
Prob(x) =
n Totaloutcomes given that another event has
occurred.
If the outcome is sure to occur, it has a
A simple event is an event
probability of 1; if an outcome can not
occur, its probability is 0. that consists of exactly one
Example: The probability of flipping a outcome.
fair coin and getting tails is 0.50, or In Probability “OR” means the
50%. If a coin is flipped 10 times, there
is no guarantee, that exactly 5 tails will union, that is, either can occur
be observed, the proportion of tails can and in probability
range from 0 to 1. “AND” means intersection 34
that both must occur.
CHARACTERISTIC S OF PROBABILITY
P(E) is always between 0 and 1. Example
The sum of the probabilities of all
A bag contains 80 balls of which
simple events must be 1.
20 are red, 25 are blue and 35
P(E) + P (not E) = 1
are white. A ball is picked at
If A and B are mutually exclusive, random what is the probability
then that the ball picked is:
P(A or B) = P(A) + P(B)
(i) Red ball
If A and B are not mutually
exclusive, then (ii) Black ball
P(A or B) = P(A) + P(B) - P(A and B) (iii) Red or Blue ball.
If A and B are independent, then
P(A and B) = P(A)P(B)
For conditional probability
P(A and B) = P(A|B)XP(B)
or
P(B and A) = P(B|A)XP(A)
35
MUTUALLY EXCLUSIVE AND
INDEPENDENT EVENTS
1. In a competitive examination, 30
candidates are to be selected. In all 600
candidates appearing in a written test, 100
will be called for the interview.
(i) What is the probability that a person
will be called for the interview?
(ii) Determine the probability of a person
getting selected if he has been called for the
interview?
(iii) Probability that person is called for the
interview and is selected?
38
EXERCISES
39
E X E RC I S E 2
40
E X E RC I S E 2
Example i. P3
3
Example
The probability of a rare disease striking a given population is
0.003. A sample of 10000 was examined.
Find the expected no. suffering from the disease and hence
determine the variance and the standard deviation for the
51
above problem.
PROBABILITY DISTRIBUTION FOR
CONTINUOUS RANDOM VARIABLES.
Yes 70 5 75
No 10 15 25
Total 80 20 100
56
STUDENT T-DISTRIBUTION
distribution Table)
EXAMPLE
125 135
HYPOTHESIS TESTING
Assume that we have a sample (x1,x2, Estimator is not rejected because it may
…,xn) from a given population. All give one bad result for one sample.
parameters of the populations are It is rejected when it gives bad results in
known except some parameter. a long run.
We want to determine, from the given Estimator is accepted or rejected
observations, the unknown parameter. depending on its sampling properties.
In other words, we want to determine Every member of a population cannot
a number or range of numbers from be examined so we use the data from a
the observations that can be taken as sample, taken from the same
a value. population, to estimate some measure,
• Estimator – is a method of such as the mean, of the population
estimation. itself.
The sample will provide us with the best
• Estimate – is a result of an estimator
estimate of the exact 'truth' about the
• Point estimation – as the name population.
suggests is the estimation of the The method of sampling depends on the
population parameter with one data available but the ideal method, as
number. every member of the population has an
Problem of statistics is not to find equal chance of being selected, is
estimates but to find estimators. random sampling. 68
POINT AND INTERVAL ESTIMATION
Unbiasedness
If an estimator estimates , then
the difference between them - )
is called the estimation error. Sample variance
Bias of the estimator is defined Expectation value of the square of
as the expectation value of this the differences between estimator
difference, that is:- and the expectation of the
B = E( - ) = E( ) - estimator is called its variance.
difference treated
Mean and120 untreated
mmHg 140groups
mmHg
75
CONFIDENCE INTERVAL FOR TWO
POPULATION PROPORTIONS
DIFFERENCE
Quota sampling
It is often used for opinion surveys.
That is each interviewer is told to collect opinions from a specified number
of, for example, males and females of certain ages.
This is fraught with danger, as interviewers often have difficulty in making
up their quota and may ask relatives or friends to masquerade as the
desired case.
Even if they don’t just re-label the people interviewed, they are likely to
interview friends or acquaintances, who are likely to be more similar to
each other than the people in the general population.
This method is sometimes useful when one has no list of the population,
and where the population is hard to find, is called snowball sampling. That
is once you have identified some respondents, you ask them to provide you
with contact details of other respondents.
90
QUIZ 1 /10.
92
SIMPLE LINEAR REGRESSION
The following table shows the individual’s working hours and wage.
Determine the regression equations of using OLS estimation
technique and hence estimate the individual’s wage whose working
hours were 5.6.
Individual Working Wage
ID. Hours
1. 6 7
2. 4 6
3. 3 5
4. 7 8
5. 5 6
97
COEFFICIENT OF DETERMINATION (R 2 )
Note
Although a good regression will
give a high R2 , a high R2 does not
necessarily mean a good fit.
Consider the plot below, its R2 is
high (98.8%), but the plot show
one point far away from the
others.
This point is often termed as
influential point, because it has a
great effect on the estimation of
the regression line.
A low R2 does not necessarily
mean that there is no relation.
The relation could be very strong,
but could be nonlinear, such as a
semi circle relation which gives a
R2 of zero.
99
C OR R E LAT I O N C OE F F I C I E N T ( R )
In many cases, more than one For the sample this is defined
variable has been measured on as:
each unit, such as animal,
plant and object. So, if there rxy= Sxy/SxSy
are several variables of where, Sx and Sy are the
interest, one is frequently
standard deviations of x and y,
interested in correlations
and Sxy is the covariance
between these variables.
between x and y, defined as:
In the previously discussed
simple linear regression model,
if we want to determine not
just if they are linearly related,
but also want to know whether
there is a positive relationship
or a negative relationship (b>0
or b<0).
One method of examining the
relationship between two
continuous variables (such as 10
0
weight and height) is to look at
CORRELATION COEFFICIENT (R)
The correlation coefficient Similarly, a correlation coefficient of
measures the strength of the near one does not imply a near
linear relation between x and y. perfect linear relation, if there is one
point in the data set that is far away
If the relationship between x
from the rest of the points, the
and y is positive (direct), then correlation coefficient may be near 1
when the x value is higher the or -1.
y value is also very likely to be Generally, the square of the
higher. correlation coefficient (r ) is equal to
2
The control chart, is also known If the chart indicates that the
as the Shewhart chart or process being monitored is not in
process-behavior chart. control, the pattern it reveals can
help determine the source of
Statistical process control is a variation to be eliminated to bring
tool used to determine whether a the process back into control.
manufacturing or business In this case a point falls outside the
process is in a state of statistical limits established for a given control
control or not. chart.
For this Shewhart set a 3-sigma Those responsible for the underlying
limits. process are expected to determine
whether a special cause has
If the chart indicates that the
occurred. If one has, then that cause
process is currently under control should be eliminated if possible.
then it can be used with
It is known that even when a
confidence to predict the future
process is in control, there is
performance of the process.
approximately a 0.3% probability of
In this case all points will plot a point exceeding 3-sigma control
within the control limits. limits.
10
4
STATISTICAL CONTROL CHARTS
10
5
PROCESS OUT/IN OF CONTROL
10
6
TYPES OF CONTROL CHARTS
10
9
AT TRIBUTE CONTROL CHARTS
11
2
EXERCISES
11
3
X- B A R ( M E A N ) C H A RT
11
4
R-CHART
11
5
P-CHART
11
6