2A1 Probability and Statistics L1 Notes Payne PDF
2A1 Probability and Statistics L1 Notes Payne PDF
2A1 Probability and Statistics L1 Notes Payne PDF
SJ Payne
A1: Probability and Statistics
SJ Payne
Michaelmas Term 2017 - 4 lectures, 1 tutorial sheet
With acknowledgements to T Blakeborough and V Grau
Books
All the recommended general mathematics text books
have sections on probability and statistics. The relevant
chapters of three popular ones are:
Riley, Hobson and Bence, Mathematical Methods for
Physics and Engineering (3rd edition): Chapter 24
Kreyszig, Advanced Engineering Mathematics (10th
edition): Chapters 24-25
James, Modern Engineering Mathematics (4th ed): ch 13
for Probability
James, Advanced Modern Engineering Mathematics (3rd
ed): ch 11 for Statistics.
1 Set theory
A set is a collection of objects with some common
property.
The objects are the elements of the set.
A set is represented by enclosing the objects inside
curly brackets, e.g.
A = { 1, 2, 3, 4, 5, 6 } and C = {x:x>0}
The symbol means ‘is a member of’ e.g.
y C means y is a member of C
The empty or null set contains no elements.
The symbol means is a subset, i.e. A B (or
equivalently B A, B is a superset of A) means every
element of A is also an element of B.
The symbol means Union and A B is the set
containing all the elements of A and B.
The symbol means intersection and A B contains
the elements common to both A and B.
A space is the set containing all the elements that are
being considered.
The set containing all the elements not in a particular
set is called its complement. There are various
conventions about how this is written down. These notes
use a superscript ‘C’, e.g. Ac, but you will also see A
and S\A in books.
If the intersection of two sets is the null set (i.e. they
have no elements in common) the sets are said to be
disjoint or mutually exclusive.
A good way of visualising sets, and set operations, is to
use Euler or Venn diagrams. At this point it should be
1
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
obvious to you what the union and the intersection sets
correspond to in the diagram.
A C
B
S: space, the set of all elements
2
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
1.3 Probability values
We will now associate a probability to each outcome. We
do this such that a probability of 1 corresponds to
certainty and 0 to impossibility.
We can observe that the probability of an event depends
on the probability of the outcomes it contains. This can be
achieved by defining the probability of events as the sum
of all the outcomes contained in the set. P(A); the
probability of event A, can thus be defined as
P A P ei P ei for all ei A ,
i i
where ei represents an event within the set A.
Since something is bound to happen as the result of an
experiment we can say straight away that P(S) = 1.
Similarly, the probability that none of the possible
outcomes will happen is 0 (i.e. P()=0), and we can note
that since no event can be less likely to happen than it not
happening, there are no negative probabilities, (i.e.
P(A) ≥ 0).
4
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
1.4 Permutations and combinations
A central idea in understanding probability calculations is
the concept of relative frequency, i.e. the frequency with
which we can expect a particular event to appear among
all possible events. If all events are equally likely, we just
need to count the number of possible results to assess
probabilities. There are tools to calculate the number of
results that can occur in different conditions. To
understand the possibilities, it is useful to contemplate a
simple situation for our examples – the drawing of balls
with different colours (or numbers) from bags.
A typical question could be: what is the chance of picking
first a white ball, followed immediately by two black balls,
from a bag containing 7 white balls and 3 black ones?
There are two factors that we need to take into account.
The first relates to whether the balls are put back into the
bag or not:
a.1) when we put the ball back into the bag before
drawing the next one (in which case the chances of
selecting a particular ball does not change, i.e. the
outcomes are independent), or
a.2) when we do not replace the ball (and thus the odds
change for subsequent selections).
In the first case we say that sampling is done with
replacement, while the second corresponds to sampling
without replacement.
The second factor relates to whether we are interested in
the order in which we have drawn them:
b.1) If we do not care about the order, then we talk about
combinations.
b.2) If the order is important, then we talk about
permutations.
These two factors combined give rise to four possible
cases: permutations/combinations with/without replace-
ment. In the following we will obtain mathematical rules
for the number of possible outcomes in each case.
5
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Let us first find out in how many ways it is possible to
arrange a set of objects. Let us imagine a bag containing
n balls, numbered 1 to n, and let us draw r of them in
sequence without replacement.
Now there are n possibilities for the first ball, n-1 for the
second, and so on until you get to n-r+1 ways for the rth
and last ball.
The total number of possible sequences of r balls from a
bag with n balls is therefore
n!
nPr n n 1n 2 n r 1
n r !
Since we do not put the balls back and the order is
considered, we are talking about permutations without
replacement and the number of permutations is written as
n P r , as shown above.
6
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
nPr n!
nC r
r! n r !r!
n
n C r can also be written - you might also know it as
r
the binomial coefficient, which will be used later. It is easy
to see that n C r n C n r .
7
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: If we throw five dice, how many combinations
of numbers can we get?
Obviously we need to allow for repetitions (replacement),
and given that the order is not important in this case we
are talking about combinations with replacement Given
that r=5 and n=6, the number of combinations is
5 6 1 10 10 !
252
5 5 5 !5 !
8
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The answer to the first question, that the number is even,
is the set of outcomes E={2, 4, 6}, and according to the
summing rule the probability of the event is the sum of
the probabilities of the outcomes contained in the event
set.
1 1 1 1
P E
6 6 6 2
Intuitively this corresponds again to the idea of relative
frequency, i.e. if we repeated the test a large number of
times, how often will the outcome be in E? In this example,
we can expect to get even values half of the time. In a
similar way, we could ask for the probability of the
number being exactly divisible by 3, i.e. what is P(T) when
T={3, 6}? We can see that the probability of T is 1/3
(=2/6).
Now let us ask the combined question “what is the chance
of the number being exactly divisible by two or three or
both?”
This set is C={2, 3, 4, 6}, which has four equally likely
outcomes so the probability P C 4 6 2 3 . Note we
cannot just add the probabilities of each event because
the element ‘6’ appears in both sets and its probability
would be included twice. We must correct for this by
taking off one of each of the probabilities of the joint
outcomes.
This illustrates the rule of addition of probabilities
P A B P A P B P A B
It is easy to confirm this visually by checking the sets A, B,
their union and their intersection. We can also check that
it gives the same answer as before:
P C P E T P E P T P E T
1 1 1 2
2 3 6 3
9
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
2 Prior knowledge, Independence & Bayes’
theorem
In the examples above, the sample space S was the whole
set of possible outcomes. There are many situations in
which we possess prior information, i.e. we already know
something about the outcome. For example, the likelihood
of a device failure in the next minutes might be affected
by the knowledge that the room temperature is unusually
high. As an example for our die we can ask “if we know
that the outcome is even, what is the probability of it
being larger than 3?”
The set of digits in S that are both even and larger than 3
is {4,6}, and the set of events that are even but not
larger than 3 is {2}. It seems thus that the mentioned
event will happen 2/3 of the time. Notice that this is
different from the probability of the value being larger
than 3 in the absence of any prior information, which
would be 1/2.
Again, let’s try to calculate this in a more formal way.
What we are trying to calculate here is a conditional
probability, i.e. the probability of the event A given that
the event B has happened. By introducing our prior
knowledge about the outcome (i.e. that it is even), we
have reduced the set of possible outcomes to a subset of
S, which we will denote B = {2,4,6}. The event for which
we want to calculate the probability is A = {4,5,6}, of
which the only ones possible now are A B, i.e. {4,6}.
This conditional probability of A given B is written P A B
(remember the form of the words; it will help you with the
order of the arguments), and following the argument
above it can be defined as:
P A B
P A B , where P B 0
P B
10
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
A
1 3 5
2 4 6
B
Intuitively we can read this as: the chance of two events
happening simultaneously is the chance of one of them
happening, multiplied by the chance of the second
happening given that the first has happened.
2.1 Total probability theorem
We can use this result to derive important relationships.
The first is the total probability theorem. If we partition
the sample space into a set of n disjoint sets Ai
A1 A2 A...
B
An A...
A... Ai
S
we can see from the addition rule above that the
probability of B is given by
P B P B A1 P A1 P B A2 P A2 P B An P An
n
P B Ai P Ai
i 1
2.3 Independence
The final thing to note is that if prior knowledge does not
affect the probability of the second event, P A B P(A) ,
and thus
P A B P A BP B P B AP A P AP B
then we say that the events A and B are independent.
So: if two events are independent, the probability that the
two of them happen in the same experiment is the
product of their individual probabilities. Note that
statistical dependence does not require any causative link,
as we have seen before in the example of the die
outcomes.
12
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: Clinical test:
A test for a rare disease detects the disease with a
probability of 99%, and has a false positive ratio (i.e. it
tests positive even though the person is healthy) of 0.5%.
We know that the percentage of the general population
who have the disease is 1 in 10000.
Suppose we chose a random subject and perform the test,
which comes out positive. What is the probability of the
person actually having the disease?
The probability of D (having the disease) before the test is
1 in 10000: P(D)=0.0001 (i.e. P(Dc)=0.9999)
The conditional probabilities of getting a positive in the
test P(T) when the person does / does not have the
disease are P(T|D)=0.99 (true positive ratio) and
P(T|Dc)=0.005 (false positive ratio).
The probability of getting a positive result in the test is
P(T), which we can calculate using total probability:
P(T) = P(T|D)P(D)+P(T|Dc)P(Dc) =0.005
Now we just need to apply Bayes’ theorem
P T DP D 0.99 0.0001
P D T 0.02
P T 0.005
You may be surprised to know that even with a high
quality test such as the one described, if a random subject
is tested and it comes out positive, the probability of the
subject being ill is only 2%. Of course this would all
change if the person has shown previous symptoms or
belongs to a certain risk group (i.e. if you have further
prior information).
If you like brain teasers, you may want to look for the
“Monty Hall problem”, an application of prior probability
that has puzzled many since it was originally proposed in
1975.
13
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3 Probability distributions
3.1 Random variables
If we assign a probability to each point in a sample space,
then we have a function defined on the space. This
function is called a random variable, usually represented
by a capital letter, e.g. X or Y. Variables that are random
in time are also called stochastic variables.
If the sample space is a set of discrete points then the
variable is discrete, otherwise it is non-discrete, or
continuous. For example, the number of students at a
particular lecture is a discrete random variable; the height
of the tallest student is a continuous random variable.
3.2 Discrete random variables
Given a sample space containing a discrete number of
values xi, we say that the probability that the random
variable X equals that number xi, is p(xi) or
P X x i px i
14
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Probability Density Function of a sample discrete Random Variable
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
15
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
where ni is the number of times the value xi appears in
the N samples. But as N we know that ni N pxi ,
so we can write
x x i px i
i
a x i px i b px i
i i
a E X b
Similarly, it is easy to demonstrate that, given two
random variables X and Y, the expectation of a linear
combination of the two (Z = aX + bY) is a linear
combination of the respective expectations:
E Z E a X bY a E X bEY
The expectation gives us an idea of which values we can
expect to get (though we still need to be careful on the
use of this parameter: for example, there is no guarantee
that the expectation corresponds to the value that is most
likely to appear). We would also like to know how much
the values are likely to vary between tests: this is related
to the breadth of the distribution. A simple, intuitive way
to quantify this is by getting the ‘average’ value of the
16
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
square of the difference of each point from the mean. This
figure is called the variance of the distribution
X2 E X x
2
We can expand this to give a useful way of calculating the
variance.
X2 E X x
2
E X 2 2X x x 2
E X 2 x E X x
2 2
E X E X
2 2
17
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.2.3 The uniform discrete probability function
18
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
probability of getting a single ‘1’ anywhere in the order
is nqn-1p.
r ‘1’s – let r ones come first followed by n - r zeros. The
probability of this is qn-rpr, but then the r ‘1’s and n - r
n
zeros can be arranged in ways (from Section 0) so
r
n
the probability is qn r pr
r
All ‘1’s – all trials must be ‘1’ so the probability is pn.
The values for each possible outcome are summarised in
the following Table:
No of successes 0 1 …. r …. n
n nr r
Probability q n nq n1 p …. q p …. p n
r
We can prove that this distribution meets the properties of
a discrete probability distribution: a) It is obvious that all
probabilities are positive; and b) To demonstrate that the
sum of all probability distributions equals 1, we can see
that the terms in the Table above correspond to those of
the binomial expansion (q+p)n, which equals 1 since
q+p=1.
n
q pn q n nq n 1 p q n r p r p n
r
This distribution is called the binomial distribution
(from the generating function) or the Bernoulli
distribution, after Jacques Bernouilli, the 17th century
Swiss mathematician who first discovered it. It is often
represented in the form B(n,p).
The binomial distribution is very common and it occurs
whenever there is a set of independent trials with
outcomes ‘0’ or ‘1’ and which all have the same
probability of success. Below you can see sample plots
corresponding to B(100,1/3) and B(100,1/2) in the first
Figure, and B(25,1/3) and B(25,1/2) in the second. Check
whether the position of the maximum and the width of the
distribution are those you expected.
19
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
0.09
0.08
n = 100
0.07
0.06
p = 1/3
0.05 p = 1/2
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60 70 80 90 100
0.18
0.16
n = 25
0.14
p = 1/3
0.12
p = 1/2
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
20
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
Example: At a particular junction 10% of cars turn left.
Five cars approach the junction, what is the probability
that 3 or more will turn left?
21
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
22
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
period into a series of n small, equal periods of time. The
probability of an event happening on each of the smaller
periods is /n but there are n of them. If we keep on
dividing the period up into more and more of ever shorter
periods eventually we will get to the point when every
little period has at most one event in it. Sampling the
process for a unit period of time can then be considered a
binomial process (since each sub-period can have either
one event in it or none) with a very large number of sub-
periods n.
What is the distribution of successes (note that we use
success here as a probability term, and it does not
necessarily reflect a positive event)?
The process is binomial so the probability of there being k
successes is
n
pk q n k p k
k
with p being the probability of a success within each of the
n periods, i.e. p=/n. Let us make the substitution for p
nk k
n!
pk 1
n k ! k! n n
which can be rewritten
n k k
n!
pk 1 1
n k !k! n n n
Now what happens to this as n ? Let us look at each
component in turn:
n!
nk
n k !
n
1 e
n
k
1 1
n
and so the probability becomes
23
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
k
nk
pk e
k! n
k
e
k!
This is the probability distribution of independent rare
events. The requirements are that
each event is independent of each other
only one event can happen at a time
the mean rate of events is constant.
So in our example of the call centre, we can use the
function above to calculate the probability of getting a
certain number of calls in one hour, helping us to estimate
the number of lines we need to hire to attend the calls.
Note that this is much more than the original information
(which was just the average number of calls per hour).
We can see that the sum of all probabilities is
k
e
k!
e e 1
k 0
24
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
0.4
Probability 0.35
= 1
=1
0.3
0.25
=5
= 5
=10
= 10
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
k
25
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.2.7 Poisson approximation to the binomial
distribution
Poisson approximation
26
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3 Continuous random variables
In the previous section we considered discrete random
variables – variables that can take one of a countable set
of distinct values. In other cases the variable can vary
continuously over a range: take for example the
distribution of student heights within your class. In this
case it is no longer possible to define the probability of
getting any single value of the variable because there is
an infinite number of them. It turns out that we have to
define the probability not of getting a particular value, but
the probability of the variable being within a given range.
It is best to approach this problem indirectly by starting
with the cumulative distribution function (CDF) -
sometimes also called cumulative probability function or
CPF. This is defined as the function, F x , equal to the
probability that the variable is less than a particular value:
F x P X x
Discrete random variables also have CDFs, the one for the
Poisson distribution with = 25 is shown below.
CPF & DPF for Poisson lambda=25
1 0.1
Cumulative Probability Function
Probability
0.5 0.05
0 0
0 5 10 15 20 25 30 35 40 45 50
k
27
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
We can see that because the probability function is
discrete this CDF has a staircase shape. We can also note
that (all these should be evident if you have understood
the definition above):
it starts from a value of 0,
rising monotonically
to a final value of 1, and that
it is steeper where the probability density function is
largest.
We can determine the probability that the random
variable lies inside a given range from the CDF
P xl X xu F xu F xl
f x 1 (condition B)
28
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
pdf
f(x)
P(xl<X<xu)
is equal to the
area under the
pdf graph
xl xu x
x E X x f x dx
29
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
defined in the same way as before, and is derived from
the second moment of the distribution
2 E X x
2
E X 2 E X
2
2
x f x dx x f x dx
2
f(x)
50%
xmode x
xmedian
30
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The standard deviation is one measure of the dispersion of
a distribution: the higher the value of the standard
deviation, the less concentrated around the mean the
distribution is. For some distributions it makes more sense
to extend the median idea to give other proportions of the
distribution, so you get quartiles (25% of samples are
likely to fall below the lower quartile and 75% above).
There is also the upper quartile where the proportions are
the other way around. More generally we talk about
percentiles. A typical one is the 95% percentile: on
average only 5% of the samples will be higher than this
value. It is used in civil engineering to define the
‘characteristic load’ that is the standard load used in
design for the service (everyday) condition, and is the
load expected to be exceeded only once in 20 times.
31
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.4 Uniform distribution
1
CPF
CDF
1/(b-a)
a b x
The uniform distribution is used where there is no reason
to assume any particular value is more likely to occur than
any other, and being particularly simple it is easy to use
mathematically.
32
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.5 Exponential probability density function
F t 1 exp t
We can now differentiate this to get the pdf
0 t 0
f t
exp t t 0
The exponential distribution function is used extensively in
failure analysis, where the event we are trying to detect is
a failure occurring. Think about questions like this: if on
average we get one failure every three days, what is the
likelihood that a certain component will fail today? It is
also the distribution of inter-arrival times in a Poisson
33
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
process. This can be very useful in systems which have to
react to events that occur randomly: think for example
about a maintenance department at a large company
where system failures represent huge costs, or an
ambulance service, which cannot afford to make the users
wait for more than a few minutes.
t E T
t
0 t e dt
ET2
2 2
t 2 e t dt 3 2
0
from which we can say
2
2
T ET ET
2 2 2
1 1
2 2
so the mean and the standard deviation have the same
value 1/ λ.
34
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.6 Normal or Gaussian distribution
35
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
This is plotted, together with the pdf, in the Figure below:
(z)
0.9
0.8
0.7
0.6
0.5
N(0,1)
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
z
36
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
want to solve the inverse problem, for example finding the
value z0 corresponding to the 90th percentile, i.e.
P(z<z0)=0.9. Tables for this are also widely available,
including in HLT. Assuming a distribution N(0,1), and
given that the values in HLT table correspond to P(z>z0)
and are provided in percentage points, we just need to
find the entry for P=10, which is 1.2816. A useful statistic
is that the deviation to give the 95th percentile is 1.64 .
37
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.7 Central limit theorem: approximations using
the normal distribution
38
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
The continuity correction
In the conversion between discrete and continuous pdfs,
you can improve the accuracy of the estimate by taking a
mean value of the function at the mid-point between the
steps in the discrete CDF. This is called the continuity
correction.
The CDF for a discrete probability function corresponding
to a discrete random variable X is flat between the
discrete values, so we can say
P X k P X k 1
To approximate using a continuous random variable Y, we
have the choice of the range of values between k and k+1.
It turns out that the value in the middle of the interval is a
pretty good approximation:
P X k P Y k 0.5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
39
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.8 The chi-square distribution
40
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
0.25
n=3
0.2
n=5
0.15
n=10
0.1
0.05
0
0 2 4 6 8 10 12
41
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
3.3.9 The t-distribution
42
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
PART 2. STATISTICS
In the lectures up to now we have been trying to build
models that provide an accurate representation of actual
phenomena which include a certain random component.
This is called probability theory. In the following we will
study ways to apply these models to analyse real-life
observations, in order to inform our decisions: this is
known as statistics.
4 Random sampling
Imagine you are in charge of quality control for a
company which manufactures a certain product. Your duty
is to assure that the products that come out of the
production line are of sufficient quality: if a damaged
batch is detected when the product is already in the
market, a fix will be much more expensive and might
cause important damage to the company’s reputation.
If you are manufacturing a cheap product (let’s say
screws), it obviously makes no sense to test every single
one of them: the cost of testing will be higher than the
actual production costs. If the product is expensive (let’s
say cars), you might afford to test each one of them, but
then again there are different levels of testing, and you
might not be able to test absolutely every single possible
defect on every single car.
The solution: take a subset of the products, selected
randomly (this is called a random sample) and test this
as a representation of the whole production. There are a
number of questions you can think about: how many
samples do I need? What are the maximum/minimum
values for the tested parameters that I can accept?
Statistics provides tools to solve these questions and
similar ones in lots of different engineering applications.
43
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
4.1 Characterising random samples
Probably the most important parameter we can use to
characterise a random sample is its mean. Let’s say we
want to estimate the mean height of the buildings in
Oxford. We can take a random sample of n=100 buildings
(here, as in many other situations, you need to take extra
care to assure that the sample is really random), measure
each one of them and calculate the sample mean:
n
1
x
n
xj
j 1
44
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
the distribution produces precisely this sample, which in a
discrete distribution would be
l f x1 f x n
2 2 2
And thus the log-likelihood is
n
n n x j m 2
ln l lnf x j ln2 n ln
j 1 2 j 1 2 2
Now we just have to differentiate this with respect to m
and and make it equal to zero to calculate the estimates
(i.e. the stationary points), which are
1 n
mˆ xj
n j 1
n
1
ˆ2 x j m ˆ
2
n j 1
45
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
As you can see, these expressions are hardly surprising:
they are almost exactly the same as the sample mean and
variance. Note the little hat (more formally called caret)
on the symbols we have used here m ˆ , ˆ2 . These represent
that these values are estimates, rather than the exact
mean and variance of the underlying probability
distribution.
Note
Some people define the sample variance as
1 n
xj x
2
s2
n j 1
6 Confidence intervals
So if we take a random sample from a normal distribution,
and we want to use the sample mean as an estimate for
the actual mean of the distribution, how confident can we
be that the estimate is good enough for our purposes?
Important decisions might rely on this: for example, a
change in the mean of the diameters of the measured
screws might be due to errors in the manufacturing
process (which should lead to stopping the production line
for an assessment), or might just be caused by chance.
To assess this we introduce the concept of confidence
intervals. For a distribution parameter θ (e.g. the mean),
a confidence interval is the range of values θ1 θ θ2 that
contains the actual value of θ with a certain confidence
level (more on this later). Typically the confidence level
is chosen beforehand, and is usually above the 90% value.
We will refer to θ1 and θ2 as the lower and upper
confidence limits.
46
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
There are different symbols used for confidence intervals.
Here we will use the convention in Kreyszig:
CONF 1 2
47
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
We also know that we can transform a normal distribution
X= N(m, 2/n) into the standardised Z=N(0,1) by using
the change of variable
X m
Z X m Z
/ n n
48
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
6.2 Confidence interval for the mean when the
variance is unknown
We will skip the mathematical proofs here – you can go to
the recommended literature for details. Overall the
process is very similar to the one described above, except
in two aspects. First, given that we do not know the value
of the variance, we will use the sample variance as a
substitute. Second, in this case the mean does not follow
a normal distribution but the t-Distribution with n-1
degrees of freedom (where, as before, n is the number
of elements in our random sample).
As mentioned above, tables for the t-distribution are also
easy to find (in HLT, for example). Let’s say our random
sample contains 6 measurements (that means 5 degrees
of freedom). For a confidence value =0.95, we need to
look at the percentage point 2.5 in the HLT table, which
lists a value 2.571 (if you don’t understand why we looked
at the percentage point 2.5 rather than 5, take into
account the symmetry of the confidence interval).
When the number of degrees of freedom is large (i.e.
when our sample contains many measurements), the
normal distribution is a good approximation; however
when we have a small number of measurements the
difference can be important.
6.3 Confidence interval for the variance of a normal
distribution
So we have seen how to calculate confidence intervals for
the mean of the distribution when we assume we know
the distribution variance, and when we need to use the
sample variance as an estimation. There is one more thing
we may want to characterise: can we estimate a
confidence interval for the variance itself?
As above, we will skip the mathematical proof, which can
be found in any of the recommended books. Let’s just say
that the sample variance can be characterised using a
random variable with a chi-square distribution with n-1
degrees of freedom, in the following way:
49
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
2
n 1 S 2 ~ n2 1
where n is the number of samples, S2 is the sample
variance, and σ2 represents the population variance. Some
simple rearrangement of the variables results in the
following expression for a confidence interval:
n 1S 2
n 1S 2
CONF 2 2 2
, n 1
1 , n 1
2 2
with 2 ,n being the value for which P X 2,n .
np1 p ˆ1 p
np ˆ
and thus the confidence interval is
50
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
ˆ1 p
p ˆ ˆ1 p
p ˆ
CONF pˆc ˆc
p p
n n
This can be used also to calculate the required sample
sizes to achieve a specified confidence level. Note
however that this would require an initial estimation of p.
7 Hypothesis testing
As suggested above, many times in engineering
applications we have to make decisions based on
information coming from random processes. Imagine, for
example, that you have to decide whether the pieces from
a certain manufacturer meet the requirements you need
for your products. It is not possible to test every single
piece, so you may have to rely on the measurements
coming from a small random sample for your decision. Or
let’s say your department has come up with a new
formulation for a chemical product, and you want to check
whether this is better or worse than the previous one you
were producing. Again, you will need to make an
important decision based on a limited number of samples.
The theory that lies behind this type of decisions is called
hypothesis testing, and is one of the most important
areas of statistics when applied to engineering.
As a working example, let’s say we are producing a new
hybrid car which we expect to get 100 miles per gallon.
We have carried out tests on the first 8 cars coming out of
the production line, which produced the following values:
100.3 102.1 95.6 97.7 99.8 103.2 96.4 98.5
How can we determine whether the average mpg of the
produced cars (we will call it µ) is the one we expected?
The basic idea is the following: we are trying to check a
hypothesis (µ=100). We call this the null hypothesis.
We also need to define an alternative hypothesis, which
in this case will be µ≠100. We also have to define a
significance level. This concept has a similar role to the
confidence level we were using in parameter estimation,
51
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
and it is related to how sure we need to be about the
result in order to accept or reject the hypothesis. Let’s say
we are looking for a significance level =0.05 (we will
revisit the concept of significance level later). We will also
assume that the measurements come from a normal
distribution. Finally, let’s assume that the variance of the
distribution is known: 2=10.
You can see the parallelism to the situation in which we
were trying to establish a confidence interval for the mean
when the standard deviation is known. We can thus use
the tables for the normal distribution (if we did not know
the variance we would use instead the Student’s t-
distribution, with n-1=7 degrees of freedom). We can see
that 95% of the area under the probability distribution
corresponds to a range of 1.96. From the list of values
above, the sample mean is 99.2 and we know the
variance is 2=10. Applying the usual change of variable,
if the mean is 100 we can expect the sample mean to be
in the following interval with 95% probability (i.e. a
significance level of 100-95=5%):
100 1.96 / n 100 1.96 / n
97.8 102.2
Our sample mean is 99.2, which falls within the interval,
and thus we can accept the hypothesis: to the significance
level required, the car meets the mpg condition. Note
however that, if we had set a lower significance level the
result might have been different. As you can see, setting
the significance level is an important decision.
There is another aspect in which we could have taken a
different approach. We used µ=100 as the null hypothesis,
so the alternative hypothesis µ≠100 includes either
having a mean significantly lower or significantly higher
than this. This is called a two-sided test (called two-
tailed in some textbooks). If we were sure that the mean
mpg is not going to be higher than 100, we would only
need to check the alternative hypothesis µ<100. This
would be a one-sided test (one-tailed), which in turn can
be left-sided (alternative hypothesis µ<100) or right-sided
52
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
(alternative hypothesis µ>100). You can easily check that
the limits of the confidence interval are different between
the one- and two-sided cases. Thus the choice between
the two types of tests is important, and requires a careful
assessment of the particular situation. This is important
for example in medical research tests: the developers of a
new drug may insist that only a possible improvement in
patient’s condition is tested (a one-sided test), while it
may be more appropriate to use a two-sided test to check
whether the drug can actually be worse than currently
available treatments.
7.1 Type I and type II errors
When we make a decision based on the result of a
hypothesis test, we run the risk of it being wrong. There
are two possible types of errors:
Type I errors occur when a true hypothesis is rejected.
In the previous example, it would happen if we wrongly
conclude that the mpg is different from 100. This can
happen in any case if we are unlucky with the values that
come up on the random sample, but it is more likely to
occur if we chose an overly restrictive significance level .
In fact, we can easily see that the value of is the
probability of making a type I error.
Type II errors occur when a hypothesis that should have
been rejected is accepted as true: in the previous example,
if we conclude that mpg=100 when in reality it is not. This
is more likely to occur if is small: in general there is a
trade-off between type I and type II errors. We call the
probability of making a type II error, and =1- is called
the power of the test.
In the example of the hybrid car above, we calculated the
interval that gave us a probability of type I errors =0.05.
We can now go on to calculate , the power of the test.
Let’s again assume that the variance of the distribution is
known: 2=10 (the same could be calculated for an
unknown variance, a case we will not discuss here). The
value of depends on the actual mean of the distribution
: obviously if is very far from the test value 100, the
53
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
probability of making a type II error will be very small. We
can calculate as
P(x 97.8) P(x 102.2)
Where the subindex indicates that the probability is
calculated for this particular value of the mean. To give
some example values, (99.0)=0.14, meaning that if the
mean of the process is 99.0 (rather than the hypothesised
100.0), the probability of this test correctly rejecting the
hypothesis is only 14%. On the other hand (97.0)=0.76,
so if the mean goes down to 97.0, the test will correctly
reject the hypothesis with probability 76%. We could
continue finding values manually or, much more efficiently,
using suitable software, to produce a curve for the power
function. Keeping the value of =0.05, we can see how
the curve changes with different values of n:
54
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
In the next Figure, we keep the size of the random sample
to n=8, and we illustrate the effects change the value of
55
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
8 Conclusion
In these four lectures we have discussed:
The concept of probability, both intuitively and in a
formal mathematical framework.
Methods to calculate probability using the concept of
relative frequency
The concept of prior probability and Bayes’ theorem
Some of the most important probability distributions, in
particular the normal or Gaussian.
The concept of statistical testing, and some basic
testing methods.
These concepts find applications in all branches of
engineering – and are explored in more detail in some of
the advanced courses.
56
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
57
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
To: Dr SJ Payne, Engineering Science
58
A1: Probability and Statistics - Michaelmas Term 2017
SJ Payne
4. Quality of the handout. The lecture handouts were
59