ST104B Statistics 2 PDF
ST104B Statistics 2 PDF
ST104B Statistics 2 PDF
J.S. Abdey
ST104b
2014
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This is an extract from a subject guide for an undergraduate course offered as part of the
University of London International Programmes in Economics, Management, Finance and
the Social Sciences. Materials for these programmes are developed by academics at the
London School of Economics and Political Science (LSE).
For more information, see: www.londoninternational.ac.uk
This guide was prepared for the University of London International Programmes by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School
of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that due
to pressure of work the author is unable to enter into any correspondence relating to, or
arising from, the guide. If you have any comments on this subject guide, favourable or
unfavourable, please use the form at the back of this guide.
Contents
Contents
1 Introduction
1.1
1.2
1.3
Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
1.5
1.6
1.6.1
1.6.2
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.3
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.4
Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7
2 Probability theory
2.1
2.2
2.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.6
11
2.7
17
2.7.1
18
22
2.8.1
23
27
2.9.1
32
2.9.2
Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
36
37
37
2.8
2.9
Contents
38
39
3 Random variables
41
3.1
41
3.2
41
3.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.6
43
3.7
. . . . . . . . . . . . . . . . . . . . . . . .
56
3.8
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.9
65
66
67
67
69
4.2
69
4.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.6
71
4.6.1
71
4.6.2
Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.6.3
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.6.4
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.6.5
78
4.6.6
78
4.6.7
80
81
4.7.1
81
4.7.2
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . .
83
4.7.3
85
4.7.4
85
4.7
ii
69
Contents
4.7.5
91
4.8
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
4.9
94
94
96
96
99
5.1
99
5.2
99
5.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
5.6
101
5.6.1
Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . .
102
Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
5.7.1
105
5.7.2
105
106
5.8.1
Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107
5.8.2
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108
5.8.3
109
111
5.9.1
112
113
114
115
115
116
118
118
118
119
120
5.7
5.8
5.9
iii
Contents
121
6.2
121
6.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
121
6.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
121
6.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
6.6
Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
6.7
124
6.8
124
6.9
126
130
132
133
135
137
137
138
139
140
141
142
142
142
143
7 Point estimation
iv
121
145
7.1
145
7.2
145
7.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
7.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
7.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
146
7.6
146
7.7
151
7.8
153
7.9
154
159
Contents
159
159
160
161
8 Interval estimation
163
8.1
163
8.2
163
8.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
8.4
163
8.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164
8.6
164
8.6.1
166
8.6.2
166
8.7
167
8.8
168
8.9
Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
168
169
169
169
169
9 Hypothesis testing
171
9.1
171
9.2
171
9.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171
9.4
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171
9.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172
9.6
Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172
9.7
173
9.7.1
. . . . . . . . . . . . . . . . .
175
9.7.2
175
9.7.3
176
9.7.4
177
9.8
t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
178
9.9
179
Contents
180
180
182
183
184
2
9.14.1 Tests on X Y with known X
and Y2
. . . . . . . . . . . . .
184
2
9.14.2 Tests on X Y with X
= Y2 but unknown . . . . . . . . . . .
185
187
189
190
192
193
193
193
195
195
197
197
197
197
197
10.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
198
198
199
206
. . . . . . . . . . . . . . . . . . . . . . . .
207
213
213
214
214
215
11 Linear regression
vi
217
217
217
Contents
217
218
11.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
218
219
223
226
227
228
. . . . . . . . . . . . . . . . . . . . .
229
231
233
233
233
234
234
237
241
vii
Contents
viii
Chapter 1
Introduction
1.1
This subject guide provides you with a framework for covering the syllabus of the
ST104b Statistics 2 half course and directs you to additional resources such as
readings and the virtual learning environment (VLE).
The following 10 chapters will cover important aspects of elementary statistical theory,
upon which many applications in EC2020 Elements of econometrics draw heavily.
The chapters are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order. There is little point in rushing past material which you have only partially
understood in order to reach the final chapter. Once you have completed your work on
all of the chapters, you will be ready for examination revision. A good place to start is
the sample examination paper which you will find at the end of the subject guide.
ST104b Statistics 2 extends the work of ST104a Statistics 1 and provides a precise
and accurate treatment of probability, distribution theory and statistical inference. As
such there will be a strong emphasis on mathematical statistics as important discrete
and continuous probability distributions are covered and properties of these
distributions are investigated.
Point estimation techniques are discussed including method of moments, least squares
and maximum likelihood estimation. Confidence interval construction and statistical
hypothesis testing follow. Analysis of variance and a treatment of linear regression
models, featuring the interpretation of computer-generated regression output and
implications for prediction, round off the course.
Collectively, these topics provide a solid training in statistical analysis. As such,
ST104b Statistics 2 is of considerable value to those intending to pursue further
study in statistics, econometrics and/or empirical economics. Indeed, the quantitative
skills developed in the subject guide are readily applicable to all fields involving real
data analysis.
1.2
1. Introduction
practical importance in many applied areas. The examples in this subject guide will
concentrate on the social sciences, but the methods are important for the physical
sciences too. This subject aims to provide a grounding in probability theory and some
of the most common statistical methods.
The material in ST104b Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
how the ideas emerge in detail.
How to study statistics
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
1.3
Syllabus
1.4
The aim of this half course is to develop students knowledge of elementary statistical
theory. The emphasis is on topics that are of importance in applications to
econometrics, finance and the social sciences. Concepts and methods that provide the
foundation for more specialised courses in statistics are introduced.
1.5
At the end of this half course, and having completed the Essential reading and
activities, you should be able to:
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions and their respective moments
explain the fundamentals of statistical inference and apply these principles to
justify the use of an appropriate model and perform hypothesis tests in a number
of different settings
demonstrate understanding that statistical techniques are based on assumptions
and the plausibility of such assumptions must be investigated when analysing real
problems.
1.6
1.6.1
This course builds on the ideas encountered in ST104a Statistics 1. Although this
subject guide offers a complete treatment of the course material, students may wish to
consider purchasing a textbook. Apart from the textbooks recommended in this subject
guide, you may wish to look in bookshops and libraries for alternative textbooks which
may help you. A critical part of a good statistics textbook is the collection of problems
to solve, and you may want to look at several different textbooks just to see a range of
1. Introduction
practice questions, especially for tricky topics. The subject guide is there mainly to
describe the syllabus and to show the level of understanding expected.
The subject guide is divided into chapters which should be worked through in the order
in which they appear. There is little point in rushing past material you only partly
understand to get to later chapters, as the presentation is somewhat sequential and not
a series of self-contained topics. You should be familiar with the earlier chapters and
have a solid understanding of them before moving on to the later ones.
The following procedure is recommended:
1. Read the introductory comments.
2. Consult the appropriate section of your textbook.
3. Study the chapter content, examples and learning activities.
4. Go through the learning outcomes carefully.
5. Attempt some of the problems from your textbook.
6. Refer back to this subject guide, or to the textbook, or to supplementary texts, to
improve your understanding until you are able to work through the problems
confidently.
The last two steps are the most important. It is easy to think that you have understood
the material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
Each chapter of the subject guide has suggestions for reading from the main textbook.
Usually, you will only need to read the material in the main textbook (see Essential
reading below), but it may be helpful from time to time to look at others.
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Time management
About one-third of your self-study time should be spent reading and the rest should be
spent solving problems. An internal student would expect maybe 15 hours of formal
teaching and another 50 hours of private study to be enough to cover the subject. Of
the 50 hours of private study, about 17 hours should be spent on the initial study of the
textbook and subject guide. The remaining 33 hours should be spent on attempting
problems, which may well require more reading.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104b Statistics 2. It must comply in all respects with the specification given in the
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the Notice on permitted materials.
Make sure you accustom yourself to using your chosen calculator and feel comfortable
with it. Specifically, calculators must:
have no external wires
must be:
hand held
compact and portable
quiet in operation
non-programmable
and must:
not be capable of receiving, storing or displaying user-supplied non-numerical data.
The Regulations state: The use of a calculator that communicates or displays textual
messages, graphical or algebraic information is strictly forbidden. Where a calculator is
permitted in the examination, it must be a non-scientific calculator. Where calculators
are permitted, only calculators limited to performing just basic arithmetic operations
may be used. This is to encourage candidates to show the Examiners the steps taken in
arriving at the answer.
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package such as
Minitab, R or SPSS. It is not necessary for this course to have such software available,
but if you do have access to it you may benefit from using it in your study of the
material.
1.6.2
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne, Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060].
Statistical tables
Lindley, D.V. and W.F. Scott, New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 978-0521484855].
These statistical tables are the same ones that are distributed for use in the
examination, so it is advisable that you become familiar with them, rather than those
at the end of a textbook.
1. Introduction
1.6.3
Further reading
Please note that, as long as you read the Essential reading, you are then free to read
around the subject area in any text, paper or online resource. You will need to support
your learning by reading as widely as possible and by thinking about how these
principles apply in the real world. To help you read extensively, you have free access to
the virtual learning environment (VLE) and University of London Online Library (see
below).
Other useful texts for this course include:
Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its
Applications (Pearson, 2013) fifth edition [ISBN 9781292023557].
While Newbold et al. is the main textbook for this course, there are many that are just
as good. You are encouraged to look at those listed above and at any others you may
find. It may be necessary to look at several textbooks for a single topic, as you may find
that the approach of one textbook suits you better than that of another.
1.6.4
In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
virtual learning environment (VLE) and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at:
http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you forget your login details, please click on the Forgotten your password link on the
login page.
The VLE
The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:
Self-testing activities: Doing these allows you to test your own understanding of the
subject material.
Electronic study materials: The printed materials that you receive from the
University of London are available to download, including updated reading lists
and references.
Past examination papers and Examiners commentaries: These provide advice on
how each examination question might best be answered.
A student discussion forum: This is an open space for you to discuss interests and
experiences, seek support from your peers, work collaboratively to solve problems
and discuss subject material.
Videos: There are recorded academic introductions to the subject, interviews and
debates and, for some courses, audio-visual tutorials and conclusions.
Recorded lectures: For some courses, where appropriate, the sessions from previous
years Study Weekends have been recorded and made available.
Study skills: Expert advice on preparing for examinations and developing your
digital literacy skills.
Feedback forms.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
Making use of the Online Library
The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login:
http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php
Additional material
There is a lot of computer-based teaching material available freely over the web. A
fairly comprehensive list can be found in the Books & Manuals section of
http://statpages.org
Unless otherwise stated, all websites in this subject guide were accessed in April 2014.
We cannot guarantee, however, that they will stay current and you may need to
1. Introduction
1.7
Examination advice
Important: the information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Regulations for relevant information about the examination, and
the VLE where you should be advised of any forthcoming changes. You should also
carefully check the rubric/instructions on the paper you actually sit and follow those
instructions.
Remember, it is important to check the VLE for:
up-to-date information on examination and assessment arrangements for this course
where available, past examination papers and Examiners commentaries for the
course which give advice on how each question might best be answered.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found in past examination papers
available on the VLE).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the Examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long move on and focus on solving the easier
questions, coming back to harder ones later.
Chapter 2
Probability theory
2.1
Probability is very important for statistics because it provides the rules that allow us to
reason about uncertainty and randomness, which is the basis of statistics. Independence
and conditional probability are profound ideas, but they must be fully understood in
order to think clearly about any statistical investigation.
2.2
2.3
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes theorem and apply it to find conditional probabilities.
2. Probability theory
2.4
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapter 3.
In addition there is essential watching of this chapters accompanying video tutorials
accessible via the ST104b Statistics 2 area at http://my.londoninternational.ac.uk
2.5
Introduction
Consider the following hypothetical example: A country will soon hold a referendum
about whether it should join the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
Will you vote Yes or No to joining the EU? as follows:
Count
%
Answer
Yes
No
513 437
54%
46%
Total
950
100%
However, we are not interested in just this sample of 950 respondents, but in the
population that they represent, that is all likely voters.
Statistical inference will allow us to say things like the following about the
population:
A 95% confidence interval for the population proportion, , of Yes voters is
(0.508, 0.572).
The null hypothesis that = 0.5, against the alternative hypothesis that > 0.5,
is rejected at the 5% significance level.
In short, the opinion poll gives statistically significant evidence that Yes voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results:
Each response Xi is a realisation of a random variable from a Bernoulli
distribution with probability parameter .
The responses X1 , X2 , . . . , Xn are independent of each other.
has expected
The sampling distribution of the sample mean (proportion) X
value and variance (1 )/n.
10
In the next few chapters, we will learn about the terms in bold, among others.
The need for probability in statistics
In statistical inference, the data that we have observed are regarded as a sample from a
broader population, selected with a random process:
Values in a sample are variable: If we collected a different sample we would not
observe exactly the same values again.
Values in a sample are also random: We cannot predict the precise values that will
be observed before we actually collect the sample.
Probability theory is the branch of mathematics that deals with randomness. So we
need to study this first.
A preview of probability
The first basic concepts in probability will be the following:
Experiment: For example, rolling a single die and recording the outcome.
Outcome of the experiment: For example, rolling a 3.
Sample space S: The set of all possible outcome; here {1, 2, 3, 4, 5, 6}.
Event: Any subset A of the sample space, for example A = {4, 5, 6}.1
Probability, P (A), will be defined as a function which assigns probabilities (real
numbers) to events (sets). This uses the language and concepts of set theory. So we
need to study the basics of set theory first.
2.6
11
2. Probability theory
when
xA
x B.
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} {1, 2, 3, 4}, because all elements appear in the larger set.
{1, 2, 5} 6 {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
12
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A B and B A.
Unions of sets (or)
The union, denoted , of two sets is:
A B = {x | x A or x B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
13
2. Probability theory
Ai = A1 A2 An
i=1
n
\
Ai = A1 A2 An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by .
Complement (not)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x S and x
/ A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
14
and A (B C) = (A B) C.
Distributive laws:
A (B C) = (A B) (A C) and A (B C) = (A B) (A C).
De Morgans laws:
(A B)c = Ac B c
and (A B)c = Ac B c .
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
c = S.
A, A A and A S.
A A = A and A A = A.
A Ac = and A Ac = S.
If B A, A B = B and A B = A.
A = and A = A.
A S = A and A S = S.
= and = .
15
2. Probability theory
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if
Ai = A, that is A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
S
a partition of A if they are pairwise disjoint and
Ai = A.
i=1
A2
A3
A1
Example 2.7
We have:
A (B Ac ) = (A Ac ) B = B =
and:
A (B Ac ) = (A B) (A Ac ) = B S = B.
Hence A and B Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
16
2.7
Axiom 2:
P (S) = 1.
[
X
P
Ai =
P (Ai ).
i=1
i=1
The precise definition also requires a careful statement of which subsets of S are allowed as events;
we can skip that on this course.
17
2. Probability theory
The axioms require that a probability function must always satisfy these requirements:
Axiom 1 requires that probabilities are always non-negative.
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of
their union is simply the sum of their individual probabilities.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.
2.7.1
Probability property
For the empty set, , we have:
P () = 0.
(2.1)
P ().
i=1
i=1
n
[
X
X
X
X
P
Ai =
P (Ai ) =
P (Ai ) +
P (Ai ) =
P (Ai )
i=1
i=1
i=1
i=n+1
i=1
18
A2
A1
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
Probability property
For any event A, we have:
P (Ac ) = 1 P (A).
Proof : We have that A Ac = S and A Ac = . Therefore:
1 = P (S) = P (A Ac ) = P (A) + P (Ac )
using the previous result, with n = 2, A1 = A and A2 = Ac .
Probability property
For any event A, we have:
P (A) 1.
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
P (Ac ) = 1 P (A) < 0.
This violates Axiom 1, so cannot be true. Therefore it must be that P (A) 1 for all A.
Putting this and Axiom 1 together, we get:
0 P (A) 1
for all events A.
Probability property
For any two events A and B, if A B, then P (A) P (B).
19
2. Probability theory
20
21
2. Probability theory
2.8
number of outcomes in A
k
=
.
m
total number of outcomes in sample space S
That is, the probability of A is the proportion of outcomes that belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes that belong to the event, and the total number of possible
outcomes.
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
Sample space: the 36 ordered pairs:
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3,2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4,1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
Outcomes in the event: A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
The probability: P (A) = 4/36 = 1/9.
22
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
Therefore, P (A) = 1 3/36 = 33/36 = 11/12.
The formula:
P (A B) = P (A) + P (B) P (A B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
P (A) = 6/36, P (B) = 3/36 and P (A B) = 1/36.
So P (A B) = P (A) + P (B) P (A B) = (6 + 3 1)/36 = 8/36 = 2/9.
2.8.1
A powerful set of counting methods answers the following question: How many ways are
there to select k objects out of n distinct objects?
The answer will depend on two things:
Whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once).
Whether the selected set is treated as ordered or unordered.
Ordered sets, with replacement
Suppose that the selection of k objects out of n needs to be:
ordered, so that the selection is an ordered sequence where we distinguish between
the 1st object, 2nd, 3rd etc.
with replacement, so that each of the n objects may appear several times in the
selection.
23
2. Probability theory
Then:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
The number of possible ordered sequences of k objects selected with replacement from n
objects is therefore:
k times
}|
{
z
n n n = nk .
Ordered sets, without replacement
Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:
ordered, so that the selection is an ordered sequence where we distinguish between
the 1st object, 2nd, 3rd etc.
without replacement: if an object is selected once, it cannot be selected again.
Now:
n objects are available for selection for the 1st object in the sequence
n 1 objects are available for selection for the 2nd object
n 2 objects are available for selection for the 3rd object
. . . and so on, until n k + 1 objects are available for selection for the kth object.
The number of possible ordered sequences of k objects selected without replacement
from n objects is therefore:
n (n 1) (n k + 1).
(2.2)
24
n!
.
(n k)!
25
2. Probability theory
3. Only the dates matter, but not who has which one (unordered ), i.e. Amy
(January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same
as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different
people must have different birthdays (without replacement). The number of
different sets of birthdays is:
365 364 363
365
365!
=
= 8,038,030.
=
(365 3)! 3!
321
3
Example 2.16 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following:
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is 365r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 r)!.
Here 1. is the size of the sample space, and 2. is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
P (Ac ) =
and:
P (A) = 1 P (Ac ) = 1
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
r
2
3
4
5
6
7
8
9
10
11
26
P (A)
0.003
0.008
0.016
0.027
0.040
0.056
0.074
0.095
0.117
0.141
r
12
13
14
15
16
17
18
19
20
21
P (A)
0.167
0.194
0.223
0.253
0.284
0.315
0.347
0.379
0.411
0.444
r
22
23
24
25
26
27
28
29
30
31
P (A)
0.476
0.507
0.538
0.569
0.598
0.627
0.654
0.681
0.706
0.730
r
32
33
34
35
36
37
38
39
40
41
P (A)
0.753
0.775
0.795
0.814
0.832
0.849
0.864
0.878
0.891
0.903
2.9
Example 2.17 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
A = Score of die 1 is not 6
B = Score of die 2 is not 6.
Then:
P (A) = 30/36 = 5/6
P (B) = 30/36 = 5/6
P (A B) = 25/36 = 5/6 5/6 = P (A) P (B), so A and B are independent.
27
2. Probability theory
1
2
= ,
4
2
P (S) =
2
1
=
4
2
and P (G) =
2
1
= .
4
2
1
4
and similarly:
1
1
and P (S G) = .
4
4
From these results, we can verify that:
P (H G) =
P (H S) = P (H) P (S)
P (H G) = P (H) P (G)
P (S G) = P (S) P (G)
and so the events are pairwise independent. But one teacher has a hat, a scarf and
gloves, so:
1
P (H S G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
Independent versus mutually exclusive events
The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.8.
28
2
A
B
P (A B)
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.19 Suppose we roll two independent fair dice again. Consider the
following events:
A = at least one of the scores is 2.
B = the sum of the scores is greater than 7.
There are shown in Figure 2.9. Now P (A) = 11/36 0.31, P (B) = 15/36 and
P (A B) = 2/36. The conditional probability of A given B is therefore:
P (A | B) =
P (A B)
2/36
2
=
=
0.13.
P (B)
15/36
15
29
2. Probability theory
(1,1)
(1,2)
(1,3) (1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3) (2,4)
(2,5)
(2,6)
(3,1)
(3,2)
(3,3) (3,4)
(3,5)
(3,6)
(4,1)
(4,2)
(4,3) (4,4)
(4,5)
(4,6)
(5,1)
(5,2)
(5,3) (5,4)
(5,5)
(5,6)
(6,1)
(6,2)
(6,3) (6,4)
(6,5)
(6,6)
B
A
cases of A within B
2
=
.
15
cases of B
P (A B)
P (A) P (B)
=
= P (A)
P (B)
P (B)
and:
P (A) P (B)
P (A B)
=
= P (B).
P (A)
P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.
P (B | A) =
30
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
B
s
As
Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = the cards are the 4 aces (cards of rank 1) ?
= 270,725 possible
We could calculate this using counting rules. There are 52
4
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore
P (A) = 1/270275.
Let us try with conditional probabilities. Define Ai as the ith card is an ace, so
that A = A1 A2 A3 A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards.
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn.
P (A3 | A1 , A2 ) = 2/50.
P (A4 | A1 , A2 , A3 ) = 1/49.
31
2. Probability theory
4
3
2
1
24
1
=
=
.
52 51 50 49
6497400
270725
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.
More methods for summing probabilities
We now return to probabilities of partitions like the situation shown in Figure 2.10.
HH A1
H
HH
HHr
r
A
HH
A
2
H
HH
HH A3
A2
A1
A3
Figure 2.10: On the left, a Venn diagram depicting A = A1 A2 A3 , and on the right
the paths to A.
Both diagrams in Figure 2.10 represent the partition A = A1 A2 A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.10,
where A1 , A2 and A3 are symbolised as different paths to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
2.9.1
K
X
i=1
32
P (A Bi ) =
K
X
i=1
P (A | Bi ) P (Bi ).
r B1
B2
rHH
HH
HHr
B
3
r
r
H
A
@HH
@ HH
Hr
@
B4
@
@
@r
B5
Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and
Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity: If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99%
specificity: If a person does not have the disease, the test will give a negative result
with a probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01, and therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 0.0001 + 0.01 0.9999
= 0.010098.
33
2. Probability theory
2.9.2
Bayes theorem
So far we have considered how to calculate P (A) for an event A which can happen in
different ways, via different events B1 , B2 , . . . , BK .
Now we reverse the question: Suppose we know that A has happened, as shown in
Figure 2.12.
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.13.
So we need:
P (Bj | A) =
P (A Bj )
P (A)
K
P
i=1
Bayes theorem
Using the chain rule and the total probability formula, we have:
P (Bj | A) =
P (A | Bj ) P (Bj )
K
P
P (A | Bi ) P (Bi )
i=1
34
Example 2.25 Continuing with Example 2.24, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
P (B) = 0.0001
P (B c ) = 0.9999
P (A | B) = 0.99 and P (A | B c ) = 0.01.
Then:
P (B | A) =
0.99 0.0001
P (A | B) P (B)
=
0.0098.
c
c
P (A | B) P (B) + P (A | B ) P (B )
0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore most
positive test results are actually false positives.
Example 2.26 You are waiting for your bag at the baggage return carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags that come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = your bag has been lost and x = your bag is not among the first x bags
to arrive. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are:
P (x | A) = 1 for all x. If the bag has been lost, it will not arrive!
P (x | Ac ) = (200 x)/200 if we assume that bags come out in a completely
random order.
Using Bayes theorem, we get:
P (x | A) P (A)
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
=
.
P (A) + [(200 x)/200] [1 P (A)]
P (A | x) =
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were mishandled per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
35
2. Probability theory
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
1.0
So, the moral of the story is that even when nearly everyone else has collected their
bags and left, do not despair!
0.6
0.4
0.0
0.2
0.8
BA
Air Malta
50
100
150
200
Bags arrived
Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air
2.10
Overview of chapter
This chapter introduced some formal terminology related to probability. The axioms of
probability were introduced, from which various other probability results can be
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes theorem was derived.
36
2.11
Axiom
Combination
Conditional probability
Elementary outcome
Experiment
Independence
Mutually exclusive
Partition
Probability
Sample space
Union
2.12
Bayes theorem
Complement
Disjoint
Empty set
Event
Intersection
Outcome
Permutation
Random experiment
Set
Venn diagram
Learning activities
1. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
2. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
3. For an event A, work out a simpler way to express the events A S, A S, A
and A .
4. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
5. Suppose that we toss a fair coin twice. The sample space is therefore
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
heads on the first toss and heads on the second toss are independent.
6. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0. (Notice that independence and disjointness are not similar ideas.)
7. Write down the condition for the three events A, B and C to be independent.
8. Prove Bayes theorem from first principles.
9. A statistics teacher knows from past experience that a student who does homework
consistently has a probability of 0.95 of passing the examination, whereas a student
who does not do homework at all has a probability of 0.30 of passing the
examination.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass the examination?
37
2. Probability theory
(b) If a student chosen at random from the group gets a pass in the examination,
what is the probability that the student had done homework consistently?
2.13
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
38
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
2.14
1. (a) A, B and C are any three events in the sample space S. Prove that:
P (ABC) = P (A)+P (B)+P (C)P (AB)P (BC)P (AC)+P (ABC).
(b) A and B are events in a sample space S. Show that:
P (A B)
P (A) + P (B)
P (A B).
2
39
2. Probability theory
40
Chapter 3
Random variables
3
3.1
This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.
3.2
3.3
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
define a random variable and distinguish it from the values that it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
3.4
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapters 4 and 5.
In addition there is essential watching of this chapters accompanying video tutorials
accessible via the ST104b Statistics 2 area at http://my.londoninternational.ac.uk
41
3. Random variables
3.5
Introduction
3
where n is the sample size.
42
Notation
A random variable is typically denoted by an upper-case letter, for example X (or Y ,
W , etc.). A specific value of a random variable is often denoted by a lower-case letter,
for example, x.
Probabilities of values of a random variable are written like this:
Sample
Sample
Sample
Sample
Sample
distribution
mean (average)
variance and standard deviation
median
3.6
Example 3.1 The following two examples will be used throughout this chapter.
1. Number of people living in a randomly selected household in England.
For simplicity, we use the value 8 to represent 8 or more (because 9 and
above are not reported separately in official statistics).
This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7
and 8.
2. A person throws a basketball repeatedly from the free-throw line, trying to
make a basket. Consider the following random variable:
Number of unsuccessful throws before the first successful throw.
The possible values of this are 0, 1, 2, . . . .
43
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.
Number of people
in household (x) P (X = x)
1
0.3002
2
0.3417
3
0.1551
4
0.1336
5
0.0494
6
0.0145
7
0.0034
8
0.0021
Probability function
The probability function (pf) of a discrete random variable X, denoted by p(x),
is a real-valued function such that for any number x the function is:
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
especially when it is necessary to indicate clearly to which random variable the
function corresponds.
44
xi S
The pf is defined for all real numbers x, but p(x) = 0 for any x
/ S, i.e. for any
value x that is not one of the possible values of X.
0.3002,
0.3417,
0.1551,
0.1336,
p(x) = 0.0494,
0.0145,
0.0034,
0.0021,
8
P
p(x) = 1.
x=1
a rx =
x=0
a(1 rn )
1r
X
x=0
a rx =
a
.
1r
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
the probability of a successful throw is at each throw, and therefore the
probability of an unsuccessful throw is 1
45
0.00
0.05
0.10
p(x)
0.15
0.20
0.25
0.30
0.35
3. Random variables
X
x=0
X
X
x
p(x) =
(1 ) =
(1 )x =
x=0
x=0
= = 1.
1 (1 )
46
p(x)
0.4
0.5
0.6
0.7
0.2
0.3
= 0.7
= 0.3
0.0
0.1
3
0
10
15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. = 0.7: a fairly good free-throw shooter.
p(xi )
xi S, xi x
i.e. the sum of the probabilities of those possible values of X that are less than or
equal to x.
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in household (x)
1
2
3
4
5
6
7
8
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
F (x)
0.3002
0.6419
0.7970
0.9306
0.9800
0.9945
0.9979
1.0000
47
3. Random variables
1.0
0.0
0.2
0.4
F(x)
0.6
0.8
p(x) =
x=0
we can write:
y
X
x=0
(1 ) =
y
X
(1 )x =
x=0
(
0
F (x) =
1 (1 )x+1
1 (1 )y+1
= 1 (1 )y+1
1 (1 )
when x < 0
when x = 0, 1, 2, . . . .
48
F(x)
0.6
0.8
1.0
0.4
0.0
0.2
= 0.7
= 0.3
10
15
x (number of failures)
49
3. Random variables
The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) =
xi p(xi ).
xi S
x p(x) or E(X) =
x p(x).
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: Instead of E(X), the symbol (the lower-case Greek letter mu),
or X , is often used.
Expected value versus sample mean
The mean (expected value) E(X) of a probability distribution is analogous to the
of a sample distribution.
sample mean (average) X
This is easiest to see when the sample space is finite. Suppose the random variable X
can have K different values X1 , . . . , XK , and their frequencies in a sample are
f1 , . . . , fK , respectively. Then the sample mean of X is:
K
X
= f1 x1 + + fK xK = x1 pb(x1 ) + + xK pb(xK ) =
xi pb(xi )
X
f1 + + fK
i=1
where:
pb(xi ) =
fi
K
P
fi
i=1
K
X
xi p(xi ).
i=1
uses the sample proportions pb(xi ), whereas E(X) uses the population probabilities
So X
p(xi ).
50
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
x p(x)
0.3002
0.6834
0.4653
0.5344
0.2470
0.0870
0.0238
0.0168
2.3579
= E(X)
xi p(xi ) =
xi S
(starting from x = 1)
X
x=0
x (1 )x
x (1 )x
x=1
= (1 )
x (1 )x1
x=1
(using y = x 1)
= (1 )
(y + 1)(1 )y
y=0
y
y
= (1 )
y
(1
+
(1
y=0
y=0
|
{z
} |
{z
}
= E(X)
=1
= (1 ) [E(X) + 1]
= (1 ) E(X) + (1 )
from which we can solve:
E(X) =
1
1
=
.
1 (1 )
51
3. Random variables
So, before scoring a basket, a good free-throw shooter (with = 0.7) misses on
average about 0.42 shots, and a poor shooter (with = 0.3) misses on average about
2.33 shots.
In general:
E[g(X)] 6= g[E(X)]
when g(X) is a nonlinear function of X.
E(X ) 6= (E(X))
and
1
X
6=
1
.
E(X)
52
Proof :
E(aX + b) =
(ax + b) p(x)
ax p(x) +
= a
b p(x)
x p(x) + b
p(x)
= aE(X) + b
where the last step follows from:
i.
ii.
p
Var(X).
Both Var(X) and sd(X) are always 0. Both are measures of the dispersion (variation)
of the distribution of X.
Alternative notation: The variance is often denoted 2 (sigma-squared) and standard
deviation by (sigma).
An alternative formula: The variance can also be calculated as:
Var(X) = E(X 2 ) (E(X))2 .
This will be proved later.
53
3. Random variables
p(x)
0.3002
0.3417
0.1551
0.1336
0.0494
0.0145
0.0034
0.0021
x2
1
4
9
16
25
36
49
64
x2 p(x)
0.300
1.367
1.396
2.138
1.235
0.522
0.167
0.134
7.259
= E(X 2 )
2
2
2
2
Var(X) =pE[(X E(X))
] = 1.699 = 7.259 (2.358) = E(X ) (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
1
.
2
54
E ((aX + b) E(aX + b))2
E (aX + b a E(X) b)2
E (aX a E(X))2
E a2 (X E(X))2
a2 E (X E(X))2
a2 Var(X).
n
P
p(x) = 1. This is easiest to show by using the binomial theorem, which states
x=0
that, for any integer n 0 and any real numbers y and z, then:
n
X
n x nx
n
(y + z) =
y z .
x
x=0
(3.2)
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
At the values x = 0, 1, . . . , n, the value of the cdf is:
x
X
n y
F (x) = P (X x) =
(1 )ny .
y
y=0
55
3. Random variables
Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n
X
n x
E(X) =
x
(1 )nx
x
x=0
n
X
n x
=
x
(1 )nx
x
x=1
n
X
n(n 1)!
x1 (1 )nx
(x
1)![(n
1)
(x
1)]!
x=1
n
X
n 1 x1
= n
(1 )nx
x
1
x=1
n1
X
n1 y
= n
(1 )(n1)y
y
y=0
= n 1
= n
where y = x 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, . . . , n 1 and probability
parameter .
3.7
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in
ST104b Statistics 2) is neither a discrete nor an absolutely continuous probability distribution, nor is
it a mixture of these. However, we will not consider this matter any further in this course.
56
for both types. But there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
Probability density function (pdf)
For a continuous random variable X, the probability function is replaced by the
probability density function (pdf), denoted as f (x) [or fX (x)].
+1
k /x
when x k
1.0
0.0
0.5
f(x)
1.5
2.0
where > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
= 2.2 is shown in Figure 3.5.
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous distribution:
P (X = x) = 0
for all x.
(3.3)
That is, the probability that X has any particular value exactly is always 0.
57
3. Random variables
Because of (3.3), with a continuous distribution we do not need to be very careful about
differences between < and , and between > and . Therefore, the following
probabilities are all equal:
P (a < X < b),
P (a X b),
3
Probabilities of intervals for continuous random variables
Integrals of the pdf give probabilities of intervals of values:
Z b
f (x) dx
P (a < X b) =
a
R3
1.5
f (x) dx.
1.0
0.0
0.5
f(x)
1.5
2.0
1.0
1.5
2.0
2.5
3.0
3.5
4.0
58
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions:
1.
f (x) 0
2.
for all x.
f (x) dx = 1.
Example 3.18 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
0
when x < k
f (x) =
k /x+1 when x k
where > 0 and k > 0.
1. Clearly, f (x) 0 for all x, since > 0, k > 0 and x+1 k +1 > 0.
2.
Z
f (x) dx =
k
dx = k
x+1
= k
x1 dx
x k
= (k ) (0 k )
= 1.
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
59
3. Random variables
for all x.
Z x
= (k )
() t1 dt
k
x
= (k ) t k
= (k )(x k )
= 1 k x
= 1 (k/x) .
Therefore:
(
0
F (x) =
1 (k/x)
when x < k
when x k.
(3.4)
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
F 0 (x) = k () x1 =
k
x+1
when x k.
60
F(x)
0.6
0.8
1.0
0.0
0.2
0.4
Example 3.20 Continuing with the insurance example (with k = 1 and = 2.2),
then:
P (X 1.5) = F (1.5) = 1 (1/1.5)2.2 0.59
P (X 3) = F (3) = 1 (1/3)2.2 0.91
P (X > 3) = 1 F (3) 1 0.91 = 0.09
P (1.5 X 3) = F (3) F (1.5) 0.91 0.59 = 0.32.
Example 3.21 Consider now a continuous random variable with the following pdf:
(
ex for x > 0
f (x) =
(3.5)
0
for x 0
where > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since:
Z
0
x
et dt = et 0 = 1 ex
for x 0
for x > 0.
61
3. Random variables
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z
f (x) dx = P ( < X < ) = lim F (x) lim F (x)
x
E[g(X)] =
g(x) f (x) dx
Example 3.22 For the Pareto distribution, introduced in Example 3.16, we have:
Z
Z
x f (x) dx
E(X) =
x f (x) dx =
=
k
Z
=
k
=
k
dx
x+1
k
dx
x
k
1
k
1
Z
|k
( 1) k 1
dx
x(1)+1
{z
}
=1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter 1, so its integral from k to is 1. This integral converges only if
1 > 0, i.e. if > 1.
62
Similarly:
2
x2
x f (x) dx =
E(X ) =
k
dx
x+1
k
=
dx
x1
k
Z
( 2) k 2
k2
=
dx
2
x(2)+1
k
|
{z
}
Z
=1
k
2
(if > 2)
and therefore:
k2
2 k 2
Var(X) = E(X ) (E(X)) =
=
2 ( 1)2
2
k
1
2
.
2
and
Var(X) =
1
2.2 1
2
2.2
7.6.
2.2 2
ex dx
0
0
= x ex 0 (1/) ex 0
= [0 0] (1/)[0 1]
= 1/.
To obtain E(X 2 ), we choose f = x2 and g 0 = ex , and use integration by parts:
Z
Z
2 x
2
2
x
E(X ) =
x e
dx = x e
+2
x ex dx
0
0
0
Z
2
x ex dx
= 0+
0
2
=
2
where the last step follows because the last integral is simply E(X) = 1/ again.
Finally:
2
1
1
Var(X) = E(X 2 ) (E(X))2 = 2 2 = 2 .
63
3. Random variables
0.6
0.8
1.0
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with = 2.2 and = 0.8. When = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
0.4
F(x)
0.2
= 2.2
= 0.8
0.0
For the Pareto distribution, the distribution is defined for all > 0, but the mean is
infinite if < 1 and the variance is infinite if < 2. This happens because for small
values of the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
10
20
30
40
50
64
(3.6)
for x k.
k/m = 1/ 2
m = k 2.
For example:
2.2
0.8
2 = 1.37.
2 = 2.38.
for x > 0.
3.8
m = log 2
m=
log 2
.
Overview of chapter
This chapter has formally introduced random variables, making a distinction between
discrete and continuous random variables. Properties of probability distributions were
discussed, including the determination of expected values and variances.
3.9
Constant
Cumulative distribution function
Estimators
Experiment
Median
Continuous
Discrete
Expected value
Interval
Outcome
65
3. Random variables
Parameter
Probability distribution
Random variable
Variance
3.10
Learning activities
1. Suppose that the random variable X takes the values {x1 , x2 , . . .}, where
x1 < x2 < . Prove the following results:
(a)
p(xi ) = 1.
i=1
(b)
p(xk ) = F (xk ) F (xk1 ).
(c)
F (xk ) =
k
X
p(xi ).
i=1
2. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 15. What
is the probability that each of them wins the prize?
3. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old fruit
and which are new fruit. However, 20% of old oranges are mouldy inside, but only
10% of new oranges are mouldy. Suppose that you choose 5 oranges at random.
What is the distribution of the number of mouldy oranges in your sample?
4. What is the expectation of the random variable X if the only possible value it can
take is c?
5. Show that E(X E(X)) = 0.
6. Show that if Var(X) = 0 then p(X ) = 1. (We say in this case that X is almost
surely equal to its mean.)
7. For a random variable X and constants a and b, prove that:
Var(aX + b) = a2 Var(X).
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
66
3.11
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
define a random variable and distinguish it from the values that it takes
3.12
67
3. Random variables
68
Chapter 4
Common distributions of random
variables
4.1
4.2
4.3
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,
exponential and normal
calculate probabilities of events for these distributions using the probability
function, probability density function or cumulative distribution function
determine probabilities using statistical tables, where appropriate
state properties of these distributions such as the expected value and variance.
4.4
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapters 4 and 5.
69
4.5
Introduction
(the sample) as values of a random variable X, which has some probability distribution
(population distribution).
How to choose that probability distribution?
Usually we do not try to invent distributions from scratch.
Instead, we use one of many existing standard distributions.
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
continuous versus discrete
among discrete: finite versus infinite number of possible values
among continuous: different sets of possible values (for example, all real numbers x,
x > 0, or x [0, 1]); symmetric versus skewed distributions.
The distributions discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it, etc.
In the statistical analysis of a random variable X we typically:
select a family of distributions based on the basic characteristics of X
use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.
Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the
question Will you vote Yes or No to joining the European Union? has answers
70
4.6
4.6.1
k
X
x p(x) =
x=1
1
n
P
i=1
k+1
1 + 2 + + k
=
k
2
n
P
(4.1)
i=1
71
and:
E(X 2 ) =
12 + 22 + + k 2
(k + 1)(2k + 1)
=
.
k
6
So:
Var(X) = E(X 2 ) (E(X))2 =
4.6.2
(4.2)
k2 1
.
12
Bernoulli distribution
A Bernoulli trial is an experiment with only two possible outcomes. We will number
these outcomes 1 and 0, and refer to them as success and failure, respectively.
Example 4.3 Examples of outcomes of Bernoulli trials are:
Agree / Disagree
Male / Female
Employed / Not employed
Owns a car / Does not own a car
Business goes bankrupt / Continues trading.
The Bernoulli distribution is the distribution of the outcome of a single Bernoulli
trial. This is the distribution of a random variable X with the following probability
function:
(
x (1 )1x for x = 0, 1
p(x) =
0
otherwise.
Therefore P (X = 1) = and P (X = 0) = 1 P (X = 1) = 1 , and no other values
are possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter . This is often written as:
X Bernoulli().
If X Bernoulli(), then:
E(X) =
E(X 2 ) =
1
X
x=0
1
X
x p(x) = 0 (1 ) + 1 =
(4.3)
x2 p(x) = 02 (1 ) + 12 =
x=0
4.6.3
Binomial distribution
72
(4.4)
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
Bob is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and therefore has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in Bobs test. X follows the binomial
distribution with n = 4 and = 0.25, i.e. we have:
X Bin(4, 0.25).
For example, what is the probability that Bob gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
= 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is 3 (1 )1 , where 1 denotes a correct
answer and 0 denotes an incorrect answer.
However, we do not care about the order of the 0s and 1s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability 3 (1 )1 .
The total number of sequences with three 1s (and therefore one 0) is the number of
locations
for the three 1s that can be selected in the sequence of 4 answers. This is
4
= 4. Therefore the probability of obtaining three 1s is:
3
4 3
(1 )1 = 4 0.253 0.751 0.0469.
3
(4.5)
We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.14).
73
If X Bin(n, ), then:
E(X) = n
Var(X) = n (1 ).
The expected value E(X) was derived in the previous chapter. The variance will be
derived later.
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again a student who guesses each one of the answers. Let
X denote the number of correct answers by such a student, so that we have
X Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X x) < 0.05, i.e. such that
P (X < x) 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places):
x
p(x)
x
p(x)
0
0.00
1
2
3
4
5
6
7
8
9
10
0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
11
12
13
14
15
16
17
18
19
20
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore P (X 8) = 0.102 > 0.05 and also
P (X 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability of the correct
answer for every question, so that X Bin(20, ). Figure 4.1 shows plots of the
probabilities for = 0.25, 0.5, 0.7 and 0.9.
4.6.4
Poisson distribution
The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . . .
74
0.20
0.00
10
15
20
10
15
Correct answers
= 0.7, E(X)=14
= 0.9, E(X)=18
20
0.20
0.10
0.00
0.00
0.10
Probability
0.20
0.30
Correct answers
0.30
Probability
0.10
Probability
0.20
0.10
0.00
Probability
0.30
= 0.5, E(X)=10
0.30
= 0.25, E(X)=5
10
15
20
Correct answers
10
15
20
Correct answers
(4.6)
Activity 4.1 Show that (4.6) satisfies the conditions to be a probability function.
Hint: You can use the following result from standard calculus. For any number a,
a
e =
X
ax
x=0
x!
75
Activity 4.2 Prove that the mean and variance of a Poisson-distributed random
variable are both equal to .
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process that generates the occurrences satisfies the
following conditions:
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is t for
some constant > 0.
In essence, these state that individual occurrences should be independent, sufficiently
rare, and happen at a constant rate per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number of occurrences in a
randomly selected time interval of length t = 1, X, follows a Poisson distribution with
mean , i.e. X Poisson().
The single parameter of the Poisson distribution is therefore the rate of occurrences
per unit of time.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
Number of telephone calls received at a call centre per minute.
Number of accidents on a stretch of motorway per week.
Number of customers arriving at a checkout per minute.
Number of misprints per page of newsprint.
Because is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X Poisson(1.5), then if
Y is the number of arrivals per two hours, Y Poisson(2 1.5) = Poisson(3).
is also the mean of the distribution, i.e. E(X) = .
Both motivations suggest that distributions with higher values of have higher
probabilities of large values of X.
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X Poisson(2) and X Poisson(4).
76
0.25
0.15
0.05
0.10
p(x)
0.20
=2
=4
0.00
4
0
10
and
X Poisson(1.6)
Y Poisson(5 1.6) = Poisson(8).
e1.6 (1.6)0
e 0
=
= e1.6 = 0.2019.
0!
0!
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 P (X 2) = 1 [P (X = 0) + P (X = 1) + P (X = 2)] which is:
1 pX (0) pX (1) pX (2) = 1
0!
1!
2!
e8 (8)0 e8 (8)1
+
= e8 + 8e8 = 9e8 = 0.0030.
0!
1!
77
A word on calculators
In the examination you will be allowed a basic calculator only. To calculate binomial
and Poisson probabilities directly requires access to a factorial key (for the binomial)
and e key (for the Poisson), which will not appear on a basic calculator. Note that any
probability calculations which are required in the examination will be possible on a
basic calculator. For example, for the Poisson probabilities in Example 4.10, it would be
acceptable to give your answers in terms of e (in the simplest form).
4.6.5
There are close connections between some probability distributions, even across
different families of them. Some connections are exact, i.e. one distribution is exactly
equal to another, for particular values of the parameters. For example, Bernoulli() is
the same distribution as Bin(1, ).
Some connections are approximate (or asymptotic), i.e. one distribution is closely
approximated by another under some limiting conditions. We next discuss one of these,
the Poisson approximation of the binomial distribution.
4.6.6
Suppose that:
X Bin(n, ).
n is large and is small.
Under such circumstances, the distribution of X is well-approximated by a Poisson()
distribution with = n .
The connection is exact at the limit, i.e. Bin(n, ) Poisson() if n and 0
in such a way that n = remains constant.
Activity 4.3 Suppose that X Bin(n, ) and Y Poisson(). Show that, if
n and 0 in such a way that n = remains constant, then, for any x, we
have:
P (X = x) P (Y = x) as n .
Hint 1: Because n = remains constant, substitute /n for from the beginning.
Hint 2: One step of the proof uses the limit definition of the exponential function,
which states that, for any number y, we have:
y n
lim 1 +
= ey .
n
n
This law of small numbers provides another motivation for the Poisson distribution.
78
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the law of small numbers.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 Army
Corps of the Prussian army in each of the years spanning 187594.
Suppose that the number of men killed by horsekicks in one corps in one year is
X Bin(n, ), where:
n is large the number of men in a corps (perhaps 50,000).
0
144
51.4
1
91
32.5
2
3
32 11
11.4 3.9
4
2
0.7
More
0
0
The sample mean of the counts is x = 0.7, which we use as for the Poisson
distribution. X Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years 187594. Source: Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen, Leipzig: Teubner.
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What
is the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X Bin(200, 0.01). We have:
P (X 2) = 1 P (X = 0) P (X = 1) = 1 0.1340 0.2707 = 0.5953.
79
0.5
0.3
0.1
0.2
Probability
0.4
Poisson(0.7)
Sample proportion
0.0
Men killed
4.6.7
Just their names and short comments are given here, so that you have an idea of what
else there is.
Geometric() distribution:
Distribution of the number of failures in Bernoulli trials before the first success.
is the probability of success at each trial.
Sample space is 0, 1, 2, . . . .
See the basketball example in Chapter 3.
Negative binomial(r, ) distribution:
Distribution of the number of failures in Bernoulli trials before r successes
occur.
is the probability of success at each trial.
Sample space is 0, 1, 2, . . . .
Negative Binomial(1, ) is the same as Geometric().
Hypergeometric(n, A, B) distribution:
Experiment where initially A + B objects are available for selection, and A of
them represent success.
n objects are selected at random, without replacement.
Hypergeometric is then the distribution of the number of successes.
80
4.7
4.7.1
81
The pdf is flat, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) 0 for all x,
and:
Z b
Z
1
1
1
f (x) dx =
dx =
[x]ba =
[b a] = 1.
ba
ba
a ba
0
f (t) dt = (x a)/(b a)
for x < a
for a x b
for x > b.
4
Activity 4.4 Derive the cdf for the continuous uniform distribution.
The probability of an interval [x1 , x2 ], where a x1 < x2 b, is therefore:
P (x1 X x2 ) = F (x2 ) F (x1 ) =
x 2 x1
.
ba
f(x)
F(x)
0
a
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
b+a
= median of X
2
(b a)2
.
12
The mean and median also follow from the fact that the distribution is symmetric
about (b + a)/2, i.e. the midpoint of the interval [a, b].
Activity 4.5 Derive the mean and variance of the continuous uniform distribution.
82
4.7.2
Exponential distribution
f(x)
It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.21). The general shape of the pdf is that of exponential decay, as shown in
Figure 4.6 (hence the name).
for x 0
for x > 0.
1
log 2
= (log 2) = (log 2) E(X) 0.69 E(X).
83
0.2
0.4
F(x)
0.6
0.8
1.0
0.0
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
Uses of the exponential distribution
The exponential is, among other things, a basic distribution of waiting times of
various kinds. This arises from a connection between the Poisson distribution the
simplest distribution for counts and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter .
Note that the expected values of these behave as we would expect.
E(X) = for Poisson(), i.e. a large means many events per unit of time, on
average.
E(X) = 1/ for Exponential(), i.e. a large means short waiting times between
successive events, on average.
Example 4.13 Consider Example 4.10.
The number of customers arriving at a bank per minute has a Poisson
distribution with parameter = 1.6.
Then the time X, in minutes, between the arrivals of two successive customers
follows an exponential distribution with parameter = 1.6.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(log 2) 0.625 = 0.433.
84
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0
for x 0
F (x) =
1.6x
1e
for x > 0.
For example:
P (X 1) = F (1) = 1 e1.61 = 1 e1.6 = 0.7981.
The probability is about 0.8 that two arrivals are at most a minute apart.
4.7.3
These are generalisations of the uniform and exponential distributions. Only their
names and short comments are given here, just so that you know they exist.
Beta(, ) distribution, shown in Figure 4.8.
Generalising the uniform, these are distributions for a closed interval, which is
taken to be [0, 1].
Sample space is therefore {x | 0 x 1}.
Unlike for the uniform distribution, the pdf is generally not flat.
Beta(1, 1) is the same as Uniform[0, 1].
Gamma(, ) distribution, shown in Figure 4.9.
Generalising the exponential distribution, this is a two-parameter family of
skewed distributions for positive values.
Sample space is {x | x > 0}.
Gamma(1, ) is the same as Exponential().
4.7.4
85
alpha=0.5, beta=1
alpha=1, beta=2
alpha=1, beta=1
alpha=0.5, beta=0.5
alpha=2, beta=2
alpha=4, beta=2
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.
Normal distribution pdf
The pdf of the normal distribution is:
1
(x )2
f (x) =
exp
2 2
2 2
If X N (, 2 ), then:
E(X) =
Var(X) = 2
and the standard deviation is therefore sd(X) = .
The mean can also be inferred from the observation that the normal pdf is symmetric
about . This also implies that the median of the normal distribution is .
The normal density is the so-called bell curve. The two parameters affect it as follows:
86
alpha=0.5, beta=1
10
alpha=1, beta=0.5
alpha=2, beta=1
10
15
20
alpha=2, beta=0.25
(4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
87
0.3
0.4
N(5, 1)
0.1
0.2
N(0, 1)
N(0, 9)
0.0
10
X
1
1
2
Z= X =
N
,
= N (0, 1).
F (x) =
exp
dt.
2 2
2 2
In the special case of the standard normal distribution, the cdf is:
2
Z x
1
t
F (x) = (x) =
exp
dt.
2
2
88
1. It is only for N (0, 1), not for N (, 2 ) for any other and 2 .
2. Even for N (0, 1), it only shows probabilities for x 0.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its mirror image in the other tail has the
same probability. Another way to justify these results is that if Z N (0, 1), then
Z N (0, 1) also. See ST104a Statistics 1 for a discussion of how to use Table 4 of
the New Cambridge Statistical Tables.
Probabilities for any normal distribution
4
2
a
X
b
P (a < X b) = P
<
a
b
= P
<Z
a
b
which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X b), with a = ,
and P (X > a), with b = .)
Example 4.15 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
X > 90 (high blood pressure)
X < 60 (low blood pressure)
60 X 90 (normal blood pressure).
These are calculated using standardisation with = 74.2 and 2 = 127.87, and
therefore = 11.31. So here:
X 74.2
= Z N (0, 1)
11.31
and we can refer values of this standardised variable to Table 4 of the New
89
90 74.2
X 74.2
>
11.31
11.31
= P (Z > 1.40)
= 1 (1.40)
= 1 0.9192
= 0.0808.
4
P (X < 60) = P
X 74.2
60 74.2
<
11.31
11.31
= P (Z < 1.26)
= P (Z > 1.26)
= 1 (1.26)
= 1 0.8962
= 0.1038.
Finally:
P (60 X 90) = P (X 90) P (X < 60) = 0.8152.
0.04
Low: 0.10
High: 0.08
0.00
0.01
0.02
0.03
Mid: 0.82
40
60
80
100
120
90
0.683
1.96
+1.96
Figure 4.12: Some probabilities around the mean for the normal distribution.
4.7.5
For 0 < < 1, the binomial distribution Bin(n, ) tends to the normal distribution
N (n , n (1 )) as n .
Less formally: The binomial is well-approximated by the normal when the number of
trials n is reasonably large.
For a given n, the approximation is best when is not very close to 0 or 1. One
rule-of-thumb is that the approximation is good enough when n > 5 and
n (1 ) > 5. Illustrations of the approximation are shown in Figure 4.13 for different
values of n and . Each plot shows values of the pf of Bin(n, ), and the pdf of the
normal approximation, N (n , n (1 )).
When the normal approximation is appropriate, we can calculate probabilities for
X Bin(n, ) using Y N (n , n (1 )) and Table 4 of the New Cambridge
Statistical Tables.
91
n=10, = 0.5
n=25, = 0.5
n=25, = 0.25
n=10, = 0.9
n=25, = 0.9
n=50, = 0.9
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X Bin(40, 0.4). Since X is discrete, such that x = 0, 1, . . . , 40, then:
P (X 4) = P (X 4.5) = P (X < 5)
since P (4 < X 4.5) = 0 and P (4.5 < X < 5) = 0 due to the gaps in the probability
mass for this distribution. In contrast if Y N (16, 9.6), then:
P (Y 4) < P (Y 4.5) < P (Y < 5)
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, ) distribution to a
continuous N (n , n (1 )) distribution.
Continuity correction
This technique involves representing each discrete binomial value x, for 0 x n,
by the continuous interval (x 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X Bin(n, ) with Y N (n , n (1 )), then:
P (X < 4) = P (X 3) P (Y < 3.5)
(since 4 is excluded)
P (X 4) = P (X < 5) P (Y < 4.5)
(since 4 is included)
P (1 X < 6) = P (1 X 5) P (0.5 < Y < 5.5)
(since 1 to 5 are included).
92
Example 4.16 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X Bin(1000, 0.361).
1. What is the probability that X 400?
Using the normal approximation, noting n = 1000 and = 0.361, with
Y N (1000 0.361, 1000 0.361 0.639) = N (361, 230.68), we get:
P (X 400) P (Y 399.5)
399.5 361
Y 361
= P
230.68
230.68
= P (Z 2.53)
= 1 (2.53)
= 0.0057.
The exact probability from the binomial distribution is P (X 400) = 0.0059.
Without the continuity correction, the normal approximation would give 0.0051.
2. What is the largest number x for which P (X x) < 0.01?
We need the largest x which satisfies:
P (X x) P (Y x + 0.5) = P
x + 0.5 361
Z
230.68
< 0.01.
2.33
230.68
which gives x 325.1. The smallest integer value which satisfies this is x = 325.
Therefore P (X x) < 0.01 for all x 325.
The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325,
and 0.011 for x = 326. The normal approximation gives exactly the correct
answer in this instance.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X 300) < 0.01, if = 0.361.
In other words, if the Conservatives support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents):
93
(a) The Conservatives true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
4.8
Overview of chapter
This chapter has introduced some common discrete and continuous probability
distributions. Their properties, uses and applications have been discussed. The
relationships between some of these distributions have also been discussed.
4.9
Bernoulli distribution
Central limit theorem
Exponential distribution
Parameter
Population distribution
Uniform distribution
4.10
Binomial distribution
Continuity correction
Normal distribution
Poisson distribution
Standardised variable
z-score
Learning activities
94
95
4.11
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,
exponential and normal
calculate probabilities of events for these distributions using the probability
function, probability density function or cumulative distribution function
4.12
n
X
x=1
(n 1)!
x1 (1 )nx .
(x 1)!(n x)!
96
4. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
Other motor vehicles (vans, motorcycles etc.) pass the same point independently at
the rate of 75 per hour. Assume a Poisson distribution for these vehicles too.
(d) What is the probability of one car and one other motor vehicle in a two-minute
period?
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
97
98