Statistikskript VWL Final E v2 Slides

Statistics for
Economics
1
Contents
Statistics for Economics
Part I: Probability
1. Probability theory: the building blocks Slide 7
1.1. Events and the sample space Slide 8
1.2. Relations of set theory Slide 19
1.3. The concept of probability Slide 27
1.4. Axiomatic definition of probability Slide 41
1.5. Basic theorems Slide 44
1.6. Probability spaces Slide 50
1.7. Conditional probability and stochastic independence Slide 64
1.8. Law of total probability Slide 77
1.9. Bayes’ theorem Slide 85
2. Combinatorial methods Slide 93
2.1. Factorials and binomial coefficients Slide 94
2.2. Multiplication rule Slide 97
2.3. Permutations Slide 99
2.4. Combinations Slide 101
2.5. Sampling with replacement Slide 103
2
3. Random variables Slide 105
3.1. The (cumulative) distribution function Slide 117
3.2. Discrete random variables Slide 125
3.3. Continuous random variables Slide 129
3.4. The expectation of a random variable Slide 138
3.5. Variance Slide 156
3.6. Standardization Slide 169
4. Special distributions Slide 173
4.1. The uniform discrete distribution Slide 176
4.2. The Bernoulli distribution (discrete) Slide 183
4.3. The Binomial distribution (discrete) Slide 187
4.4. The Poisson distribution (discrete) Slide 195
4.5. The uniform continuous distribution Slide 201
4.6. The exponential distribution (continuous) Slide 205
4.7. The normal distribution (continuous) Slide 209
5. Multivariate random variables Slide 218
5.1. Joint distribution and marginal distributions Slide 223
5.2. Conditional distributions and stochastic independence Slide 240
5.3. Covariance and correlation Slide 248
5.4. Sums and sample means of random variables Slide 254
6. The Central Limit Theorem Slide 261
3
Pert II: Statistics
7. Descriptive statistics Slide 278
7.1. Frequency tables, histograms, and empirical distributions Slide 281
7.2. Summarizing data using numerical techniques Slide 289
7.3. Boxplot Slide 302
7.4. Quantile-Quantile-plot Slide 305
7.5. Scatter diagram Slide 310
8. Estimation of unknown parameters Slide 313
8.1. Intuitive examples of estimators Slide 318
8.2. Properties of estimators Slide 328
8.3. Main methods to get estimators Slide 346
9. Confidence intervals Slide 361
9.1. The idea Slide 362
9.2. Example of a confidence interval
(mean of a distribution, large samples) Slide 366
9.3. Relation with testing hypotheses Slide 370
Part III: Exercises Slide 375
4
Part I: Probability
5
Part I: Probability Part II: Statistics
1. Probability theory: the building blocks 7. Descriptive statistics
1.1. Events and the sample space 7.1. Frequency tables, histograms, and empirical distributions
1.2. Relations of set theory 7.2. Summarizing data using numerical techniques
1.3. The concept of probability 7.3. Boxplot
1.4. Axiomatic definition of probability 7.4. Quantile-Quantile-plot
1.5. Basic theorems 7.5. Scatter diagram
1.6. Probability spaces
1.7. Conditional probability and stochastic independence 8. Estimation of unknown parameters
1.8. Law of total probability 8.1. Intuitive examples of estimators
1.9. Bayes’ theorem 8.2. Properties of estimators
8.3. Main methods to get estimators
2. Combinatorial methods
2.1. Factorials and binomial coefficients 9. Confidence intervals
2.2. Multiplication rule 9.1. The idea
2.3. Permutations 9.2. Example of a confidence interval
2.4. Combinations (mean of a distribution, large samples)
2.5. Sampling with replacement 9.3. Relation with testing hypotheses
3. Random variables Part III: Exercises

3.1. The (cumulative) distribution function
3.2. Discrete random variables
3.3. Continuous random variables
3.4. The expectation of a random variable
3.5. Variance
3.6. Standardization
4. Special distributions
4.1. The uniform discrete distribution
4.2. The Bernoulli distribution (discrete)
4.3. The Binomial distribution (discrete)
4.4. The Poisson distribution (discrete)
4.5. The uniform continuous distribution
4.6. The exponential distribution (continuous)
4.7. The normal distribution (continuous)
5. Multivariate random variables

5.1. Joint distribution and marginal distributions
5.2. Conditional distributions and stochastic independence
5.3. Covariance and correlation
5.4. Sums and sample means of random variables
6. The Central Limit Theorem Chapter 1.3/4 DeGroot and

Schervish
1. Probability theory:
building blocks
7
1.1. Events and the sample space
A (random) experiment is any process, real or

hypothetical, in which the possible outcomes can be
identified ahead of time:
 it is performed under clear rules;

 it can be repeated as often as necessary under
the same conditions; and
 the outcome is unknown and cannot be predicted.
8
Definition
Every experiment has a number of possible

single outcomes (elementary events).
9
Definition
The collection of all possible outcomes of an

experiment is called the
sample space
of the experiment.
10
Example 1.1.1:
) When a six-sided die is rolled:
S = {1, 2, 3, 4, 5, 6}
) If we flip a coin twice:
S = {HH, HT, TH, TT}
) If we flip a coin until we get a head:
S = {H, TH, TTH, TTTH, ....}
(countably many elements)
11
) For the lifespan of a light bulb we have:
S = {t : t ≥ 0} = [0,∞ ) = R +
(uncountably many elements)

 a continuum of outcomes.
) For modeling the behavior of a share price, a
possible choice is
S = {all functions: R + → R + }.
12
Remark:
An experiment might be described by

several sample spaces.
13
Example 1.1.2:
Two coins are tossed:
→ if we are interested in the outcomes heads or tails
of the two coins:
S1= {(H , H) , (H, T ), ( T, H), (H, H)}
→ if, instead, we count the number of heads/tails:
S2 = {( 2, 0 ) , (1, 1), ( 0, 2)}
→ finally, if we only want to see whether they show the
same (s) or a different (d) result:
S3 = {{s}, {d}}.
14
Definition
Definition
An event
Ein A is a well-defined,
zufälliges Ereignis A istarbitrary
eine Teilmenge
set of des
possible
Ereignisraumes
outcomesS.of the experiment. It is a subset
of the sample space S.
Das Ereignis A ist eingetreten, wenn das Ergebnis
Wedessay
Zufallsexperimentes
that an event A occurred
ein Element
if the outcome
dieser of
the
Teilmenge
experiment
A ist.
is an element contained in A.
15
Definition
All events of an experiment with sample space S

form the
set of events E(S).  Set of all subsets of S
16
We assume that two specific events must always be
contained in E:
1. The sample space, which as an event we call the

sure event: S ⊂ E;
2. The empty set, which as an event we call the

impossible event: Ø ⊂ E.
17

3.5. Variance

Chapter 1.4 DeGroot and
6. The Central Limit Theorem Schervish
18
1.2. Relations of set theory
Complement: Ā
is the set that contains all elements of S
that do not belong to A.
19
Definition
The complement Ā = S\A occurs when A does

not occur.
A
S
Ā
20
Definition
Union: A∪B (‘‘A or B’’) is defined to be the set
containing all outcomes that belong to A alone, to B
alone, or both A and B.
A B
21
Definition
Intersection: A ∩ B (‘‘A and B’’) is defined to be the
set that contains all outcomes that belong both to A
and to B.
A A∩B B
22
Definition
Two events A and D are called disjoint or

mutually exclusive if A and D have no outcomes in
common: A ∩ D = Ø.
D
Disjoint events 23
Definition
Difference: A\B (‘‘A without B’’) occurs if A, but not B,

occurs. B
A
A\B
24
Definition
Containment: It is said that a set C is contained in a

set A if every element of C also belongs to the set A.
A
C
C is contained in A
⇔C⊂A
S
C is contained in A
25

3.5. Variance

6. The Central Limit Theorem Chapter 1.2 DeGroot and

Schervish
26
1.3. The concept of probability
P: E ℝ
A P(A)
is a real-valued function:
To each event in E is assigned exactly one
element of ℝ.
27
1) The subjective interpretation of probability
is often used for one-time events;
is the probability that a person assigns to a

possible outcome of some process, representing
her own judgement of the likelihood that the
outcome will be obtained.
“P(A) = the degree of belief that someone holds

about the likelihood of A occurring’’
28
2) The frequency interpretation of probability
the probability that some specific outcome of a

process will be obtained is interpreted to mean the
relative frequency with which that outcome would be
obtained if the process were repeated a large number
of times under similar conditions.
29
Definition
The limit P(A) = lim hn (A)

n →∞
denotes the frequentist probability of A, where
hn (A) = "relative frequency that A occurs".
30
Example 1.3.1:
A die is rolled 3,000 times in succession. A running
tally is kept of the number of times we get a ‘‘6’’.
What do we expect for P (6)?
" "
1 → third concept ≅ 0.1666

( )
P 6 =
" "
6
"classical definition of probability"
31
Illustration of Example 1.3.1:
32
3) The classical interpretation of probability
attributed to Laplace, 1812
Nevertheless, Bernoulli had already discussed the

same concept more than 100 years earlier.
33
The classical interpretation of probability is based
on the concept of equally likely outcomes.
If the outcome of some process must be one of n

different outcomes, and if these n outcomes are
equally likely to occur, then the probability that an
event A occurs is given by the ratio between the
number of outcomes in A and the total number of
outcomes n.
34
# outcomes belonging to A |A|
P(A) = =
# possible outcomes n
35
Definition
An experiment with finite many equally likely

elementary events is called a
Laplace experiment.
36
Example 1.3.2:
We toss a die and a coin simultaneously.
We want to compute the probability of the event
A = "heads and number larger than 4"
Definition of the sample space:

Possible outcomes: ( H, ≤ 4 ) , ( H, > 4 ) , (T , ≤ 4 ) , (T , > 4 )
Are they equally likely? No!
37
→ Alternative: Elementary events:
(H,1) , ( T,1) , (H,2 ) , ( T,2 ) ,…, (H,6 ) , ( T,6 )
are equally likely!
We can use the Laplace theory:
2
⇒ A= {(H,5 ) , (H,6 )} and P ( A )= 12
= 1
6
38
Example 1.3.3:
When flipping a coin twice:
S = {HH, HT, TH, TT}.

3
P ["at least one head" ] =
4
39

3.5. Variance

40
1.4. Axiomatic definition of probability
Definition
Each function P
P: E ℝ
A P(A)
that assigns to each event A in E a real number is
called probability function (or measure).
P(A) is called the probability of the event A when
the following axioms hold (Kolmogorov, 1933):
41
Definition (continued)
Axiom 1: P(A) ≥ 0, ∀AєE
Axiom 2: P (S ) = 1
Axiom 3: P (A∪B) = P (A) + P (B),

whe n A∩B = Ø
(a ddition rule for dis joint e ve nts )
42

3.5. Variance

1.5. Basic theorems
Theorem 1
The probability of the complement of an event A is
given by
P(Ā) = 1 - P(A), for each event A є E
Theorem 2
The probability of the impossible event is given by:
P(Ø) = 0
44
Theorem 3
For every finite sequence of n pairwise disjoint events
A1,A2, ..., An є E the probability of the union of the
events equals the sum of the individual probabilities.
That is
 n  n
P   A i  = ∑ P ( A i ).
 i=1  i=1
Theorem 4
For an event resulting from a difference A\B we have
that P(A\B) = P(A) – P(A∩B).
45
Theorem 5 (addition rule)
For every two events A and B in E we have that
P(A∪B) = P(A) + P(B) – P(A∩B).
Impliziert
Theorem 6 (monotonicity property)
If an event A is contained in an event B, then its
probability will never be larger than that of B, that is
A⊂B P(A) ≤ P(B)
46
Example 1.5.1:
What is the probability that an arbitrarily chosen

number with three digits will have at least two of the
same digit? (Use the Laplace theory)
Let us define the sample space as
S = {000,…,999} , S = # outcomes in S = 10 3
47
We consider the event
A = ''number with at least two of same digits",
then:
A
( )
Th.1
P (A) = 1− P A = 1− , und
S
A = # three digits numbers with all
different digits = 10 ⋅ 9 ⋅ 8 = 720
720
⇒ P (A) = 1− = 0.28
1000
48

3.5. Variance

Discrete probability spaces
Definition (discrete sample space)
A sample space S with a finite number or a

countable number of outcomes is called discrete.
50
Let us now consider P() a probability function that
satisfies the axioms.
Then we assign to each elementary event ei a

probability pi = P(ei). The number pi indicates the
probability that the outcome will be exactly ei.
That is:
e 1 e2 e3 ..... ei .....
p1 p2 p3 ..... pi .....
In order to satisfy the axioms of probability, the

numbers pi must satisfy the following conditions:
51
(1) pi ≥ 0 for each i = 1,2,...
(2) ∑ pi = 1, because ei ∩ ej = Ø ∀ i ≠ j
all i
(ei pairwise disjoint) and ∪ ei = S
all i
Th. 3
∑ pi = P(∪ ei) = P(S) = 1
alle i all i
(3) P(A) = ∑ pi
eiєA
that is, the probability of each event A є E is computed

as the sum of the probabilities of all outcomes ei
contained in A.
52
Example 1.6.1:
1. Probability space with m equally likely outcomes
(Laplace-experiment):
pi 1
e1 , , e m ;= ∀ i=1, , m
m
m
⇒ ∑ i
p =m ⋅ 1 =1
i=1
m
53
2. Probability space with infinitely, countably many
outcomes:
Experiment: “Flip a coin until we get an head’’.
S = {H, TH, TTH, TTTH, TTTTH ,…}

e1 e2 e3 e4 e5 ......
= ( H ) 1 2=
p1 P= TH ) 1 =
, p 2 P (=
4
, p3 1 ,…
8
  1
pi P=
 TT.....TH
   2i
 i 
i
∞
1 ∞
1∞
⇒ ∑ pi = ∑ pi = ∑ i = ∑   =1,
alle i i=1 i=1 2 i=1  2 
geometric series that converges to 1!

54
Proof using the properties of the geometric
series or through visualization:
1 =p
2 1
Area of the
square = 1
1
1 =p 8
4 2
......
1
16 ......
55
General probability spaces
Definition
Ist (continuous
ein Ereignisraum sample space)
S nibzälbar, heisst er stetig.
A sample space S with uncountably many

outcomes is called continuous.
56
Example 1.6.2:
Consider a piece of the line going from 0 to 1 (closed

interval):
0 1
In this interval we can identify infinitely, uncountably

many points with zero length.
57
Let us now conduct an experiment constructed as
follows: choose randomly an arbitrary real number
0 ≤ a ≤ 1 in the interval [0,1].
In any case we have that
P ( a ) =0 ∀a ∈ S= [ 0,1] .
We define the events:
A = {a | a < 0.4}
B = {a | 0.6 < a < 0.9}
C = {a | a > 0.8}
58
Intuition:
P ( A ) = 0.4; P ( B ) = 0.3 und P ( C ) = 0.2. Why?
→ S can be seen as a Laplace continuous sample space,
where all real numbers are equally likely to be chosen.
length A
Thus for an event A: P ( A ) = ;
length S B
C
length B ∩ C 0.1
P [ B ∩ C] = = = 0.1
length S 1 0 0.5 0.6 0.8 0.9 1
Th.5
P [ B ∪ C] = P ( B ) +P ( C ) − P [ B ∩ C] =0.3+0.2-0.1=0.4
59
Remark:
The intervals (events) can be open or closed.
For example
B+ = {a | 0.6 ≤ a ≤ 0.9} → P ( B+ ) = 0.3 = P ( B ) ,
In both cases the probability is the same, because

even though B+ has two additional points (boundaries),
their probability is “of zero measure’’.
60
2. Define S= {( a,b ) | 0 < a < 3 and 0 < b < 2}
D = {( a,b ) | 0 < a < b < 2}
K= {(a,b) | (a-2) + (b-1) < 1}

2 2
( circle included in the rectangle )

b
K
S
a
61
Using the classic definition of probability:
Area of D 2 1
P ( D) = = =
Area of S 6 3
Area of K π
P (K) = = ≅ 0.5236
Area of S 6
62

3.5. Variance

Chapter 2.1/2 DeGroot and
1.7. Conditional probability and stochastic
independence
64
Example 1.7.1:
Rolling a die:
What is the probability of getting a “6“?
→ P ("6") = 1
6
And when we additionally know that the resulting
number is even?
P ("6" "result is an even number") = 1

3
65
Definition (conditional probability)
Suppose that we learn that an event B has occurred

and we wish to compute the probability of another
event A taking into account that our knowledge that B
has occurred.
This probability is called conditional probability of
A given that B has occurred, is denoted by P(AIB),
and can be computed as
P(A∩B)
P(AIB) = , if P(B) > 0,
P(B)
and is not defined if P(B)=0
66
Example 1.7.2:
Suppose that two dice were rolled.

What is the probability (Laplace experiment) that at
least one of the dice results in a ‘‘6’’ if we already
observe that the sum of the two numbers is larger
than 9?
67
(1,1) (1,2) ...... ...... ...... (1,6)
(2,1) (2,2) ...... ...... ...... (2,6)
(3,1) (3,2) ...... ...... ...... (3,6)
(4,1) (4,2) ...... ...... (4,5) (4,6)
(5,1) (5,2) ...... ...... (5,5) (5,6)
(6,1) (6,2) ...... (6,4) (6,5) (6,6)
11
A = "at least one 6": P ( A ) =
36
6 5
B = "sum > 9": P ( B) = and P ( A ∩ B ) =
36 36
P ( A ∩ B) 5/36
5
⇒ P ( A B) = = =
P ( B) 6/36 6
68
Theorem 7 (multiplication rule)
Let A and B be events with positive probability. The
probability of the intersection of A and B is given by
P(A ∩ B) = P(A)  P(B I A)

or
P(B ∩ A) = P(B)  P(A I B).
69
Example 1.7.3:
An urn contains 4 colored balls, 3 of them being red and

1 blue.
What is the probability that, after randomly drawing 2
balls without replacement, 2 red balls are observed?
Let Ri = "i-th drawn ball is red" for i=1, 2. Then we have
3 2 1
P [ R1 ∩ R 2 ] = P [ R1 ] ⋅ P  R 2 R1  = ⋅ = .
42 3 2
70
Definition (stochastic independent events)
Two events A and B with positive probability are

(stochastic) independent if
P(A I B) = P(A).
Analogously, it also holds that
P(B I A) = P(B).
71
Theorem 8 (multiplication rule for independent events)
If two events A and B are independent, then it follows

that
P(A∩B) = P(A)  P(B).
72
Example 1.7.4:
A coin is flipped twice.

S= {HH, HT, TH, TT} , all equally likely (Laplace).
Consider the events: A="head in 1st toss" = {HH, HT}
B="head in 2nd toss" = {HH, TH}
2 1
Then: P ( A ) =P ( B ) = = and
4 2
∩ B ) P ({HH
P (A = = }) 1 4 =P ( A ) ⋅ P ( B ) .
Thus, A and B are independent (as expected).
73
Example 1.7.5:
See exercise 1.6.1 b) “discrete probability spaces’’

(24 tosses of 2 dice; consider the sum of the numbers
→ compute P["at least one result equal 12"] = ? ).
74
Remark:
Stochastic independence is not a transitive relation!
From "A and B independent" and "B and C
independent" does not necessarily follow
"A and C independent"!
Example: Rolling two dice.
Let A be the event "1,2 or 3 in first die", B "4, 5 or 6 in

second die", and C "4,5 or 6 in first die".
Clearly A and B as well as B and C are independent,

but in any case A and C will be dependent.
75

3.5. Variance

1.8. Law of total probability
Example 1.8.1: (identifying defective items)
A manufactured article can be produced using two
different machines.
Machine one M1 produces two times more articles than
the slower machine M2. But 10% of M1 's articles are
defective while only 7% of M2 's articles are defective.
What is the probability that a randomly chosen article

from the total production is defective?
77
Known proportions: machine production defective items
M1 2/3 10%
M2 1/3 7%
P("defective article" | M1) = 0.1 ;

P("defective article" | M2) = 0.07
P("article produced by M1") = 2/3
P("article produced by M2") = 1/3
→ P(A=‘‘defective article") = ?
S
A ∩ M1
A="defective article" M i : "produced by M i"
A ∩ M2
M2 M1
78
We have (Theorem 8, multiplication rule):
P(A ∩M
=i) P ( A Mi ) ⋅ P (Mi ) , i = 1,2
Following from Axiom 3:
P ( A ) = P ( A ∩ M1 ) +P ( A ∩ M2 )
= P ( A M1 ) ⋅ P (M1 ) +P ( A M2 ) ⋅ P (M2 )
= 0.1⋅ 2 + 0.07 ⋅ 1 =0.09
3 3
79
Definition:
Definition (partition)
Let S denote the sample space of some experiment,

and consider n events H1, H2,..., Hn in S such that
Hi ∩ Hj = Ø for i ≠ j (pairwise disjoint)

and
H1 ∪ H2 ∪ ... ∪ Hn = S.
It is said that these events form a partition of S.
80
Theorem 9 (law of total probability)
Suppose that the events H1, H2, ... , Hn form a

partition of the state space S, and that all have some
positive probability. Then, for every event A є E,
n
P(A) = ∑ P(A I Hj)  P(Hj)
j=1
H1 H2
...
A
H3
Hn
H4 ...
...
81
Example 1.8.2:
In Example 1.7.3, an urn with 4 balls (3 red and 1
blue):
R1 = "first ball drawn is red"
and
R1 = "first ball drawn is blue"
form a partition of S.
We can therefore compute the probability of the event
R2 = {"second ball drawn is red"} as
( ) ( )
P (R2 ) = P (R2 R1 ) ⋅ P (R1 ) + P R2 R1 ⋅ P R1
2 3 3 1 3
= ⋅ + ⋅ =
3 4 3 4 4
82
Remark:
Computations like those appearing in the law of total
probability can often be visualized using a tree
diagram.
R 2 red-red
R1
R 2 red-blue
R 2 blue-red
R1
R 2 blue-blue (impossible)
83

3.5. Variance

1.9. Bayes’ theorem
Bayes’ theorem:
Let the events H1,...,Hn form a partition of the

space S such that P(Hi) > 0, for all i=1,…,n,
and let B be an event such that P(B) > 0.
Then, for each Hi
P (B | Hi )P (Hi )
P (H i | B ) =
∑
n
j =1
P (B | H j )P (H j )
85
Example 1.9.1: (identifying defective items)
In Example 1.8.1 the probability that an article randomly
chosen from the total production is produced by
machine M1 was (a-priori)
2
P ("article produced by machine M1")= = 0.66.
3
If we now observe that the chosen article is defective,
we will surely increase that probability, given that
machine M1 leaves a larger portion of defective items
behind.
86
Example 1.9.1 (continued):
From Bayes’ theorem we get: 2
0.1 ⋅
3
P ( "article from M1" | " article defective" ) =
2 1
0.1 ⋅ + 0.07 ⋅
3 3
20
= = 0.741
27
In Bayes’ language, H1,...., Hn are called alternative

hypotheses, P(Hi) is called the prior probability of the i-
th hypothesis and P(Hi|B) is called the posterior
probability of the i-th hypothesis after having observed
that B has occurred.
87
Example 1.9.2: (test for a disease)
The following numbers are known about a free
medical test for a certain disease:
 If a person has the disease, there is a probability

of 90% that the test will give a positive response.
 If a person does not have the disease, there is a

probability of 99% that the test will be negative.
 The chances of having the disease are only 0.1%.

88
Given that the test is free, fast and harmless, you
decide to take it. A few days later you learn that you
had a positive response to the test.
What is the probability that you have the disease?
Let: H1 = " you have the disease";

B = "test is positive"
→ P (B H1= ( )
) 0.9 ; P B H1= 0.1 ;
= (
P B H1 0.99
= ); P B H1 0.01 ( )
89
( )
prior probability
→ P ( H1 ) 0.001 and =
= P H1 0.999
law of total
) ( )
probability
(
P ( B ) P ( B H1 ) ⋅ P ( H1 ) + P B H1 ⋅ P H1
⇒ =
= 0.9 ⋅ 0.001 + 0.01 ⋅ 0.999 = 0.01089 and
Bayes' theorem
( )
P B H 1 ⋅ P ( H 1 ) 0.9 ⋅ 0.001
⇒ P ( H1 B )
= = = 0.0826,
P ( B) 0.01089
that is the reliability of a positive response
is about 8.3%.
90
Tree diagram:
Test T
Condition
B
H1
B
B
H1
B
91

3.5. Variance

6. The Central Limit Theorem Chapter 1.7/8/9 DeGroot

and Schervish
93
2.1. Factorials and binomial coefficients
Definition
The notation n! indicates the product of all integer

numbers between 1 and n, that is
n! = 1 ⋅ 2 ⋅ 3 ⋅ ⋅ (n-1) ⋅ n
and is read n factorial.
Moreover, we assume that: 0! = 1
94
Definition
The binomial coefficient denoted by the symbol

n
k
is defined for integers n > 0 and k ≥ 0 with n ≥ k by
n n!
=
k k! (n – k)!
95

3.5. Variance

6. The Central Limit Theorem

2.2. Multiplication rule
Suppose that an experiment has k parts, that the i-th part
of the experiment can have ni possible outcomes, and
that all the outcomes in each part can occur regardless of
which specific outcomes have occurred in the other parts.
Then, the total number of outcomes of the experiment will

be equal to the product
n1  n2  n3    nk.
97

3.5. Variance


2.3. Permutations
Definition
Suppose that a set has n elements.

Suppose that an experiment consists of selecting k of the
elements at a time without replacement.
Let each outcome consist of the k elements in the order
selected.
Each such outcome is called a permutation of n
elements taken k at a time.
The number of permutations is given by n! / (n – k)!
99

3.5. Variance


2.4. Combinations
Definition
Consider a set with n elements.

Each subset of size k chosen from this set is called
a combination of n elements taken k at a time.
The number of distinct subsets of size k that can be

chosen from a set of size n is given by the binomial
n
coefficient ( k ) and therefore equals n! / (k! (n-k)!).
101

3.5. Variance


2.5. Sampling with replacement
Definition
How many distinct sequences of length m can we

obtain using a set of size n (with replacement)? nm
And if we do not care about the ordering in the

n+m-1
different samples? ( m )
103

3.5. Variance

6. The Central Limit Theorem Chapters 3 and 4 DeGroot

and Schervish
3. Random variables
105
Definition
Let us consider a probability space [S, E, P ()].

A real-valued function
X: S ℝ
e X(e) є ℝ,
assigning to each elementary event e in S a real
number X(e) is called a random variable when
an event Ar є E with Ar = {e | X(e) ≤ r} can be
defined for every arbitrary real number r.
106
S
 
  
  
xєℝ
-1 0 1 2 3
107
Example 3.0.1:
If we roll one die once, we have that
S = {1, 2,…,6}.
The number resulting on the die defines a random

variable that can be described by the function
X (e) = e .
108
Example 3.0.2:
A coin is tossed once.
Let X denote the number of heads. X has only
two possible values:
X ( "tail" ) = 0 und X ( "head" ) = 1.
The set of events E includes the four events:
E = {0,
/ "tail", "head", S} .
We have: if -∞ < r < 0 then A r =

0/
if 0 ≤ r < 1 then A r =
" tail"
if 1 ≤ r < ∞ then A r =
S.
109
Example 3.0.3:
Two dice are rolled once. The sample space contains
the following 36 single outcomes:
S= {( i, j) | i=1,...,6, j=1,...,6}.
We can consider different random variables, such as:
X = "sum of the numbers":
X ( i, j ) = i+j = x, x = 2, 3,…,12.
Y = "absolute difference of the numbers":
Y ( i, j ) = i-j = y, y = 0, 1,…,5.
110
Definition
Let W be the set of values a given function can take.
W is called the image (or range) of the function.
We say that a given random variable X is discrete if X

can take only a finite number or a countable infinite
number of different values (i.e. W ⊂ ℝ is discrete).
If the image W ⊂ ℝ of a random variable X contains an

interval of the real line (or the whole real line), then X is
called continuous (uncountably many values in W).
111
Example 3.0.4: (sick notes, see exercise 1.5.2)
O: "number of sick notes"
S = {{-}, {X},{Y},{Z},{XY},{XZ},{YZ},{XYZ}}
→ W = {0, 1, 2, 3}
112
Example 3.0.5: (revenue under uncertain conditions)
The total revenue of a company results from the order

volumes of few major contracts.
For next year the company management hopes to get
three big orders A, B, and C.
The management estimates the chances of success of
getting each specific order differently.
Let us assume for simplicity that getting order A, B, or
C are independent events.
113
Order Order volume Probability of

[Mio. CHF] getting the order
A 10 0.8
B 14 0.5
C 24 0.75
The revenue is a random variable X that depends on

the chances of success of the orders:
P ( X = 34 ) = P ( A ∩ B ∩ C=
) ( 0.8 ) ⋅ (1 − 0.5 ) ⋅ 0.75= 0.3
Image of X ?
114
Order positions Revenue P({ei})
(ei) (X(ei))
− 0 0.025
A 10 0.1
B 14 0.025
C 24 0.075
AB 24 0.1
AC 34 0.3
BC 38 0.075
ABC 48 0.3
Σ=1
W = {0, 10, 14, 24, 34, 38, 48}

→ P ( X=24 ) = P ({C, AB}) = 0.175.
115

3.5. Variance


Schervish
Definition
The (cumulative) distribution function (for short

c.d.f.) F of a random variable X is the function
F(x) = P(X ≤ x) = P({e|X(e) ≤ x}), for all x є ℝ,
that assigns to each real value x the probability that
the random variable takes a value X ≤ x.
117
Example 3.1.1:
X = “number of heads when tossing a coin once’’.
X can only take the values 0 or 1.

If the coin is fair, that is P(‘‘head’’) = 0.5, then:
F(x)
1
 0 , if x < 0
1 ½
F(x) =  , if 0 ≤ x <1
2

 1 , if x ≥ 1 0 1
118
Example 3.1.2:
Two dice are rolled once.
Let Y be ‘‘the absolute difference between the two numbers’’.
Then:
Y = 0 , if (i, j) = (1,1); (2,2); (3,3); ...; (6,6) : 6 couples
Y = 1 , if (i, j) = (1,2); (2,1); ... : 10 couples
Y = 2 , if (i, j) = (1,3); (3,1); (2,4); ... : 8 couples
Y = 3 , if (i, j) = (1,4); (4,1); ... : 6 couples
Y = 4 , if (i, j) = (1,5); (5,1); (2,6); (6,2) : 4 couples
Y = 5 , if (i, j) = (1,6); (6,1) : 2 couples
36 couples
119
6 6
Thus: P ( Y 0==) = 16 ; P ( Y 3=) = 1 ;
36 36 6
10 5 4
P ( Y= 1=
) = ; P ( Y= 4=
) = 1 ;
36 18 36 9
8 2 2
P(Y 2 =
=) = ; P ( Y 5=) = 1 ;
36 9 36 18
120
0 , y<0
1/6 , 0 ≤ y < 1
4/9 , 1 ≤ y < 2
⇒ F ( y ) =2/3 , 2 ≤ y < 3
5/6 , 3 ≤ y < 4
17/18 , 4 ≤ y < 5
1 , y≥ 5
121
F(y)
2/3
1/3
0 1 2 3 4 5
122
Properties of the distribution function:
(1) Continuity from the right: a c.d.f. is always continuous

from the right; that is
lim F(x + Δx) = F(x) at every point x.
Δx0
(2) Nondecreasing: the function F(x) is nondecreasing as
x increases; that is
F(a) ≤ F(b) for any arbitrary a < b.
(3) Limits: F(x) has the limits:

lim F(x) = 0 and lim F(x) = 1.
x ∞ x+∞
123

3.5. Variance


Schervish
Definition
If a random variable X has a discrete distribution,
the probability (mass) function of X is defined as
the function f such that for every real number x
f(x) = P [X=x].
Clearly, the values pi = P(X = xi) are positive only at

the points x=xi belonging to the image W of X:
 pi x= xi ∈ W ,
f ( x=
) P( X
= x= ) 
0 else.
125
Every probability function f satisfies the properties:
(1) f(xi) ≥ 0 (probabilities non-negative);
(2) ∑ f(xi) = 1 (sure event has probability 1).

all i
From (1) and (2) we get directly the following

property:
(3) f(xi) ≤ 1.
126
Remark:
For real-valued intervals, we can generally compute
the probabilites using the following formula:
∑ p ( xi )
P ( a<X ≤ b ) = F ( b ) − F ( a ) =
a < xi ≤b
For discrete random variables, the formula has to be

modified (boundaries included/excluded):
P ( a < X < b ) = F (b ) − F ( a ) − f (b )
P ( a ≤ X ≤ b ) = F ( b ) − F ( a ) +f ( a )
P ( a ≤ X < b ) = F ( b ) − F ( a ) +f ( a ) − f ( b )
127

3.5. Variance


Schervish
Definition Type equation here.
Let X be a continuous random g

variable with
distribution function F. The first derivative of
the function in x
d
f(x) = F(x)
dx
is called the (probability) density function of X.
In this case: b
∫
P ( a ≤ X ≤ b ) = f ( x )dx.
a
129
Every density function satisfies the following properties:
(1) f(x) ≥ 0 (distribution function nondecreasing)

∞
(2) ∫f(x)dx = 1 (Area under the density is exactly 1)
∞
130
Example 3.3.1:
A continuous random variable X is characterized by

the distribution function
 0 , x<0
 1
F ( x ) =  ( x - 3 )3 + 1 , 0 ≤ x < 3
 27
 1 , x ≥ 3
131
Computation of the density function:

(take the derivative of F(∙) in each part of the c.d.f. )
 0 , x<0
 1
f ( x ) =  ( x - 3 )2 , 0 ≤ x<3
9
 0 , x ≥ 3
132
f(x)
F(x)
1
1
0.5 0.5
0.2593
x x
0 1 2 3 0 1 2 3 4
133
What is the probability P (1 ≤ X ≤ 2 ) ?
P ( 1 ≤ X ≤ 2 )= F ( 2 ) − F ( 1)=
−1
27
+ 1− ( ( −2 ) 3
27
+1= ) 27
7
= 0.2593
or
2
1   −1 −8 7
2
1
P (1 ≤ X ≤ 2 ) = ∫ dx =
2
(x-3) = − = .
3
(x − 3)
1
9  27  1 27 27 27
In both cases we clearly get the same result!

The information content of the two functions F and f is
the same.
134
Example 3.3.2: (waiting time at the ‘S-Bahn’ station)
Trains come every 12 minutes at a given

‘S-Bahn’ station. Suppose you do not know the exact
schedule and arrive at the station at a randomly
chosen point in time.
Let us define
X = ‘‘waiting time at the station’’
as the random variable of interest.
The image set of X is W = [0,12] (minutes).
135
 1 , if x ∈ [ 0, 12]
Density function: f ( x ) =  12
 0 , else.
Distribution function:
 0 , x<0 x x
x 1  u x
F(x)  12 , 0 ≤ x ≤ 12 ← ∫ = du =
 12  12
 0
12 0
 1 , x > 12
10
Then: P (10 < X < 15 ) =F (15 ) − F (10 ) =−
1 =0.1667
12
9
P ( X > 9 ) =−
1 F ( 9 ) =−
1 =0.25
12 136

3.5. Variance


Schervish
Definition
Let X be a random variable and f be its probability
or density function (discrete or continuous X).
The expected value of X is defined as
E[X] = ∑ xjf(xj) = ∑ xjpj, if X discrete;

all j all j
∞
E[X] = ∫ xf(x)dx, if X continuous.
∞
It is usually denoted by μx.
138
The expectation of a random variable can be regarded
as being the center of gravity of that distribution.
Satz:
The expected value of the difference between

the random variable X and its expectation μx
equals zero, that is
E[X - μx] = 0 (central property).
139
Example 3.4.1:
Let X be a random variable with probability function

 x , x = 1, 2
f (x) =  3
 0 , else.
1 2
⇒ E [ X] = 1⋅ + 2⋅ =5
3 3 3
140
Example 3.4.2: (rolling two dice)
Y = "absolute difference between the two numbers"

We computed the probability function as follows:
Y 0 1 2 3 4 5
6 10 8 6 4 2
f ( yi )
36 36 36 36 36 36
1 5 2 1 1 1
⇒ E[Y] =0⋅ + 1⋅ + 2⋅ + 3⋅ + 4⋅ + 5⋅
6 18 9 6 9 18
5+8+9+8+5 35
= =
18 18
141
Example 3.4.3:
Let X be a random variable with density function

 1 x , 1 ≤ x ≤ 3
f (x) =  4
 0 , else.
∞ 3
1 2 1 3 3
⇒ E [ X] =∫ x ⋅ f ( x ) dx =
∫ 4 x dx =
12
x
1
−∞ 1
27 − 1 26 13
= = = .
12 12 6
142
The expectation of a function of a random
variable (law of the unconscious statistician)
Let X be a random variable with probability

function or density function f, and g(X) be a real-
valued function. Then:
E[g(X)] = ∑ g(xj)  pj , if X discrete;

all j
E[g(X)] = ∫ g(x)  f(x)dx , if X continuous.

ℝ
143
Example 3.4.4:
Breakdowns are observed during the activity of a
production center.
Analyzing past data, we get for

X= "number of breakdowns per day"
the following probability function:
X 0 1 2 3
f( x ) 0.35 0.4 0.15 0.1
144
To eliminate the breakdowns the firm incurs the

following costs:
4 (per thousand CHF).
g ( x )= 5−
x +1
If we first compute
E [ X ] =⋅
0 0.35 + 1 ⋅ 0.4 + 2 ⋅ 0.15 + 3 ⋅ 0.1 = 1
expected number of breakdowns and plug in the
value in the cost function, we get expected costs
4
g (E [ X ]) =
5− 3.
= 145
1+ 1
But:
The correct way to compute the expected costs is
E [ g ( X )]
= g ( 0 ) ⋅ 0.35 + g (1) ⋅ 0.4 + g ( 2 ) ⋅ 0.15 + g ( 3 ) ⋅ 0.1
11
1 ⋅ 0.35 + 3 ⋅ 0.4 +
= ⋅ 0.15 + 4 ⋅ 0.1
3
= 2.5.
146
Example 3.4.5:
X 0 1 2 3
fx 0.1 0.3 0.2 0.4
Y= ( X-2 )2 → fy ? Wy = {0, 1, 4}
Y 0 1 4
→ E [ Y ] = 0.7 + 0.4 = 1.1
fy 0.2 0.7 0.1
147
Computing expectations: linear function
Theorem
Let X be a random variable with expected value E[X]

and a, b two real-valued finite constants, if
Y = aX + b,
then the expected value of Y equals
E[Y] =E[aX + b]= a  E[X] + b.
148
Example 3.4.6:
Let X be a random variable with density function

e-x , x ≥ 0
fx ( x )  and Y = 2X + 1
0 , else
What is the E[X]? Linear case: 2 ways
E [ Y ] = E [ 2x + 1] = 2 ⋅ E [ X ] + 1 = 3
+∞ +∞
∫ xe ( -x.e )0 + ∫
-x +∞
E [ X] = dx = -x
e -x
dx = -e -x ∞
0 1
=
↓↑ P.I.
0 0
computing the density fy = ? 149

Theorem (density transformation)
Assume that some regularity conditions on the function g()

are satisfied. Then
dg
−1
(y)
fy ( y ) fx ( g
= −1
( y )) ⋅ .
dy

y-1
→ y = g ( x ) = 2x + 1 → g
−1
(y) =
2
y-1
− 1 y-1
⇒ fy ( y ) =e 2
⋅ , if ≥ 0.
2 2
 150
⇔ y ≥1
Do we have a density function?
∞ y-1 y-1
1 −1
− − ∞
∫
21
e dy =
2
2
⋅2 e 2
1
1 
=
∞ y-1 y-1
1 − 1 − ∞
=
Then: E[Y]
2 ∫y e 2
dy = −
↑P.I. 2
⋅2 ⋅y⋅e 2
1
1
∞ y-1 y-1 ∞
− −
+ ∫e 2
dy =1+2 ⋅ ( − e 2
) 3.
=
1 1
151
Example 3.4.7:
Let us consider a random variable X with p.f.
X -1 0 1.5 2
f (x) 0.3 0.1 0.4 0.2
⇒ E [ X=
] ( −1) ⋅ 0.3 + 0 ⋅ 0.1 + 1.5 ⋅ 0.4 + 2 ⋅ 0.2= 0.7
E [ X+3=
] E [ X ] += 3 3.7
E [ 4X ] =4 E [ X] =4 ⋅ 0.7 =2.8
152
Example 3.4.8: (rolling die game)
Player 1 promises Player 2 that he will pay the

following amounts when rolling one die:
10 cents, if the number is a 1 or 2;

20 cents, if the number is a 3 or 4;
40 cents, if the number is a 5; and
80 cents, if the number is a 6.
How much does Player 2 have to pay before each

roll of the die such that the game is fair?
153
“Fair game’’ means that the fee one has to pay
exactly equals the expected gain. Let X denote the
gain (in cents).
X=x 10 20 40 80
f (x) 2 2 1 1
6 6 6 6
Then, the expected gain is given by

2 2 1 1
E [ X ] = 10 ⋅ + 20 ⋅ + 40 ⋅ + 80 ⋅ = 30.
6 6 6 6
Thus the fee must be set equal to 30 cents to get a
fair game. (→ Casino-games are not fair!) 154

3.5. Variance


Schervish
3.5. Variance
Definition
Let X be a random variable with finite mean μx.
The variance of X is defined as follows:
2
σx = V(X) = E[(X - μx)2],
provided that the sum or the integral exists.

If the mean of X does not exist, we say that V(X)
does not exist as well.
The standard deviation of X is the nonnegative
square root of V(X), that is σx = + √ V(X) .
156
The variance of a random variable is a measure
of how spread out the distribution of X is.
It can be computed by:
2
σx = ∑ (xj - μx)2 pj , if X discrete;
alle j
2
σx = ∫(x-μx)2 f(x)dx, if X continuous.
ℝ
157
Example 3.5.1: (flipping a coin)
Let X be the discrete random variable defined as

X = “number of heads when flipping a coin twice’’
Then: X 0 1 2
f (x) 1 1 1
4 2 4
and µ x = 1
1 2 1
2
σ
= ( 0-1) ⋅2
+ (1 − 1) ⋅ + ( 2 − 1)2 ⋅ 1= 1
X 4 2 4 2
158
Let us consider again (see Example 3.4.2)

Y = “absolute difference of the numbers’’.
Above we computed that

35
E[Y
= ] ⋅
18
Then, we get
V ( Y=
) σ=
2
Y
2.05247.
159
Example 3.5.3: (continuous case)
Let X be a continuous random variable with density

function given by
c  1 2
  x − x , 0<x<2
f (x) =   2 
 0 , else.
160
What is the value of the constant c such that the

function f is a density function?
2
 1 2 x2
2 x 3
2
 8 2 !
c ⋅ ∫  x- x  dx = c −  = c2 −  = c = 1
0 2   2 0 6 0  6 3
⇒ c =3
2
161
Then:
3
( )   3 8 
2 3 4
1 3 x x
E [ X] =
2 2
⋅ ∫ ⋅ −  = ⋅  − 2  =1
2 3
x - x dx=
2 0
2 2  3 0 8 0
 2 3 
and
3  1 2 3  5 2 1 4
2 2
V ( X ) = ∫ ( x-1) ⋅  x- x  dx = ∫
2 3
 x- x +2x - x  dx
20  2  2 0 2 2 
3
40 16 32  1 1 = 0.4472.
=  2- + - = 5 ; σx= 5
2 6 2 10 
162
Computation of variances: simple rules
Theorem (linear function)
Let X be a random variable with existing variance

V(X), and a and b two real constants. Then
Y = aX+b
has the variance
V(Y) = a2  V(X)
and standard deviation
σY = |a|  σx.
163
Example 3.5.4:
Let us consider the random variable X with p.f.
X 6060 6100 6140
f (x) 0.2 0.3 0.5
For the computation of the variance V(X) let us first

consider the linear transformation
X-6100
Y=
40
with associated probability function
Y -1 0 1
f ( y ) 0.2 0.3 0.5 164
We then get:
E [ Y ] = 0.3
V ( Y ) ( −1 − 0.3 ) ⋅ 0.2 + ( 0 − 0.3 ) ⋅ 0.3
= 2 2
+ (1 − 0.3 ) ⋅ 0.5 =
2
0.61
V ( 6100 + 40Y ) =
⇒V (X) = 40 ⋅ V (Y ) 2
= 40 ⋅ 0.61 =
2
976
165
Theorem (alternative method for computing variances)
For every random variable X: V(X) = E[X2] - μx2.

Let us consider again
Y = “absolute difference of the numbers’’
35
We already computed:
= E [Y] = , V ( Y ) 2.05247.
18
10 210
Check using the theorem: E [ Y ] = 0 ⋅ 1
2 2
2
+1 ⋅ + ... =
6 36 36
( )
2
210 70
⇒ V (Y) = − =5.83 − 3.78086 =2.05247. 166
36 36
Steiner rule:
Let X be a random variable with E[X] = μ and
let d be a real-valued constant. Then
V(X) = E[(X-d)2] – (μ-d)2.
167

3.5. Variance


Definition
Let X be a random variable with μx = μ and

σx = σ > 0 (both finite). Then we call the
random variable Z resulting from the transformation
X-μ
Z = σ standardized.
169
Every standardized random variable has
expectation 0 and variance 1:
translation stretching 1 X-μ
X Y=X–μ Z= σ Y= σ
E[X] = μ E[Y] = 0 E[Z] = 0

V(X) = σ2 V(Y) = σ2 V(Z) = 1
170
Example 3.6.1: (flipping a coin)
X = “number of heads when flipping a coin twice’’

We have: E(X) = 1 and V(X) = ½ .
X-µ X −1
Standardization: Z= =
σ 1
2
We then get for Z:
Z - 2 0 2
f (z) 1 1 1
4 2 4
and, as expected,
= E [ Z ] 0=and V ( Z ) 1.
171

3.5. Variance

6. The Central Limit Theorem Chapter 5 DeGroot and

Schervish
173
Several distributions play a special role in probability
and statistics: they are known to be useful in a wide
variety of applied problems.
Each special class constitutes a whole family of

distributions.
The different members of each family can be

obtained by specifying the value of the underlying
parameter(s).
174

3.5. Variance


Let us consider a random variable X with m (finite)
possible outcomes.
We assume that all outcomes are equally likely:
X x1 x2 x3 … xm-1 xm
f(X) 1/m 1/m 1/m .... 1/m 1/m
Notation: fUni(x; m)
parameter of the family
176
Example 4.1.1: (rolling one die)
For m = 6, we have that the probability function of
X = ‘‘number when rolling one die’’
is given by
1
 , x = 1, 2,...,6
f Uni ( x;6 ) =  6
 0, else.
177
What about the distribution function?
FUni(x,6)
5/6
4/6
3/6
2/6
1/6
0 1 2 3 4 5 6
178
What is the expected value?

m m m
1 1 m
E [ X]= ∑ x ⋅ p = ∑ x ⋅ P [ X = x ]=
i=1
i i
i=1
i i ∑
i=1
xi ⋅ =
m
⋅ ∑ xi
m i=1
1 6 1 6 ⋅ ( 6 + 1)
→ E [ X] = ⋅ ∑ i = ⋅ =3.5
6 i=1 6 2
179
And the variance (or standard deviation)?
 2  1 m 2 1 6 2 1 ( 6 + 1)( 6 ⋅ 2 + 1) ⋅ 6 7 ⋅ 13 91
E X =⋅ ∑ x =⋅ ∑ i =⋅ = =
  m i 6 i=1 6 6 6 6
i=1
m 2
2 2 1 2  1 m 
V ( X) =  
E X − E [ X] =⋅ ∑ x −  ⋅ ∑ x 
  m i=1 i m i
 i=1 
91  7 2 91 49 182 − 147 35
= −  = − = = = 2.9167
6 2 6 4 12 12
σ
= V (X)
= 2.9167
= 1.7078
X
180
Remark: (no additive property)
Unfortunately the sum of two (or more) independent

uniform discrete random variables does not belong to
the same family of distributions, that is is not uniform
discrete.
Ex: X = ‘‘sum of the numbers when rolling two dice’’:
2 3 4  12
1 2 3
 ← not equally likely!
36 36 36
not a uniform discrete distribution!
181

3.5. Variance


Schervish
The random variable X has only two possible
outcomes, denoted by 0 and 1.
result of an experiment with two outcomes A

and Ā, where A is called success and Ā failure.
The probability that event A occurs is called
success probability, is the parameter of the
distribution, and is denoted by p.
Such experiments are called

Bernoulli experiments/trials.
183
Definition
A random variable X has the Bernoulli distribution

with parameter p if its probability function equals
1 − p, x= 0

fBe ( x; p ) =
= p, x 1
 0, else.

The parameter p can take any value between 0 and

1.
184
Example 4.2.1:
Let us consider a game where in order to win the

player has to score at least one six when rolling three
dice (‘‘success’’). Let X denote success/failure.
→ What is the success probability?
()
3
5 216 − 125 91
p =1 - P [ "no six" ] =
1− = = .
6 216 216
91
⇒ E  X  = = 0.4213,
216
91  91 
( )
V X =⋅  1 −
216 
=
216 
0.2438,
σ X = 0.4938.
185

3.5. Variance


Schervish
Definition
A random variable X has the binomial distribution

with parameters n and p if its probability function
equals
n x x = 0,1,2,..., n
fBi ( x; p, n )=   ⋅ p ⋅ (1 − p ) ,
n−x
x 0 ≤ p ≤ 1.
187
Derivation of the binomial distribution:
Let Yi, i=1,…n, be independent random
variables, each one Bernoulli distributed
with parameter p (Bernoulli trials).
Then, the sum

n
X = ∑Yi
i =1
is distributed according to a binomial
distribution with parameters n and p.
188
Example 4.3.1: (urn with replacement)
Let us consider an urn containing colored balls. In

particular we have 10 black and 20 white balls.
We draw four balls from the urn with replacement.
We are interested in the total number of black balls

we have drawn.
189
Let X denote the random variable
X = ‘‘number of black balls’’.
→ possible outcomes of X: x = 0, 1, 2, 3, 4
10 1
→ parameter values: p = = ; n=4
30 3
4 0 16
=   p (1 − p= )
4 −0
→ P(X=0) ;
0 81
4 1 32
 p (1 − p )
4 −1
P(X=1) =
= ;
 1 81
190
That is:
 1 
X is binomially distributed with fBi  x; ,4 
 3 
1 4
→ E [ X] = n ⋅ p = 4 ⋅ =
3 3
1 2 8
→ V ( X ) = n ⋅ p ⋅ (1 − p ) = 4 ⋅ ⋅ =
3 3 9
191
Example 4.3.2: (election)
Let us assume that 35% of the whole voting

population of a given country decides to vote for
party C.
We ask 12 randomly chosen persons whom they

intend to vote for (random sample of size n=12).
What will be the result of our sample test?
192
Expected number of electors voting for C in the sample:

µ = 12 ⋅ 0.35 = 4.2.
Standard deviation:
σ = 12 ⋅ 0.35 ⋅ 0.65 =
1.6523.
P ( "electors voting for C reach majority" )

12 12
=
= P [ X>6] ∑
= P [ X=x ] ∑ f ( x;=
x=7 x=7
0.35, 12 )
Bi 0.0846.
193

3.5. Variance


Schervish
Definition
Let 𝜆𝜆 > 0. A random variable X has the Poisson

distribution with mean 𝜆𝜆 if its probability function is
as follows:
 λ x −λ
 ⋅ e , for x = 0,1, 2,...,
fPo ( x; λ ) =  x !
 0, else.

195
Example 4.4.1: (roulette)
A roulette player is convinced that the number 17 is

his lucky number.
Therefore he continuously bets on that number.
What is the probability that he is going to win exactly 8

times out of 200 trials?
→ success probability by one trial: p = P(‘‘17’’) = 1/37;

number of trials: n=200.
196
λ np
Let us use the Poisson distribution with = = 5.4054.
5.40548 −5.4054
Then, we find: fPo ( 8 ; λ ) = e = 0.0812.
8!
Remark: If, instead, we use the correct binomial

distribution, we find:
 1 
fBi  8 ; , 200  = 0.0814.
 37 
197
Example 4.4.2: (minigolf)
The professional minigolf player E. Findhole

pretends that, even in the most difficult conditions,
only in 20% of the cases is he not able to make a
hole in one.
Today many journalists are present at the minigolf

course. The player will demonstrate his ability with a
series of 50 attempts.
198
Let us compute the probability that Mr. Findhole
makes a hole in one in only 38 out of the 50
attempts: (a) exact; (b) with a suitable approximation.
a) X = "# failed attempts" follows fBi ( x ; 0.2 , 50 )

P [ X=12] = 0.1033
b) Let us use the Poisson distribution: λ = np = 10

P [ X=12] ≈ P( Y = 12 ) = 0.0948
↑
Poiss.
Approximation error: too large! p = 0.2 not suitable!

199

3.5. Variance


Definition
A random variable X has the uniform (continuous)
distribution if its density function is defined as
follows (a,b two real-valued constants):
 1
 ,
fUni ( x; a, b )  b − a
= if a ≤ x ≤ b
 0, else.
The density function is constant in the interval [a, b].
201
Example 4.5.1:
According to schedule, a bus is expected to arrive
every 30 minutes between midnight and 6am.
What is the probability that a passenger has to wait
more than 10 minutes?
T = "waiting time for the next bus" ,
1
 , 0 ≤ t ≤ 30,
is a random variable with fUni ( t;0,30 ) =  30
 0 , else.
10 2
Thus: P [ T>10] =−1 P [ T ≤ 10] =−
1 FUni (10 ) =−
1 =
30 3
Moreover: E [ T ] 15min
= = and V ( T ) 75min2
202
Example 4.5.2: (waiting time at the ‘S-Bahn’ station)
Trains are coming every 12 minutes at a given
‘S-Bahn’ station. Suppose you do not know the exact
schedule and arrive at the station at a randomly
chosen point in time.
Let us define
X = ‘‘waiting time at the station’’
as the random variable of interest.
The image set of X is W = [0,12] (minutes).
203

3.5. Variance


Schervish
Definition
Let 𝜆𝜆 > 0. A random variable X has the exponential

distribution with parameter 𝜆𝜆 if its density function is as
follows:
λ ⋅ e − λ x , if 0 ≤ x < ∞
fEx ( x; λ ) = 
 0, else.
Then: E[X] = 1/ 𝜆𝜆 and V(X) = 1/ λ 2 .
205
Example 4.6.1: (life test)
If we believe in the manufacturer's claim, the

expected life of a light bulb is 5000 hours.
We assume as a good approximation for the
distribution of the random variable
X = ‘‘lifetime of the light bulb’’:
x
1 −
=fEx ( x ) e 5000 , 0 ≤ x < ∞.
5000
1
Remark: λ =
E[X]
206
What is the probability that a light bulb:
(1) runs less than 2500 hours ?

1
−
P [ X ≤ 2500] =
FEx ( 2500 ) =
1− e 2
0.3935;
=
(2) runs more than 10000 hours?
( )
P [ X > 10000] =1 − FEx (10000 ) =1 − 1 − e −2 =0.1353.
207

3.5. Variance


Schervish
Definition (standard normal distribution)
A random variable X has the standard normal
distribution if its density function is defined as
follows:
x2
1 −
Z (x)
f= ⋅e , for − ∞ < x < ∞.
2
2 ⋅π
The expected value and the variance of this

distribution are 0 and 1, respectively.
209
Definition (general normal distribution)
A random variable X has the normal distribution
with mean μ and variance 𝜎𝜎 2 (−∞ < μ < ∞; σ > 0) if
its density function is defined as follows:
2
( x −µ )
1 -
fN ( x; µ ,σ 2
)= e 2σ 2 , − ∞ < x < ∞.
2πσ 2
In fact, a parameter μ for the center of gravity and a

parameter σ for the dispersion of the distribution are
introduced.
210
Illustration of the normal distribution (density function)
for different (μ,σ) values.
σ=1 σ=2 σ=3
μ = -5
μ=0
μ=5
211
Example 4.7.1: (working with normal distribution)
Let the random variable X be normally distributed

with E[X] = 5 und V(X) = 9.
What is the probability: P(-2 < X ≤ 4)?
 -2 - µ X - µ 4 - µ 
P ( -2 < X ≤ 4 ) = P  < ≤ 
 σ σ σ 
 7 -1 
= P - < Z ≤ 
 3 3
 1  7
= FZ  -  - FZ  - 
 3  3
= 0.3694 - 0.0098 = 0.3596
212
The probability P(-2 < X ≤ 4) for different values of
E[X] = μ und V(X) = σ2.
σ=1 σ=2 σ=3
μ = -5
μ=0
μ=5
213
Example 4.7.2: (finance: asset allocation)
It is usually assumed in the portfolio theory that financial
asset returns are normally distributed random variables.
→ E [R ] = µ expected return;
σ R = V ( R ) volatility (risk).
An investor wants to invest a certain amount of money in
three different shares:
share expected return volatility

A1 44% 22%
A2 36% 20%
A3 10% 4%
214
Example 4.7.2 (continued)
) investing the whole amount on a single share

(trying to avoid a loss):
 R1 − µ1 0 − 0.44 
P=( R1 < 0 ) P  σ < = 0.22   P= ( Z < -2 ) 0.0228
 1
 R2 − µ2 0 − 0.36 
P ( R2 < 0 ) P 
= <=  P= ( Z < -1.8 ) 0.0359
 σ2 0.2 
 R3 − µ3 0 − 0.1 
P ( R3 < 0 ) P 
= <
=  P=( Z < -2.5 ) 0.0062
 σ3 0.04 
i.e., the risk of a loss is smaller for the third share

(but the expected return is smaller, too).
215
Example 4.7.2 (continued)
) investing the same amount on the three shares:

1
portfolio return: R = ( R1 + R2 + R3 ) is normally
3
distributed with
1
[R ] ( 44 + 36 +=
µR E=
= 10 ) 30 [%]
3
2
 1
=σ R 2 V= ( R )   ( 22 + 20 + 4 )
2 2 = 2 100% 2
3
Then: ) = 0.0013 < P ( R3 < 0 ) !
P ( R < 0=
→ Splitting the capital among different investment

options turns out to be the best strategy (diversification).
216

3.5. Variance


218
To formalize many underlying theories as well as to
solve many applied problems, one also needs to
consider the relation among the different random
variables under investigation. In fact, that information
might play a prominent role and cannot be
neglected.
219
Example 5.0.1: (finance: portfolio selection)
Let us consider a portfolio composed of two indices,

namely the S&P 500 and the FTSE 100.
To analyze the behavior of the portfolio returns, one
possibility is to model the returns of the two indices
separately using two univariate random variables.
Proceeding this way, however, we would completely
neglect the stochastic relation between the two
indices, and the results of the analysis might be
misleading.
Therefore what we need is a multivariate approach
that takes into account such a relation.
220
→ In general, the relation among the variables is
stochastic and must be taken into account.
→ If we consider two random variables:

→ bivariate distribution.
→ If we consider more than two random variables:

→ multivariate distribution.
221

3.5. Variance


Schervish
5.1. Joint distribution and marginal
distributions
Discrete random variables:
Let X and Y be discrete random variables, and

consider the ordered pair (X,Y).
The joint probability function of X and Y is defined
as the function f that for every point (x,y) in the xy-
plane,
f(x,y) = P [{X=x} ∩ {Y=y}].
223
Properties:
(1) f ( xi , y j ) ≥ 0 

( 2 ) ∑∑ f ( xi , y j ) = 1  ⇒ ( 3 ) f ( xi , y j ) ≤ 1, ∀i , j

∀i ∀ j 
Finally, for each set C of ordered pairs,
P ( ( X ,Y ) ∈ C ) =∑ f ( xi , y j ).
( xi ,y j )∈C
224
Example 5.1.1: (urn without replacement)
An urn contains 6 balls: three balls are labelled with

"1", two balls with "2", and the last ball with "3".
U

 
 
Two balls are drawn from the urn without
replacement. The joint probability of
(X, Y) = (" label first ball ", " label second ball ")
is: [see next slide]
225
Y y1=1 y2=2 y3=3 fx For example:

X
3 2 1
x1=1 1/5 1/5 1/10 1/2 f (1, 1) = ⋅ =
6 5 5
2 3 1
x2=2 1/5 1/15 1/15 1/3 f ( 2, 1) = ⋅ =
6 5 5
x3=3 1/10 1/15 0 1/6 1 3 1
f ( 3, 1) = ⋅ =
fy 1/2 1/3 1/6 1 6 5 10
Such a table is called a table of probabilities.

226
Example 5.1.2: (urn with replacement)
Consider the same example as in 5.1.1 with
replacement.
Y y1=1 y2=2 y3=3 fx For example:

X
3 3 1
x1=1 1/4 1/6 1/12 1/2 f (1,1) = ⋅ =
6 6 4
x2=2 1/6 1/9 1/18 1/3 
x3=3 1/12 1/18 1/36 1/6 
fy 1/2 1/3 1/6 1 
227
Example 5.1.3: (tossing a coin)
A fair coin is tossed four times. Let

(X, Y) = (" number of heads ", " number of changes ").
→ # outcomes: 24=16
TTTT: (0,0); TTTH: (1,1); TTHT: (1,2); THTT: (1,2); HTTT: (1,1);
TTHH: (2,1); THTH: (2,3); THHT: (2,2); HTHT: (2,3); HHTT: (2,1);
HTTH: (2,2); THHH: (3,1); HTHH: (3,2); HHTH: (3,2); HHHT: (3,1);
HHHH: (4,0).
228
Y y1 = 0 y2 = 1 y3 = 2 y4 = 3 fx
X
x1 = 0 1/16 0 0 0 1/16
x2 = 1 0 1/8 1/8 0 1/4
x3 = 2 0 1/8 1/8 1/8 3/8
x4 = 3 0 1/8 1/8 0 1/4
x5 = 4 1/16 0 0 0 1/16
fy 1/8 3/8 3/8 1/8 1

229
Remark:
In the previous three examples:
fx ( xi ) = P [ X = xi ] = ∑ f ( xi ,y j ) = pi ,• ;
j
fy ( y j ) = P Y = y j  = ∑ f ( xi ,y j ) = p•, j ;
i
are called marginal probability functions of X and

Y, respectively.
230
Continuous random variables:
Let X and Y be continuous random variables. The
function f(x,y) with
b d
∫ ∫ f ( x,y ) dydx = P [{a < X ≤ b} ∩ {c < Y ≤ d }]

a c
for real-valued constants a<b and c<d is called the

joint (probability) density function of X and Y.
Properties: (1) f ( x,y ) ≥ 0, ∀x, y ;

∞ ∞
( 2) ∫ ∫ f ( x,y ) dydx = 1.
-∞ -∞
231
Example 5.1.4:
The joint density function of a (specific) two-dimensional
normal distribution could be defined by:
x2 +y 2
1 −
f ( x, y=) ⋅e 2
, − ∞ < x, y < +∞.
2π
232
Example 5.1.5:
The joint density function of (X, Y) is given by
12 2
 ( x + xy ) , if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1;
f ( x,y ) =  7
 0, else.
→ non-negativity property ? 
→ 12
( )
2
12 xy 12
1 1 1 1
1 1
x
∫ ∫ ( x +xy ) dydx = ∫ x y + ∫
2
2
dx = 2
x + dx
7 0 0
7 0 0 2 0 7 0 2
12  x 3 x 
( )
12 1 1 12 7
2 1
=  +  = + = ⋅ = 1
7  3 4  0 7 3 4 7 12
233
In the continuous case, the marginal (probability)
density functions of X and Y are
∞
fx ( x ) = ∫ f ( x, y ) dy
−∞
and ∞
fy ( y ) = ∫ f ( x, y ) dx,
−∞
respectively.
234
1
1
12 12  y 2

fx ( x ) ∫
7 0
( x + xy ) dy =
2
⋅ x y + x 
7 
2
2 0
12  2 x 
= ⋅  x + , x ∈ [0,1].
7  2
1
1
12 12  x 3
x 2
y
fy ( y ) ∫
7 0
( x + xy ) dx = ⋅  +
2
7 3 
2 0
12  1 y 
= ⋅  + , y ∈ [0,1] .
7 3 2
235
Remark:
The expected values and variances of the marginal
distributions of a bivariate random vector can be
computed using the marginal probability/density
functions:
→ µ x = E [ X ]= ∑ x i fx ( x i );
i
∑ ( x -µ ) f ( x )
2
σx
= ( X)
V=2
i x x i
( discrete )
i
∞
→ µ x = E [ X ]= ∫xf x
( x ) dx;
−∞
∞
∫ ( x-µ )
2
σx
= ( X)
V=2
x
fx ( x ) dx ( continuous )
−∞
236
Definition
The joint (cumulative) distribution function of

two random variables X and Y is defined as the
function F such that for all real values x and y,
F ( x, y ) = P [ X ≤ x, Y ≤ y ] .
It is clear that F(x,y) is monotone increasing in x for

each fixed y and is monotone increasing in y for
each fixed x.
237
Practical computation:
If (X,Y) is discrete:
F ( x, y ) = ∑ ∑ f ( x , y ).
xi ≤ x y j ≤ y
i j
If (X,Y) is continuous:
x y
( x, y )
F= ∫ ∫ f (u, v ) ⋅ dv ⋅ du.
−∞ −∞
238

3.5. Variance


Schervish
5.2. Conditional distributions and stochastic
independence
Definition
a) Let X and Y be two discrete random variables with joint
probability function pij. The conditional probability
function of X given thatY=yj is defined as follows:
f=
( )
f X ,Y xi , y j pi , j
( xi ) = p , if p⋅, j > 0, and 0 else.
X Y =y j
( )fY y j ⋅, j
b) Let X and Y be two continuous random variables with joint

density function f(x,y). The conditional density function
of X given that Y = y is defined as follows:
f X ,Y ( x, y )
=fX Y =y ( x ) , if fy ( y ) > 0, and 0 else.
fY ( y )
240
Example 5.2.1: (discrete case)
Consider tossing a coin three times. Define
(X,Y)=(" # heads in first toss ", " # heads ").
We compute the conditional distribution of X given

that Y=1.
Y y1=0 y2=1 y3 =2 y4=3 fx
X
x1=0 1/8 1/4 1/8 0 1/2
x2=1 0 1/8 1/4 1/8 1/2
fy 1/8 3/8 3/8 1/8 1

241
p1,2 1/ 4 2
f X |Y =1 (x) : X = x1 = 0: = =
p×,2 3/8 3
p2,2 1/ 8 1
X = x2 = 1: = =
p×,2 3/8 3
Another example: 1
fY |X=0 (y) : Y = 0:
4
1
Y = 1:
2
1
Y = 2:
4
Y = 3: 0 ...
242
Example 5.2.2: (continuous case)
Let (X,Y) denote a two-dimensional continuous
random vector with joint density given by
λ 2 ⋅ e − λ x , 0 ≤ y ≤ x,
f ( x,y ) = 
 0 , else.
→ Marginal densities:
x
fx ( x ) =∫ λ 2 ⋅ e − λ x dy =λ 2 ⋅ e − λ x ⋅ x, x ≥ 0
0
+∞
 1 − λ x  +∞
fy ( y ) =
∫ λ ⋅e
2 −λ x
λ ⋅ − e  =
dx = 2
λ ⋅ e−λy , y ≥ 0
y  λ y
(→ Y is exponentially distributed: Exp ( λ ) ) .
243
How should we specify the domain of the joint density

function (and therefore the integrals’ boundaries)?
y=x
A
A = {( x, y ) | x ∈ [0, ∞ ), y ∈ [0, x ]}
or
A = {( x, y ) | y ∈ [0, ∞ ), x ∈ [ y , ∞ )}
244
→ conditional density of X given Y :

λ 2 ⋅ e − λ x ⋅ 1{0≤ y ≤ x} − λ ( x-y )
f X|Y =y ( x ) = −λy
=λ ⋅e , x ≥ y ≥ 0.
λ ⋅e ⋅ 1{y ≥0}
→ conditional density of Y given X :

λ 2 ⋅ e − λ x ⋅ 1{0≤ y ≤ x}
1
fY |X=x ( y )
= = , 0 ≤ y ≤ x.
λ ⋅ e x ⋅ 1{x ≥0} x
2 −λ x
245
Definition (independent random variables)
It is said that two random variables X and Y with

joint probability or density function 𝑓𝑓𝑋𝑋,𝑌𝑌 and marginal
probability or density functions 𝑓𝑓𝑋𝑋 and 𝑓𝑓𝑌𝑌 are
independent if and only if
f X ,Y ( =
x, y ) f X ( x ) ⋅ fY ( y ) , − ∞ < x, y < +∞.
It also follows that (for all y with 𝑓𝑓𝑌𝑌 (y)>0 and all x
withf(x,y) = respectively)
𝑓𝑓𝑋𝑋 (x)>0, f x (x) ⋅ f y (y), für ∀(x,y).
= ( x ) and fY |X=x ( y ) fY ( y ).
f X|Y =y ( x ) f X=
246

3.5. Variance


Schervish
Definition
Let X and Y be two random variables with joint probability
or density function f(x,y). The expectation of the two-
dimensional function g(X,Y) is defined as
E [ g(X,Y)] = ∑∑ g(x ,yi j ) f(xi ,y j ), if (X,Y ) discrete;

∀i ∀j
∞ ∞
E [ g(X,Y)] = ∫ ∫ g(x,y) f(x,y) dy dx, if (X,Y ) continuous.
−∞ −∞
248
We introduce summaries of a joint distribution that
enable us to measure the relationship between two
random variables, i.e. their tendency to vary together
rather than independently.
Definition (covariance)
Let X and Y be random variables with finite means

μx and μy.
The covariance of X and Y is defined as
Ε ( X − µ X ) ⋅ (Y − µY )  ,
Cov ( X ,Y ) =
if the expectation exists.

249
Multiplication rule for expectations:
For all random variables X and Y with finite variance:
E [ XY ] = E [ X ] ⋅ E [Y ] + Cov ( X ,Y ) .
If X and Y are independent random variables with

finite variance, then
E [ XY ] = E [ X ] ⋅ E [Y ].
As a consequence, to verify the independence of two
random variables, the following criterion can be used:
X , Y independent → Cov (X,Y ) = 0

or Cov (X,Y ) ≠ 0 → X,Y not independent.
250
Computational rules for the covariance
The covariance is a so-called bilinear operator, which

means that it is linear in both arguments. Let U, V, X, Y
be random variables with finite means and variances and
a, b, c, d real-valued constants.
Then: Cov(a·U+b·V, c·X+d·Y) =
a·c·Cov(U,X) + a·d·Cov(U,Y) + b·c·Cov(V,X)+ b·d·Cov(V,Y)
Remark: The covariance can also be seen as a scalar

product.
Thus, when we say that X and Y are orthogonal, we
mean that Cov(X,Y) = 0.
251
Definition (correlation)
Let X and Y be random variables with finite variances σx
and σy, respectively.
Then the correlation of X and Y is defined as follows:
Cov ( X ,Y )
ρ X ,Y = .
σ X ⋅ σY
It is said that X and Y are positively correlated if 𝜌𝜌𝑋𝑋,𝑌𝑌 >

0, that X and Y are negatively correlated if 𝜌𝜌𝑋𝑋,𝑌𝑌 < 0,
and that X and Y are uncorrelated if 𝜌𝜌𝑋𝑋,𝑌𝑌 = 0.
252

3.5. Variance


Schervish
5.4. Sums and sample means of random
variables
Expected value of a sum of two random variables?
E[ X + Y ] = ∑∑ (x + y i j ) ⋅ f (xi ,y j )

∀i ∀j
= ∑∑  x
∀i ∀j
i ⋅ f (xi ,y j ) + ∑∑  y
∀i ∀j
j ⋅ f(xi ,y j )
= E [ X ] + E [Y ]
This result can be generalized to:
E [ X1 + ... + X n ] = E [ X1 ] + ... + E [ X n ].
254
Variance of a sum of two random variables?
(
V ( X + Y ) = E ( X + Y ) - ( µx + µy )

)
( )
= E ( X - µ x ) + (Y - µ y ) 

2 2
   
= E ( X - µ x ) + (Y - µ y ) + 2 ⋅ ( X - µ x ) (Y - µ y ) 
 2 2
 
= E ( X - µ x ) + E (Y - µ y )  + 2 ⋅ E ( X - µ x ) (Y - µ y ) 
 2
 2

   
= V ( X ) + V (Y ) + 2 ⋅ Cov ( X,Y )
Thus, in the case of uncorrelated random variables,

it can be generalized to:
V ( X1 + ... + X n ) = V ( X1 ) + ... + V ( X n ) .
255
Sample mean of uncorrelated random variables:
Let 𝑋𝑋1 to 𝑋𝑋n be uncorrelated random variables with

finite means and variances. Then:
1 1 n 1
( )
E Xn
n

= E  ( X1 +  + X n )  = ⋅ ∑ E [ X i ] = ⋅ n ⋅ µ =
 n i=1 n
µ
1 1 n 1 σ2
( )
V Xn
n

= V  ( X1 +  + X n )  = 2 ⋅ ∑V [ X i ] = 2 ⋅ n ⋅ σ =
 n i=1 n
2
n
σ
⇒ σX =
n
n
(reduced by the factor n : called n - rule!)

256
Example 5.4.1:
Let us consider the following game:

To participate the player has to pay 1 Euro (fee).
A fair coin is tossed three times: for each ‘‘head’’ the
player wins exactly “1 Euro’’.
257
a) Describe the random variable
X: “player’s net winnings’’
using Bernoulli distributed random variables.
Show that E[X] = ½ and V(X) = ¾.

 1
3 y
 i ,1 =1, "head" , p = , Yi are independent,
X = ∑Yi -1, Yi =  2
i =1  y i ,2 =0, else.

 3  3 3
1 1
→ E  X  = E  ∑Yi -1 = ∑ E Yi  -1 = ∑ p -1 = 3 ⋅ − 1 = ;
 i =1  i =1 i=1 2 2
 3  3 3
1 3
→ V ( X ) = V  ∑Yi -1 = ∑ V ( Yi ) = ∑ p ⋅ (1-p ) = ⋅ 3 = .
 i =1  i=1 i=1 4 4
258
b) Anton plays the game three times consecutively.
Let U be the winnings after three games. Express
the random variable U using
𝑋𝑋i =‘‘player’s net winnings game i’’, i = 1, 2, 3.
Compute E[U] and V(U).

U = X1 + X 2 + X 3 , Xi are independent.
3
→ E U  =
3 ⋅ E  X  =;
2
3 9
→ V (U ) = 3 ⋅ V ( X ) = 3 ⋅ = .
4 4 259

3.5. Variance


Schervish
261
Let us consider a sequence of n random variables.
Assume that the random variables X1,..., Xn are
independent and identically distributed (i.i.d) with
(both finite)
E [ X i ] = µ und V ( X i ) = σ 2 .
This sequence of random variables is called a

random sample of size n.
How can we generate it in practice? → two (main)
cases:
1) random sampling with replacement;
2) series of tests / experiments. 262

Example 6.1:
Let X describe the winnings from a gambling game.
If we play that game several times in a row we get the random
sample X1,..., Xn.
n
The total winnings after n games is Sn =∑ Xi .
i=1
1
The average winnings per game after n games is Xn = Sn .
n
Question: What is the probability that after a lot of games the
total winnings is between a and b, i.e., P[a ≤ Sn ≤ b] = ?
→ The central limit theorem gives us an approximate way to

answer this type of question, in particular when n is large
(n→∞).
263
Theorem: Central Limit Theorem (CLT)
Let X1, X2,..., Xn be i.i.d. random variables with μ = E[Xi]

and σ2 = V(Xi) (both finite). Let Sn be the sum and
Sn
Xn = the sample mean of the random sample. Then, the
n
distribution function Fn of the standardized variable
Sn - n µ X n - µ
=Zn =
σ n σ/ n
converges for n → ∞ to the standard normal distribution:
Fn ( Zn ) → FZ ( Z ) .
Special case: Xi i.i.d. Bernoulli distributed.

264
Illustration of the Central Limit Theorem:
Let X1, X2,..., Xn be a random sample from a
continuous uniform distribution on [0,1].
265
Theorem: Limit Theorem of De Moivre and Laplace
Let Sn be a binomially distributed random variable with
parameters n and p.
Then its distribution function converges with increasing n
towards a normal distribution with corresponding moments:
FBi ( sn ; n, p ) → FN ( sn ; np, np (1 - p ) )
Similarly, the distribution function F of the standardized

variable Sn - np Xn - p
= Zn ≡
np (1 - p ) p (1 − p )
n
converges for n → ∞ to the standard normal distribution
Fn ( Zn ) → FZ ( Z ) .
266
Example 6.2: (finance)
A financial theory assumes that the share (log-) prices
in efficient markets behave according to the so-called
random walk: K t = K t-1 +ε t .
As a consequence, the returns are given by

K t − K t −1, E [ε t ] =
εt = 0
V (ε t ) = σ 2
and have the same expected value zero and the

same variance 𝜎𝜎 2 .
267
Example 6.2 (continued):
The monthly return would then be a sum

ε t + ε t+1 +  + ε t+n , with n=22
(approximate number of working days in a month).
CLT the monthly returns are approximately

normally distributed with expected value zero and n-
times variance.
→ naive prediction: the price stays at the same level
it is today.
268
Example 6.3:
Let X1,..., X12 be i.i.d. from a uniform distribution on

12
[-½, ½], and S12 = ∑ X i .
i =1
( b - a )2 1
We know that E [ X i ]= 0= µ, V ( X i )= = = σ 2.
12 12
Applying the CLT we get that
approx
S12 ~ N ( n µ , nσ 2 ) = N ( 0, 1) .
Remark: With as few as n=12 we get a reasonably good

approximation of the true distribution using the CLT.
269
Example 6.4:
We toss a coin 100 times and we get ‘‘heads’’ exactly
60 times. Is the coin fair?
1, heads in toss i , i =1,...,100,

Let X i = 
0, else.
Xi is Bernoulli distributed with p=½ (under the

assumption that the coin is fair).
270
Example 6.4 (continued):
S100 is binomially distributed with p=½, n=100.

standardization
↓
 S100 − 50 60 − 50 
P [S100 ≥ 60 ] = P  Z100 = ≥ 
 25 25 
CLT: approx.
↓
= 1 − FZ ( 2 ) =
0.0228.
Thus: Assuming a fair coin, the probability of the

event {S100 ≥ 60} is very small. Given that we observed
this event, we may question the fairness of the coin.
Remark: Use a correction for continuity when

approximating discrete distributions using the CLT.
271
Part II: Statistics
272
Representative random sample:
The difficulty in analyzing many phenomena, be they

economic, social, or otherwise, is that there is simply too
much information for the mind to assimilate.
It would be more useful to have much less information, but
information which was still representative of the orginal
data. In reaching this, much of the original information
would be deliberately lost.
Remark: There is no formal statistical definition of

representative.
273
A very important characteristic of statistical variables is
the scale in which they are measured.
Depending on the scale, variables can be divided in
different classes.
The appropriate method of analyzing the data as well as
the possible statistical evaluation depend on this
classification.
Other relevant factors are the sophistication of the

audience and the ‘message’ which is intended to
convey.
274
1. Nominal scale:
A variable is measured on a nominal scale when

there is not an obvious natural ordering of the
outcomes: only equality or inequality of the outcomes
can be determined.
2. Ordinal scale:
A variable is measured on an ordinal scale when the

different outcomes can be naturally ordered, but the
‘distance’ between them cannot be measured.
275
3. Ratio scale:
A variable is measured on a ratio scale when not

only the different outcomes can be ordered or
ranked, but also the distance between them can be
computed.
The outcomes in this case must be numbers.
Finally, another important distinction is whether we

have to analyze a sample of cross-sectional data
(measured at one specific point in time) or a sample
of time-series data.
276

3.5. Variance

6. The Central Limit Theorem Chapter 1 Barrow

Chapters 1-3 ASWFS
7. Descriptive statistics
278
Goal:
The task of descriptive statistics is to introduce a
number of descriptive, in most cases graphical
methods to summarize all the information about the
variables under investigation and illustrate the main
features, without distorting the picture.
279

3.5. Variance


7.1. Frequency tables, histograms, and
empirical distributions
281
Example 7.1.1: (radioactive decay of Americium-241)
The radioactive element Am-241 emits by decay α -

particles. We are interested in the number of emissions
in given intervals of a fixed length, for example 10
seconds.
We therefore observe the decay process for some time

and we want to find a suitable model for the recorded
data.
For measuring purposes, we split the whole recording

period in 1207 intervals, each lasting 10 seconds.
282
In each interval we count the number of emissions, yielding

the data x1,…, x n ; n =
1207. The total number of recorded
emissions is 10,129.
In a first step we build the following classes: -

for =y
j
3, 4, … ,16, we count the number of intervals with
exactly 𝑦𝑦𝑗𝑗 emissions;
- for intervals with 0-2 emissions and for intervals with more
than 17 emissions we build two boundary classes.
This results in the following frequency table (next slide):

283
Frequency table:
Class
interval 0-2 3 4 5 6 7 8 9
(emissions)
Numbers 18 28 56 105 126 146 164 161
Class
interval 10 11 12 13 14 15 16 >17
(emissions)
Numbers 123 101 74 53 23 15 9 5
Let us draw a histogram for the recorded data (see

next slide):
284
285
Example 7.1.2: (‘population pyramids’, book page 22)
Remark: Histograms can also be used for variables
maesured on a nominal or ordinal scale (bar charts).
Another helpful method is the empirical distribution

function Fn , defined as follows:
(# observations x i with x i ≤ y ) 1
Fn ( y ) =
n n
∑ fj .
j , y j ≤y
Fn gives an estimate of the distribution function F that

generated the observed data. Clearly, Fn is constant
except for the outcomes y1, …., ym. Usually one plots
the points ( y j , Fn ( y j )) , for j = 1, ...., m. 286
Example 7.1.1 (Am-241, continued):
Empirical distribution function:
𝑦𝑦𝑗𝑗 0 1 2 3 4 5 6 7 ……
𝐹𝐹𝑛𝑛 (𝑦𝑦𝑗𝑗 � ? ? 0.0149 0.0381 0.0845 0.1715 0.2759 0.3969 …….
𝐹𝐹𝑛𝑛 (𝑦𝑦𝑗𝑗 �
287
𝑦𝑦𝑗𝑗

3.5. Variance


7.2. Summarizing data using numerical
techniques
Definition (measures of location)
a1) Arithmetic mean (or average) is the most

familiar measure of location: (→ ratio scale)
1 n
x=
n
∑x
i=1
i
289
Definition (measures of location)
a2) Median: (→ ordinal and ratio scale)
 x  n+1 , if n odd,
 
 2 


xMed =   
1
  x n  + x n   , if n even.
 2   2   2 +1 
a3) Mode: (→ nominal, ordinal, and ratio scale)
xM = xi , with hi ≥ h j , for all j ≠ i.
290
Example 7.2.1:
Let us consider the following observations:
4, 7, 7, 7, 12, 12, 13, 16, 19, 23, 23, 97 .
We compute the different measures of location:
mean x = 20;
mode x M = 7;
12+13
median x Med = = 12.5.
2
291
Definition (quantiles)
b) From the ordered observations x(1), ...., x(n) we can

compute the so-called empirical α - quantiles for
different probability levels α ∈ (0,1) as follows:
Compute first K=  αn  +1, where ⌊∙⌋ denotes the

integer part of the number αn. Then get the
empirical α -quantile as
 x (K) , if α ⋅ n not an integer number;

1
 ( x (K ) + x (K-1) ) , if α ⋅ n an integer number.
2
Interpretation: α -percent of the observations lie
below the empirical α - quantile. 292
Example 7.2.2:
Let α = 75%. For n = 100 is α ⋅ n = 75 an integer
number, thus K = 76.
We have to choose a value z between x(75) and x(76)
if`we want 75% of the observations to lie below z.
The empirical 75%-quantile (also called Q3 or third
1
quartile) (x (75)
+ x (76) ) fulfills the requirement.
2
→ For n = 101: α ⋅ n = 75.75 ⇒ K=76,

α ⋅ n not an integer number ⇒ Q3 = x (76) .
293
The location measures like mean, median, or modus
give only some information about the central
tendency of a distribution.
Two distributions might have the same location

parameter while being different.
A measure of dispersion can help distinguish

among distributions with the same location measure.
294
Definition (measures of dispersion)
c1) Range, which is the difference between the
smallest and largest observations, is the
simplest measure of dispersion:
range = xmax – xmin.
c2) Mean-quartile range as dispersion measure:
MQA=
( Q3 -Q2 ) + ( Q2 -Q1 ) IQA
= ,
2 2
where IQA = Q3-Q1 is called inter-quartile range.
295
Example 7.2.4:
n = 14 observations
range = 38-11=27
11 12.5 15 18 19.5 23 25.6 28 29 30 31.5 34 35 38
Q1 Q2 Q3
IQA
1
Q =
2
( x ( ) + x ( ) ) = 26.8
7 8
2
Q = x(
1 4)
= 18 ⇒ IQA = 13.5 ⇒ MQA = 6.75.
Q =x = 31.5
3 (11)
296
Definition (measures of dispersion)
c3) Variance and standard deviation as measures
of dispersion:
The mean of the squared distances of the
observations from the arithmetic mean
1 n
sx = ∑ ( x i - x )
2 2
n i=1
is called (empirical) variance.
The positive square root of the variance
sx = + sx2
is called (empirical) standard deviation. 297
Example 7.2.5:
Consider the observations
3, 5, 9, 9, 6, 6, 3, 7, 7, 6, 7, 6, 5, 7, 6, 9, 6, 5, 3, 5.
Let us compute the empirical variance:
( x -x ) h j ( x j -x )
2 2
j xj nj hj h jx j x j -x j
1 3 3
0.15 0.45 − 3 9 1.35
2 5 4
0.20 1 −1 1 0.20
3 6 6
0.30 1.8 0 0 0
4 7 4
0.20 1.4 1 1 0.20
5 9 3
0.15
 1.35
  3
 9 1.35

n=20
= ∑ 1= x =6
∑ 0 Sx 2 =3.1
298
Definition
The measures of dispersion introduced so far are all

measures of absolute dispersion and their values depend
upon the units in which the variable is measured.
To compare the degrees of dispersion of two variables

measured in different units we have to define a measure
of relative dispersion such as the coefficient of variation
defined (provided that x ≠ 0) as
sx
VK x = .
x
299
Example 7.2.6: (stock prices, 250 working days)
Daimler Chrysler-share: x = 50.59 Euro , s x = 36.18 Euro

Porsche AG-share: y = 396.10 Euro , s y = 182.96 Euro
36.18
⇒ VK x = = 0.72
50.59
182.96
VK y = = 0.46
396.10
Thus, although it shows a smaller standard deviation, the
Daimler Chrysler-share has larger relative dispersion.
→ VK is often used to measure the volatility of stock prices!
300

3.5. Variance


7.3. Boxplot Graph: use the right scale!
d =largest observation x i
with x i -Q3 < 1.5 ⋅ IQA
25% of the Q3 = 75%-quantile 

data
lie here
Q 2 = median
 IQA = Q − Q
 3 1
25% of the
date Q1 = 25%-quantile 
lie here
c = smallest observation x i
x with Q1 -x i < 1.5 ⋅ IQA

x outliers
x
302
Example 7.3.1: (lifetime of 16 devices in months)
1.5; 3.5; 6.5; 11.50; 12.50; 14; 17; 17; 19; 20; 23.5;
32.5; 34.5; 39; 55.5; 119
( X ( ) +X ( ) )
8 9
( X ( ) +X ( ) )
4 5
( X( 13 )
+X (12 ) )
Q2 = = 18 ; Q1 = = 12 ; Q3 = = 33.5
2 2 2
⇒ IQA = 21.5 Q 3 + 1.5 ⋅ IQA = 65.75; Q1 - 1.5 ⋅ IQA < 0
1.5 ⋅ IQA = 32.25
55.5
Q3 • 119 outlier
• small relative dispersion
• not symmetric!
(more concentration on Q1/Q2)
Q2
Q1
1.5
303

3.5. Variance


7.4. Quantile-Quantile-plot (QQ-plot)
Sometimes we have an idea about the distribution of
the stochastic process that generated the data.
A QQ-plot is a graphical device that allows us to

investigate whether our assumption about the
distribution is supported by the observed data.
For practical purposes, generally one uses specific

software.
But: what is the theory underlying the QQ-plot?

305
Assumption:
The data x1, …., xn are the realizations of random
variables X1,….., Xn, from a common distribution F.
Considering the analogy between empirical and

theoretical quantiles, we expect
x ( α n  +1) ≈ F-1 (α )
 
given that about α -percent of the data are smaller than
x ( α n  +1) per construction, and, on the other hand, values
of Xi are smaller than F-1(α ) with probability α :
P  X i ≤ F−1 (α ) = F ( F-1 (α ) ) ≅ α .

306
K-1
Let K= α n  + 1, then ≈ α , and thus we
expect n  1
K-  2
 K-1 
x(K) ≈ F (α ) ≈ F 
-1 -1 -1
≈F  .
 n   n 
 
Therefore, if the two distributions being compared
are similar, the points
  1 
 -1  K- 2  
F   , x ( K )  , K=1,...., n,
  n  
   
must lie approximately on the line y=x.
307
 K- 1 
Practically, we plot the theoretical quantiles F-1  2
 n 
 
vs. the empirical quantiles x(K) and graphically
investigate how much the points deviate from a line.
QQ - plot of the data
(theoretical distribution:
normal distribution).
308

3.5. Variance


7.5. Scatter diagram
For variables measured on the ratio scale, we can compute
the differences among the observations.
One-dimensional: we plot the individual observations as

points on the x-axis.
Two-dimensional: we display the data as a collection of
points, each having the value of one variable on the x-axis
and the value of the other variable on the y-axis.
Usually the goal of this graphical method is to get an idea

about how to define the classes in a histogram and/or about
the relation between two variables.
310
Example 7.5.1:
One-dimensional:
x xxx xxx xxx xxx x
Two-dimensional:
waiting time between eruptions
and the duration of eruption for
the Old Faithful geyser in
Yellowstone National Park.
Scatter diagrams are easy to construct and to interpret

(provided that the sample size is not too large): one can see
the domain, concentrations of values (clusters), outliers,
relation between two variables,… 311

3.5. Variance


Schervish
8. Estimation of unknown
parameters
313
A central problem in statistics consists of the
identification of random variables of interest, the
specification of a joint distribution or a family of
possible joint distributions for the observable random
variables, and the identification of any parameters of
those distributions that are assumed unknown.
These tasks must be done using an observable

random sample X1,…, Xn with realizations x1,…, xn.
In particular, the unknown parameters must be
estimated as accurately as possible.
314
For the estimation of the unknown parameters, we
have two main approaches:
1) Point estimation:
For each parameter one gets a single value from
the sample as a result of the estimation procedure.
2) Confidence intervals:
The idea is to get some intervals of values in which
the true unknown parameters are contained with
high probability (confidence).
315
The starting point in both approaches is the definition of a
so-called estimator (or statistic).
Suppose that the observable random variables of interest

are X1 ,…, Xn . Any arbitrary real-valued function
θ n ( X1 ,…, X n )
of the n random variables is called an estimator.

The estimator tells us how the random variables must be
handled to get an optimal estimation of the parameter(s).
316

3.5. Variance


Schervish
8.1. Intuitive examples of estimators
We often have to deal in practice with the problem that

although we can sometimes safely identify the
underlying distribution generating the data, we have no
clue about the unknown parameters of that distribution.
Or even the distribution of the data generating process
is not easy understood from the data available.
However, we can still try to understand the main
features of the underlying distribution such as the
location or the variance parameter.
318
We want to estimate µ=E[X] thanks to a random
sample X1,..., Xn, generated from the population
distribution having probability or density function fx.
Idea: From the natural interpretation of µ as location

parameter of the distribution, use the arithmetic mean
of the sample (sample mean) as estimator:
1 n
µˆ = t(X1 ,…, X n ) = X n = ∑ X i .
n i=1
This type of estimations is called point estimations

given that the results are single values (and not
plausible intervals of values). 319
Example 8.1.1: (height of students)
We select a random sample of 10 students from all
students participating in a specific class. The height X
in cm is determined and reported in the following table:
i 1 2 3 4 5 6 7 8 9 10
xi 176 180 181 168 177 186 184 173 182 177
The estimate of the average height of the students in

the class is computed as
µˆ = x n = 178.4 cm.
320
Question: Is this a good estimator?
To answer this kind of question we need to introduce

properties that estimators must satisfy to be good estimators.
This is done in the next section.
Example 8.1.2: (binomial experiment)

Following the same reasoning as in the last example, if we
consider a binomial experiment, we should use as estimators
for the population variance σ2 and the success probability
parameter p (respectively):
1 n
σˆ 2 = t(X1 ,…, X n ) = S2x = ∑ (X i -X n ) 2
n i=1
and
1 n
p̂ = t(X1 ,…, X n ) = ∑ X i .
n i=1 321
Illustration: binomial experiment
322
Using the data summarized in the table on slide 320,

we get for the variance
2
s = 25.84.
x
Thus, a natural estimate for the variance (scale

parameter) of the underlying distribution is
σ̂ 2 = s 2x = 25.84.
323
Example 8.1.3: (firm’s lifetime)
324
Estimators t(X1,…, Xn) are random variables and therefore
have their own distribution.
Notation:
• symbol "^" means estimator (pronounced "Hat")
• T = t(X1,…, Xn) is a random variable; its realization (called
estimate) is based on the sample observations xi (i = 1,…,
n) of the corresponding random variables in the sample.
• E[T] = µT denotes the expected value of the estimator T.
• V(T) = E ( T - μ T )  = σ T 2 denotes the variance of the
 2
 
estimator T.
Note that often for a given parameter there may be different

estimators.
325
Example 8.1.4:
Let X be Poisson distributed with parameter λ , then:
λ x -λ
P (X = x) = e with E [ X ] = V ( X ) = λ.
x!
Now, should we use

λ = t ( X1 ,…, X n ) = X,
λ = t ( X ,…, X ) = S 2
1 n x
or even other functions as the estimator of 𝜆𝜆?

This decision must be made based on whether the
candidate estimators satisfy some relevant goodness
properties.
326

3.5. Variance

6. The Central Limit Theorem Chapter 7.4/8.7 DeGroot

and Schervish
8.2. Properties of estimators
Definition (goodness properties)
A) Unbiased estimators
An estimator T = t ( X1 ,…, X n ) = θˆ is an unbiased estimator
of θ if E [ T ] =μT exists and E [ T ] =μT =θ for all values of θ.
In fact, it is desirable to use an estimator T with probability

distribution concentrated around θ.
The difference between the expected value of T and the

parameter θ is called bias: bias = E [ T ] − θ.
328
Example 8.2.1: (sample mean and variance)
1 n
i) T = t ( X1 ,…, X n ) = X n = ∑ X i is an unbiased
n i=1
estimator for μ = E [ X ] ( i.e., μ = E [ X i ] , i = 1,  , n ) :
1 n  1 n 1
E [ T ] = E  X n  = E  ∑ X i  = ∑ E [ X i ] = ⋅ n ⋅ μ = μ.
n
 i=1  n i=1 n
n
ii) T = t ( X1 ,…, X n ) = S2 = 1 ∑ i
(X - X) 2
is a biased
n i=1
estimator for σ 2 = V ( X ) ; in fact, its expectation
equals
( n-1) 2
σ and not σ 2 (for proof see next slide).
n 329
1  n 1  n
( ))
2 
( ) (
2
E [ T ] =E S  = E  ∑ X i -X  = E  ∑ ( X i -μ ) - X-μ
2

n  i=1  n  i=1 
1  n 2
n  i=1
2
( )
= E  ∑ ( X i -μ ) − n X-μ 

1 n  2 
n  i=1 
2
  (
=  ∑ E ( X i -μ )  − n E X-μ  

)
 
1
=
n
⋅ n ⋅ V ( X ) - n V X 
 ( )

1 σ 2
 1 n-1
2
= n ⋅ σ - n ⋅ = ⋅ σ 2
⋅ ( n-1) = ⋅ σ 2
n n  n n
=> Therefore, an unbiased estimator of σ2 is given by
n 2 1 n
T = t ( X1 ,…, X n ) =
n-1
S =
n-1
∑ i .
(X
i=1
- X) 2
330
iii) Consider a Bernoulli random sample Z1,…, Zn:
f z ( z ) = p (1-p ) , z ∈ {0,1} .
z 1-z
Then, the estimator

n
t ( Z1 ,…, Zn ) =Z = 1 ∑ Z
ni=1
i
is unbiased for the success probability parameter p.
This means that the share of successes in the

sample is an unbiased estimator for the success
probability parameter p in a binomial experiment.
331
Question: Which estimator for V(Z)=p(1-p)?
Idea: Consider Z(1 − Z).
Is this estimator unbiased?

2
 n p +np (1-p )
2 2
E  Z(1 − Z) = E  Z  − E  Z = p-
  n 2
n-1
= p (1-p ) ≠ p (1-p )
n
But: E  Z(1 − Z)  → p (1-p ) for n → ∞.
332
Graphical illustration:
333
B) Asymptotically unbiased estimators
If the bias of an estimator becomes monotonically

smaller when the sample size increases, and in the
limit as n → ∞ vanishes, then we say that the
estimator is asymptotically unbiased:
lim E [ T ] = θ.
n →∞
334
C) Consistent estimators
A sequence of estimators {θˆ n = t ( X1,…, Xn )}n that
converges in probability to the unknown parameter θ
being estimated, as n → ∞, is called consistent;
that is:
( )
P θˆ n − θ > ε → 0 as n → ∞, ∀ε > 0.
Notation:
P
θ̂n → θ as n → ∞.
335
Graphical illustration of consistency:
Data: firm’s lifetime example 8.1.3.
336
Theorem (practical consistency check):
An estimator is consistent when the following two

conditions are satisfied:
• it is unbiased (or at least asymptotically unbiased);
and
• as n → ∞, its variance vanishes.
337
Example 8.2.2: (sample mean)
Let X1,…, Xn be a random sample from a distribution

with expected value parameter µ and standard
deviation parameter σ.
1 n
Let Tn = t ( X1,…, Xn ) = Xn = ∑ Xi .
n i=1
We proved already that Tn is unbiased for µ.
 1 n  σ2
Now: V ( Tn ) = V  ∑ X= i → 0.
 n i=1  n n→∞
P
⇒ Tn is consistent for μ, i.e., Tn  → μ.
Theorem
338
D) (Relative) efficient estimators
Let T and U1, U2,…, UK denote unbiased estimators for
the unknown parameter θ :
E [ T ] = E [U1 ] =  = E [UK ] = θ.
Then, T is called efficient if
V ( T ) ≤ V (Ui ) , i=1,…, K.
If more than one unbiased estimator exists, we choose
the one with the smallest variance.
(→ we expect that the realized estimates are less
dispersed around the true unknown θ ).
339
Graphical illustration of efficiency
Let X1, …, Xn be a random sample from a distribution
with expected value μ and variance σ2, and let n to be
an even number. Then:
1 n
Xa = ∑
n i =1
Xi
2 n/2
Xb = ∑
n i =1
X 2i
are two unbiased estimators for the mean parameter,

given that
1 n 1 n 1
E ( X a ) E (=
=
=
∑ Xi )
n i 1=
∑
=
n i 1
E( X i )
n
n ⋅ E( X i )
⋅= µ;
2 n/2 2 n/2 2
=
E ( X b ) E( ∑ = X 2i ) ∑ ( X 2i )
E= ⋅ n / 2 ⋅ E=
( X 2i ) µ.
= n i 1= n i 1 n
340
Graphical illustration of efficiency (continued)
What about the variance? We can compute that
1 2 2 2
=V ( Xa ) = σ and V ( X b ) σ .
n n
341
Xa Xb
Graphical illustration of efficiency (regression)
342
E) Mean squared error (MSE)
The mean squared error (MSE) complement the criteria

introduced so far to judge the goodness of an estimator.
Let T = t ( X1,…, Xn ) denote an estimator of the unknown

parameter θ . Then
E ( T-θ )  = MSE ( θ )
 2
 
is the mean squared error of the estimator T.

343
From the definition we see that for an estimator T with
finite variance, the MSE of T as an estimator of θ
equals its variance plus the square of its bias:
MSE ( θ ) = E ( T-θ ) = E ( T-μT - ( θ-μT ) ) 

 2
 2

   
= E ( T-μT ) − 2 ( θ-μT ) ⋅ E [ T-μT ] + ( θ-μT )
 
2 2
   
=0
= E ( T-μT ) + ( θ-μT )
 
2 2
 
= V ( T ) + bias .
2
344

3.5. Variance


Schervish
So far we focused our attention on the properties

required for estimators to be good estimators and
therefore yield accurate estimates.
Question: What general methods are available to yield

estimators satisfying the goodness
properties?
We have several possible approaches. Among them:

 Method of moments
 Least-squares method (regression)
 Maximum Likelihood method 346
A) Method of moments
Idea: estimate the moment of the underlying distribution
assumed to generate the data using the corresponding
sample moments.
Assume that X1 ,…, Xn form a random sample from a

distribution with at least k existing moments. Define
µ j (θ ) E=
= θ  1 
 X j
 , j 1,..., k .
Suppose that the function M −1 (θ ) = ( µ1 (θ ),..., µ K (θ ) ) is a one-
to-one function of 𝜃𝜃. Define the sample moments by
1 n
=mj = ∑
n i =1
X i , j
j
1,..., k . 347
The method of moments estimator (MME) of 𝜃𝜃 is
M (m1 ,..., mk ).
theoretical moments sample moments
n
μ1=E [ X] = μ m1 = 1 ∑ X j
n j=1
n
μ2 =E  X2  = μ2 + σ 2 m2 = 1 ∑ X j2
n j=1
⋅ ⋅
⋅ ⋅
⋅ ⋅
n
μk =E  Xk  mk = 1 ∑ X jk
n j=1
348
The usual way of implementing the method of
moments is to set up the k equations m j = µ j (θ )
and then solve for 𝜃𝜃.
For example, we will then get the following method of

moments estimators for the mean and the variance
parameters, respectively:
μ̂ = Xn ;
n
1
n∑
2 2 2
σ̂ = (Xi -Xn ) = S .
i=1
Goodness properties of MMEs: Method of moment

estimators are consistent and asymptotically unbiased.
349
Illustration: normal distribution
350
B) Least-squares method
Idea: It is generally used in linear regressions.
Suppose that the goal is to estimate the unknown mean
parameter μ.
This method implies choosing as estimator of μ the
function μ̂ LS defined as
n
μ̂LS = argmin ∑ (Xi -μ)2 ,
μ i=1
that is the statistic that minimizes the sum of the squared

distances between sample observations and μ.
Solution: n
μ̂LS = 1 ∑ Xi = Xn .
n i=1 351
C) Maximum Likelihood method
Illustrative starting example:
A certain lion has three possible states of activity each

night; it is “very active” (denoted by 𝜃𝜃1 ), “moderately
active” (denoted by 𝜃𝜃2 ), and “lethargic” (denoted by 𝜃𝜃3 ).
Also, each night this lion eats people; it eats i people with
probability
p (i / θ ) , θ =
∈Θ {θ=
,j
j 1, 2,3} .
The numerical values are given in the following table (see
next slide):
352
i 0 1 2 3
𝑝𝑝 𝑖𝑖 ⁄𝜃𝜃1 0.00 0.05 0.05 0.90
𝑝𝑝 𝑖𝑖 ⁄𝜃𝜃2 0.05 0.05 0.80 0.10
𝑝𝑝 𝑖𝑖 ⁄𝜃𝜃3 0.90 0.08 0.02 0.00
If we are told that exactly X=x people were eaten last

night, how should we estimate the lion’s activity state?
One seemingly reasonable method is to estimate 𝜃𝜃 as

that 𝜃𝜃 ∈ 𝛩𝛩 for which 𝑝𝑝 𝑥𝑥 ⁄𝜃𝜃 is largest.
This kind of reasoning is the core of the maximum

likelihood method: we are going to choose the most
plausible parameter, in the sense that it maximizes the
probability of the event we observe.
353
Maximum Likelihood method:
Let X1,…, Xn be a random sample from a known
population distribution fx depending on the parameter θ
to be estimated.
When the joint probability/density function of the
observations in a random sample is regarded as a
function of θ
L ( θ; x1,..., xn ) = f( x1,..., xn ) ( θ; x1,..., xn )
n
= ∏ fx ( θ; xi )
i=1
for given values of x1,…, xn, it is called the likelihood

function.
354
For each possible observed vector x1,…, xn, let
T(x1,…, xn) denote a value of θ for which the
likelihood function is a maximum, and let
θ� ML = T(X1,…, Xn)
be the estimator of θ defined in this way. The
estimator θ� ML is called a maximum likelihood
estimator (MLE) of θ:
θ̂ML = argmax L ( θ ; X1,..., Xn )
θ
= argmax log L ( θ ; X1,..., Xn ) .

θ
After (X1,…, Xn)=(x1,…, xn) is observed, the value

T(x1,…, xn) is called maximum likelihood estimate.
355
Example 8.3.1:
MLE of p in a Bernoulli population:
n
(p ; x1,..., xn ) ∏ p (1-p ) , xi ∈ {0,1}.
xi 1-xi
L=
i=1
n
log L ( p ; x1,..., xn ) = ∑(x
i=1
i log p + (1-x i ) log (1-p ) )
∑ x ∑ (1-x )
n n
∂ log !
L ( p ; x1,..., x n ) i=1 i i
= = - i=1
0
∂p pˆ 1-pˆ
( )  
n n
⇔ 1 − p ⋅ ∑ xi =p ⋅  n − ∑ xi 
 
i=1  i=1 
1 n
⇒ p ML = ∑ Xi = Xn .
n i=1 356
Example 8.3.2: Let X1,…, Xn be a random sample
from a uniform continuous distribution:
1
 ,0 ≤ x ≤ θ
fUni ( X ) =  θ
 0 , else
n
 1
L ( θ ; x1,..., xn ) =   , if all xi ∈ [0,θ].
θ
L is monotonic decreasing in θ, but: all xi ∈ [0,θ],
that is θ ≥ xi , i=1,..., n
⇒ θˆ = max ( X ,, X ) .
ML 1 n
357
Graphical illustration of example 8.3.2:
358
Properties of maximum likelihood estimators
1. Invariance property of MLEs:

If θ is the maximum likelihood estimator of θ and
()
if g is a one-to-one function, then g θ is the
maximum likelihood estimator of g ( θ ).
2. Goodness properties of MLEs:

Maximum likelihood estimators are consistent
and asymptotically unbiased.
359

3.5. Variance


Schervish
9. Confidence intervals
361
9.1. The idea
Confidence intervals provide a method of adding
more information to an estimator when we wish to
estimate an unknown parameter θ.
We can find an interval (A,B) that we think has high
probability of containing θ. The length of such an
interval gives us an idea about how closely we can
estimate θ and how large the sampling error is.
362
Definition
Symmetric (1- α)-confidence interval:
CONF1-α ( θ ) = θ n -f n ; θ n +f n  ,
where f n denotes the sampling error and is
computed in such a way that the confidence interval
contains the unknown parameter θwith a given
probability(1-α):
P θ ∈ CONF1-α ( θ )  = 1-α.
The probability (1- α ) is called the confidence level.

Classical values for confidence levels are 0.95 and
0.99.
363
Interpretation: The confidence interval can be regarded
as the observed value of the random interval (A,B) after
observing the data.
In fact, one way to think of the random interval (A,B) is to

imagine that the sample that we observed is one of many
possible samples that we could have observed.
Each such sample would allow us to compute an observed

interval. ⋅
⋅
⋅
⋅
⋅
⋅
X
θ 364

3.5. Variance


Schervish
9.2. Example of a confidence interval
(mean of a distribution, large samples)
Let us consider the construction of the (1-α)-confidence
interval for the mean parameter μ of the population
distribution.  X n -μ 
CLT: P ≤ q  → FZ ( q ) .
 σ n  n →∞
Then (if n is large enough):

 X n -μ 
P  −q α ≤ ≤ q α  ≅ 1-α,
 1−
2 σ n 1−
2
( )
where q α denotes the 1 − α 2 - quantile of the standard
1−
2
normal distribution. 366

We can now solve with respect to the parameter of
interest, getting the (1-α) -confidence interval for μ :
 σ σ 
CONF1-α ( μ ) =  X n − q α ; Xn + q α 
 1−
2 n 1−
2 n
If also σ is unknown, for n ≥ 50 we have that
 Sn Sn 
CONF1-α ( μ ) =  X n − q α ; Xn + q α 
 1−
2 n 1−
2 n
is the (1- α) – confidence interval for μ , with
n
Sn = 1
2
∑ (X -X ) 2
.
ni=1
i n
367
Example 9.2.1: (rent index)
The municipal administration asks 50 households about

their rent per m2 (exclusive of heating) to compute the
local rent index. This results in the following numbers:
X50 = 8.30 € und S50 = 2.07 € .
What is the 0.9 - confidence interval for the average rent

parameter μ ?
CONF90% ( μ ) =
 2.07 2.07 
8.30 − 1.645 ⋅ ; 8.30 + 1.645 ⋅  = [ 7.82 ; 8.78].
 50 50 
368

3.5. Variance


Schervish
9.3. Relation with testing hypotheses
Consider again the statistical problem involving a

parameter θ whose true value is unknown but must
lie in a certain parameter space Θ.
Suppose Θ can be partitioned into two disjoint

subsets Θ0 and Θ1 and we are interested to verify
whether θ lies in Θ0 or in Θ1 .
A problem of this type is called a problem of

hypothesis testing. Some observed values will
provide information about θ to make a decision.
370
𝐻𝐻0 : 𝜃𝜃𝜖𝜖 Θ0
𝐻𝐻1 : 𝜃𝜃𝜃𝜃 Θ1
are called target (null) and alternative hypotheses.
When performing a test if we decide that θ lies in

Θ1 , we are said to reject the target hypothesis.
If we decide that θ lies in Θ0 , we are said not to

reject 𝐻𝐻0 .
One possible way to make our decision about θ is

to construct a confidence interval.
371
In such a case we are interested in the following
type of hypotheses:
𝐻𝐻0 : 𝜃𝜃 = 𝜃𝜃0
𝐻𝐻1 : 𝜃𝜃 ≠ 𝜃𝜃0 .
Example: Mean of a distribution (large samples)
Idea: Reject 𝐻𝐻0 : µ= µ0 if the distance between the

arithmetic mean and µ0 is large enough:
𝑋𝑋𝑛𝑛 − µ0 ≥ 𝑐𝑐,
where the value c is determined by the significance
𝜎𝜎
level α of the test: 𝑐𝑐 = 𝑞𝑞1−𝛼𝛼 .
2 𝑛𝑛
372
If 𝑋𝑋𝑛𝑛 = 𝑥𝑥𝑛𝑛 is observed, the set of µ0 such that we
would not reject 𝐻𝐻0 is the set of µ0 such that
𝜎𝜎
𝑥𝑥𝑛𝑛 − µ0 < 𝑞𝑞1−𝛼𝛼
2 𝑛𝑛
in case 𝜎𝜎 is known (otherwise see slide 367).
This inequality easily translates to the formula

given in slide 367 for the confidence interval
(µ = µ0 ):
 σ σ 
CONF1-α ( μ ) =  X n − q α ; Xn + q α 
 1−
2 n 1−
2 n
373

3.5. Variance


Part III: Exercises
375
Exercise 1.1.1:
Give the sample space in the following cases:
1) A person is asked about her birthday.
2) K persons are asked about their birthdays.
3) Position of a locator on the unit circle.
4) Let S = {"0 times six", "2 times six"}.
Is it a sample space for the experiment of rolling
two dice?
376
Exercise 1.2.1:
1) Rolling one die:
A: "number smaller than 4"  A ∪B = ?

→
B: "odd number"  A ∩B = ?
2) A = {( x, y ) |ax + by + c = 0}
→AB=?
B = {( x, y ) |ax + by + d = 0}
3) Prove the De Morgan’s laws:
A ∪ B = A ∩ B and A ∩ B = A ∪ B.
377
Exercise 1.2.1 (continued):
4) Rolling two dice:

S = {(1,1) , (1,2 ) ,..., (1,6 ) , ( 2,1) , ( 2,2 ) ,..., ( 6,6 )}
and
A: ‘‘at least one die is a six’’
B: ‘‘the two dice show the same number’’
C: ‘‘the two dice show odd numbers’’
378
Write the following events as subsets of S:
A=?
B=?
C=?
A=?
C=?
379
B∩C = ?
B\C =?
A ∪C = ?
A ∩B ∩C = ?
380
Exercise 1.3.1:
Which probability concept?
(1) The probability that playing ‘‘Lotto’’ we type three

or more right numbers equals only about 2%.
(2) The probability that next June in Frankfurt it is

going to snow is smaller than 5%.
(3) The probability that the applicant X for the

announced position Y is invited for an interview
equals 80%.
381
Exercise 1.5.1:
“Rolling two dice” (see exercise 1.2.1 (4))
A: “at least one die is a six”

B: “the two dice show the same number”
C: “the two dice show odd numbers”
Compute the probabilities of the events above

and those introduced on pages 374-375.
382
Exercise 1.5.2: (sick notes)
The (frequentist) probabilities for the sick notes of three

employees X, Y and Z are summarized in the following
table:
Ei {-} {X} {Y} {Z} {XY} {XZ} {YZ} {XYZ}
P(Ei) 0.751 0.1 0.063 0.061 0.011 0.008 0.005 0.001 Σ=1
Compute:
→ P(“X ill”) = ?
→ P(“Y ill”) = ?
→ P(“X and Y ill”) = ?
→ P(“X or Y ill”) = ?
383
Exercise 1.6.1:
Compute the probability of the following two events:
a) A: “Rolling four dice we get at least one six’’
b) B: “Rolling two dice 24 times we get at least one
twelve as the sum of the numbers’’
384
Exercise 1.6.2:
Let us consider the sample space
S= {( a,b ) | 0 < a < 3 and 0 < b < 2}.

Define the event
 a
U = ( a,b ) | a ∈ [ 0,3] , b < 1 −  .
 6
P(U) = ?
385
Exercise 1.7.1: (sick notes, see exercise 1.5.2)
→ Are the sick notes of the three employees X, Y

and Z pairwise independent?
I) P(‘‘X and Y ill’’) = ?
II) P(‘‘X and Z ill’’) = ?
III) P(‘‘Y and Z ill’’) = ?
⇒ How does the probability of sick notes of Y change
in reaction to a sick note of X?
P(‘‘Y ill’’ | ‘‘X ill’’) = ?
386
Exercise 1.7.2:
Two independent elevators A and B, identical from
both a technical and a functional point of view, are
located in an office building.
The probability that the elevator A (or B) at a given
point in time is on the ground floor equals 0.2.
387
→ What is the probability that a visitor coming at a

randomly chosen point in time....
I) finds both elevators on the ground floor?
II) finds at least one elevator on the ground floor?
III) finds exactly one elevator on the ground floor?
→ Both elevators have a failure probability equal to 5%
when they are not on the ground floor.
What is the probability that...
IV) elevator A fails given that it is not on the ground
floor?
388
Exercise 1.7.3:
The transmission of a communication from A to B can
be done using 3 independent channels. Each channel
can fail with a given probability ‘‘p’’.
Compute the following probability:

P[“transmission successful’’] = ?
389
Exercise 1.8.1: (draw of the ‘Zusatzzahl’ in Lotto)
Lotto works as follows: seven balls are drawn without

replacement from an urn containing 49 balls numbered
consecutively (with integer numbers from 1 to 49).
The number of the last ball is called ‘Zusatzzahl’.
What is the probability of the event
A: “Zusatzzahl 1 is drawn’’?
390
Exercise 1.8.2: (supplier with differences in quality)
An automaker equips his vehicles with air conditioning
systems that he gets from three different suppliers.
Supplier Share Defective items

A 50% 5%
B 30% 9%
C 20% 24%
M: “a randomly chosen vehicle has a malfunctioning

air conditioning system’’
→ P(M) = ?
→ when M is observed: P(A|M) = ? = P(A) = 0.5 ?
P(B|M) = ? and P(C|M) = ?
391
Exercise 1.9.1: (supplier with differences in quality)
(see exercise 1.8.2, continued)
P (A M) = ?
P (B M) = ?
P (C M) = ?
392
Exercise 1.9.2: (urn)
Let us consider two urns U1 and U2.

U1 contains 5 white and 7 red balls. U2 contains one
white and 5 red balls.
⋅
We randomly choose an urn and from that urn we

randomly draw one ball.
The ball drawn is red. What is the probability that urn

U1 was chosen?
393
Exercise 3.0.1:
Define the appropriate random variable for:
→ the number of customers in a given store;
→ the burning time of a light bulb (in hours).
394
Exercise 3.1.1:
Let us consider rolling two dice.
S = {(i, j) | i , j ∈ {1,...,6}}
X = "sum of the numbers": ( i, j ) → i+j
W = {2,...,12}
What is the distribution function of X?
395
Exercise 3.1.2:
A machine produces defective items with probability 5%.
We choose randomly 4 items from the total production.
Let us denote with X the random variable

X = ‘‘number of defective items in the sample’’.
Then: W={0,1,2,3,4}.
What is the distribution function of X?
396
Exercise 3.2.1:
A random variable has the following probability

function:
K for x = 0
2K for x = 1

f x ( x ) = 3K for x = 2
5K for x = 3

 0 else.
397
→ Compute the constant K.
→ P (1 < X ≤ 3 ) = ?
P ( X > 1) = ?
P ( X = 1) = ?
→ Find the smallest value x of X such that

P ( X ≤ x ) = Fx ( x ) ≥ 0.5.
398
Exercise 3.2.2: (revenue under uncertain conditions)
(see example 3.0.5, continued)
The management is interested in the probability that
1. the total order volume next year amounts at most

to 30 millions (M).
2. the next year’s absolute difference from the fixed
goal of 36 M for the total order volume amounts at
most to 6 M.
399
Exercise 3.3.1:
Let us consider a random variable X with density
function f given by
 2 1
3x , 0 ≤ x ≤ 2

f ( x ) =  3x 1
 2 , 2 <x≤c

 0 , else.
a) Determine the constant c such that the function f is
a density function. Sketch the density function.
b) Compute the distribution function of X.
400
Exercise 3.4.1:
1. Consider a random variable X with
 2 1
3x , 0 ≤ x ≤ 2

f (x) = 3 1
2 x , 2 < x ≤ c

 0 , else.
E [ X] = ?
401
2. Waiting time at the 'S-Bahn' station (continued)
(see example 3.3.2)

E [ X] = ?
3. Two players A and B roll one die alternately.
The rolling player wins from his competitor

• 3 Euro if he gets 1 or 2;
• 6 Euro if he gets 6; and
• 0 Euro if the number obtained is 3, 4 or 5.
Compute the expected value of the random variable
X = "winning of A" when each player rolls the die once.
402
Exercise 3.5.1:
A random variable X has the density function
3 2
 x , 0 ≤ x ≤ c
f ( x) = 8
 0, else.
Compute E[X] and V(X).
403
Exercise 3.5.2:
Compute the expected value and the variance of the
random variable
Y = 3X + 2,
where X has the probability function
X 1 2 5
f ( x ) 0.2 0.3 0.5
404
Exercise 3.5.3: (defective piping)
A piping is made of 20 segments. Given that the
ouflow quantity is smaller than the inflow quantity,
there must be a leak somewhere.
Let us assume that there is exactly one leak and
that it is located in each one of the segments with
equal probability 1/20.
We would like to find the segment in which the
leak is located with the smallest possible number
of inspections (that is, measuring the flow rate at
each segment’s borders).
405
a) Compute the distribution of

X = ‘‘number of inspections’’
in the case that we check gradually each segment’s
border starting with the first one.
2
Compute E(X) and σx.
b) Is there a better, cheaper strategy?
406
Exercise 4.3.1: (quality check)
In the production of high-quality drinking glasses the
percentage of defective items equals 20%.
In the course of a quality check we take randomly
four drinking glasses with replacement.
X: ‘‘# of defective glasses in the sample’’
Y: ‘‘# of flawless glasses in the sample’’
Compute the probability that:
(1) exactly one glass in the sample is defective;
(2) at least two glasses in the sample are defective;
(3) exactly one glass in the sample is flawless.
Compute E[X], E[Y], V(X), V(Y).
407
Exercise 4.4.1: (clients at the bank counter)
Clients come to a given bank counter at some unpredictable
point in time: in the morning (8-12) on average 12 clients per
hour and in the afternoon (14-16) on average 10 clients per
hour.
Assume that clients are coming independently to each other
whether it is morning or afternoon. Compute the probability…
1) that on a given day between 09.00h and 09.15h no client
comes to the bank counter;
2) that on a given day between 15.00h and 15.15h no client
comes to the bank counter;
3) that on a given day between 15.30h and 16.00h more than
6 clients show up at the bank counter.
408
Exercise 4.6.1: (clients at the bank counter)
(see exercise 4.4.1, continued)
→ in the morning (8-12): 12 clients per hour

→ in the afternoon (14-16): 10 clients per hour
What is the probability that at a given point in time in

the morning/afternoon a client shows up at the bank
counter in the following five minutes?
→ Xmor = ‘‘time until the arrival of the next client’’

(morning) ~ fEx (x; 12)
→ Xaft = ‘‘time until the arrival of the next client’’
(afternoon) ~ fEx (x; 10)
409
Exercise 4.7.1:
Working with general normal distributions N(µ, σ2):
Let X ~ fN (x ; 2,16)
→ P ( X ≤ 0) =?
→ P ( X ≤ 2) =?
Now consider X ~ fN (x ; 5, 100). Find q such that:
→ P ( X ≤ q) =
0.25;
→ P ( X ≤ q) =
0.75.
410
Exercise 5.1.1:
Let us consider a box containing 2 white, 3 black,

and 1 blue balls. We draw 2 balls with replacement.
Define the random variables:

X = ‘‘number of white balls in the sample’’
Y = ‘‘number of blue balls in the sample’’
Find the joint (bivariate) distribution and the marginal

distributions of X and Y.
411
Exercise 5.4.1:
Let { X1 , X 2 , ...} be a random sample with E ( X i ) = µ

1 2
and V ( X i ) σ . Let
= 2
Z= ⋅ ∑ ( 4X 2i + 3 X i + 2) .
22 i=1
Compute E[Z] and V(Z).
412
Exercise 5.4.2:
Let X and Y denote two independent, Poisson

distributed random variables. We know that V(X) +
V(Y) = 5.
Compute the probability P(X+Y ≤ 2).
Hint: The Poisson family of distributions satisfies the

additive property:
X ~ fPo 

Y ~ fPo ⇒ ( X + Y ) ~ fPo .
X,Y independent 
 413
Exercise 6.1:
The (historical) probability that on a given day in June
in a Mediterranean holiday resort it rains equals 0.08.
a) What is the distribution of the number of rainy

days in a week (X7) and in the whole month of
June (X30)?
b) Compute expected value and variance of X7.
c) What is the probability that...
- it does not rain for a whole week in June?
- it rains at least three days in a week in June?
- in the whole month of June we observe at most
two rainy days?
414
Exercise 6.1 (continued):
In the same resort, the sunshine duration of a day in

June can be modeled as a normally distributed
random variable with µ = 10 [hours] and σ2 = 10.8
[hours2].
d) What is the distribution of the total sunshine

duration in June (Y30) and that of the average
sunshine duration in June (Y30), respectively?
e) What is the probability that the sun shines in the

whole month of June of a given year on average
more than 11 hours per day?
415
Exercise 6.2:
100 integer numbers from 1 to 5 are randomly chosen
and summed. What is the probability that the sum…
a) equals at most the value 250?
b) lies between the values 275 and 305 (boundaries

included)?
416
Exercise 6.3:
A die is rolled 300 times. Let X denote the number of

‘3’ that are observed.
Compute P(50 < X ≤ 53) and P(X < 40).
417
Exercise 7.4.1:
Below are summarized the results (in points) of an

exam (23 students):
6.2; 4.82; 2.96; 6.18; 6.52; 7.9; 9.62, 6.22; 0.42;

9.06; 11.7; 6.54; 3.14; 4.74; 2.66; 7.04; 7.78; 11.8;
9.44; 20.76; 2.9; 8.42; 8.02
Boxplot? QQ-Plot? Other summary statistics?
418
Exercise 8.2.1: (estimation of λ in a Poisson distribution)
Let us consider the following two estimators for the

parameter λ of a Poisson distributed population:
n 2
T1 = X n and T2 = S .
n-1
i) Are T1 and T2 unbiased for λ ?
ii) Is T1 consistent for λ ?
iii)Is T1 (relative) efficient with respect to T2?
419
Exercise 8.2.2:
Let
1 n 1  n

X= ∑ Xi and X =  2X1+∑ Xi 
n i=1 n+1  i=2 
be two competing estimators for the mean
parameter E[X] = μ of the population distribution.
Assume that V(X) = 𝜎𝜎 2 exists.
1. Show that both estimators are unbiased.

2. Compute their variance.
3. Which alternative estimator is (relative)
efficient?
420
Exercise 8.3.1:
We toss n identical coins, each one until we get for

the first time ‘heads’.
Find the maximum likelihood estimator of

p = P[ "heads" ]
based on the random sample X1,…, Xn, where Xi
denotes the number of ‘tails’ needed before we get
for the first time ‘heads’ for the coin i.
421
Exercise 9.2.1:
A sector is made of N=12,100 individual companies.

We consider a random sample of size n=225.
The variable of interest is P = “annual profit” (in
Swiss francs).
Summary statistics for the results in 2006 are:
p225= 600,000. − ; sP= 90,000. −
Find:
1. a confidence interval for the mean annual profit
at the level α=4.55%;
2. a confidence interval for the total annual profit of
the sector at the level α=4.55%.
422
Exercise 9.2.2:
In the June 1986 issue of Consumer Reports, some data on the

calorie content of beef hot dogs is given. Here are the numbers
of calories for 20 different hot dog brands:
186, 181, 176, 149,184, 190, 158, 139, 175, 148,

152, 111, 141, 153,190, 157, 131, 149, 135, 132.
Assume that these numbers are the observed values from a

random sample of twenty independent normal random variables
with mean µ and standard deviation σ, both unknown.
Find a 90% confidence interval for the mean number of calories

µ.
423

Statistikskript VWL Final E v2 Slides

Uploaded by

Copyright:

Available Formats

Statistikskript VWL Final E v2 Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistikskript VWL Final E v2 Slides

Uploaded by

Copyright:

Available Formats

Statistics for

Part III: Exercises Slide 375

3. Random variables Part III: Exercises

5. Multivariate random variables

6. The Central Limit Theorem Chapter 1.3/4 DeGroot and

A (random) experiment is any process, real or

 it is performed under clear rules;

Every experiment has a number of possible

The collection of all possible outcomes of an

(uncountably many elements)

An experiment might be described by

All events of an experiment with sample space S

set of events E(S).  Set of all subsets of S

1. The sample space, which as an event we call the

2. The empty set, which as an event we call the

3. Random variables Part III: Exercises

5. Multivariate random variables

The complement Ā = S\A occurs when A does

Two events A and D are called disjoint or

Difference: A\B (‘‘A without B’’) occurs if A, but not B,

Containment: It is said that a set C is contained in a

3. Random variables Part III: Exercises

5. Multivariate random variables

6. The Central Limit Theorem Chapter 1.2 DeGroot and

is often used for one-time events;

is the probability that a person assigns to a

“P(A) = the degree of belief that someone holds

the probability that some specific outcome of a

The limit P(A) = lim hn (A)

denotes the frequentist probability of A, where

hn (A) = "relative frequency that A occurs".

1 → third concept ≅ 0.1666

attributed to Laplace, 1812

Nevertheless, Bernoulli had already discussed the

If the outcome of some process must be one of n

An experiment with finite many equally likely

Definition of the sample space:

We can use the Laplace theory:

S = {HH, HT, TH, TT}.

3. Random variables Part III: Exercises

5. Multivariate random variables

Axiom 1: P(A) ≥ 0, ∀AєE

Axiom 3: P (A∪B) = P (A) + P (B),

(a ddition rule for dis joint e ve nts )

3. Random variables Part III: Exercises

5. Multivariate random variables

A⊂B P(A) ≤ P(B)

What is the probability that an arbitrarily chosen

Let us define the sample space as

A = ''number with at least two of same digits",

3. Random variables Part III: Exercises

5. Multivariate random variables

A sample space S with a finite number or a

Then we assign to each elementary event ei a

In order to satisfy the axioms of probability, the

that is, the probability of each event A є E is computed

S = {H, TH, TTH, TTTH, TTTTH ,…}

geometric series that converges to 1!

A sample space S with uncountably many

Consider a piece of the line going from 0 to 1 (closed