Odf-SC1 05 MonteCarlo

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Chapter 1

Monte Carlo Methods


Goals:
1. basics of Monte Carlo methods
2. design of a Monte Carlo study

1.1 Basics of Monte Carlo methods


In a previous chapter, we developed the crude Monte Carlo estimator
of the expectation
µ = Eθ [g(X)]
iid
given X1, X2, . . . , Xn ∼ f (x|θ) with the same distribution as X, the
strong law of large numbers (SLLN)
e implies that
n
1X
µ̂ = g(Xi) → µ as n → ∞.
n i=1
The precision of our estimate µ̂ is dictated by
1 σg2
Var(µ̂) = Varθ g(X) ≡ ,
n n
2 Monte Carlo Methods

which can be estimated via


n
1 X
σ̂g2 = (g(Xi) − µ̂)2 ≡ sample variance of g(Xi)s.
n − 1 i=1

Note that µ̂ is unbiased, and typically in large samples


·
µ̂ ∼ Normal µ, σ̂g2/n .


The precision of µ̂ depends on σg2 and n. We will discuss several


methods that aim to increase precision, besides increasing n. Note that
more complex methods may increase precision for a given n, but may incur
increased programming effort or computational time. Some assessment of
the trade-offs between variance reduction and added labor or cost needs
to be made.

1.1.1 Control variates


As before, suppose we wish to estimate (assuming x continuous)
Z
µ ≡ Eθ [g(X)] = g(x)f (x|θ) dx.

If we have a g ∗(x) that is “similar to” g(x) and for which


Z
τ ≡ Eθ [g ∗(X)] = g ∗(x)f (x|θ) dx

is known, then writing


Z
µ = {g(x) − g ∗(x)}f (x|θ) dx + τ
= Eθ [g(x) − g ∗(x)] + τ
1.1 Basics of Monte Carlo methods 3

we can use crude Monte Carlo to estimate Eθ [g(x) − g ∗(x)]. That is,
n
1X
µ̂ = {g(xi) − g ∗(xi)} + τ
n i=1
n n
1X 1X ∗
= {g(xi)} − {g (xi)} + τ
n i=1 n i=1

with
1
Var[µ̂] = Varθ [g(X) − g ∗(X)]
n
1
= {Varθ [g(X)] + Varθ [g ∗(X)] − 2Covθ [g(X), g ∗(X)]}.
n
.
If g ∗(X) mimics g(X), then Varθ [g(X)] = Varθ [g ∗(X)] and
. 1
Var[µ̂] = {2Varθ [g(X)] − 2Varθ [g(X)]Corrθ [g(X), g ∗(X)]}
n
1 1
< Varθ [g(X)] if Corrθ [g(X), g ∗(X)] > .
n 2
Thus, reduction in variability relative to crude MC if Corrθ [g(X), g ∗(X)] >
1
2.

Example, median Let X = (X1, . . . , Xn) be a sample from some


distribution with known E[xei] = τ . Let X̄ = n1 ni=1 Xi and M =
P

sample median of Xis, and suppose we wish to estimate

E[M ] ≡ µ (nonstandard notation)

given R samples, each giving Mr and x̄r . Consider using


R
1X
µ̂ = {Mr − x̄r } + τ
R r=1
4 Monte Carlo Methods

as the estimate. That is, use x̄ as a control variate for estimating E[M ].
# Gamma(2, 4) distribution, with E[X]=a*b and Var[X]=a*b^2.
a <- 2
b <- 4

# true median
qgamma(0.5, a, scale=b)

## [1] 6.713

# true mean of gamma distribution


tau <- a*b
tau

## [1] 8

# sample from gamma distribution


R <- 1e4 # samples
n <- 25 # sample size
x <- matrix(rgamma(R*n, a, scale=b), ncol=n) # draw R random samples in rows

# bootstrap estimate of variability of M


M <- apply(x, 1, median)
c(mean(M), var(M))

## [1] 6.806 1.638

# using mean as control variate


x.bar <- apply(x, 1, mean)
c(mean(x.bar), var(x.bar))

## [1] 8.005 1.289

# Check that the correlation between our variate of interest (median)


# and our control variate (mean) is at least 1/2
cor(M, x.bar)

## [1] 0.7612
1.1 Basics of Monte Carlo methods 5

# This estimate of mu, the true median, has lower variance than x.bar
mu.hat <- mean(M - x.bar) + tau
c(mu.hat, var(M - x.bar))

## [1] 6.8005 0.7149

Example, cdf Let X = {X1, X2, . . . , Xk } and T (X ) ≡ some statistic.


Suppose we wish to estimate
e e

µ ≡ µ(t) = Pr[T (X) ≤ t] = E[1(T (X)≤t)],


that is, estimate the cumulative distribution function (cdf) of T (X ). In
other words, the cdf of T (X ) is the probability that statistic T (X )eis less
than t for each quantile t. eThe crude MC estimate is the empirical e cdf:
given T (X 1), T (X 2), . . . , T (X n),
e e e
n
1X
µ̂ ≡ µ̂(t) = 1(T (Xi)≤t)
n i=1
number of (T (Xi) ≤ t)
= .
n
Suppose statistic S(X ) mimics T (X ) and the cdf of S(X )
e e e
τ (t) ≡ Pr[S(X) ≤ t]
is known. Then the control variate estimate is
n
1X
µ̂ = {1(T (Xi)≤t) − 1(S(Xi)≤t)} + τ (t)
n i=1

1
 T ≤ t, S > t
{1(T (Xi)≤t) − 1(S(Xi)≤t)} = 0 T ≤ t, S ≤ t or T < t, S < t .
−1 T > t, S ≤ t

6 Monte Carlo Methods

The variance reduction could be substantial.


This idea was used in the “Princeton Robustness Study1”, which among
other things considered distributional properties of trimmed mean-like t-
statistic
x̄T − θ
tT = (based on sample size, k).
SE[x̄T ]
If the underlying population distribution is Normal with mean θ, you can
use
x̄ − θ
t = ∼ tk−1
SE[x̄]
as a control variable for estimating cdf of tT .

Example, Multinomial Suppose

X = {X1, X2, . . . , Xk } ∼ Multinomial(m, θ),


e e
where θ = (θ1, . . . , θk ). Two standard statistics for testing H0 : θ1 =
θ01, . . .e, θk = θ0k are the Pearson statistic
k
X (xi − mθ0i)2
P =
i=1
mθ0i

and the likelihood ratio statistic


k  
X xi
G2 = 2 xi loge .
i=1
mθ0i
1
John W. Tukey (1973). The Estimators of the Princeton Robustness Study. Princeton Univer-
sity, Department of Statistics.
1.1 Basics of Monte Carlo methods 7

·
Note that 0 loge(0) ≡ 0. In large samples, both P and G2 ∼ χ2k−1 when
H0 is true. One way to study the closeness of χ2k−1 approximation is
through the moments: how close do the moments of P and G2 match
those of the χ2k−1 distribution? The moments of P are tractable, but the
moments of G2 are not. This suggests using P as a control variate for
estimating moments of G2. For example, suppose we wish to estimate

E[G2] = µ.

We know

E[P ] = E[χ2k−1] = k − 1.

Thus, given n multinomial samples, estimate µ via


R
1X 2
µ̂ = {G − Pr } + (k − 1),
R r=1 r

where G2r and Pr are the values of G2 and P from the rth sample.

1.1.2 Antithetic variates (AV)


Suppose we have two estimators µ̂1 and µ̂2 of µ and each has variance
σ 2/n when based on a sample of size n. If the correlation ρ between these
estimators is negative2, then the estimator

µ̂1 + µ̂2
µ̂AV =
2
2
Antithetic means “directly opposed or contrasted; mutually incompatible”.
8 Monte Carlo Methods

has
1
Var[µ̂AV] = {Var[µ̂1] + Var[µ̂2] + 2Cov[µ̂1, µ̂2]}
4
1 p
= {Var[µ̂1] + Var[µ̂2] + 2ρ Var[µ̂1]Var[µ̂2]}
4
1 2
= {σ + σ 2 + 2ρσ 2}
4n
1 2
= σ (1 + ρ)
2n
σ2
<
2n
where the last term is the variance of either µ̂1 or µ̂2 based on a sample of
size 2n. That is, averaging the two estimators based on the same sample
of size n (necessary to make estimators correlated) is better than doubling
the sample size using either estimator individually.
Put another way, two negatively correlated estimators can be combined
to provide a more precise estimator than either estimate individually, even
when the combined estimator is based on half the number of samples.
The AV method is often difficult to implement since you need to find
negatively correlated estimators. This can often be done in situations with
certain symmetry constraints.

Example, AV Suppose X ∼ Normal(0, 1) and we wish to estimate


X
µ = E[h(X)] where h(X) = .
2X − 1
Since −X ∼ Normal(0, 1), the distribution of h(X) and h(−X) are iden-
tical and thus E[h(−X)] = µ. Based on a sample of n = 10000, we
find the AV sample is much more precise than that of either individual
estimate based on n = 20000 samples.
1.1 Basics of Monte Carlo methods 9

# define h(x)
f.h <- function(x) {
h <- x / (2^x - 1)
return(h)
}

# sample from normal distribution


R <- 1e4 # samples
x <- rnorm(R) # draw R random samples
x2 <- rnorm(R) # double the samples for later comparison

# calculate h(x) and h(-x)


h.x <- f.h(x)
h.negx <- f.h(-x)

# these are negatively correlated, so the AV approach is profitable


cor(h.x, h.negx)

## [1] -0.9527

# estimate
combine.h.x <- (h.x + h.negx) / 2
mu.hat.AS <- mean(combine.h.x)
mu.hat.AS

## [1] 1.499

# sd of AV estimate
sd(combine.h.x)

## [1] 0.07764

# sd of individual estimate based on 2*R samples


h.x2 <- f.h(x2)
h.negx2 <- f.h(-x2)

sd(c(h.x, h.x2))

## [1] 0.5086

sd(c(h.negx, h.negx2))

## [1] 0.507
10 Monte Carlo Methods

The AV approach combines two estimates of the same parameter as


best we can, that is, by averaging them. A real gain comes about if the
estimates have negative correlation.
In general, if we have estimates µ̂1, µ̂2, . . . , µ̂` of µ with covariance
matrix
Σ = [Cov(µ̂i, µ̂j )],
then we can use generalized LS to get the optimal estimate, that is, set
   
µ̂1 1
 µ̂  1
   
µ̂∗ =  ..2  =  ..  µ + ε = 1µ + ε, Cov[ε] = Σ,
 .  . e e e
µ̂` 1
then the best estimate is
µ̂ = (1>Σ−11)−11>Σ−1µ̂∗.
e e e
Remarks
ˆ Typically estimate Σ with Σ̂ and plug that into µ̂,

ˆ with two estimates with equal variance, the estimate is always the
average, and
ˆ depending on Σ, could potentially reduce n and get same precision
as using individual estimator µ̂j .

1.1.3 Importance sampling (IS)


As before, we wish to estimate
Z
µ ≡ Eθ [g(X)] = g(x)f (x|θ) dx.
1.1 Basics of Monte Carlo methods 11

Assume θ is fixed and let f (x) ≡ f (x|θ). the crude MC estimate µ̂ is


unbiased with
1
Var(µ̂) = Varθ [g(X)]
n
1
= (Eθ [g 2(X)] − µ2)
n Z 
1
= g 2(x)f (x|θ) dx − µ2 .
n

Importance sampling seeks to reduce Var(µ̂) as follows. Note that for


any other density h(x)
Z
µ = g(x)f (x) dx
Z
f (x)
= g(x) h(x) dx
Z h(x)
= g(x)w(x)h(x) dx
= Eh[g(x)w(x)],

which is the expectation with respect to h(x). Thus, drawing a sample of


size n, X1, X2, . . . , Xn, from h(x|θ), we can use the MC estimate
e
n
1X
µ̂IS = g(xi)w(xi)
n i=1

as an unbiased estimator of µ with


1
Var(µ̂IS) = Varh[g(X)w(X)]
n
1
= (Eh[g 2(X)w2(X)] − µ2).
n
12 Monte Carlo Methods

Note that the expected value of the weight function


Z
f (x)
Eh[w(x)] = h(x) dx
Z h(x)
= f (x) dx
= 1,

that is, the average weight is 1.


Since the average weight is one, some weights may be very large ( 1).
IS tends to work well when w(x) is large only when g(x) is small. This
requires the choice of h(x) to be made carefully!

Remarks

1. IS is a crude MC, so we can estimate Var(µ̂IS) via

2
σIS
Var(µ̂IS) =
n

where

n
2 1 X
σIS = {g(xi)w(xi) − µ̂IS}2,
n − 1 i=1

which is the sample variance of the g(xi)w(xi)s.


1.1 Basics of Monte Carlo methods 13

2. Another IS estimate is obtained by writing


R
g(x)f (x) dx
µ = R (1.1)
f (x) dx
g(x) fh(x)
(x)
R
h(x) dx
= R f (x)
h(x) dx
R h(x)
g(x)w(x)h(x) dx
= R
w(x)h(x) dx
Eh[g(x)w(x)]
= .
Eh[w(x)]
This also makes sense because Eh[w(x)] = 1.
Given X1, X2, . . . , Xn from h(x|θ), estimate µ via
1
Pen
i=1 g(xi )w(xi )
µ̂ = n 1 P n
n i=1 w(xi )
n
1X
= g(xi)w∗(xi)
n i=1
where
w(xi)
w∗(xi) = 1
Pn
n `=1 w(x` )

are the normalized weights.


This approach is important because we can think of f (x) in (1.1) not
as a density but as a kernel of a density. That is, the actual density is
f (xi)
cf (xi) = R .
f (x) dx
That is, we don’t need to know the normalization constant, which makes
this a useful strategy in Bayesian calculations.
14 Monte Carlo Methods

3. Sometimes IS is used because sampling from h(x) is easier than


sampling from f (x).

Example of IS, Beta Suppose X ∼ Beta(α, β) with density


Γ(α + β) α−1
f (x) = x (1 − x)β−1, 0<x<1
Γ(α)Γ(β)
and we wish to compute the moment generating function (mgf) of X,
Z 1
MX (t) = E[etx] = etxf (x) dx,
0
for which there is no closed-form solution.
Define h(x) = 1 for 0 < x < 1, and 0 otherwise, that is h(x) is
Uniform(0, 1). Then,
Z 1
f (x)
MX (t) = etx h(x) dx
1
Z0 1
= etxw(x)h(x) dx
0
= Eh[etxf (x)],
where the expectation is taken with respect to X ∼ Uniform(0, 1).
iid
If X1, X2, . . . , Xn ∼ Uniform(0, 1), the IS estimate is
n
1 X tx
µ̂IS = e f (x).
n i=1
iid
We can do crude MC by sampling X1, X2, . . . , Xn ∼ Beta(α, β) and
computing
n
1 X tx
µ̂ = e .
n i=1
1.2 Some basics on designing an MC study 15

How well does this work?

1.2 Some basics on designing an MC study


Principals of experimental design apply to designing an MC study. For
a given parameter µ (or set of parameters) that we wish to estimate, we
need to assess

ˆ the sample size needed to obtain a specified precision (1/variance),


and

ˆ whether the crude MC can be improved upon.

A sample size calculation requires some knowledge of uncertainty, possibly


based on a “pilot study”. To make things concrete, suppose we have a
statistic

T (X ) = T (X1, X2, . . . , Xn)


e
for which we wish to estimate the CDF

Pr[T (X) ≤ t] = E[1(T (X)≤t)].

More generally, we would consider estimating Pr[T (X) ∈ C] for some set
C. If we, for the moment, assume that t is fixed, then all we are doing is
estimating the probability

p = Pr[T (X) ≤ t].


16 Monte Carlo Methods

For crude MC, we generate n copies X1, X2, . . . , Xn from the same dis-
tribution as X, and compute
n
1X
p̂ = 1(T (Xi)≤t)
n i=1
number of {T (Xi) ≤ t}
=
n
= {sample proportion ≤ t}.

We know
1
Var[p̂] = Var[1(T (Xi)≤t)]
n
1
= p(1 − p)
n
which can be estimated via
1
Var[p̂] = p̂(1 − p̂)
n
or (a close approximation)
n
!
. 1 1 X
Var[p̂] = {1(T (Xi)≤t) − p̂}2 .
n n−1 i=1

Thus, our general results can be applied to this setting.


Given this method, how do you choose n? One approach is based on
the margin-of-error (MOE). We note that an approximate 95% CI for p
based on p̂ is
r
p̂(1 − p̂)
p̂ ± 1.96 ,
n
1.2 Some basics on designing an MC study 17

q
which implies that p is within (approximately) 2 p̂(1−p̂)n of p̂ in 95% of
samples. That is, the error on p̂ as an estimate of p is within
r
p̂(1 − p̂)
MOE = 2
n
in 95% of samples. Since p(1 − p) ≤ 0.25,
r
0.25 1
MOE ≤ 2 = √ .
n n
If we pre-specify a desired MOE, then choosing
1 1
√ = MOE implies n =
n MOE2
1 2
gives the desired result. For a MOE of 0.01, we need n = 0.01 2 = 100 =
1 2
10000. For a MOE of 0.05, we need n = 0.05 2 = 20 = 400. In general,
decreasing the MOE by a factor of two requires quadrupling n.
.
Note that this is a worst case scenario. If you know p = 0.1, then
r
. 0.1 × 0.9 2(0.3) 0.6
MOE = 2 = √ = √
n n n
or
. 0.62 0.36
n = 2 = 2,
MOE MOE
which reduces the necessary sample size by a factor of approximately 3
relative to using p = 0.5.

Remark If p is “really small”, that is, a tail probability, you prob-


ably wish a MOE of no greater than 0.01 or less!
18 Monte Carlo Methods

1.3 Using the same stream of random


numbers
This can have an effect of “pairing”.
Suppose again we have a univariate random variable, X, (though the
following holds for multivariate, as well) and now we wish to estimate both

pT = Pr[T (X) ≤ t]
pS = Pr[S(X) ≤ t]

for two different statistics T (X) and S(X) and a fixed t.


One approach would be to use separate random samples of size n and
crude MC
n
1X number of {T (X) ≤ t}
p̂T = 1(T (Xi)≤t) =
n i=1 n
n
1X number of {S(Xi∗) ≤ t}
p̂S = 1(S(Xi∗)≤t) = .
n i=1 n

This gives
pT (1 − pT )
Var[p̂T ] =
n
pS (1 − pS )
Var[p̂S ] =
n
and, since samples are independent
1
Var[p̂T − p̂S ] = {pT (1 − pT ) + pS (1 − pS )}.
n
This is a two independent proportions problem.
1.3 Using the same stream of random numbers 19

If the goal is to estimate pT and pS but also to estimate pT − pS


accurately, then we should identify a way to make p̂T and p̂S positively
correlated (similar to the control variate idea) since

Var[p̂T − p̂S ] = Var[p̂T ] + Var[p̂S ] − 2Cov[p̂T , p̂S ]


1
= {pT (1 − pT ) + pS (1 − pS )} − 2Cov[p̂T , p̂S ].
n
If T (X) and S(X) are similar, just using the same stream of random
numbers is often sufficient (and more efficient!)
With the same sample X1, X2, . . . , Xn, calculate
n
1X number of {T (X) ≤ t}
p̂T = 1(T (Xi)≤t) = ,
n i=1 n
n
1X number of {S(Xi∗) ≤ t}
p̂S = 1(S(Xi∗)≤t) = ,
n i=1 n

and
n n
1X 1X
p̂T − p̂S = {1(T (Xi)≤t) − 1(S(Xi)≤t)} = ∆t{T (Xi), S(Xi)},
n i=1 n i=1

where

∆t{T (Xi), S(Xi)} = {1(T (Xi)≤t) − 1(S(Xi)≤t)}



 1 T ≤ t, S > t
= 0 T ≤ t, S ≤ t or T < t, S < t .
−1 T > t, S ≤ t

Let the joint distribution of the indicators be given by the following 2-by-2
contingency table
20 Monte Carlo Methods

S
T S≤t S>t
T ≤ t pT S pT S̄ pT
T > t pT̄ S pT̄ S̄ pT̄
pS pS̄ 1
where pT S = Pr[T ≤ t, S ≤ t], pT S̄ = Pr[T ≤ t, S > t], etc. Then

E[∆t{T (Xi), S(Xi)}] = 1pT S̄ − 1pT̄ S


= (pT S + pT S̄ ) − (pT̄ S + pT S )
= pT − pS ,

that is,

E[p̂T − p̂S ] = pT − pS

and

Var[∆t{T (Xi), S(Xi)}] = E[∆2t ] − (E[∆t])2


= 1pT S̄ − 1pT̄ S − (pT − pS )2
= pT S̄ − pT̄ S − ((pT S + pT S̄ ) − (pT̄ S + pT S ))2
... (a little work)
= pT S̄ (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S .

Thus,
1
Var[p̂T − p̂S ] = Var[∆t{T (Xi), S(Xi)}]
n
p (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S
= T S̄ .
n

Remarks
1.3 Using the same stream of random numbers 21

1. This is just a paired proportion problem, where, if we let nT S =


number of (T (Xi) ≤ t, S(Xi) ≤ t), nT S̄ = number of (T (Xi) ≤ t, S(Xi) >
t), etc., then the 2-by-2 table of counts
S
T S≤t S>t
T ≤ t nT S nT S̄ nT
T > t nT̄ S nT̄ S̄ nT̄
nS nS̄ n
leads to estimates of cell and marginal probabilities, for example
S
T S≤t S>t
T ≤ t pT S = nT S /n pT S̄ = nT S̄ /n pT = nT /n
T > t pT̄ S = nT̄ S /n pT̄ S̄ = nT̄ S̄ /n pT̄ = nT̄ /n
pS = nS /n pS̄ = nS̄ /n 1 = n/n
Then
(nT S + nT S̄ ) − (nT S + nT̄ S )
p̂T − p̂S =
n
= p̂T S̄ − p̂T̄ S .
That is, the estimate of pT − pS is based only on cases that disagree.
This is unbiased for pT − pS with
Var[p̂T − p̂S ] = Var[p̂T S̄ − p̂T̄ S ]
= Var[p̂T S̄ ] + Var[p̂T̄ S ] − Cov[p̂T S̄ , p̂T̄ S ]
p (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S
= T S̄ .
n
If T (X) and S(X) mimic each other, expect the number or proportion of
. .
disagreements to be low, or pT S̄ = 0 and pT̄ S = 0 leading to very small
Var[p̂T − p̂S ] based on using the same sample of Xis.
22 Monte Carlo Methods

2. From earlier results,


1
Var[p̂T − p̂S ] = Var[∆t{T (Xi), S(Xi)}]
n
p (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S
= T S̄ .
n
We can estimate this in two ways:

1. plug-in estimates of pT S̄ and pT̄ S from the contingency table, or

2. Compute the sample variance of ∆t{T (Xi), S(Xi)} which is easy to


do if you have one column with entries 1(T (Xi)≤t) and another with
1(S(Xi)≤t). Then you simply take the difference in the columns and
calculate the sample variance of the differences.

1.4 Other methods


A variety of other MC techniques exist, such as

ˆ conditioning swindles,

ˆ Rao-Blackwellization, and

ˆ stratified sampling.

You might also like