Odf-SC1 05 MonteCarlo

Chapter 1
Monte Carlo Methods

Goals:
1. basics of Monte Carlo methods
2. design of a Monte Carlo study
1.1 Basics of Monte Carlo methods

In a previous chapter, we developed the crude Monte Carlo estimator
of the expectation
µ = Eθ [g(X)]
iid
given X1, X2, . . . , Xn ∼ f (x|θ) with the same distribution as X, the
strong law of large numbers (SLLN)
e implies that
n
1X
µ̂ = g(Xi) → µ as n → ∞.
n i=1
The precision of our estimate µ̂ is dictated by
1 σg2
Var(µ̂) = Varθ g(X) ≡ ,
n n
2 Monte Carlo Methods
which can be estimated via

n
1 X
σ̂g2 = (g(Xi) − µ̂)2 ≡ sample variance of g(Xi)s.
n − 1 i=1
Note that µ̂ is unbiased, and typically in large samples

·
µ̂ ∼ Normal µ, σ̂g2/n .

The precision of µ̂ depends on σg2 and n. We will discuss several

methods that aim to increase precision, besides increasing n. Note that
more complex methods may increase precision for a given n, but may incur
increased programming effort or computational time. Some assessment of
the trade-offs between variance reduction and added labor or cost needs
to be made.
1.1.1 Control variates

As before, suppose we wish to estimate (assuming x continuous)
Z
µ ≡ Eθ [g(X)] = g(x)f (x|θ) dx.
If we have a g ∗(x) that is “similar to” g(x) and for which

Z
τ ≡ Eθ [g ∗(X)] = g ∗(x)f (x|θ) dx
is known, then writing

Z
µ = {g(x) − g ∗(x)}f (x|θ) dx + τ
= Eθ [g(x) − g ∗(x)] + τ
1.1 Basics of Monte Carlo methods 3
we can use crude Monte Carlo to estimate Eθ [g(x) − g ∗(x)]. That is,
n
1X
µ̂ = {g(xi) − g ∗(xi)} + τ
n i=1
n n
1X 1X ∗
= {g(xi)} − {g (xi)} + τ
n i=1 n i=1
with
1
Var[µ̂] = Varθ [g(X) − g ∗(X)]
n
1
= {Varθ [g(X)] + Varθ [g ∗(X)] − 2Covθ [g(X), g ∗(X)]}.
n
.
If g ∗(X) mimics g(X), then Varθ [g(X)] = Varθ [g ∗(X)] and
. 1
Var[µ̂] = {2Varθ [g(X)] − 2Varθ [g(X)]Corrθ [g(X), g ∗(X)]}
n
1 1
< Varθ [g(X)] if Corrθ [g(X), g ∗(X)] > .
n 2
Thus, reduction in variability relative to crude MC if Corrθ [g(X), g ∗(X)] >
1
2.
Example, median Let X = (X1, . . . , Xn) be a sample from some

distribution with known E[xei] = τ . Let X̄ = n1 ni=1 Xi and M =
P
sample median of Xis, and suppose we wish to estimate
E[M ] ≡ µ (nonstandard notation)
given R samples, each giving Mr and x̄r . Consider using

R
1X
µ̂ = {Mr − x̄r } + τ
R r=1
as the estimate. That is, use x̄ as a control variate for estimating E[M ].
# Gamma(2, 4) distribution, with E[X]=a*b and Var[X]=a*b^2.
a <- 2
b <- 4
# true median
qgamma(0.5, a, scale=b)
## [1] 6.713
# true mean of gamma distribution

tau <- a*b
tau
## [1] 8
# sample from gamma distribution

R <- 1e4 # samples
n <- 25 # sample size
x <- matrix(rgamma(R*n, a, scale=b), ncol=n) # draw R random samples in rows
# bootstrap estimate of variability of M

M <- apply(x, 1, median)
c(mean(M), var(M))
## [1] 6.806 1.638
# using mean as control variate

x.bar <- apply(x, 1, mean)
c(mean(x.bar), var(x.bar))
## [1] 8.005 1.289
# Check that the correlation between our variate of interest (median)

# and our control variate (mean) is at least 1/2
cor(M, x.bar)
## [1] 0.7612
# This estimate of mu, the true median, has lower variance than x.bar
mu.hat <- mean(M - x.bar) + tau
c(mu.hat, var(M - x.bar))
## [1] 6.8005 0.7149
Example, cdf Let X = {X1, X2, . . . , Xk } and T (X ) ≡ some statistic.

Suppose we wish to estimate
e e
µ ≡ µ(t) = Pr[T (X) ≤ t] = E[1(T (X)≤t)],

that is, estimate the cumulative distribution function (cdf) of T (X ). In
other words, the cdf of T (X ) is the probability that statistic T (X )eis less
than t for each quantile t. eThe crude MC estimate is the empirical e cdf:
given T (X 1), T (X 2), . . . , T (X n),
e e e
n
1X
µ̂ ≡ µ̂(t) = 1(T (Xi)≤t)
n i=1
number of (T (Xi) ≤ t)
= .
n
Suppose statistic S(X ) mimics T (X ) and the cdf of S(X )
e e e
τ (t) ≡ Pr[S(X) ≤ t]
is known. Then the control variate estimate is
n
1X
µ̂ = {1(T (Xi)≤t) − 1(S(Xi)≤t)} + τ (t)
n i=1

1
 T ≤ t, S > t
{1(T (Xi)≤t) − 1(S(Xi)≤t)} = 0 T ≤ t, S ≤ t or T < t, S < t .
−1 T > t, S ≤ t

The variance reduction could be substantial.

This idea was used in the “Princeton Robustness Study1”, which among
other things considered distributional properties of trimmed mean-like t-
statistic
x̄T − θ
tT = (based on sample size, k).
SE[x̄T ]
If the underlying population distribution is Normal with mean θ, you can
use
x̄ − θ
t = ∼ tk−1
SE[x̄]
as a control variable for estimating cdf of tT .
Example, Multinomial Suppose
X = {X1, X2, . . . , Xk } ∼ Multinomial(m, θ),

e e
where θ = (θ1, . . . , θk ). Two standard statistics for testing H0 : θ1 =
θ01, . . .e, θk = θ0k are the Pearson statistic
k
X (xi − mθ0i)2
P =
i=1
mθ0i
and the likelihood ratio statistic

k
X xi
G2 = 2 xi loge .
i=1
mθ0i
1
John W. Tukey (1973). The Estimators of the Princeton Robustness Study. Princeton Univer-
sity, Department of Statistics.
·
Note that 0 loge(0) ≡ 0. In large samples, both P and G2 ∼ χ2k−1 when
H0 is true. One way to study the closeness of χ2k−1 approximation is
through the moments: how close do the moments of P and G2 match
those of the χ2k−1 distribution? The moments of P are tractable, but the
moments of G2 are not. This suggests using P as a control variate for
estimating moments of G2. For example, suppose we wish to estimate
E[G2] = µ.
We know
E[P ] = E[χ2k−1] = k − 1.
Thus, given n multinomial samples, estimate µ via

R
1X 2
µ̂ = {G − Pr } + (k − 1),
R r=1 r
where G2r and Pr are the values of G2 and P from the rth sample.
1.1.2 Antithetic variates (AV)

Suppose we have two estimators µ̂1 and µ̂2 of µ and each has variance
σ 2/n when based on a sample of size n. If the correlation ρ between these
estimators is negative2, then the estimator
µ̂1 + µ̂2
µ̂AV =
2
2
Antithetic means “directly opposed or contrasted; mutually incompatible”.
has
1
Var[µ̂AV] = {Var[µ̂1] + Var[µ̂2] + 2Cov[µ̂1, µ̂2]}
4
1 p
= {Var[µ̂1] + Var[µ̂2] + 2ρ Var[µ̂1]Var[µ̂2]}
4
1 2
= {σ + σ 2 + 2ρσ 2}
4n
1 2
= σ (1 + ρ)
2n
σ2
<
2n
where the last term is the variance of either µ̂1 or µ̂2 based on a sample of
size 2n. That is, averaging the two estimators based on the same sample
of size n (necessary to make estimators correlated) is better than doubling
the sample size using either estimator individually.
Put another way, two negatively correlated estimators can be combined
to provide a more precise estimator than either estimate individually, even
when the combined estimator is based on half the number of samples.
The AV method is often difficult to implement since you need to find
negatively correlated estimators. This can often be done in situations with
certain symmetry constraints.
Example, AV Suppose X ∼ Normal(0, 1) and we wish to estimate

X
µ = E[h(X)] where h(X) = .
2X − 1
Since −X ∼ Normal(0, 1), the distribution of h(X) and h(−X) are iden-
tical and thus E[h(−X)] = µ. Based on a sample of n = 10000, we
find the AV sample is much more precise than that of either individual
estimate based on n = 20000 samples.
# define h(x)
f.h <- function(x) {
h <- x / (2^x - 1)
return(h)
}
# sample from normal distribution

R <- 1e4 # samples
x <- rnorm(R) # draw R random samples
x2 <- rnorm(R) # double the samples for later comparison
# calculate h(x) and h(-x)

h.x <- f.h(x)
h.negx <- f.h(-x)
# these are negatively correlated, so the AV approach is profitable

cor(h.x, h.negx)
## [1] -0.9527
# estimate
combine.h.x <- (h.x + h.negx) / 2
mu.hat.AS <- mean(combine.h.x)
mu.hat.AS
## [1] 1.499
# sd of AV estimate
sd(combine.h.x)
## [1] 0.07764
# sd of individual estimate based on 2*R samples

h.x2 <- f.h(x2)
h.negx2 <- f.h(-x2)
sd(c(h.x, h.x2))
## [1] 0.5086
sd(c(h.negx, h.negx2))
## [1] 0.507
The AV approach combines two estimates of the same parameter as

best we can, that is, by averaging them. A real gain comes about if the
estimates have negative correlation.
In general, if we have estimates µ̂1, µ̂2, . . . , µ̂` of µ with covariance
matrix
Σ = [Cov(µ̂i, µ̂j )],
then we can use generalized LS to get the optimal estimate, that is, set
   
µ̂1 1
 µ̂  1
   
µ̂∗ =  ..2  =  ..  µ + ε = 1µ + ε, Cov[ε] = Σ,
 .  . e e e
µ̂` 1
then the best estimate is
µ̂ = (1>Σ−11)−11>Σ−1µ̂∗.
e e e
Remarks
Typically estimate Σ with Σ̂ and plug that into µ̂,
with two estimates with equal variance, the estimate is always the
average, and
depending on Σ, could potentially reduce n and get same precision
as using individual estimator µ̂j .
1.1.3 Importance sampling (IS)

As before, we wish to estimate
Z
µ ≡ Eθ [g(X)] = g(x)f (x|θ) dx.
Assume θ is fixed and let f (x) ≡ f (x|θ). the crude MC estimate µ̂ is

unbiased with
1
Var(µ̂) = Varθ [g(X)]
n
1
= (Eθ [g 2(X)] − µ2)
n Z
1
= g 2(x)f (x|θ) dx − µ2 .
n
Importance sampling seeks to reduce Var(µ̂) as follows. Note that for

any other density h(x)
Z
µ = g(x)f (x) dx
Z
f (x)
= g(x) h(x) dx
Z h(x)
= g(x)w(x)h(x) dx
= Eh[g(x)w(x)],
which is the expectation with respect to h(x). Thus, drawing a sample of

size n, X1, X2, . . . , Xn, from h(x|θ), we can use the MC estimate
e
n
1X
µ̂IS = g(xi)w(xi)
n i=1
as an unbiased estimator of µ with

1
Var(µ̂IS) = Varh[g(X)w(X)]
n
1
= (Eh[g 2(X)w2(X)] − µ2).
n
Note that the expected value of the weight function

Z
f (x)
Eh[w(x)] = h(x) dx
Z h(x)
= f (x) dx
= 1,
that is, the average weight is 1.

Since the average weight is one, some weights may be very large ( 1).
IS tends to work well when w(x) is large only when g(x) is small. This
requires the choice of h(x) to be made carefully!
Remarks
1. IS is a crude MC, so we can estimate Var(µ̂IS) via
2
σIS
Var(µ̂IS) =
n
where
n
2 1 X
σIS = {g(xi)w(xi) − µ̂IS}2,
n − 1 i=1
which is the sample variance of the g(xi)w(xi)s.

2. Another IS estimate is obtained by writing

R
g(x)f (x) dx
µ = R (1.1)
f (x) dx
g(x) fh(x)
(x)
R
h(x) dx
= R f (x)
h(x) dx
R h(x)
g(x)w(x)h(x) dx
= R
w(x)h(x) dx
Eh[g(x)w(x)]
= .
Eh[w(x)]
This also makes sense because Eh[w(x)] = 1.
Given X1, X2, . . . , Xn from h(x|θ), estimate µ via
1
Pen
i=1 g(xi )w(xi )
µ̂ = n 1 P n
n i=1 w(xi )
n
1X
= g(xi)w∗(xi)
n i=1
where
w(xi)
w∗(xi) = 1
Pn
n `=1 w(x` )
are the normalized weights.

This approach is important because we can think of f (x) in (1.1) not
as a density but as a kernel of a density. That is, the actual density is
f (xi)
cf (xi) = R .
f (x) dx
That is, we don’t need to know the normalization constant, which makes
this a useful strategy in Bayesian calculations.
3. Sometimes IS is used because sampling from h(x) is easier than

sampling from f (x).
Example of IS, Beta Suppose X ∼ Beta(α, β) with density

Γ(α + β) α−1
f (x) = x (1 − x)β−1, 0<x<1
Γ(α)Γ(β)
and we wish to compute the moment generating function (mgf) of X,
Z 1
MX (t) = E[etx] = etxf (x) dx,
0
for which there is no closed-form solution.
Define h(x) = 1 for 0 < x < 1, and 0 otherwise, that is h(x) is
Uniform(0, 1). Then,
Z 1
f (x)
MX (t) = etx h(x) dx
1
Z0 1
= etxw(x)h(x) dx
0
= Eh[etxf (x)],
where the expectation is taken with respect to X ∼ Uniform(0, 1).
iid
If X1, X2, . . . , Xn ∼ Uniform(0, 1), the IS estimate is
n
1 X tx
µ̂IS = e f (x).
n i=1
iid
We can do crude MC by sampling X1, X2, . . . , Xn ∼ Beta(α, β) and
computing
n
1 X tx
µ̂ = e .
n i=1
1.2 Some basics on designing an MC study 15
How well does this work?
1.2 Some basics on designing an MC study

Principals of experimental design apply to designing an MC study. For
a given parameter µ (or set of parameters) that we wish to estimate, we
need to assess
the sample size needed to obtain a specified precision (1/variance),

and
whether the crude MC can be improved upon.
A sample size calculation requires some knowledge of uncertainty, possibly

based on a “pilot study”. To make things concrete, suppose we have a
statistic
T (X ) = T (X1, X2, . . . , Xn)

e
for which we wish to estimate the CDF
Pr[T (X) ≤ t] = E[1(T (X)≤t)].
More generally, we would consider estimating Pr[T (X) ∈ C] for some set
C. If we, for the moment, assume that t is fixed, then all we are doing is
estimating the probability
p = Pr[T (X) ≤ t].

For crude MC, we generate n copies X1, X2, . . . , Xn from the same dis-
tribution as X, and compute
n
1X
p̂ = 1(T (Xi)≤t)
n i=1
number of {T (Xi) ≤ t}
=
n
= {sample proportion ≤ t}.
We know
1
Var[p̂] = Var[1(T (Xi)≤t)]
n
1
= p(1 − p)
n
which can be estimated via
1
Var[p̂] = p̂(1 − p̂)
n
or (a close approximation)
n
!
. 1 1 X
Var[p̂] = {1(T (Xi)≤t) − p̂}2 .
n n−1 i=1
Thus, our general results can be applied to this setting.

Given this method, how do you choose n? One approach is based on
the margin-of-error (MOE). We note that an approximate 95% CI for p
based on p̂ is
r
p̂(1 − p̂)
p̂ ± 1.96 ,
n
1.2 Some basics on designing an MC study 17
q
which implies that p is within (approximately) 2 p̂(1−p̂)n of p̂ in 95% of
samples. That is, the error on p̂ as an estimate of p is within
r
p̂(1 − p̂)
MOE = 2
n
in 95% of samples. Since p(1 − p) ≤ 0.25,
r
0.25 1
MOE ≤ 2 = √ .
n n
If we pre-specify a desired MOE, then choosing
1 1
√ = MOE implies n =
n MOE2
1 2
gives the desired result. For a MOE of 0.01, we need n = 0.01 2 = 100 =
1 2
10000. For a MOE of 0.05, we need n = 0.05 2 = 20 = 400. In general,
decreasing the MOE by a factor of two requires quadrupling n.
.
Note that this is a worst case scenario. If you know p = 0.1, then
r
. 0.1 × 0.9 2(0.3) 0.6
MOE = 2 = √ = √
n n n
or
. 0.62 0.36
n = 2 = 2,
MOE MOE
which reduces the necessary sample size by a factor of approximately 3
relative to using p = 0.5.
Remark If p is “really small”, that is, a tail probability, you prob-

ably wish a MOE of no greater than 0.01 or less!
1.3 Using the same stream of random

numbers
This can have an effect of “pairing”.
Suppose again we have a univariate random variable, X, (though the
following holds for multivariate, as well) and now we wish to estimate both
pT = Pr[T (X) ≤ t]
pS = Pr[S(X) ≤ t]
for two different statistics T (X) and S(X) and a fixed t.

One approach would be to use separate random samples of size n and
crude MC
n
1X number of {T (X) ≤ t}
p̂T = 1(T (Xi)≤t) =
n i=1 n
n
1X number of {S(Xi∗) ≤ t}
p̂S = 1(S(Xi∗)≤t) = .
n i=1 n
This gives
pT (1 − pT )
Var[p̂T ] =
n
pS (1 − pS )
Var[p̂S ] =
n
and, since samples are independent
1
Var[p̂T − p̂S ] = {pT (1 − pT ) + pS (1 − pS )}.
n
This is a two independent proportions problem.
1.3 Using the same stream of random numbers 19
If the goal is to estimate pT and pS but also to estimate pT − pS

accurately, then we should identify a way to make p̂T and p̂S positively
correlated (similar to the control variate idea) since
Var[p̂T − p̂S ] = Var[p̂T ] + Var[p̂S ] − 2Cov[p̂T , p̂S ]

1
= {pT (1 − pT ) + pS (1 − pS )} − 2Cov[p̂T , p̂S ].
n
If T (X) and S(X) are similar, just using the same stream of random
numbers is often sufficient (and more efficient!)
With the same sample X1, X2, . . . , Xn, calculate
n
1X number of {T (X) ≤ t}
p̂T = 1(T (Xi)≤t) = ,
n i=1 n
n
1X number of {S(Xi∗) ≤ t}
p̂S = 1(S(Xi∗)≤t) = ,
n i=1 n
and
n n
1X 1X
p̂T − p̂S = {1(T (Xi)≤t) − 1(S(Xi)≤t)} = ∆t{T (Xi), S(Xi)},
n i=1 n i=1
where
∆t{T (Xi), S(Xi)} = {1(T (Xi)≤t) − 1(S(Xi)≤t)}


 1 T ≤ t, S > t
= 0 T ≤ t, S ≤ t or T < t, S < t .
−1 T > t, S ≤ t

Let the joint distribution of the indicators be given by the following 2-by-2
contingency table
S
T S≤t S>t
T ≤ t pT S pT S̄ pT
T > t pT̄ S pT̄ S̄ pT̄
pS pS̄ 1
where pT S = Pr[T ≤ t, S ≤ t], pT S̄ = Pr[T ≤ t, S > t], etc. Then
E[∆t{T (Xi), S(Xi)}] = 1pT S̄ − 1pT̄ S

= (pT S + pT S̄ ) − (pT̄ S + pT S )
= pT − pS ,
that is,
E[p̂T − p̂S ] = pT − pS
and
Var[∆t{T (Xi), S(Xi)}] = E[∆2t ] − (E[∆t])2

= 1pT S̄ − 1pT̄ S − (pT − pS )2
= pT S̄ − pT̄ S − ((pT S + pT S̄ ) − (pT̄ S + pT S ))2
... (a little work)
= pT S̄ (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S .
Thus,
1
Var[p̂T − p̂S ] = Var[∆t{T (Xi), S(Xi)}]
n
p (1 − pT S̄ ) + pT̄ S (1 − pT̄ S ) + 2pT S̄ pT̄ S
= T S̄ .
n
Remarks
1.3 Using the same stream of random numbers 21
1. This is just a paired proportion problem, where, if we let nT S =

number of (T (Xi) ≤ t, S(Xi) ≤ t), nT S̄ = number of (T (Xi) ≤ t, S(Xi) >
t), etc., then the 2-by-2 table of counts
S
T S≤t S>t
T ≤ t nT S nT S̄ nT
T > t nT̄ S nT̄ S̄ nT̄
nS nS̄ n
leads to estimates of cell and marginal probabilities, for example
S
T S≤t S>t
T ≤ t pT S = nT S /n pT S̄ = nT S̄ /n pT = nT /n
T > t pT̄ S = nT̄ S /n pT̄ S̄ = nT̄ S̄ /n pT̄ = nT̄ /n
pS = nS /n pS̄ = nS̄ /n 1 = n/n
Then
(nT S + nT S̄ ) − (nT S + nT̄ S )
p̂T − p̂S =
n
= p̂T S̄ − p̂T̄ S .
That is, the estimate of pT − pS is based only on cases that disagree.
This is unbiased for pT − pS with
Var[p̂T − p̂S ] = Var[p̂T S̄ − p̂T̄ S ]
= Var[p̂T S̄ ] + Var[p̂T̄ S ] − Cov[p̂T S̄ , p̂T̄ S ]
= T S̄ .
n
If T (X) and S(X) mimic each other, expect the number or proportion of
. .
disagreements to be low, or pT S̄ = 0 and pT̄ S = 0 leading to very small
Var[p̂T − p̂S ] based on using the same sample of Xis.
2. From earlier results,

1
Var[p̂T − p̂S ] = Var[∆t{T (Xi), S(Xi)}]
n
= T S̄ .
n
We can estimate this in two ways:
1. plug-in estimates of pT S̄ and pT̄ S from the contingency table, or
2. Compute the sample variance of ∆t{T (Xi), S(Xi)} which is easy to

do if you have one column with entries 1(T (Xi)≤t) and another with
1(S(Xi)≤t). Then you simply take the difference in the columns and
calculate the sample variance of the differences.
1.4 Other methods

A variety of other MC techniques exist, such as
conditioning swindles,
Rao-Blackwellization, and
stratified sampling.

Odf-SC1 05 MonteCarlo

Uploaded by

Copyright:

Available Formats

Odf-SC1 05 MonteCarlo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Odf-SC1 05 MonteCarlo

Uploaded by

Copyright:

Available Formats

Chapter 1

Monte Carlo Methods

1.1 Basics of Monte Carlo methods

which can be estimated via

Note that µ̂ is unbiased, and typically in large samples

The precision of µ̂ depends on σg2 and n. We will discuss several

1.1.1 Control variates

If we have a g ∗(x) that is “similar to” g(x) and for which

is known, then writing

Example, median Let X = (X1, . . . , Xn) be a sample from some

sample median of Xis, and suppose we wish to estimate

E[M ] ≡ µ (nonstandard notation)

given R samples, each giving Mr and x̄r . Consider using

# true mean of gamma distribution

# sample from gamma distribution

# bootstrap estimate of variability of M

## [1] 6.806 1.638

# using mean as control variate

## [1] 8.005 1.289

# Check that the correlation between our variate of interest (median)

## [1] 6.8005 0.7149

Example, cdf Let X = {X1, X2, . . . , Xk } and T (X ) ≡ some statistic.

µ ≡ µ(t) = Pr[T (X) ≤ t] = E[1(T (X)≤t)],

The variance reduction could be substantial.

Example, Multinomial Suppose

X = {X1, X2, . . . , Xk } ∼ Multinomial(m, θ),

and the likelihood ratio statistic

Thus, given n multinomial samples, estimate µ via

1.1.2 Antithetic variates (AV)

Example, AV Suppose X ∼ Normal(0, 1) and we wish to estimate

# sample from normal distribution

# calculate h(x) and h(-x)

# these are negatively correlated, so the AV approach is profitable

# sd of individual estimate based on 2*R samples

The AV approach combines two estimates of the same parameter as

1.1.3 Importance sampling (IS)

Assume θ is fixed and let f (x) ≡ f (x|θ). the crude MC estimate µ̂ is

Importance sampling seeks to reduce Var(µ̂) as follows. Note that for

which is the expectation with respect to h(x). Thus, drawing a sample of

as an unbiased estimator of µ with

Note that the expected value of the weight function

that is, the average weight is 1.

1. IS is a crude MC, so we can estimate Var(µ̂IS) via

which is the sample variance of the g(xi)w(xi)s.

2. Another IS estimate is obtained by writing

are the normalized weights.

3. Sometimes IS is used because sampling from h(x) is easier than

Example of IS, Beta Suppose X ∼ Beta(α, β) with density

How well does this work?

1.2 Some basics on designing an MC study

 the sample size needed to obtain a specified precision (1/variance),

 whether the crude MC can be improved upon.

A sample size calculation requires some knowledge of uncertainty, possibly

T (X ) = T (X1, X2, . . . , Xn)

Pr[T (X) ≤ t] = E[1(T (X)≤t)].

p = Pr[T (X) ≤ t].

Thus, our general results can be applied to this setting.

Remark If p is “really small”, that is, a tail probability, you prob-

1.3 Using the same stream of random

the sample size needed to obtain a specified precision (1/variance),

whether the crude MC can be improved upon.