Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
Amitay Isaacs
1 Density Estimation
In density estimation, one considers a sample X1 , X2 , . . . , Xn of independent, identically
distributed (iid) random variables with common density function f , and one seeks to
estimate f from the data. Statistical estimation deals with following issues.
3. Deriving bounds on the accuracy for a fixed “amount” n of data, and asymptotically
for n → ∞. The ultimate goal is to obtain the distribution of the accuracy, viewed
as a random variable, for fixed n or asymptotically.
4. Quantifying notirons of optimality of the estimator. One would like to know the limits
on the accuracy of any estimator, and whether the estimators under consideration
reach these limits.
5. Settling the relevance or adequacy of the estimators for finite (small) n, given the
model.
6. Certifying the adequacy of the indicated model. The latter goes by the name of
“goodness-of-fit testing”.
1
and we wish to estimate fo . Here, n is referred to as the sample size. In the parametric
model, the pdf fo is assumed to belong to a parametric family
def
= = {f (·; θ) : θ ∈ Θ},
described by a (low-dimensional) parameter θ belonging to the set of all possible parameters
Θ. Thus, there exists a θ0 ∈ Θ such that
fo (x) = f (x; θ0 ), −∞ < x < ∞.
The standard example for density estimation, viz. the family of normal densities, parametrized
by θ = (µ, σ),
(x − µ)2
1
f (x; θ) = √ exp − , −∞ < x < ∞.
2π σ 2σ 2
The standard method for estimating θ0 is by maximum likelihood estimation. In this
method, the estimator (assuming it exists) is any value of θ for which the likelihood is
maximal and amounts to solving
Yn
maximize f (Xi ; θ)
i=1
subject to θ ∈ Θ
but it is more natural to consider the averaged logarithm of the likelihood, written as
n Z
1X
log f (Xi ; θ) = log f (x; θ) dFn (x).
n i=1 R
The reason for this is that under reasonable condictions, the log-likelihood converges as
n→∞ Z Z
log f (x; θ) dFn (x) −→ log f (x; θ) dF0 (x), (θ fixed)
R R
almost surely by the strong law of large numbers. Thus, the maximum likelihood estimation
problem is equivalent to
Z
minimize − log f (x; θ) dFn (x)
R
subject to θ ∈ Θ
The questions of interest are whether the above problem has a unique solution θn , how
it may be computed, and what can be said about the error θn − θ0 or f (·; θn ) − f (·; θ0 ).
One reason for the popularity of maximum likelihood estimation is that under reasonable
conditions, √
n(θn − θ0 ) →d Y ∼ N (0, σ 2 )
√
i.e., n(θn − θ0 ) converges in distribution to a normally distributed random variable Y ,
with mean 0 and variance σ 2 given via
|g(x; θ0 )|2
Z
1
= dx,
σ2 R f (x; θ0 )
where g(x; θ) denotes the partial derivative of f (x; θ) with respect to θ.
2
1.2 Parametric: Least Square Method
The idea of least-squares estimation is that assuming fo were known, an ideal choice of θ
would minimize Z
|f (x; θ) − fo (x)|2 dx,
R
the latter being equal to 0 if the model is correct. Because fo is unknown, this method
cannot be used. However, by rewriting the above equation as
Z Z Z
|f (x; θ)| dx − 2 f (x; θ) dFo (x) + |fo (x)|2 dx,
2
R R R
we see that the first term can be computed exactly, and that the last term is independent
of θ. The second term may be estimated by
Z
−2 f (x; θ) dFn (x),
R
Under reasonable conditions, the error θn − θo is normally distributed with 0 mean and
variance σ 2 given by Z
|g(x; θo ) − E|2 f (x; θo )dx
σ 2 = R Z 2
2
|g(x; θo )| dx
R
3
2 Least Squares Estimation
2.1 Ordinary Least Squares
A linear statistical model is of the form
where y is a random variable called the response; φ1 (x), φ2 (x), . . . , φ3 (x) are mathematical
variables called regressors; x is the vector of parameters whose values are controlled (or at
least accurately observed) by the experimenter; ε is a random variable that accounts for
unexpained random variation in response; and β0 , β1 , . . . , βk are parameters whose exact
values are not known and must be estimated from the experimental data. The density
function for the random errors ε can be written as
ε ∼ N (0, σ 2 I)
Taking expectations,
εi = yi − E[yi ]
From this it can be seen that the ith random error is the differece between the ith response
and its mean. Since the average response depends on β0 , β1 , . . . , βk whose values are not
known, random errors cannot be observed from a sample. Supposing b0 , b1 , . . . , bk are
estimators for β0 , β1 , . . . , βk , we can observe residuals. These estimators can be used to
approximate E[yi ] as
The method of least squares is to choose estimators in such a way that the sum of
the squares of the residuals is minimized. To obtain the least squares estimators for
4
β0 , β1 , . . . , βk in the linear model, first express the responses in terms of the residuals
as shown:
y1 = b0 + b1 φ1 (x1 ) + b2 φ2 (x1 ) + · · · + bk φk (x1 ) + e1
y2 = b0 + b1 φ1 (x2 ) + b2 φ2 (x2 ) + · · · + bk φk (x2 ) + e2
..
.
yn = b0 + b1 φ1 (xn ) + b2 φ2 (xn ) + · · · + bk φk (xn ) + en
These equations can be written in the matrix form as
y = Φb + e
∂e0 e
= −2Φ0 y + (Φ0 Φ)b + (Φ0 Φ)0 b
∂b
= −2Φ0 y + 2(Φ0 Φ)b
or
(Φ0 Φ)b = Φ0 y
We can solve for b to get
b = (Φ0 Φ)−1 Φ0 y
5
2.2 Generalized Least Squares
Earlier we have assumed that the errors ε1 , ε2 , . . . are uncorrelated with common variance.
In many situations one or both of these assumptions is unrealistic. Here we assume that
the variance of ε is a n × n positive definite matrix Σ. and it follows a multivariate normal
distribution.
ε ∼ Nn (0, Σ)
The joint density for ε1 , ε2 , . . . , εn is given by
1 1
f (ε; β, Σ) = exp(− ε0 Σ−1 ε)
(2π)n/2 |Σ|1/2 2
To find maximum likelihood estimator of β we must find the vector b∗ that maximizes the
function. It can be shown that to maximize f we must minimize the term ε0 Σ−1 ε.
This term is similar to the residual sum of squares that is minimized in the ordinary least
squares. Differentiating this term with respect to b∗ ,
∂ε0 Σ−1 ε ∂(y − Φb∗ )0 Σ−1 (y − Φb∗ )
=
∂b∗ ∂b∗
∂ 0 −1 ∗ 0 0 −1 ∗ 0 0 −1 ∗
= y Σ y − 2b Φ Σ y + b Φ Σ Φb
∂b∗
= −2Φ0 Σ−1 y + 2Φ0 Σ−1 Φb∗
6
Each residual yi − ŷi is weighted by 1/σi , the reciprocal of standard deviation. In this way,
a point drawn from a population with small variability receives a heavy weight; one drawn
from a population with a large degree of variability receives less importance in estimating
β.
7
where the “other terms” are independent of µ. It can be verified that apart from the “other
terms”,
n n n
1 1X 2 1X 2 1 X
Ln (µ) = kµ − Xi k + kXi k − 2 k Xi k2 .
2 n i=1 n i=1 n i=1
It follows that the mle of µ is
n
1X
µn = Xi .
n i=1
So, in this case, the mle exists and is unique.
kX − Zβk2
1
f (X; β, σ) = exp − , X ∈ Rd .
(2πσ 2 )n/2 2σ 2
It is clear that the optimal β is not influenced by the choice of σ, so that βn solves the
linear least-squares problem
It follows that the solution βn is unique if and only if Z has full column rank. In that case
βn = (Z T Z)−1 Z T X.
Taking β = βn and solving minimization problem for σn , we get a unique solution given
by
1
σn2 = kZβn − Xk2 .
n
8
4 DACE
The DACE model is a combination of linear model and the multivariate normal model for
random variable X.
X ∼ Nn (Zβo , Σ),
where Z ∈ Rn×p is a known matrix, βo is a vector of unknown parameters, Σ = σo2 R is
the covariance matrix with σo2 ∈ R and R is the unknown correlation matrix. Correlations
between two random points X1 and X2 is given by
R = R(X1 , X2 ; θ),
where θ are the unknown parameters. The parametric family of distributions is given by