Design and Analysis of Computer Experiments: Theory: 1 Density Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Design and Analysis of Computer Experiments: Theory

Amitay Isaacs

1 Density Estimation
In density estimation, one considers a sample X1 , X2 , . . . , Xn of independent, identically
distributed (iid) random variables with common density function f , and one seeks to
estimate f from the data. Statistical estimation deals with following issues.

1. The construction of estimators.

2. Quantifying the notion of accuracy of the estimator.

3. Deriving bounds on the accuracy for a fixed “amount” n of data, and asymptotically
for n → ∞. The ultimate goal is to obtain the distribution of the accuracy, viewed
as a random variable, for fixed n or asymptotically.

4. Quantifying notirons of optimality of the estimator. One would like to know the limits
on the accuracy of any estimator, and whether the estimators under consideration
reach these limits.

5. Settling the relevance or adequacy of the estimators for finite (small) n, given the
model.

6. Certifying the adequacy of the indicated model. The latter goes by the name of
“goodness-of-fit testing”.

7. Finally, the actual computation of the estimators.

1.1 Parametric: Maximum Likelihood Estimation


To illustrate the problem of density estimation, let fo be a univariate probability density
function (pdf), with corresponding cumulative distribution function (cdf) F0 . The simplest
density estimation problem is where the data

X1 , X2 , . . . , Xn are iid, univariate random variables, with common densityfo (x),

1
and we wish to estimate fo . Here, n is referred to as the sample size. In the parametric
model, the pdf fo is assumed to belong to a parametric family
def
= = {f (·; θ) : θ ∈ Θ},
described by a (low-dimensional) parameter θ belonging to the set of all possible parameters
Θ. Thus, there exists a θ0 ∈ Θ such that
fo (x) = f (x; θ0 ), −∞ < x < ∞.
The standard example for density estimation, viz. the family of normal densities, parametrized
by θ = (µ, σ),
(x − µ)2
 
1
f (x; θ) = √ exp − , −∞ < x < ∞.
2π σ 2σ 2
The standard method for estimating θ0 is by maximum likelihood estimation. In this
method, the estimator (assuming it exists) is any value of θ for which the likelihood is
maximal and amounts to solving
Yn
maximize f (Xi ; θ)
i=1
subject to θ ∈ Θ
but it is more natural to consider the averaged logarithm of the likelihood, written as
n Z
1X
log f (Xi ; θ) = log f (x; θ) dFn (x).
n i=1 R

The reason for this is that under reasonable condictions, the log-likelihood converges as
n→∞ Z Z
log f (x; θ) dFn (x) −→ log f (x; θ) dF0 (x), (θ fixed)
R R
almost surely by the strong law of large numbers. Thus, the maximum likelihood estimation
problem is equivalent to
Z
minimize − log f (x; θ) dFn (x)
R
subject to θ ∈ Θ
The questions of interest are whether the above problem has a unique solution θn , how
it may be computed, and what can be said about the error θn − θ0 or f (·; θn ) − f (·; θ0 ).
One reason for the popularity of maximum likelihood estimation is that under reasonable
conditions, √
n(θn − θ0 ) →d Y ∼ N (0, σ 2 )

i.e., n(θn − θ0 ) converges in distribution to a normally distributed random variable Y ,
with mean 0 and variance σ 2 given via
|g(x; θ0 )|2
Z
1
= dx,
σ2 R f (x; θ0 )
where g(x; θ) denotes the partial derivative of f (x; θ) with respect to θ.

2
1.2 Parametric: Least Square Method
The idea of least-squares estimation is that assuming fo were known, an ideal choice of θ
would minimize Z
|f (x; θ) − fo (x)|2 dx,
R
the latter being equal to 0 if the model is correct. Because fo is unknown, this method
cannot be used. However, by rewriting the above equation as
Z Z Z
|f (x; θ)| dx − 2 f (x; θ) dFo (x) + |fo (x)|2 dx,
2
R R R

we see that the first term can be computed exactly, and that the last term is independent
of θ. The second term may be estimated by
Z
−2 f (x; θ) dFn (x),
R

and so the least-squares estimator of the parameter θo is the solution to


Z Z
minimize −2 f (x; θ) dFn (x) + |f (x; θ)|2 dx
R R
subject to θ ∈ Θ

Under reasonable conditions, the error θn − θo is normally distributed with 0 mean and
variance σ 2 given by Z
|g(x; θo ) − E|2 f (x; θo )dx
σ 2 = R Z 2
2
|g(x; θo )| dx
R

where g(x; θ) is the partial derivative of f (x; θ) with respect to θ, and


Z
E= g(x; θo )f (x; θo )dx.
R

1.3 Nonparametric Estimation


If for whatever reason a parametric model fo is not forthcoming, we are dealing with what is
customarily referred to as a nonparametric estimation problem. Another way of saying this
is that one wishes to make as few assumptions as possible about the density fo . Without
any assumptions, the density fo is just a nonnegative, integrable function with integral
equal to 1. Thus, it is an infinite-dimensional object, and inifinitely many parameters are
required to describe fo .

3
2 Least Squares Estimation
2.1 Ordinary Least Squares
A linear statistical model is of the form

y = β0 + β1 φ1 (x) + β2 φ2 (x) + · · · + βk φk (x) + ε

where y is a random variable called the response; φ1 (x), φ2 (x), . . . , φ3 (x) are mathematical
variables called regressors; x is the vector of parameters whose values are controlled (or at
least accurately observed) by the experimenter; ε is a random variable that accounts for
unexpained random variation in response; and β0 , β1 , . . . , βk are parameters whose exact
values are not known and must be estimated from the experimental data. The density
function for the random errors ε can be written as

ε ∼ N (0, σ 2 I)

For a response yi corresponding to the xi , the model indicates

yi = β0 + β1 φ1 (xi ) + β2 φ2 (xi ) + · · · + βk φk (xi ) + εi

Taking expectations,

E[yi ] = β0 + β1 φ1 (xi ) + β2 φ2 (xi ) + · · · + βk φk (xi ) + E[εi ]

Since E[εi ] = 0, we can rewrite the model in the form

εi = yi − E[yi ]

From this it can be seen that the ith random error is the differece between the ith response
and its mean. Since the average response depends on β0 , β1 , . . . , βk whose values are not
known, random errors cannot be observed from a sample. Supposing b0 , b1 , . . . , bk are
estimators for β0 , β1 , . . . , βk , we can observe residuals. These estimators can be used to
approximate E[yi ] as

[i ] = b0 + b1 φ1 (x) + b2 φ2 (x) + · · · + bk φk (x)


E[y

The ith residual is denoted by ei and is given by

ei = yi − (b0 + b1 φ1 (x) + b2 φ2 (x) + · · · + bk φk (x))


[i ]
= yi − E[y

The method of least squares is to choose estimators in such a way that the sum of
the squares of the residuals is minimized. To obtain the least squares estimators for

4
β0 , β1 , . . . , βk in the linear model, first express the responses in terms of the residuals
as shown:
y1 = b0 + b1 φ1 (x1 ) + b2 φ2 (x1 ) + · · · + bk φk (x1 ) + e1
y2 = b0 + b1 φ1 (x2 ) + b2 φ2 (x2 ) + · · · + bk φk (x2 ) + e2
..
.
yn = b0 + b1 φ1 (xn ) + b2 φ2 (xn ) + · · · + bk φk (xn ) + en
These equations can be written in the matrix form as

y = Φb + e

We want to minimize the sum of squares of the residuals


n
X
0
ee= e2i
i=1

We can write this as


e0 e = (y − Φb)0 (y − Φb)
Simplifying,
e0 e = (y0 − b0 Φ0 ) (y − Φb)
= y0 y − y0 Φb − b0 Φ0 y + b0 Φ0 Φb
Since b0 Φ0 y is 1 × 1, b0 Φ0 y = (b0 Φ0 y)0 = y0 Φb. By substitution,

e0 e = y0 y − 2y0 Φb + b0 (Φ0 Φ)b


= y0 y − 2(Φ0 y)0 b + b0 (Φ0 Φ)b

To minimize e0 e, we differentiate the expression with respect to b and set it to zero.

∂e0 e
= −2Φ0 y + (Φ0 Φ)b + (Φ0 Φ)0 b
∂b
= −2Φ0 y + 2(Φ0 Φ)b

Setting the derivative to zero, we obtain

−2Φ0 y + 2(Φ0 Φ)b = 0

or
(Φ0 Φ)b = Φ0 y
We can solve for b to get
b = (Φ0 Φ)−1 Φ0 y

5
2.2 Generalized Least Squares
Earlier we have assumed that the errors ε1 , ε2 , . . . are uncorrelated with common variance.
In many situations one or both of these assumptions is unrealistic. Here we assume that
the variance of ε is a n × n positive definite matrix Σ. and it follows a multivariate normal
distribution.
ε ∼ Nn (0, Σ)
The joint density for ε1 , ε2 , . . . , εn is given by
1 1
f (ε; β, Σ) = exp(− ε0 Σ−1 ε)
(2π)n/2 |Σ|1/2 2
To find maximum likelihood estimator of β we must find the vector b∗ that maximizes the
function. It can be shown that to maximize f we must minimize the term ε0 Σ−1 ε.

ε0 Σ−1 ε = (y − Φb∗ )0 Σ−1 (y − Φb∗ )

This term is similar to the residual sum of squares that is minimized in the ordinary least
squares. Differentiating this term with respect to b∗ ,
∂ε0 Σ−1 ε ∂(y − Φb∗ )0 Σ−1 (y − Φb∗ )
=
∂b∗ ∂b∗
∂ 0 −1 ∗ 0 0 −1 ∗ 0 0 −1 ∗

= y Σ y − 2b Φ Σ y + b Φ Σ Φb
∂b∗
= −2Φ0 Σ−1 y + 2Φ0 Σ−1 Φb∗

Setting this derivative equal to zero, we obtain

Φ0 Σ−1 Φb∗ = Φ0 Σ−1 y

Solving for b∗ we get,


b∗ = (Φ0 Σ−1 Φ)−1 Φ0 Σ−1 y
The estimator b∗ is called the generalized least squares estimator.

2.3 Weighted Least Squares


An important special case of generalized least squares occurs when variance- covariance
matrix of ε is diagonal with entries σ12 , σ22 , . . . , σn2 . Here we assume that ε1 , ε2 , . . . , εn are
uncorrelated but that they do not have common variance. This situation occurs frequently
in scientific studies when the variation in response is dependent upon the magnitude of
the x variables. When generalized least squares is used in this context, the technique is
referred to as weighted regression or weighted least squares.
ε0 Σ−1 ε = (y − Φb∗ )0 Σ−1 (y − Φb∗ )
n  2
X yi − ŷi
=
i=1
σi

6
Each residual yi − ŷi is weighted by 1/σi , the reciprocal of standard deviation. In this way,
a point drawn from a population with small variability receives a heavy weight; one drawn
from a population with a large degree of variability receives less importance in estimating
β.

3 Maximum Likelihood Estimation


We are interested in maximum likelihood estimation for parametric families of probability
density functions, typically involving only a small number of scalar parameters. The specific
problem is as follows. Let X = (X1 , X2 , . . . , Xn )T ∈ Rn×d denote a sample of size n of a
Rd -valued random variable with joint pdf f (x; θo ), x ∈ Rn×d . The pdf of X is known to
belong to the parametric family = = {f (x; θ) : θ ∈ Θ}, where Θ ⊂ Rp is the set of all
possible parameters. We wish to estimate θo on the basis of X. The (averaged) negative
log-likelihood of θ given X is defined as
1
Ln (θ) = − log f (X; θ).
n
A maximum likelihood estimator (mle) of θ, denoted by θn , is any solution of the problem
minimize Ln (θ)
subject to θ ∈ Θ.

3.1 The Multivariate normal model


A random variable X ∈ Rd is said to be a multivariate normal with mean µ and covariance
matrix Σ if its pdf is given by
exp − 21 (x − µ)T Σ−1 (x − µ)

f (x; µ, Σ) = 1/2
, x ∈ Rd ,
d
((2π) det(Σ))
with µ = E[X] ∈ Rd and Σ ∈ Rd×d a positive-definite matrix and det(Σ) the determinant
of Σ. The above is denoted simply by
X ∼ Nd (µ, Σ).
Suppose now that X1 , X2 , . . . , Xn are iid with X1 ∼ N (µo , I), where I is the identity
matrix, and µo ∈ Rd is the unknown parameter we wish to estimate. The parametric
family is then  
−d/2 1
f (x; µ) = (2π) exp − kx − µk , x ∈ Rd ,
2
2
with µ ∈ Rd the parameter. Here k · k denotes the Euclidean norm on Rn . The negative
log-likelihood of µ given X1 , X2 , . . . , Xn is,
n
1 X
Ln (µ) = kXi − µk2 + other terms,
2n i=1

7
where the “other terms” are independent of µ. It can be verified that apart from the “other
terms”,
n n n
1 1X 2 1X 2 1 X
Ln (µ) = kµ − Xi k + kXi k − 2 k Xi k2 .
2 n i=1 n i=1 n i=1
It follows that the mle of µ is
n
1X
µn = Xi .
n i=1
So, in this case, the mle exists and is unique.

3.2 Linear model


The linear multivariate normal model for a random variable X models the distribution as

X ∼ Nn (Zβo , σo2 I),

where Z ∈ Rn×p is a known matrix, βo ∈ Rp is a vector of uknown parameters, and σo2 ∈ R


is also an unknown parameter. This is the basic model underlying both regression analysis
and the design of experiments. The parametric family of distributions is given by

kX − Zβk2
 
1
f (X; β, σ) = exp − , X ∈ Rd .
(2πσ 2 )n/2 2σ 2

The maximum likelihood estimation problem for estimating βo and σo is

minimize (2nσ 2 )−1 kX − Zβk2 + log σ


subject to β ∈ Rd , σ > 0.

It is clear that the optimal β is not influenced by the choice of σ, so that βn solves the
linear least-squares problem

minimize kZβ − Xk2


subject to β ∈ Rd

It follows that the solution βn is unique if and only if Z has full column rank. In that case

βn = (Z T Z)−1 Z T X.

Taking β = βn and solving minimization problem for σn , we get a unique solution given
by
1
σn2 = kZβn − Xk2 .
n

8
4 DACE
The DACE model is a combination of linear model and the multivariate normal model for
random variable X.
X ∼ Nn (Zβo , Σ),
where Z ∈ Rn×p is a known matrix, βo is a vector of unknown parameters, Σ = σo2 R is
the covariance matrix with σo2 ∈ R and R is the unknown correlation matrix. Correlations
between two random points X1 and X2 is given by

R = R(X1 , X2 ; θ),

where θ are the unknown parameters. The parametric family of distributions is given by

exp(− 21 (X − Zβ)T Σ−1 (X − Zβ))


f (X; β, σ, θ) = n 1/2
X ∈ Rp
((2π) det(Σ))

You might also like