Density Estimation 36-708
Density Estimation 36-708
Density Estimation 36-708
36-708
1 Introduction
Density estimation used for: regression, classification, clustering and unsupervised predic-
tion. For example, if pb(x, y) is an estimate of p(x, y) then we get the following estimate of
the regression function: Z
m(x)
b = yb
p(y|x)dy
where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of
the Bayes rule. For clustering, we look for the high density regions, based on an estimate
of the density. Many classifiers that you are familiar with can be re-expressed this way.
Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from
X1 , . . . , X n .
Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0
where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.
1
1.0
1.0
0.5
0.5
0.0
−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0
1.0
0.5
0.5
0.0
0.0
−3 0 3 −3 0 3
Just Right Oversmoothed
Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.
2
2 Loss Functions
Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx
as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric
where the supremum is over all measurable sets. Now if P and Q have densities p and q then
Z
1 1
dT V (P, Q) = |p − q| = kp − qk1 . H
2 2
R
Thus, if |p − q| < δ then we know that |P (A) − Q(A)| < δ/2 for all A. Also, the L1 norm is
transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X).
Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then
Z Z
|p(x) − q(x)|dx = |e p(y) − qe(y)|dy. H
Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback- H
Leibler loss is completely dominated by the tails of the densities.
3 Histograms
Perhaps the simplest density estimators are histograms. For convenience, assume that the
data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d (although this assumption is not
crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing
3
h later. There are N ≈ (1/h)d such bins and each has volume hd . Denote the bins by
B1 , . . . , BN . The histogram density estimator is
N b
X θj
pbh (x) = d
I(x ∈ Bj ) (2)
j=1
h
where n
1X
θbj = I(Xi ∈ Bj )
n i=1
is the fraction of data points in bin Bj . Now we bound the bias and variance of pbh . We will
assume that p ∈ P(L) where
( )
P(L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (3)
R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj
p(u)du. For any x ∈ Bj ,
θj
ph (x) ≡ E(b
ph (x)) = (4)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h
Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h h
√
where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.
1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd
4
where C0 = L2 d(C/(L2 d))2/(d+2) .
Later, we will prove the following theorem which shows that this upper bound is tight.
Specifically:
sup P n (kb
ph − pk∞ > )
P ∈P
where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,
√
Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,
√
kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤ + L dh. (7)
Now set s
1 2
= log .
cnhd δhd
5
Then, with probability at least 1 − δ,
s
√
1 2
kb
ph − pk∞ ≤ log + L dh. (8)
cnhd δhd
2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2
2π
3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)
−3 0 3 −3 0 3
−3 0 3 −3 0 3
Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).
6
−10 −5 0 5 10
Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.
Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n
1X 1 kx − Xi k
pb(x) = K . (10)
n i=1 hd h
where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.
Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.
The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.
7
4.1 Risk Analysis
In this section we examine the accuracy of kernel density estimation. We will first need a
few definitions.
∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd
Let β be a positive integer. Define the Hölder class
( )
Σ(β, L) = g : |Ds g(x)−Ds g(y)| ≤ Lkx−yk, for all s such that |s| = β −1, and all x, y .
(11)
For example, if d = 1 and β = 2 this means that
The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.
where H
X (u − x)s
gx,β (u) = Ds g(x). (13)
s!
|s|≤β
ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b
8
Lemma 3 The bias of pbh satisfies:
sup |ph (x) − p(x)| ≤ chβ (14)
p∈Σ(β,L)
for some c.
Proof. We have
Z
1
|ph (x) − p(x)| = K(ku − xk/h)p(u)du − p(x)
hd
Z
= K(kvk)(p(x + hv) − p(x))dv
Z Z
≤ K(kvk)(p(x + hv) − px,β (x + hv))dv + K(kvk)(px,β (x + hv) − p(x))dv .
R
The first term is bounded by Lhβ K(s)|s|β since p ∈ Σ(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree β (with no constant
term).
hd
kx − uk
Z Z
2 1 2
Var(Zi ) ≤ E(Zi ) = 2d K p(u)du = 2d K 2 (kvk) p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (kvk)dv ≤ d
h h
for some c since the densities in Σ(β, L) are uniformly bounded. The result follows.
Since the mean squared error is equal to the variance plus the bias squared we have:
1
Theorem 5 The L2 risk is bounded above, uniformly over Σ(β, L), by h4β + nhd
(up to
constants). If h n−1/(2β+d) then
Z 2β
2β+d
2 1
sup E (b ph (x) − p(x)) dx . (16)
p∈Σ(β,L) n
9
4.2 Minimax Bound
According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.
Theorem 6 together with (16) imply that kernel estimators are rate minimax.
Now we state a result which says how fast pb(x) concentrates around p(x). First, recall
Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2 and
|Yi | ≤ M . Then
n2
P(|Y − µ| > ) ≤ 2 exp − 2 . (18)
2σ + 2M /3
Note that the last statement follows from the bias-variance calculation followed by Markov’s
inequality. The first statement does not.
|b
p(x) − p(x)| ≤ |b
p(x) − ph (x)| + |ph (x) − p(x)| (21)
10
where ph (x) = E(bp(x)). From Lemma 3, |ph (x) − p(x)| ≤ chβ for some c. Now pb(x) =
n−1 ni=1 Zi where
P
1 kx − Xi k
Zi = d K .
h h
Note that |Zi | ≤ c1 /hd where c1 = K(0). Also, Var(Zi ) ≤ c2 /hd from Lemma 4. Hence, by
Bernstein’s inequality,
n2 nhd 2
p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever ≤ 3c2 /c1 . If we choose = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd
4.4 Concentration in L∞
Theorem 7 shows that, for each x, pb(x) is close to p(x) with high probability. We would like
a version of this result that holds uniformly over all x. That is, we want a concentration
result for
kb
p − pk∞ = sup |b p(x) − p(x)|.
x
We can write
kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb
We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd
C
P(kbph − ph k∞ > ) ≤ 4 exp − . (22)
hd+1 28K(0)
11
where N (T, d, ) denotes the -covering number of the metric space (T, d), F is the envelope
function of F and the supremum is taken over the set of all probability measures on Rd . The
quantities A and v are called the VC characteristics of Fh .
Theorem 8 (Giné and Guillou 2002) Assume that the kernel satisfies the above prop-
erty.
1. Let h > 0 be fixed. Then, there exist constants c1 > 0 and c2 > 0 such that, for all
small > 0 and all large n,
ph (x) − ph (x)| > ≤ c1 exp −c2 nhd 2 .
P sup |b (24)
x∈Rd
nhdn
2. Let hn → 0 as n → ∞ in such a way that | log hdn |
→ ∞. Let
s
| log hn |
n ≥ . (25)
nhdn
Then, for all n large enough, (24) holds with h and replaced by hn and n , respectively.
The above theorem imposes minimal assumptions on the kernel K and, more importantly,
on the probability distribution P , whose density is not required to be bounded or smooth,
and, in fact, may not even exist. Combining the above theorem with Lemma 3 we have the
following result.
for some constants C and c where C depends on δ. Choosing h log n/n−1/(2β+d) we have
2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n
We have ignored what happens near the boundary of the sample space. If x is O(h) close to
the boundary, the bias is O(h) instead of O(h2 ). There are a variety of fixes including: data
reflection, transformations, boundary kernels, local likelihood.
12
4.6 Confidence Bands and the CLT
p
Consider first a single point x. Let sn (x) = Var(b
ph (x)). The CLT implies that
N (0, 1) as long as
Xn
lim E[Ln,i |2+δ = 0
n→∞
i=1
for some δ > 0. But this is does not yield a confidence interval for p(x). To see why, let us
write
pbh (x) − p(x) pbh (x) − ph (x) ph (x) − p(x) bias
= + = Zn (x) + √ .
sn (x) sn (x) sn (x) var(x)
Assuming that the optimize the risk by balancing the bias and the variance, the second term
is some constant c. So
pbh (x) − p(x)
N (c, τ 2 (x)).
sn (x)
This means that the usual confidence interval pbh (x) ± zα/2 s(x) will not cover p(x) with
probability tending to 1 − α. One fix for this is to undersmooth the estimator. (We sacrifice
risk for coverage.) An easier approach is just to interpret pbh (x) ± zα/2 s(x) as a confidence
interval for the smoothed density ph (x) instead of p(x).
But this only gives an interval at one point. To get a confidence band we use the bootstrap.
Let Pn be the empirical distribution of X1 , . . . , Xn . The idea is to estimate the distribution
√
Fn (t) = P nhd ||b ph (x) − ph (x)||∞ ≤ t
where pb∗h is constructed from the bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Later in the course,
we will show that
P
sup |Fn (t) − Fbn (t)| → 0.
t
1. Let Pn be the empirical distribution that puts mass 1/n at each data point Xi .
13
h= 1
h= 2 h= 3
7. Let
zα zα
`n (x) = pbh (x) − √ , un (x) = pbh (x) + √ .
nhd nhd
See Figure 4.
If you want a confidence band for p you need to reduce the bias (undersmooth). A simple
way to do this is with twicing. Suppose that β = 2 and that we use the kernel estimator pbh .
Note that,
ph (x)] = p(x) + C(x)h2 + o(h2 )
E[b
p2h (x)] = p(x) + C(x)4h2 + o(h2 )
E[b
14
for some C(x). That is, the leading term of the bias is b(x) = C(x)h2 . So if we define
5 Cross-Validation
In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h. Here, we
describe two cross-validation methods.
A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x)dx + p2 (x)dx.
(b 2 2
The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pb (x) dx − 2 pb(x)p(x)dx.
where pb(−i) is the density estimator obtained after removing the ith observation.
15
H
It is easy to check that E[R(h)]
b = R(h).
When the kernel is Gaussian, the cross-validation score can be written, after some tedious
algebra, as follows. Let φ(z; σ) denote a Normal density with mean 0 and variance σ 2 . Then,
√ d
φd
(0; 2h) n − 2 XY √
R(h) =
b + φ(X i` − X j` ; 2h) (27)
(n − 1) n(n − 1)2 i6=j `=1
d
2 XY
− φ(Xi` − Xj` ; h). (28)
n(n − 1) i6=j `=1
The estimator pb and the cross-validation score can be computed quickly using the fast Fourier
transform; see pages 61–66 of Silverman (1986).
For histograms, it is easy to work out the leave-one-out cross-validation in close form:
2 n + 1 X b2
R(h)
b = − θ . H
(n − 1)h (n − 1)h j j
A further justification for cross-validation is given by the following theorem due to Stone
(1984).
Theorem 12 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (29)
inf h (p(x) − pbh (x))2 dx
R
The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.
There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.
16
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H = {h1 , . . . , hN } be a finite grid of
bandwidths. Let n
1X 1 kx − Yi k
pbj (x) = K .
n i=1 hdj h
Thus we have a set P = {b
p1 , . . . , pb} of density estimators.
R R
We would like to minimize L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
b pbj ) = pb2 (x) − 2
X
Lbj ≡ L(p, pbj (Zi ). (30)
j
n i=1
Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L
This theorem can be proved using concentration of measure techniques that we discuss later
in class. A similar result can be proved for V -fold cross-validation.
In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.
Theorem 14 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R
17
and R
K 2 (x)dx
Z
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (31)
4 nh n
2
R
where σK = x2 K(x)dx.
Proof. Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
E[Kh (x, X)] and Var[bp(x)] = n−1 Var[Kh (x, X)]. Now,
x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1
= p(x) + h2 p00 (x) u2 K(u) du · · ·
2
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is
1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx
1
Var[b
p(x)] = +O .
n hn n
The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.
If we differentiate (31) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
1/5
c2
h∗ = (32)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R
because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (31),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).
6 High Dimensions
The rate of convergence n−2β/(2β+d) is slow when the dimension d is large. In this case it is
hopeless to try to estimate the true density p precisely in the L2 norm (or any similar norm).
18
We need to change our notion of what it means to estimate p in a high-dimensional problem.
Instead of estimating p precisely we have to settle for finding an adequate approximation
of p. Any estimator that finds the regions where p puts large amounts of mass should be
considered an adequate approximation. Let us consider a few ways to implement this type
of thinking.
Ph = P ⊕ K h
where ⊕ denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In
other words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another
way to say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but
still could preserve most of the important shape information about p. Consider then choosing
a fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. From Theorem 8 we conclude:
2
Theorem 15 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n
The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
Independence Based Methods. If we can live with some bias, we can reduce the dimen-
sionality by imposing some independence assumptions. The simplest example is to treat the
components (X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1
19
An extension is to use a forest. We represent the distribution with an undirected graph. A
graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E
To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.
Density Trees. Ram and Gray (2011) suggest a recursive partitioning scheme similar to
decision trees. They split each coordinate dyadically, in a greedy fashion. The density
estimator is taken to be piecewise constant. They use an L2 risk estimator to decide when to
split. This seems promising. The ideas seems to have been re-discovered in Yand and Wong
(arXiv:1404.1425) and Liu and Wong (arXiv:1401.2597). Density trees seem very promising.
It would be nice if there was an R package to do this and if there were more theoretical
results.
7 Example
Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.
8 Derivatives
Kernel estimators can also be used to estimate the derivatives of a density.3 Let D⊗r p denote
the rth derivative p. We are using Kronecker notation. Let D⊗0 p = p, D⊗1 f is the gradient
3
In this section we follow Chacon and Duong (2013), Electronic Journal of Statistics, 7, 499-532.
20
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth
Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.
21
of p, and D⊗2 p = vecH where H is the Hessian. We also write this as p(r) when convenient.
The asymptotic mean squared error is derived in Chacon, Duong and Wand (2011) and is
given by
1 −1/2 m2 (K)
|H |tr((H −1 )⊗r R(D⊗r (K))) + 2 tr((Idr ⊗ vecT H)R(D⊗(r+2) p)(Idr ⊗ vec(H)))
n 4
R R
where R(g) = g(x)g T (x)dx, m2 (K) = xxT K(x)dx. The optimal H has entries of order
n−2/(d+2r+4) which yield an asymptotic mean squared error of order n−4/(d+2r+4) . In dimension
d = 1, the risk looks like this as a function of r:
r risk
0 n−4/5
1 n−4/7
2 n−4/9
We see that estimating derivatives is harder than estimating the density itself.
One application of this that we consider later in the course is mode-based clustering. Here,
we use density estimation to find the modes of the density. We associate clusters with these
modes. We can also test for a mode by testing if D2 p(x) < 0 at the estimated modes.
22
9 Unsupervised Prediction and Anomaly Detection
We can use density estimation to do unsupervised prediction and anomaly detection. The
basic idea is due to Vovk, and was developed in a statistical framework in Lei, Robins and
Wasserman (2014).
P(Yn+1 ∈ Cn ) ≥ 1 − α.
Fix a value y. Let A = (Y1 , . . . , Yn , y) be the augmented dataset. That is, we set Yn+1 = y.
Let pbA be a density estimate based on A. Consider the vector
Under H0 , the rank of these values is uniformly distributed. That is, for each i,
1
pA (Yi ) ≤ pbA (y)) =
P(b .
n+1
A p-value for the test is
n+1
1 X
π(y) = pA (Yi ) ≤ pbA (y)).
I(b
n + 1 i=1
Computing Cn is tedious. Fortunately, Jing, Robins and Wasserman (2014) show that there is
a simpler set that still has the correct coverage (but is slightly larger). The set is constructed
as follows. Let Zi = pb(Yi ). Order these observations
Z(1) ≤ · · · ≤ Z(n) .
23
Lemma 16 We have that Cn ⊂ Cn+ and hence
P(Yn+1 ∈ Cn ) ≥ 1 − α.
Finally, we note that any Yi with a small p-value can be regarded as an outlier (anomaly).
The above method is exact. We can also use a simpler, asympotic approach. With Z(k)
b = {y : pb(y) ≥ t} where now t = Z(k) . From Cadre, Pelletier and Pudlo
defined above, set C
(2013) we have that
√ P
nhd µ(C∆C)
b →c
for some constant c where C is the true 1 − α level set. Hence, P (Yn+1 ∈ C)
b = 1 − α + oP (1).
Note also that the L2 loss does not make any sense. If you tried to use cross-validation, you
would find that the estimated risk is minimized at h = 0. H
A simple solution is to focus on estimating the smoothed density ph (x) which is well-defined
for every h > 0. More sophisticated ideas are based on topological data analysis which we
discuss later in the course.
11 Series Methods
We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0
24
Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.
An example is the cosine basis:
√
φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,
If p ∈ F then
∞
X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.
The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1
11.1 L1 Methods
25
VC Classes. Let A be a class of sets with VC dimension ν. As in section 5.2, split the data
X into Y and Z with P = {b p1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)
A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argming∈P ∆n (g).
The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Yatracos Classes. Devroye and Györfi (2001) use a class of sets called a Yatracos class which
leads to estimators with some remarkable properties. n Let P = {po1 , . . . , pN } be a set of
densities and define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x :
pi (x) > pj (x)}. Let
pb = argming∈G ∆(g)
where Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.
26
Theorem 18 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (34)
j
R
where ∆ = supA∈A A p − Pn (A).
R
The term minj |pj − p| is like a bias while term ∆ is like the variance.
R R
Proof. Let i be such that pb = pi and let s be such that |ps − p| = minj |pj − p|. Let
B = {pi > ps } and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (35)
Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n
1X kx − Yi k
pbh (x) = K .
n i=1 h
Let n o
A = A(h, ν) : h, ν > 0, h 6= ν
where A(h, ν) = {x : pbh (x) > pbν (x)}. Define
Z
∆n (g) = sup g(u)du − Pn (A)
A∈A A
27
Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let
pb = argming∈G ∆(g).
Under some regularity conditions on the kernel, we have the following result.
The proof involves showing that the terms on the right hand side of (34) are small. We refer
the reader to Devroye and Györfi (2001) for the details.
R
Recall that dT V (P, Q) = supA |P (A) − Q(A)| = (1/2) |p(x) − q(x)|dx where the supremum
is over all measurable sets. The above theorem says that the estimator does well in the
total variation metric, even though the method only used the Yatracos class of sets. Finding
computationally efficient methods to implement this approach remains an open question.
12 Mixtures
Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.
In some problems, X is not just high dimensional, it is infinite dimensional. For example
suppose that each Xi is a curve. An immediate problem is that the concept of a density
is no longer well defined. On a Euclidean space, the density p for a probability measure is
28
R
the function that satisfies P (A) = A p(u)dµ(u) for all measurable A where µ is Lebesgue
measure. Formally, we say that p is the Radon-Nikodym derivative of P with respect to the
dominating measure µ. Geometrically, we can think of p as
P(kX − xk ≤ )
p(x) = lim
→0 V ()
where V () = d π d/2 /Γ(d/2 + 1) is the volume of a sphere of radius . Under appropriate
conditions, these two notions of density agree. (This is the Lebesgue density theorem.)
When the outcome space X is a set of curves, there is no dominating measure and hence
there is no density. Instead, we define the density geometrically by
q (x) = P(ξ(x, X) ≤ )
for a small where ξ is some metric on X . However we cannot divide by V () and let tend
to 0 since the dimension d is infinite.
One way around this is to use a fixed and work with the unnormalized density q . For the
purpose of finding high-density regions this may be adequate. An estimate of q is
n
1X
qb (x) = I(ξ(x, Xi ) ≤ ).
n i=1
Pk
An alternative is to expand Xi into a basis: X(t) ≈ j=1 βj ψj (t). A density can be defined
in terms of the βj ’s.
Example 20 Figure 6 shows the tracks (or paths) of 40 North Atlantic tropical cyclones
(TC). The full dataset, consisting of 608 from 1950 to 2006 is shown in Figure 7. Buchman,
Lee and Schafer (2009) provide a thorough analysis of the data. We refer the reader to their
paper for the full details.4
Each data point— that is, each TC track— is a curve in R2 . Various questions are of
interest: Where are the tracks most common? Is the density of tracks changing over time?
Is the track related to other variables such as windspeed and pressure?
Each curve Xi can be regarded as mapping Xi : [0, Ti ] → R2 where Xi (t) = (Xi1 (t), Xi2 (t))
is the position of the TC at time t and Ti is the lifelength of the TC. Let
n o
Γi = (Xi1 (t), Xi2 (t)) : 0 ≤ t ≤ Ti
29
Figure 6: Paths of 40 tropical cyclones in the North Atlantic.
Figure 8 shows the 10 TC’s with highest local density and the the 10 TC’s with lowest local
density using = 16.38. This choice of corresponds to the 10th percentile of the values
{dH (Xi , Xj ) : i 6= j}. The high density tracks correspond to TC’s in the gulf of Mexico with
short paths. The low density tracks correspond to TC’s in the Atlantic with long paths.
15 Miscellanea
Another method for selecting h which is sometimes used when p is thought to be very smooth
is the plug-in method. The idea is to take the formula for the mean squared error (equation
31), insert a guess of p00 and then solve for the optimal bandwidth h. For example, if d = 1
30
Figure 7: Paths of 608 tropical cyclones in the North Atlantic.
Figure 8: 10 highest density paths (black) and 10 lowest density paths (blue).
31
and under the idealized assumption that p is a univariate Normal this yields h∗ = 1.06σn−1/5 .
Usually, σ is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q
is the interquartile range.5 This choice of h∗ works well if the true density is very smooth
and is called the Normal reference rule.
Since we don’t want to necessarily assume that p is very smooth, it is usually better to
estimate h using cross-validation. See Loader (1999) for an interesting comparison between
cross-validation and plugin methods.
A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.
Density estimation is sometimes used to find unusual observations or outliers. These are
observations for which pb(Xi ) is very small.
16 Summary
1. A commonly used nonparametric density estimator is the kernel estimator
n
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h
5
Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing
by 1.34 is that Q/1.34 is a consistent estimate of σ if the data are from a N (µ, σ 2 ).
32