Density Estimation 36-708

Density Estimation
36-708
1 Introduction
Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric

density estimation is to estimate p with as few assumptions about p as possible. We denote
the estimator by pb. The estimator will depend on a smoothing parameter h and choosing h
carefully is crucial. To emphasize the dependence on h we sometimes write pbh .
Density estimation used for: regression, classification, clustering and unsupervised predic-
tion. For example, if pb(x, y) is an estimate of p(x, y) then we get the following estimate of
the regression function: Z
m(x)
b = yb
p(y|x)dy
where pb(y|x) = pb(y, x)/b

p(x). For classification, recall that the Bayes rule is
h(x) = I(p1 (x)π1 > p0 (x)π0 )
where π1 = P(Y = 1), π0 = P(Y = 0), p1 (x) = p(x|y = 1) and p0 (x) = p(x|y = 0). Inserting
sample estimates of π1 and π0 , and density estimates for p1 and p0 yields an estimate of
the Bayes rule. For clustering, we look for the high density regions, based on an estimate
of the density. Many classifiers that you are familiar with can be re-expressed this way.
Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from
X1 , . . . , X n .
Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
4
1 1 X
p(x) = φ(x; 0, 1) + φ(x; (j/2) − 1, 1/10) (1)
2 10 j=0
where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson
density. Based on 1,000 draws from p, we computed a kernel density estimator, described
later. The estimator depends on a tuning parameter called the bandwidth. The top right plot
is based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.
1
1.0
1.0
0.5
0.5
0.0
−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0
1.0
0.5
0.5
0.0
0.0
−3 0 3 −3 0 3
Just Right Oversmoothed
Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.
2
2 Loss Functions
The most commonly used loss function is the L2 loss

Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x) + p2 (x)dx.
(b 2 2
The risk is R(p, pb) = E(L(p, pb)).
Devroye and Györfi (1985) make a strong case for using the L1 norm
Z
kbp − pk1 ≡ |b p(x) − p(x)|dx
as the loss instead of L2 . The L1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric
dT V (P, Q) = sup |P (A) − Q(A)|

A
where the supremum is over all measurable sets. Now if P and Q have densities p and q then
Z
1 1
dT V (P, Q) = |p − q| = kp − qk1 . H
2 2
R
Thus, if |p − q| < δ then we know that |P (A) − Q(A)| < δ/2 for all A. Also, the L1 norm is
transformation invariant. Suppose that T is a one-to-one smooth function. Let Y = T (X).
Let p and q be densities for X and let pe and qe be the corresponding densities for Y . Then
Z Z
|p(x) − q(x)|dx = |e p(y) − qe(y)|dy. H
Hence the distance is unaffected by transformations. The L1 loss is, in some sense, a much
better loss function than L2 for density estimation. But it is much more difficult to deal
with. For now, we will focus on L2 loss. But we may discuss L1 loss later.
R
Another loss function is the Kullback-Leibler loss p(x) log p(x)/q(x)dx. This is not a good
loss function to use for nonparametric density estimation. The reason is that the Kullback- H
Leibler loss is completely dominated by the tails of the densities.
3 Histograms
Perhaps the simplest density estimators are histograms. For convenience, assume that the
data X1 , . . . , Xn are contained in the unit cube X = [0, 1]d (although this assumption is not
crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing
3
h later. There are N ≈ (1/h)d such bins and each has volume hd . Denote the bins by
B1 , . . . , BN . The histogram density estimator is
N b
X θj
pbh (x) = d
I(x ∈ Bj ) (2)
j=1
h
where n
1X
θbj = I(Xi ∈ Bj )
n i=1
is the fraction of data points in bin Bj . Now we bound the bias and variance of pbh . We will
assume that p ∈ P(L) where
( )
P(L) = p : |p(x) − p(y)| ≤ Lkx − yk, for all x, y . (3)
R
First we bound the bias. Let θj = P (X ∈ Bj ) = Bj
p(u)du. For any x ∈ Bj ,
θj
ph (x) ≡ E(b
ph (x)) = (4)
hd
and hence R
Bj
p(u)du 1
Z
p(x) − ph (x) = p(x) − = d (p(x) − p(u))du.
hd h
Thus,
1
Z
1 √ Z √
|p(x) − ph (x)| ≤ d |p(x) − p(u)|du ≤ d Lh d du = Lh d
h h
√
where we used the fact that if x, u ∈ Bj then kx − uk ≤ dh.
Now Rwe bound the variance.

R Since p is Lipschitz on a compact set, it is bounded. Hence,
θj = Bj p(u)du ≤ C Bj du = Chd for some C. Thus, the variance is
1 θj (1 − θj ) θj C
Var(b
ph (x)) = 2d
Var(θbj ) = 2d
≤ 2d
≤ .
h nh nh nhd
We conclude that the L2 risk is bounded by

Z
C
sup R(p, pb) = (E(b ph (x) − p(x))2 ≤ L2 h2 d + d . (5)
p∈P(L) nh
1
C
d+2
The upper bound is minimized by choosing h = L2 nd
. (Later, we shall see a more
practical way to choose h.) With this choice,
2
d+2
1
sup R(p, pb) ≤ C0
P ∈P(L) n
4
where C0 = L2 d(C/(L2 d))2/(d+2) .
Later, we will prove the following theorem which shows that this upper bound is tight.
Specifically:
Theorem 2 There exists a constant C > 0 such that

Z 2
d+2
2 1
inf sup E p(x) − p(x)) dx ≥ C
(b . (6)
pb P ∈P(L) n
3.1 Concentration Analysis For Histograms
Let us now derive a concentration result for pbh . We will bound
sup P n (kb
ph − pk∞ > )
P ∈P
where kf k∞ = supx |f (x)|. Assume that ≤ 1. First, note that

!
θb θ
j j
X
ph −ph k∞ > ) = P max d − d > = P(max |θbj −θj | > hd ) ≤
P(kb P(|θbj −θj | > hd ).
j h h j
j
Using Bernstein’s inequality and the fact that θj (1 − θj ) ≤ θj ≤ Chd ,

2 2d

d 1 n h
P(|θbj − θj | > h ) ≤ 2 exp −
2 θj (1 − θj ) + hd /3
1 n2 h2d

≤ 2 exp −
2 Chd + hd /3
≤ 2 exp −cn2 hd

where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/h)d ,
P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .

√
Earlier we saw that supx |p(x) − ph (x)| ≤ L dh. Hence, with probability at least 1 − πn ,
√
kb
ph − pk∞ ≤ kb ph − ph k∞ + kph − pk∞ ≤ + L dh. (7)
Now set s
1 2
= log .
cnhd δhd
5
Then, with probability at least 1 − δ,
s
√

1 2
kb
ph − pk∞ ≤ log + L dh. (8)
cnhd δhd
Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,

s 1
!
√
2+d
2 2 2 1 log n
ph −pk∞ ≤ c−1 n− 2+d log
kb + log n +L dn− 2+d = O . (9)
δ 2+d n
4 Kernel Density Estimation

R
A
R one-dimensional smoothing
R kernel is any smooth function K such that K(x) dx = 1,
2
xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0. Smoothing kernels should not be confused with
Mercer kernels which we discuss later. Some commonly used kernels are the following:
2
Boxcar: K(x) = 21 I(x) Gaussian: K(x) = √1 e−x /2
2π
3 2 70
Epanechnikov: K(x) = 4 (1 − x )I(x) Tricube: K(x) = 81
(1 − |x|3 )3 I(x)
where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.

Qd These kernels are plotted in Figure 2.
Two commonly used multivariate kernels are j=1 K(xj ) and K(kxk).
−3 0 3 −3 0 3
−3 0 3 −3 0 3
Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).
6
−10 −5 0 5 10
Figure 3: A kernel density estimator pb. At each point x, pb(x) is the average of the kernels
centered over the data points Xi . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.
Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth,
the kernel density estimator is defined to be
n
1X 1 kx − Xi k
pb(x) = K . (10)
n i=1 hd h
More generally, we define

n
1X
pbH (x) = KH (x − Xi )
n i=1
where H is a positive definite bandwidth matrix and KH (x) = |H|−1/2 K(H −1/2 x). For
simplicity, we will take H = h2 I and we get back the previous formula.
Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multivari-
ate case the coordinates of Xi should be standardized so that each has the same variance,
since the norm kx − Xi k treats all coordinates as if they are on the same scale.
The kernel estimator places a smoothed out lump of mass of size 1/n over each data point
Xi ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.
7
4.1 Risk Analysis
In this section we examine the accuracy of kernel density estimation. We will first need a
few definitions.
Assume that Xi ∈ X ⊂ Rd where X is compact. Let β and L be positive numbers. Given a

vector s = (s1 , . . . , sd ), define |s| = s1 + · · · + sd , s! = s1 ! · · · sd !, xs = xs11 · · · xsdd and
∂ s1 +···+sd
Ds = .
∂xs11 · · · ∂xsdd
Let β be a positive integer. Define the Hölder class
( )
Σ(β, L) = g : |Ds g(x)−Ds g(y)| ≤ Lkx−yk, for all s such that |s| = β −1, and all x, y .
(11)
For example, if d = 1 and β = 2 this means that
|g 0 (x) − g 0 (y)| ≤ L |x − y|, for all x, y.
The most common case is β = 2; roughly speaking, this means that the functions have
bounded second derivatives.
If g ∈ Σ(β, L) then g(x) is close to its Taylor series approximation:
|g(u) − gx,β (u)| ≤ Lku − xkβ (12)
where H
X (u − x)s
gx,β (u) = Ds g(x). (13)
s!
|s|≤β
In the common case of β = 2, this means that

p(u) − [p(x) + (x − u)T ∇p(x)] ≤ L||x − u||2 .

Assume now R that theR kernel

p
R =β G(x1 ) · · · G(xd ) where
K has the form K(x) R s G has support
on [−1, 1], G = 1, |G| < ∞ for any p ≥ 1, |t| |K(t)|dt < ∞ and t K(t)dt = 0 for
s ≤ β.
An example of a kernel that satisfies these conditions

R s for β = 2 is G(x) = (3/4)(1 − x2 ) for
|x| ≤ 1. Constructing a kernel that satisfies t K(t)dt = 0 for β > 2 requires using kernels
that can take negative values.
ph (x)]. The next lemma provides a bound on the bias ph (x) − p(x).
Let ph (x) = E[b
8
Lemma 3 The bias of pbh satisfies:
sup |ph (x) − p(x)| ≤ chβ (14)
p∈Σ(β,L)
for some c.
Proof. We have
Z
1
|ph (x) − p(x)| = K(ku − xk/h)p(u)du − p(x)
hd
Z
= K(kvk)(p(x + hv) − p(x))dv

Z Z
≤ K(kvk)(p(x + hv) − px,β (x + hv))dv + K(kvk)(px,β (x + hv) − p(x))dv .

R
The first term is bounded by Lhβ K(s)|s|β since p ∈ Σ(β, L). The second term is 0 from
the properties on K since px,β (x + hv) − p(x) is a polynomial of degree β (with no constant
term).
Next we bound the variance.
Lemma 4 The variance of pbh satisfies:

c
ph (x)) ≤
sup Var(b (15)
p∈Σ(β,L) nhd
for some c > 0.

Proof. We can write pb(x) = n−1 ni=1 Zi where Zi = h1d K kx−X ik
P
h
. Then,
hd

kx − uk
Z Z
2 1 2
Var(Zi ) ≤ E(Zi ) = 2d K p(u)du = 2d K 2 (kvk) p(x + hv)dv
h h h
Z
supx p(x) c
≤ d
K 2 (kvk)dv ≤ d
h h
for some c since the densities in Σ(β, L) are uniformly bounded. The result follows.
Since the mean squared error is equal to the variance plus the bias squared we have:
1
Theorem 5 The L2 risk is bounded above, uniformly over Σ(β, L), by h4β + nhd
(up to
constants). If h n−1/(2β+d) then
Z 2β
2β+d
2 1
sup E (b ph (x) − p(x)) dx . (16)
p∈Σ(β,L) n
When β = 2 and h n−1/(4+d) we get the rate n−4/(4+d) .
9
4.2 Minimax Bound
According to the next theorem, there does not exist an estimator that converges faster than
O(n−2β/(2β+d) ). We state the result for integrated L2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.
Theorem 6 There exists C depending only on β and L such that

Z 2β
2β+d
1
inf sup Ep p(x) − p(x))2 dx ≥ C
(b . (17)
pb p∈Σ(β,L) n
Theorem 6 together with (16) imply that kernel estimators are rate minimax.
4.3 Concentration Analysis of Kernel Density Estimator
Now we state a result which says how fast pb(x) concentrates around p(x). First, recall
Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ 2 and
|Yi | ≤ M . Then
n2

P(|Y − µ| > ) ≤ 2 exp − 2 . (18)
2σ + 2M /3
Theorem 7 For all small > 0,
p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

P(|b (19)
Hence, for any δ > 0,

r !
C log(2/δ)
sup P |b
p(x) − p(x)| > d
+ chβ <δ (20)
p∈Σ(β,L) nh
for some constants C and c. If h n−1/(2β+d) then

c
sup P |b p(x) − p(x)|2 > < δ.
p∈Σ(β,L) n2β/(2β+d)
Note that the last statement follows from the bias-variance calculation followed by Markov’s
inequality. The first statement does not.
Proof. By the triangle inequality,
|b
p(x) − p(x)| ≤ |b
p(x) − ph (x)| + |ph (x) − p(x)| (21)
10
where ph (x) = E(bp(x)). From Lemma 3, |ph (x) − p(x)| ≤ chβ for some c. Now pb(x) =
n−1 ni=1 Zi where
P

1 kx − Xi k
Zi = d K .
h h
Note that |Zi | ≤ c1 /hd where c1 = K(0). Also, Var(Zi ) ≤ c2 /hd from Lemma 4. Hence, by
Bernstein’s inequality,
n2 nhd 2

p(x) − ph (x)| > ) ≤ 2 exp −
P(|b ≤ 2 exp −
2c2 h−d + 2c1 h−d /3 4c2
p
whenever ≤ 3c2 /c1 . If we choose = C log(2/δ)/(nhd ) where C = 4c2 then
r !
C
P |b
p(x) − ph (x)| > ≤ δ.
nhd
The result follows from (21).
4.4 Concentration in L∞
Theorem 7 shows that, for each x, pb(x) is close to p(x) with high probability. We would like
a version of this result that holds uniformly over all x. That is, we want a concentration
result for
kb
p − pk∞ = sup |b p(x) − p(x)|.
x
We can write
kb
ph − pk∞ ≤ kb ph − ph k∞ + chβ .
ph − ph k∞ + kph − pk∞ ≤ kb
We can bound the first term using something called bracketing together with Bernstein’s
theorem to prove that,
d
3n2 hd

C
P(kbph − ph k∞ > ) ≤ 4 exp − . (22)
hd+1 28K(0)
An alternative approach is to replace Bernstein’s inequality with a more sophisticated in-

equality due to Talagrand. We follow the analysis in Giné and Guillou (2002). Let

x−· d
F= K ,x ∈ R ,h > 0 .
h
We assume there exists positive numbers A and v such that
v
A
sup N (Fh , L2 (P ), kF kL2 (P ) ) ≤ , (23)
P
11
where N (T, d, ) denotes the -covering number of the metric space (T, d), F is the envelope
function of F and the supremum is taken over the set of all probability measures on Rd . The
quantities A and v are called the VC characteristics of Fh .
Theorem 8 (Giné and Guillou 2002) Assume that the kernel satisfies the above prop-
erty.
1. Let h > 0 be fixed. Then, there exist constants c1 > 0 and c2 > 0 such that, for all
small > 0 and all large n,

ph (x) − ph (x)| > ≤ c1 exp −c2 nhd 2 .

P sup |b (24)
x∈Rd
nhdn
2. Let hn → 0 as n → ∞ in such a way that | log hdn |
→ ∞. Let
s
| log hn |
n ≥ . (25)
nhdn
Then, for all n large enough, (24) holds with h and replaced by hn and n , respectively.
The above theorem imposes minimal assumptions on the kernel K and, more importantly,
on the probability distribution P , whose density is not required to be bounded or smooth,
and, in fact, may not even exist. Combining the above theorem with Lemma 3 we have the
following result.
Theorem 9 Suppose that p ∈ Σ(β, L). Fix any δ > 0. Then

r !
C log n
P sup |bp(x) − p(x)| > d
+ chβ < δ
x nh
for some constants C and c where C depends on δ. Choosing h log n/n−1/(2β+d) we have

2 C log n
P sup |bp(x) − p(x)| > 2β/(2β+d) < δ.
x n
4.5 Boundary Bias
We have ignored what happens near the boundary of the sample space. If x is O(h) close to
the boundary, the bias is O(h) instead of O(h2 ). There are a variety of fixes including: data
reflection, transformations, boundary kernels, local likelihood.
12
4.6 Confidence Bands and the CLT
p
Consider first a single point x. Let sn (x) = Var(b
ph (x)). The CLT implies that
pbh (x) − ph (x)

Zn (x) ≡ N (0, τ 2 (x)) H
sn (x)
for some τ (x). This is true even P hn → 0 and

if h = hn is decreasing. Specifically, suppose thatP
nhn → ∞. Note that Zn (x) = i=1 Lni , say. According to Lyapounov’s CLT, ni=1 Lni
n
N (0, 1) as long as
Xn
lim E[Ln,i |2+δ = 0
n→∞
i=1
for some δ > 0. But this is does not yield a confidence interval for p(x). To see why, let us
write
pbh (x) − p(x) pbh (x) − ph (x) ph (x) − p(x) bias
= + = Zn (x) + √ .
sn (x) sn (x) sn (x) var(x)
Assuming that the optimize the risk by balancing the bias and the variance, the second term
is some constant c. So
pbh (x) − p(x)
N (c, τ 2 (x)).
sn (x)
This means that the usual confidence interval pbh (x) ± zα/2 s(x) will not cover p(x) with
probability tending to 1 − α. One fix for this is to undersmooth the estimator. (We sacrifice
risk for coverage.) An easier approach is just to interpret pbh (x) ± zα/2 s(x) as a confidence
interval for the smoothed density ph (x) instead of p(x).
But this only gives an interval at one point. To get a confidence band we use the bootstrap.
Let Pn be the empirical distribution of X1 , . . . , Xn . The idea is to estimate the distribution
√
Fn (t) = P nhd ||b ph (x) − ph (x)||∞ ≤ t
with the bootstrap estimator

√
Fbn (t) = P p∗h (x) − pbh (x)||∞ ≤ t X1 , . . . , Xn
nhd ||b

where pb∗h is constructed from the bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Later in the course,
we will show that
P
sup |Fn (t) − Fbn (t)| → 0.
t
Here is the algorithm.
1. Let Pn be the empirical distribution that puts mass 1/n at each data point Xi .
13
h= 1
h= 2 h= 3
Figure 4: 95 percent bootstrap confidence bands using various bandwidths.
2. Draw X1∗ , . . . , Xn∗ ∼ Pn . This is called a bootstrap sample.

3. Compute the density estimator pb∗h based on the bootstrap sample.
√
4. Compute R = supx nhd ||b p∗h − pbh ||∞ .
5. Repeat steps 2-4 B times. This gives R1 , . . . , RB .
6. Let zα be the upper α quantile of the Rj ’s. Thus
B
1 X
I(Rj > zα ) ≈ α.
B j=1
7. Let
zα zα
`n (x) = pbh (x) − √ , un (x) = pbh (x) + √ .
nhd nhd
Theorem 10 Under appropriate (very weak) conditions, we have

lim inf P `n (x) ≤ ph (x) ≤ u(x) for all x ≥ 1 − α.
n→∞
See Figure 4.
If you want a confidence band for p you need to reduce the bias (undersmooth). A simple
way to do this is with twicing. Suppose that β = 2 and that we use the kernel estimator pbh .
Note that,
ph (x)] = p(x) + C(x)h2 + o(h2 )
E[b
p2h (x)] = p(x) + C(x)4h2 + o(h2 )
E[b
14
for some C(x). That is, the leading term of the bias is b(x) = C(x)h2 . So if we define
bb(x) = pb2h (x) − pbh (x)

3
then
E[bb(x)] = b(x).
We define the bias-reduced estimator

4 1
peh (x) = pbh (x) − bb(x) = pbh (x) − pb2h .
3 4
A confidence set centered at peh will be asymptotically valid but will not be an optimal
estimator. This is a fundamental conflict between estimation and inference.
5 Cross-Validation
In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h. Here, we
describe two cross-validation methods.
5.1 Leave One Out
A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is
Z Z Z Z
p(x) − p(x)) dx = pb (x)dx − 2 pb(x)p(x)dx + p2 (x)dx.
(b 2 2
The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L(h) = pb (x) dx − 2 pb(x)p(x)dx.
The risk is R(h) = E(L(h)).
Definition 11 The leave-one-out cross-validation estimator of risk is

Z 2 n
2X
R(h) =
b pb(−i) (x) dx − pb(−i) (Xi ) (26)
n i=1
where pb(−i) is the density estimator obtained after removing the ith observation.
15
H
It is easy to check that E[R(h)]
b = R(h).
When the kernel is Gaussian, the cross-validation score can be written, after some tedious
algebra, as follows. Let φ(z; σ) denote a Normal density with mean 0 and variance σ 2 . Then,
√ d
φd
(0; 2h) n − 2 XY √
R(h) =
b + φ(X i` − X j` ; 2h) (27)
(n − 1) n(n − 1)2 i6=j `=1
d
2 XY
− φ(Xi` − Xj` ; h). (28)
n(n − 1) i6=j `=1
The estimator pb and the cross-validation score can be computed quickly using the fast Fourier
transform; see pages 61–66 of Silverman (1986).
For histograms, it is easy to work out the leave-one-out cross-validation in close form:
2 n + 1 X b2
R(h)
b = − θ . H
(n − 1)h (n − 1)h j j
A further justification for cross-validation is given by the following theorem due to Stone
(1984).
Theorem 12 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel
estimator with bandwidth h and let b
h denote the bandwidth chosen by cross-validation. Then,
R 2
p(x) − pbbh (x) dx a.s.
→ 1. (29)
inf h (p(x) − pbh (x))2 dx
R
The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.
There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.
5.2 Data Splitting
An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir

simplicity, let us consider here just splitting the data in two halves. This version of cross-
validation comes with stronger theoretical guarantees. Let pbh denote the kernel estimator
16
based on bandwidth h. For simplicity, assume the sample size is even and denote the sample
size by 2n. Randomly split the data X = (X1 , . . . , X2n ) into two sets of size n. Denote
these by Y = (Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ).1 Let H = {h1 , . . . , hN } be a finite grid of
bandwidths. Let n
1X 1 kx − Yi k
pbj (x) = K .
n i=1 hdj h
Thus we have a set P = {b
p1 , . . . , pb} of density estimators.
R R
We would like to minimize L(p, pbj ) = pb2j (x) − 2 pbj (x)p(x)dx. Define the estimated risk
Z n
b pbj ) = pb2 (x) − 2
X
Lbj ≡ L(p, pbj (Zi ). (30)
j
n i=1
Let pb = argming∈P L(p,

b g). Schematically:
Y → {b
p1 , . . . , pbN } = P
split
X = (X1 , . . . , X2n ) =⇒
Z → {L bN }
b1 , . . . , L
Theorem 13 (Wegkamp 1999) There exists a C > 0 such that

C log N
p − pk2 ) ≤ 2 min E(kg − pk2 ) +
E(kb .
g∈P n
This theorem can be proved using concentration of measure techniques that we discuss later
in class. A similar result can be proved for V -fold cross-validation.
5.3 Asymptotic Expansions
In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.
Theorem 14 Let RxR = E(p(x) − pb(x))2 and let R = Rx dx. Assume that p00 is absolutely
R
continuous and that p000 (x)2 dx < ∞. Then,

R
1 4 4 00 2 p(x) K 2 (x)dx

1
Rx = σK hn p (x) + +O + O(h6n )
4 nhn n
1
It is not necessary to split the data into two sets of equal size. We use the equal split version for
simplicity.
17
and R
K 2 (x)dx
Z
1 4 4 00 2 1
R = σK hn p (x) dx + +O + O(h6n ) (31)
4 nh n
2
R
where σK = x2 K(x)dx.
Proof. Write Kh (x, X) = h−1 K ((x − X)/h) and pb(x) = n−1 i Kh (x, Xi ). Thus, E[b
P
p(x)] =
E[Kh (x, X)] and Var[bp(x)] = n−1 Var[Kh (x, X)]. Now,

x−t
Z
1
E[Kh (x, X)] = K p(t) dt
h h
Z
= K(u)p(x − hu) du
h2 u2 00
Z
0
= K(u) p(x) − hup (x) + p (x) + · · · du
2
Z
1
= p(x) + h2 p00 (x) u2 K(u) du · · ·
2
R R
since K(x) dx = 1 and x K(x) dx = 0. The bias is
1 2 2 00
E[Khn (x, X)] − p(x) = σK hn p (x) + O(h4n ).
2
By a similar calculation,
R
p(x) K 2 (x) dx

1
Var[b
p(x)] = +O .
n hn n
The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.
If we differentiate (31) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
1/5
c2
h∗ = (32)
c21 A(f )n
where c1 = x2 K(x)dx, c2 = K(x)2 dx and A(f ) = f 00 (x)2 dx. This is informative
R R R
because it tells us that the best bandwidth decreases at rate n−1/5 . Plugging h∗ into (31),
we see that if the optimal bandwidth is used then R = O(n−4/5 ).
6 High Dimensions
The rate of convergence n−2β/(2β+d) is slow when the dimension d is large. In this case it is
hopeless to try to estimate the true density p precisely in the L2 norm (or any similar norm).
18
We need to change our notion of what it means to estimate p in a high-dimensional problem.
Instead of estimating p precisely we have to settle for finding an adequate approximation
of p. Any estimator that finds the regions where p puts large amounts of mass should be
considered an adequate approximation. Let us consider a few ways to implement this type
of thinking.
Biased Density Estimation. Let ph (x) = E(bph (x)). Then

kx − uk
Z
1
ph (x) = K p(u)du
hd h
R
so that the mean of pbh can be thought of as a smoothed version of p. Let Ph (A) = A
ph (u)du
be the probability distribution corresponding to ph . Then
Ph = P ⊕ K h
where ⊕ denotes convolution2 and Kh is the distribution with density h−d K(kuk/h). In
other words, if X ∼ Ph then X = Y + Z where Y ∼ P and Z ∼ Kh . This is just another
way to say that Ph is a blurred or smoothed version of P . ph need not be close in L2 to p but
still could preserve most of the important shape information about p. Consider then choosing
a fixed h > 0 and estimating ph instead of p. This corresponds to ignoring the bias in the
density estimator. From Theorem 8 we conclude:
2
Theorem 15 Let h > 0 be fixed. Then P(kbph − ph k∞ > ) ≤ Ce−nc . Hence,
r !
log n
kb
ph − ph k∞ = OP .
n
The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
Independence Based Methods. If we can live with some bias, we can reduce the dimen-
sionality by imposing some independence assumptions. The simplest example is to treat the
components (X1 , . . . , Xd ) as if they are independent. In that case
d
Y
p(x1 , . . . , xd ) = pj (xj )
j=1
and the problem is reduced to a set of one-dimensional density estimation problems.

2
If X ∼ P and Y ∼ Q are independent, then the distribution of X + Y is denoted by P ? Q and is called
the convolution of P and Q.
19
An extension is to use a forest. We represent the distribution with an undirected graph. A
graph with no cycles is a forest. Let E be the edges of the graph. Any density consistent
with the forest can be written as
d
Y Y pj,k (xj , xk )
p(x) = pj (xj ) .
j=1
pj (xj )pk (xk )
(j,k)∈E
To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.
Density Trees. Ram and Gray (2011) suggest a recursive partitioning scheme similar to
decision trees. They split each coordinate dyadically, in a greedy fashion. The density
estimator is taken to be piecewise constant. They use an L2 risk estimator to decide when to
split. This seems promising. The ideas seems to have been re-discovered in Yand and Wong
(arXiv:1404.1425) and Liu and Wong (arXiv:1401.2597). Density trees seem very promising.
It would be nice if there was an R package to do this and if there were more theoretical
results.
7 Example
Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L2 error which does not guarantee that the
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.
8 Derivatives
Kernel estimators can also be used to estimate the derivatives of a density.3 Let D⊗r p denote
the rth derivative p. We are using Kronecker notation. Let D⊗0 p = p, D⊗1 f is the gradient
3
In this section we follow Chacon and Duong (2013), Electronic Journal of Statistics, 7, 499-532.
20
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth
Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.
21
of p, and D⊗2 p = vecH where H is the Hessian. We also write this as p(r) when convenient.
Let H be a bandwidth matrix and let

n
1X
pbH (x) = KH (x − Xi )
n i=1
where KH (x) = |H|−1/2 K(H −1/2 x). We define

n
(r) (x) = D ⊗r p
1 X ⊗r
pc bH (x) = D KH (x − Xi ).
n i=1
For computation, it is useful to note that
D⊗r KH (x) = |H|−1/2 (H −1/2 )⊗r D⊗r KH (H −1/2 x).
The asymptotic mean squared error is derived in Chacon, Duong and Wand (2011) and is
given by
1 −1/2 m2 (K)
|H |tr((H −1 )⊗r R(D⊗r (K))) + 2 tr((Idr ⊗ vecT H)R(D⊗(r+2) p)(Idr ⊗ vec(H)))
n 4
R R
where R(g) = g(x)g T (x)dx, m2 (K) = xxT K(x)dx. The optimal H has entries of order
n−2/(d+2r+4) which yield an asymptotic mean squared error of order n−4/(d+2r+4) . In dimension
d = 1, the risk looks like this as a function of r:
r risk
0 n−4/5
1 n−4/7
2 n−4/9
We see that estimating derivatives is harder than estimating the density itself.
Chacon and Duong (2013) derive an estimate of the risk:

CVr (H) = (−1)r |H|−1/2 vecT (H −1 )⊗r Gn
where
1 X ⊗2r −1/2 2 X
Gn = D K(H (X i − X j )) − D⊗2r K(H −1/2 (Xi − Xj ))
n2 i,j n(n − 1) i6=j
and K = K ? K. We can now minimize CV over H. It would be nice if someone wrote an

R package to do this. I think the ks package does much of this.
One application of this that we consider later in the course is mode-based clustering. Here,
we use density estimation to find the modes of the density. We associate clusters with these
modes. We can also test for a mode by testing if D2 p(x) < 0 at the estimated modes.
22
9 Unsupervised Prediction and Anomaly Detection
We can use density estimation to do unsupervised prediction and anomaly detection. The
basic idea is due to Vovk, and was developed in a statistical framework in Lei, Robins and
Wasserman (2014).
Suppose we observe Y1 , . . . , Yn ∼ P . We want to predict Yn+1 . We will construct a level α

test for the null hypothesis H0 : Yn+1 = y. We do this for every value of y. Then we invert
the test, that is, we set Cn to be the set of y’s that are not rejected. It folows that
P(Yn+1 ∈ Cn ) ≥ 1 − α.
The prediction set Cn is finite sample and distribution-free.
Fix a value y. Let A = (Y1 , . . . , Yn , y) be the augmented dataset. That is, we set Yn+1 = y.
Let pbA be a density estimate based on A. Consider the vector
pbA (Y1 ), . . . , pbA (Yn+1 ).
Under H0 , the rank of these values is uniformly distributed. That is, for each i,
1
pA (Yi ) ≤ pbA (y)) =
P(b .
n+1
A p-value for the test is
n+1
1 X
π(y) = pA (Yi ) ≤ pbA (y)).
I(b
n + 1 i=1
The prediction set is n o

Cn = y : π(y) ≥ α .
Computing Cn is tedious. Fortunately, Jing, Robins and Wasserman (2014) show that there is
a simpler set that still has the correct coverage (but is slightly larger). The set is constructed
as follows. Let Zi = pb(Yi ). Order these observations
Z(1) ≤ · · · ≤ Z(n) .
Let k = b(n + 1)αc and let

K(0)
t = Z(k) − .
nhd
Define )
n
Cn+ = y : pb(y) ≥ t .
23
Lemma 16 We have that Cn ⊂ Cn+ and hence
P(Yn+1 ∈ Cn ) ≥ 1 − α.
Finally, we note that any Yi with a small p-value can be regarded as an outlier (anomaly).
The above method is exact. We can also use a simpler, asympotic approach. With Z(k)
b = {y : pb(y) ≥ t} where now t = Z(k) . From Cadre, Pelletier and Pudlo
defined above, set C
(2013) we have that
√ P
nhd µ(C∆C)
b →c
for some constant c where C is the true 1 − α level set. Hence, P (Yn+1 ∈ C)
b = 1 − α + oP (1).
10 Manifolds and Singularities
Sometimes a distribution is concentrated near a lower-dimensional set. This causes problems

for density estimation. In fact the density, as we usually think of it, may not be defined.
As a simple example, suppose P is supported on the unit circle in R2 . The distribution

P is singular with respect to Lebesgue measure µ. This means that there are sets A with
P (A) > 0 even though µ(A) = 0. Effectively, this means that the density is infinite. To see
this, consider a point x on the circle. Let B(x, ) be a ball of radius centered at x. Then
P(B(x, ))
p(x) = lim → ∞. H
→0 µ(B(x, ))
Note also that the L2 loss does not make any sense. If you tried to use cross-validation, you
would find that the estimated risk is minimized at h = 0. H
A simple solution is to focus on estimating the smoothed density ph (x) which is well-defined
for every h > 0. More sophisticated ideas are based on topological data analysis which we
discuss later in the course.
11 Series Methods
We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that Xi ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = {f : [0, 1] → R, f 2 (x)dx < ∞}.
0
24
Thus Z Z
φ2j (x)dx = 1, φj (x)φk (x)dx = 0.
An example is the cosine basis:
√
φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . ,
If p ∈ F then
∞
X
p(x) = βj φj (x)
j=1
R1 Pk
where βj = 0
p(x)φj (x)dx. An estimate of p is pb(x) = j=1 βbj φj (x) where
n
1X
βbj = φj (Xi ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.
It can be shown that

Z n
X ∞
X
2
p(x) − p(x)) dx] =
R = E[ (b Var(βbj ) + βj2 . H
j=1 j=k+1
The first term is of order O(k/n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that p ∈ P with
( ∞
)
X X
P= p∈F : p= βj φj : βj2 j 2q < ∞ .
j j=1
In that case it can be shown that

2q
k 1
R≈ + . H
n k
The optimal k is k ≈ n1/(2q+1) with risk
2q
2q+1
1
R=O .
n
11.1 L1 Methods
Here we discuss another approach to choosing h aimed at the L1 loss.

R The idea is to select
a class of sets A— which we call test sets— and choose h to make A pbh (x)dx close to P (A)
for all A ∈ A. That is, we would like to minimize
Z
∆(g) = sup g(x)dx − P (A). (33)

A∈A A
25
VC Classes. Let A be a class of sets with VC dimension ν. As in section 5.2, split the data
X into Y and Z with P = {b p1 , . . . , pbN } constructed from Y . For g ∈ P define
Z
∆n (g) = sup g(x)dx − Pn (A)

A∈A A
−1
Pn
where Pn (A) = n i=1 I(Zi ∈ A). Let pb = argming∈P ∆n (g).
Theorem 17 For any δ > 0 there exists c such that

r
ν
P ∆(b p) > min ∆(b pj ) + 2c < δ.
j n
Proof. We know that r

ν
P sup |Pn (A) − P (A)| > c < δ.
A∈A n
Hence, except on an event of probability at most δ, we have that
Z Z
∆n (g) = sup g(x)dx − Pn (A) ≤ sup g(x)dx − P (A) + sup Pn (A) − P (A)

A∈A A A∈A A A∈A
r
ν
≤ ∆(g) + c .
n
By a similar argument, ∆(g) ≤ ∆n (g) + c nν . Hence, |∆(g) − ∆n (g)| ≤ c nν for all g. Let
p p
p∗ = argming∈P ∆(g). Then,
r r r
ν ν ν
∆(p) ≤ ∆(b p) ≤ ∆n (bp) + c ≤ ∆n (p∗ ) + c ≤ ∆(p∗ ) + 2c .
n n n
The difficulty in implementing this idea is computing and minimizing ∆n (g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Yatracos Classes. Devroye and Györfi (2001) use a class of sets called a Yatracos class which
leads to estimators with some remarkable properties. n Let P = {po1 , . . . , pN } be a set of
densities and define the Yatracos class of sets A = A(i, j) : i 6= j where A(i, j) = {x :
pi (x) > pj (x)}. Let
pb = argming∈G ∆(g)
where Z

∆n (g) = sup g(u)du − Pn (A)

A∈A A

Pn
and Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on a sample Z1 , . . . , Zn ∼ p.
26
Theorem 18 The estimator pb satisfies
Z Z
|b
p − p| ≤ 3 min |pj − p| + 4∆ (34)
j
R
where ∆ = supA∈A A p − Pn (A).

R
The term minj |pj − p| is like a bias while term ∆ is like the variance.
R R
Proof. Let i be such that pb = pi and let s be such that |ps − p| = minj |pj − p|. Let
B = {pi > ps } and C = {ps > pi }. Now,
Z Z Z
|b
p − p| ≤ |ps − p| + |ps − pi |. (35)
Let B denote all measurable sets. Then,

Z Z Z Z Z
|ps − pi | = 2 max pi − ps ≤ 2 sup pi − ps

A∈{B,C} A A A∈A A A
Z Z
≤ 2 sup pi − Pn (A) + 2 sup ps − Pn (A)

A∈A A A∈A A
Z
≤ 4 sup ps − Pn (A)

A∈A A
Z Z Z
≤ 4 sup ps − p + 4 sup p − Pn (A)

A∈A A A∈A A
Z ZA Z Z
= 4 sup ps − p + 4∆ ≤ 4 sup ps − p + 4∆

A∈A A A A∈B A A
Z
= 2 |ps − p| + 4∆.
The result follows from (35).
Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = (Z1 , . . . , Zn ). For each h let
n
1X kx − Yi k
pbh (x) = K .
n i=1 h
Let n o
A = A(h, ν) : h, ν > 0, h 6= ν
where A(h, ν) = {x : pbh (x) > pbν (x)}. Define
Z

∆n (g) = sup g(u)du − Pn (A)

A∈A A
27
Pn
where Pn (A) = n−1 i=1 I(Zi ∈ A) is the empirical measure based on Z. Let
pb = argming∈G ∆(g).
Under some regularity conditions on the kernel, we have the following result.
Theorem 19 (Devroye and Györfi, 2001.) The risk of pb satisfies

Z Z r
log n
E |bp − p| ≤ c1 inf E |bph − p| + c2 . (36)
h n
The proof involves showing that the terms on the right hand side of (34) are small. We refer
the reader to Devroye and Györfi (2001) for the details.
R
Recall that dT V (P, Q) = supA |P (A) − Q(A)| = (1/2) |p(x) − q(x)|dx where the supremum
is over all measurable sets. The above theorem says that the estimator does well in the
total variation metric, even though the method only used the Yatracos class of sets. Finding
computationally efficient methods to implement this approach remains an open question.
12 Mixtures
Another approach to density estimation is to use mixtures. We will discuss mixture modelling
when we discuss clustering.
13 Two-Sample Hypothesis Testing
Density estimation can be usedRfor two sample testing. Given X1 , . . . , Xn ∼ p and Y1 , . . . , Ym ∼

q we can test H0 : p = q using (bp − qb)2 as a test statistic. More interestingly, we can test lo-
cally H0 : p(x) = q(x) at each x. See Duong (2013) and Kim, Lee and Lei (2018). Note that
under H0 , the bias cancels from pb(x) − qb(x). Also, some sort of multiple testing correction
is required.
14 Functional Data and Quasi-Densities
In some problems, X is not just high dimensional, it is infinite dimensional. For example
suppose that each Xi is a curve. An immediate problem is that the concept of a density
is no longer well defined. On a Euclidean space, the density p for a probability measure is
28
R
the function that satisfies P (A) = A p(u)dµ(u) for all measurable A where µ is Lebesgue
measure. Formally, we say that p is the Radon-Nikodym derivative of P with respect to the
dominating measure µ. Geometrically, we can think of p as
P(kX − xk ≤ )
p(x) = lim
→0 V ()
where V () = d π d/2 /Γ(d/2 + 1) is the volume of a sphere of radius . Under appropriate
conditions, these two notions of density agree. (This is the Lebesgue density theorem.)
When the outcome space X is a set of curves, there is no dominating measure and hence
there is no density. Instead, we define the density geometrically by
q (x) = P(ξ(x, X) ≤ )
for a small where ξ is some metric on X . However we cannot divide by V () and let tend
to 0 since the dimension d is infinite.
One way around this is to use a fixed and work with the unnormalized density q . For the
purpose of finding high-density regions this may be adequate. An estimate of q is
n
1X
qb (x) = I(ξ(x, Xi ) ≤ ).
n i=1
Pk
An alternative is to expand Xi into a basis: X(t) ≈ j=1 βj ψj (t). A density can be defined
in terms of the βj ’s.
Example 20 Figure 6 shows the tracks (or paths) of 40 North Atlantic tropical cyclones
(TC). The full dataset, consisting of 608 from 1950 to 2006 is shown in Figure 7. Buchman,
Lee and Schafer (2009) provide a thorough analysis of the data. We refer the reader to their
paper for the full details.4
Each data point— that is, each TC track— is a curve in R2 . Various questions are of
interest: Where are the tracks most common? Is the density of tracks changing over time?
Is the track related to other variables such as windspeed and pressure?
Each curve Xi can be regarded as mapping Xi : [0, Ti ] → R2 where Xi (t) = (Xi1 (t), Xi2 (t))
is the position of the TC at time t and Ti is the lifelength of the TC. Let
n o
Γi = (Xi1 (t), Xi2 (t)) : 0 ≤ t ≤ Ti
be the graph of Xi . In other words, Γi is the track, regarded as a subset of points in R2 .

We will use the Hausdorff metric to measure the distance between curves. The Hausdorff
4
Thanks to Susan Buchman for providing the data.
29
Figure 6: Paths of 40 tropical cyclones in the North Atlantic.
distance between two sets A is B is
dH (A, B) = inf{ : A ⊂ B and B ⊂ A } (37)

( )
= max sup inf kx − yk, sup inf kx − yk (38)
x∈A y∈B x∈B y∈A
where A = x∈A B(x, ) is called the enlargement of A and B(x, ) = {y : ky − xk ≤ }.

S
We use the unnormalized kernel estimator
n
1X
qb (γ) = I(dH (γ, Γi ) ≤ ).
n i=1
Figure 8 shows the 10 TC’s with highest local density and the the 10 TC’s with lowest local
density using = 16.38. This choice of corresponds to the 10th percentile of the values
{dH (Xi , Xj ) : i 6= j}. The high density tracks correspond to TC’s in the gulf of Mexico with
short paths. The low density tracks correspond to TC’s in the Atlantic with long paths.
15 Miscellanea
Another method for selecting h which is sometimes used when p is thought to be very smooth
is the plug-in method. The idea is to take the formula for the mean squared error (equation
31), insert a guess of p00 and then solve for the optimal bandwidth h. For example, if d = 1
30
Figure 7: Paths of 608 tropical cyclones in the North Atlantic.
Figure 8: 10 highest density paths (black) and 10 lowest density paths (blue).
31
and under the idealized assumption that p is a univariate Normal this yields h∗ = 1.06σn−1/5 .
Usually, σ is estimated by min{s, Q/1.34} where s is the sample standard deviation and Q
is the interquartile range.5 This choice of h∗ works well if the true density is very smooth
and is called the Normal reference rule.
Since we don’t want to necessarily assume that p is very smooth, it is usually better to
estimate h using cross-validation. See Loader (1999) for an interesting comparison between
cross-validation and plugin methods.
A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h(x) for each point x. One can also use a different bandwidth h(xi ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.
Density estimation is sometimes used to find unusual observations or outliers. These are
observations for which pb(Xi ) is very small.
16 Summary
1. A commonly used nonparametric density estimator is the kernel estimator
n
1X 1 kx − Xi k
pbh (x) = K .
n i=1 hd h
2. The kernel estimator is rate minimax over certain classes of densities.
3. Cross-validation methods can be used for choosing the bandwidth h.
5
Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing
by 1.34 is that Q/1.34 is a consistent estimate of σ if the data are from a N (µ, σ 2 ).
32

Density Estimation 36-708

Uploaded by

Copyright:

Available Formats

Density Estimation 36-708

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Density Estimation 36-708

Uploaded by

Copyright:

Available Formats

Density Estimation

Let X1 , . . . , Xn be a sample from a distribution P with density p. The goal of nonparametric

where pb(y|x) = pb(y, x)/b

h(x) = I(p1 (x)π1 > p0 (x)π0 )

The most commonly used loss function is the L2 loss

The risk is R(p, pb) = E(L(p, pb)).

dT V (P, Q) = sup |P (A) − Q(A)|

Now Rwe bound the variance.

We conclude that the L2 risk is bounded by

Theorem 2 There exists a constant C > 0 such that

3.1 Concentration Analysis For Histograms

Let us now derive a concentration result for pbh . We will bound

where kf k∞ = supx |f (x)|. Assume that  ≤ 1. First, note that

Using Bernstein’s inequality and the fact that θj (1 − θj ) ≤ θj ≤ Chd ,

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .

Choosing h = (c2 /n)1/(2+d) we conclude that, with probability at least 1 − δ,

4 Kernel Density Estimation

where I(x) = 1 if |x| ≤ 1 and I(x) = 0 otherwise.

More generally, we define

Assume that Xi ∈ X ⊂ Rd where X is compact. Let β and L be positive numbers. Given a

|g 0 (x) − g 0 (y)| ≤ L |x − y|, for all x, y.

If g ∈ Σ(β, L) then g(x) is close to its Taylor series approximation:

|g(u) − gx,β (u)| ≤ Lku − xkβ (12)

In the common case of β = 2, this means that

Assume now R that theR kernel

An example of a kernel that satisfies these conditions

Next we bound the variance.

Lemma 4 The variance of pbh satisfies:

When β = 2 and h  n−1/(4+d) we get the rate n−4/(4+d) .

Theorem 6 There exists C depending only on β and L such that

4.3 Concentration Analysis of Kernel Density Estimator

Theorem 7 For all small  > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

Hence, for any δ > 0,

for some constants C and c. If h  n−1/(2β+d) then

Proof. By the triangle inequality,

The result follows from (21).

An alternative approach is to replace Bernstein’s inequality with a more sophisticated in-

Theorem 9 Suppose that p ∈ Σ(β, L). Fix any δ > 0. Then

4.5 Boundary Bias

pbh (x) − ph (x)

for some τ (x). This is true even P hn → 0 and

with the bootstrap estimator

Here is the algorithm.

Figure 4: 95 percent bootstrap confidence bands using various bandwidths.

2. Draw X1∗ , . . . , Xn∗ ∼ Pn . This is called a bootstrap sample.

Theorem 10 Under appropriate (very weak) conditions, we have

bb(x) = pb2h (x) − pbh (x)

5.1 Leave One Out

The risk is R(h) = E(L(h)).

Definition 11 The leave-one-out cross-validation estimator of risk is

5.2 Data Splitting

An alternative to leave-one-out is V -fold cross-validation. A common choice is V = 10. Fir

Let pb = argming∈P L(p,

Theorem 13 (Wegkamp 1999) There exists a C > 0 such that

5.3 Asymptotic Expansions

continuous and that p000 (x)2 dx < ∞. Then,

Biased Density Estimation. Let ph (x) = E(bph (x)). Then

and the problem is reduced to a set of one-dimensional density estimation problems.

where kf k∞ = supx |f (x)|. Assume that ≤ 1. First, note that

P(|θbj − θj | > hd ) ≤ 2h−d exp −cn2 hd ≡ πn .

When β = 2 and h n−1/(4+d) we get the rate n−4/(4+d) .

Theorem 7 For all small > 0,

p(x) − ph (x)| > ) ≤ 2 exp −cnhd 2 .

for some constants C and c. If h n−1/(2β+d) then

Proof. We know that r

dH (A, B) = inf{ : A ⊂ B and B ⊂ A } (37)

where A = x∈A B(x, ) is called the enlargement of A and B(x, ) = {y : ky − xk ≤ }.