Density estimation in the uniform deconvolution model
Piet Groeneboom and Geurt Jongbloed
June 13, 2002
Abstract
We consider the problem of estimating a probability density function based on data that
are corrupted by noise from a uniform distribution. The (nonparametric) maximum likelihood
estimator for the corresponding distribution function is well defined. For the density function
this is not the case. We study two nonparametric estimators for this density. The first is a type
of kernel density estimate based on the empirical distribution function of the observable data.
The second is a kernel density estimate based on the MLE of the distribution function of the
unobservable (uncorrupted) data.
1
Introduction
Missing and incomplete data problems have been an important field of research in mathematical
statistics during the past decades. In particular, a lot of theory has been developed for the analysis
of right censored data and martingale methods have been very successful here. Other forms of
censoring have also been studied, like interval censored data. In the latter situation the data are
“very incomplete” in the sense that one can never observe the variable of interest directly, but one
only has information about a region or interval to which the variable of interest belongs.
In contrast with the situation for right-censored data, martingales have not been very useful for
developing theory for interval censored data and one is forced to develop theory from scratch. For
√
example, the familiar n-convergence, still valid in the situation of right-censored data, generally
does not hold anymore and pointwise asymptotic normality of the maximum likelihood estimator
of the unknown distribution function will generally not hold either.
√
These two facts: rates of convergence slower than n and non-normal limit distributions for
maximum likelihood estimators are characteristic for many incomplete data problems. These problems belong to the category of inverse statistical problems, see, e.g., Groeneboom (1996), and
are subject to “heavy loss of information”. A convenient way of describing the situation is by
introducing a hidden space, containing the information one would like to observe, an observation
space, containing the observations actually one can observe, and a mapping from observation space
to hidden space. We use this set-up in the sequel, in the development of asymptotic distribution
theory.
We focus on one particular problem of this type: deconvolution, which poses problems of the
same nature as interval censored data. In fact, in section 3 we give an example of a situation where
the two models give the same type of maximum likelihood estimator, with the same convergence
rate and the same (non-normal) limit distribution.
In the deconvolution model one also never can observe the variable of interest directly, but
one only has access to the sum of the variable of interest and some “noise” added to it. We
specialize further to “uniform noise”. In fact, the present investigation has been inspired by work
on “deblurring of pictures” in Roy Choudhury (1998), where methods for recovering pictures,
1
blurred by Poisson noise are studied. Further work on the latter problem is reported in O’Sullivan
and Roy Choudhury (2001).
Another motivation for the present study has been to provide an alternative for “using EM with
early stopping”, which at present enjoys much popularity in certain circles. The trouble with the
latter procedure is that it is very hard to determine what one actually gets if one does EM with
early stopping. One gets result that will be dependent on the distribution, used as a starting point
for the algorithm, and this prevents the development of distribution theory.
We show that if one wants to estimate a density in deconvolution problems, one can compute
the MLE of the distribution function (without any early stopping!) and use a convolution of a
kernel with this MLE to estimate the density. In Roy Choudhury (1998) the latter procedure is
discussed as a method for estimating the deblurred picture, where the picture plays the role of a
density. Section 5 is based on the last part of a special topics course, given by the first author
in the spring of 1998 at the University of Washington, Seattle, USA. Parts of this course are also
included as an appendix in Roy Choudhury (1998).
The type of deconvolution we study has also been called “boxcar deconvolution”, and recent
work on this model (not taking the maximum likelihood approach) can be found in Johnstone and
Raimondo (2002) and Hall et al (2001).
2
Uniform deconvolution
Consider independent and identically distributed random variables Z1 , Z2 , . . . with density
Z
k(z − x) dF (x)
g(z) = gF (z) =
R
where k is a known probability density on R and F is an arbitrary distribution function. Estimating
F based on a sample from gF is a statistical inverse problem in the sense that the sampling distribution is the image of the distribution of interest under a known transformation K. Estimating
F can in that sense be interpreted as first to estimate g and then to apply ‘some inverse’ of K to
obtain an estimate for F . This inverse problem, however, can also be viewed as an incomplete data
problem. The complete observations consist of independent pairs (Xi , Yi ) where Xi ∼ F and Yi ∼ k
are independent. If the complete data were observable, the distribution function F could be adequately estimated by the empirical distribution function of the Xi ’s (or better if more information
on F is available). However, a part of the data is missing. We do not observe the pairs (Xi , Yi )
but only observe the random variables Zi = Xi + Yi . The question then is: how can we estimate F
and quantities related to F based on the sample of Z’s.
In this paper we will not consider the deconvolution problem in great generality. We focus on
one specific convolution kernel: the uniform density on [0, 1). In that case, we have
Z
Z
dF (x) = F (z) − F (z − 1), z ∈ R.
(2.1)
g(z) = gF (z) = k(z − x) dF (x) =
(z−1,z]
Note that, formally, the distribution function F can be recovered from the density g by
F (z) =
∞
X
i=0
g(z − i).
(2.2)
Moreover, we assume F0 (0) = 0 throughout.
In the deblurring application considered in Roy Choudhury (1998), it is necessary to estimate
the probability density f0 of X based on data from the convolution of f0 with a uniform density
2
on (0, b) for general (known) b > 0. Although restricting the convolution kernel to the standard
uniform density is quite a loss with regard to the general deconvolution problem where k is a
general density, the results of this paper are also valid for the nonstandard uniform deconvolution,
i.e. where b > 0 arbitrary (but known). The procedure then is to first transform the data by
division through b and use the techniques for the standard uniform deconvolution model of this
paper to obtain an estimate of the rescaled density f0 . This estimate can then again be rescaled to
get a density estimate for f0 . We will not restate our results for these more general b since these
are obvious.
Of course, in order for (2.2) to be valid, g should belong to the image of the set of distribution
functions under the convolution operator. This range is a strict subset of the class of all probability
densities. In fact, this is one of the characteristic features of inverse problems. The nonparametric
model on the hidden space (here the product measures that can be constructed from an arbitrary
distribution function and the uniform density) transforms under the mapping from the hidden to
the observation space to a model that is smaller than the usual nonparametric model one is used
to considering in the observation space. The example below, that will be used in the sequel to
illustrate the various estimators, stresses the fact that applying (2.2) to densities outside the range
of the convolution operator, gives functions F that are not distribution functions.
Example 1. Consider the exponential model for F . This means that we assume F to belong to
the class
F = {Fθ : θ > 0} with Fθ (x) = 1 − e−θx
for x > 0. Then the sampling densities belong to the parametric model
1 − eθz
for z ∈ (0, 1)
G = {gθ : θ > 0} with gθ (z) =
(eθ − 1)e−θz for z ≥ 1.
(2.3)
This is not one of the ‘usual parametric families’ on (0, ∞). It is important to note that any family
F of distribution functions on R results in a family of densities on R using (2.1).
Now suppose we take the class of exponential densities as model in the observation space, so
gθ (z) = θe−θz in (2.3). Then we get as associated model in the hidden space
Fθ (x) = θe−θx 1 + eθ + e2θ + . . . + e⌊x⌋θ .
It is clear that Fθ (0) = θ (which is not necessarily in (0, 1)) and that Fθ is initially decreasing. This
shows that the class of exponential densities is outside the range of the set of distribution functions
under the uniform convolution mapping.
In section 3 we discuss the nonparametric maximum likelihood estimator F̂ of the distribution
function F and address computational issues associated to that estimator. Section 4 is devoted
to a kernel estimator which is based on (2.2). The sampling density is estimated using a kernel
estimator and this estimate is plugged into (2.2). This estimator has the disadvantage that it is
not a probability density function due to the fact that the usual kernel density estimates are not
contained in the class of densities of the form (2.1). However, its asymptotic behavior can be
derived using the classical central limit theorem for triangular arrays. In section 5 we introduce a
natural kernel density estimator that is based on F̂ and study its asymptotic behavior.
3
Nonparametric Maximum Likelihood Estimator F̂
In this subsection we consider the nonparametric maximum likelihood approach to estimating the
distribution function F in the uniform deconvolution model. Given data z1 < z2 < . . . < zn , the
3
loglikelihood function on the class of distribution functions is given by
l(F ) =
n
n
i=1
i=1
1X
1X
log gF (zi ) =
log(F (zi ) − F (zi − 1)).
n
n
This estimator was studied in Groeneboom and Wellner (1992). Note that l(F ) ≤ 0 for all
F , so that the log likelihood function is bounded above. Note also that l depends on F only via
its values at the observed points zi and the points zi − 1. This means that when maximizing l,
attention may be restricted to discrete distribution functions having their mass concentrated at the
points zi and zi − 1. However, any maximizer of l will assign mass zero to these latter points. The
loglikelihood is increased by shifting such masses to the first zj bigger than zi −1. Finally, note that
l can be viewed as a strictly concave function on the set of all these discrete distribution functions.
All this gives that we can uniquely define the MLE F̂ as the discrete distribution function having
its mass concentrated on the observed zi ’s that maximizes l.
Another well known feature in this setting is that if the support of the distribution with distribution function F is contained in (0, 1), the MLE can be computed using the fact that the
loglikelihood is the same in stucture as the loglikelihood for current status data. If the support of
F is bigger, this correspondence does not exist anymore and the estimator is to be computed using
iterative optimization techniques. Figure 1 shows the maximum likelihood estimator of the mixing
distribution F and the corresponding density g based on a sample of size n = 500.
The true F (mean one exponential) and g are also included in the picture.
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Figure 1: Above: ML estimate of the distribution function F based on a sample of size n = 500 with
the true (exponential) distribution. Below: the corresponding estimate for the sampling density g
with the true density.
To obtain Figure 1, we used a Newton algorithm to compute F̂ , where the basic quadratic
maximizations are performed using the support reduction algorithm. This algorithm is introduced
in Groeneboom, Jongbloed and Wellner (2002), where it is also used to compute the maximum
likelihood estimator of the distribution function in the gaussian deconvolution model. It is an
4
algorithm that belongs to the vertex direction family of algorithms and is closely related to the
algorithm proposed in Simar (1976).
4
Inverse kernel density estimation
There are various methods that come to mind if one wants to estimate f based on a sample from
g. For example, the convolution structure of (2.1) suggests the use of the Fourier transform to get
an estimate of f . However, this approach is not attractive since the characteristic function of the
uniform distribution has zeroes.
In this section we study an estimator that is suggested by (2.2): an inverse density estimator.
Indeed, assuming that F has a density, we get that
f (z) =
∞
X
i=0
g′ (z − i),
so that a usual density estimate of g can be used to estimate f . In this section we study an inverse
kernel density estimator for f . This estimator is based on a conventional kernel estimate of g,
Z
Z
z−x
1
K
d Gn (x)
Kh (z − x) d Gn (x) =
gn,h (z) =
h R
h
R
Here h > 0 is the bandwidth of the density estimate and K is the kernel. Based on this estimate,
we define
∞ Z
∞ Z
∞
X
X
1 X
′ z−x−i
′
˜
K
Kh′ (z −x−i) d Gn (x). (4.4)
d Gn (x) =
gn,h (z −i) = 2
fn,h (z) =
h
h
R
R
i=0
i=0
i=0
For this estimator we have the following theorem.
Theorem 4.1 Let K be a compactly supported twice continuously differentiable kernel and let the
distribution function F0 be continuously differentiable at the point t ∈ (0, M ). Define the estimator
f˜n,h(t) of the derivative f0 of F0 at t by (4.4) for t ∈ (0, M ). Then:
(i) As n → ∞, h ↓ 0, and nh → ∞,
)
(
∞ Z
X
D
Kh′ (t − x − i) dG0 (x) −→ N (0, σ 2 ),
{nh3 }1/2 f˜n,h (t) −
(4.5)
i=0
where
2
σ = F0 (t)
Z
K ′ (u)2 du.
(ii) Suppose that f0 is twice differentiable at t. Then, for h ↓ 0,
!
Z
∞ Z
X
1 ′′
−2
′
h
Kh (t − z − i)g0 (z) dz − f0 (t) → f0 (t) u2 K(u) du =: β.
2
(4.6)
(4.7)
i=0
(iii) If hn ∼ c · n−1/7 , for some c > 0, as n → ∞,
n
o
D
n2/7 f˜n,h (t) − f0 (t) −→ N (βc2 , c−3 σ 2 ), n → ∞,
where σ 2 is as in (4.6), and β as in (4.7).
5
(4.8)
Proof. For (i) write
3 1/2
{nh }
(
f˜n,h (t) −
with
3
1/2
ξnk = (h /n)
∞
X
∞ Z
X
Kh′ (t
i=0
Kh′ (t
i=0
)
− z − i) dG0 (z)
− Zk − i) −
Z
Kh′ (t
=
n
X
ξnk
k=1
− z − i) dG0 (z) .
Note that the infinite sum is in fact a finite sum due to the fact that F0 (0) = 0. By the central
limit theorem for triangular arrays (theorem 7.2 in Billingsley (1968)), we have that
s−1
n
n
X
k=1
D
ξnk −→ N (0, 1),
where
s2n
=
n
X
varξnk =
2
nEξn1
i=0
k=1
= h3
Z
∞
X
Kh′ (t − Z1 − i))
= h var(
3
∞
X
∞
z=0
= I1,n −
i=0
2
I2,n .
!2
Kh′ (t − z − i)
g0 (z) dz − h3
Z
∞
∞ X
z=0 i=0
Kh′ (t − z − i)g0 (z) dz
!2
Because K has compact support, for h sufficiently small, the first term can be written as
3
I1,n = h
∞ Z
X
i=0
=
Z
∞
z=0
Kh′ (t
2
− z − i) g0 (z) dz =
K ′ (u)2 F0 (t − hu) du → F0 (t)
Z
∞ Z
X
i=0
K ′ (u)2 g0 (t − i − hu) du
K ′ (u)2 du, for h ↓ 0,
by dominated convergence. For the second term we have that
I2,n = h3/2
∞ Z
X
∞
Kh′ (t − z − i)g0 (z) dz
i=0 z=0
∞ Z
X
−2
3/2
h
= h
i=0
K′
t−z−i
h
g0 (z) dz =
√ Z
h K ′ (u)f0 (t − hu) du,
R
which is of smaller order than the first term. Hence, s2n → F0 (t) K ′ (u)2 du, yielding (i). Finally,
note that
var(ξnk )
1
c
2
P (|ξnk | > ǫsn ) ≤
= 2 and ξnk
≤
2
2
ǫ sn
ǫ n
nh
for a c > 0. Hence,
n
c
1
c
1
n
≤
Eξ 2 1
·
·
= 2 2 ·
→0
s2n nk [|ξnk |>ǫsn] s2n nh ǫ2 n
ǫ sn nh
as nh → ∞.
6
For (ii), note that
∞ Z
X
∞ Z
X
t−z−i
g0 (z) dz
h
i=0
i=0
Z
∞ Z
X
K ′ (u) g0 (t − i − hu) du = h−1 K ′ (u) F0 (t − hu) du
= h−1
Kh′ (t − z − i) g0 (z) dz = h−2
K′
Zi=0
1
1
= h−1 K ′ (u) (F0 (t) − huf0 (t) + h2 u2 f0′ (t) − h3 u3 f0′′ (t − hζu) du
2
6
Z
1 2 ′′
→ 0 + f0 (t) + 0 + h f0 (t) u2 K(u) du.
2
Here we use that K is a smooth probability density with zero mean and partial integration.
Now, for (iii) note that
!
∞ Z
X
Kh′ (t − z − i)g0 (z) dz +
n2/7 (f˜n,h (t) − f0 (t)) = c−3/2 (nh3 )1/2 f˜n,h(t) −
i=0
+n
2/7
∞ Z
X
i=0
D
2
Kh′ (t
!
− z − i)g0 (z) dz − f0 (t)
−3 2
−→ N (βc , c
σ ),
as n → ∞ by (i) and (ii).
✷
Corollary 4.1 To estimate f (t) with kernel estimator (4.4), the asymptotic MSE optimal bandwidth is given by
2 1/7
3σ
−1/7
h = hn = copt · n
with copt =
,
4β 2
with σ 2 and β as in (4.6) and (4.7).
2 2
For the biweight kernel K(u) = 16
15 (1 − u ) 1[−1,1] (u)
this gives
F (t0 ) 1/7
15
2835F (t0 ) 1/7
4
σ 2 = F (t0 ) and β = f ′′ (t0 ) ⇒ copt =
≈
1.72
.
7
21
64f ′′ (t0 )2
f ′′ (t0 )2
Proof: For copt we minimize the sum of the squared bias and the variance given in theorem 4.1
(iii) as a function of c.
✷
Figure 4 shows f˜n,h based on the same data set as Figure 1. The choice of K is the biweight kernel and bandwidth h = 0.88 (asymptotically MSE optimal for t0 = 1). Note that the
optimal bandwidth in our simulation example (exponential F0 and n = 500) depends on t0 as
t0 7→ 0.71(e2t0 − et0 )1/7 . This is an increasing function, explaining to some extent that the density
estimate in Figure 4 seems oversmoothed near zero and undersmoothed in the right tail. Of course,
in practical situations one could first estimate F0 and f0′′ using a pilot bandwidth and then compute a density estimate with varying bandwidth based on this pilot estimate. Also, one could apply
boundary kernels to get rid of the obvious boundary problems of the estimates at zero. Boundary
kernels in the spirit of Wand and Jones (1995) have a discontinuity at zero. In the current context
this would be a problem since we need the density estimate to be differentiable. Although the
estimate of Figure 4 could be improved near the boundary, we will not pursue this here.
7
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Figure 2: The inverse kernel estimate f˜n,h based on a sample of size 500, using the biweight kernel
bandwidth h = 0.88.
5
Density estimation based on F̂
We want to study the asymptotic behavior of the estimator
Z
ˆ
fn,h (t) = Kh (t − y) dF̂n (y)
of the derivative f0 (t) at t of the distribution function F0 of the distribution we want to estimate,
where F̂n is the MLE of this distribution function F0 as defined in section 3. Figure 3 shows this
estimate based on the same data as Figure 1 and 4. The estimate fˆn,h has the same undesirable
behavior near zero as f˜n,h. In this case it is straightforward to implement boundary kernels in the
spirit of section 2.11 in Wand and Jones (1995). The resulting estimate (which coincides with fˆn,h
on [h, ∞)) is shown in Figure 4.
We impose the following condition on F0 :
(F) The distribution corresponding to the distribution function F0 has support [0, M ], and has a
continuous derivative f0 which satisfies
inf
u∈(0,M )
f0 (u) > 0.
Note that condition (F) implies that for any δ ∈ (0, M ∧ 1):
inf
{F0 (x) − F0 (x − 1)} ∧ {F0 (x + 1) − F0 (x)} > 0.
x∈[δ,M −δ]
(5.9)
For the kernel function we assume
(K) Kh (u) = h−1 K(u/h), where K is a fixed non-negative twice differentiable kernel with support
[−1, 1], and bounded second derivative, that integrates to 1 on [−1, 1] and is symmetric around
zero.
For our pictures we used the biweight kernel, as defined in Corollary 4.1. This kernel satisfies (K).
We show below that under these conditions the following result holds:
8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Figure 3: The density estimate fˆn,h based on a sample of size 500, using the biweight kernel and
h = 0.88.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Figure 4: The smoothed MLE density estimate based on the sample of size 500 using boundary
kernels. Again, h = 0.88 and the kernel used is the biweight.
Theorem 5.1 Let K satisfy condition (K) and let the distribution function F0 satisfy condition
(F). Let t be a fixed point in the open interval (0, M ), such that t 6≡ 0 (mod 1) and M − t 6≡ 0
(mod 1), and let the estimator fˆn,h(t) of the derivative f0 of F0 at t be defined by
Z
ˆ
fn,h (t) = Kh (t − x) dF̂n (x),
for t ∈ (0, M ). Moreover, let m be the largest integer < M + 1. Then:
(i) As n → ∞, h ↓ 0, and nh3 → ∞,
Z
D
{nh3 }1/2 fˆn,h (t) − Kh (t − x) dF0 (x) −→ N (0, σ 2 ),
9
(5.10)
where
2
3
σ = lim h
h↓0
Z
2
θh,t,F
dG0 > 0,
0
for a function θh,t,F0 , defined by
m
X
{1 − F0 (x + i)} Kh′ (t − (x + i)) , if x ∈ [0, 1], k = 0,
θh,t,F0 (x+k) =
i=0Pk−1 ′
− i=0 Kh (t − (x + i)) + θh,t,F0 (x) , if x ∈ [0, 1], k = 1, . . . , m.
(ii) Suppose that f0 is twice differentiable at t. Then, for h ↓ 0,
Z
Z
1 ′′
−2
h
Kh (t − x) dF0 (x) − f0 (t) → f0 (t) u2 K(u) du =: β.
2
(iii) If hn ∼ c · n−1/7 , for some c > 0, as n → ∞,
n
o
D
n2/7 fˆn,h (t) − f0 (t) −→ N (βc2 , c−3 σ 2 ), n → ∞,
(5.11)
(5.12)
(5.13)
(5.14)
where σ 2 is as in (5.11), and β as in (5.13).
Remark. The statements of Theorem 5.1 resemble those of Theorem 4.1. The asymptotic bias is
exactly the same for both estimators.
The definition (5.12) of θh,t,F0 in part (i) is based on a definition given in Example 11.2.3e on
p. 230 of van de Geer (2000). If M ≤ 1, it can be seen that this is given by
θh,t,F0 (x) = (1 − F0 (x))Kh′ (t − x) − F0 (x − 1)Kh′ (t − x + 1)
if (t − h, t + h) ⊂ (0, M ). This means that σ 2 reduces to:
Z
2
σ = F0 (t){1 − F0 (t)} K ′ (u)2 du.
(5.15)
In this case the asymptotic variances for the two estimators depend on the kernel K in the same
way, but the dependence on F0 shows that fˆn,h is more efficient than f˜n,h :
F0 (t){1 − F0 (t)} < F0 (t), t ∈ (0, M ),
under condition (F) at the beginning of this section.
Apart from the pointwise asymptotic properties of fˆn,h , which we think are overall better than
those of f˜n,h , another reason to prefer fˆn,h is that fˆn,h is a genuine probability density function. In
contrast to f˜n,h, it only takes nonnegative values.
The assumption that t 6≡ 0 (mod 1) and M − t 6≡ 0 (mod 1) is somewhat unpleasant and
probably not needed, but we saw no way to avoid this in the present proof.
The proof of Theorem 5.1 is much more involved than that of Theorem 4.1 and will be given below
in a sequence of steps. The “hidden probability space” contains random variables (X, Y ), where
X has distribution function F0 , Y has a Uniform(0, 1)-distribution, and X and Y are independent
and the mapping, relating the random variables in the hidden space to the random variables in the
observation space, is:
T (x, y) = x + y, x ∈ [0, M ], y ∈ [0, 1].
10
We will need some facts about score operators for this particular model, given in, e.g., Groeneboom and Wellner (1992), Chapter 1. Let L2 (F0 ) denote the space of square integrable functions
w.r.t. the measure dF0 , that is: the space of Borel measurable functions h : IR → IR such that
Z
h2 dF0 < ∞.
Moreover, the space L02 (F0 ) is the subset of functions h ∈ L2 (F0 ), satisfying
Z
h dF0 = 0.
The score operator LF0 , mapping scores a ∈ L02 (F0 ) to scores ā in the observation space, only
involving a Hellinger differentiable path (Fτ )τ ∈[0,δ) , tending to F0 , is given by the conditional
expectation
n
o
[LF0 (a)] (z) = EF0 a(X) T (X, Y ) = z .
Its adjoint on the space of scores contained in L02 (G0 ), where G0 is defined by the convolution
density gF0 , is given by the conditional expectation
n
o Z
[L (ā)] (x) = EF0 ā(Z) X = x =
∗
x+1
ā(z) dz.
x
Note that the adjoint operator L∗ does not involve the distribution function F0 , in contrast with
the score operator LF0 itself (of which it is the adjoint).
The guiding principle in finding the asymptotic distribution of fˆn,h(t) is to try to find a solution
θh,t,F to the equation
Z
L∗ θh,t,F = Kh (t − ·) −
Kh (t − x) dF (x),
where θh,t,F is a score in L02 (GF ), belonging to the (closure of the) range of LF , and GF denotes
the distribution with density
gF (z) = F (z) − F (z − 1).
The next step is then to prove that
Z
Z
Kh (t − x) d(F̂n − F0 )(x) ∼ θh,t,F0 (z) d(Gn − G0 )(z),
(5.16)
in the sense that the right-hand side of (5.16) represents the left-hand side up to terms that are
of lower order, as n → ∞, and h = hn ↓ 0 (at a certain rate that will be specified below). By the
central limit theorem, the right-hand side of (5.16), divided by
σh,t =
Z
1/2
θh,t,F0 (z) dG0 (z)
2
is asymptotically standard normal (again under some conditions on the rate at which h tends to
zero as a function of n).
Note that, since the convolution kernel is the Uniform(0,1) density, we have:
Z z
a(x) dF (x)/{F (z) − F (z − 1)}.
[LF (a)] (z) =
z−1
11
Defining
θ(z) =
Z
z
z−1
a(x) dF (x)/{F (z) − F (z − 1)}, z ∈ IR,
(5.17)
where the ratio is defined to be zero if the denominator is zero, we have to solve the equation
Z
Z x+1
θ(z) dz = Kh (t − x) − Kh (t − y) dF (y),
x
for x in the interior of the support of F , where θ will in fact depend on t, h and F . Differentiation
w.r.t. x yields the equation
θ(x + 1) − θ(x) = −Kh′ (t − x),
(5.18)
if θ is continuous at the points x and x + 1.
Let the function φ be defined by
Z x
a(u) dF (u) , if x ≥ 0,
φ(x) =
00
, otherwise.
(5.19)
Then, by (5.18), φ should satisfy
φ(x) − φ(x − 1)
φ(x + 1) − φ(x)
−
= −Kh′ (t − x),
F (x + 1) − F (x) F (x) − F (x − 1)
(5.20)
if x and x + 1 are points of continuity of θ. Conversely, if φ satisfies (5.20), then θ satisfies (5.18).
The following lemma is one of the stepping stones in achieving our goal (compare to Lemma
3.3 in Groeneboom (1996)).
Lemma 5.1 Let the kernel K satisfy condition (K) above. Furthermore, let t ∈ (0, M ) and h ∈
d
, where Kh (u) =
Kh (u)
(0, 1/2) satisfy (t − h, t + h) ⊂ (0, M ), and let Kh′ (t − x) = du
u=t−x
h−1 K(u/h). Then we have:
(i) Suppose that the distribution function F on [0, ∞) satisfies F (M + 1) = 1 and (5.9) for a
δ ∈ (0, 21 (M ∧ 1)) (not necessarily for all δ ∈ (0, M ∧ 1)). Then the equation
{F (x + 1) − F (x − 1)}ψ(x) − {F (x + 2) − F (x + 1)}ψ(x + 1)
=
Kh′ (t
−{F (x − 1) − F (x − 2)}ψ(x − 1)
− x), x ∈ IR,
(5.21)
under the side condition ψ(x) = 0, x ∈
/ [0, M + 1], has a bounded solution ψh,t,F .
(ii) Let F be as in condition (i), and let the distribution function F0 satisfy condition (F) above.
Moreover, let ψt,h,F be defined as in (i), and let θh,t,F be defined by
θh,t,F (x) = {F (x + 1) − F (x)}ψ(x) − {F (x − 1) − F (x − 2)}ψ(x − 1), x ∈ IR.
Then
and
Z
Z
θh,t,F (x)gF (x) dx = 0,
Kh (t − y) d(F − F0 )(y) = −
12
Z
θh,t,F (z) dG0 (z).
(5.22)
(5.23)
Remark. The function φh,t,F , defined by
φh,t,F (x) = {F (x) − F (x − 1)}{F (x + 1) − F (x)}ψh,t,F (x), x ∈ IR,
(5.24)
satisfies (5.20), if the denominators in (5.20) are strictly positive. The introduction of the function
ψh,t,F (x) is needed to cover the situation that the denominators are zero. If condition (F) is satisfied
(as is true for the distribution function F0 ), the introduction of the function ψh,t,F0 is not necessary
in the definition of θh,t,F0 . We will however use the function ψh,t,F in the case that F equals the
MLE F̂n . In particular we will need the fact that φh,t,F̂ (x) contains the factors F̂n (x) − F̂n (x − 1)
and F̂n (x + 1) − F̂n (x), as is clear from its representation (5.24) in terms of ψh,t,F̂ (x).
Proof of Lemma 5.1.
Fix x ∈ [0, 1) and let m be the largest integer such that x + m ≤ M + 1 and F (x + m − 1) < 1. We
define, for j = 0, . . . , m,
α(x + j) = F (x + j) − F (x + j − 1).
(5.25)
Moreover, we define for j = 0, . . . , m,
β(x + j) = F (x + j + 1) − F (x + j − 1),
(5.26)
µ(x + j) = Kh′ (t − x − j).
(5.27)
and
Note that β(x + j) > 0, for all j, 0 ≤ j ≤ m, again by the positivity condition on the differences
F (x) − F (x − 1) on [δ, M − δ].
We now get the matrix equation
A(x)ψ(x) = µ(x),
where A(x) is the tridiagonal matrix
β(x) −α(x + 2)
0
0
···
0
0
−α(x) β(x + 1) −α(x + 3)
0
.
.
.
0
0
0
−α(x + 1) β(x + 2) −α(x + 4) . . .
0
0
A(x) =
..
..
..
.
.
.
.
.
.
.
.
.
...
.
.
.
0
0
0
0
. . . −α(x + m − 1) β(x + m)
ψ(x)
µ(x)
..
..
ψ(x) =
and µ(x) =
.
.
.
ψ(x + m)
µ(x + m)
,
As is well-known, this system of equations can be solved in the following way. Let uj , j =
0, . . . , m, be recursively defined in the following way:
b0 = β(x), u0 = µ(x)/b0 ,
(5.28)
and
bj = β(x + j) − α(x + j − 1)α(x + j + 1)/bj−1 , uj = {µ(x + j) − α(x + j − 1)uj−1 } /bj ,
(5.29)
for j = 1, . . . , m. Then the solution of the system can be found by the following back-substitution:
ψ(x + m) = um ,
13
(5.30)
ψ(x + j) = ψ(x + j + 1) −
α(x + j + 2)ψ(x + j + 1)
, j = m − 1, . . . , 0.
bj
(5.31)
We have, if m > 1,
b1 = F (x + 2) − F (x) − {F (x + 2) − F (x + 1)}{F (x) − F (x − 1)}/{F (x + 1) − F (x − 1)}
> F (x + 2) − F (x) − {F (x + 2) − F (x + 1)} = F (x + 1) − F (x) > 0
since 0 ≤ {F (x) − F (x − 1)}/{F (x + 1) − F (x − 1)} < 1 and F (x + 2) − F (x) > F (x + 1) − F (x).
Continuing in this way, we find inductively, that
bj > 0, 0 ≤ j < m.
Moreover, by the positivity condition on the differences F (x) − F (x − 1) on [δ, M + 1 − δ], the
coefficients bj are bounded away from zero, for any choice of x ∈ [0, 1) and 0 ≤ j < m. The
solution ψ is therefore bounded. The function φ, defined by
φ(x) = {F (x) − F (x − 1)}{F (x + 1) − F (x)}ψ(x)
is therefore also bounded.
We now have, by (5.22),
θh,t,F (x + 1) − θh,t,F (x) = −Kh′ (t − x), x ∈ [0, M + 1].
Furthermore, denoting ψh,t,F by ψ for convenience of notation, we get
Z
θh,t,F (z) gF (z) dz
Z
= ψ(z){F (z + 1) − F (z)}{F (z) − F (z − 1)} dz
Z
− ψ(z − 1){F (z − 1) − F (z − 2)}{F (z) − F (z − 1)} dz
Z
= ψ(z){F (z + 1) − F (z)}{F (z) − F (z − 1)} dz
Z
− ψ(z){F (z) − F (z − 1)}{F (z + 1) − F (z)} dz
= 0.
Hence
Z
θh,t,F (z) dG0 (z) =
=
Z
θh,t,F (z)g0 (z) dz =
Z
Z
θh,t,F (z) {g0 (z) − gF (z)} dz
θh,t,F (z) {F0 (z) − F0 (z − 1) − (F (z) − F (z − 1))} dz
Z
= − {θh,t,F (z + 1) − θh,t,F (z)} {F0 (z) − F (z)} dz
Z
Z
′
=
Kh (t − z) {F0 (z) − F (z)} dz = Kh (t − z)d (F0 − F ) (z).
✷
14
Since F0 satisfies condition (F), and since (5.24), with F = F0 , implies that φh,t,F0 is bounded
and zero outside (0, M ), we have that φh,t,F0 is absolutely continuous w.r.t. F0 on the interval
(0, M ). So we can define ah,t,F0 as the Radon-Nikodym derivative of the function φh,t,F0 w.r.t. F0
on (0, M ). The condition on the density f0 also ensures that this Radon-Nikodym derivative is
almost surely bounded on (0, M ) and hence ah,t,F0 ∈ L2 (F0 ), where we use (5.23) of part (ii) of
Lemma 5.1.
Note that it was shown in the proof of Lemma 5.1 that, under the conditions of this lemma, we
have the following relation
Z
Z
n
o
(5.32)
θh,t,F̂n dG0 = θh,t,F̂n (z) F0 (z) − F0 (z − 1) − {F̂n (z) − F̂n (z − 1)} dz.
In the proof of Theorem 5.1 we also need the following lemma.
Lemma 5.2 Let condition (F) be satisfied. Then we have for each deterministic sequence of points
(tn ), converging to a fixed point t ∈ (0, M ), and for each c > 0:
sup |F̂n (tn + n−1/3 u) − F0 (tn )| = Op (n−1/3 ).
(5.33)
u∈[−c,c]
The proof of Lemma 5.2 proceeds along similar lines as the proof of Lemma 4.6 on p. 155 of
Groeneboom (1996). It is omitted here for reasons of space. It shows that the MLE has so-called
“cube root n” convergence, a property it shares with the MLE in the case of interval censoring.
Proof of Theorem 5.1. By the consistency of F̂n , we may assume that condition (5.9) is satisfied
for a δ ∈ (0, 12 (M ∧1)), and n sufficiently large, and therefore that the equation (5.21) has a solution
ψh,t,F̂n for F̂n = F . Hence we can define φh,t,F̂n by (5.24).
Let τ0 = 0, τm+1 = M + 1 and τ1 < . . . < τm be the points of jump of F̂n , and let Ji denote the
interval [τi , τi+1 ), for 0 ≤ i < m. The interval Jm will denote the closed interval [τm , τm+1 ]. Let
the function ξh,t,F be defined by
φh,t,F (x)
, if 0 < F (x) < 1,
ξh,t,F (x) =
F (x){1 − F (x)}
0
, otherwise.
Definition (5.24) implies that the function ξh,t,F is bounded if F satisfies the conditions of Lemma
5.1.
We define a piecewise constant version of ξh,t,F0 , constant on the intervals Ji , by
, if x ∈ Ji , and F0 (x) > F̂n (τi ) for all x ∈ Ji ,
ξh,t,F0 (τi )
ξ̄h,t,F̂n (x) =
ξ
(τ −) , if x ∈ Ji , and F0 (x) < F̂n (τi ) for all x ∈ Ji ,
h,t,F0 i+1
ξh,t,F0 (u)
, if x ∈ Ji , and F0 (u) = F̂n (u) for a point u ∈ Ji ,
for x ∈ Ji . We note that a construction of this type was introduced under III on p. 213 of Geskus
and Groeneboom (1997). We here use the version in Lemma 11.10 of van de Geer (2000), where
instead of a piecewise constant version of ξh,t,F̂n a piecewise constant version of ξh,t,F0 is used.
We next define the piecewise constant function φ̄h,t,F̂n by
n
o
φ̄h,t,F̂n = F̂n (x) 1 − F̂n (x) ξ̄h,t,F̂n (x), x ∈ IR,
15
and the piecewise constant function θ̄h,t,F̂n by
φ̄h,t,F̂n (x) − φ̄h,t,F̂n (x − 1)
θ̄h,t,F̂n (x) =
F̂n (x) − F̂n (x − 1)
0
, if F̂n (x) > F̂n (x − 1),
, otherwise.
Note that Lemma 5.2 and the assumption t 6≡ 0 (mod 1) and t 6≡ M (mod 1), we may assume
that F̂n (x)− F̂n (x−1) stays away from zero for all large n, if φ̄h,t,F̂n (x) 6= 0 and/or φ̄h,t,F̂n (x−1) 6= 0,
implying that the function θ̄h,t,F̂n is bounded.
We have:
Z n
o
θ̄h,t,F̂n − θh,t,F̂n dG0
Z n
on
o
F0 (z) − F0 (z − 1) − {F̂n (z) − F̂n (z − 1)} dz. (5.34)
θ̄h,t,F̂n (z) − θh,t,F̂n (z)
=
This follows in the same way as (5.32).
We now show that this implies:
Z
Z
θh,t,F̂n dG0 = θ̄h,t,F̂n dG0 + op n−2/3 h−2 .
(5.35)
Lemma 11.10 in van de Geer (2000) implies that there exists a constant c > 0 such that
ξ̄h,t,F̂n (x) − ξh,t,F0 (x) ≤ ch−3 F̂n (x) − F0 (x) , for all x ∈ IR,
where the factor h−3 arises from the second derivative of Kh , which appears in dξh,t,F0 /dF0 , using
the notation in (11.61) of Lemma 11.10 in van de Geer (2000). The function θh,t,F0 is defined in
terms of the function φh,t,F0 , with support consisting of intervals [t − h + k, t + h + k] which, by the
assumption t 6≡ 0 (mod 1) and M − t 6≡ 0 (mod 1), are strictly contained in the open interval
(0, M ), if h is sufficiently small.
Lemma 5.2 then implies that the closest point of jump of F̂n to the left or right of these intervals
(finite in number) have a distance Op (n−1/3 ) to the endpoints of these intervals. This means, by
the condition nh3 → ∞ that
ξ¯h,t,F̂n (x) − ξh,t,F0 (x) ≤ ch−3 F̂n (x) − F0 (x) ,
for x in a finite number of intervals of order h, shrinking to points t + k, for integers k such that
t = k ∈ (0, M ), and that
ξ̄h,t,F̂n (x) − ξh,t,F0 (x) = 0
outside these intervals. But this, in turn, implies by (5.34), the Cauchy-Schwarz inequality and the
fact that we may assume that the differences F̂n (x) − F̂n (x ± 1) and F0 (x) − F0 (x ± 1) stay bounded
away from zero on these intervals:
Z n
o
(5.36)
θ̄h,t,F̂n (x) − θh,t,F0 (x) dG0 (x) = Op h−3 dh,n (F̂n , F0 ) = Op h−2 n−2/3 ,
where dh,n (F̂n , F0 )2 is defined as
2
dh,n (F̂n , F0 )
=
XZ
j∈Z
(t+2h+j)∧M
(t−2h+j)∨0
n
o2
F̂n (x) − F0 (x) dx = Op hn−2/3 .
16
Note that the infinite sum is in fact a sum with a (uniformly) bounded number of non-zero terms.
The property
o2
X Z (t+2h+j)∧M n
F̂n (x) − F0 (x) dx = Op hn−2/3
(t−2h+j)∨0
j∈Z
follows from the fact that, by the assumption t 6≡ 0 (mod 1) and M −t 6≡ 0 (mod 1), the intervals
[(t − 2h + j) ∨ 0, (t + 2h + j) ∧ M ] are strictly contained in (0, M ) and the fact that the processes
x 7→ F̂n (x) − F0 (x), x ∈ [(t − 2h + j) ∨ 0, (t + 2h + j) ∧ M ],
have a completely similar probabilistic behavior, together with the property
kF̂n − F0 k22 = Op n−2/3 .
On the other hand, by the structure of the solutions ψh,t,F̂n and ψh,t,F0 , there also exists a constant
c2 > 0 such that
X
ψh,t,F̂n (x) − ψh,t,F0 (x) ≤ c2 h−2
F̂n (x + k) − F0 (x + k) ,
k∈Z
for x belonging to an interval (t + k − h, t + k + h) ⊂ (0, M + 1) and that
ψh,t,F̂n (x) − ψh,t,F0 (x) = 0
outside these intervals. Note that for x ∈ (t + k − h, t + k + h) ⊂ (M, M + 1), we have: ψh,t,F0 (x) = 0
and
φh,t,F̂n (x)
n
o
= F̂n (x + 1) − F̂n (x) ψh,t,F̂n (x)
F̂n (x) − F̂n (x − 1)
n
o
n
o
1 − F̂n (x) ψh,t,F̂n (x) = F0 (x) − F̂n (x) ψh,t,F̂n (x) ≤ ch−2 F0 (x) − F̂n (x) ,
for a constant c > 0, and that we get in a similar way:
φh,t,F̂n (x)
F̂n (x + 1) − F̂n (x)
≤ ch−2 F0 (x) − F̂n (x) ,
for a constant c > 0, if x ∈ (M, M + 1).
Hence, again by (5.34) and the Cauchy-Schwarz inequality
Z n
o
θh,t,F̂n (x) − θh,t,F0 (x) dG0 (x) = Op h−2 kF̂n − F0 k22 = Op h−2 n−2/3 .
Relation (5.35) now follows from (5.36) and (5.37).
R
We then apply, using the score equations for F̂n , implying θ̄h,t,F̂n dGn = 0,
Z
and
Z n
θ̄h,t,F̂n dG0 =
θ̄h,t,F̂n − θh,t,F0
o
Z
θ̄h,t,F̂n d(G0 − Gn ),
d(G0 − Gn ) = Op h−2 n−2/3 ,
17
(5.37)
where the latter result again follows from an L2 -entropy argument. This shows, since nh3 → ∞,
Z
Z
Kh (t − x) d F̂n − F0 (x) = θh,t,F0 d (Gn − G0 ) + op n−1/2 h−3/2 ,
and yields part (i) of Theorem 5.1.
Part (ii) of Theorem 5.1 now follows from
Z
Z
Kh (t − x) dF0 (x) − f0 (t) =
K(u) {f0 (t + hu) − f0 (t)} du
Z
1 2 ′′
∼ 2 h f0 (t) u2 K(u) du as h ↓ 0,
where we use that the kernel K integrates to 1 and is symmetric around zero. Combining (i) and
(ii) using that hn ∼ c · n−1/7 , we get (iii). This finishes the proof of Theorem 5.1.
✷
References
Billingsley, P. (1968). Convergence of probability measures. Wiley, New York.
Geer, S. van de (1996). Rates of convergence for the maximum likelihood estimator in mixture
models, Journal of Nonparametric Statistics 6, 293-310.
Geer, S. van de (2000). Applications of Empirical Process Theory. Cambridge University Press.
Cambridge, U.K. 293-310.
Geskus, R.B. and Groeneboom, P. (1996). Asymptotically optimal estimation of smooth func-
tionals for interval censoring, part 1. Statistica Neerlandica 50, 69-99.
Geskus, R.B. and Groeneboom, P. (1997). Asymptotically optimal estimation of smooth func-
tionals for interval censoring, part 2. Statistica Neerlandica 51, 201-219.
Groeneboom, P. (1996). Lectures on inverse problems, in: Lectures on probability theory, Ecole
d’Eté de Probabilités de Saint-Flour XXIV-1994, Editor: P. Bernard. Springer Verlag, Berlin.
Groeneboom, P. (1998). Nonparametric Estimation for Inverse Problems: Algorithms and
Asymptotics, Special topics course 593C, Department of Statistics, University of Washington,
Seattle, USA.
Groeneboom, P., Jongbloed, G. and Wellner, J.A. (2002). The support reduction algorithm
for computing nonparametric function estimates. Manuscript in progress.
Groeneboom, P. and Wellner, J.A. (1992). Information bounds and nonparametric maximum
likelihood estimation. Birkhäuser, Basel. Springer Verlag. New York.
Hall, P., Ruymgaart, F., van Gaans, O. and van Rooy, A. (2001). Inverting noisy integral
equations using wavelet expansions: a class of irregular convolutions. In: “State of the art
in Probability and Statistics: Festschrift for Willem R. van Zwet”. Vol. 30 of Lecture notesMonograph series. Institute of Mathematical Statistics, pp. 251-280.
Johnstone, I.M. and Raimondo, M. (2002). Periodic boxcar deconvolution and diophantine ap-
proximation.
Technical Report. http://www-stat.stanford.edu/∼imj/Reports/index.html
O’Sullivan, F. and Roy Choudhury, K. (2001). An Analysis of the Role of Positivity and Mix-
ture Model Constraints in Poisson Deconvolution Problems, J. Computational and Graphical
Statistics 10, 673-696
18
Robertson, T., Wright, F. T., Dykstra, R. L. (1988). Order Restricted Statistical Inference.
Wiley, New York.
Roy Choudhury, K. (1998). Additive Mixture Models for Multichannel Image Data Ph.D. Dis-
sertation, University of Washington, Seattle.
Simar, L. (1976). Maximum likelihood estimation of a compound Poisson process. Ann. Statist.
4, 1200–1209
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall, London.
19
View publication stats