Density estimation in the uniform deconvolution model

P. Groeneboom; G. Jongbloed

Density estimation in the uniform deconvolution model

Piet Groeneboom

2002

visibility

…

description

19 pages

link

1 file

We consider the problem of estimating a probability density function based on data that are corrupted by noise from a uniform distribution. The (nonparametric) maximum likelihood estimator for the corresponding distribution function is well defined. For the density function this is not the case. We study two nonparametric estimators for this density. The first is a type of kernel density estimate based on the empirical distribution function of the observable data. The second is a kernel density estimate based on the MLE of the distribution function of the unobservable (uncorrupted) data.

Density estimation in the uniform deconvolution model Piet Groeneboom and Geurt Jongbloed June 13, 2002 Abstract We consider the problem of estimating a probability density function based on data that are corrupted by noise from a uniform distribution. The (nonparametric) maximum likelihood estimator for the corresponding distribution function is well defined. For the density function this is not the case. We study two nonparametric estimators for this density. The first is a type of kernel density estimate based on the empirical distribution function of the observable data. The second is a kernel density estimate based on the MLE of the distribution function of the unobservable (uncorrupted) data. 1 Introduction Missing and incomplete data problems have been an important field of research in mathematical statistics during the past decades. In particular, a lot of theory has been developed for the analysis of right censored data and martingale methods have been very successful here. Other forms of censoring have also been studied, like interval censored data. In the latter situation the data are “very incomplete” in the sense that one can never observe the variable of interest directly, but one only has information about a region or interval to which the variable of interest belongs. In contrast with the situation for right-censored data, martingales have not been very useful for developing theory for interval censored data and one is forced to develop theory from scratch. For √ example, the familiar n-convergence, still valid in the situation of right-censored data, generally does not hold anymore and pointwise asymptotic normality of the maximum likelihood estimator of the unknown distribution function will generally not hold either. √ These two facts: rates of convergence slower than n and non-normal limit distributions for maximum likelihood estimators are characteristic for many incomplete data problems. These problems belong to the category of inverse statistical problems, see, e.g., Groeneboom (1996), and are subject to “heavy loss of information”. A convenient way of describing the situation is by introducing a hidden space, containing the information one would like to observe, an observation space, containing the observations actually one can observe, and a mapping from observation space to hidden space. We use this set-up in the sequel, in the development of asymptotic distribution theory. We focus on one particular problem of this type: deconvolution, which poses problems of the same nature as interval censored data. In fact, in section 3 we give an example of a situation where the two models give the same type of maximum likelihood estimator, with the same convergence rate and the same (non-normal) limit distribution. In the deconvolution model one also never can observe the variable of interest directly, but one only has access to the sum of the variable of interest and some “noise” added to it. We specialize further to “uniform noise”. In fact, the present investigation has been inspired by work on “deblurring of pictures” in Roy Choudhury (1998), where methods for recovering pictures, 1 blurred by Poisson noise are studied. Further work on the latter problem is reported in O’Sullivan and Roy Choudhury (2001). Another motivation for the present study has been to provide an alternative for “using EM with early stopping”, which at present enjoys much popularity in certain circles. The trouble with the latter procedure is that it is very hard to determine what one actually gets if one does EM with early stopping. One gets result that will be dependent on the distribution, used as a starting point for the algorithm, and this prevents the development of distribution theory. We show that if one wants to estimate a density in deconvolution problems, one can compute the MLE of the distribution function (without any early stopping!) and use a convolution of a kernel with this MLE to estimate the density. In Roy Choudhury (1998) the latter procedure is discussed as a method for estimating the deblurred picture, where the picture plays the role of a density. Section 5 is based on the last part of a special topics course, given by the first author in the spring of 1998 at the University of Washington, Seattle, USA. Parts of this course are also included as an appendix in Roy Choudhury (1998). The type of deconvolution we study has also been called “boxcar deconvolution”, and recent work on this model (not taking the maximum likelihood approach) can be found in Johnstone and Raimondo (2002) and Hall et al (2001). 2 Uniform deconvolution Consider independent and identically distributed random variables Z1 , Z2 , . . . with density Z k(z − x) dF (x) g(z) = gF (z) = R where k is a known probability density on R and F is an arbitrary distribution function. Estimating F based on a sample from gF is a statistical inverse problem in the sense that the sampling distribution is the image of the distribution of interest under a known transformation K. Estimating F can in that sense be interpreted as first to estimate g and then to apply ‘some inverse’ of K to obtain an estimate for F . This inverse problem, however, can also be viewed as an incomplete data problem. The complete observations consist of independent pairs (Xi , Yi ) where Xi ∼ F and Yi ∼ k are independent. If the complete data were observable, the distribution function F could be adequately estimated by the empirical distribution function of the Xi ’s (or better if more information on F is available). However, a part of the data is missing. We do not observe the pairs (Xi , Yi ) but only observe the random variables Zi = Xi + Yi . The question then is: how can we estimate F and quantities related to F based on the sample of Z’s. In this paper we will not consider the deconvolution problem in great generality. We focus on one specific convolution kernel: the uniform density on [0, 1). In that case, we have Z Z dF (x) = F (z) − F (z − 1), z ∈ R. (2.1) g(z) = gF (z) = k(z − x) dF (x) = (z−1,z] Note that, formally, the distribution function F can be recovered from the density g by F (z) = ∞ X i=0 g(z − i). (2.2) Moreover, we assume F0 (0) = 0 throughout. In the deblurring application considered in Roy Choudhury (1998), it is necessary to estimate the probability density f0 of X based on data from the convolution of f0 with a uniform density 2 on (0, b) for general (known) b > 0. Although restricting the convolution kernel to the standard uniform density is quite a loss with regard to the general deconvolution problem where k is a general density, the results of this paper are also valid for the nonstandard uniform deconvolution, i.e. where b > 0 arbitrary (but known). The procedure then is to first transform the data by division through b and use the techniques for the standard uniform deconvolution model of this paper to obtain an estimate of the rescaled density f0 . This estimate can then again be rescaled to get a density estimate for f0 . We will not restate our results for these more general b since these are obvious. Of course, in order for (2.2) to be valid, g should belong to the image of the set of distribution functions under the convolution operator. This range is a strict subset of the class of all probability densities. In fact, this is one of the characteristic features of inverse problems. The nonparametric model on the hidden space (here the product measures that can be constructed from an arbitrary distribution function and the uniform density) transforms under the mapping from the hidden to the observation space to a model that is smaller than the usual nonparametric model one is used to considering in the observation space. The example below, that will be used in the sequel to illustrate the various estimators, stresses the fact that applying (2.2) to densities outside the range of the convolution operator, gives functions F that are not distribution functions. Example 1. Consider the exponential model for F . This means that we assume F to belong to the class F = {Fθ : θ > 0} with Fθ (x) = 1 − e−θx for x > 0. Then the sampling densities belong to the parametric model 1 − eθz for z ∈ (0, 1) G = {gθ : θ > 0} with gθ (z) = (eθ − 1)e−θz for z ≥ 1. (2.3) This is not one of the ‘usual parametric families’ on (0, ∞). It is important to note that any family F of distribution functions on R results in a family of densities on R using (2.1). Now suppose we take the class of exponential densities as model in the observation space, so gθ (z) = θe−θz in (2.3). Then we get as associated model in the hidden space Fθ (x) = θe−θx 1 + eθ + e2θ + . . . + e⌊x⌋θ . It is clear that Fθ (0) = θ (which is not necessarily in (0, 1)) and that Fθ is initially decreasing. This shows that the class of exponential densities is outside the range of the set of distribution functions under the uniform convolution mapping. In section 3 we discuss the nonparametric maximum likelihood estimator F̂ of the distribution function F and address computational issues associated to that estimator. Section 4 is devoted to a kernel estimator which is based on (2.2). The sampling density is estimated using a kernel estimator and this estimate is plugged into (2.2). This estimator has the disadvantage that it is not a probability density function due to the fact that the usual kernel density estimates are not contained in the class of densities of the form (2.1). However, its asymptotic behavior can be derived using the classical central limit theorem for triangular arrays. In section 5 we introduce a natural kernel density estimator that is based on F̂ and study its asymptotic behavior. 3 Nonparametric Maximum Likelihood Estimator F̂ In this subsection we consider the nonparametric maximum likelihood approach to estimating the distribution function F in the uniform deconvolution model. Given data z1 < z2 < . . . < zn , the 3 loglikelihood function on the class of distribution functions is given by l(F ) = n n i=1 i=1 1X 1X log gF (zi ) = log(F (zi ) − F (zi − 1)). n n This estimator was studied in Groeneboom and Wellner (1992). Note that l(F ) ≤ 0 for all F , so that the log likelihood function is bounded above. Note also that l depends on F only via its values at the observed points zi and the points zi − 1. This means that when maximizing l, attention may be restricted to discrete distribution functions having their mass concentrated at the points zi and zi − 1. However, any maximizer of l will assign mass zero to these latter points. The loglikelihood is increased by shifting such masses to the first zj bigger than zi −1. Finally, note that l can be viewed as a strictly concave function on the set of all these discrete distribution functions. All this gives that we can uniquely define the MLE F̂ as the discrete distribution function having its mass concentrated on the observed zi ’s that maximizes l. Another well known feature in this setting is that if the support of the distribution with distribution function F is contained in (0, 1), the MLE can be computed using the fact that the loglikelihood is the same in stucture as the loglikelihood for current status data. If the support of F is bigger, this correspondence does not exist anymore and the estimator is to be computed using iterative optimization techniques. Figure 1 shows the maximum likelihood estimator of the mixing distribution F and the corresponding density g based on a sample of size n = 500. The true F (mean one exponential) and g are also included in the picture. 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Figure 1: Above: ML estimate of the distribution function F based on a sample of size n = 500 with the true (exponential) distribution. Below: the corresponding estimate for the sampling density g with the true density. To obtain Figure 1, we used a Newton algorithm to compute F̂ , where the basic quadratic maximizations are performed using the support reduction algorithm. This algorithm is introduced in Groeneboom, Jongbloed and Wellner (2002), where it is also used to compute the maximum likelihood estimator of the distribution function in the gaussian deconvolution model. It is an 4 algorithm that belongs to the vertex direction family of algorithms and is closely related to the algorithm proposed in Simar (1976). 4 Inverse kernel density estimation There are various methods that come to mind if one wants to estimate f based on a sample from g. For example, the convolution structure of (2.1) suggests the use of the Fourier transform to get an estimate of f . However, this approach is not attractive since the characteristic function of the uniform distribution has zeroes. In this section we study an estimator that is suggested by (2.2): an inverse density estimator. Indeed, assuming that F has a density, we get that f (z) = ∞ X i=0 g′ (z − i), so that a usual density estimate of g can be used to estimate f . In this section we study an inverse kernel density estimator for f . This estimator is based on a conventional kernel estimate of g, Z Z z−x 1 K d Gn (x) Kh (z − x) d Gn (x) = gn,h (z) = h R h R Here h > 0 is the bandwidth of the density estimate and K is the kernel. Based on this estimate, we define ∞ Z ∞ Z ∞ X X 1 X ′ z−x−i ′ ˜ K Kh′ (z −x−i) d Gn (x). (4.4) d Gn (x) = gn,h (z −i) = 2 fn,h (z) = h h R R i=0 i=0 i=0 For this estimator we have the following theorem. Theorem 4.1 Let K be a compactly supported twice continuously differentiable kernel and let the distribution function F0 be continuously differentiable at the point t ∈ (0, M ). Define the estimator f˜n,h(t) of the derivative f0 of F0 at t by (4.4) for t ∈ (0, M ). Then: (i) As n → ∞, h ↓ 0, and nh → ∞, ) ( ∞ Z X D Kh′ (t − x − i) dG0 (x) −→ N (0, σ 2 ), {nh3 }1/2 f˜n,h (t) − (4.5) i=0 where 2 σ = F0 (t) Z K ′ (u)2 du. (ii) Suppose that f0 is twice differentiable at t. Then, for h ↓ 0, ! Z ∞ Z X 1 ′′ −2 ′ h Kh (t − z − i)g0 (z) dz − f0 (t) → f0 (t) u2 K(u) du =: β. 2 (4.6) (4.7) i=0 (iii) If hn ∼ c · n−1/7 , for some c > 0, as n → ∞, n o D n2/7 f˜n,h (t) − f0 (t) −→ N (βc2 , c−3 σ 2 ), n → ∞, where σ 2 is as in (4.6), and β as in (4.7). 5 (4.8) Proof. For (i) write 3 1/2 {nh } ( f˜n,h (t) − with 3 1/2 ξnk = (h /n) ∞ X ∞ Z X Kh′ (t i=0 Kh′ (t i=0 ) − z − i) dG0 (z) − Zk − i) − Z Kh′ (t = n X ξnk k=1 − z − i) dG0 (z) . Note that the infinite sum is in fact a finite sum due to the fact that F0 (0) = 0. By the central limit theorem for triangular arrays (theorem 7.2 in Billingsley (1968)), we have that s−1 n n X k=1 D ξnk −→ N (0, 1), where s2n = n X varξnk = 2 nEξn1 i=0 k=1 = h3 Z ∞ X Kh′ (t − Z1 − i)) = h var( 3 ∞ X ∞ z=0 = I1,n − i=0 2 I2,n . !2 Kh′ (t − z − i) g0 (z) dz − h3 Z ∞ ∞ X z=0 i=0 Kh′ (t − z − i)g0 (z) dz !2 Because K has compact support, for h sufficiently small, the first term can be written as 3 I1,n = h ∞ Z X i=0 = Z ∞ z=0 Kh′ (t 2 − z − i) g0 (z) dz = K ′ (u)2 F0 (t − hu) du → F0 (t) Z ∞ Z X i=0 K ′ (u)2 g0 (t − i − hu) du K ′ (u)2 du, for h ↓ 0, by dominated convergence. For the second term we have that I2,n = h3/2 ∞ Z X ∞ Kh′ (t − z − i)g0 (z) dz i=0 z=0 ∞ Z X −2 3/2 h = h i=0 K′ t−z−i h g0 (z) dz = √ Z h K ′ (u)f0 (t − hu) du, R which is of smaller order than the first term. Hence, s2n → F0 (t) K ′ (u)2 du, yielding (i). Finally, note that var(ξnk ) 1 c 2 P (|ξnk | > ǫsn ) ≤ = 2 and ξnk ≤ 2 2 ǫ sn ǫ n nh for a c > 0. Hence, n c 1 c 1 n ≤ Eξ 2 1 · · = 2 2 · →0 s2n nk [|ξnk |>ǫsn] s2n nh ǫ2 n ǫ sn nh as nh → ∞. 6 For (ii), note that ∞ Z X ∞ Z X t−z−i g0 (z) dz h i=0 i=0 Z ∞ Z X K ′ (u) g0 (t − i − hu) du = h−1 K ′ (u) F0 (t − hu) du = h−1 Kh′ (t − z − i) g0 (z) dz = h−2 K′ Zi=0 1 1 = h−1 K ′ (u) (F0 (t) − huf0 (t) + h2 u2 f0′ (t) − h3 u3 f0′′ (t − hζu) du 2 6 Z 1 2 ′′ → 0 + f0 (t) + 0 + h f0 (t) u2 K(u) du. 2 Here we use that K is a smooth probability density with zero mean and partial integration. Now, for (iii) note that ! ∞ Z X Kh′ (t − z − i)g0 (z) dz + n2/7 (f˜n,h (t) − f0 (t)) = c−3/2 (nh3 )1/2 f˜n,h(t) − i=0 +n 2/7 ∞ Z X i=0 D 2 Kh′ (t ! − z − i)g0 (z) dz − f0 (t) −3 2 −→ N (βc , c σ ), as n → ∞ by (i) and (ii). ✷ Corollary 4.1 To estimate f (t) with kernel estimator (4.4), the asymptotic MSE optimal bandwidth is given by 2 1/7 3σ −1/7 h = hn = copt · n with copt = , 4β 2 with σ 2 and β as in (4.6) and (4.7). 2 2 For the biweight kernel K(u) = 16 15 (1 − u ) 1[−1,1] (u) this gives F (t0 ) 1/7 15 2835F (t0 ) 1/7 4 σ 2 = F (t0 ) and β = f ′′ (t0 ) ⇒ copt = ≈ 1.72 . 7 21 64f ′′ (t0 )2 f ′′ (t0 )2 Proof: For copt we minimize the sum of the squared bias and the variance given in theorem 4.1 (iii) as a function of c. ✷ Figure 4 shows f˜n,h based on the same data set as Figure 1. The choice of K is the biweight kernel and bandwidth h = 0.88 (asymptotically MSE optimal for t0 = 1). Note that the optimal bandwidth in our simulation example (exponential F0 and n = 500) depends on t0 as t0 7→ 0.71(e2t0 − et0 )1/7 . This is an increasing function, explaining to some extent that the density estimate in Figure 4 seems oversmoothed near zero and undersmoothed in the right tail. Of course, in practical situations one could first estimate F0 and f0′′ using a pilot bandwidth and then compute a density estimate with varying bandwidth based on this pilot estimate. Also, one could apply boundary kernels to get rid of the obvious boundary problems of the estimates at zero. Boundary kernels in the spirit of Wand and Jones (1995) have a discontinuity at zero. In the current context this would be a problem since we need the density estimate to be differentiable. Although the estimate of Figure 4 could be improved near the boundary, we will not pursue this here. 7 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Figure 2: The inverse kernel estimate f˜n,h based on a sample of size 500, using the biweight kernel bandwidth h = 0.88. 5 Density estimation based on F̂ We want to study the asymptotic behavior of the estimator Z ˆ fn,h (t) = Kh (t − y) dF̂n (y) of the derivative f0 (t) at t of the distribution function F0 of the distribution we want to estimate, where F̂n is the MLE of this distribution function F0 as defined in section 3. Figure 3 shows this estimate based on the same data as Figure 1 and 4. The estimate fˆn,h has the same undesirable behavior near zero as f˜n,h. In this case it is straightforward to implement boundary kernels in the spirit of section 2.11 in Wand and Jones (1995). The resulting estimate (which coincides with fˆn,h on [h, ∞)) is shown in Figure 4. We impose the following condition on F0 : (F) The distribution corresponding to the distribution function F0 has support [0, M ], and has a continuous derivative f0 which satisfies inf u∈(0,M ) f0 (u) > 0. Note that condition (F) implies that for any δ ∈ (0, M ∧ 1): inf {F0 (x) − F0 (x − 1)} ∧ {F0 (x + 1) − F0 (x)} > 0. x∈[δ,M −δ] (5.9) For the kernel function we assume (K) Kh (u) = h−1 K(u/h), where K is a fixed non-negative twice differentiable kernel with support [−1, 1], and bounded second derivative, that integrates to 1 on [−1, 1] and is symmetric around zero. For our pictures we used the biweight kernel, as defined in Corollary 4.1. This kernel satisfies (K). We show below that under these conditions the following result holds: 8 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Figure 3: The density estimate fˆn,h based on a sample of size 500, using the biweight kernel and h = 0.88. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Figure 4: The smoothed MLE density estimate based on the sample of size 500 using boundary kernels. Again, h = 0.88 and the kernel used is the biweight. Theorem 5.1 Let K satisfy condition (K) and let the distribution function F0 satisfy condition (F). Let t be a fixed point in the open interval (0, M ), such that t 6≡ 0 (mod 1) and M − t 6≡ 0 (mod 1), and let the estimator fˆn,h(t) of the derivative f0 of F0 at t be defined by Z ˆ fn,h (t) = Kh (t − x) dF̂n (x), for t ∈ (0, M ). Moreover, let m be the largest integer < M + 1. Then: (i) As n → ∞, h ↓ 0, and nh3 → ∞, Z D {nh3 }1/2 fˆn,h (t) − Kh (t − x) dF0 (x) −→ N (0, σ 2 ), 9 (5.10) where 2 3 σ = lim h h↓0 Z 2 θh,t,F dG0 > 0, 0 for a function θh,t,F0 , defined by  m X   {1 − F0 (x + i)} Kh′ (t − (x + i)) , if x ∈ [0, 1], k = 0, θh,t,F0 (x+k) =   i=0Pk−1 ′ − i=0 Kh (t − (x + i)) + θh,t,F0 (x) , if x ∈ [0, 1], k = 1, . . . , m. (ii) Suppose that f0 is twice differentiable at t. Then, for h ↓ 0, Z Z 1 ′′ −2 h Kh (t − x) dF0 (x) − f0 (t) → f0 (t) u2 K(u) du =: β. 2 (iii) If hn ∼ c · n−1/7 , for some c > 0, as n → ∞, n o D n2/7 fˆn,h (t) − f0 (t) −→ N (βc2 , c−3 σ 2 ), n → ∞, (5.11) (5.12) (5.13) (5.14) where σ 2 is as in (5.11), and β as in (5.13). Remark. The statements of Theorem 5.1 resemble those of Theorem 4.1. The asymptotic bias is exactly the same for both estimators. The definition (5.12) of θh,t,F0 in part (i) is based on a definition given in Example 11.2.3e on p. 230 of van de Geer (2000). If M ≤ 1, it can be seen that this is given by θh,t,F0 (x) = (1 − F0 (x))Kh′ (t − x) − F0 (x − 1)Kh′ (t − x + 1) if (t − h, t + h) ⊂ (0, M ). This means that σ 2 reduces to: Z 2 σ = F0 (t){1 − F0 (t)} K ′ (u)2 du. (5.15) In this case the asymptotic variances for the two estimators depend on the kernel K in the same way, but the dependence on F0 shows that fˆn,h is more efficient than f˜n,h : F0 (t){1 − F0 (t)} < F0 (t), t ∈ (0, M ), under condition (F) at the beginning of this section. Apart from the pointwise asymptotic properties of fˆn,h , which we think are overall better than those of f˜n,h , another reason to prefer fˆn,h is that fˆn,h is a genuine probability density function. In contrast to f˜n,h, it only takes nonnegative values. The assumption that t 6≡ 0 (mod 1) and M − t 6≡ 0 (mod 1) is somewhat unpleasant and probably not needed, but we saw no way to avoid this in the present proof. The proof of Theorem 5.1 is much more involved than that of Theorem 4.1 and will be given below in a sequence of steps. The “hidden probability space” contains random variables (X, Y ), where X has distribution function F0 , Y has a Uniform(0, 1)-distribution, and X and Y are independent and the mapping, relating the random variables in the hidden space to the random variables in the observation space, is: T (x, y) = x + y, x ∈ [0, M ], y ∈ [0, 1]. 10 We will need some facts about score operators for this particular model, given in, e.g., Groeneboom and Wellner (1992), Chapter 1. Let L2 (F0 ) denote the space of square integrable functions w.r.t. the measure dF0 , that is: the space of Borel measurable functions h : IR → IR such that Z h2 dF0 < ∞. Moreover, the space L02 (F0 ) is the subset of functions h ∈ L2 (F0 ), satisfying Z h dF0 = 0. The score operator LF0 , mapping scores a ∈ L02 (F0 ) to scores ā in the observation space, only involving a Hellinger differentiable path (Fτ )τ ∈[0,δ) , tending to F0 , is given by the conditional expectation n o [LF0 (a)] (z) = EF0 a(X) T (X, Y ) = z . Its adjoint on the space of scores contained in L02 (G0 ), where G0 is defined by the convolution density gF0 , is given by the conditional expectation n o Z [L (ā)] (x) = EF0 ā(Z) X = x = ∗ x+1 ā(z) dz. x Note that the adjoint operator L∗ does not involve the distribution function F0 , in contrast with the score operator LF0 itself (of which it is the adjoint). The guiding principle in finding the asymptotic distribution of fˆn,h(t) is to try to find a solution θh,t,F to the equation Z L∗ θh,t,F = Kh (t − ·) − Kh (t − x) dF (x), where θh,t,F is a score in L02 (GF ), belonging to the (closure of the) range of LF , and GF denotes the distribution with density gF (z) = F (z) − F (z − 1). The next step is then to prove that Z Z Kh (t − x) d(F̂n − F0 )(x) ∼ θh,t,F0 (z) d(Gn − G0 )(z), (5.16) in the sense that the right-hand side of (5.16) represents the left-hand side up to terms that are of lower order, as n → ∞, and h = hn ↓ 0 (at a certain rate that will be specified below). By the central limit theorem, the right-hand side of (5.16), divided by σh,t = Z 1/2 θh,t,F0 (z) dG0 (z) 2 is asymptotically standard normal (again under some conditions on the rate at which h tends to zero as a function of n). Note that, since the convolution kernel is the Uniform(0,1) density, we have: Z z a(x) dF (x)/{F (z) − F (z − 1)}. [LF (a)] (z) = z−1 11 Defining θ(z) = Z z z−1 a(x) dF (x)/{F (z) − F (z − 1)}, z ∈ IR, (5.17) where the ratio is defined to be zero if the denominator is zero, we have to solve the equation Z Z x+1 θ(z) dz = Kh (t − x) − Kh (t − y) dF (y), x for x in the interior of the support of F , where θ will in fact depend on t, h and F . Differentiation w.r.t. x yields the equation θ(x + 1) − θ(x) = −Kh′ (t − x), (5.18) if θ is continuous at the points x and x + 1. Let the function φ be defined by  Z x  a(u) dF (u) , if x ≥ 0, φ(x) =  00 , otherwise. (5.19) Then, by (5.18), φ should satisfy φ(x) − φ(x − 1) φ(x + 1) − φ(x) − = −Kh′ (t − x), F (x + 1) − F (x) F (x) − F (x − 1) (5.20) if x and x + 1 are points of continuity of θ. Conversely, if φ satisfies (5.20), then θ satisfies (5.18). The following lemma is one of the stepping stones in achieving our goal (compare to Lemma 3.3 in Groeneboom (1996)). Lemma 5.1 Let the kernel K satisfy condition (K) above. Furthermore, let t ∈ (0, M ) and h ∈ d , where Kh (u) = Kh (u) (0, 1/2) satisfy (t − h, t + h) ⊂ (0, M ), and let Kh′ (t − x) = du u=t−x h−1 K(u/h). Then we have: (i) Suppose that the distribution function F on [0, ∞) satisfies F (M + 1) = 1 and (5.9) for a δ ∈ (0, 21 (M ∧ 1)) (not necessarily for all δ ∈ (0, M ∧ 1)). Then the equation {F (x + 1) − F (x − 1)}ψ(x) − {F (x + 2) − F (x + 1)}ψ(x + 1) = Kh′ (t −{F (x − 1) − F (x − 2)}ψ(x − 1) − x), x ∈ IR, (5.21) under the side condition ψ(x) = 0, x ∈ / [0, M + 1], has a bounded solution ψh,t,F . (ii) Let F be as in condition (i), and let the distribution function F0 satisfy condition (F) above. Moreover, let ψt,h,F be defined as in (i), and let θh,t,F be defined by θh,t,F (x) = {F (x + 1) − F (x)}ψ(x) − {F (x − 1) − F (x − 2)}ψ(x − 1), x ∈ IR. Then and Z Z θh,t,F (x)gF (x) dx = 0, Kh (t − y) d(F − F0 )(y) = − 12 Z θh,t,F (z) dG0 (z). (5.22) (5.23) Remark. The function φh,t,F , defined by φh,t,F (x) = {F (x) − F (x − 1)}{F (x + 1) − F (x)}ψh,t,F (x), x ∈ IR, (5.24) satisfies (5.20), if the denominators in (5.20) are strictly positive. The introduction of the function ψh,t,F (x) is needed to cover the situation that the denominators are zero. If condition (F) is satisfied (as is true for the distribution function F0 ), the introduction of the function ψh,t,F0 is not necessary in the definition of θh,t,F0 . We will however use the function ψh,t,F in the case that F equals the MLE F̂n . In particular we will need the fact that φh,t,F̂ (x) contains the factors F̂n (x) − F̂n (x − 1) and F̂n (x + 1) − F̂n (x), as is clear from its representation (5.24) in terms of ψh,t,F̂ (x). Proof of Lemma 5.1. Fix x ∈ [0, 1) and let m be the largest integer such that x + m ≤ M + 1 and F (x + m − 1) < 1. We define, for j = 0, . . . , m, α(x + j) = F (x + j) − F (x + j − 1). (5.25) Moreover, we define for j = 0, . . . , m, β(x + j) = F (x + j + 1) − F (x + j − 1), (5.26) µ(x + j) = Kh′ (t − x − j). (5.27) and Note that β(x + j) > 0, for all j, 0 ≤ j ≤ m, again by the positivity condition on the differences F (x) − F (x − 1) on [δ, M − δ]. We now get the matrix equation A(x)ψ(x) = µ(x), where A(x) is the tridiagonal matrix  β(x) −α(x + 2) 0 0 ··· 0 0  −α(x) β(x + 1) −α(x + 3) 0 . . . 0 0   0 −α(x + 1) β(x + 2) −α(x + 4) . . . 0 0 A(x) =   .. .. .. . . . . . .  . . . ... . . . 0 0 0 0 . . . −α(x + m − 1) β(x + m)     ψ(x) µ(x)     .. .. ψ(x) =   and µ(x) =  . . . ψ(x + m) µ(x + m)     ,   As is well-known, this system of equations can be solved in the following way. Let uj , j = 0, . . . , m, be recursively defined in the following way: b0 = β(x), u0 = µ(x)/b0 , (5.28) and bj = β(x + j) − α(x + j − 1)α(x + j + 1)/bj−1 , uj = {µ(x + j) − α(x + j − 1)uj−1 } /bj , (5.29) for j = 1, . . . , m. Then the solution of the system can be found by the following back-substitution: ψ(x + m) = um , 13 (5.30) ψ(x + j) = ψ(x + j + 1) − α(x + j + 2)ψ(x + j + 1) , j = m − 1, . . . , 0. bj (5.31) We have, if m > 1, b1 = F (x + 2) − F (x) − {F (x + 2) − F (x + 1)}{F (x) − F (x − 1)}/{F (x + 1) − F (x − 1)} > F (x + 2) − F (x) − {F (x + 2) − F (x + 1)} = F (x + 1) − F (x) > 0 since 0 ≤ {F (x) − F (x − 1)}/{F (x + 1) − F (x − 1)} < 1 and F (x + 2) − F (x) > F (x + 1) − F (x). Continuing in this way, we find inductively, that bj > 0, 0 ≤ j < m. Moreover, by the positivity condition on the differences F (x) − F (x − 1) on [δ, M + 1 − δ], the coefficients bj are bounded away from zero, for any choice of x ∈ [0, 1) and 0 ≤ j < m. The solution ψ is therefore bounded. The function φ, defined by φ(x) = {F (x) − F (x − 1)}{F (x + 1) − F (x)}ψ(x) is therefore also bounded. We now have, by (5.22), θh,t,F (x + 1) − θh,t,F (x) = −Kh′ (t − x), x ∈ [0, M + 1]. Furthermore, denoting ψh,t,F by ψ for convenience of notation, we get Z θh,t,F (z) gF (z) dz Z = ψ(z){F (z + 1) − F (z)}{F (z) − F (z − 1)} dz Z − ψ(z − 1){F (z − 1) − F (z − 2)}{F (z) − F (z − 1)} dz Z = ψ(z){F (z + 1) − F (z)}{F (z) − F (z − 1)} dz Z − ψ(z){F (z) − F (z − 1)}{F (z + 1) − F (z)} dz = 0. Hence Z θh,t,F (z) dG0 (z) = = Z θh,t,F (z)g0 (z) dz = Z Z θh,t,F (z) {g0 (z) − gF (z)} dz θh,t,F (z) {F0 (z) − F0 (z − 1) − (F (z) − F (z − 1))} dz Z = − {θh,t,F (z + 1) − θh,t,F (z)} {F0 (z) − F (z)} dz Z Z ′ = Kh (t − z) {F0 (z) − F (z)} dz = Kh (t − z)d (F0 − F ) (z). ✷ 14 Since F0 satisfies condition (F), and since (5.24), with F = F0 , implies that φh,t,F0 is bounded and zero outside (0, M ), we have that φh,t,F0 is absolutely continuous w.r.t. F0 on the interval (0, M ). So we can define ah,t,F0 as the Radon-Nikodym derivative of the function φh,t,F0 w.r.t. F0 on (0, M ). The condition on the density f0 also ensures that this Radon-Nikodym derivative is almost surely bounded on (0, M ) and hence ah,t,F0 ∈ L2 (F0 ), where we use (5.23) of part (ii) of Lemma 5.1. Note that it was shown in the proof of Lemma 5.1 that, under the conditions of this lemma, we have the following relation Z Z n o (5.32) θh,t,F̂n dG0 = θh,t,F̂n (z) F0 (z) − F0 (z − 1) − {F̂n (z) − F̂n (z − 1)} dz. In the proof of Theorem 5.1 we also need the following lemma. Lemma 5.2 Let condition (F) be satisfied. Then we have for each deterministic sequence of points (tn ), converging to a fixed point t ∈ (0, M ), and for each c > 0: sup |F̂n (tn + n−1/3 u) − F0 (tn )| = Op (n−1/3 ). (5.33) u∈[−c,c] The proof of Lemma 5.2 proceeds along similar lines as the proof of Lemma 4.6 on p. 155 of Groeneboom (1996). It is omitted here for reasons of space. It shows that the MLE has so-called “cube root n” convergence, a property it shares with the MLE in the case of interval censoring. Proof of Theorem 5.1. By the consistency of F̂n , we may assume that condition (5.9) is satisfied for a δ ∈ (0, 12 (M ∧1)), and n sufficiently large, and therefore that the equation (5.21) has a solution ψh,t,F̂n for F̂n = F . Hence we can define φh,t,F̂n by (5.24). Let τ0 = 0, τm+1 = M + 1 and τ1 < . . . < τm be the points of jump of F̂n , and let Ji denote the interval [τi , τi+1 ), for 0 ≤ i < m. The interval Jm will denote the closed interval [τm , τm+1 ]. Let the function ξh,t,F be defined by  φh,t,F (x)  , if 0 < F (x) < 1, ξh,t,F (x) = F (x){1 − F (x)}  0 , otherwise. Definition (5.24) implies that the function ξh,t,F is bounded if F satisfies the conditions of Lemma 5.1. We define a piecewise constant version of ξh,t,F0 , constant on the intervals Ji , by  , if x ∈ Ji , and F0 (x) > F̂n (τi ) for all x ∈ Ji ,  ξh,t,F0 (τi ) ξ̄h,t,F̂n (x) = ξ (τ −) , if x ∈ Ji , and F0 (x) < F̂n (τi ) for all x ∈ Ji ,  h,t,F0 i+1 ξh,t,F0 (u) , if x ∈ Ji , and F0 (u) = F̂n (u) for a point u ∈ Ji , for x ∈ Ji . We note that a construction of this type was introduced under III on p. 213 of Geskus and Groeneboom (1997). We here use the version in Lemma 11.10 of van de Geer (2000), where instead of a piecewise constant version of ξh,t,F̂n a piecewise constant version of ξh,t,F0 is used. We next define the piecewise constant function φ̄h,t,F̂n by n o φ̄h,t,F̂n = F̂n (x) 1 − F̂n (x) ξ̄h,t,F̂n (x), x ∈ IR, 15 and the piecewise constant function θ̄h,t,F̂n by    φ̄h,t,F̂n (x) − φ̄h,t,F̂n (x − 1) θ̄h,t,F̂n (x) = F̂n (x) − F̂n (x − 1)   0 , if F̂n (x) > F̂n (x − 1), , otherwise. Note that Lemma 5.2 and the assumption t 6≡ 0 (mod 1) and t 6≡ M (mod 1), we may assume that F̂n (x)− F̂n (x−1) stays away from zero for all large n, if φ̄h,t,F̂n (x) 6= 0 and/or φ̄h,t,F̂n (x−1) 6= 0, implying that the function θ̄h,t,F̂n is bounded. We have: Z n o θ̄h,t,F̂n − θh,t,F̂n dG0 Z n on o F0 (z) − F0 (z − 1) − {F̂n (z) − F̂n (z − 1)} dz. (5.34) θ̄h,t,F̂n (z) − θh,t,F̂n (z) = This follows in the same way as (5.32). We now show that this implies: Z Z θh,t,F̂n dG0 = θ̄h,t,F̂n dG0 + op n−2/3 h−2 . (5.35) Lemma 11.10 in van de Geer (2000) implies that there exists a constant c > 0 such that ξ̄h,t,F̂n (x) − ξh,t,F0 (x) ≤ ch−3 F̂n (x) − F0 (x) , for all x ∈ IR, where the factor h−3 arises from the second derivative of Kh , which appears in dξh,t,F0 /dF0 , using the notation in (11.61) of Lemma 11.10 in van de Geer (2000). The function θh,t,F0 is defined in terms of the function φh,t,F0 , with support consisting of intervals [t − h + k, t + h + k] which, by the assumption t 6≡ 0 (mod 1) and M − t 6≡ 0 (mod 1), are strictly contained in the open interval (0, M ), if h is sufficiently small. Lemma 5.2 then implies that the closest point of jump of F̂n to the left or right of these intervals (finite in number) have a distance Op (n−1/3 ) to the endpoints of these intervals. This means, by the condition nh3 → ∞ that ξ¯h,t,F̂n (x) − ξh,t,F0 (x) ≤ ch−3 F̂n (x) − F0 (x) , for x in a finite number of intervals of order h, shrinking to points t + k, for integers k such that t = k ∈ (0, M ), and that ξ̄h,t,F̂n (x) − ξh,t,F0 (x) = 0 outside these intervals. But this, in turn, implies by (5.34), the Cauchy-Schwarz inequality and the fact that we may assume that the differences F̂n (x) − F̂n (x ± 1) and F0 (x) − F0 (x ± 1) stay bounded away from zero on these intervals: Z n o (5.36) θ̄h,t,F̂n (x) − θh,t,F0 (x) dG0 (x) = Op h−3 dh,n (F̂n , F0 ) = Op h−2 n−2/3 , where dh,n (F̂n , F0 )2 is defined as 2 dh,n (F̂n , F0 ) = XZ j∈Z (t+2h+j)∧M (t−2h+j)∨0 n o2 F̂n (x) − F0 (x) dx = Op hn−2/3 . 16 Note that the infinite sum is in fact a sum with a (uniformly) bounded number of non-zero terms. The property o2 X Z (t+2h+j)∧M n F̂n (x) − F0 (x) dx = Op hn−2/3 (t−2h+j)∨0 j∈Z follows from the fact that, by the assumption t 6≡ 0 (mod 1) and M −t 6≡ 0 (mod 1), the intervals [(t − 2h + j) ∨ 0, (t + 2h + j) ∧ M ] are strictly contained in (0, M ) and the fact that the processes x 7→ F̂n (x) − F0 (x), x ∈ [(t − 2h + j) ∨ 0, (t + 2h + j) ∧ M ], have a completely similar probabilistic behavior, together with the property kF̂n − F0 k22 = Op n−2/3 . On the other hand, by the structure of the solutions ψh,t,F̂n and ψh,t,F0 , there also exists a constant c2 > 0 such that X ψh,t,F̂n (x) − ψh,t,F0 (x) ≤ c2 h−2 F̂n (x + k) − F0 (x + k) , k∈Z for x belonging to an interval (t + k − h, t + k + h) ⊂ (0, M + 1) and that ψh,t,F̂n (x) − ψh,t,F0 (x) = 0 outside these intervals. Note that for x ∈ (t + k − h, t + k + h) ⊂ (M, M + 1), we have: ψh,t,F0 (x) = 0 and φh,t,F̂n (x) n o = F̂n (x + 1) − F̂n (x) ψh,t,F̂n (x) F̂n (x) − F̂n (x − 1) n o n o 1 − F̂n (x) ψh,t,F̂n (x) = F0 (x) − F̂n (x) ψh,t,F̂n (x) ≤ ch−2 F0 (x) − F̂n (x) , for a constant c > 0, and that we get in a similar way: φh,t,F̂n (x) F̂n (x + 1) − F̂n (x) ≤ ch−2 F0 (x) − F̂n (x) , for a constant c > 0, if x ∈ (M, M + 1). Hence, again by (5.34) and the Cauchy-Schwarz inequality Z n o θh,t,F̂n (x) − θh,t,F0 (x) dG0 (x) = Op h−2 kF̂n − F0 k22 = Op h−2 n−2/3 . Relation (5.35) now follows from (5.36) and (5.37). R We then apply, using the score equations for F̂n , implying θ̄h,t,F̂n dGn = 0, Z and Z n θ̄h,t,F̂n dG0 = θ̄h,t,F̂n − θh,t,F0 o Z θ̄h,t,F̂n d(G0 − Gn ), d(G0 − Gn ) = Op h−2 n−2/3 , 17 (5.37) where the latter result again follows from an L2 -entropy argument. This shows, since nh3 → ∞, Z Z Kh (t − x) d F̂n − F0 (x) = θh,t,F0 d (Gn − G0 ) + op n−1/2 h−3/2 , and yields part (i) of Theorem 5.1. Part (ii) of Theorem 5.1 now follows from Z Z Kh (t − x) dF0 (x) − f0 (t) = K(u) {f0 (t + hu) − f0 (t)} du Z 1 2 ′′ ∼ 2 h f0 (t) u2 K(u) du as h ↓ 0, where we use that the kernel K integrates to 1 and is symmetric around zero. Combining (i) and (ii) using that hn ∼ c · n−1/7 , we get (iii). This finishes the proof of Theorem 5.1. ✷ References Billingsley, P. (1968). Convergence of probability measures. Wiley, New York. Geer, S. van de (1996). Rates of convergence for the maximum likelihood estimator in mixture models, Journal of Nonparametric Statistics 6, 293-310. Geer, S. van de (2000). Applications of Empirical Process Theory. Cambridge University Press. Cambridge, U.K. 293-310. Geskus, R.B. and Groeneboom, P. (1996). Asymptotically optimal estimation of smooth func- tionals for interval censoring, part 1. Statistica Neerlandica 50, 69-99. Geskus, R.B. and Groeneboom, P. (1997). Asymptotically optimal estimation of smooth func- tionals for interval censoring, part 2. Statistica Neerlandica 51, 201-219. Groeneboom, P. (1996). Lectures on inverse problems, in: Lectures on probability theory, Ecole d’Eté de Probabilités de Saint-Flour XXIV-1994, Editor: P. Bernard. Springer Verlag, Berlin. Groeneboom, P. (1998). Nonparametric Estimation for Inverse Problems: Algorithms and Asymptotics, Special topics course 593C, Department of Statistics, University of Washington, Seattle, USA. Groeneboom, P., Jongbloed, G. and Wellner, J.A. (2002). The support reduction algorithm for computing nonparametric function estimates. Manuscript in progress. Groeneboom, P. and Wellner, J.A. (1992). Information bounds and nonparametric maximum likelihood estimation. Birkhäuser, Basel. Springer Verlag. New York. Hall, P., Ruymgaart, F., van Gaans, O. and van Rooy, A. (2001). Inverting noisy integral equations using wavelet expansions: a class of irregular convolutions. In: “State of the art in Probability and Statistics: Festschrift for Willem R. van Zwet”. Vol. 30 of Lecture notesMonograph series. Institute of Mathematical Statistics, pp. 251-280. Johnstone, I.M. and Raimondo, M. (2002). Periodic boxcar deconvolution and diophantine ap- proximation. Technical Report. http://www-stat.stanford.edu/∼imj/Reports/index.html O’Sullivan, F. and Roy Choudhury, K. (2001). An Analysis of the Role of Positivity and Mix- ture Model Constraints in Poisson Deconvolution Problems, J. Computational and Graphical Statistics 10, 673-696 18 Robertson, T., Wright, F. T., Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, New York. Roy Choudhury, K. (1998). Additive Mixture Models for Multichannel Image Data Ph.D. Dis- sertation, University of Washington, Seattle. Simar, L. (1976). Maximum likelihood estimation of a compound Poisson process. Ann. Statist. 4, 1200–1209 Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall, London. 19 View publication stats

Log In

Density estimation in the uniform deconvolution model

Related papers

Related papers

Related topics