Adaptive Bayesian Density Regression For High-Dimensional Data
Adaptive Bayesian Density Regression For High-Dimensional Data
Adaptive Bayesian Density Regression For High-Dimensional Data
DOI: 10.3150/14-BEJ663
Density regression provides a flexible strategy for modeling the distribution of a response variable Y given
predictors X = (X1 , . . . , Xp ) by letting that the conditional density of Y given X as a completely unknown
function and allowing its shape to change with the value of X. The number of predictors p may be very
large, possibly much larger than the number of observations n, but the conditional density is assumed to
depend only on a much smaller number of predictors, which are unknown. In addition to estimation, the
goal is also to select the important predictors which actually affect the true conditional density. We consider
a nonparametric Bayesian approach to density regression by constructing a random series prior based on
tensor products of spline functions. The proposed prior also incorporates the issue of variable selection.
We show that the posterior distribution of the conditional density contracts adaptively at the truth nearly at
the optimal oracle rate, determined by the unknown sparsity and smoothness levels, even in the ultra high-
dimensional settings where p increases exponentially with n. The result is also extended to the anisotropic
case where the degree of smoothness can vary in different directions, and both random and deterministic
predictors are considered. We also propose a technique to calculate posterior moments of the conditional
density function without requiring Markov chain Monte Carlo methods.
1. Introduction
We consider Bayesian estimation of the conditional density of a response Y given a large number
of predictors X = (X1 , . . . , Xp ), where p is possibly much larger than the sample size n. This
problem is sometimes referred to as density regression and has received attention in many scien-
tific application areas such as genome association studies. Non-Bayesian approaches to density
regression usually focus on the kernel approach [14,25], which requires estimating the bandwidth
using cross-validation [15], bootstrap [25] or other methods.
In the Bayesian literature, there are two common approaches to density regression. One ap-
proach models the joint density and obtains the conditional density as a by-product. The other
approach directly models the conditional density while leaving the marginal distribution of X un-
specified. In this paper, we focus on the latter approach in a nonparametric manner. Many of the
existing methods are based on assigning priors on the space of densities through countable mix-
In this paper, we consider the Bayesian density regression problem using a finite linear com-
bination of tensor products of B-splines to construct a prior distribution. We obtain the posterior
contraction rate under the p n setting and show that the rate is adaptive for both dimension
and smoothness in the sense that it agrees with the optimal rate of convergence (up to a logarith-
mic factor) of the oracle procedure, that uses the knowledge of true predictors and the underlying
smoothness of the true conditional density, simultaneously for all smoothness levels and dimen-
sion of true predictors. We further extend the result to the anisotropic situation where smoothness
can vary in different coordinates of the true conditional density function, and allow both random
and deterministic predictors. We also devise an effective computing strategy. Typically, a re-
versible jump Markov chain Monte Carlo (RJMCMC) introduced by [21] is used for Bayesian
computation for models with varying dimension. For high dimensional data, RJMCMC-based
methods may be computationally challenging and may give unreliable results due to limited
exploration of the model space. We propose a group-wise Dirichlet distribution prior on the co-
efficients of B-spline functions that leads to a conjugacy-like structure which can be utilized to
develop a computing algorithm based on direct sampling without resorting to MCMC techniques.
As in the univariate density estimation example in [37], the proposed computing method presents
closed form expressions for posterior moments including the mean and the variance.
The paper is organized as follows. In the next section, we describe the model and the prior
and discuss some preliminaries on tensor product of B-splines. Posterior contraction rates for
both the isotropic and anisotropic cases and for both random and deterministic predictors are
obtained in Section 3. Computational strategies are described and simulation results are presented
in Section 4. Proofs are presented in Section 5.
Let N = {1, 2, . . .}, N0 = {0, 1, 2, . . .} and J be the unit J -dimensional simplex. For any real
number x, define x to be the largest integer less than or equal to x. For a multi-index l =
(l1 , . . . , ld ) ∈ Nd0 , d ∈ N, we define the sum l. = l1 + · · · + ld and the mixed partial derivative
operator D l = ∂ l. /∂x1l1 · · · ∂xdld . For a bounded open connected set ⊂ Rd (e.g., = (0, 1)d ),
define the β-Hölder class C β () as the collection of functions f on that has bounded mixed
partial derivatives D l f of all orders up to l. ≤ β0 and that for every l ∈ Nd0 satisfying l. = β0 ,
l
D f (x) − D l f (y) ≤ Cx − yβ−β0 (1)
2
for some constant C > 0, any x, y ∈ and β0 as the largest integer strictly smaller than β. Any
such function uniquely extends to a continuous function on the closure of .
Let the indicator function be denoted by 1. We use “” to denote an inequality up to a constant
multiple. We write f g if f g f . Let D(ε, T , ρ) denote the packing number, which is
defined as the maximum cardinality of an ε-dispersed subset of T with respect to distance ρ.
The symbol P will stand for a generic probability measure.
Bayesian density regression in high dimensional 399
B-spline functions and their tensor-products have been widely used to approximate functions
in both mathematics and statistics literature. Here we provide a brief overview of their defini-
tions and approximation properties; see more descriptions in [9]. For natural numbers K and
q ∈ N, let the unit interval (0, 1) be divided into K equally spaced subintervals. A spline of
order q with knots at the end points of these intervals is a function f such that the restriction
of f in each subinterval is a polynomial of degree less than q and if q ≥ 2, f is (q − 2)-times
continuously differentiable (interpreted as only continuous if q = 2). Splines of order q form a
linear space of dimension J = K + q − 1, a convenient basis of which is given by the set of
B-splines B1 , . . . , BJ . In particular, if q = 1, then corresponding B-splines form the Haar ba-
sis {1(j −1/J,j/J ] : 1 ≤ j ≤ J }. The B-splines are nonnegative and add up to one at any x. Each
Bj is positive only on an interval of length q/K and at most q many B-splines are nonzero at
any given x. Most importantly, splines of order q with knots at {0, 1/K, . . . , (K − 1)/K, 1} ap-
proximate any function in C α (0, 1) at the rate K −α , or equivalently, any f ∈ C α (0, 1) can be
approximated by a linear combination of B1 , . . . , BJ up to an error of the order J −α .
This idea works in multidimensional case as well. For (0, 1)d and d ∈ N, we split (0, 1) into
Ki equal intervals and consider corresponding spline functions B1 , . . . , BJi in the ith direction,
where Ji = q + Ki − 1. Hence, there are di=1 Ki equal cubes in total. Define tensor-product
B-spline basis functions as the product of univariate basis functions of each direction:
d
Bj (x1 , . . . , xd ) = Bjk (xk ), j = (j1 , . . . , jd ), jk = 1, . . . , Jk , k = 1, . . . , d. (2)
k=1
For simplicity, we use B to denote a column vector of all basis functions and define the total
number of basis functions by J = dk=1 Jk .
Tensor-products of B-splines maintain a lot of nice properties that the univariate B-splines
enjoy. In the following, we list a few of them that will be used in our modeling:
(iii) For every x, Bj (x) > 0 only if xi K ≤ ji ≤ xi K + q − 1 for every i = 1, . . . , d.
1
We also define the normalized version of a univariate B-spline B by B̄ = B/ 0 B(x) dx. Like
univariate B-splines, the approximation ability of tensor-product B-splines is determined by the
smoothness level α of the function to be approximated and J provided that q is chosen to be
larger than α. In the following lemma, we state their approximation results. In particular, the
result in part (c) suggests that the approximation power remains the same when the coefficients
satisfy certain restrictions (positive, adds up to one), which later can help us assign prior distri-
butions.
400 W. Shen and S. Ghosal
Lemma 1. (a) For any function f ∈ C β ((0, 1)d ), 0 < β ≤ q, there exists θ ∈ RJ and a constant
C1 > 0 such that
J0
J0
−β
f − ··· θj Bj (x) ≤ C1 J0 f (β) ∞ ,
j1 =1 jd =1 ∞
where j = (j1 , . . . , jd ).
(b) Further, if f > 0, then for sufficiently large J0 , we can choose every element of θ to be
positive.
(c) Assume that f (y|x1 , . . . , xd ) is a positive density function in y for every (x1 , . . . , xd )
and as a function of (y, x1 , . . . , xd ) belongs to C β ((0, 1)d+1 ), where 0 < β ≤ q. Then for suf-
J
ficiently large J = J0d+1 , there exists η ∈ (0, 1)J satisfying j00=1 ηj0 ,j1 ,...,jd = 1 for every fixed
(j1 , . . . , jd ) ∈ {1, . . . , J0 }d , and a constant C2 > 0 such that
J0
J0
d
−β
f (y|x1 , . . . , xd ) − ··· ηj0 ,...,jd B̄j0 (y) Bjk (xk ) ≤ C2 J 0 = C2 J −β/(d+1) .
j0 =1 jd =1 k=1 ∞
We consider the data generated from n independent and identically distributed pairs of obser-
vations (Y1 , X1 ), . . . , (Yn , Xn ), where Yi ∈ (0, 1) and Xi ∈ (0, 1)p for every i = 1, . . . , n, and
p ∈ N. It may be noted that the unit intervals appearing in the ranges of these random variables
are not special as we can apply an appropriate affine transform on the data otherwise. We assume
that Y is related only to d covariates, say Xm0 , . . . , Xm0 , that is the conditional density of Y
1 d
given X is a function of these coordinates only. This is an important sparsity assumption that
will allow us to make valid inference about the conditional density even when p is very large,
provided that d is small. However, neither d nor these indexes are known. The goal is to estimate
the conditional density of Y given X with accuracy comparable with the oracle procedure which
assumes the knowledge of d and m01 , . . . , m0d using a Bayesian procedure.
A prior on the conditional density given p covariate values x1 , . . . , xp can be induced by a
finite series expansion in terms of tensor product of B-splines
Jp
J0
p
h(y, x|J0 , J, η) = ··· ηj0 ,...,jp B̄j0 (y) Bjk (xk ), (3)
j0 =1 jp =1 k=1
can be denoted by θj0 ,jm1 ,...,jmr . Now the conditional density can be written as
J
p
J0
m1
Jmr
h(y, x|J0 , J, η) = ··· θj0 ,jm1 ,...,jmr B̄j0 (y) Bjk (xk ). (4)
r=0 m1 ,...,mr j0 =1 jm1 =1 jmr =1 k: γk =1
By assigning prior distributions on indicator variables γk , number of terms Jk and the corre-
sponding coefficients θ , we obtain an induced prior on f . The prior on the model indicator γ
is constructed by first putting a prior on the total model size r, and then selecting models with
size r. More specifically, we construct the prior distribution through the following scheme:
p
(A1) Prior on the model size r: Let r = k=1 γk be the number of variables included in the
model and 1 be a fixed, positive prior probability mass function of r. Assume that there
exists some constants c0 , t0 > 0, such that for every r ∈ N,
1 (r) ≤ exp − exp c0 r t0 . (5)
(A2) Prior on the inclusion variables (γ1 , . . . , γp ): Given a value of r, define the support
of γ by {m1 , . . . , mr }. We assume that the probability 2 (m1 , . . . , mr |r) of each set of
variables 1 ≤ m1 < · · · < mr ≤ p of size r satisfies
1 1
c1
p ≤ 2 (m1 , . . . , mr |r) ≤ c1
p
r r
r
˜ 3 (J0 |r)
3 (J0 , Jm1 , . . . , Jmr |r; m1 , . . . , mr ) = ˜ 3 (Jmk |r)
k=1
(A4) Prior on the coefficients: Given the values of r, m1 , . . . , mr and J0 , Jm1 , . . . , Jmr , recall
that the conditional density of Y given X is written as
where (θj0 ,jm1 ,...,jmr : 1 ≤ j0 ≤ J0 ) ∈ J0 for every jm1 , . . . , jmr . We let every
(θj0 ,jm1 ,...,jmr : 1 ≤ j0 ≤ J0 ) ∈ J0 be distributed independently with identical prior dis-
tribution ˜ 4 (·|J0 ) and then denote the induced prior on the coefficients by 4 (·|r; m1 ,
˜ 4 to be a Dirichlet distribution
. . . , mr ; J0 , Jm1 , . . . , Jmr ). In particular, we choose
Dir(a, . . . , a), where a is a fixed positive constant.
Our prior (A1) on the model size includes the truncated binomial prior used in [28] as a special
case. Condition (5) implies that 1 (r > r̄) ≤ exp{− exp(c0 r̄ t0 )} for some constant c0 > 0 and
any r̄ ∈ N. Since r should not be greater than p, which changes with n, we may also let 1
depend on n. In this case, we assume the decay rate to hold with c0 and t0 , which are both
free from n, and for any fixed d̄, 1 (d̄) is bounded below, or more generally 1 (d̄) satisfies
− log d (d̄) = o(nδ ) for all δ > 0.
In (A2), an easy choice is to let c1 = c1 = 1, i.e., assign equal probability for choosing r
indices from {1, . . . , p}. We may also allow c1 , c1 depend on n as long as log(1/c1 ) = o(nδ ) and
log(c1 ) = o(nδ ) for all δ > 0. The posterior contraction rate to be obtained in the next section
will remain the same.
For (A3), using the same prior ˜ 3 is not necessary, but it is a convenient and appropriate
default choice. The independence between components is also not essential. In Section 3.2 when
the true density function is anisotropic, we shall have to use a different and more complicated
prior, which will be obviously appropriate for this isotropic case as well. Relation (6) is satisfied
with κ = κ = 1 if a zero-truncated Poisson distribution is assigned on J r+1 in the sense that N
is a zero-truncated Poisson and J = N 1/(r+1) .
In (A4), the same value of a is not necessary, but we use it as a default choice. In particular,
a = 1 leads to the uniform prior distribution on the simplex. The same contraction rate will be
obtained as long as the parameters are kept in a fixed compact subset of (0, ∞). More generally,
we may allow the lower bound approach zero at most polynomially fast in n−1 , although the
upper bound needs to be fixed.
where f1 and f2 stand for generic conditional densities of Y on (0, 1) given X in (0, 1)p . Let f0
be a fixed conditional density function for Y on (0, 1) given X in (0, 1)p , standing for the true
Bayesian density regression in high dimensional 403
conditional density. We say that the posterior distribution of the density regression model based
on contracts to f0 at a rate εn in the metric ρ if for any Mn → ∞,
lim f : ρ(f0 , f ) > Mn εn |Xn , Yn = 0 in probability. (8)
n→∞
Theorem 1. Suppose that we have i.i.d. observations X1 , . . . , Xn from a possibly unknown prob-
ability distribution G on (0, 1)p . Assume that the true conditional density satisfies conditions
(B1)–(B3). If the prior satisfies conditions (A1)–(A4), then the posterior distribution of f con-
tracts at f0 at the rate
εn = max n−(1−α)/2 (log n)1/(2t0 ) , n−β/(2β+d+1) (log n)κ β/(2β+d+1) (9)
with respect to ρ, where t0 and κ are defined in (A1) and (A3).
Remark 1. Theorem 1 establishes contraction rates for the posterior distribution of the entire
conditional density function f (y|x). As a consequence, we can obtain the same posterior con-
traction rate for other quantities of interest such as conditional quantile functions, conditional
404 W. Shen and S. Ghosal
moment functions and so on. This rate may not be optimal for the estimation of these quantities
because y has been integrated out, that is, we conjecture the optimal rate is n−β/(2β+d) instead
of n−β/(2β+d+1) , up to logarithmic factors.
Remark 2. After examining the proof, we find that condition (5) in (A1) can be relaxed if α is
small. For example, if α = 0, then we only need 1 (r) ≤ exp(−c0 r t0 ).
If predictors are qualitatively different, then it may be interesting to consider the situation where
f0 has different smoothness levels in different directions. In the following, we propose an alter-
native anisotropic smoothness assumption replacing condition (B1).
For β = (β0 , . . . , βd ) ∈ Nd+1 and β0 , . . . , βd ≤ q, define a tensor Sobolev space S β ((0, 1)d+1 )
of functions f of (d + 1) variables by
S β (0, 1)d+1 = f : D l f ∞ < ∞, l = (l0 , . . . , ld ), lk ≤ βk , k = 0, . . . , d
As in Lemma 1, we show that the tensor-product B-splines still have nice approximation abil-
ities within anisotropic function spaces.
Lemma
d
2. (a) For any function f ∈ S β ((0, 1)d+1 ), where 0 < β0 , . . . , βd ≤ q, there exists θ ∈
R k=0 k and a constant C1 > 0 depending only on q, d and β0 , . . . , βd such that
J
J0
Jd −βk
d βk f
d ,
f − ··· θj Bj (x) ≤ C1 Jk
β
dxk k ∞
j0 =1 jd =1 ∞ k=0
where j = (j0 , . . . , jd ).
(b) Further, if f > 0, we can choose every element of θ to be positive.
(c) Assume that f (y|x1 , . . . , xd ) is a positive probability density function in y for ev-
ery (x1 , . . . , xd ) and as a function of (y, x1 , . . . , xd ) belongs to S β ((0, 1)d+1 d
), where β =
(β0 , . . . , βd ) ∈ Nd+1 satisfying 0 < β0 , . . . , βd ≤ q. Then there exists η ∈ (0, 1) k=0 Jk satisfying
J0
j0 =1 ηj0 ,j1 ,...,jd = 1 for every fixed (j1 , . . . , jd ) ∈ {1, . . . , J1 } × · · · × {1, . . . , Jd } and a constant
C2 > 0 such that
J0
J0
d d
−β
f (y|x 1 , . . . , x d ) − · · · η B̄
j0 ,...,jd j0 (y) B jk k
(x ) ≤ C 2 Jk k .
j0 =1 jd =1 k=1 ∞ k=0
Bayesian density regression in high dimensional 405
(B4) We assume that the true density f0 is only related to d predictors with indices 1 ≤ m01 <
· · · < m0d ≤ p, where d is a fixed number, and as a function of (y, xm0 , . . . , xm0 ) belongs
1 d
to S β ((0, 1)d+1 ).
In order to obtain the adaptive convergence rate, we replace the independent prior distribution
on J in condition (A3) by the following joint distribution condition.
(A3 ) Prior on the number of terms in basis expansion: Given the model size r and active
predictor indices 1 ≤ m1 < · · · < mr ≤ p, let the number of terms in the basis expansion
of (r + 1)-fold tensor products of B-splines be denoted by (J0 , Jm1 , . . . , Jmr ), and let
3 (·|r; m1 , . . . , mr ) stand for their joint prior distribution. We assume that for some
fixed constants c2 , c2 > 0, κ ≥ κ ≥ 1,
κ
r
r
exp −c2 J0 Jmk log J0 + log Jmk
k=1 k=1
≤ 3 (J0 , Jm1 , . . . , Jmr |r; m1 , . . . , mr )
κ
r r
≤ exp −c2 J0 Jmk log J0 + log Jmk .
k=1 k=1
Clearly, the rate reduces to that of the isotropic case when β0 = · · · = βd . Thus the rate now
can be viewed as the optimal rate (up to a logarithmic factor) of estimating a (d + 1)-variate func-
tion with smoothness levels β0 , . . . , βd . Note that the rate is determined by the harmonic mean
of smoothness levels in different coordinates, thus the worst smoothness has the most prominent
effect on the rate. However, the rate thus obtained is strictly better than that obtained by a naive
application of Theorem 1 using the worst smoothness condition in all co-ordinates. Thus addi-
tional smoothness in other co-ordinates help improve the rate from the worst case scenario. This
result agrees with the minimax rate associated with estimating a (d + 1)-dimensional anisotropic
density with respect to the Hellinger distance (cf. [1]). Interestingly, the posterior automatically
adapts to different vector of smoothness levels. Noticeably, as in the isotropic case, the ambient
dimension p does not affect the rate except when it grows exponentially in n. It will be interesting
406 W. Shen and S. Ghosal
to generalize the result to allow anisotropic Hölder classes with noninteger smoothness levels as
in [38]. Since approximation properties of tensor product of B-splines are presently known only
for integer smoothness in the anisotropic case, we restrict to smoothness parameters to integer
values only.
Our method also applies for the case of deterministic predictors. In order n to obtain the posterior
convergence rate, we first define the empirical measure PX = −1
n n i=1 δXi , in which δXi is
a point mass probability measure at Xi , based on the observations X1 , . . . , Xn . Then we define
an empirical Hellinger distance on the space of conditional densities by
1/2 1/2 2
ρn2 (f1 , f2 ) = f1 (y|x1 , . . . , xp ) − f2 (y|x1 , . . . , xp ) dyPX n (dx1 , . . . , dxp ), (11)
where f1 and f2 are generic conditional densities of Y on (0, 1) given X in (0, 1)p . We can
obtain the same posterior contraction rates as the case of random predictors for both isotropic
and anisotropic cases.
4. Numerical results
4.1. Computation
First, we ignore that we have a variable selection issue and pretend that we know which d
predictors are relevant, say {m1 , . . . , md }. Thus, we may pretend that we are in the fixed di-
mensional setting p = d and all predictors are relevant. Then given the observations (Xi , Yi ) =
(Xi1 , . . . , Xid ; Yi ), i = 1, . . . , n, the conditional likelihood
J Jm J
n 0 1
md
d
L(η, J|X, Y ) = ··· ηj0 ,jm1 ,...,jmd B̄j0 (Yi ) Bjmk (Xi,jmk ) (12)
i=1 j0 =1 jm1 =1 jmd =1 k=1
n ∗
expands to s1 ∈J ··· sn ∈J i=1 ηsi Bsi (Yi |Xi ), where J = {1, . . . , J0 } × · · · × {1, . . . , Jmd }
d
and Bs∗ (y|x) is defined as B̄s0 (y) k=1 Bsmk (xmk ) for every (d + 1)-dimensional index s taking
Bayesian density regression in high dimensional 407
d
˜ 3 (J0 )
P(J, η|X, Y ) ∝ P(X, Y |J, η)4 (η|J) ˜ 3 (Jmk ),
k=1
In view of (12) and the form of f (y|x) given by (4), both numerator
and denominator of (13)
involve computing integrals of the form I (s1 , . . . , sn ) = η (η) nk=1 ηsk dη. We collect coeffi-
cients η with the same index together to form their powers and observe that, by (A4), coefficients
whose index differ anywhere except in the zeroth co-ordinate are independent, and the collection
of coefficients with the same last d components are Dirichlet distributed. In view of the conju-
gacy of the Dirichlet functional form with respect to a multinomial function, I (s1 , . . . , sn ) can
be written down in terms of products of certain Dirichlet forms, and hence can be computed for
any given (s1 , . . . , sn ). Therefore (13) simplifies to
∞ ∞ d n
J0 =1 · · ·
˜ ˜
Jd =1 3 (J0 ) k=1 3 (Jmk ) s0 ∈J · · ·
∗
sn ∈J I (s0 , . . . , sn ) i=0 Bsi (Yi |Xi )
∞ ∞ d n ,
J0 =1 · · ·
˜ ˜
Jd =1 3 (J0 ) k=1 3 (Jmk ) s1 ∈J · · ·
∗
sn ∈J I (s1 , . . . , sn ) i=1 Bsi (Yi |Xi )
(14)
∞
∞ ∞
r
W (m1 , . . . , mr |X, Y ) =
0
··· ˜ 3 (J0 )
˜ 3 (Jmk )
J0 =1 Jm1 =1 Jmr =1 k=1
n
× ··· I (s0 , . . . , sn ) Bs∗i (Yi |Xi ),
s0 ∈J sn ∈J i=0
∞
∞
∞
r
W 1 (m1 , . . . , mr |X, Y ) = ··· ˜ 3 (J0 )
˜ 3 (Jmk )
J0 =1 Jm1 =1 Jmr =1 k=1
n
× ··· I (s1 , . . . , sn ) Bs∗i (Yi , Xi ).
s1 ∈J sn ∈J i=1
408 W. Shen and S. Ghosal
Similar expressions can be obtained for other posterior moments, in particular, for the posterior
second moment and hence the posterior variance. This estimate can be viewed as a kernel mixture
estimator whose kernel is determined jointly by selected covariates and associated tensor product
B-splines. Since a B-spline basis function takes nonzero values only at q intervals, the calculation
of W 0 for a given r involves (Jmax − Jmin + 1)r+1 q (r+1)(n+1) terms if we restrict
J
0 and each
Jmk , k = 1, . . . , r, to take values between Jmin and Jmax . Then there will be r̄r=1 pr (Jmax −
Jmin + 1)r+1 q (r+1)(n+1) terms in total. Instead of evaluating all terms, we randomly sample a
number of terms in both numerator and denominator and take the associated average values. If we
choose q = 1, then the prior can be viewed as a multivariate random histogram
and the number
of terms in the expression for the posterior mean will reduce to r̄r=1 pr (Jmax − Jmin + 1)r+1 ,
although the resulting density estimate will be less smooth and the rate adaptation property of the
posterior distribution will apply only to smoothness up to order 1. We shall make this choice in
our simulation studies to save on computational cost in exchange of sacrificing some smoothness.
For each example, we generate p covariates X1 , . . . , Xp uniformly from [0.05, 0.95]. In the
computation of (15), we randomly draw N ∗ = 100 or 500 terms in the sums of W 1 and W 0 for
Bayesian density regression in high dimensional 409
n = 100 n = 500
L-S rsp(N ∗ = 100) rsp(N ∗ = 500) L-S rsp(N ∗ = 100) rsp(N ∗ = 500)
every fixed choice of m1 , . . . , mr and r. We compare our method (rsp) with least-squares kernel
conditional density estimation (L-S) developed by [40], where they use L1 -regularization to se-
lect variables. Prediction errors under the L2 -loss and their maximum standard errors associated
with 10 Monte Carlo replications are summarized in Tables 1 and 2.
Compared with the least-squares method, our approach has a better performance in most cases.
Since we are directly sampling a fixed number of terms from the sums in (15), our prediction error
does not change too much with p, which makes the proposed method outperform L-S when p is
large. Comparing the prediction errors under the choices of N ∗ = 100 and 500, we find that their
performances are quite close to each other. Hence direct sampling does not introduce too much
variability. We also carry out a sensitivity analysis by using different parameter values in the prior
distribution, for example, r̄ = 6, λ = 50 and Jmin = 5, Jmax = 10. Similar results are obtained.
In practice, one may choose r̄ as a constant multiple (e.g., twice) of the possible maximal model
size to let all important covariates be included in the considered model with a high probability.
The Poisson mean parameter λ in (C3) shall be modified according to the choice of r̄ to ensure
that λ1/r falls into an appropriate range, say, between 4 and 20.
n = 100 n = 500
L-S rsp(N ∗ = 100) rsp(N ∗ = 500) L-S rsp(N ∗ = 100) rsp(N ∗ = 500)
5. Proofs
Proof of Theorem 1. Note that the conditional density f (y|x) is the same as the joint density
of (X, Y ) at (x, y) with respect to the dominating measure μ equal to the product of G and the
Lebesgue measure. Further, the distance ρ on the space of conditional densities is equivalent to
the Hellinger distance on the space of joint densities with respect to μ. Hence, in order to derive
contraction rate of the posterior distribution of the conditional density at a true density f0 (·|·), we
need only to apply the standard result on posterior convergence rate for (joint) densities given by
Theorem 1 of [18]. The required conditions characterizing the posterior contraction rate εn → 0
can therefore be rewritten in the present context as follows: there exists a sequence of subsets Fn
of the space of conditional densities, called sieves, such that
Fnc exp −8nεn2 , (18)
log D(εn , Fn , ρ) nεn2 , (19)
f : K(f0 , f ) ≤ εn , V (f0 , f ) ≤ εn exp −nεn ,
2 2 2
(20)
where K(f0 , f ) = f0 (y|x) log(f0 (y|x)/f (y|x)) dy dG(x) is the Kullback–Leibler divergence
and V (f0 , f ) = f0 (y|x) log2 (f0 (y|x)/f (y|x)) dy dG(x) is the Kullback–Leibler variation.
We define a sieve in the following way:
Fn = h(x, y|r; m1 , . . . , mr ; J0 , Jm1 , . . . , Jmr ; θ): r ≤ r̄n ,
r (21)
1 ≤ m1 < · · · < mr ≤ p; J0 , Jm1 , . . . , Jmr ≤ J˜n , θ ∈ (J0 ) k=1 Jmk ,
where J˜n = (LJn∗ )(d+1)/(r+1) (log n)κ/(r+1) , Jn∗ and r̄n are two sequences of number going to
infinity, L and κ are some fixed positive constants. We shall choose the values of these numbers
later.
We first verify (18). Note that (Fnc ) is bounded by
r̄n
1 (r > r̄n ) + 3 (Jmk > J˜n for some k = 1, . . . , r|r, m1 , . . . , mr )
r=1 1≤m1 <···<mr ≤p
r̄n
p ˜
≤ exp − exp c0 r̄nt0 + r 3 (J > J˜n )
r
r=1
r̄n
≤ exp − exp c0 r̄nt0 + r̄n p r̄n exp −c2 J˜nr+1 (log J˜n )κ
r=1
d+1
≤ exp − exp c0 r̄nt0 + r̄n2 p r̄n exp −c3 Ld+1 Jn∗ (log J˜n )κ +κ
d+1
= exp − exp c0 r̄nt0 + exp 2 log r̄n + r̄n log p − c3 Ld+1 Jn∗ (log J˜n )κ +κ
≤ exp −bnεn2
Bayesian density regression in high dimensional 411
for any b > 0 and some constant c3 > 0 provided L is chosen sufficiently large and the following
relations hold
log r̄n nεn2 , r̄n log p nεn2 ,
∗ d+1
(22)
Jn (log n)κ+κ nεn2 , exp c0 r̄nt0 nεn2 .
Now we bound the covering number D(εn , Fn , ρ) using the relation D(εn , Fn , ρ) ≤
D(εn2 , Fn , · 1 ), where · 1 stand for the L1 -distance on the space of conditional densities
given by
f1 − f2 1 = f1 (y|x) − f2 (y|x) dy dG(x) ≤ sup f1 (y|x) − f2 (y|x) dy.
x
since the collection Bj (x)s add up to 1 for any x. Using the fact that D(ε, d , · 1 ) ≤ (3/ε)d ,
we obtain
D εn2 , Fn;r;m1 ,...,mr ;J0 ,Jm1 ,...,Jmr , · 1 ≤ D εn2 , J0 , · 1
1≤Jm1 ,...,Jmr ≤J˜n
J0
3
≤
εn2
1≤Jm1 ,...,Jmr ≤J˜n
J˜nr+1 (23)
3
=
εn2
Ld+1 (Jn∗ )d+1 (log n)κ
3
≤ .
εn2
412 W. Shen and S. Ghosal
Therefore,
D(εn , Fn , ρ) ≤ D εn2 , Fn , · 1
r̄n Ld+1 (Jn∗ )d+1 (log n)κ
3
≤
εn2
r=1 1≤m1 <···<mr ≤p, 1≤J0 ,Jm ,...,Jm ≤J˜n
1 r
r̄n
Ld+1 (Jn∗ )d+1 (log n)κ
p 3
≤ J˜nr+1
r εn2
r=1
r̄n d+1 ∗ d+1 κ
d+1 3 L (Jn ) (log n)
≤ p r Ld+1 Jn∗ (log n)κ
εn2
r=1
d+1
d+1
r̄n p r̄n Jn∗ (log n)κ exp Ld+1 Jn∗ (log n)κ log 3/εn2
= exp log r̄n + r̄n log p + (d + 1) log Jn∗ + κ log(log n)
d+1
+ Ld+1 Jn∗ (log n)κ log 3/εn2
d+1
≤ exp c4 Jn∗ (log n)κ+1 + r̄n log p
for some c4 > 0. Thus it suffices to have the following relations
∗ d+1
Jn (log n)κ+1 nεn2 , r̄n log p nεn2 . (24)
For (20), in order to lower bound the prior concentration probability around f0 (y|x), we shall
restrict to the oracle model consisting of d true covariates Xm0 , . . . , Xm0 . By Lemma 1, there
1 d
exists θ0 = (θj00 ,j ,...,jm0 : 1 ≤ j0 , jm0 , . . . , jm0 ≤ Jn∗ ) such that
m01 d
1 d
−β
supf0 (y|x) − h x, y|d; m01 , . . . , m0d ; Jn∗ , . . . , Jn∗ ; θ0 Jn∗ ≤ εn . (25)
x,y
Now for every (jm0 , . . . , jm0 ), we define θ 0j ,...,jm0 = (θj00 ,j ,...,jm0 : 1 ≤ j0 ≤ Jn∗ ) ∈ Jn∗ .
1 d m0
1 d m0
1 d
Then θ 0 can be written by θ 0 = (θ 0j ,...,jm0 : 1 ≤ jm0 , . . . , jm0 ≤ Jn∗ ). We consider θ =
m01 d
1 d
(θ j ,...,jm0 : 1 ≤ jm0 , . . . , jm0 ≤ Jn∗ ) and θ j ,...,jm0 = (θj0 ,j ,...,jm0 : 1 ≤ j0 ≤ Jn∗ ) ∈ Jn∗ . If
m01 d 1 d m01 d m0
1 d
max θ j ,...,jm0 − θ 0j ≤ ε, (26)
m01 ,...,jm0 1
1≤jm0 ,...,jm0 ≤Jn∗ d m0
1 d
1 d
then
h x, y|d; m0 , . . . , m0 ; J ∗ , . . . , J ∗ ; θ − h x, y|d; m0 , . . . , m0 ; J ∗ , . . . , J ∗ ; θ 0
1 d n n 1 d n n
Jn∗Jn ∗ Jn ∗
≤ ··· θj ,j ,...,jm0 − θj00 ,j B̄j (y)Bj (x 0 ) · · · Bj (x 0 ).
0 m01 m0
,...,jm0 0 0 m m1 0 m 1 md d
d
j0 =1 jm0 =1 jm0 =1 1 d
1 d
Bayesian density regression in high dimensional 413
provided that
θ j ,...,jm0 − θ 0j ≤ J ∗ −(d+1) εn for all jm0 , . . . , jm0 . (28)
m01 m01
,...,jm0 1 n 1 d
d d
To simplify the notation, we denote h(x, y|d; m01 , . . . , m0d ; Jn∗ , . . . , Jn∗ ; θ) by fθ . Combining (25)
and (27), we have the desired approximation supx,y |f0 (y|x) − fθ (y|x)| ≤ 2εn .
Using condition (B3), inf fθ ≥ inf f0 − f0 − fθ ∞ ≥ m0 /2 given that εn is sufficiently small.
This implies that f0 /fθ ∞ ≤ 2f0 ∞ /m0 < ∞ since f0 can be regarded as a fixed continuous
function on the compact set [0, 1]d+1 . Hence, for every fθ satisfying fθ − f0 ∞ ≤ 2εn ,
|f0 (y|x) − fθ (y|x)|2 1
ρ (f0 , fθ ) =
2
1/2 1/2
dy dG(x) ≤ f0 − fθ 2∞ εn2 . (29)
(f0 (y|x) + fθ (y|x))2 m0
Thus, it suffices to lower bound the prior probability of the event in (28), which is
∗ d+1
1 (d) × 2 m01 , . . . , m0d |r = d × ˜ 3 Jn
× ˜ 4 θ j 0 ,...,j 0 − θ 0j ,...,j ≤ Jn∗ −(d+1) εn
0 m1 0 1 md m1 md
1≤jm0 ,...,jm0 ≤Jn∗
1 d
1
∗ d+1
∗ κ
∗ d ∗ (Jn∗ )d+1
p exp −(d + 1)c2 Jn log Jn × exp − Jn c5 Jn log
d
εn
for some constant c5 > 0 by the small ball probability estimates of a Dirichlet distribution in
Lemma 6.1 of [18]. As long as Jn∗ and εn−1 are powers of n within slowly varying factors, the
last expression can be bounded below by exp{−d log p − c6 (Jn∗ )d+1 (log n)κ } for some c6 > 0.
Hence in order to obtain (20), it suffices to have the following relationships:
−β
d+1
Jn∗ εn , log p nεn2 , Jn∗ (log n)κ nεn2 . (31)
414 W. Shen and S. Ghosal
We can determine the rate εn as the smallest sequence of numbers that satisfies (22), (24)
and (31), that is,
εn = max n−(1−α)/2 (log n)1/(2t0 ) , n−β/(2β+d+1) (log n)κ β/(2β+d+1) ,
1/(d+1) (32)
Jn∗ = nεn2 (log n)1/(d+1) + 1,
and κ = κ − κ , r̄n = L (log n)1/t0 for some sufficiently large L provided that the condition
t
exp{exp(c0 r̄n0 )} nεn2 is satisfied.
Proof of Theorem 2. The proof essentially follows the outline given in Theorem 1 except for
two main differences. First, we shall need to use different J0 , Jm0 , . . . , Jm0 due to the approxi-
1 d
−β0 −β1 −βd
mation result by Lemma 2. In particular, we need J0 J ··· J . This will sightly
m01 m0d
change our definition of the sieve and the calculation of the prior concentration. The second
difference is that we now have a dependent prior distribution in (A3 ), which will change the
calculation of the prior concentration rate.
We define a new sieve as follows
Fn = h(x, y|r; m1 , . . . , mr ; J0 , Jm1 , . . . , Jmr ; θ ): r ≤ r̄n ,
r
(33)
r
k=1 Jmk
1 ≤ m1 < · · · < mr ≤ p; J0 Jmk ≤ J˜nr+1 , θ ∈ J0 ,
k=1
where J˜n = (LJn∗ )(d+1)/(r+1) (log n)κ/(r+1) , Jn∗ and r̄n are two sequences of number going to
infinity, and L and κ are some fixed positive constants. We shall choose the values of these
numbers later.
We first verify (18). It follows that
c
r̄n r
Fn ≤ 1 (r > r̄n ) + 3 J0 Jmk > J˜nd+1 . (34)
r=1 1≤m1 <···<mr ≤p k=1
Note
that the joint distribution of (J0 , Jm1 , . . . , Jmr ) depends only on the value of their product
J0 rk=1 Jmk . Let t be a given integer and let Nt stand for the number of ways one can choose
{J0 , Jm1 , . . . , Jmr } such that J0 rk=1 Jmk = t . Then
r
3 J0 Jmk = t ≤ Nt exp −c2 t (log t)κ
k=1
for some c8 > 0. In the sieve, we choose the cut-off J˜nr+1 which is clearly of order higher than r
and hence the requirement is met. As a result, the second term in (34) is bounded by
r̄n
p r exp −c8 J˜nr+1 log J˜nr+1
r=1
r̄n
d+1
≤ exp r log p − c8 Ld+1 Jn∗ (log n)κ+κ
r=1
d+1
≤ exp log r̄n + r̄n log p − c8 Ld+1 Jn∗ (log n)κ+κ
for some c8 > 0, which is of the same form of the corresponding bound for the isotropic case,
and that L can be chosen sufficiently large. Thus, relation (22) is obtained.
The calculation of entropy proceeds in the same way as in the isotropic case. We split Fn into
layers following the same definition. Then
J0 rk=1 Jm J˜nr+1
3 k 3
D εn2 , Fn;r;m1 ,...,mr ;J0 ,Jm1 ,...,Jmr , · 1 ≤ ≤
εn2 εn2
and the remaining calculations are identical, which give entropy estimates of the sieve as in the
isotropic case and hence relation (24) is obtained.
Now we estimate the prior concentration rate. Consider the oracle model given by (d; m01 , . . . ,
∗ , J ∗ , . . . , J ∗ ), where
m0d ; Jn,0 n,1 n,d
∗
−β0
∗ −β1
∗ −βd
Jn,0 Jn,1 · · · Jn,d ≤ εn .
d
∗ −βk
supf0 (y|x) − h x, y|d; m01 , . . . , m0d ; Jn,0
∗ ∗
, . . . , Jn,d ; θ0 Jn,k εn .
x,y
k=0
∗ )
(Jn,0 k=1 Jn,k . Let θ ∈ (Jn,0
∗ ) k=1 Jn,k and be represented by (θ j ∗ ,k =
: 1 ≤ jm0 ≤ Jn,k
m0
,...,jm0
1 d k
416 W. Shen and S. Ghosal
d
∗
≤ Jn,0 ∗
Jn,k max θ j ,...,jm0 − θ 0j ,...,jm0
1
∗ ,
1≤j 0 ≤Jn,k m01 d m0
1 d
k=1 mk
k=1,...,d
≤ Jn ∗ d+1
max θ j ,...,jm0 − θ 0j ,...,jm0 1 ,
∗ ,
1≤j 0 ≤Jn,k m01 d m0
mk 1 d
k=1,...,d
∗ )1/(d+1) + 1 is the smallest integer greater than the geometric mean of
where Jn∗ = ( dk=0 Jn,k
∗ , . . . , J ∗ . Thus it suffices to lower bound
Jn,0 n,d
∗ ∗
1 (d) × 2 m01 , . . . , m0d |r = d × 3 Jn,0 , . . . , Jn,d
× ˜ 4 θ j 0 ,...,j 0 − θ 0j ,...,j ≤ Jn∗ −(d+1) εn .
0 0
m1 md m1 md
∗ ,1≤j ≤J ∗ ,k=1,...,d
1≤j0 ≤Jn,0 m0 n,k
k
Since the other factors are as before, it suffices to look at the third factor only, whose lower bound
is given by
d d κ
d+1
κ
∗ ∗
exp −c2 Jn,k log Jn,k exp −c9 Jn∗ log Jn∗
k=0 k=0
for some constant c9 > 0, which is identical with the corresponding expression for the isotropic
case. Thus we need
∗ −βk
∗ d+1
Jn,k εn , log p nεn2 , Jn (log n)κ nεn2 ,
d 1/(d+1) (35)
Jn∗ ∗
Jn,k .
k=0
Combining (22), (24) and (35), we can choose κ = κ − κ , r̄n as a large multiple of (log n)1/t0 ,
∗ = ε −1/βk and
Jn,k n
∗ ∗ ∗ ∗
εn = max n−β /(2β +d+1) (log n)κ β /(2β +d+1) , n−(1−α)/2 (log n)1/(2t0 ) ,
where β ∗ = (d + 1)( dk=0 βk−1 )−1 is the harmonic mean of (β0 , β1 , . . . , βd ).
Bayesian density regression in high dimensional 417
Proof of Theorem 3. Note that the distance ρn on the space of conditional densities mathemati-
cally can be expressed as the Hellinger distance on the space of joint densities with respect to the
dominating measure μ, which is the product of PX n and the Lebesgue measure. This is notwith-
standing the fact that the predictor variables are actually deterministic. We only need to replace
G(·) by PX n in the definitions of Kullback–Leibler divergence and Kullback–Leibler variation
in (20). The rest of arguments proceed exactly in the same way as in Theorems 1 and 2 for the
isotropic and the anisotropic cases respectively.
Proof of Lemma 1. Part (a) is a well-known approximation result for tensor product splines,
see Theorem 12.7 of [36] or Lemma 2.1 of [10], for example. Part (b) is a direct multivariate
generalization of Lemma 1, part (b) of [37].
For part (c), note that by part (b) we have θ ∈ (0, 1)J such that
J0
J0
d
f (y|x1 , . . . , xd ) − ··· θj0 ,...,jd Bj0 (y) Bjk (xk ) ≤ C1 J −β/(d+1)
j0 =1 jd =1 k=1 ∞
1
for constant C1 = Cf (β) ∞ . Define ξ as the column vector of ξj0 ,...,jd = θj0 ,...,jd Bj0 (y) dy
0
and B∗ as the column vector of Bj∗ (y, x) = B̄j0 (y) dk=1 Bjk (xk ). Then
f (y|x1 , . . . , xd ) − ξ T B∗ (y, x) ≤ C1 J −β/(d+1) . (36)
∞
for any x ∈ (0, 1)d . Applying a multivariate analog of Theorem 4.38 of [36] for tensor product
of B-splines, we can bound the maximum norm of coefficients in a tensor product B-spline
expansion by a constant multiple of the supremum norm of the function formed by corresponding
linear combination. This is possible by forming a dual basis consisting of tensor product of
functions in a dual basis for univariate B-splines and by noting that the supremum norms of the
elements of the dual basis can be taken to be uniformly bounded (see Theorem 4.41 of [36]).
This leads to the relation
J0
1 − ξj0 ,...,jd ≤ C1 J −β/(d+1) (37)
j0 =1
for some positive constant C2 . Combining with (36), the result now follows.
Proof of Lemma 2. Part (a) is a well-known approximation result for tensor Sobolev space, see
Theorem 12.7 of [36], for example. The proof of (b) and (c) proceed exactly as in Lemma 1.
Acknowledgements
The authors would like to thank the Associate Editor and two referees for their helpful comments
that greatly improves the quality of the paper.
References
[1] Birgé, L. (1986). On estimating a density using Hellinger distance and some other strange facts.
Probab. Theory Related Fields 71 271–291. MR0816706
[2] Brown, P.J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and predic-
tion. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 627–641. MR1626005
[3] Brown, P.J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors.
J. R. Stat. Soc. Ser. B Stat. Methodol. 64 519–536. MR1924304
[4] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer Series in Statistics. Heidelberg: Springer. MR2807761
[5] Carlin, B.P. and Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods.
J. R. Stat. Soc. Ser. B Stat. Methodol. 57 473–484.
[6] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for
possibly sparse sequences. Ann. Statist. 40 2069–2101. MR3059077
[7] Chung, Y. and Dunson, D.B. (2009). Nonparametric Bayes conditional distribution modeling with
variable selection. J. Amer. Statist. Assoc. 104 1646–1660. MR2750582
[8] Dalalyan, A.S. and Tsybakov, A.B. (2012). Mirror averaging with sparsity priors. Bernoulli 18 914–
944. MR2948907
[9] de Boor, C. (2001). A Practical Guide to Splines, Revised ed. Applied Mathematical Sciences 27. New
York: Springer. MR1900298
Bayesian density regression in high dimensional 419
[10] de Jonge, R. and van Zanten, J.H. (2012). Adaptive estimation of multivariate functions using condi-
tionally Gaussian tensor-product spline priors. Electron. J. Stat. 6 1984–2001. MR3020254
[11] Dunson, D.B. and Park, J.-H. (2008). Kernel stick-breaking processes. Biometrika 95 307–323.
MR2521586
[12] Dunson, D.B., Pillai, N. and Park, J.-H. (2007). Bayesian density regression. J. R. Stat. Soc. Ser. B
Stat. Methodol. 69 163–183. MR2325270
[13] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911. MR2530322
[14] Fan, J., Yao, Q. and Tong, H. (1996). Estimation of conditional densities and sensitivity measures in
nonlinear dynamical systems. Biometrika 83 189–206. MR1399164
[15] Fan, J. and Yim, T.H. (2004). A crossvalidation method for estimating conditional densities.
Biometrika 91 819–834. MR2126035
[16] George, E.I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist.
Assoc. 88 881–889.
[17] George, E.I. and McCulloch, R.E. (1997). Approaches for Bayesian variable selection. Statist. Sinica
7 339–373.
[18] Ghosal, S., Ghosh, J.K. and van der Vaart, A.W. (2000). Convergence rates of posterior distributions.
Ann. Statist. 28 500–531. MR1790007
[19] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-i.i.d.
observations. Ann. Statist. 35 192–223. MR2332274
[20] Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth
densities. Ann. Statist. 35 697–723. MR2336864
[21] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82 711–732. MR1380810
[22] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and
the virtue of overparametrization. Bernoulli 10 971–988. MR2108039
[23] Griffin, J.E. and Steel, M.F.J. (2006). Order-based dependent Dirichlet processes. J. Amer. Statist.
Assoc. 101 179–194. MR2268037
[24] Hall, P., Racine, J. and Li, Q. (2004). Cross-validation and the estimation of conditional probability
densities. J. Amer. Statist. Assoc. 99 1015–1026. MR2109491
[25] Hall, P., Wolff, R.C.L. and Yao, Q. (1999). Methods for estimating a conditional distribution function.
J. Amer. Statist. Assoc. 94 154–163. MR1689221
[26] Huang, J., Horowitz, J.L. and Wei, F. (2010). Variable selection in nonparametric additive models.
Ann. Statist. 38 2282–2313. MR2676890
[27] Jara, A. and Hanson, T.E. (2011). A class of mixtures of dependent tail-free processes. Biometrika 98
553–566. MR2836406
[28] Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Con-
vergence rates of the fitted densities. Ann. Statist. 35 1487–1511. MR2351094
[29] Khasminskii, R.Z. (1978). A lower bound for risks of nonparametric density estimates in the uniform
metric. Theory Probab. Appl. 23 794–798.
[30] Lepski, O. (2013). Multivariate density estimation under sup-norm loss: Oracle approach, adaptation
and independence structure. Ann. Statist. 41 1005–1034. MR3099129
[31] Ma, L. (2012). Recursive partitioning and Bayesian inference on conditional distributions. Technical
report. Duke Univ.
[32] Müller, P., Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures.
Biometrika 83 67–79. MR1399156
[33] Norets, A. and Pelenis, J. (2014). Posterior consistency in conditional density estimation by covariate
dependent mixtures. Econometric Theory 30 606–646. MR3205608
420 W. Shen and S. Ghosal
[34] Pati, D., Dunson, D.B. and Tokdar, S.T. (2013). Posterior consistency in conditional distribution esti-
mation. J. Multivariate Anal. 116 456–472. MR3049916
[35] Rivoirard, V. and Rousseau, J. (2012). Posterior concentration rates for infinite dimensional exponen-
tial families. Bayesian Anal. 7 311–333. MR2934953
[36] Schumaker, L.L. (2007). Spline Functions: Basic Theory, 3rd ed. Cambridge Mathematical Library.
Cambridge: Cambridge Univ. Press. MR2348176
[37] Shen, W. and Ghosal, S. (2012). Adaptive Bayesian procedures using random series prior. Technical
report. Available at arXiv:1403.0625.
[38] Shen, W., Tokdar, S.T. and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with
Dirichlet mixtures. Biometrika 100 623–640. MR3094441
[39] Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist. 29
687–714. MR1865337
[40] Sugiyama, M., Takeuchi, I., Kanamori, T., Suzuki, T., Hachiya, H. and Okanohara, D. (2010). Least-
squares conditional density estimation. IEICE Trans. Inf. Syst. E93-D 583–594.
[41] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58
267–288. MR1379242
[42] Tokdar, S.T. (2011). Dimension adaptability of Gaussian process models with variable selection and
projection. Technical report. Available at arXiv:1112.0716.
[43] Tokdar, S.T., Zhu, Y.M. and Ghosh, J.K. (2010). Bayesian density regression with logistic Gaussian
process and subspace projection. Bayesian Anal. 5 319–344. MR2719655
[44] Trippa, L., Müller, P. and Johnson, W. (2011). The multivariate beta process and an extension of the
Polya tree model. Biometrika 98 17–34. MR2804207
[45] van de Geer, S.A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36
614–645. MR2396809
[46] van der Vaart, A.W. and van Zanten, J.H. (2008). Rates of contraction of posterior distributions based
on Gaussian process priors. Ann. Statist. 36 1435–1463. MR2418663
[47] Yang, Y. and Dunson, D. (2012). Bayesian conditional tensor factorizations for high-dimensional
classification. Technical report. Available at arXiv:1301.4950.