Bayesian Regularized Quantile Regression
Bayesian Regularized Quantile Regression
Bayesian Regularized Quantile Regression
533–556
1 Introduction
Quantile regression (Koenker and Bassett 1978) has gained increasing popularity as it
provides richer information than the classic mean regression. Suppose that we have
a sample (x1 , y1 ), · · · , (xn , yn ). Then the linear quantile regression model for the θth
quantile (0 < θ < 1) is yi = xTi β + ui , i = 1, . . . , n, where β = (β1 , · · · , βp )T ∈ Rp
and ui ’s are independent with their θth quantiles equal to 0. It can be shown that the
coefficients β can be estimated consistently by the solution to the following minimization
problem
n
X ¡ ¢
min ρθ yi − xTi β , (1)
β
i=1
Although the asymptotic theory for quantile regression has been well developed
(Koenker and Bassett 1978; Koenker 2005), a Bayesian approach enables exact infer-
ence even when the sample size is small. Yu and Moyeed (2001) proposed a Bayesian
formulation of quantile regression using the skewed Laplace distribution for the errors
∗ Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:qli@
math.wustl.edu
† Center for Biomedical Informatics, Harvard Medical School, Cambridge, MA, mailto:Ruibin_Xi@
hms.harvard.edu
‡ Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:nlin@
math.wustl.edu
and sampling β from its posterior distribution using a random walk Metropolis-Hastings
algorithm. Similar formulations were also employed by Tsionas (2003) and Kozumi and
Kobayashi (2009), both of which developed Gibbs samplers to estimate their mod-
els. Reed and Yu (2009) considered a similar scale-mixture expression of the skewed
Laplace distribution as in Kozumi and Kobayashi (2009) and derived efficient Gibbs
samplers. Reed et al. (2009) studied the stochastic search variable selection (SSVS)
algorithm based on the scale-mixture expression in Reed and Yu (2009). Kottas and
Krnjajić (2009) extended this idea to Dirichlet process scale mixture of skewed Laplace
distributions and another scale mixture of uniform distributions. The special case of
median regression was discussed by Walker and Mallick (1999), Kottas and Gelfand
(2001) and Hanson and Johnson (2002). They modeled the error distribution as mix-
ture distributions based on either the Dirichlet process or the Pólya tree. Hjort and
Walker (2009) introduced the quantile pyramids and discussed briefly on its applica-
tion to Bayesian quantile regression. Geraci and Bottai (2007) and Reich et al. (2009)
considered Bayesian quantile regression models for clustered data.
One crucial problem in building a quantile regression model is the selection of pre-
dictors. The prediction accuracy can often be improved by choosing an appropriate
subset of predictors. Also, in practice, it is often desired to identify a smaller subset
of predictors from a large set of predictors to obtain better interpretation. There has
been active research on sparse representation of linear regression. Tibshirani (1996)
introduced the least absolute shrinkage and selection operator (lasso) technique which
can simultaneously perform variable selection and parameter estimation. The lasso es-
timate isPan L1 -regularized least squares estimate, i.e., the lasso P
estimate is the solution
n p
to minβ i=1 (yi − xTi β)2 + λkβk1 , for some λ ≥ 0 and kβk1 = k=1 |βk |. Several con-
straints and corresponding improvement of the lasso are as follows. When categorical
predictors are present in the regression model, the lasso is not satisfactory since it only
selects individual dummy variables instead of the whole predictor. To solve this prob-
lem, Yuan and Lin (2006) introduced the group lasso by generalizing the lasso penalty.
Another extension of the lasso is the elastic net (Zou and Hastie 2005), which has an
improved performance for correlated predictors and can select more than n variables
in the case of p > n. Other related approaches include the smoothly clipped absolute
deviation (SCAD) model (Fan and Li 2001) and the fused lasso (Tibshirani et al. 2005).
The first use of regularization in quantile regression is made by Koenker (2004),
which put the lasso penalty on the random effects in a mixed-effect quantile regression
model to shrink the random effects towards zero. Wang et al. (2007) considered the
least absolute deviance (LAD) estimate with adaptive lasso penalty (LAD-lasso) and
proved its oracle property. Recently, Li and Zhu (2008) considered quantile regression
with the lasso penalty and developed its piecewise linear solution path. Wu and Liu
(2009) demonstrated the oracle properties of the SCAD and adaptive lasso regularized
quantile regression.
Our goal is to develop a Bayesian framework for regularization in linear quantile
regression. For linear regression, Bae and Mallick (2004) and Park and Casella (2008)
treated the lasso from a Bayesian perspective, and proposed hierarchical models which
can be solved efficiently through Gibbs samplers. These works shed light on incorpo-
Q. Li, R. Xi and N. Lin 535
rating the regularization methods into the Bayesian quantile regression framework. In
this paper, we consider different penalties including lasso, group lasso and elastic net
penalties. Gibbs samplers are developed for these three types of regularized quantile
regression. As demonstrated later by simulation studies, these Bayesian regularized
quantile regression methods provide more accurate estimates and better prediction ac-
curacy than their non-Bayesian peers.
The rest of the paper is organized as follows. In Section 2, we introduce Bayesian
regularized quantile regression with lasso, group lasso and elastic net penalties, and
derive the corresponding Gibbs samplers. Simulation studies are then presented in
Sections 3 followed by a real data example in Section 4. Discussions and conclusions
are put in Section 5. An appendix contains technical proofs and derivations.
Hence, maximizing the likelihood (3) is equivalent to minimizing (1). Recently, Kozumi
and Kobayashi (2009) proved that the skewed Laplace distribution (2) can be viewed
as a mixture of an exponential and a scaled normal distribution. More specifically, we
have the following lemma.
f (ṽi | τ ) = τ exp(−τ ṽi ). Denote ṽ = (ṽ1 , · · · , ṽn ) and z = (z1 , · · · , zn ). Then, we have
the hierarchical model
p
yi = xTi β + ξ1 ṽi + τ −1/2 ξ2 ṽi zi ,
Yn
ṽ | τ ∼ τ exp(−τ ṽi ), (4)
i=1
Yµ
n ¶
1 1 2
z ∼ √ exp − zi .
i=1
2π 2
The lasso, group lasso and elastic net estimates are all regularized least squares
estimates and the differences among them are only at their penalty terms. Specifically,
Pn
they are all solutions to the following form of minimization problem minβ i=1 (yi −
xTi β)2 + λ1 h1 (β) + λ2 h2 (β), for some λ1 , λ2 ≥ 0 and penalty functions h1 (·) and h2 (·).
The lasso corresponds to λ1 = λ, λ2 = 0, h1 (β) = kβk1 and h2 (β) = 0. Suppose that the
predictors are classified into G groups and βg is the coefficient vector of the gth group.
¡ ¢1/2
Denote kβg kKg = βgT Kg βg for positive definite matrices Kg (g = 1, · · · , G). Then,
PG
the group lasso corresponds to λ1 = λ, λ2 = 0, h1 (β) = g=1 kβg kKg and h2 (β) = 0.
Pp
The elastic net corresponds to λ1 , λ2 > 0, h1 (β) = kβk1 and h2 (β) = kβk22 = k=1 βk2 .
Similarly, we form the minimization problem for regularized quantile regression as
n
X
min ρθ (yi − xTi β) + λ1 h1 (β) + λ2 h2 (β). (5)
β
i=1
That is, we replace the squared error loss by the the check loss, while keeping the corre-
sponding penalty terms unchanged. Starting from (4) we can show that, by introducing
suitable priors on β, the solution to (5) is equivalent to the maximum a posteriori
(MAP) estimate in the Bayesian formulation. In the following three subsections, we
will discuss (5) with lasso, elastic net and group lasso penalties separately. For each
penalty, we present the Bayesian hierarchical model and derive its corresponding Gibbs
sampler.
Yp
η
π(β | τ, λ) = exp{−η|βk |}
2
k=1
Yp Z ∞ µ ¶ µ 2 ¶
1 β 2 η2 −η
= √ exp − k exp sk dsk .
2πs k 2s k 2 2
k=1 0
Denote s = (s1 , · · · , sp ). We further put Gamma priors on the parameter τ and η 2 and
have the following Bayesian hierarchical model.
p
yi = xTi β + ξ1 ṽi + ξ2 τ −1/2 ṽi zi ,
Yn
ṽ | τ ∼ τ exp (−τ ṽi ) ,
i=1
Yn µ ¶
1 1 2
z ∼ √ exp − zi , (9)
i=1
2π 2
Y p µ ¶ p µ 2 ¶
1 β 2 Y η2 η
β, s | η 2 ∼ √ exp − k exp − sk ,
2πs k 2sk 2 2
k=1 k=1
¡ ¢c−1
τ, η 2 ∼ τ a−1 exp(−bτ ) η 2 exp(−dη 2 ).
Maximizing the posterior distribution (12) is thus equivalent to minimizing (10). Cal-
culation of the constant C(η1 , η2 ) is in Appendix B. Let η̃1 = η12 /(4η2 ). Putting Gamma
priors on τ , η̃1 and η2 , we then have the following hierarchical model.
p
yi = xTi β + ξ1 ṽi + ξ2 τ −1/2 ṽi zi ,
Yn
ṽ | τ ∼ τ exp(−τ ṽi ),
i=1
Yn µ ¶
1 1 2
z ∼ √ exp − zi ,
i=1
2π 2
µ½ ¶−1 ¾
ind. 1 1 tk − 1
βk | tk , η2 ∼ p exp − βk2 , (13)
2π(tk − 1)/(2η2 tk ) 2 2η2 tk
ind. −1/2 1/2 © ª
tk |η̃1 ∼ Γ−1 (1/2, η̃1 )tk η̃1 exp − η̃1 tk I(tk > 1),
τ, η̃1 , η2 ∼ τ a−1 exp(−bτ ) η̃1c1 −1 exp(−d1 η̃1 ) η2c2 −1 exp(−d2 η2 ),
where a, b, c1 , c2 , d1 , d2 ≥ 0, I(·) is the indicator function and Γ(·, ·) is the upper incom-
plete gamma function.
Appendix B gives the details of the full conditionals for the Gibbs sampler. The full
conditional distributions are all common distributions except the full conditional of η̃1 .
The full conditional distribution of η̃1 is
½ · p
X ¸¾
−p p/2+c1 −1
f (η̃1 | X, y, ṽ, β, t, τ, η2 ) ∝ Γ (1/2, η̃1 )η̃1 exp − η̃1 d1 + tk .
k=1
and hence limη̃1 →∞ f (η̃1 | X, y, ṽ, β, t, τ, η2 )q −1 (η̃1 | t) exists and equals to some positive
constant. So the tail behaviors of q(η̃1 | t) and f (η̃1 | X, y, ṽ, β, t, τ, η2 ) are similar. At
each iteration of the Gibbs sampler, we sample from f (η̃1 | X, y, ṽ, β, t, τ, η2 ) using a
one-step Metropolis-Hastings sampling.
for this scenario. The group lasso penalty (Bakin 1999; Yuan and Lin 2006; Meier
et al. 2008) takes the group structure into account and can do variable selection at
the group level. Suppose that the predictors are grouped into G groups and βg is
the coefficient vector of the gth group predictors xig . Then, β = (β1T , · · · , βG
T T
) and
T T T
xi = (xi1 , · · · , xiG ) . Denote dg the dimension of the vector βg . Let Kg be a known
dg × dg positive definite matrix (g = 1, · · · , G). Define kβg kKg = (βgT Kg βg )1/2 . We
consider the following group lasso regularized quantile regression
n
X G
X
min ρθ (yi − xTi β) + λ kβg kKg . (14)
β
i=1 g=1
p ¡ ¢
Set the prior of βg as the Laplace prior π(βg | η) = Cdg det(Kg )η dg exp −ηkβg kKg ,
where Cdg = 2−(dg +1)/2 (2π)−(dg −1)/2 /Γ((dg + 1)/2), η = τ λ, and Γ(·) is the gamma
function. Then, the posterior distribution of β is
( n G
)
X X
T
f (β | y, X, τ, λ) ∝ exp −τ ρθ (yi − xi β) − τ λ kβg kKg . (15)
i=1 g=1
Maximizing the posterior (15) is then equivalent to minimizing (14). From the equality
(8), we have
q
π(βg | η) = Cdg det(Kg )η dg exp(−ηkβg kKg )
Z ∞ ½ ¾
1 1 ¡ ¢−1
= q exp − βgT sg K−1
g βg ×
0 det(2πsg K−1 ) 2
g
¡ 2 ¢(dg +1)/2 µ 2 ¶
η /2 −η
³ ´ s(d
g
g −1)/2
exp sg dsg .
Γ
dg +1 2
2
Putting Gamma priors on τ and η, we then have the following hierarchical model.
p
yi = xTi β + ξ1 ṽi + ξ2 τ −1/2 ṽi zi ,
Yn
ṽ | τ ∼ τ exp(−τ ṽi ),
i=1
Yn µ ¶
1 1
z ∼ √ exp − zi2 , (16)
i=1
2π 2
ind.
βg | sg ∼ N (0, sg K−1
g ),
µ 2 ¶
ind. ¡ 2 ¢(dg +1)/2 (dg −1)/2 −η
sg | η 2 ∼ η /2 sg exp sg ,
2
¡ ¢c−1
τ, η 2 ∼ τ a−1 exp(−bτ ) η 2 exp(−dη 2 ),
As for the Gibbs sampler, the full conditional distributions of ṽi and τ are the same
as in the lasso regularized quantile regression. The full conditional distribution of βk
is a normal distribution and that of η 2 is a Gamma distribution. The full conditional
distribution of sk is generalized inverse Gaussian. All details are included in Appendix
C.
3 Simulation studies
In this section, we carry out Monte Carlo simulations to study the performance of
Bayesian regularized quantile regression with comparison to some non-Bayesian ap-
proaches. The methods in the comparison include:
• Bayesian regularized quantile regressions with the lasso penalty (BQR.L), the
elastic net penalty (BQR.EN) and the group lasso penalty (BQR.GL).
• Regularized mean regression methods including the lasso and the elastic net (EN).
yi = xTi β + ui , i = 1, . . . , n, (17)
where ui ’s have the θth quantile equal to 0. We first consider models with homogeneous
errors and then those with heterogenous errors.
• Simulation 2: β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85), which corresponds
to dense case.
• Simulation 5: β = ((−1.2, 1.8, 0), (0, 0, 0), (0.5, 1, 0), (0, 0, 0), (1, 1, 0)), which cor-
responds to the case with group structures in the predictors.
Q. Li, R. Xi and N. Lin 541
Table 1: MMADs for Simulation 1. In the parentheses are standard deviations of the
MMADs obtained by 500 bootstrap resampling. The bold numbers correspond to the
smallest MMAD in each category.
θ Method Error Distribution
θ = 0.5 BQR.L 0.939 (0.048) 1.461 (0.089) 1.090 (0.103) 1.615 (0.149)
BQR.EN 1.051 (0.044) 1.521 (0.075) 1.174 (0.076) 1.642 (0.095)
lasso 1.089 (0.072) 1.510 (0.098) 1.384 (0.142) 2.026 (0.167)
EN 1.169 (0.066) 1.628 (0.102) 1.649 (0.098) 1.921 (0.133)
QR 1.262 (0.048) 1.812 (0.073) 1.636 (0.091) 1.965 (0.085)
QR-L 1.102 (0.038) 1.458 (0.090) 1.228 (0.081) 1.719 (0.125)
θ = 0.3 BQR.L 0.967 (0.059) 1.618 (0.073) 1.117 (0.102) 1.952 (0.140)
BQR.EN 0.973 (0.066) 1.623 (0.102) 1.180 (0.091) 2.006 (0.076)
lasso 1.135 (0.071) 1.762 (0.107) 1.436 (0.082) 2.144 (0.131)
EN 1.280 (0.063) 1.876 (0.091) 1.436 (0.105) 2.025 (0.135)
QR 1.213 (0.070) 1.712 (0.069) 1.406 (0.059) 2.157 (0.075)
QR-L 1.035 (0.071) 1.745 (0.109) 1.223 (0.104) 2.002 (0.115)
θ = 0.1 BQR.L 1.079 (0.060) 2.057 (0.143) 1.097 (0.103) 2.886 (0.214)
BQR.EN 1.078 (0.049) 2.099 (0.152) 1.168 (0.109) 2.959 (0.209)
lasso 1.076 (0.052) 2.652 (0.244) 1.361 (0.091) 3.136 (0.187)
EN 1.185 (0.074) 2.591 (0.212) 1.445 (0.103) 3.049 (0.261)
QR 1.318 (0.036) 1.873 (0.073) 1.453 (0.088) 1.913 (0.149)
QR-L 1.460 (0.101) 2.310 (0.150) 2.035 (0.114) 3.308 (0.215)
In the first three simulation studies, the rows of X in (17) are generated inde-
pendently from N (0, Σ), where the (i, j)th element of Σ is 0.5|i−j| . In Simulation
4, we first generate Z1 and Z2 independently from N (0, 1). Then let xj = Z1 + ²j ,
j = 1, . . . , 10, xj ∼ N (0, 1), j = 11, . . . , 20, xj = Z2 + ²j , j = 21, . . . , 30, where
²j ∼ N (0, 0.01), j = 1, . . . , 10, 21, . . . , 30. In Simulation 5, we first simulate a latent
variable, Z = (Z1 , . . . , Z5 )T , from N (0, Σ), where the (i, j)th element of Σ is 0.5|i−j| .
Then each Zj is trichotomized as 0, 1 or 2, depending on whether it is smaller than
Φ−1 (1/3), between Φ−1 (1/3) and Φ−1 (2/3), or larger than Φ−1 (2/3). Here Φ(·) is the
cumulative distribution function for standard normal distribution. The rows of X are
given by (I(Z1 = 0), I(Z1 = 1), I(Z1 = 2), . . . , I(Z5 = 0), I(Z5 = 1), I(Z5 = 2)).
Within each simulation study, we consider four different choices for the distribution
of ui ’s.
• The first choice is a normal distribution N (µ, σ 2 ), with µ chosen so that the θth
quantile is 0. σ 2 is set as 9.
• The second choice is a mixture of two normal distributions, 0.1N (µ, σ12 )+
0.9N (µ, σ22 ), with µ chosen so that the θth quantile is 0. σ12 is set as 1 and σ22 is
set as 5.
542 Bayesian Quantile Regression
Table 2: MMADs for Simulation 4. In the parentheses are standard deviations of the
MMADs obtained by 500 bootstrap resampling. The bold numbers correspond to the
smallest MMAD in each category.
θ Method Error Distribution
θ = 0.5 BQR.L 3.736 (0.116) 5.613 (0.202) 4.907 (0.209) 6.941 (0.128)
BQR.EN 3.399 (0.143) 5.114 (0.177) 4.587 (0.122) 6.007 (0.134)
lasso 4.707 (0.268) 6.512 (0.388) 6.187 (0.365) 7.861 (0.196)
EN 3.172 (0.193) 5.329 (0.224) 4.326 (0.231) 6.490 (0.171)
QR 4.812 (0.300) 7.059 (0.229) 6.017 (0.263) 10.263 (0.552)
QR-L 7.480 (0.289) 7.766 (0.278) 7.899 (0.219) 8.581 (0.201)
θ = 0.3 BQR.L 3.840 (0.122) 6.141 (0.162) 5.135 (0.228) 7.012 (0.220)
BQR.EN 3.660 (0.189) 5.869 (0.172) 4.636 (0.239) 6.288 (0.145)
lasso 4.822 (0.399) 6.644 (0.227) 6.242 (0.157) 8.184 (0.274)
EN 3.490 (0.155) 5.645 (0.151) 4.841 (0.277) 6.585 (0.338)
QR 4.613 (0.155) 9.088 (0.401) 6.556 (0.411) 10.476 (0.388)
QR-L 6.641 (0.148) 7.726 (0.168) 7.180 (0.104) 8.366 (0.178)
θ = 0.1 BQR.L 3.920 (0.194) 7.861 (0.204) 5.474 (0.238) 9.217 (0.309)
BQR.EN 3.773 (0.147) 7.233 (0.197) 4.987 (0.198) 8.374 (0.183)
lasso 4.742 (0.301) 8.654 (0.225) 6.479 (0.310) 9.821 (0.265)
EN 3.577 (0.132) 8.046 (0.334) 4.836 (0.262) 9.581 (0.436)
QR 4.793 (0.288) 13.553 (0.704) 6.669 (0.319) 16.808 (0.921)
QR-L 6.890 (0.136) 9.865 (0.331) 7.637 (0.221) 10.760 (0.206)
• The third choice is a Laplace distribution Laplace(µ, b), with µ chosen so that the
θth quantile is 0. b is set to 3 so that the variance is 2b2 = 18.
• The fourth choice is a mixture of two Laplace distributions, 0.1Laplace(µ, b1 ) +
0.9Laplace(µ,
√ b2 ), with µ chosen so that the θth quantile is 0. b1 is set as 1 and
b2 is set as 5.
Note that we intentionally choose error distributions that are different from the
skewed Laplace distribution to see how the Bayesian regularized quantile regression
depends on this error assumption, as there has been criticisms that assigning a specific
error distribution is departing from the semiparametric nature of quantile regression
since quantile regression treats the error distribution nonparametrically. Our simulation
results show that, in terms of parameter estimation accuracy and quantile estimation
accuracy, the Bayesian regularized quantile regression methods still perform well even
when this error distribution assumption is violated.
For each simulation study and each choice of the error distribution, we run 50 simu-
lations. In each simulation, we generate a training set with 20 observations, a validation
set with 20 observations, and a testing set with 200 observations. The validation set is
used to choose the penalty parameters in lasso (λ), EN (λ1 and λ2 ) and QR-L (λ). After
Q. Li, R. Xi and N. Lin 543
Table 3: MMADs for Simulation 5. In the parentheses are standard deviations of the
MMADs obtained by 500 bootstrap resampling. The bold numbers correspond to the
smallest MMAD in each category.
θ Method Error Distribution
θ = 0.5 BQR.L 1.353 (0.058) 1.857 (0.144) 1.397 (0.109) 2.181 (0.085)
BQR.EN 1.228 (0.098) 1.827 (0.150) 1.563 (0.078) 2.263 (0.081)
lasso 1.269 (0.056) 2.197 (0.135) 1.895 (0.120) 2.530 (0.083)
EN 1.251 (0.058) 1.909 (0.124) 1.651 (0.107) 2.272 (0.134)
QR-L 1.498 (0.089) 1.676 (0.080) 1.627 (0.138) 1.836 (0.120)
BQR.GL 1.222 (0.087) 1.800 (0.070) 1.466 (0.077) 1.834 (0.080)
θ = 0.3 BQR.L 1.275 (0.066) 2.202 (0.058) 1.416 (0.101) 2.330 (0.091)
BQR.EN 1.295 (0.066) 2.240 (0.075) 1.653 (0.129) 2.434 (0.038)
lasso 1.308 (0.083) 2.326 (0.159) 1.783 (0.119) 2.791 (0.141)
EN 1.346 (0.061) 2.122 (0.219) 1.658 (0.064) 2.530 (0.036)
QR-L 1.360 (0.051) 2.257 (0.056) 1.621 (0.076) 2.492 (0.100)
BQR.GL 1.232 (0.048) 2.013 (0.110) 1.361 (0.136) 2.182 (0.075)
θ = 0.1 BQR.L 1.259 (0.047) 2.424 (0.018) 1.450 (0.060) 2.457 (0.027)
BQR.EN 1.315 (0.071) 2.491 (0.016) 1.524 (0.066) 2.510 (0.011)
lasso 1.292 (0.055) 2.693 (0.190) 1.856 (0.119) 2.530 (0.148)
EN 1.240 (0.058) 2.530 (0.101) 1.779 (0.086) 2.530 (0.022)
QR-L 1.746 (0.068) 6.025 (0.611) 2.343 (0.109) 6.972 (0.742)
BQR.GL 1.225 (0.058) 2.350 (0.045) 1.424 (0.070) 2.366 (0.049)
the penalty parameters are selected, we combine the training set and the validation set
together to estimate β. Since QR is not a regularization method, it does not need the
validation set to choose any penalty parameters, so we directly combine the training
and validation sets together to estimate β. Similarly, BQR.L, BQR.EN and BQR.GL
do not need the validation set since they estimate the penalty parameter automatically,
so we also combine the training and validation sets together for estimation. The testing
set is used to evaluate the performance for these methods.
Priors for the Bayesian methods are taken to be almost noninformative. In BQR.L
and BQR.GL, the parameters a, b, c, d in the Gamma priors for τ and η 2 are all set to
be 0.1. Similarly, in BQR.EN, the parameters a, b, c1 , d1 , c2 , d2 in the Gamma priors for
τ, η̃1 and η2 are also chosen to be 0.1.
In the first four simulation studies, there is no group structure in the predictors, so
BQR.GL reduces to BQR.L. In Simulation 5, we choose the positive definite matrices
Kg = dg Idg , where dg ’s are dimensions of βg ’s, as suggested by Yuan and Lin (2006).
We considered five different values of the given quantile θ: 10%, 30%, 50%, 70% and
90%. But since all four distributions of ui ’s are symmetric, only 10%, 30% and 50%
need to be reported. Due to the space limit, we only present the results for Simulation
544 Bayesian Quantile Regression
Table 4: The parameter estimations for the first three simulation studies. The error
distribution is chosen to be normal and the quantile θ is chosen to be 0.1. Within each
simulation study, the median of 50 estimates of β is reported.
Simulation Study Method β̂1 β̂2 β̂3 β̂4 β̂5 β̂6 β̂7 β̂8
Simulation 1 βtrue 3.000 1.500 0.000 0.000 2.000 0.000 0.000 0.000
BQR.L 2.917 1.436 0.053 0.038 1.623 -0.001 0.045 0.019
BQR.EN 2.852 1.522 0.091 0.099 1.638 -0.055 0.126 0.018
lasso 2.605 1.276 0.000 0.000 1.292 0.000 0.000 0.000
EN 2.606 1.569 0.198 0.021 1.249 0.000 0.000 0.000
QR 3.007 1.642 -0.030 0.123 2.000 -0.109 0.218 -0.007
QR-L 2.501 1.453 0.000 0.000 1.548 0.000 0.000 0.000
Simulation 2 βtrue 0.850 0.850 0.850 0.850 0.850 0.850 0.850 0.850
BQR.L 0.655 0.885 0.616 0.734 0.630 0.548 0.807 0.585
BQR.EN 0.714 0.879 0.759 0.797 0.726 0.556 0.926 0.729
lasso 0.534 0.756 0.758 0.839 0.630 0.592 0.728 0.527
EN 0.562 0.905 0.960 1.102 0.964 0.845 0.864 0.499
QR 0.884 1.000 0.808 0.962 0.864 0.729 1.057 0.831
QR-L 0.403 0.721 0.746 0.853 0.772 0.620 0.626 0.574
Simulation 3 βtrue 5.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
BQR.L 4.802 0.0851 0.034 0.032 0.008 -0.081 0.021 0.031
BQR.EN 4.689 0.1634 0.037 0.008 0.040 -0.140 0.093 -0.016
lasso 4.018 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
EN 3.956 0.1568 0.000 0.000 0.000 0.000 0.000 0.000
QR 4.981 0.1615 -0.030 0.123 0.025 -0.109 0.218 -0.007
QR-L 4.377 0.0017 0.000 0.000 0.000 0.000 0.000 0.000
1, 4 and 5 (Table 1, 2 and 3). In these tables, the MMAD stands for the median of
P200
mean absolute deviations, i.e. median(1/200 i=1 |xT T true
i β̂ − xi β |), where the median
is taken over the 50 simulations. In the parentheses are the standard deviations of
the MMAD obtained by 500 bootstrap resampling of the 50 mean absolute deviations
(MAD).
Simulation 1, 2 and 3 show that, in terms of the MMAD, the two Bayesian regularized
quantile regression methods perform better than the other four methods in general,
especially for the last three non-normal error distributions. In Simulation 2, whose
results are not shown here, BQR.EN performs the best — BQR.EN has the smallest
MMAD in 9 out of 12 simulation setups. In Simulation 3, with results not shown
either, BQR.L performs the best — BQR.L has the smallest MMAD in 9 out of 12
simulation setups. Although none of these four error distributions are assumed in the
Bayesian regularized quantile regression methods, performance of BQR.L and BQR.EN
shows that the Bayesian methods are quite robust to the error distribution assumption.
Secondly, as the Bayesian counterpart for QR-L,BQR.L generally behaves better in
terms of the MMAD. Thirdly, while QR-L performs well for θ = 0.5 and θ = 0.3,
its performance drops noticeably when θ = 0.1. This phenomenon also shows up in
Q. Li, R. Xi and N. Lin 545
where x1i is generated independently from N (0, 1), x3i is generated independently from
the uniform distribution on [0, 1], x2i = x1i + x3i + zi , where zi follows N (0, 1). εi ’s are
independent normally distributed with variance 1 and mean as the negative of the θth
standard normal quantile. There are also five more independent noise random variables,
distributed as N (0, 1), x4 , . . . , x8 . The results are summarized in Table 5. Here we use
the same prior specification as those in Section 3.1, setting a = b = c = d = 0.1 for
BQR.L and a = b = c1 = d1 = c2 = d2 = 0.1 for BQR.EN. Table 5 contains another
performance criterion, the test error, which refers to the average check loss on the
independent testing data set (Wu and Liu 2009).
From Table 5 we can see that BQR.L, BQR.EN, QR and QR-L behave significantly
better than lasso and EN, which demonstrates the broader applicability of quantile
regression based methods. Secondly, within the set of quantile regression based methods,
all four methods considered here behave similarly in terms of the test error. However,
if we consider MMAD, BQR.L and BQR.EN outperform others uniformly. Thirdly,
BQR.L performs better than its non-Bayesian counterpart, QR-L, uniformly for all
quantiles considered.
546 Bayesian Quantile Regression
Table 5: MMADs and test errors for the simulation with heterogeneous random errors.
The bold numbers correspond to the smallest MMAD or test error in each category.
theta=0.1
BQR.L
BQR.EN
0.15
lasso
EN
QR
QR−L
0.10
0.05
0.00
estimates
−0.05
−0.10
−0.15
−0.20
2 4 6 8 10 12 14
predictor index
Figure 1: The estimates of the predictor effects for the Boston Housing data using
different methods. The given quantile is θ = 0.1. The 95% credible intervals given by
BQR.L and BQR.EN are also plotted.
Q. Li, R. Xi and N. Lin 549
Table 6: 10-fold cross-validation results for the Boston Housing data. The bold numbers
correspond to the smallest MMAD in each category.
Appendix
A. The Gibbs sampler for the lasso regularized quantile regression
Let β−k be the parameter vector β excluding the component βk , s−k be the variable
s excluding the component sk and ṽ−i be the variable ṽ excluding the component ṽi .
Denote X = (x1 , · · · , xn ). Then, the conditional distribution f (y|X, ṽ, β, s, τ, η 2 ) in
the lasso regularized quantile regression is
n
Y ½ ¾
1 (yi − xTi β − ξ1 ṽi )2
f (y|X, ṽ, β, s, τ, η 2 ) = p exp −
i=1 2πτ −1 ξ22 ṽi 2τ −1 ξ22 ṽi
½ n ¾ n
1 X (yi − xTi β − ξ1 ṽi )2 Y 1
= exp − p .
2 i=1 τ −1 ξ22 ṽi i=1 2πτ −1 ξ22 ṽi
Thus, the full conditional distribution of ṽi is a generalized inverse Gaussian distribution.
The full conditional distribution f (sk | X, y, ṽ, β, s−k , τ, η 2 ) of sk is
The full conditional f (sk | X, y, ṽ, β, s−k , τ, η 2 ) is then again a generalized inverse
Gaussian distribution. Now we consider the full conditional distribution of βk which is
given by
distribution is just the normal distribution N (µ̃k , σ̃k2 ). The full conditional distribution
of τ is
f (τ | X, y, ṽ, β, s, η 2 )
∝ f (y | X, ṽ, β, s, τ, η)π(ṽ | τ )π(τ )
½ n ¾ Ã n
!
n/2 1 X (yi − xTi β − ξ1 ṽi )2 n X
∝ τ exp − τ exp −τ ṽi τ a−1 exp(−bτ )
2 i=1 τ −1 ξ22 ṽi i=1
( · n µ ¶ ¸)
X (yi − x β − ξ1 ṽi )
T 2
a+3n/2−1 i
∝ τ exp −τ + ṽi + b .
i=1
2ξ22 ṽi
That is, the full conditional distribution of τ is a Gamma distribution. At last, the full
conditional distribution of η 2 is
Before putting priors on η1 and η2 , let us first calculate the quantity C(η1 , η2 ). By (8),
we have
η1
π(βk | η1 , η2 ) = C(η1 , η2 ) exp{−η1 |βk | − η2 βk2 }
2 ½ ¾ µ 2 ¶
Z ∞
1 1 + 2η2 sk 2 η12 −η1
= C(η1 , η2 ) √ exp − βk exp sk dsk .
0 2πsk 2sk 2 2
Z ∞ −1/2 ½ µ ¶−1 ¾
tk 1 tk − 1
π(βk | η1 , η2 ) = C(η1 , η2 ) p exp − βk2 ×
2π(tk − 1)/(2η2 tk )
1 2 2η2 tk
½ 2 ¾
η12 −η1
exp (tk − 1) dtk .
4η2 4η2
552 Bayesian Quantile Regression
R∞
Since −∞
π(βk | η1 , η2 )dβk = 1, by Fubini’s theorem, we have
Z ∞ Z ∞ 2
½ 2 ¾
−1/2 η1 −η1
π(βk | η1 , η2 )dβk = C(η1 , η2 ) tk exp (tk − 1) dtk
−∞ 1 4η2 4η2
µ 2 ¶1/2 ½ 2 ¾Z ∞
η1 η1
= C(η1 , η2 ) exp t−1/2 exp(−t)dt
4η2 4η2 4−1 η12 η2−1
= 1.
R∞
Notice that 4−1 η2 η−1 t−1/2 exp(−t)dt = Γ(1/2, 4−1 η12 η2−1 ) is the upper incomplete
1 2
gamma function. Therefore, we have
µ ¶µ 2 ¶−1/2 ½ ¾
−1 η12 η1 η12
C(η1 , η2 ) = Γ 1/2, exp − .
4η2 4η2 4η2
B.2 The Gibbs sampler for quantile regression with the elastic net penalty
The full conditional distributions of ṽi and τ are again the same as in the lasso regular-
ized quantile regression case. Let
p
X
ỹik = yi − ξ1 ṽi − xij βj ,
j=1,j6=k
n
X
σ̃k−2 = τ ξ2−2 x2ik ṽi−1 + 2η2 tk (tk − 1)−1 ,
i=1
n
X
µ̃k = σ̃k2 τ ξ2−2 ỹik xik ṽi−1 .
i=1
Then, similar to the lasso regularized quantile regression case, we can show that the full
conditional f (βk | X, y, ṽ, β−k , t, τ, η̃1 , η2 ) of βk is the normal distribution N (µ̃k , σ̃k2 ).
The full conditional distribution of tk − 1 is
f (tk − 1 | y, X, ṽ, β, t−k , τ, η̃1 , η2 )
½ µ ¶−1 ¾
1 1 tk − 1 © ª
∝ √ exp − βk2 exp − η̃1 tk I(tk > 1)
tk − 1 2 2η2 tk
½ · ¸¾
1 1 2η2 βk2
∝ √ exp − 2η̃1 (tk − 1) + I(tk − 1 > 0).
tk − 1 2 tk − 1
Therefore, the full conditional distribution of tk − 1 is a generalized Gaussian distribu-
tion. The full conditional distribution of η̃1 is
p
Y 1/2 © ª
f (η̃1 | X, y, ṽ, β, t, τ, η2 ) ∝ η̃1c1 −1 exp(−d1 η̃1 ) Γ−1 (1/2, η̃1 )η̃1 exp − η̃1 tk
k=1
½ · p
X ¸¾
p/2+c1 −1
∝ Γ−p (1/2, η̃1 )η̃1 exp − η̃1 d1 + tk .
k=1
Q. Li, R. Xi and N. Lin 553
p
Y µ ½ ¶−1 ¾
1/21 tk − 1
f (η2 | X, y, ṽ, β, t, τ, η̃1 ) ∝ η2c2 −1
exp(−d2 η2 ) exp −η2 βk2
2 2η2 tk
k=1
½ µ p
X ¶¾
p/2+c2 −1
∝ η2 exp − η2 d2 + tk (tk − 1)−1 βk2 ,
k=1
C. The Gibbs sampler for the group lasso regularized quantile regres-
sion
Let β−g be the parameter vector β excluding the component vector βg . It is easy to see
that the full conditional distributions of ṽi and τ are the same as in the lasso regularized
quantile regression. The full conditional distribution of sg is
That is, the full conditional distribution of sg is again a generalized inverse Gaussian
distribution. The full conditional distribution of βg is
PG
Let ỹig = yi − k=1,k6=g xTik βk − ξ1 ṽi . Then, we have
YG µ 2 ¶
¡ 2 ¢(dg +1)/2 −η ¡ ¢c−1 ¡ ¢
∝ η exp sg η 2 exp −dη 2
g=1
2
( Ã G !)
¡ 2 ¢(p+G)/2+c−1 X
∝ η exp −η 2 sg /2 + d .
g=1
References
Andrews, D. F. and Mallows, C. L. (1974). “Scale Mixtures of Normal Distributions.”
Journal of the Royal Statistical Society, Ser. B, 36: 99–102. 537
Bae, K. and Mallick, B. (2004). “Gene Selection Using a Two-Level Hierarchical
Bayesian Model.” Bioinformatics, 20: 3423–3430. 534
Bakin, S. (1999). “Adaptive Regression and Model Selection in Data Mining Problems.”
PhD thesis, Australian National University, Canberra. 539
Fan, J. and Li, R. (2001). “Variable Selection via Nonconcave Penalized Likelihood and
Its Oracle Properties.” Journal of the American Statistical Association, 96: 1348–
1360. 534
Geraci, M. and Bottai, M. (2007). “Quantile Regression for Longitudinal Data Using
the Asymmetric Laplace Distribution.” Biostatistics, 8: 140–154. 534
Hanson, T. and Johnson, W. (2002). “Modeling Regression Error with a Mixture of
Pólya trees.” Journal of the American Statistical Association, 97: 1020–1033. 534
Harrison, D. and Rubinfeld, D. L. (1978). “Hedonic Prices and the Demand for Clean
Air.” Journal of Environmental Economics and Management, 5: 81–102. 547
Hjort, N. and Walker, S. (2009). “Quantile Pyramids for Bayesian Nonparametrics.”
Annals of Statistics, 37: 105–131. 534
Jørgensen, B. (1982). Statistical Properties of the Generalized Inverse Gaussian Distri-
bution, Lecture Notes in Statistics, volume 9. Springer-Verlag. 537
Koenker, R. (2004). “Quantile Regression for Longitudinal Data.” Journal of Multi-
variate Analysis, 91: 74–89. 534
— (2005). Quantile Regression. New York: Cambridge University Press. 533
Q. Li, R. Xi and N. Lin 555
Wang, H., Li, G., and Jiang, G. (2007). “Robust Regression Shrinkage and Consis-
tent Variable Selection Through the LAD-Lasso.” Journal of Business & Economic
Statistics, 25: 347–355. 534
Wu, Y. and Liu, Y. (2009). “Variable Selection in Quantile Regression.” Statistica
Sinica, 19: 801–817. 533, 534, 545
Yu, K. and Moyeed, R. (2001). “Bayesian Quantile Regression.” Statistics & Probability
Letters, 54: 437–447. 533
Yuan, M. and Lin, Y. (2006). “Model Selection and Estimation in Regression with
Grouped Variables.” Journal of the Royal Statistical Society, Ser. B, 68: 49–67. 534,
539, 543
Zou, H. and Hastie, T. (2005). “Regularization and Variable Selection via the Elastic
Net.” Journal of the Royal Statistical Society, Ser. B, 67: 301–320. 534
Zou, H. and Yuan, M. (2008a). “Composite Quantile Regression and The Oracle Model
Selection Theory.” Annals of Statistics, 36: 1108–1126. 549