Regression
Regression
Regression
1. Introduction
Linear regression is one of the oldest and most widely practiced data analysis
method. In many real data settings least squares linear regression leads to per-
formance in par with state-of-the-art (and often far more complicated) methods
while remaining amenable to interpretation. These advantages coupled with the
argument “all models are wrong” warrants a detailed study of least squares lin-
ear regression estimator in settings that are close to the practical/realistic ones.
Instead of proposing assumptions that we think are practical/realistic, we start
with a clean slate. We start by not assuming anything about the observations
1
Kuchibhotla et al./All of Linear Regression 2
pX1J , Y1 qJ , . . . , pXnJ , Yn qJ P Rd ˆ R and study the OLS estimator β̂ given by
n
1ÿ
β̂ :“ arg min pYi ´ XiJ θq2 ,
θPRd n i“1
where arg min represents a θ at which the minimum is attained and this β̂ may not
be unique, in which case any of the minimizers is set as β̂. This clean slate study
should be compared to the usual assumption-laden approach where one usually
starts by assuming that there exists a vector β0 P Rd such that Yi “ XiJ β0 ` εi for
independent and identically distributed Gaussian homoscedastic errors ε1 , . . . , εn .
The classical linear regression setting (Gauss-Markov model) sometimes also as-
sumes X1 , . . . , Xn are deterministic/non-stochastic. In this model, it is well-known
that β̂ has a normal distribution and is the best linear unbiased estimator (BLUE)
for every sample size n ě d.
Why is a clean slate study possible? At first glance it might seem strange
how a study without assumptions is possible. For a simple explanation, set
n n
1ÿ 1ÿ
Γ̂ :“ Xi Yi and Σ̂ :“ Xi XiJ . (1)
n i“1 n i“1
Recall the quantities Γ̂ and Σ̂ defined in (1). The following result proves deter-
ministic bounds on estimation error and linear representation error for the OLS
estimator β̂. Let ptq` :“ maxt0, tu for t P R and for any Σ P Rdˆd , set
DΣ :“ }Σ´1{2 Σ̂Σ´1{2 ´ Id }op . (4)
Theorem 2.1 (Deterministic Inequality). For any symmetric matrix Σ P Rdˆd and
for any vector β P Rd , we have
p1 ` DΣ q´1 }Σ´1 pΓ̂ ´ Σ̂βq}Σ ď }β̂ ´ β}Σ ď p1 ´ DΣ q`
´1
}Σ´1 pΓ̂ ´ Σ̂βq}Σ . (5)
Furthermore,
}β̂ ´ β ´ Σ´1 pΓ̂ ´ Σ̂βq}Σ ď DΣ p1 ´ DΣ q´1 ´1
` }Σ pΓ̂ ´ Σ̂βq}Σ . (6)
Proof. From the definition of β̂, we have the normal equations Σ̂β̂ “ Γ̂. Subtracting
Σ̂β P Rd from both sides, we get Σ̂pβ̂ ´ βq “ Γ̂ ´ Σ̂β, which is equivalent to
pΣ´1{2 Σ̂Σ´1{2 qΣ1{2 pβ̂ ´ βq “ Σ´1{2 pΓ̂ ´ Σ̂βq.
Adding and subtracting Id from the parenthesized term with further rearrange-
` ˘
ment, we get Σ1{2 pβ̂ ´ βq ´ Σ´1{2 pΓ̂ ´ Σ̂βq “ Id ´ Σ´1{2 ΣΣ´1{2 Σ1{2 pβ̂ ´ βq.
Taking Euclidean norm on both sides yields
› ‰››
› 1{2 “
›Σ β̂ ´ β ´ Σ pΓ̂ ´ Σ̂βq › “ }pId ´ Σ´1{2 Σ̂Σ´1{2 qΣ1{2 pβ̂ ´ βq}
´1
Flexibility in the Choice of Σ and β. For most purposes the canonical choices
of Σ, β in (8) suffice but for some applications involving sub-sampling and cross-
validation, the flexibility in choosing Σ, β helps. For instance, consider the OLS
estimator constructed based on the first n ´ 1 observations, that is,
n´1
ÿ
β̂´n :“ arg min pn ´ 1q ´1
pYi ´ XiJ θq2 “ arg min ´2θJ Γ̂´n ` θJ Σ̂´n θ,
θPRd i“1 θPRd
2.1. Consistency of β̂
If DΣ ă 1, then inequalities in (5) provides both upper bounds and lower bounds
on the estimation error }β̂ ´ β}Σ that match up to a constant multiple. This allows
one to state that necessary and sufficient condition for convergence of }β̂ ´ β}Σ to
zero is }Σ´1 pΓ̂ ´ Σ̂βq}Σ has to converge to zero. Note that with the choices in (8)
Kuchibhotla et al./All of Linear Regression 6
Σ´1 pΓ̂´ Σ̂βq is a mean zero random vector obtained by averaging n random vectors
and hence “weak” dependence implies convergence of covariance to zero implying
convergence to zero. This implies consistency of the OLS estimator β̂ to β:
Corollary 2.1 (Consistency). If DΣ ă 1 and }Σ´1 pΓ̂ ´ Σ̂βq}Σ converges to zero
in probability then }β̂ ´ β}Σ converges to zero in probability.
Turning to inequality (6), note that if DΣ Ñ 0 (in appropriate sense) then
inequality (6) provides an expansion of β̂ ´ β since the remainder (the right
hand side of (6)) is of smaller order than β̂ ´ β. Observe that Σ´1 pΓ̂ ´ Σ̂βq “
n´1 ni“1 Σ´1 Xi pYi ´ XiJ βq, and hence (6) shows that β̂ ´ β behaves like an aver-
ř
where Cd represents the set of all convex sets in Rd and K :“ VarpΣ´1{2 pΓ̂ ´ Σ̂βqq.
For any matrix A, let }A}HS represent the Hilbert-Schmidt (or Frobenius) norm,
ř
that is, }A}2HS :“ i,j A2 pi, jq. Also, for any positive semi-definite matrix A, let
}A}˚ denote the nuclear norm of the matrix A.
Kuchibhotla et al./All of Linear Regression 7
Corollary 2.2 (Berry–Esseen bound for OLS). Fix any η P p0, 1q. Then there
exists universal constants c1 , c2 ą 0 such that for all n ě 1,
ˇ ` ˘ˇˇ
´1{2 ´1{2
sup ˇPpβ̂ ´ β P Aq ´ P N p0, Σ KΣ qPA ˇ
ˇ
APCd
` ˘
ď 4∆n ` 2n´1 ` c2 }Kn´1 }˚1{4 rn η ` P DΣ ą η ,
?
where recall DΣ from (4) and rn :“ c´1
1 }K
1{2
}op log n ` }K 1{2 }HS .
The proof of the corollary can be found in Appendix A and it does not re-
quire (8). The proof of normal approximation for multivariate minimum contrast
estimators in Pfanzagl (1973) is very similar to that of Corollary 2.2. Like The-
orem 2.1, Corollary 2.2 is also a finite sample result that does not assume any
specific dependence structure on the observations. The quantity ∆n in (10) is a
quantification of convergence of right hand side of (9) to a normal distribution
and is bounded by the available multivariate Berry–Esseen bounds. Such bounds
for independent (but not necessarily identically distributed) random vectors can
be found in Bentkus (2004) and Raič (2018). For dependent settings, multivariate
Berry–Esseen bounds are hard to find but univariate versions available (in Romano
and Wolf (2000) and Hörmann (2009)) can be extended to multivariate versions
by the characteristic function method and smoothing inequalities. In this respect,
we note here that the proof of Corollary 2.2 can be extended to prove a normal
approximation result for αJ pβ̂ ´ βq for any specific direction α P Rd and for this
univariate random variable results from above references apply directly. Finally to
get concrete rates from the bound in Corollary 2.2, we only need to choose η P p0, 1q
and for this we need to control the tail probability of DΣ in (4). There are two
choices for this. Firstly, assuming moment bounds for X1 , . . . , Xn , it is possible to
get a tail bound for DΣ under reasonable dependence structures; see Kuchibhotla
et al. (2018a) and Koltchinskii and Lounici (2017a, for independence case). Sec-
ondly, one can use a Berry–Esseen type result (Koltchinskii and Lounici, 2017b)
for DΣ which also implies an exponential tail bound up to an analogue of ∆n term.
Under weak enough dependence structure, Σ1{2 pΓ̂ ´ Σ̂βq is Op pn´1{2 q in any fixed
direction and hence }K}op “ Op pn´1 q where, recall, K “ VarpΣ´1{2 pΓ̂ ´ Σ̂βqq.
´1 1{4
?
Assuming }K ´1 }op — }K}´1op , we get }K }˚ rn “ Opn´1{4 rp1{4 log n ` p3{4 sq. In
the best case scenario ∆n ě Opp7{4 n´1{2 q and hence to match this rate, we can to
take η “ Opn´1{4 q which is a permissible choice under (11). Hence we can claim
ˇ ` ˘ˇˇ p7{4
sup ˇPpβ̂ ´ β P Aq ´ P N p0, Σ´1{2 KΣ´1{2 q P A ˇ “ Op1q 1{2 .
ˇ
APCd n
We have intentionally left the conditions vague which will be cleared in Section 5.
since η can be taken to be zero in limit. In fact a careful modification of the proof
leads to a sharper right hand side as ∆n . These calculations hint at a previously un-
noticed phenomenon: The bounds for random covariates are inherently larger than
those for fixed covariates (although they are all of same order). A similar statement
also holds when some of the covariates are fixed but others are random (the bounds
have extra terms only for random set of covariates). This phenomenon means that,
when working with finite samples, the statistical conclusions can be significantly
distorted depending on whether the covariates are treated fixed or random. Here it
is worth mentioning that the canonical choice of β changes depending on whether
covariates are treated random or fixed. If the covariates are fixed, then the canoni-
cal choice β is β “ pn´1 ni“1 xi xJ
ř ´1 ´1
řn
i q pn i“1 xi ErYi sq, where we write xi (rather
than Xi ) to represent fixed nature of covariates. If the covariates are random, then
the canonical choice β is β “ pn´1 ni“1 ErXi XiJ sq´1 pn´1 ni“1 ErXi Yi sq.
ř ř
Kuchibhotla et al./All of Linear Regression 9
3. Statistical Inference for the OLS estimator
which is, sometimes, referred to as the sandwich variance. The two ends of the
variance Σ´1 can be estimated by Σ̂´1 . The only troublesome part is the “meat”
part which is the variance of a mean zero average. Estimation of this part requires
an understanding of the dependence structure of observations. For instance if the
observations are independent then we can readily write
n n
1 ÿ 1 ÿ “ ‰
V “ 2
VarpX pY
i i ´ X J
i βqq ĺ 2
E Xi XiJ pYi ´ XiJ βq2 . (13)
n i“1 n i“1
The inequality above is the matrix inequality representing the difference of ma-
trices is positive semi-definite. A strict inequality above can hold since the obser-
vations need not satisfy ErXi pYi ´ XiJ βqs “ 0. (The definition of β only implies
řn J
i“1 ErXi pYi ´ Xi βqs “ 0.) The last term on the right of (13) can be estimated
by n´2 ni“1 Xi XiJ pYi ´ XiJ β̂q2 (obtained by removing the expectation and then
ř
where χ2d,α represents the p1 ´ αq-th quantile of the chi-square distribution with d
degrees of freedom. If V̂ is an asymptotically conservative estimator for V that is
V̂ Ñ V̄ (in an appropriate sense) and V̄ ľ V , then
PpN p0, Σ´1 V Σ´1 qJ Σ̂V̂ ´1 Σ̂N p0, Σ´1 V Σ´1 q ď χ2d,α q
Ñ PpN p0, V̄ ´1{2 V V̄ ´1{2 qJ N p0, V̄ ´1{2 V V̄ ´1{2 qq ě 1 ´ α,
Kuchibhotla et al./All of Linear Regression 10
where strict inequality holds if V̄ ą V ; the inequality above is true because of
Anderson’s lemma (Anderson, 1955, Corollary 3) and it may not be true for non-
symmetric confidence regions. An alternate p1 ´ αq-confidence region for β is
! ˇ ˇ )
ˇ y ´1{2
R̂8,α :“ θ P Rd : max1ďjďd ˇAV pβ̂j ´ θj qˇ ď z8,α , (15)
ˇ
j
where AVy j represents the j-th diagonal entry of the variance estimator Σ̂´1 V̂ Σ̂´1
´1{2
and z8,α is the p1´αq-th quantile of max1ďjďd |AVj N p0, AV qj |, with AV P Rdˆd
represents the variance matrix Σ´1 V Σ´1 .
Hypothesis tests for β P Rd can also be performed based on the statistics used
in (14) and (15). It is easy to verify that neither statistic uniformly beats the other
in terms of power. The tests for a single coordinate βj are easy to obtain from the
y 1{2 which is close to a standard normal random variable.
statistic pβ̂j ´ βj q{AVj
The advantage of R̂8,α over R̂2,α is that it leads to a rectangular region and
hence easily interpretable inference for coordinates of β. The confidence region
R̂2,α which is elliptical makes this interpretation difficult.
Inference based on a closed form variance estimator can be thought of as a direct
method and is, in general, hard to extend to general dependence structures. A safe
choice and a more unified way of estimating the variance is by the use of some
resampling scheme. Bootstrap and subsampling or their block versions are robust
to slight changes in dependence structures and are more widely applicable. The
literature along these lines is so vast to review and we refer the reader to Kunsch
(1989), Liu and Singh (1992), Politis and Romano (1994), Lahiri (1999) for gen-
eral block sampling techniques for variance/distribution estimation. Finite sample
study of direct method is easy while such a study for resampling methods (under
dependence) is yet non-existent.
Having understood the properties of the OLS estimator obtained from the full set
of covariates, we now proceed to the practically important aspect of OLS under
variable selection. More often than not is the case that the set of covariates in
the final reported model is not the same as the full set of covariates and more
concernedly the final set of covariates is chosen based on the data at hand. For
concreteness, let M̂ Ď t1, 2, . . . , du represent the set of covariates selected and let
β̂M̂ represent the OLS estimator constructed based on covariate (indices) in M̂ .
More generally for any set M Ď t1, 2, . . . , du, let β̂M represent the OLS estimator
Kuchibhotla et al./All of Linear Regression 11
from covariates in M , that is,
n
ÿ
β̂M :“ arg min J
pYi ´ Xi,M θq2 .
θPR|M | i“1
The aim of this section is to understand the properties of β̂M̂ (irrespective of how
M̂ is chosen). This problem further highlights the strength of the deterministic
inequality in Theorem 2.1 which applies irrespective of randomness of M̂ . Define
for any M Ď t1, 2, . . . , du, the canonical “target” for OLS estimator β̂M as
n
ÿ “ ‰
βM :“ arg min J
E pYi ´ Xi,M θq2 .
θPRp i“1
Σ ´1{2 ´1{2
Also, define DM :“ }ΣM Σ̂M ΣM ´ I|M | }op . where recall ΣM (and Σ̂M ) represents
the submatrix of Σ (and Σ̂). Recall ptq` :“ maxt0, tu.
Corollary 4.1. For any M̂ , we have
Σ
› › DM̂ › ´1 ›
›β̂ ´ β ´ Σ´1 pΓ̂ ´ Σ̂ β q› ď ›Σ pΓ̂ ´ Σ̂M̂ βM̂ q›Σ .
M̂ M̂ M̂ M̂ M̂ Σ Σ M̂
M̂ M̂ 1 ´ DM̂ M̂ M̂
Corollary 4.1 follows immediately from Theorem 2.1 and for simplicity it is
stated with Σ, β choices in (8) but other choices are possible. The first inequality
in the corollary proves an influence function type expansion for the estimator β̂M̂
around (a possibly random) target vector βM̂ . In order to prove convergence of
the remainder in this expansion to zero, one needs to control DM̂ which can be a
bit complicated to deal with directly. With some information on how “strongly”
dependent M̂ is on the data, such a direct approach can be worked out; see Russo
and Zou (2016, Proposition 1), Jiao et al. (2018). If no information other than the
fact that M̂ P M for some set, M, of subsets of covariates, then we have
DM
DM̂ ď UM̂ ˆ max , (17)
M PM UM
for any set of (non-stochastic) numbers tUM : M P Mu; UM usually converges to
a
zero at rate |M | logped{|M |q{n; see Proposition 5.1. Some examples of M include
Mďk :“ tM Ď t1, . . . , du : 1 ď |M | ď ku, M“k :“ tM Ď t1, . . . , du : |M | “ ku,
Kuchibhotla et al./All of Linear Regression 12
for some k ě 1. Note that the maximum on the right hand side of (17) is random
only through Σ̂ (dissolving the randomness in M̂ into the maximum over M). We
will take this indirect approach in our study since we do not want to make any
assumption on how the model M̂ is obtained which might as well be adversarial.
Further note that (17) is tight (in that it cannot be improved) in an agnostic setting
since one can take M̂ such that DM̂ {UM̂ “ maxM PM DM {UM . We take the same
indirect approach to bound }Σ´1 M pΓ̂M ´ Σ̂M βM q}ΣM over M P M. These bounds
prove consistency and linear representation error bounds for the OLS estimator
under variable selection. Similar results can be derived for other modifications of
OLS estimator such as transformations.
From Corollary 4.1 it is easy to prove the following corollary (similar to Corol-
lary 2.1) for consistency.
Σ
Corollary 4.2 (Consistency of β̂M̂ ). If DM̂ ă 1 and }Σ´1M̂
pΓ̂M̂ ´ Σ̂M̂ βM̂ q} Ñ 0 in
probability, then }β̂M̂ ´ βM̂ }ΣM̂ converges to zero in probability.
The conditions of Corollary 4.2 are reasonable and can be shown to hold under
various dependence settings; see Kuchibhotla et al. (2018a). Under these condi-
tions, we get that β̂M̂ “converges” to βM̂ and hence under reasonable conditions,
it is only possible to perform consistent asymptotic inference only for βM̂ based
on β̂M̂ . In other words, if a confidence region is constructed for a parameter η cen-
tered at β̂M̂ and that such region becomes a singleton asymptotically then }η ´βM̂ }
should converge to zero. In relation to the well-known consistent model selection
literature, we can say if a claim is made about inference for βM0 (for M0 the true
support) then }βM̂ ´ βM0 } should converge to zero asymptotically.
Σ
From Corollary 4.1 (if DM̂ Ñ 0), we have
and hence inference for βM̂ requires understanding the asymptotic distribution
of Σ´1
M̂
pΓ̂M̂ ´ Σ̂M̂ βM̂ q which is an average indexed by a random model M̂ . The
impossibility results of Leeb and Pötscher (Leeb and Pötscher, 2008) imply that
one cannot (uniformly) consistently estimate the asymptotic distribution of the
Kuchibhotla et al./All of Linear Regression 13
right hand side of (18). Hence the approach we take for inference is as follows: if
we know apriori that M̂ belongs on M either with probability 1 or with probability
approaching 1, then by simultaneously inferring about βM over all M P M we can
perform inference about βM̂ . This is necessarily a conservative approach for any
particular variable selection procedure leading to (M̂ or) βM̂ but over all random
models M̂ P M, this procedure is exact (or non-conservative); see Kuchibhotla
et al. (2018b, Theorem 3.1). We acheive this simultaneous inference by using high-
dimensional normal approximation results for averages of random vectors. Based
on Corollary 4.1, we prove the following corollary (similar to Corollary 2.2).
Because of the finite sample nature (not requiring any specific structure), the
result is cumbersome and requires some notation. We first briefly describe the
method of proof of corollary to make the notation and result clear. We have already
proved (16) for all M P M. Since Euclidean norm majorizes the maximum norm,
Σ
DM }Σ´1
M pΓ̂M ´ Σ̂M βM q}ΣM
max |pβ̂M ´ βM qj ´ pΣ´1
M pΓ̂M ´ Σ̂M βM qqj | À Σ
.
1ďjď|M | p1 ´ DM q`
Here we write À since scaled Euclidean norm leads to other constant factors. We
can use CLT for pΣ´1M pΓ̂M ´ Σ̂M βM qqM PM to compare pβ̂M ´βM qM PM to a Gaussian
counterpart. The CLT error term for the averages pΣ´1 M pΓ̂M ´ Σ̂M βM qqM PM is
defined as ∆n,M . Here we also note that β̂M ´ βM is only close to the average upto
an error term on the right hand side. This leads to two terms: first we need to show
the right hand side term is indeed small for which we use CLT for scaled Euclidean
norm (leading to Ξn,M below) and secondly, we need to account for closeness upto
this small error which appears as probability of Gaussian process belonging in a
small strip (leading to an anti-concentration term in the bound).
Now some notation. Let VM represent the version of V in (12) for model M ,
Note that VM “ Opn´1 q, in general. Define the Gaussian process pGM,j qM PM,1ďjď|M |
with mean zero and the covariance operator given by: CovpGM,j , GM 1 ,j 1 q equals
˜ ¸
n n
´1 J
1 ÿ pΣM Xi,M qj pYi ´ Xi,M βM q 1 ÿ pΣ´1
M 1 X i,M 1 qj 1 pYi ´ X
J
i,M 1 βM 1q
Cov , ,
n i“1 ´1 ´1 1{2 n ´1 ´1 1{2
pΣ VM Σ qj
M M i“1 pΣ 1 V M
M 1 Σ 1
M qj 1
1{2
where ĺ represents the vector coordinate-wise inequality, N|M | represents the 1{2-
net of tθ P R|M | : }θ} ď 1u, that is, minθ1 PN 1{2 maxθPR|M | : }θ}“1 }θ ´ θ1 } ď 1{2,
|M |
and pḠM qM PM represents a Guassian process that has mean zero and shares the
´1{2
same covariance structure as pVM pΓ̂M ´ Σ̂M βM qqM PM . Note that VarpGM q “ I|M |
for any M P M. The quantity Ξn,M helps control one of the remainder factors,
ř
}Σ´1
M pΓ̂M ´ Σ̂M β M q} ´1
ΣM V ΣM . For the main term, define C :“ M PM |M | and
M
ˇ ¨ ˛ ˇ
ˇ ˆ ˙ ˆ ˙ˇ
´1
ˇP ˝ |pΣM pΓ̂M ´Σ̂M βM qqj |
ˇ ˇ
∆n,M :“ supaPRC` 1{2 ĺ a ‚´ P p|GM,j |q M PM, ĺ a ˇ .
ˇ pΣ´1 ´1
M VM ΣM qj M PM, 1ďjď|M |
ˇ
ˇ 1ďjď|M |
ˇ
The proof of Corollary 4.3 can be found in Appendix B. The first inequality in
Corollary 4.3 is slightly different from the conclusion of Corollary 4.1 but is more
important for inference since the scaling in Corollary 4.3 is with respect to the
“asymptotic” variance of β̂M ´ βM . The second conclusion of Corollary 4.3 is a
“randomness-free” version of finite sample Berry–Esseen type result for pβ̂M ´ βM q
simultaneously over all M P M. The terms each have a meaning and is explained
Kuchibhotla et al./All of Linear Regression 15
before the notation above. For a simpler result, consider the case of fixed (non-
Σ
stochastic) covariates. In this case DM “ 0 for all M and hence the result becomes
ˇ ¨ ˛ ˇ
ˇ ˜ ¸ ˆ ˙ ˇ
ˇ |p β̂ M ´ β q
M j | ˇ
a a ˇ ď ∆n,M `3Ξn,M ,
ˇ ˚ ‹ ˇ
ˇP ˝ ĺ ´ P p|GM,j |q M PM, ĺ
´1 1{2
´1
‚
ˇ
ˇ pΣ V
M M M jΣ q M PM, 1ďjď|M | ˇ
ˇ
1ďjď|M |
for all a P RC
` since we can take ηM to be zero in limit. Getting back to the bound
in Corollary 4.3, the quantities ∆n,M and Ξn,M can be easily controlled by using
high-dimensional CLT results which only depend on the number of coordinates in
the vector logarithmically. In particular for maxt∆n,M , Ξn,M u “ op1q they only
ř
require logp M PM |M |q “ opnγ q for some γ ą 0 (Chernozhukov et al., 2017a,
2014; Zhang and Wu, 2017; Zhang and Cheng, 2014; Koike, 2019) for details. For
instance, if M “ tM Ď t1, . . . , du : |M | ď ku then the requirement becomes
k logped{kq “ opnγ q. For the case of independent observations and sufficiently
weakly dependent observations, we have
` ˘1{6
maxt∆n,M , Ξn,M u “ Op1q n´1 log7 p M PM |M |q
ř
.
Σ
Bounds for PpYM PM tDM ě ηM uq can be obtained using certain tail and “weak
dependence” assumptions the covariates X1 , . . . , Xn (and as mentioned before one
only needs to be concerned with the stochastic coordinates of covariates). This
often necessitates exponential tails on the covariates if the total number of co-
variates d is allowed to grow almost exponentially with n (Guédon et al., 2015;
Tikhomirov, 2017). Finally the control of the anti-concentration term (the last one
in Corollary 4.3) only concerns a tail properties of a Gaussian process. A dimen-
sion dependent bound (that only depends logarithmically on dimension) for this
probability can be found in (Nazarov, 2003; Chernozhukov et al., 2017b):
´Ť ¯ a ř
P M PM,1ďjď|M | t||GM,j | ´ a M,j | ď εu ď Hε log p M PM |M |q,
for some constant H ą 0. Dimension-free bounds for this probability exist only for
some special cases (Chernozhukov et al., 2015; Kuchibhotla et al., 2018). Regarding
the constant in the anti-concentration probability, note that π|M | |M| ď ped{|M |q|M |
for any collection Mand hence logp|M |π|M | 52|M | {Ξn,M q ď |M | logp25ed{t|M |Ξn,M uq.
Kuchibhotla et al./All of Linear Regression 16
4.3. Inference under Variable Selection
Σ
Suppose that we can find pηM qM PM such that PpYM PM tDM ě ηM uq and the anti-
concentration term goes to zero, then from Corollary 4.3 we get that
˜ ¸ ˆ ˙
´ ¯
´1 ´1 ´1{2
P pΣM VM ΣM qj |pβ̂M ´ βM qj | M PM, ĺ a « P p|GM,j |q M PM, ĺ a ,
1ďjď|M | 1ďjď|M |
ř
uniformly for all a P R M PM |M | . In order to perform inference (or in particular
confidence regions) one can choose a vector a “ aα such that
ˆ ˙
P p|GM,j |q M PM, ĺ aα “ 1 ´ α. (19)
1ďjď|M |
This implies that for any M̂ P M chosen (possibly) randomly based on the data,
ˆ´ ¯ ˙
´1 ´1 ´1{2
P pΣM̂ VM̂ ΣM̂ qj |pβ̂M̂ ´ βM̂ |qj ď paα qM̂ ě 1 ´ α ` op1q,
1ďjď|M̂ |
then we have
ˆ ˇ ˇ ˙
ˇ ´1 ´1 ´1{2
P max ˇpΣM̂ VM̂ ΣM̂ qj pβ̂M̂ ´ βM̂ qj ˇ ď KM̂ pαq ě 1 ´ α ` op1q, (23)
ˇ
1ďjď|M̂ |
for any M̂ (possibly random) such that PpM̂ P Mq “ 1 (this equality can be
relaxed to convergence to 1). Inequality (23) readily implies (asymptotically valid)
post-selection confidence region for βM̂ as
" ˇ ˇ *
|M̂ | ˇ ´1 ´1 ´1{2
R̂8,M̂ :“ θ P R : max ˇpΣM̂ VM̂ ΣM̂ qj pβ̂M̂ ,j ´ θj qˇ ď KM̂ pαq .
ˇ
1ďjď|M̂ |
Note that the confidence regions or more generally inference obtained from the
maximum statistic corresponds to taking pKM pαqqM PM in (22) to be a constant
multiple of p1qM PM (all 1’s vector). Further note that the event in (22) represents a
specific choice of vector aα in (19) for which Corollary 4.3 applies. Before we discuss
how to choose pKM pαqqM PM , we list out some of the disadvantages of using the
maximum statistic (20).
Kuchibhotla et al./All of Linear Regression 18
Disadvantages of the maximum statistic. The maximum statistic is a nat-
ural generalization of inference for a single model to simultaneous inference over
a collection of models. The maximum statistic would be the right thing to do if
we are concerned with simultaneous inference for p parameters (all of which are
of same order) but this is not the case with OLS under variable selection. It is in-
tuitively expected that models with more number of covariates would have larger
width intervals. For this reason by taking the maximum over the collection M of
models, one is ignoring the smaller models and the fact that small models have
smaller width confidence intervals. To be concrete, if M is Mďk it follows from
the results of Berk et al. (2013); Zhang (2017) that
a
max max |GM,j | “ Op p k logped{kqq, (24)
M PM 1ďjď|M |
and in the worst case this rate can be attained. But if k “ 40 (for example) but
the selected model M̂ happened to have only two covariates, then the confidence
?
interval is (unnecessarily) wider by a factor of 20. By allowing model dependent
quantile KM pαq as in (22) we can tighten confidence intervals appropriately. For
this particular disadvantage, it is enough to have KM pαq depend on M only through
|M |, its size. There is a second disadvantage of the maximum statistic that requires
dependence of KM pαq on the covariates in M .
To describe the second disadvantage we look at the conditions under which worst
case rate in (24) is attained when k “ d. Berk et al. (2013, Section 6.2) shows that
if the covariates are non-stochastic, and
« ff
Id´1 c1d´1
Σ̂ :“ J a , for some c2 ă 1{pd ´ 1q,
0d´1 1 ´ pd ´ 1qc2
Comparing (25) and (26), it is clear that the inclusion of the last covariate increases
a ?
the order of the maximum statistic from logpedq to d; this shift is because of
Kuchibhotla et al./All of Linear Regression 19
increased collinearity. This means that if in the selection procedure we allow all
models but end up choosing the model that only contains the first d ´ 1 covariates,
we pay of lot more price than necessary. Note that if d increases with n, this increase
(in rate) could hurt more. Once again allowing for KM pαq a model dependent
quantile for maximum (over j) in that model resolves this disadvantage.
How to choose KM pαq? Now that we have understood the need for model M
dependent quantiles KM pαq, it remains to decide how to find these quantiles. But
first note that these are not uniquely defined because multivariate quantiles are
not unique. We do not yet know of an “optimal” construction of KM pαq and we
describe a few choices below motivated by multi-scale testing literature (Dumbgen
and Spokoiny, 2001; Datta and Sen, 2018). Before we proceed to this, we note an
impossibility on uniform improvement over the maximum statistic. Suppose we
select a (random) model M̂ such that
ˇ ˇ ˇ ˇ
max ˇpβ̂M̂ ,j ´ βM̂ ,j q{σM̂ ,j ˇ “ max max ˇpβ̂M,j ´ βM,j q{σM,j ˇ ,
ˇ ˇ ˇ ˇ
1ďjď|M̂ | M PM 1ďjď|M |
1{2
where σM,j represents the standard deviation, pΣ´1 ´1
M VM ΣM qj , of β̂M,j ´ βM,j . For
this random model M̂ , Kpαq the quantile of the maximum statistic in (21) leads
to the smallest possible rectangular confidence region for βM̂ . This implies that
KM̂ pαq ě Kpαq for any α P r0, 1s and any sequence pKM pαqqM PM . Therefore no
sequence of quantiles pKM pαqqM PM satisfying (22) can improve on Kpαq uniformly
over M P M; any gain for some model is paid for by a loss for some other model.
The hope is that the gain outweighs the loss and we see this in our simulations.
Getting back to the construction of KM pαq, let the maximum for model M be
ˇ ˇ
TM :“ max ˇpβ̂M,j ´ βM,j q{σ̂M,j ˇ ,
ˇ ˇ
1ďjď|M |
for an estimator σ̂M,j of the standard deviation σM,j ; recall σM,j involves VM that
converges to zero. Recall that the maximum statistic (20) is given by maxM PM TM .
We now present three choices that will lead to three different quantiles KM pαq.
1. In order to take into account the set of covariates in M , we center TM by its
median before taking the maximum:
where medp¨q represents the median. One can center by the mean of TM but
estimation of mean of a maximum using bootstrap is not yet clear. Higher
Kuchibhotla et al./All of Linear Regression 20
collinearity between the covariates in M could increase the order of TM , the
effect of which we avoid spilling into other models by centering by the me-
dian. Also, it is clear that the median of TM has order depending only on
M not the maximum model size in collection M. Further it is well-known
that the maximum of Gaussians exhibit a super-concentration phenomenon
in that their variance decreases to zero as the number of entries in the max-
imum goes to infinity. For this reason, it may not be of importance to scale
p1q
by the standard deviation of TM . If KM pαq represents the quantile of the
statistic (27), then the post-selection confidence intervals are given by
" *
p1q |M | p1q
R̂M :“ θ P R : max |pβ̂M,j ´ θj q{σ̂M,j | ď medpTM q ` KM pαq .
y
1ďjď|M |
p2q
where MADpTM q :“ medp|TM ´ medpTM q|q. If KM pαq represents the quantile
of the statistic (28), then the post-selection confidence intervals are given by
" *
p2q |M | p2q
R̂M :“ θ P R : max |pβ̂M,j ´ θj q{σ̂M,j | ď my edpTM q ` MyADpTM qKM pαq .
1ďjď|M |
3. Now that we have centered and scaled TM with its median and MAD, it is
expected that even for models of different sizes, pTM ´medpTM qq{MADpTM q are
of the same order. However, when we take the maximum over all models of
same size they may not be. The reason for this is the maximum over models
of size 1 involves d terms and the maximum over models of size 2 involves
dpd ´ 1q{2 terms. Hence naturally the maximum over models of size 2 is
expected to be bigger. To account for this discrepancy define the centered
and scaled maximum statistic for model size s as
We emphasize once again that even though these choices improve the width of
confidence intervals for some models, they will deteriorate the width for other
models. We will see from the simulations in Section 6 that the gain (for some
models) outweighs the loss (for other models) in width. All the choices above
involve medpTM q, MADpTM q which are simple functions of quantiles and can be
computed readily from bootstrap procedures mentioned above.
All the theoretical analysis in previous sections is deterministic and the complete
study in any specific setting requires bounding the remainder terms in the deter-
ministic inequalities above. In this section, we complete the program by bounding
the remainder terms in case of independent observations. The two main quantities
that need bounding for Theorem 2.1 are
DΣ :“ }Σ´1{2 Σ̂Σ´1{2 ´ Id }op and }Σ´1 pΓ̂ ´ Σ̂βq}Σ “ }Σ´1{2 pΓ̂ ´ Σ̂βq}.
The concentration of the sample covariance matrix to its expectation has been the
study for decades documented in the works of Vershynin (2012, 2018), Rudelson
and Zhou (2013), Guédon et al. (2015), Tikhomirov (2017). We state here the
result from Tikhomirov (2017) with minimal tail assumptions that we know of.
Theorem 5.1 (Theorem 1.1 of Tikhomirov (2017)). Fix n ě 2d and p ě 2. If
X1 , . . . , Xn are centered iid random vectors satisfying: for some B ě 1,
This simultaneous control often necessitates exponential tails for covariates if one
needs to allow d to grow (almost exponentially) with n. Guédon et al. (2015)
provide sharp results for sup|M |ďk }Σ̂M ´ΣM }op for both polynomial and exponential
Σ
tails on covariates. We do not know such sharp results for supM PM DM . By a simple
union bound the following result can be proved for both quantities in (31). For this
we assume the following extension of (30): For all 1 ď i ď n,
« ˜ ¸ff
|aJ Xi |β
E exp β β
ď 2, for some β ą 0, 0 ă Kβ ă 8 and for all a P Rd . (32)
Kβ }a}Σ
6. Simulation Results
pjq
are reported with R̂M replaced by R̂M , 1 ď j ď 3 given in Page 19, where mpR̂M q
p1q
represents the threshold of the confidence region for model M (e.g., for R̂M it
p1q
is my
edpTM q ` KM pαq). Note that this threshold is a proxy for the volume of the
p0q
confidence region. Additionally we consider R̂M given by
# ˇ ˇ +
ˇ β̂ ´ θ ˇ ´ ¯
|M | ˇ M,j jˇ p0q p0q
θ P R : max ˇ ˇ ď KM pαq with KM pαq :“ p1´αq-quantile max TM .
j ˇ σ̂M,j ˇ M PM
Finally we also report PpXM PM tβM P R̂M uq. Note that by construction this prob-
ability has to be about 0.95 and by noting the first quantity in (36), we see if
the constructed confidence regions are too conservative for models of smaller sizes.
Table 1 shows the average coverage from all methods in all settings confirming that
these are valid post-selection confidence regions.
Figures 1, 2, and 3 show the results (for settins (a), (b) and (c), respectively) from
500 simulations within each 200 bootstrap samples were used. In all the settings,
pjq
the coverage from the proposed methods (R̂M , j “ 1, 2, 3) is closer to 0.95 and for
p0q
many models the proposed intervals are shorter than the ones from R̂M .
Kuchibhotla et al./All of Linear Regression 25
method
Setting (a) Setting (b) Setting (c)
method
0.986 0
0.976 0.972
method
0.964 1
0.960 0.964
method
0.964 2
0.956 0.958
method
0.964 3
0.956 0.958
Table 1
The numbers in table represent the average simultaneous coverage of all confidence regions for
all settings estimate of PpXM PM tβM P R̂M uq based on 500 replications.
MinVolume
Coverage
0.90 3.5
0.85
3.0
0.80
2.5
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
MedianVolume MaxVolume
4.5
5.0
MedianVolume
4.0
MaxVolume
4.5
4.0
3.5
3.5
3.0
3.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
Fig 1: The results for setting (a) with orthogonal design. The lines represents
average (of quantities in (36)) over 500 replications and error bars are ˘1 SD over
pjq
replications. Method j in legend refers to confidence regions R̂M for j “ 0, 1, 2, 3.
Volume in the plots refers to the threshold mpR̂M q.
MinVolume
Coverage
0.90 3.5
0.85
3.0
0.80
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
MedianVolume MaxVolume
4.4 5.5
5.0
MedianVolume
4.0
MaxVolume
4.5
3.6
4.0
3.2 3.5
3.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
4.0
0.95
MinVolume
Coverage
0.90 3.5
0.85
3.0
0.80
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
MedianVolume MaxVolume
4.5 5.5
5.0
MedianVolume
4.0
MaxVolume
4.5
3.5 4.0
3.5
3.0
3.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model size Model size
Fig 3: The results for setting (c) with worst case design.
We have chosen to study the OLS linear regression estimator because of its
simplicity; even in this case some calculations get messy. An almost parallel set
of results can be derived for other regression estimators including GLMs, Cox
proportional hazards model and so on; see Kuchibhotla (2018) for details.
Finally we close with some comments on computation. The methods of inference
after variable selection mentioned in Section 4.3 involve computing maximum over
`˘
all models |M | “ s and there are ds many such. This can be prohibitive if d or s is
large. Allowing for slightly enlarged confidence regions (conservative inference), one
can try to approximate these maximums from above without exact computation.
We now briefly discuss one way of doing this and details are left for a future work.
Suppose we want to find the maximum norm of w “ pw1 , . . . , wm q P Rm ` . Further
suppose we know an upper bound, B, on }w}8 . Note the trivial inequality
´ ¯1{q ´ ¯1{q
m´1 m
ř q 1{q ´1
řm q
w
j“1 i ď }w} 8 ď m m w
j“1 i ,
řm
for any q ě 1. If q “ logpmq{ε, then }w}8 is pm´1 j“1 wiq q1{q up to a factor of
Kuchibhotla et al./All of Linear Regression 28
eε . Observe now that pm´1 m q q
ř
j“1 wi q “ EJ rwJ s (for J „ Unift1, . . . , mu) is an ex-
iid
pectation which can be estimated by k ´1 k`“1 wjq` for j1 , . . . , jk „ Unift1, . . . , mu.
ř
This is only an estimator of the expectation but using the apriori upper bound B,
one can use any of the existing concentration inequalities to get a finite sample
confidence interval for pm´1 m q 1{q
ř
j“1 wi q which leads to an upper estimate of }w}8 .
The details such as “which concentration inequality is good?, how good the upper
bound is?” will be given elsewhere.
Bibliography
Proof. From the definition of ∆n and Theorem 2.1 of Rudelson et al. (2013), we
get for all r ą 0,
´ ¯ ` ˘
P }Σ pΓ̂ ´ Σ̂βq}Σ ą r ` }K }HS ď P }K 1{2 N p0, Id q}2 ą r ` }K 1{2 }HS ` ∆n
´1 1{2
ˆ ˙
c21 r2
ď 2 exp ´ 1{2 2 ` ∆n ,
}K }op
for some constant c1 ą 0 (independent of p and n). Thus, we get for all n ě 1,
´ a ¯
1{2 1{2
P }Σ´1 pΓ̂ ´ Σ̂βq}Σ ą c´1
1 }K }op log n ` }K } HS ď 2n´1 ` ∆n . (37)
For any set A Ď Rp and ą 0, let A denote the -inflation of the set A with
respect to the norm } ¨ }Σ , that is, A :“ ty P Rp : }y ´ x}Σ ď for some x P Au .
Using Theorem 2.1, we get with DΣ as in (4), for any set A Ď Rp ,
´ ¯ ´ ¯
P Σ1{2 pβ̂ ´ βq P A ď P Σ´1{2 pΓ̂ ´ Σ̂βq P Arn η
´ ¯ ` ˘
` P }Σ pΓ̂ ´ Σ̂βq}Σ ą rn ` P DΣ ą η ,
´1
´ ¯ ´ ¯
P Σ´1{2 pΓ̂ ´ Σ̂βq P A ď P Σ1{2 pβ̂ ´ βq P Arn η
´ ¯ ` ˘
` P }Σ´1 pΓ̂ ´ Σ̂βq}Σ ą rn ` P DΣ ą η .
Kuchibhotla et al./All of Linear Regression 32
Therefore, we get
ˇ ´ ¯ ´ ¯ˇ
1{2 ´1{2
Σ β̂ βq A Σ Γ̂ Σ̂βq A
ˇ ˇ
ˇ P p ´ P ´ P p ´ P ˇ
´ ¯ ` ˘ ´ ¯
´1{2 rn η Σ ´1
ďP Σ pΓ̂ ´ Σ̂βq P A zA ` P D ą η ` P }Σ pΓ̂ ´ Σ̂βq}Σ ą rn .
Recall that N p0, Id q represents a standard normal random vector. Now we get,
from Lemma 2.6 of Bentkus (2003) and the discussion following, that there exists
` ˘ 1{4
a constant c2 ą 0 such that supAPCd P K 1{2 N p0, Id q P Arn η zA ď c2 }K ´1 }˚ rn η,
where }M }˚ for a matrix M P Rpˆp denotes the nuclear norm of M . Hence
ˇ ´ ¯ ´ ¯ˇ
sup ˇP Σ1{2 pβ̂ ´ βq P A ´ P Σ´1{2 pΓ̂ ´ Σ̂βq P A ˇ
ˇ ˇ
APCd
` ˘
ď c2 }K ´1 }1{4
˚ rn η ` 2n
´1
` 3∆n ` P DΣ ą η .
Here we have used inequality (37). Finally, from the definition of ∆n , we get
ˇ ´ ¯ ` ˘ˇˇ
sup ˇP Σ1{2 pβ̂ ´ βq P A ´ P K 1{2 N p0, Id q P A ˇ
ˇ
APCd
` ˘
ď c2 }K ´1 }1{4
˚ rn η ` 2n
´1
` 4∆n ` P DΣ ą η .
We first prove a version of Corollary 4.1 for the purpose of normal approximation
with } ¨ }ΣM replaced by } ¨ }ΣM V ´1 ΣM . We start with equality before (7) in the proof
M
of Theorem 2.1 for model M :
” ı
1{2 ´1{2 ´1{2 1{2
ΣM β̂M ´ βM ´ Σ´1 M p Γ̂M ´ Σ̂ β
M M q “ pI|M | ´ ΣM Σ̂M ΣM qΣM pβ̂M ´ βM q.
´1{2
Multiplying both sides by VM and applying Euclidean norm, we get
}β̂M ´ βM ´ Σ´1
M pΓ̂M ´ Σ̂M βM q}ΣM V ´1 ΣM M
´1{2 ´1{2 ´1{2 1{2 ´1{2 1{2
ď }VM pI|M | ´ ΣM Σ̂M ΣM qVM VM ΣM pβ̂M ´ βM q}
´1{2 ´1{2 ´1{2 1{2
ď }VM pI|M | ´ ΣM Σ̂M ΣM qVM }op }β̂M ´ βM }ΣM V ´1 ΣM
M
Σ
“ DM }β̂M ´ βM }ΣM V ´1 ΣM .
M
Kuchibhotla et al./All of Linear Regression 33
The last equality above follows from the fact that }AB}op “ }BA}op . This implies
Σ
DM
}β̂M ´βM ´Σ´1
M pΓ̂M ´ Σ̂M βM q}ΣM VM
´1 ď }Σ´1 pΓ̂M ´Σ̂M βM q}ΣM V ´1 ΣM .
ΣM
p1 ´ DM q` M
Σ M
(38)
Observe now that for any x P R|M | and any invertible matrix A,
See, e.g., Rigollet and Hütter (2015, Theorem 1.19). Therefore, for all M P M,
Σ ´1{2
|pβ̂M ´ βM ´ Σ´1 2DM maxθPN 1{2 θJ VM pΓ̂M ´ Σ̂M βM q
M pΓ̂M ´ Σ̂M βM qqj | |M |
max b ď Σ
.
1ďjď|M | ´1 ´1
pΣM VM ΣM qj p1 ´ D M q`
´1{2
Using the definition of Ξn,M , we can control maxθPN 1{2 θJ VM pΓ̂M ´ Σ̂M βM q.
|M |
Observe first that
¨ ˛
J
θ ḠM
P ˝ max max b ě 1‚
M PM θPN 1{2
|M | 2 logp|M|5|M | π|M | q ` 2 logp|M |2 {Ξn,M q
˜ ¸(40)
d
ÿ b
ď P max max θJ ḠM ě 2 logp|M|5s πs q ` 2 logps2 {Ξn,M q .
M PM,|M |“s θPNs1{2
s“1
Since ḠM is a standard normal random vector for each M P M, θJ ḠM is a standard
Gaussian random variable and it follows from Rigollet and Hütter (2015, Theorem
1.14) that for all t ě 0,
˜ ¸
a
P max max θJ ḠM ě 2 logp|M|5s πs q ` 2t ď expp´tq,
M PM,|M |“s θPN 1{2
|M |
Kuchibhotla et al./All of Linear Regression 34
Taking t “ logps2 {∆n,M q yields
˜ ¸
Ξn,M
b
P max max θJ ḠM ě 2 logp|M|5s πs q ` 2 logps2 {Ξn,M q ď 2 .
M PM,|M |“s θPNs1{2 s
Combining this with (40) and using ds“1 s´2 ď π 2 {6 ă 1.65, we get
ř
¨ ˛
J
θ ḠM
P ˝ max max b ě 1‚ ď 1.65Ξn,M .
M PM θPN 1{2 2
|M | 2 logp|M|5 π|M | q ` 2 logp|M | {Ξn,M q
|M |
Hence for any pηM qM PM pď 1{2q, on an event with probability at least 1´2.65Ξn,M ´
Σ
PpYM PM tDM ě ηM uq, we get
|pβ̂M ´ βM ´ Σ´1M pΓ̂M ´ Σ̂M βM qqj |
b
max b ď 4ηM 2 logp|M|5|M | |M |2 π|M | {Ξn,M q.
1ďjď|M | ´1 ´1
pΣM VM ΣM qj
(41)
ř
|M |
Define a vector ε P R M PM indexed by M P M, 1 ď j ď |M | such that
b
εM,j :“ 4ηM 2 logp|M|5|M | |M |2 π|M | {Ξn,M q.
and
¨ ˛ ¨ ˛
¨ ˛ ¨ ˛
˚ pβ̂M ´ βM qj ‚ ‹ ˚ pΣ´1 pΓ̂M ´ Σ̂M βM qqj ‚ ‹
P˝˚ ˝ b P A‚ ě P ˚
‹ ˝ M b P A ´ ε ‹
´1 ´1
˝ ´1 ´1
‚
pΣM VM ΣM qj M PM, pΣM VM ΣM qj M PM,
1ďjď|M | 1ďjď|M |
˜ ¸
ď
Σ
´ 2.65Ξn,M ´ P tDM ě ηM u ,
M PM
Observe that
ˇ ˇ
ˇ1 ÿ n ˇ
Σ ´1{2 ´1{2 J ´1{2 2
DM }ΣM Σ̂M ΣM ´ I|M | }op ď 2 sup ˇ pν ΣM Xi,M q ´ 1ˇ , (42)
ˇ ˇ
“
1{4 ˇ n ˇ
νPN |M | i“1
1{4
where N|M | represents the 1{4-net of tθ P R|M | : }θ} “ 1u; see Lemma 2.2 of Ver-
1{4
shynin (2012). Note that |N|M | | ď 9|M | . Therefore the right hand side of (42) is a
maximum over a finite number of mean zero averages with summands satisfying
” ´ ¯ı
J ´1{2 β 1{4
E exp K´β β |ν ΣM X i,M | ď 2, for all ν P N|M | and M Ď t1, 2, . . . , du.
Applying Theorem 3.4 of Kuchibhotla and Chakrabortty (2018), we get for any
t ě 0 that with probability 1 ´ 3e´t ,
c
Σ κΣM pt ` |M | logp9qq
Cβ K2β plogp2nqq2{β pt ` |M | logp9qqmaxt1,2{βu
DM ď 14 ` ,
n n
`˘
for some constant Cβ ą 0 depending only β. Since there are ds ď ped{sqs models
of size s, taking t “ s logped{sq ` u (for any u ě 0) and applying union bound over
all models of size s, we get that with probability 1 ´ 3e´u , simultaneously for all
M Ď t1, 2, . . . , du with |M | “ s,
c
Σ κΣ M pu ` s logp9ed{sqq
Cβ K2β plogp2nqq2{β pu ` s logp9ed{sqqmaxt1,2{βu
DM ď 14 ` .
n n
To prove the result simultaneously over all 1 ď s ď d, take u “ v ` logpπ 2 s2 {6q and
apply union bound over 1 ď s ď d to get with probability 1 ´ 3e´v simulataneously
over all M Ď t1, 2, . . . , du with |M | “ s for some 1 ď s ď d,
c
Σ κΣ 2 2
M pv ` logpπ s {6q ` s logp9ed{sqq
DM ď 14
n
Cβ Kβ plogp2nqq pv ` logpπ 2 s2 {6q ` s logp9ed{sqqmaxt1,2{βu
2 2{β
` .
n
?
Since s´1 logpπ 2 s2 {6q ď p2π{ 6q supxěπ{?6 expp´xqx ď 1, we get with probability
1 ´ 3e´v simultaneously for any 1 ď s ď d and for any model M Ď t1, 2, . . . , du
with |M | “ s,
c
Σ κΣ 2
M pv ` s logp9e d{sqq
Cβ K2β plogp2nqq2{β pv ` s logp9e2 d{sqqmaxt1,2{βu
DM ď 14 ` .
n n
Kuchibhotla et al./All of Linear Regression 36
This completes the proof of (34).
´1{2
We now bound }ΣM pΓ̂M ´ Σ̂M βM q} simultaneously over all M . Observe from
the definition of βM that
n
ÿ n n
“ ‰ ÿ ÿ
0ď J
E pYi ´ Xi,M βM q 2 “ ErYi2 s ´ J
ErpXi,M βM q2 s,
i“1 i“1 i“1
1{2
and hence }β̃M } “ }ΣM βM } ď p ni“1 ErYi2 s{nq1{2 . Now note that since ErΓ̂M ´
ř
Hence we get by Theorem 3.4 of Kuchibhotla and Chakrabortty (2018) that for
any t ě 0, with probability 1 ´ 3e´t
c
VM pt ` |M | logp5qq Cβ BKβ plogp2nqq1{β pt ` |M | logp5qqmaxt1,1{βu
EM,1 ď 7 ` .
n n
Σ
Now following same approach as used for DM , we get with probability 1 ´ 3e´u ,
for any 1 ď s ď d, for any model M Ď t1, 2, . . . , du such that |M | “ s,
c
VM pv ` s logp5e2 d{sqq Cβ BKβ plogp2nqq1{β pv ` s logp5e2 d{sqqmaxt1,1{βu
EM,1 ď 7 ` .
n n
(43)
Kuchibhotla et al./All of Linear Regression 37
To bound EM,2 simultaneously over all M , we take
„
B :“ 8E max |Yi | ď 8n1{r max pEr|Yi |r sq1{r “ 8n1{r Kn,r ,
1ďiďn 1ďiďn
which is motivated by Proposition 6.8 of Ledoux and Talagrand (1991). Now con-
sider the normalized process
n1{2 EM,2
E2,Norm :“ max max .
1ďsďd |M |“s n´1{2`1{r Kn,r Kβ ps logp5e2 d{sq ` log nq1{β
Note that E p1q is an average of non-negative random variables and hence by the
choice of B above and Proposition 6.8 of Ledoux and Talagrand (1991), we get
» fi
1{2 J
“ ‰ 1 n |ν X̃i,M Yi,2 |
E E p1q ď 8E – max max max ´1{2`1{r fl
n 1ďiďn 1ďsďd, νPNs1{2 n Kn,r Kβ ps logp5e2 d{sq ` log nq1{β
|M |“s
» fi
´1{2 J
n |ν X̃i,M Yi |
ď 8E – max max max ´1{2`1{r 2 d{sq ` log nq1{β
fl (44)
1ďiďn 1ďsďd, νPN 1{2 n K K
n,r β ps logp5e
s
|M |“s
› ›
› › › J
›
› |Yi | › ›
› › |ν X̃i,M | ›
ď 8 › max
› max max max › .
1ďiďn Kn,r n1{r › ›1ďsďd 1ďiďn, νPN 1{2 Kβ ps logp5e2 d{sq ` log nq1{β ›
2› s
|M |“s
›
2
Here we use }W }2 for a random variable W to denote pErW 2 sq1{2 . In the second
`˘
factor, the number of items in the maximum for any fixed s is given by n ds 5s ď
np5ed{sqs and hence from (33), we get
¨ ˛
for a constant Cβ ą 0 (which is different from the one in (46)). Applying Theorem
8 of Boucheron et al. (2005) now yields for every q ě 1
› ›
› 1{2 J
›
p1q p1q
›1 n |ν X̃ i,M Y i,2 | ›
}E }q ď 2ErE s`Cq ›› max max max ´1{2`1{r › ,
› n 1ďiďn 1ďsďd, νPNs1{2 n Kn,r Kβ ps logp5e2 d{sq ` log nq1{β ››
|M |“s q
for some (other) absolute constant C ą 0. This implies (using (48)) that
› ›
› 1{2 J
›
›1 n |ν X̃i,M Yi,2 | ›
}E2,Norm }q ď 3Cβ `Cq › max max max ´1{2`1{r
› › .
2 d{sq ` log nq1{β ›
› n 1ďiďn 1ďsďd, νPN 1{2 n
s
K K
n,r β ps logp5e ›
|M |“s q
As before, we have
› ›
› 1{2 J
›
›1 n |ν X̃i,M Yi,2 | ›
› max max max ›
› n 1ďiďn 1ďsďd, 1{2 n ´1{2`1{r K K ps logp5e2 d{sq ` log nq 1{β ›
› νPNs n,r β ›
|M |“s q
› ›
› J
›
› |Y i | |ν X̃ i,M | ›
ď ›› max 1{r
max max max 2
›
1{β ›
›1ďiďn Kn,r n 1ďiďn 1ďsďd, |M |“s
νPNs Kβ ps logp5e d{sq ` log nq
1{2
›
q
› ›
› › › J
›
› |Yi | ›› ›› |ν X̃i,M | ›
ď › max
›
1{r
max max max 2 1{β
› .
1ďiďn Kn,r n › ›1ďiďn 1ďsďd, νPN 1{2 Kβ ps logp5e d{sq ` log nq ›
r› s ›
|M |“s rq{pr´qq
Kuchibhotla et al./All of Linear Regression 39
where the last inequality holds for any q ă r by Hölder’s inequality. We already
have that the first factor is bounded be 1. From (45), we have
› ›
› J
› ˆ ˙1{β
›
› max max max |ν X̃ i,M | › rq
2
›
1{β ›
ď Cβ .
νPNs Kβ ps logp5e d{sq ` log nq r´q
›1ďiďn 1ďsďd, 1{2
› |M |“s
›
rq{pr´qq
Hence by Markov’s inequality, we get with probability at least 1 ´ 1{tr´1 , for any
1 ď s ď d, for any model M Ď t1, 2, . . . , du such that |M | “ s,