Conjugate Gradient Methods For High-Dimensional Glmms
Conjugate Gradient Methods For High-Dimensional Glmms
Conjugate Gradient Methods For High-Dimensional Glmms
high-dimensional GLMMs
Andrea Pandolfi∗, Omiros Papaspiliopoulos†, and Giacomo Zanella‡
November 8, 2024
arXiv:2411.04729v1 [stat.CO] 7 Nov 2024
Generalized linear mixed models (GLMMs) are a widely used tool in sta-
tistical analysis. The main bottleneck of many computational approaches
lies in the inversion of the high dimensional precision matrices associated
with the random effects. Such matrices are typically sparse; however, the
sparsity pattern resembles a multi partite random graph, which does not
lend itself well to default sparse linear algebra techniques. Notably, we show
that, for typical GLMMs, the Cholesky factor is dense even when the origi-
nal precision is sparse. We thus turn to approximate iterative techniques, in
particular to the conjugate gradient (CG) method. We combine a detailed
analysis of the spectrum of said precision matrices with results from random
graph theory to show that CG-based methods applied to high-dimensional
GLMMs typically achieve a fixed approximation error with a total cost that
scales linearly with the number of parameters and observations. Numerical
illustrations with both real and simulated data confirm the theoretical find-
ings, while at the same time illustrating situations, such as nested structures,
where CG-based methods struggle.
∗
Bocconi University, Department of Decision Sciences
†
Bocconi University, Department of Decision Sciences and BIDSA
‡
Bocconi University, Department of Decision Sciences and BIDSA
GZ acknowledges support from the European Research Council (ERC), through StG “PrSc-
HDBayLe” grant ID 101076564.
1
1. Introduction
Generalized linear mixed models (GLMMs) are a foundational statistical tool, widely
used across multiple disciplines (Gelman and Hill, 2007; Wood, 2017). GLMMs extend
the framework of Generalized Linear Models by incorporating both fixed effects and
random effects. Fixed effects capture population-level trends and relationships, while
random effects account for individual deviations from these trends. GLMMs utilize a
link function to establish a relationship between the mean of the response variable and
the linear predictor, which is given by a linear combination of both fixed and random
effects (see Section 3).
In various settings, the factors associated with the random effects may include numer-
ous categories, known as levels, resulting in models with a large number of parameters
p and a large number of observations N , both potentially in the order of several thou-
sands. This scenario often arises in contemporary applications. For instance, within
the political sciences, one may consider a geographic factor with numerous units, or one
may include factors encoding so-called deep interactions (Ghitza and Gelman, 2013).
Another context where such models find utility is in recommendation systems, where
categorical variables represent customers and products (Perry, 2016; Gao and Owen,
2017).
Our work is relevant for various computational algorithms used to fit GLMMs, such
as Gibbs sampling, variational inferences, algorithms for restricted maximum likelihood
estimation, and Laplace approximations (see Section 8 for a discussion). The primary
computational bottleneck of these methodologies lies in the factorization of a sparse
high-dimensional matrix, denoted by Q. This matrix decomposes as
Q = T + V T ΩV , (1)
where T is the prior component (or, equivalently, a regularization term) and is usually
diagonal or easily factorizable; Ω is a diagonal matrix and adjusts for the variances of the
response variables; and V ∈ RN ×p denotes the design matrix. In particular, we mostly
focus on performing Bayesian inferences using so-called blocked Gibbs samplers, that
alternate the updates of regression parameters, variance hyperparameters and potential
additional latent variables (see Section 6.1). The update of the regression parameters
requires sampling from a multivariate Gaussian distribution with precision matrix Q as
in (1), and it is usually the most computationally intensive steps in those algorithms.
The standard procedure to sample from a multivariate normal distribution involves
the computation of the Cholesky factor of Q (see Section 2.1). It is well established
that the Cholesky factorization can be computed efficiently with nested factors (see e.g.
Papaspiliopoulos and Zanella, 2017 and references therein). On the contrary, we will
show that, for general crossed factors, under standard random design assumptions, the
cost of computing the Cholesky factor scales as O p3 , even when Q is sparse (see Section
4 for further details). Thus, since exact factorization of Q is too expensive, we consider
an alternative strategy to sample from the desired Gaussian distribution, which involves
the solution of a properly perturbed linear system Qθ = b (see Section 2.2). Here,
instead of finding the exact solution of the linear system, we employ the well-known
2
conjugate gradient (CG) algorithm to find an approximate solution or, equivalently,
produce an approximate sample. We will demonstrate that, in the general case where
the Cholesky factorization becomes inefficient, the CG sampler only requires a constant
(not growing with N and p) number of matrix-vector multiplications Qb, which results
in a total of O (max(N, p)) cost to produce an approximate sample.
The paper is organized as follows. In Section 2, we briefly review the algorithms for
the Cholesky factorization and the CG method, with a particular focus on their time
complexity. In Section 3, we describe the specific instance of GLMMs, which we refer
to as random-intercept crossed effects models, that will serve as the reference model for
the theoretical analysis in Sections 4 and 5. In Section 4, we show that exact Cholesky
factorization is expensive for such models; while, in Section 5, we show that the CG
algorithm converges fast for such models. The bulk of our technical contribution lies
in the analysis of the spectrum of Q performed in Section 5. Finally, in Section 6, we
describe the proposed methodologies for the case of general GLMMs, and illustrate them
with simulated and real data. Specifically, we consider an application to a survey data
for the American presidential elections of 2004 (Ghitza and Gelman, 2013) and to a data
set for Instructor Evaluations by Students at ETH (from lme4 R-library). The proofs
of the presented results are given in the supplementary material and code to reproduce
the experiments is available at github.com/AndreaPandolfi/ASLA.jl.
Given the Cholesky factor L, one can sample from N (Q−1 m, Q−1 ) by solving linear
systems in L and LT , which can be done efficiently in O (nL ) time via forward and
3
backward substitution. A description of the algorithm can be found in the supplementary
material.
computation of the Cholesky factor requires O p3
For dense matrices, the exact
time, and the storage of O p2 entries. For sparse matrices with specific structures, the
4
Theorem 1. Denoting with Cost(Chol) the number of floating point operations (flops)
needed to compute the Cholesky factor L of a positive definite matrix Q, we have
p
!
X
O n2L /p ≤ Cost(Chol) = O n2L,m ≤ O nL 1.5
. (5)
m=1
The equality and lower bound in (5) are well-known, while the upper bound is more
involved, and we have not been able to find
it in the literature. Note that, trivially, the
2
result also implies Cost(Chol) ≥ O nQ /p .
When Q is a dense matrix we have nL = O p2 and thus the lower and upper bounds
in (5) coincide, being both cubic in p, and they are both tight. For sparse matrices,
instead, the two bounds can differ up to a O p0.5 multiplicative factor and each can
be tight depending on the sparsity pattern. For example, if Q is a banded matrix with
bandwidth b, we have nL = O (pb) and Cost(Chol) 2
0.5
= O pb , hence the lower bound is
tight while the upper bound is off by a O (p/b) factor. On the contrary, for a matrix
Q with a dense p0.5 × p0.5 sub-matrix and diagonal elsewhere we have n = O (p) and
L
Cost(Chol) = O p1.5 , meaning that the upper bound is tight while the lower bound is
Qθ = m + z, z ∼ N (0, Q) (6)
Simple computations show that, when the linear system in (6) is solved exactly, θ is an
exact sample from N p (Q−1 m, Q−1 ). If the linear system is solved via CG method, an
approximate sample will be returned. A potential limitation for the application of this
algorithm is given by the necessity to sample efficiently from N p (0, Q), which is however
feasible for matrices as in (1) (see end of Section 3). Notice that in this case, CG is only
5
used to solve the linear system in (6), while other CG samplers (Parker and Fox, 2012;
Vono et al., 2022) exploit CG algorithms, actually Lanczos tri-diagonalization (Lanczos,
1950), to build an approximate low-rank square root of Q−1 which is then used to sample.
Notice, however, that, if Q is full-rank but has only k < p distinct eigenvalues, then (6)
would return an exact sample in k-steps (see Theorem 2), while other methodologies
based on low-rank approximations would not.
Theorem 2. Consider the linear system Qθ = b, with Q positive definite. Denote the
starting vector with θ 0 , then the Q-norm distance between the k-th CG iterate θ k and
the solution θ satisfies the inequality
!k
||θ k − θ||Q
p
κ(Q) − 1
≤2 , (7)
||θ 0 − θ||Q
p
κ(Q) + 1
where κ(Q) = λmax (Q)/λmin (Q) denotes the condition number of Q. Moreover, if Q
has only k < p distinct eigenvalues, CG returns the exact solution after k iterations.
See Trefethen and Bau (1997) for a proof of these results. Theorem 2 shows that CG
has a fast convergence rate when either Q has a small condition number or when it has
few distinct eigenvalues.
It is well-known in the CG literature that the convergence rate presented in Theorem
2 is very conservative. Indeed, CG is also fast when most of the eigenvalues, except
few outlying ones, are clustered in a small interval [µs , µp−r ], with µs not close to 0. In
this case, one observes that, after a small number of iterations, CG behaves as if the
components corresponding to the outlying eigenvalues have been removed, and the CG
convergence rate changes as if the condition number in (7) is replaced by the effective
value µp−r /µs (Concus et al., 1976; van der Vorst, 2003). This behavior is known as
superlinear convergence of conjugate gradients. For this reason, instead of considering
κ(Q), we will focus on the effective condition number (Van der Sluis and Van der Vorst,
1986)
µp−r
κs+1,p−r (Q) = , (8)
µs+1
obtained by removing the smallest s eigenvalues of Q and the r largest ones. In Section
5, we will provide upper bounds on κs+1,p−r (Q) in the GLMM context, for small s and
r.
6
3. Random-intercept crossed effects models
We now describe the model we will refer to in the theoretical analysis of Section 4 and
5. Specifically, we present the class of random-intercept crossed effects models.
Model 1 (Random-intercept crossed effects models). For each observation i ∈ {1, . . . , N },
consider a univariate continuous response yi ∈ R distributed as
K
X
−1
z Ti,k θ k ,
yi | ηi ∼ N ηi , τ , ηi = θ0 + (9)
k=1
where θ0 is a global intercept, θ k = (θk,1 , . . . , θk,Gk )T is the vector of random effects for
factor k, and z i,k ∈ {0, 1}Gk with G
P k
g=1 zi,k,g = 1 is a “one-hot” vector which encodes
the level of factor k for the observation i. Here K is the number of factors, and Gk
ind.
the number of levels in factor K. We assume independent priors θk,g ∼ N (0, Tk−1 ),
for each factor k and level g. The fixed effect is assigned either a normal or improper
flat prior, and the precision parameters are assigned conjugate gamma distributions.
Though, neither of these choices are directly relevant for our analysis.
If we define θ = (θ0 , θ T1 , . . . , θ TK )T ∈ Rp (with p = 1+ k Gk ) and v i = (1, z Ti,1 , . . . , z Ti,K )T ∈
P
Q = T + τV T V . (11)
In Sections 4 and 5, we study the time complexity of sampling from (10) with the
algorithms presented in Section 2. Regarding the CG sampler, we will only focus on the
complexity of the CG step. Indeed, sampling z ∼ N p (0, Q) with Q as in (11), can be
done in O (N + p) time, by sampling ζ ∼ N p (0, I p ) and η ∼ N N (0, I N ) independently,
√
and setting z = T 1/2 ζ + τ V T η.
Random-intercept models defined as in Model 1 are a specific instance of the more
general class of GLMMs considered in Section 6, which motivates our work. Throughout
the paper, Model 1 will serve as a study case to develop theoretical results. We restrict
our attention to the random-intercept case as it allows for an exhaustive theoretical
tractability, while still preserving the crossed sparsity structure that constitute the main
computational challenge for general GLMMs with non-nested designs (see e.g. Section
4). Notice also that various recent works provided methodologies and theoretical results
for this class of model (Papaspiliopoulos et al., 2019; Ghosh et al., 2022). Note that,
while we develop theoretical results in the context of Model 1, our methodology and
numerical results apply to the more general class of GLMMs described in Section 6,
which in particular includes random slopes and interaction terms.
For the sake of simplicity, the only fixed effect included in Model 1 is the intercept
θ0 ∈ R. Nonetheless, all the results of the following sections could be restated to include
7
multivariate fixed effects, θ 0 ∈ RD0 , leading to the same conclusions, provided that D0
is sufficiently small relative to the dimensionality of the random effects (θ 1 , . . . , θ K ) (see
Section B.1 in the supplementary material for further details).
where U [θk,g , θk′ ,g′ ] refer to the entry relative to the pair (θk,g , θk′ ,g′ ). We assume without
loss of generality that each level of each factor is observed at least once, which is equiv-
alent to say that U [θk,g , θk,g ] ≥ 1, for all factors k = 1, . . . , K and levels g = 1 . . . , Gk .
In the analogy with conditional independence graphs described in Section 2.1.1, the
last equality in (12) implies that θk,g is connected to θk′ ,g′ in GQ if and only if k ̸= k ′ and
there exists at least one observation where the two levels are observed together. Thus, the
resulting CI graph is a (K +1)-partite graph, with one trivial block containing the vertex
θ0 , and a block of size Gk for each factor k = 1, . . . , K. Since U [θ0 , θk,g ] = U [θk,g , θk,g ] > 0
for each k, g, then θ0 is connected to all the other vertices.
Throughout our analysis, we will consider a regime where N, p → ∞, while K is fixed.
Notice that N ≥ p/(K + 1), since we assume that each θk,g appears in at least one
observation. Finally, we assume that nQ or, equivalently, the number of edges is GQ is
O (N ).
Proposition 1 implies that we can place θ0 last without loss of generality. This helps to
reduce nL but it does not solve the problem. The following example describes a simple
sparse design that is catastrophic in this sense, where nQ = O (p), but nL = O p2 .
8
g = G connect θ1,G to all the θ2,j ’s that have degree less or equal than d. A graphical
representation of the resulting precision matrix Q can be found in Section D of the
supplementary material.
Proposition 2. Under the design of Example 1 with d fixed and G → ∞, it holds that
Proposition 2 shows that sparse crossed designs can lead to poor performances of sparse
Cholesky solvers. We now provide a numerical study that shows that this is indeed the
case for most sparse crossed designs. We consider missing completely at random designs
where each cell in the K-dimensional data contingency table contains an observation
with probability π ∈ (0, 1) and is empty otherwise. We take G1 = · · · = GK = G, hence
N ∼ Binomial(GK , π), and we consider, for increasing values of G, different values of K
and different ways that π relates to G. Specifically, we examine the following designs:
Notice that the expected degree is constant in scenario (a), but it increases with G in
scenarios (b) and (c). Moreover, notice that by picking levels independently among all
the possible GK combinations, we guarantee nQ = O (N ). Figure 1 reports the results,
plotting the number of flops required to compute L, versus p = KG+1 in a log-log scale.
In all three regimes, despite Q being sparse, the Cholesky factor L is dense, leading to
the worst case scenario Cost(Chol) = O p3 . Notably, the same results were obtained
regardless of the ordering strategy. In Figure 1, we also include the number of flops
required for sampling with CG algorithm, which we will analyze in Section 5.5.
Figure 1: Cost(Chol) (red dots) as a function of p (x-axis) in a log-log scale, for scenarios
(a)–(c) described above. The purple triangles represent the cost of the CG
sampler, which is discussed in Section 5.5. Q is obtained by setting τ = 1
and T = I p . The results are obtained by averaging over 30 independent
simulations.
9
5. Spectral analysis and CG convergence
In this section, we provide a series of spectral results about the posterior precision matrix
for Model 1. Namely, we show that all but a fixed number of the eigenvalues of Q, if
appropriately preconditioned, concentrate in a small interval, as p → +∞. In particular,
Theorem 4 and 6 will show that the effective condition number remains bounded for
increasing p. This suggests that the number of CG iterations required to reach a desired
level of accuracy is independent of the size of the problem, leading to dimension-free
convergence of the algorithm, which is consistent with numerical experiments (see Section
5.5).
5.1. Notation
We denote with Diag(Q), the diagonal matrix whose diagonal coincide with the one
of Q and, for a given vector v, we denote with Diag(v) the diagonal matrix whose
diagonal is v. Recalling that U = V T V , we start by decomposing Q in (11) as Q =
T + τ · Diag(U ) + τ A , where A = U − Diag(U ) can be seen as a weighted adjacency
matrix, which counts the number of times that two levels θk,g and θk′ ,g′ are observed
together, see (12). We also define D = Diag(A1p ), where 1p ∈ Rp is a vector of ones, so
that D is a diagonal matrix whose elements are the row-wise sums of A. Notice that,
for crossed effect models, D = K Diag(U ), indeed U [θk,g , θk,g ] counts the number of
times that the level θk,g is observed, see (12), however, each time it is observed it is also
“connected” to other K − 1 levels from the other factors and to the global parameter θ0 ,
hence the equality.
When we consider the sub-matrix relative to the random effects only, we use the
notation Q(r) ∈ Rp−1×p−1 , which is obtained by removing the row and column relative
to the global effect θ0 . We use the same notation for the adjacency matrix A(r) . We let
D (r) = Diag(A(r) 1p−1 ) ∈ R(p−1)×(p−1) , so that D (r) = (K − 1) · Diag(U (r) ).
Denote with Q̄ the Jacobi preconditioned precision matrix (Golub and van Loan, 2013,
Sec. 11.5.3), i.e.
Q̄ = Diag(Q)−1/2 QDiag(Q)−1/2 . (14)
(r)
In Section 5.2, we expand on the importance of such preconditioning. We define Ā =
(D (r) )−1/2 A(r) (D (r) )−1/2 , as one would usually normalize the adjacency matrix of a
graph.
10
For positive definite matrices, the Jacobi preconditioning is often a natural choice, since
x → M −1 x is very cheap to evaluate.
For the specific case of Model 1, the Jacobi preconditioning allows rescaling Q so that
its eigenvalues lie within a bounded interval independently of the design (see Theorem 3
for more details). To illustrate this point, we consider the following setting. There are K
factors of size G1 , . . . , Q
GK and, for a given N , we sample uniformly without replacement
N out of the possible K k=1 Gk combinations of levels. This generates a specific random
design V . Notice that this is equivalent to the random design of Figure 1, but for general
Gk and for fixed N . In this case, simple computations show that the expected value of
the diagonal element Q[θk,g , θk,g ] is Tk + τ N/Gk . When looking at the spectrum of the
resulting Q, one can observe, for each factor k, a bulk of Gk eigenvalues centered around
the expected value Tk + τ N/Gk , which is a manifestation of spectral concentration for
high-dimensional random matrices. The left panel in Figure 2 shows this phenomenon in
the case of K = 3 factors. On the other hand, when applying Jacobi preconditioning to
Q, the bulks relative to different factors are grouped into a single one centered around 1
(see the right panel of Figure 2). These two different spectral behaviors are reflected in
Figure 2: The two histograms show the spectrum of Q (left) and Q̄ (right). They are
obtained by setting T = I p , τ = 1, G1 = 300, G2 = 1000, G3 = 2000.
The dashed lines on the left panel correspond to the values Tk + τ N/Gk , for
k = 1, 2, 3.
the number of iterations needed for convergence of CG. See for example Table 1, which
considers a sequence of designs with 2 factors of increasingly different sizes. There, the
effective condition number and the number of CG iteration increase as the difference
between factor sizes increases, while they remain constant when considering the Jacobi
preconditioned version Q̄. Theorems 4 and 6 provide theoretical results supporting the
empirical behavior of the spectrum of Q̄. Given the better performances, we will focus
our analysis on the spectrum of Q̄.
Note that, in the right panel of Figure 2, one can observe a cluster of 2 small eigenvalues
separated from the remaining part of the spectrum. This shows why we need to consider
11
effective condition numbers defined as in (8), and it will be made formal in Theorem 3.
Table 1: Effective condition number and number of CG iterations for Q and Q̄. The
design is the same as in Figure 1, with π = p−9/10 . Q is obtained with T = I p
and τ = 1. In the linear system Qθ = b, b has independent entries with uniform
distribution on (−0.5, 0.5).
Remark 1 (CG preconditioners). There is a vast literature that studies the choice of
the optimal CG preconditioner for different linear systems (Saad, 2003, Ch.10). In
our analysis, we have also explored other preconditioning techniques, such as so-called
incomplete Cholesky factorizations (see the supplementary material for a more detailed
discussion). Nevertheless, we observed results comparable, or slightly worse, to the ones
obtained with Jacobi preconditioning. Nishimura and Suchard (2023) also proposed a
prior-preconditioned version of conjugate gradient method. However, they focused on
Bayesian sparse regression with a large number of covariates, identifying the relevant
ones through shrinkage priors. Since we do not consider sparsity-inducing priors, the
diagonal elements of the prior term T are much smaller than those of the likelihood term
τ V T V . This is why, in our case, using T as a preconditioner does not improve much
the spectrum of Q, leading to the same issues observed in Table 1.
maxk=0,...K Tk
0 < µ̄1 ≤ · · · ≤ µ̄K ≤ , K + 1 ≤ µ̄p ≤ K + 2 .
τ + maxk=0,...K Tk
12
fixed and random effects, whose direction corresponds to the eigenvalue µ̄p . However, it
provides little information about their relative differences, whose directions correspond
to the eigenvalues λ1 (Ū ) = · · · = λK (Ū ). Notice also that the upper bounds on the
small eigenvalues µ̄1 , . . . µ̄K in Theorem 3 are quite conservative. In all the numerical
examples we have explored, the K smallest eigenvalues of Q̄ are very close to zero.
1 + (K − 1)ν̄p−2
κK+1,p−2 (Q̄) ≤ . (16)
1 + (K − 1)ν̄K
Lemma 1 provides an interesting connection between the effective condition number
(r)
κK+1,p−2 (Q̄), and the concentration of eigenvalues of Ā around 0. This allows to focus
(r)
on the spectrum of Ā and to leverage results from random graph theory, in order to
bound the condition number of Q̄ using (16).
Remark 2. The upper-bound in (16) is quite tight in the regimes that we consider.
When the bound becomes uninformative for ν̄K = −1/(K − 1), we numerically observed
that κK+1,p−2 (Q̄) is very large, in the order of several hundreds. Theoretically, if ν̄K =
−1/(K − 1), we can prove that κK+1,p−2 (Q̄) is lower bounded by τ (maxk Tk )−1 , hence
the effective condition number can be arbitrarily large depending on the value of maxk Tk .
Heuristically, if we want κK+1,p−2 (Q̄) to be small, we need ν̄K to be bounded away from
−1/(K − 1).
However, it is important to remark that the condition ν̄K > −1/(K − 1) is only
sufficient for the fast convergence of CG. Clearly, it can happen that, ν̄r > −1/(K − 1),
for some small r > K, even if ν̄K = −1/(K − 1). In such case, we can extend Lemma
1 to bound κr+1,p−2 (Q̄), and CG would still converge fast.
13
In the following analysis, we will focus on κK+1,p−2 (Q̄) for mathematical convenience,
in order to obtain a sufficient condition for CG to converge fast. Though, we will also
study necessary conditions, as in Theorem 5.
In order for the upper-bound in (16) to be effective, we would like ν̄p−2 to be small,
or equivalently 1 − ν̄p−2 to be large; but mostly we need ν̄K to be bounded away from
1
− K−1 .
The spectral gap 1 − ν̄p−2 is a well-studied quantity since it is directly related to
the global connectivity of a graph. One may wonder whether global connectivity is also
1
associated to the gap ν̄K + K−1 . In Section 5.4.1, we will see that, in the simple case with
two factors, this association is straightforward. Indeed, we will prove that a sufficient
condition to bound the effective condition number is that the associated graph is made of
a unique connected component. On the other hand, in Section 5.4.2, we will show that,
when considering more than two factors, a different notion of connectivity is needed.
Assume now, that A(r) is a proper adjacency matrix with binary entries in {0, 1}. In
this case, one can show that ν̄p−2 < 1 if and only if the graph associated to the adjacency
(r)
matrix Ā consists of a unique connected component (Godsil and Royle, 2001). More
in general, the higher is its connectivity, the larger is the gap 1 − ν̄p−2 (Godsil and Royle,
2001, Ch. 13.5). With this in mind, Corollary 2 states that CG converges fast, when
(r)
the graph associated to Ā is well-connected.
Clearly, higher connectivity comes at the price of having a denser support for Q and
consequently a larger cost per iteration of CG. Ideally, CG would be most efficient when
the CI graph is well-connected while being very sparse. There is a vast literature of
random graph theory that studies graphs with this property. For example, the well-
known Alon-Boppana bound (Alon, 1986) gives an upper bound on the spectral gap of
regular graphs. Graphs that attain this bound are called Ramanujan. Random regular
graphs are also known for having great connectivity property, indeed, they can get
arbitrarily close to the Alon-Boppana bound, with high probability, as the number of
vertices goes to infinity (Friedman, 2003). Brito et al. (2022) extends this result to
the case of random biregular bipartite graph, i.e. a bipartite graph sampled uniformly at
14
random among all the possible bipartite graphs, where the G1 vertices of the first part
has constant degree d1 , and the remaining G2 vertices has degree d2 . This is why one
would expect CG algorithm to have a fast convergence rate under this random design.
The following result formalizes this intuition.
Theorem 4. Consider Model 1 with A(r) being the adjacency matrix of a bipartite, bireg-
ular random graph. Then, for any sequence ϵp → 0 as p → ∞, it holds asymptotically
almost surely that √ √
1 + (1/ d1 + 1/ d2 ) + ϵp
κ3,p−2 (Q̄) ≤ √ √ . (18)
1 − (1/ d1 + 1/ d2 ) − ϵp
Theorem 4 shows that, even with relatively small degrees d1 and d2 , the effective
condition number of Q̄ becomes close to 1. An analogous, but weaker result can be stated
also for Erdős-Rényi random bipartite graphs and can be found in the supplementary
material.
The intuition behind Example 2 is made more general in the following theorem, which
shows that pairwise connectivity between all factors is a necessary condition for a well
conditioned precision matrix.
Theorem 5. Consider Model 1. Fix any permutation π of the first K integers. For
each ℓ = 1, . . . , K − 1, let Cℓ be the number of connected components of the sub-graph
15
(r) PK−1
restricted to the factors (π(ℓ), π(ℓ + 1)). Then Ā has at least ℓ=1 Cℓ eigenvalues
1
equal to − K−1 .
1
Theorem 5 shows that, if we want ν̄K to be bounded away from − K−1 , one needs
each pair of bipartite sub-graphs to have only one connected component, that is Cℓ = 1
for each ℓ and π. This is clearly a stronger property than global connectivity of the
(r)
graph of Ā , and in several cases it may not be satisfied. For instance, some datasets
may have one factor which is nested into another one, i.e. when each level of the former
factor can be observed only with one level of the latter one (see Section 6.2 for an
example). In this case, the sub-graph associated to these two factors would present as
(r)
many connected components as the number of levels in the first factor, and Ā will
have as many eigenvalues equal to −1/(K − 1). This issue becomes even more relevant
when considering interaction terms between factors. Consider two factors of sizes G1
and G2 respectively, the interaction term between them is given by a third factor of
size G1 · G2 that consider all the possible pairwise combinations among the levels of the
two factors. One can notice that, by construction, the interaction term is nested inside
(r)
both of the original factors. Hence, the corresponding Ā will have at least G1 + G2
eigenvalues equal to −1/(K − 1). As we will see in Section 6.2, including nested factors
and interaction terms adds to the spectrum of Q̄ several “problematic” eigenvalues close
to zero, slowing down the convergence of CG algorithm.
Unfortunately, neither pairwise connectivity between all factors guarantees ν̄K >
1
− K−1 , an example is given in the supplementary material. Indeed, a stronger notion
of pairwise connectivity is needed to lower bound ν̄K . The following result describes
such condition, and shows that the effective condition number κK+1,p−2 (Q̄) is bounded
independently of p.
Theorem 6 (Strong pairwise connectivity). Consider Model 1. For any pair of factors
k ̸= h, let A(k,h) ∈ R(Gk +Gh )×(Gk +Gh ) be the adjacency matrix of the bipartite graph
(k,h)
restricted to the pair of factors (k, h). Denote with Ā the normalized version obtained
−1/2 (k,h) −1/2 (k,h)
as M A M , with M = Diag(A · 1). Assume that
√ n
(k,h)
o
λ∗ = K − 1 · max |λ| : λ ∈ σ Ā , |λ| =
̸ 1, k ̸= h < 1 . (19)
16
5.5. Numerical experiments
The above bounds on the effective condition numbers of Q̄ suggest that CG samplers
are highly effective for large unstructured designs, which resembles random graphs. In
this section, we verify these conclusions through numerical simulations. Specifically, we
consider the same numerical experiment as in the end of Section 4 and compute the
cost of obtaining an approximate sample by solving (6) with CG, which is given by the
product of the cost per iteration and the number of iterations needed for convergence.
The former is of order O (nQ ), while the latter can be quantified as
n o
N. iterations = inf k ≥ 0 : ||Qθ k − b||2 < ϵ||b||2 , (21)
N. iterations
p
Case (a) Case (b) Case (c)
100 17 22 45
205 18 21 47
435 19 20 48
910 19 21 43
1910 19 18 39
4000 19 17 34
17
in all three designs. Therefore, while this framework represents a worst-case scenario
for sampling via Cholesky factorization, as its computation requires O p3 time, it
is optimal for sampling via CG, since the number of iterations is independent of the
dimension p.
Here θ 0 ∈ RD0 denotes the vector of fixed effects and xi,0 ∈ RD0 denotes the correspond-
ing covariates. For each factor k and level g, we consider the random effect θ k,g ∈ RDk ,
for g = 1, . . . , Gk , with corresponding covariates wi,k ∈ RDk . Up to now, we only con-
sidered random intercept models with Dk = 1 and wi,k = 1, while we now also consider
random slopes models with Dk > 1. As before, the vector z i,k ∈ {0, 1}Gk encodes the
level of the factor k to which observation yi is assigned to. With such notation, θ k has
covariates xi,k = z i,k ⊗ wi,k ∈ RGk Dk , where ⊗ denotes the Kronecker product.
In previous sections, we discussed the case of Gaussian likelihood function with identity
link, namely f (yi | ηi ) = N ηi , τ −1 . We consider also binomial likelihood with logistic
link function f (yi |ηi ) = Binomial ni , (1 + exp(−ηi ))−1 , using the Polya-Gamma data
augmentation (Polson et al., 2013) defined as
ηi2
1
p(yi , ωi | ηi ) = n exp [yi − ni /2] ηi − ωi fP G (ωi |ni , 0) ,
2 i 2
18
(ω1 , . . . , ωN )T the vector of PG latent variables. With such notation, simple Bayesian
computation leads to the following conditional posterior updates
−1 −1
p(θ | y, V , Ω, {T k }K T
k=1 ) ∼ N p Q (T m0 + V κ), Q , (23)
ind.
p(ωi | y, V , θ) ∼ P G(ni , v Ti θ), ∀i = 1, . . . , N, (24)
Gk
ind.
X
p(T k | y, V , θ) ∼ W αk + Gk , (Φk + θ k,g θ Tk,g )−1 , ∀k = 1, . . . , K, (25)
g=1
1. A survey data of the 2004 American political elections (Ghitza and Gelman, 2013;
Goplerud et al., 2024). It collects the vote (Democrat or Republican) of 79148
American citizens. The data set also contains information about their income (5
levels), ethnic group (4 levels), age (4 levels), region of the country (5 levels) and
state (51 levels).
19
when moving away from this favorable random design. Moreover, we also construct
two simulated datasets with the same number of factors and levels as in the voter
turnout and student evaluations data, but with a random design matrix simulated
as described in Section 5.5. We will compare the simulated datasets with the real
ones, in order to isolate the contribution of the complexity in the sparsity structure
of Q.
• Sample size. We can analyze the impact of sample size by considering, for each
data set, a subsample with a smaller size, such as N = 7000, and a larger one with
N = 70000. However, instead of subsampling uniformly at random, we enforce
each level of each factor to appear in at least one observation. This guarantee to
have the same p, as long as we don’t include interaction terms, whose size doesn’t
depend only on which levels are observed, but also on how they interact.
• Factor size. The two data sets we consider differ significantly when it comes to
factor size. E.g. the student evaluations data set involves factors with thousands
of levels, compared to tens of levels for the voter turnout. We will see that factor
size plays a crucial role when considering the number of CG iterations.
Table 3 summarizes the results of our numerical study. Each row shows the number
of CG iterations for the different designs described above. We use the simple additive
model (a) as a benchmark and consider only one additional term at a time to isolate
the contribution of each component. E.g., if we include random slopes, we remove the
20
nested factor and interaction terms. The only exception is the three-way interactions,
as it would be unusual to consider them without the pairwise interactions. The last row
presents results for the full model, which includes all the effects mentioned. Each entry
in Table 3 reports the number of iterations for different sample sizes. Additionally, for
each data set, we include a column displaying the number of iterations obtained using
simulated data.
Voter Turnout Students Evaluations
Case
Real Simulated Real Simulated
30 (68) 25 (68) 26 (4101) 18 (4101)
Random intercepts
36 (68) 24 (68) 35 (4101) 16 (4101)
42 (73) 28 (73) 63 (4115) 25 (4115)
Nested factor
54 (73) 27 (73) 94 (4115) 22 (4115)
48 (127) 38 (127) 84 (5229) 34 (5229)
Random slopes
68 (127) 40 (127) 150 (5229) 34 (5229)
134 (763) 99 (787) 91 (19523) 31 (23585)
2 way interactions
343 (786) 190 (787) 122 (89385) 35 (97897)
166 (2952) 111 (3720) 98 (27588) 32 (31650)
3 way interactions
442 (3569) 258 (3723) 128 (159633) 36 (168028)
189 (3016) 147 (3784) 122 (28730) 58 (32792)
Full
532 (3633) 351 (3787) 266 (160775) 65 (169170)
Table 3: Average number of CG iterations. The average is obtained over 200 Gibbs
sampler iterations after an initial burn-in. The number in parentheses repre-
sents p. In each entry, we report results obtained with N = 7 000 (above) and
N = 70 000.
• Table 3 shows, for the real data, an overall increase in the number of CG iterations
when the sample size increases. This might seem counter-intuitive since, for larger
N , the CI graph becomes denser and better connected. However, when examining
the spectrum of Q̄ for the random intercept only case, we observe that, for small
N , the interval [µ̄K+1 , µ̄p−2 ] is larger but most of the eigenvalues therein are
21
tightly concentrated in its center, around 1. In contrast, for larger N , the interval
[µ̄K+1 , µ̄p−2 ] is smaller, consistently with the CI graph being better connected, but
the spectrum is more diffuse. In other words, for smaller N , the effective condition
number κK+1,p−2 (Q̄) is worse in accordance with the theoretical analysis of Section
5. However, in this example, κs+1,p−r is better for larger s, r, resulting in an overall
faster convergence rate.
• The comparison between these two data sets highlights the significant impact of
factor size on the number of CG iterations. For the voter turnout data, which
presents relatively small factors, CG can require several hundred iterations, even
for solving relatively small problems on the order of a few thousand. Moreover,
this issue is not specific to the details of the voter turnout design, as a similarly
large number of iterations is needed also in the simulated case with a missing
completely at random design but same factor sizes. On the other hand, when
considering the student evaluations case, where some factors have numerous levels,
CG proves to be very efficient, requiring only a few hundred iterations to solve
significantly large linear systems in the order of 105 . This difference becomes
even more significant in the simulated case, where approximately 50 iterations are
sufficient for convergence, even for the largest problems.
22
of the use of CG in the context of large Bayesian sparse regression models. Differently
from their analysis, we consider the case where the design matrix itself V is sparse.
8. Discussion
We remark that the applicability of the methodologies presented in this article goes be-
yond Gibbs sampling. Standard inferential methodologies such as restricted maximum
likelihood estimation, Laplace approximation and variational inference require the com-
putation of the Cholesky factorization of a matrix as in (1), and would suffer from the
same limitations presented in Section 4. On the other hand, some of these methodolo-
gies can benefit from the use of CG and similar iterative methods. For example, in the
context of ML estimation, one could employ the method of moments proposed in Gao
and Owen (2020), to obtain consistent estimates of the variance components and then
use conjugate gradient to compute the GLS estimator. Also, in the context of partially
factorized or unfactorized variational inference (Goplerud et al., 2024), one is required
to invert matrices with the same structure of (1), which can be computed via CG. In
all these cases, the spectral analysis of Section 5 may be of interest to predict theo-
retical performances. However, the applicability of CG algorithm is hindered in those
cases where the determinant of large matrices is also needed, e.g., for the computation of
marginal quantities. Several strategies have been proposed to approximate determinants
of matrices via Lanczos algorithm (Ubaru et al., 2017), though they are computation-
ally more expensive than CG, and it is harder to control for the error induced by such
approximations.
This paper raises both theoretical and methodological questions. For example, it
would be interesting to provide an explicit result, which proves the cubic complexity
of the Cholesky factorization under generic (K-partite) random designs. As for the
analysis of CG algorithm, Theorem 3 and 5 shows how linear dependence among the
columns of the design matrix V relates to small eigenvalues of Q, which, in general,
significantly slows down the convergence of CG. A standard technique to address this
issue involves imposing linear constraints on the vector of parameters θ (Wood, 2017;
Zanella and Roberts, 2020). Perhaps surprisingly, an initial numerical exploration of
such procedure suggests that it actually does not lead to a significant improvement of
the rate of convergence of CG algorithm, and, in some cases, may even slow it down.
We leave a more detailed investigation to future work.
References
Alon, N. (1986). Eigenvalues and expanders. Combinatorica, 6(2):83–96.
Brito, G., Dumitriu, I., and Harris, K. D. (2022). Spectral gap in random bipartite
biregular graphs and applications. Combinatorics, Probability and Computing.
23
Ceriani, P. M. and Zanella, G. (2024). Linear-cost unbiased posterior estimates
for crossed effects and matrix factorization models via couplings. preprint
arXiv:2410.08939.
Chung, F. R. (1997). Spectral graph theory, volume 92. American Mathematical Soc.
Concus, P., Golub, G. H., and O’Leary, D. P. (1976). A generalized conjugate gradient
method for the numerical solution of elliptic partial differential equations. In Sparse
matrix computations, pages 309–332. Elsevier.
Erdős, P., Rényi, A., et al. (1960). On the evolution of random graphs. Publ. math. inst.
hung. acad. sci, 5(1):17–60.
Frazier, D. T., Loaiza-Maya, R., and Martin, G. M. (2023). Variational Bayes in state
space models: Inferential and predictive accuracy. Journal of Computational and
Graphical Statistics, 32(3):793–804.
Gao, K. and Owen, A. (2017). Efficient moment calculations for variance components
in large unbalanced crossed random effects models. Electronic Journal of Statistics,
11(1):1235–1296.
Gao, K. and Owen, A. B. (2020). Estimation and inference for very large linear mixed
effects models. Statistica Sinica, 30:1741–1771.
Gelman, A. and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models, volume 3. Cambridge University Press New York, New York, USA.
Ghitza, Y. and Gelman, A. (2013). Deep interactions with MRP: Election turnout
and voting patterns among small electoral subgroups. American Journal of Political
Science, 57(3):762–776.
Ghosh, S., Hastie, T., and Owen, A. B. (2022). Backfitting for large scale crossed random
effects regressions. The Annals of Statistics, 50(1):560 – 583.
Godsil, C. and Royle, G. F. (2001). Algebraic graph theory, volume 207. Springer Science
& Business Media.
Golub, G. H. and van Loan, C. F. (2013). Matrix Computations. JHU Press, fourth
edition.
Goplerud, M., Papaspiliopoulos, O., and Zanella, G. (2024). Partially factorized varia-
tional inference for high-dimensional mixed models. Biometrika, in press.
24
Katzfuss, M. and Guinness, J. (2021). A general framework for Vecchia approximations
of Gaussian processes. Statistical Science, 36(1):124 – 141.
Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of
linear differential and integral operators. Journal of Research of the National Bureau
of Standards.
Lin, C.-J. and Moré, J. J. (1999). Incomplete Cholesky factorizations with limited
memory. SIAM Journal on Scientific Computing, 21(1):24–45.
Menictas, M., Di Credico, G., and Wand, M. P. (2023). Streamlined variational inference
for linear mixed models with crossed random effects. Journal of Computational and
Graphical Statistics, 32(1):99–115.
Nishimura, A. and Suchard, M. A. (2023). Prior-preconditioned conjugate gradient
method for accelerated Gibbs sampling in “large n, large p” Bayesian sparse regression.
Journal of the American Statistical Association, 118(544):2468–2481.
Papandreou, G. and Yuille, A. L. (2010). Gaussian sampling by local perturbations.
Advances in Neural Information Processing Systems, 23.
Papaspiliopoulos, O., Roberts, G. O., and Zanella, G. (2019). Scalable inference for
crossed random effects models. Biometrika, 107(1):25–40.
Papaspiliopoulos, O., Stumpf-Fétizon, T., and Zanella, G. (2023). Scalable Bayesian
computation for crossed and nested hierarchical models. Electronic Journal of Statis-
tics, 17(2):3575 – 3612.
Papaspiliopoulos, O. and Zanella, G. (2017). A note on MCMC for nested multilevel
regression models via belief propagation. Arxiv 1704.06064.
Parker, A. and Fox, C. (2012). Sampling Gaussian distributions in Krylov spaces with
conjugate gradients. SIAM Journal on Scientific Computing, 34(3):B312–B334.
Perry, P. O. (2016). Fast moment-based estimation for hierarchical models. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 79(1):267–291.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models
using Pólya–Gamma latent variables. Journal of the American Statistical Association,
108(504):1339–1349.
Rivin, I. (2002). Counting cycles and finite dimensional Lp norms. Advances in Applied
Mathematics, 29(4):647 – 662.
Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications.
Chapman & Hall.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
royal statistical society: Series b (statistical methodology), 71(2):319–392.
25
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. Other Titles in Applied
Mathematics. SIAM, second edition.
Schäfer, F., Katzfuss, M., and Owhadi, H. (2021). Sparse Cholesky factorization by
Kullback–Leibler minimization. SIAM Journal on Scientific Computing, 43(3):A2019–
A2046.
Simpson, D. P., Turner, I. W., Strickland, C. M., and Pettitt, A. N. (2013). Scalable
iterative methods for sampling from massive Gaussian random vectors. preprint at
arXiv:1312.1476.
Tan, L. S. L. and Nott, D. J. (2013). Variational inference for generalized linear mixed
models using partially noncentered parametrizations. Statistical Science, 28(2):168–
188.
Ubaru, S., Chen, J., and Saad, Y. (2017). Fast estimation of $tr(f(a))$ via stochastic
Lanczos quadrature. SIAM Journal on Matrix Analysis and Applications.
Van der Sluis, A. and Van der Vorst, H. A. (1986). The rate of convergence of conjugate
gradients. Numerische Mathematik, 48:543–560.
van der Vorst, H. A. (2003). Iterative Krylov methods for large linear systems. Cambridge
University Press, Cambridge; New York.
Vono, M., Dobigeon, N., and Chainais, P. (2022). High-dimensional Gaussian sampling:
A review and a unifying approach based on a stochastic proximal point algorithm.
SIAM Review, 64(1):3–56.
Zanella, G. and Roberts, G. (2020). Multilevel linear models, Gibbs samplers and multi-
grid decompositions. Bayesian Analysis.
26
A. Proofs
Proof of Theorem 1. The number of operations required by the recursive equation in (2)
is
Xp
Cost(Chol) = 1 + nL,m + nL,m (1 + nL,m ) , (26)
m=1
p p
! p
!2
X X X
−1 −1
n2L,m =p p n2L,m ≥p p nL,m = n2L /p ,
m=1 m=1 m=1
operations. By definition of GL , if {ℓ, m} ∈ GL and {ℓ, j} ∈ GL for ℓ < m < j, then also
{ℓ, m} ∈ GL . Thus, the cost in (27) coincides with the number of 3-cycles in GL . The
upper bound in (5) then follows, noting that the number of 3-cycles in an undirected
graph with nL edges is less or equal than nL 1.5 , see e.g. (Rivin, 2002, Theorem 4).
27
Proof of Proposition 2. Denote with Q[θk,g ; θk′ ,g′ ] the entry of Q relative to the pair
(θk,g , θk′ ,g′ ), and recall that, for k ̸= k ′ , its value coincides with the number of observa-
tions i = 1, . . . , N s.t. zi,k,g = 1 and zi,k′ ,g′ = 1. Finally, θk,g is adjacent to θk′ ,g′ in GQ
if and only if Q[θk,g ; θk′ ,g′ ] > 0.
Define the function r : {1, . . . , G} 7→ {1, . . . , G} as r(1) = 1 and r(j) = ⌈d−1 (j − 1)⌉
for j = 2, . . . , G, so that Q[θ1,r(j) ; θ2,j ] = 1 for all j ≥ 1. Then, for every couple j and
m such that r(j) ≤ m ≤ j we now show that the future set of θ2,m does not separate it
from θ2,j in GQ . We construct a path in GQ between θ2,m and θ2,j that goes through
θ2,1 as follows. Since both Q[θ1,j ; θ2,j ] and Q[θ1,r(j) ; θ2,j ] are positive for all j ≥ 1,
the path going from θ2,j to θ1,r(j) to θ2,r(j) to θ1,r(r(j)) to θ2,r(r(j)) etc. is supported on
GQ . Also, since r(ℓ) < ℓ for all ℓ ≥ 2 and r(1) = 1, the above path eventually reaches
θ2,1 . The same strategy can be applied to construct a path in GQ from θ2,m to θ2,1 .
Joining the two above paths at θ2,1 , we obtain a path from θ2,m to θ2,j in GQ . The
assumption r(j) ≤ m ≤ j together with r(ℓ) ≤ ℓ for all ℓ ensures that such path does
not involve elements in the future set of θ2,m apart from θ2,j . Thus, by (4), L[θ2,g ; θ2,m ]
is a potential non-zero whenever r(j) ≤ m ≤ j, meaning that the row of L corresponding
to θ2,j contains at least j − r(j) + 1 potential non-zeros. Summing over j we obtain
G G G
X X j−1 d−1X d − 1 G(G + 1)
nL ≥ (j − r(j) + 1) ≥ j− ≥ j= −1 .
d d d 2
j=2 j=2 j=2
Theorem 1. The statements about nL and Cost(Chol) follow from the above equalities
and p = 1 + 2G.
Proof of P Theorem 3. We start by proving the results relative to the likelihood term U =
V TV = N T
i=1 v i v i .
Proof of part 1. First recall, from Section 3, that v i = (1, z Ti,1 , . . . , z Ti,K )T ∈ Rp , where
z i,k have only one entry with unitary value and the remaining ones are zero. Consider
the sets of vector W = {wk ; k = 1, . . . , K}, where each wk = (−1, 0TG1 , .., 1TGk , .., 0TGK )T ,
and the vector of ones 1Gk is in the position relative to the k -th factor. Then, for all
i = 1, . . . , N and k = 1 . . . , K, we have v Ti wk = 0p . Thus, U wk = 0p for each k. Since
the vectors in W are linearly independent, we can deduce that dim(N ull(U )) ≥ K. Also,
since U is positive semidefinite, then at least the first K eigenvalues must be equal to
zero. The same holds for Ū , since the matrix Diag(U ) is positive definite by hypothesis
(Uii ≥ 1 for each i). We have thus shown λ1 (Ū ) = · · · = λK (Ū ) = 0, as desired.
Proof of part 2. Recall that U = Diag(U ) + A, where A can be interpreted as a
weighted adjacency matrix representing a (K + 1)-partite graph (we can think of the
global effect θ0 as an additional part with one single vertex). We define Ā as the adja-
cency matrix A normalized by the row-wise sum of its elements, i.e. Ā = D −1/2 AD −1/2 .
As we showed in Section 5, it holds that
p
X
D ii = Aij = KUii .
j=1
28
It follows that Ū and Ā satisfy Ū = I p + K Ā and, in particular, they share the same
eigenvectors. Finally, recall that the largest eigenvalue of any normalized adjacency
matrix Ā is λp (Ā) = 1 (Chung, 1997). As a consequence, the largest eigenvalue relative
to Ū is λp (Ū ) = 1 + Kλp (Ā) = K + 1, as desired.
Proof of the bounds on µ̄1 ,. . . , µ̄p . Finally, we translate the spectral results on Ū into
bound on the eigenvalues of Q̄. Notice that Q̄ = E + Ū where E is a diagonal matrix
with elements Eii = (Tii + τ Uii )/Tii . Since Uii ≥ 1, we thus can bound the eigenvalues
of E as
Tii T̄
λp (E) = max Eii ≤ ≤ ,
i Tii + τ τ + T̄
where T̄ = maxk=0,...K Tk . We can then obtain the inequalities of Theorem 3 combining
the above inequality with Weyl’s inequality (Horn and Johnson, 2012, Theorem 4.3.1).
Indeed we have
T̄
0 < µ̄K ≤ λK (Ū ) +λp (E) ≤ ,
| {z } τ + T̄
=0
as well as
µ̄p ≤ λp (Ū ) + λp (E) ≤ K + 2
| {z }
≤1
and
µ̄p ≥ λp (Ū ) + λ1 (E) ≥ K + 1 .
| {z }
≥0
(r)
Proof of Corollary 1. We begin by defining Ū = Diag(U (r) )−1/2 U (r) Diag(U (r) )−1/2 .
Theorem 3 applied to such matrix guarantees that its spectrum is concentrated in the
interval [0, K] and, more specifically:
(r) (r) (r)
λ1 (Ū ) = · · · = λ1 (Ū )K−1 = 0; λ1 (Ū )p−1 = K. (28)
29
Note that, with the notation introduced in Section 5.1, it holds that
(r) (r)
Q̄ = I p−1 + (C (r) )1/2 Ā (C (r) )1/2 ,
(r)
Proof of Corollary 2. We just need to prove that the spectrum of Ā is symmetric
(r)
around zero. This together with Lemma 1 gives the result. Recall that the matrix Ā
can be represented as
0 B ,
(r)
Ā =
BT 0
(r)
with B ∈ RG1 ×G2 . Consider w.l.o.g. any positive eigenvalue λ > 0 of Ā . Let x be one
of its eigenvector and split it into x = [x1 , x2 ], where x1 ∈ RG1 and x2 ∈ RG2 . Then
x∗ = [x1 , −x2 ] is an eigenvalue relative to −λ. Indeed:
−B T x2
(r) ∗ −λx1
Ā x = = = −λx∗ .
Bx1 λx2
Since this operation also preserve the multiplicity of the eigenvalues, the result holds
true.
30
Proof of Theorem 4. By Theorem 3.2 of Brito et al. (2022), the adjacency matrix A(r)
satisfies p p
λp−2 (A(r) ) ≤ d1 − 2 + d2 − 1 + ϵp ,
p p
λ2 (A(r) ) ≥ − ( d1 − 2 + d2 − 1) − ϵp ,
asymptotically almost surely, with ϵp → 0, as p → ∞. For a bipartite biregular graph,
(r) √
it holds that Ā = A(r) / d1 d2 , then the bound above extends to the eigenvalues ν̄2
(r)
and ν̄p−2 of Ā as follows
1 1
ν̄p−2 ≤ √ + √ + ϵp ,
d1 d2
1 1
ν̄2 ≥ − √ − √ − ϵp .
d1 d2
Finally, we apply Lemma 1 with s = q = 2 so that, when K = 2, equation (16) becomes
1 + ν̄p−2
κ3,p−2 ≤ .
1 + ν̄2
Combining the above bounds we obtain the desired statement.
n K
!2
X X
T (r) (r) 2
x U x = ||V x|| = xgk [i] =0
i=1 k=1
PK
happens if and only if k=1 xgk [i] = 0 for each i = 1, . . . , N . Hence, we have the
equivalence
K
X
(r)
x ∈ N ull(U ) ⇔ xgk [i] = 0, ∀i = 1, . . . , N . (30)
k=1
31
(ℓ) (ℓ)
Pm ’s are disjoint subsets of the levels in factor ℓ, and the same goes for Qm ’s. For any
(ℓ)
ℓ = 1, . . . , K − 1 and m = 1, . . . , C1 , we define the vector xm ∈ N ull(U (r) ) as
(ℓ)
xm,j = 1,
if j belongs to Pm
(ℓ)
xm,j = −1, if j belongs to Qm
(ℓ)
xm,j = 0, otherwise .
(ℓ) (ℓ)
Note that xm satisfies (30) because, if an observation i does not involve Pm , then
(ℓ) (ℓ) (ℓ)
xm,g∗ [i] = 0 for each k. On the other hand, if xm,gℓ [i] = 1, then xm,gℓ+1 [i] = −1 (by
k
(ℓ) (ℓ)
definition of connected components, an edge from Pm can only connect to Qm ), while
xm,gk [i] = 0 for each k ̸= ℓ, ℓ + 1 (by construction).
(ℓ)
2. {xm }ℓ,m are linearly independent. Consider the linear combination
K−1
X X Cℓ
(ℓ) (ℓ)
βm xm = 0.
ℓ=1 m=1
If we consider the first G1 components (i.e. we restrict to the levels of the first factor),
(1)
only {xm }C 1
m=1 have non-zero entries on such components. Moreover, since they are
(1) 1 (1) (1)
supported on the disjoint sets (Pm )C m=1 , it implies that β1 = · · · = βC1 = 0. We
now consider the G2 components relative to the second factor. If we don’t consider
(2)
the previous eigenvectors, only {xm }C 2
m=1 have non-zero entries on such components.
(2) (2)
Analogously to the previous case, we can conclude β1 = · · · = βC2 = 0. We can further
(ℓ)
extend this procedure, proving that βm = 0 for each ℓ, m.
(r)
Proof of Theorem 6. Consider the matrix Ā ∈ Rp−1 , Theorem 3 applied to it im-
plies that ν̄1 = . . . ν̄K−1 = −1/(K − 1) and ν̄p−1 = 1. In particular, the invari-
ant subspace associated to these eigenvalues is the K-dimensional space W = {x ∈
Rp−1 : x is constant factor-wise} = Span(w1 , . . . , wK ), where wk = (0TG1 , .., 1TGk , .., 0TGK )T .
We denote its orthogonal with δW = W ⊥ . With this notation, Theorem 6 becomes a
straightforward consequence of Lemma 1 and 2 below.
32
(r)
where the multiplicative factor K − 1 comes from the fact that Ā is made of K
cliques, hence each edge in the clique is connected to K − 1 levels in each other factor;
however when restricting to the pair (k, h), it’s connected to only one level. Moreover,
by properties of bipartite adjacency matrix (Brito et al., 2022), it holds that
(r)
||Ā [k, h]y||
max |λ| = (K − 1) max .
λ∈σ Ā
(k,h)
:|λ|̸=1 y∈RGh : y T 1Gh =0 ||y||
For a given vector x ∈ δW , denote with x[k] ∈ RGk the vector restricted to the factor
k; by definition of δW , it holds that x[k]T 1Gk = 0. Finally, taking x ∈ δW , we have
2
K
(r) (r)
X X
||Ā x||2 = Ā [k, h]x[h]
k=1 h̸=k
K X
X (r) 2
≤ Ā [k, h]x[h]
k=1 h̸=k
K X
η2 X
≤ ∥x[h]∥2
(K − 1)2
k=1 h̸=k
| {z }
η 2
= (K − 1)||x||2
(K − 1)2
η2
= ||x||2
K −1
(λ∗ )2 ||x||2
= .
(K − 1)2
33
where ν̄K and ν̄p−D0 −1 would be respectively the K-th smallest and second-largest eigen-
(r)
values of Ā ∈ R(p−D0 )×(p−D0 ) . Provided that θ 0 is low-dimensional, the above inequal-
ity would give an upper bound on the effective condition number after the removal of
few extreme eigenvalues.
34
Voter Turnout Students Evaluations
Case
Jacobi IChol Jacobi IChol
Random intercepts 36 19 35 16
Nested effect 53 31 94 37
Random slopes 70 27 150 17
2 way interactions 338 104 121 53
3 way interactions 445 108 126 54
Everything 517 140 262 67
Table 4: Average number of CG iterations with Jacobi and Incomplete Cholesky precon-
ditioning for the data sets considered in Section 6. We considered N = 70 000
observations.
the one of Theorem 4, however, in this case, we need an asymptotic lower bound on the
expected degree G1 π (and G2 π) to guarantee sufficient connectivity in the limit.
Theorem 7. Let Q = Diag(Q) + τ A be the posterior precision matrix in (11). Let A(r)
be the adjacency matrix of an Erdős-Rényi bipartite random graph. Let G2 ≥ G1 ≥ 1
and π = π(G1 ) be such that
p p
G1 π = Ω( G1 π log3 (G1 )), G2 π = Ω( G2 π log3 (G2 )) .
Proof. The proof of this result follows the same steps of the proof of Theorem 4, with
the additional result from Ashcroft (2021).
35
The corresponding U (r) will be
1 0 1 0 0 1
0 2 1 1 1 1
U (r)
1
= 1 2 0 1 1
0 1 0 1 0 1
1 1 1 1 0 2
Notice that dim(N ull(U (r) )) = 3 and that the following vectors generate the null-space
of U (r) :
1 1 1
1 1 0
−1 0 −1/2
x1 = −1 ,
0 ,
x2 = x3 = 1/2 .
0 −1 1/2
0 −1 −1/2
In particular, the first two vectors are the directions identified by Theorem 3, while the
third one is specific to the symmetry of this particular example. By construction also
(r)
Ū has a null-space of dimension 3. Finally, because of the relationship in (15), the
(r)
corresponding eigenvalues of Ā will satisfy ν̄1 = ν̄2 = ν̄3 = −1/2.
To conclude, this example provides a design where the bipartite subgraph of any pair
1
of factors is made of a unique connected component, but ν̄K = − K−1 .
36
Algorithm 1 Cholesky factorization. Algorithm 2 Incomplete Cholesky
Input: Q ∈ Rp×p symmetric and positive Input: Q ∈ Rp×p symmetric, S sparsity
definite set
Output: L ∈ Rp×p lower triangular s.t. Output: L ∈ Rp×p lower triangular with
Q = LLT Supp(L) ⊂ S
1: for j ∈ {1,q
. . . , p} do 1: for j ∈ {1,q
. . . , p} do
Ljj ← Qjj − k : (j,k)∈S L2jk Ljj ← Qjj − k : (j,k)∈S L2jk
P P
2: 2:
3: for i ∈ {j + 1, . . . , N } do 3: for i > j : (i, j) ∈ S do
4: Lij ← Qij /Ljj 4: Lij ← Qij /Ljj
5: for k ∈ {1, . . . , j − 1} do 5: for k < j : (i, k), (j, k) ∈ S do
6: Lij ← Lij − (Lik · Ljk /Ljj ) 6: Lij ← Lij − (Lik · Ljk /Ljj )
7: end for 7: end for
8: end for 8: end for
9: end for 9: end for
10: return L 10: return L
Lw = m , LT θ = w + z .
Specifically, one needs to solve a lower and an upper-triangular linear system respectively,
which can be done with O (nL ) cost via forward and backward substitution.
37
Algorithm 3 Conjugate Gradient Algorithm 4 Preconditioned CG
Input: Q ∈ R p×p symmetric positive Input: Q ∈ Rp×p symmetric positive
definite, b ∈ R , ϵ > 0 desired accuracy. definite, b ∈ Rp , ϵ > 0, M preconditioner.
p
1: x0 = 0, r 0 = b, p0 = r 0 . 1: x0 = 0, r 0 = b, z 0 = M −1 r 0 , p0 = z 0 .
2: for j = 0, 1, . . . until ||r j ||2 < ϵ, do 2: for j = 0, 1, . . . until ||r j ||2 < ϵ, do
r Tj r j r Tj z j
3: αj = T 3: αj = T
pj Qpj pj Qpj
4: xj+1 = xj + αj pj 4: xj+1 = xj + αj pj
5: r j+1 = r j − αj Qpj 5: r j+1 = r j − αj Qpj
r Tj+1 r j+1 6: z j+1 = M −1 r j+1
6: βj = r Tj+1 z j+1
r Tj r j
7: βj =
7: pj+1 = r j+1 + βj pj r Tj z j
8: end for 8: pj+1 = z j+1 + βj pj
9: return xj+1 9: end for
10: return xj+1
38
D. Additional Figures
D.1. Sparsity structure of Q resulting from Example 1
39