Conjugate Gradient Methods For High-Dimensional Glmms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Conjugate gradient methods for

high-dimensional GLMMs
Andrea Pandolfi∗, Omiros Papaspiliopoulos†, and Giacomo Zanella‡

November 8, 2024
arXiv:2411.04729v1 [stat.CO] 7 Nov 2024

Generalized linear mixed models (GLMMs) are a widely used tool in sta-
tistical analysis. The main bottleneck of many computational approaches
lies in the inversion of the high dimensional precision matrices associated
with the random effects. Such matrices are typically sparse; however, the
sparsity pattern resembles a multi partite random graph, which does not
lend itself well to default sparse linear algebra techniques. Notably, we show
that, for typical GLMMs, the Cholesky factor is dense even when the origi-
nal precision is sparse. We thus turn to approximate iterative techniques, in
particular to the conjugate gradient (CG) method. We combine a detailed
analysis of the spectrum of said precision matrices with results from random
graph theory to show that CG-based methods applied to high-dimensional
GLMMs typically achieve a fixed approximation error with a total cost that
scales linearly with the number of parameters and observations. Numerical
illustrations with both real and simulated data confirm the theoretical find-
ings, while at the same time illustrating situations, such as nested structures,
where CG-based methods struggle.

Keywords: Bayesian computation; High-dimensional Gaussians; Conjugate gradient


samplers; Cholesky factorization; Random graphs.


Bocconi University, Department of Decision Sciences

Bocconi University, Department of Decision Sciences and BIDSA

Bocconi University, Department of Decision Sciences and BIDSA
GZ acknowledges support from the European Research Council (ERC), through StG “PrSc-
HDBayLe” grant ID 101076564.

1
1. Introduction
Generalized linear mixed models (GLMMs) are a foundational statistical tool, widely
used across multiple disciplines (Gelman and Hill, 2007; Wood, 2017). GLMMs extend
the framework of Generalized Linear Models by incorporating both fixed effects and
random effects. Fixed effects capture population-level trends and relationships, while
random effects account for individual deviations from these trends. GLMMs utilize a
link function to establish a relationship between the mean of the response variable and
the linear predictor, which is given by a linear combination of both fixed and random
effects (see Section 3).
In various settings, the factors associated with the random effects may include numer-
ous categories, known as levels, resulting in models with a large number of parameters
p and a large number of observations N , both potentially in the order of several thou-
sands. This scenario often arises in contemporary applications. For instance, within
the political sciences, one may consider a geographic factor with numerous units, or one
may include factors encoding so-called deep interactions (Ghitza and Gelman, 2013).
Another context where such models find utility is in recommendation systems, where
categorical variables represent customers and products (Perry, 2016; Gao and Owen,
2017).
Our work is relevant for various computational algorithms used to fit GLMMs, such
as Gibbs sampling, variational inferences, algorithms for restricted maximum likelihood
estimation, and Laplace approximations (see Section 8 for a discussion). The primary
computational bottleneck of these methodologies lies in the factorization of a sparse
high-dimensional matrix, denoted by Q. This matrix decomposes as
Q = T + V T ΩV , (1)
where T is the prior component (or, equivalently, a regularization term) and is usually
diagonal or easily factorizable; Ω is a diagonal matrix and adjusts for the variances of the
response variables; and V ∈ RN ×p denotes the design matrix. In particular, we mostly
focus on performing Bayesian inferences using so-called blocked Gibbs samplers, that
alternate the updates of regression parameters, variance hyperparameters and potential
additional latent variables (see Section 6.1). The update of the regression parameters
requires sampling from a multivariate Gaussian distribution with precision matrix Q as
in (1), and it is usually the most computationally intensive steps in those algorithms.
The standard procedure to sample from a multivariate normal distribution involves
the computation of the Cholesky factor of Q (see Section 2.1). It is well established
that the Cholesky factorization can be computed efficiently with nested factors (see e.g.
Papaspiliopoulos and Zanella, 2017 and references therein). On the contrary, we will
show that, for general crossed factors, under standard random design assumptions, the
cost of computing the Cholesky factor scales as O p3 , even when Q is sparse (see Section
4 for further details). Thus, since exact factorization of Q is too expensive, we consider
an alternative strategy to sample from the desired Gaussian distribution, which involves
the solution of a properly perturbed linear system Qθ = b (see Section 2.2). Here,
instead of finding the exact solution of the linear system, we employ the well-known

2
conjugate gradient (CG) algorithm to find an approximate solution or, equivalently,
produce an approximate sample. We will demonstrate that, in the general case where
the Cholesky factorization becomes inefficient, the CG sampler only requires a constant
(not growing with N and p) number of matrix-vector multiplications Qb, which results
in a total of O (max(N, p)) cost to produce an approximate sample.
The paper is organized as follows. In Section 2, we briefly review the algorithms for
the Cholesky factorization and the CG method, with a particular focus on their time
complexity. In Section 3, we describe the specific instance of GLMMs, which we refer
to as random-intercept crossed effects models, that will serve as the reference model for
the theoretical analysis in Sections 4 and 5. In Section 4, we show that exact Cholesky
factorization is expensive for such models; while, in Section 5, we show that the CG
algorithm converges fast for such models. The bulk of our technical contribution lies
in the analysis of the spectrum of Q performed in Section 5. Finally, in Section 6, we
describe the proposed methodologies for the case of general GLMMs, and illustrate them
with simulated and real data. Specifically, we consider an application to a survey data
for the American presidential elections of 2004 (Ghitza and Gelman, 2013) and to a data
set for Instructor Evaluations by Students at ETH (from lme4 R-library). The proofs
of the presented results are given in the supplementary material and code to reproduce
the experiments is available at github.com/AndreaPandolfi/ASLA.jl.

2. Review of high-dimensional Gaussian sampling


In this section, we review some well-known algorithms to sample from a high dimen-
sional Gaussian distribution. Specifically, we consider the problem of sampling θ ∼
N p (Q−1 m, Q−1 ). Such structure arises in several contexts, such as Bayesian regression
(Nishimura and Suchard, 2023), spatial models with GMRFs (Rue et al., 2009), and
GLMMs, as in this case (see Section 3). We are mostly interested in the complexity of
these algorithms when p is large and Q is sparse. We say that a matrix Q ∈ Rp×p is
sparse, if the number of non-zero entries grows slower than the number of total possible
entries, namely if nQ = o(p2 ) as p → +∞, where nQ denotes the number of non-zero
entries of Q.

2.1. Cholesky factorization


The Cholesky factorization of a positive-definite matrix Q ∈ Rp×p is a factorization of
the form Q = LLT , where L ∈ Rp×p is a lower triangular matrix with real and positive
diagonal entries. Such factorization can be computed with the following column-wise
recursion
v
m−1 m−1
u !
u X 1 X
Lmm = tQmm − L2mℓ , Ljm = Qjm − Lmℓ Ljℓ , j > m . (2)
Lmm
ℓ=1 ℓ=1

Given the Cholesky factor L, one can sample from N (Q−1 m, Q−1 ) by solving linear
systems in L and LT , which can be done efficiently in O (nL ) time via forward and

3
backward substitution. A description of the algorithm can be found in the supplementary
material.
computation of the Cholesky factor requires O p3

For dense matrices, the exact
time, and the storage of O p2 entries. For sparse matrices with specific structures, the


computational requirements can be drastically reduced. If Q can be turned into a banded


matrix via row and column permutations, the cost of computing L can be reduced to
O pb2 , where b is the bandwidth. When considering spatial Gaussian Markov random
fields (GMRFs) or nested hierarchical models, such reordering can be done efficiently,
which makes sampling via Cholesky factor efficient in such cases (see Papaspiliopoulos
and Zanella, 2017, and extensive references therein). On the other hand, there is little
work in the literature that studies the computational cost of the Cholesky factorization
for general GLMMs with non-nested designs.

2.1.1. Conditional Independence Graphs and Cholesky complexity analysis


The conditional independence structure of a given Gaussian vector θ ∼ N p (Q−1 m, Q−1 )
is described by Q through the relation
θj ⊥ θm | θ −jm ⇐⇒ Qjm = 0 , (3)
where θ −jm denotes θ after the removal of the entries j and m. Hence, the conditional
independence (CI) graph of θ is entirely described by Supp(Q), the support of Q (Rue
and Held, 2005, Ch. 2.1.5). We denote the CI graph of θ by GQ , whose vertices are the
variables {θj , j = 1, . . . , p} and the edges are those (θj , θm ) s.t. Qjm ̸= 0 for j ̸= m.
The Cholesky factor L of Q has also a probabilistic interpretation: for m < j, Ljm = 0
if and only if θj and θm are independent given the future set of θm excluding θj , i.e.
θm ⊥ θj | θ {(m+1):p}\j (Rue and Held, 2005, Theorem 2.8). Therefore, even when Q is
sparse, L might not be and the order we assign to elements of θ influences the sparsity
of L. The possibly non-zero entries of L can be deduced from GQ as follows: a sufficient
condition to ensure Ljm = 0 for m < j is that the future set of θm separates it from
θj in P
GQ . This motivates defining the number of possible non-zero entries in L as
nL = pm=1 nL,m where
nL,m = |{j ≥ m : the future set of θm does not separate it from θj in GQ }| . (4)
Thus, the Cholesky factor L involves nL − nQ ≥ 0 additional potential non-zero entries
compared to the original matrix Q. Such additional non-zeros terms are commonly
referred to as fill-ins. Since nL depends on the ordering of the variables in θ, standard
algorithms for Cholesky factorizations of sparse matrices proceed in two steps: first they
try to find an ordering of variables that reduces nL as much as possible, and then compute
the corresponding Cholesky factor using (2). Finding the ordering that minimizes nL
is NP-hard, but various heuristic strategies to find good orderings are available (Golub
and van Loan, 2013, Ch. 11).
Sparsity in L has direct consequences on the computational cost required to compute it
- although it has to be appreciated that a sparse Cholesky is not necessarily computable
efficiently. The following theorem quantifies this connection.

4
Theorem 1. Denoting with Cost(Chol) the number of floating point operations (flops)
needed to compute the Cholesky factor L of a positive definite matrix Q, we have
p
!
X
O n2L /p ≤ Cost(Chol) = O n2L,m ≤ O nL 1.5
 
. (5)
m=1

The equality and lower bound in (5) are well-known, while the upper bound is more
involved, and we have not been able to find
 it in the literature. Note that, trivially, the
2
result also implies Cost(Chol) ≥ O nQ /p .
When Q is a dense matrix we have nL = O p2 and thus the lower and upper bounds


in (5) coincide, being both cubic in p, and they are  both tight. For sparse matrices,
instead, the two bounds can differ up to a O p0.5 multiplicative factor and each can
be tight depending on the sparsity pattern. For example, if Q is a banded matrix with
bandwidth b, we have nL = O (pb) and Cost(Chol) 2
0.5
 = O pb , hence the lower bound is
tight while the upper bound is off by a O (p/b) factor. On the contrary, for a matrix
Q with a dense p0.5 × p0.5 sub-matrix and diagonal elsewhere we have n = O (p) and
L
Cost(Chol) = O p1.5 , meaning that the upper bound is tight while the lower bound is


off by a O p0.5 factor.

2.2. Conjugate Gradient


Conjugate Gradient (CG) is a widely-used iterative optimization algorithm employed for
solving large systems of linear equations Qθ = b, for which Q is positive-definite (Golub
and van Loan, 2013; Saad, 2003). CG iteratively minimizes the quadratic form associated
to the linear system. However, instead of optimizing along the gradient direction, it
restricts to the component of the gradient which is conjugate (i.e. Q-orthogonal) to the
previous search directions.
Each CG iteration only requires evaluation of matrix-vector products Qθ and scalar
products of p-dimensional vectors. This feature makes CG methods very appealing for
solving sparse linear systems, as each iteration only requires O (nQ ) operations. The
algorithm is also optimal in terms of memory efficiency, as it only requires storing O (p)
values: basically only the approximate solution, which gets updated in place.
Several strategies have been proposed to use CG algorithm to sample from high dimen-
sional Gaussian distributions. In this paper, we will refer to the perturbation optimization
sampler (Papandreou and Yuille, 2010; Nishimura and Suchard, 2023), which requires
solving the following linear system

Qθ = m + z, z ∼ N (0, Q) (6)

Simple computations show that, when the linear system in (6) is solved exactly, θ is an
exact sample from N p (Q−1 m, Q−1 ). If the linear system is solved via CG method, an
approximate sample will be returned. A potential limitation for the application of this
algorithm is given by the necessity to sample efficiently from N p (0, Q), which is however
feasible for matrices as in (1) (see end of Section 3). Notice that in this case, CG is only

5
used to solve the linear system in (6), while other CG samplers (Parker and Fox, 2012;
Vono et al., 2022) exploit CG algorithms, actually Lanczos tri-diagonalization (Lanczos,
1950), to build an approximate low-rank square root of Q−1 which is then used to sample.
Notice, however, that, if Q is full-rank but has only k < p distinct eigenvalues, then (6)
would return an exact sample in k-steps (see Theorem 2), while other methodologies
based on low-rank approximations would not.

2.2.1. Rate of convergence


When studying the complexity of CG algorithm, one needs to quantify how fast the
solution at iteration k approaches the exact solution of the linear system. The conver-
gence behavior of CG algorithm has been extensively studied in the literature (Golub
and van Loan, 2013, Section 11). We report the most well-known result in the following
Theorem.

Theorem 2. Consider the linear system Qθ = b, with Q positive definite. Denote the
starting vector with θ 0 , then the Q-norm distance between the k-th CG iterate θ k and
the solution θ satisfies the inequality
!k
||θ k − θ||Q
p
κ(Q) − 1
≤2 , (7)
||θ 0 − θ||Q
p
κ(Q) + 1

where κ(Q) = λmax (Q)/λmin (Q) denotes the condition number of Q. Moreover, if Q
has only k < p distinct eigenvalues, CG returns the exact solution after k iterations.

See Trefethen and Bau (1997) for a proof of these results. Theorem 2 shows that CG
has a fast convergence rate when either Q has a small condition number or when it has
few distinct eigenvalues.
It is well-known in the CG literature that the convergence rate presented in Theorem
2 is very conservative. Indeed, CG is also fast when most of the eigenvalues, except
few outlying ones, are clustered in a small interval [µs , µp−r ], with µs not close to 0. In
this case, one observes that, after a small number of iterations, CG behaves as if the
components corresponding to the outlying eigenvalues have been removed, and the CG
convergence rate changes as if the condition number in (7) is replaced by the effective
value µp−r /µs (Concus et al., 1976; van der Vorst, 2003). This behavior is known as
superlinear convergence of conjugate gradients. For this reason, instead of considering
κ(Q), we will focus on the effective condition number (Van der Sluis and Van der Vorst,
1986)
µp−r
κs+1,p−r (Q) = , (8)
µs+1
obtained by removing the smallest s eigenvalues of Q and the r largest ones. In Section
5, we will provide upper bounds on κs+1,p−r (Q) in the GLMM context, for small s and
r.

6
3. Random-intercept crossed effects models
We now describe the model we will refer to in the theoretical analysis of Section 4 and
5. Specifically, we present the class of random-intercept crossed effects models.
Model 1 (Random-intercept crossed effects models). For each observation i ∈ {1, . . . , N },
consider a univariate continuous response yi ∈ R distributed as
K
X
−1
z Ti,k θ k ,

yi | ηi ∼ N ηi , τ , ηi = θ0 + (9)
k=1

where θ0 is a global intercept, θ k = (θk,1 , . . . , θk,Gk )T is the vector of random effects for
factor k, and z i,k ∈ {0, 1}Gk with G
P k
g=1 zi,k,g = 1 is a “one-hot” vector which encodes
the level of factor k for the observation i. Here K is the number of factors, and Gk
ind.
the number of levels in factor K. We assume independent priors θk,g ∼ N (0, Tk−1 ),
for each factor k and level g. The fixed effect is assigned either a normal or improper
flat prior, and the precision parameters are assigned conjugate gamma distributions.
Though, neither of these choices are directly relevant for our analysis.
If we define θ = (θ0 , θ T1 , . . . , θ TK )T ∈ Rp (with p = 1+ k Gk ) and v i = (1, z Ti,1 , . . . , z Ti,K )T ∈
P

Rp , we can write ηi as v Ti θ. Under Model 1, the posterior conditional distribution of θ


is
θ | y, V , T , τ ∼ N p (Q−1 (T m0 + τ V T y), Q−1 ) , (10)
where m0 is the prior mean of θ, T its prior precision, V = [v 1 | . . . |v N ]T ∈ RN ×p and

Q = T + τV T V . (11)

In Sections 4 and 5, we study the time complexity of sampling from (10) with the
algorithms presented in Section 2. Regarding the CG sampler, we will only focus on the
complexity of the CG step. Indeed, sampling z ∼ N p (0, Q) with Q as in (11), can be
done in O (N + p) time, by sampling ζ ∼ N p (0, I p ) and η ∼ N N (0, I N ) independently,

and setting z = T 1/2 ζ + τ V T η.
Random-intercept models defined as in Model 1 are a specific instance of the more
general class of GLMMs considered in Section 6, which motivates our work. Throughout
the paper, Model 1 will serve as a study case to develop theoretical results. We restrict
our attention to the random-intercept case as it allows for an exhaustive theoretical
tractability, while still preserving the crossed sparsity structure that constitute the main
computational challenge for general GLMMs with non-nested designs (see e.g. Section
4). Notice also that various recent works provided methodologies and theoretical results
for this class of model (Papaspiliopoulos et al., 2019; Ghosh et al., 2022). Note that,
while we develop theoretical results in the context of Model 1, our methodology and
numerical results apply to the more general class of GLMMs described in Section 6,
which in particular includes random slopes and interaction terms.
For the sake of simplicity, the only fixed effect included in Model 1 is the intercept
θ0 ∈ R. Nonetheless, all the results of the following sections could be restated to include

7
multivariate fixed effects, θ 0 ∈ RD0 , leading to the same conclusions, provided that D0
is sufficiently small relative to the dimensionality of the random effects (θ 1 , . . . , θ K ) (see
Section B.1 in the supplementary material for further details).

3.1. Design assumptions


Consider the precision matrix defined in (11). Since T is a diagonal matrix, the off-
diagonal support of Q is entirely characterized by the likelihood term, which we will
denote with U = V T V = ni=1 v i v Ti . In particular, simple computations show that
P

U [θ0 , θ0 ] = N , U [θ0 , θk,g ] = |{i = 1, . . . , N : zi,k,g = 1}| ,


(12)
U [θk,g , θk,g ] = |{i : zi,k,g = 1}| , U [θk,g , θk′ ,g′ ] = |{i : zi,k,g = 1, zi,k′ ,g′ = 1}| ,

where U [θk,g , θk′ ,g′ ] refer to the entry relative to the pair (θk,g , θk′ ,g′ ). We assume without
loss of generality that each level of each factor is observed at least once, which is equiv-
alent to say that U [θk,g , θk,g ] ≥ 1, for all factors k = 1, . . . , K and levels g = 1 . . . , Gk .
In the analogy with conditional independence graphs described in Section 2.1.1, the
last equality in (12) implies that θk,g is connected to θk′ ,g′ in GQ if and only if k ̸= k ′ and
there exists at least one observation where the two levels are observed together. Thus, the
resulting CI graph is a (K +1)-partite graph, with one trivial block containing the vertex
θ0 , and a block of size Gk for each factor k = 1, . . . , K. Since U [θ0 , θk,g ] = U [θk,g , θk,g ] > 0
for each k, g, then θ0 is connected to all the other vertices.
Throughout our analysis, we will consider a regime where N, p → ∞, while K is fixed.
Notice that N ≥ p/(K + 1), since we assume that each θk,g appears in at least one
observation. Finally, we assume that nQ or, equivalently, the number of edges is GQ is
O (N ).

4. Cholesky factorization with sparse crossed designs


In this section, we will provide a simple theoretical example as well as numerical results
with random designs showing that the Cholesky factorization is not suitable to efficiently
factorize the posterior precision matrices that arise from Model 1.

Proposition 1. Under Model 1, nL is a non-increasing function of the position of θ0


in the ordering of θ. Hence, placing θ0 last in the ordering always minimizes nL .

Proposition 1 implies that we can place θ0 last without loss of generality. This helps to
reduce nL but it does not solve the problem. The following example describes a simple
sparse design that is catastrophic in this sense, where nQ = O (p), but nL = O p2 .


Example 1. Consider a random intercept model with K = 2 and G1 = G2 = G. Fix


an integer d ≥ 2, and for each g = 1, . . . , G − 1, connect the vertex θ1,g to θ2,j (this is
equivalent to adding an observation i, s.t. g1 [i] = g and g2 [i] = j) for all j’s for which
at least one of the following conditions hold: (a) g = j; (b) d(g − 1) ≤ j − 2 < dg
mod (G − 1) and g < j; (c) d(g − 1) ≤ j − 1 < dg mod (G − 1) and g > j. For

8
g = G connect θ1,G to all the θ2,j ’s that have degree less or equal than d. A graphical
representation of the resulting precision matrix Q can be found in Section D of the
supplementary material.

Proposition 2. Under the design of Example 1 with d fixed and G → ∞, it holds that

nQ = O (pd), nL = O p2 and Cost(Chol) = O p3 .


 
(13)

Proposition 2 shows that sparse crossed designs can lead to poor performances of sparse
Cholesky solvers. We now provide a numerical study that shows that this is indeed the
case for most sparse crossed designs. We consider missing completely at random designs
where each cell in the K-dimensional data contingency table contains an observation
with probability π ∈ (0, 1) and is empty otherwise. We take G1 = · · · = GK = G, hence
N ∼ Binomial(GK , π), and we consider, for increasing values of G, different values of K
and different ways that π relates to G. Specifically, we examine the following designs:

(a) K = 2, π = 20G−1 ; (b) K = 2, π = G−1/2 ; (c) K = 5, π = G−K+3/2 .

Notice that the expected degree is constant in scenario (a), but it increases with G in
scenarios (b) and (c). Moreover, notice that by picking levels independently among all
the possible GK combinations, we guarantee nQ = O (N ). Figure 1 reports the results,
plotting the number of flops required to compute L, versus p = KG+1 in a log-log scale.
In all three regimes, despite Q being sparse, the Cholesky factor L is dense, leading to
the worst case scenario Cost(Chol) = O p3 . Notably, the same results were obtained
regardless of the ordering strategy. In Figure 1, we also include the number of flops
required for sampling with CG algorithm, which we will analyze in Section 5.5.

Figure 1: Cost(Chol) (red dots) as a function of p (x-axis) in a log-log scale, for scenarios
(a)–(c) described above. The purple triangles represent the cost of the CG
sampler, which is discussed in Section 5.5. Q is obtained by setting τ = 1
and T = I p . The results are obtained by averaging over 30 independent
simulations.

9
5. Spectral analysis and CG convergence
In this section, we provide a series of spectral results about the posterior precision matrix
for Model 1. Namely, we show that all but a fixed number of the eigenvalues of Q, if
appropriately preconditioned, concentrate in a small interval, as p → +∞. In particular,
Theorem 4 and 6 will show that the effective condition number remains bounded for
increasing p. This suggests that the number of CG iterations required to reach a desired
level of accuracy is independent of the size of the problem, leading to dimension-free
convergence of the algorithm, which is consistent with numerical experiments (see Section
5.5).

5.1. Notation
We denote with Diag(Q), the diagonal matrix whose diagonal coincide with the one
of Q and, for a given vector v, we denote with Diag(v) the diagonal matrix whose
diagonal is v. Recalling that U = V T V , we start by decomposing Q in (11) as Q =
T + τ · Diag(U ) + τ A , where A = U − Diag(U ) can be seen as a weighted adjacency
matrix, which counts the number of times that two levels θk,g and θk′ ,g′ are observed
together, see (12). We also define D = Diag(A1p ), where 1p ∈ Rp is a vector of ones, so
that D is a diagonal matrix whose elements are the row-wise sums of A. Notice that,
for crossed effect models, D = K Diag(U ), indeed U [θk,g , θk,g ] counts the number of
times that the level θk,g is observed, see (12), however, each time it is observed it is also
“connected” to other K − 1 levels from the other factors and to the global parameter θ0 ,
hence the equality.
When we consider the sub-matrix relative to the random effects only, we use the
notation Q(r) ∈ Rp−1×p−1 , which is obtained by removing the row and column relative
to the global effect θ0 . We use the same notation for the adjacency matrix A(r) . We let
D (r) = Diag(A(r) 1p−1 ) ∈ R(p−1)×(p−1) , so that D (r) = (K − 1) · Diag(U (r) ).
Denote with Q̄ the Jacobi preconditioned precision matrix (Golub and van Loan, 2013,
Sec. 11.5.3), i.e.
Q̄ = Diag(Q)−1/2 QDiag(Q)−1/2 . (14)
(r)
In Section 5.2, we expand on the importance of such preconditioning. We define Ā =
(D (r) )−1/2 A(r) (D (r) )−1/2 , as one would usually normalize the adjacency matrix of a
graph.

5.2. Jacobi preconditioning


In this section, we briefly investigate the choice of Jacobi preconditioning. The general
idea of preconditioning (Golub and van Loan, 2013, Section 11.5) is to find a non-singular
and easy to factorize matrix M s.t. the preconditioned matrix M −1/2 QM −1/2 is better
conditioned than the original Q. A simple modification of the standard CG algorithm
allows working with the preconditioned matrix instead of Q (Golub and van Loan, 2013,
Algorithm 11.5.1) at the additional cost of computing x → M −1 x at each iteration.

10
For positive definite matrices, the Jacobi preconditioning is often a natural choice, since
x → M −1 x is very cheap to evaluate.
For the specific case of Model 1, the Jacobi preconditioning allows rescaling Q so that
its eigenvalues lie within a bounded interval independently of the design (see Theorem 3
for more details). To illustrate this point, we consider the following setting. There are K
factors of size G1 , . . . , Q
GK and, for a given N , we sample uniformly without replacement
N out of the possible K k=1 Gk combinations of levels. This generates a specific random
design V . Notice that this is equivalent to the random design of Figure 1, but for general
Gk and for fixed N . In this case, simple computations show that the expected value of
the diagonal element Q[θk,g , θk,g ] is Tk + τ N/Gk . When looking at the spectrum of the
resulting Q, one can observe, for each factor k, a bulk of Gk eigenvalues centered around
the expected value Tk + τ N/Gk , which is a manifestation of spectral concentration for
high-dimensional random matrices. The left panel in Figure 2 shows this phenomenon in
the case of K = 3 factors. On the other hand, when applying Jacobi preconditioning to
Q, the bulks relative to different factors are grouped into a single one centered around 1
(see the right panel of Figure 2). These two different spectral behaviors are reflected in

Figure 2: The two histograms show the spectrum of Q (left) and Q̄ (right). They are
obtained by setting T = I p , τ = 1, G1 = 300, G2 = 1000, G3 = 2000.
The dashed lines on the left panel correspond to the values Tk + τ N/Gk , for
k = 1, 2, 3.

the number of iterations needed for convergence of CG. See for example Table 1, which
considers a sequence of designs with 2 factors of increasingly different sizes. There, the
effective condition number and the number of CG iteration increase as the difference
between factor sizes increases, while they remain constant when considering the Jacobi
preconditioned version Q̄. Theorems 4 and 6 provide theoretical results supporting the
empirical behavior of the spectrum of Q̄. Given the better performances, we will focus
our analysis on the spectrum of Q̄.
Note that, in the right panel of Figure 2, one can observe a cluster of 2 small eigenvalues
separated from the remaining part of the spectrum. This shows why we need to consider

11
effective condition numbers defined as in (8), and it will be made formal in Theorem 3.

No preconditioning Jacobi preconditioning


G1 G2 N
κ3,p−2 n. iters κ3,p−2 n. iters
100 100 2828 10.61 31 5.37 25
100 1000 36482 33.08 50 4.39 24
100 10000 1015037 166.91 64 3.33 19

Table 1: Effective condition number and number of CG iterations for Q and Q̄. The
design is the same as in Figure 1, with π = p−9/10 . Q is obtained with T = I p
and τ = 1. In the linear system Qθ = b, b has independent entries with uniform
distribution on (−0.5, 0.5).

Remark 1 (CG preconditioners). There is a vast literature that studies the choice of
the optimal CG preconditioner for different linear systems (Saad, 2003, Ch.10). In
our analysis, we have also explored other preconditioning techniques, such as so-called
incomplete Cholesky factorizations (see the supplementary material for a more detailed
discussion). Nevertheless, we observed results comparable, or slightly worse, to the ones
obtained with Jacobi preconditioning. Nishimura and Suchard (2023) also proposed a
prior-preconditioned version of conjugate gradient method. However, they focused on
Bayesian sparse regression with a large number of covariates, identifying the relevant
ones through shrinkage priors. Since we do not consider sparsity-inducing priors, the
diagonal elements of the prior term T are much smaller than those of the likelihood term
τ V T V . This is why, in our case, using T as a preconditioner does not improve much
the spectrum of Q, leading to the same issues observed in Table 1.

5.3. Outlying eigenvalues of Q̄


In this section, we identify a set of outlying eigenvalues of the matrix Q̄. Namely, The-
orem 3 shows that, independently of the data design, Q̄ always has K small eigenvalues
and a large one, close to K + 1.

Theorem 3. Let U = V T V be the likelihood term of Q in (11). Denote by Ū the


matrix Diag(U )−1/2 U Diag(U )−1/2 and let λ1 (Ū ) ≤ · · · ≤ λp (Ū ) be its eigenvalues.
Then:
λ1 (Ū ) = · · · = λK (Ū ) = 0 , λp (Ū ) = K + 1 .
The previous equalities extend to the eigenvalues µ̄1 ≤ · · · ≤ µ̄p of Q̄ as follows:

maxk=0,...K Tk
0 < µ̄1 ≤ · · · ≤ µ̄K ≤ , K + 1 ≤ µ̄p ≤ K + 2 .
τ + maxk=0,...K Tk

The eigenvalues of Theorem 3 corresponds to those directions where the likelihood is


highly or poorly informative. This is a well-known behavior due to the additive structure
of the linear predictor ηi in (9). The predictor is very informative about the sum of

12
fixed and random effects, whose direction corresponds to the eigenvalue µ̄p . However, it
provides little information about their relative differences, whose directions correspond
to the eigenvalues λ1 (Ū ) = · · · = λK (Ū ). Notice also that the upper bounds on the
small eigenvalues µ̄1 , . . . µ̄K in Theorem 3 are quite conservative. In all the numerical
examples we have explored, the K smallest eigenvalues of Q̄ are very close to zero.

5.4. Connection with graphs and adjacency matrices


The purpose of this section is to draw a connection between the spectrum of Q̄ and the
(r)
one of the adjacency matrix Ā of a K-partite graph. We will give sufficient conditions
that guarantee that the eigenvalues of Q̄, except for those identified by Theorem 3,
concentrate around 1 as p → +∞. The following corollary extends Theorem 3 to the
(r)
matrix Ā .
(r)
Corollary 1. Let Ā = (D (r) )−1/2 A(r) (D (r) )−1/2 ∈ R(p−1)×(p−1) be as in Section 5.1,
and denote with ν̄1 ≤ · · · ≤ ν̄p−1 its eigenvalues. Then,

ν̄1 = · · · = ν̄K−1 = −(K − 1)−1 , ν̄p−1 = 1. (15)


(r)
We now focus on the remaining part of the spectrum of Ā , which lies in the interval
[ν̄K , ν̄p−2 ]. If this interval becomes smaller, we would also expect the spectrum of Q̄
to concentrate in a smaller interval, therefore improving the effective condition number.
The following lemma quantifies this connection.
(r)
Lemma 1. Under Model 1, consider Q̄ ∈ Rp×p and Ā as above, with eigenvalues
µ̄1 ≤ · · · ≤ µ̄p and ν̄1 ≤ · · · ≤ ν̄p−1 respectively. Let ν̄K ≤ 0 ≤ ν̄p−2 , then,

1 + (K − 1)ν̄p−2
κK+1,p−2 (Q̄) ≤ . (16)
1 + (K − 1)ν̄K
Lemma 1 provides an interesting connection between the effective condition number
(r)
κK+1,p−2 (Q̄), and the concentration of eigenvalues of Ā around 0. This allows to focus
(r)
on the spectrum of Ā and to leverage results from random graph theory, in order to
bound the condition number of Q̄ using (16).
Remark 2. The upper-bound in (16) is quite tight in the regimes that we consider.
When the bound becomes uninformative for ν̄K = −1/(K − 1), we numerically observed
that κK+1,p−2 (Q̄) is very large, in the order of several hundreds. Theoretically, if ν̄K =
−1/(K − 1), we can prove that κK+1,p−2 (Q̄) is lower bounded by τ (maxk Tk )−1 , hence
the effective condition number can be arbitrarily large depending on the value of maxk Tk .
Heuristically, if we want κK+1,p−2 (Q̄) to be small, we need ν̄K to be bounded away from
−1/(K − 1).
However, it is important to remark that the condition ν̄K > −1/(K − 1) is only
sufficient for the fast convergence of CG. Clearly, it can happen that, ν̄r > −1/(K − 1),
for some small r > K, even if ν̄K = −1/(K − 1). In such case, we can extend Lemma
1 to bound κr+1,p−2 (Q̄), and CG would still converge fast.

13
In the following analysis, we will focus on κK+1,p−2 (Q̄) for mathematical convenience,
in order to obtain a sufficient condition for CG to converge fast. Though, we will also
study necessary conditions, as in Theorem 5.

In order for the upper-bound in (16) to be effective, we would like ν̄p−2 to be small,
or equivalently 1 − ν̄p−2 to be large; but mostly we need ν̄K to be bounded away from
1
− K−1 .
The spectral gap 1 − ν̄p−2 is a well-studied quantity since it is directly related to
the global connectivity of a graph. One may wonder whether global connectivity is also
1
associated to the gap ν̄K + K−1 . In Section 5.4.1, we will see that, in the simple case with
two factors, this association is straightforward. Indeed, we will prove that a sufficient
condition to bound the effective condition number is that the associated graph is made of
a unique connected component. On the other hand, in Section 5.4.2, we will show that,
when considering more than two factors, a different notion of connectivity is needed.

5.4.1. The case K = 2


(r)
In this section, we consider Model 1 with only two factors. In this case, Ā represents
a bipartite graph and its spectrum has a very peculiar structure. Indeed, its eigenvalues
are symmetric around zero, and, in modulo, less or equal than 1 (Brito et al., 2022).
With such property, the link with connectivity of a graph expressed in the previous
section becomes straightforward. Indeed, 1 − ν̄p−2 = ν̄2 + 1, which simplifies the bound
in Lemma 1 as follows.

Corollary 2. Consider Model 1 with K = 2, then


1 + ν̄p−2
κ3,p−2 (Q̄) ≤ . (17)
1 − ν̄p−2

Assume now, that A(r) is a proper adjacency matrix with binary entries in {0, 1}. In
this case, one can show that ν̄p−2 < 1 if and only if the graph associated to the adjacency
(r)
matrix Ā consists of a unique connected component (Godsil and Royle, 2001). More
in general, the higher is its connectivity, the larger is the gap 1 − ν̄p−2 (Godsil and Royle,
2001, Ch. 13.5). With this in mind, Corollary 2 states that CG converges fast, when
(r)
the graph associated to Ā is well-connected.
Clearly, higher connectivity comes at the price of having a denser support for Q and
consequently a larger cost per iteration of CG. Ideally, CG would be most efficient when
the CI graph is well-connected while being very sparse. There is a vast literature of
random graph theory that studies graphs with this property. For example, the well-
known Alon-Boppana bound (Alon, 1986) gives an upper bound on the spectral gap of
regular graphs. Graphs that attain this bound are called Ramanujan. Random regular
graphs are also known for having great connectivity property, indeed, they can get
arbitrarily close to the Alon-Boppana bound, with high probability, as the number of
vertices goes to infinity (Friedman, 2003). Brito et al. (2022) extends this result to
the case of random biregular bipartite graph, i.e. a bipartite graph sampled uniformly at

14
random among all the possible bipartite graphs, where the G1 vertices of the first part
has constant degree d1 , and the remaining G2 vertices has degree d2 . This is why one
would expect CG algorithm to have a fast convergence rate under this random design.
The following result formalizes this intuition.

Theorem 4. Consider Model 1 with A(r) being the adjacency matrix of a bipartite, bireg-
ular random graph. Then, for any sequence ϵp → 0 as p → ∞, it holds asymptotically
almost surely that √ √
1 + (1/ d1 + 1/ d2 ) + ϵp
κ3,p−2 (Q̄) ≤ √ √ . (18)
1 − (1/ d1 + 1/ d2 ) − ϵp
Theorem 4 shows that, even with relatively small degrees d1 and d2 , the effective
condition number of Q̄ becomes close to 1. An analogous, but weaker result can be stated
also for Erdős-Rényi random bipartite graphs and can be found in the supplementary
material.

5.4.2. The case K > 2


(r)
When considering more than two factors, the symmetry of the spectrum of Ā around
0 is lost. One may wonder whether the direct link established by Corollary 2, between
(r)
the connectivity of the graph associated to Ā , expressed by the spectral gap 1 − ν̄p−2 ,
and the good conditioning of the matrix Q̄ holds true also for K > 2. Unfortunately,
this is not the case. Example 2 describes a 3-partite graph with excellent connectivity,
whose associated precision matrix Q̄ has O (p) eigenvalues close to zero.

Example 2. Consider the case with K = 3 factors and G1 = G2 = G3 = G. We


consider N = G2 observations, s.t. for each i = 1, . . . , N , the first two factors appear
with the level g1 [i] = g2 [i] = ⌈i/G⌉; while the third factor has level g3 [i] = (i mod G) + 1.
An example of the resulting conditional independence graph is given in the supplementary
material. With such design, the first and third factors, as well as the second and third, are
fully connected; however, the first and second factor are very poorly connected. Indeed,
each level θ1,g of the first factor only connects to θ2,g of the second one, and the sub-graph
restricted to the first two factors has G disconnected components.
(r)
In this case, the spectral gap of Ā is 1 − ν̄p−2 = 0.5 and the CI graph has great
connectivity, as each vertex is at most at distance 2 from any other one. Nonetheless,
(r)
G+1 eigenvalues of Ā are equal to the lower-bound of −1/2 in (15). As a consequence,
CG would need to remove many small eigenvalues before getting to a faster rate of
convergence. For instance, if we take G = 200, T = I p (the identity matrix) and τ = 1,
we have κG+3,p−2 (Q̄) ≈ 2, but κG+2,p−2 (Q̄) ≈ 400.

The intuition behind Example 2 is made more general in the following theorem, which
shows that pairwise connectivity between all factors is a necessary condition for a well
conditioned precision matrix.

Theorem 5. Consider Model 1. Fix any permutation π of the first K integers. For
each ℓ = 1, . . . , K − 1, let Cℓ be the number of connected components of the sub-graph

15
(r) PK−1
restricted to the factors (π(ℓ), π(ℓ + 1)). Then Ā has at least ℓ=1 Cℓ eigenvalues
1
equal to − K−1 .
1
Theorem 5 shows that, if we want ν̄K to be bounded away from − K−1 , one needs
each pair of bipartite sub-graphs to have only one connected component, that is Cℓ = 1
for each ℓ and π. This is clearly a stronger property than global connectivity of the
(r)
graph of Ā , and in several cases it may not be satisfied. For instance, some datasets
may have one factor which is nested into another one, i.e. when each level of the former
factor can be observed only with one level of the latter one (see Section 6.2 for an
example). In this case, the sub-graph associated to these two factors would present as
(r)
many connected components as the number of levels in the first factor, and Ā will
have as many eigenvalues equal to −1/(K − 1). This issue becomes even more relevant
when considering interaction terms between factors. Consider two factors of sizes G1
and G2 respectively, the interaction term between them is given by a third factor of
size G1 · G2 that consider all the possible pairwise combinations among the levels of the
two factors. One can notice that, by construction, the interaction term is nested inside
(r)
both of the original factors. Hence, the corresponding Ā will have at least G1 + G2
eigenvalues equal to −1/(K − 1). As we will see in Section 6.2, including nested factors
and interaction terms adds to the spectrum of Q̄ several “problematic” eigenvalues close
to zero, slowing down the convergence of CG algorithm.
Unfortunately, neither pairwise connectivity between all factors guarantees ν̄K >
1
− K−1 , an example is given in the supplementary material. Indeed, a stronger notion
of pairwise connectivity is needed to lower bound ν̄K . The following result describes
such condition, and shows that the effective condition number κK+1,p−2 (Q̄) is bounded
independently of p.

Theorem 6 (Strong pairwise connectivity). Consider Model 1. For any pair of factors
k ̸= h, let A(k,h) ∈ R(Gk +Gh )×(Gk +Gh ) be the adjacency matrix of the bipartite graph
(k,h)
restricted to the pair of factors (k, h). Denote with Ā the normalized version obtained
−1/2 (k,h) −1/2 (k,h)
as M A M , with M = Diag(A · 1). Assume that
√ n 
(k,h)
 o
λ∗ = K − 1 · max |λ| : λ ∈ σ Ā , |λ| =
̸ 1, k ̸= h < 1 . (19)

Then, the effective condition number is bounded by


1 + λ∗
κK+1,p−2 (Q̄) ≤ . (20)
1 − λ∗
(k,h)
Condition (19) states that, if the second-largest eigenvalue of Ā is bounded away
1
from √K−1 for all pairs (k, h), then Q̄ has a bounded condition number after the removal
of few extreme eigenvalues. For example, if we consider a random K-partite d-regular
graph, the condition √2d < √K−1 1
i.e. d > 4(K − 1), would guarantee (19) asymptotically
almost surely, as p → ∞ (see Theorem 4).

16
5.5. Numerical experiments
The above bounds on the effective condition numbers of Q̄ suggest that CG samplers
are highly effective for large unstructured designs, which resembles random graphs. In
this section, we verify these conclusions through numerical simulations. Specifically, we
consider the same numerical experiment as in the end of Section 4 and compute the
cost of obtaining an approximate sample by solving (6) with CG, which is given by the
product of the cost per iteration and the number of iterations needed for convergence.
The former is of order O (nQ ), while the latter can be quantified as
n o
N. iterations = inf k ≥ 0 : ||Qθ k − b||2 < ϵ||b||2 , (21)

where θ k is the approximate CG solution of the linear system Qθ = b at iteration k, and


Q is the precision matrix of the conditional distribution of θ. Monitoring the relative
norm of the residuals is a standard stopping criterion in libraries that implements CG
algorithm. In this case, it also allows for a fair comparison between problems with
different sizes. In our numerical experiments, we will set the tolerance to ϵ = 10−8 .
We consider the three scenarios presented in Figure 1. Table 2 reports the number of
iterations defined in (21) in the three different designs, while the plots in Figure 1 show,
in purple, the total number of floating points operations in logarithmic scale.
In the first design, we have N ∼ Binomial(G2 , 20/G), so that nQ = O (p). Here the
number of CG iterations is basically constant with p (see the first column of Table 2)
and, as a result, the slope for CG in Figure 1 is roughly equal to 1. To be more precise,
the small increase in the number of CG iterations is due to the fact that the number
of disconnected components in Erdős-Rényi random graphs with a constant degree also
increases as the graph size grows (Erdős et al., 1960). Instead, for designs (b) and (c)
we have nQ = O p3/2 and expected degree and connectivity of the graph that increase
with p. This leads to an increasingly better effective condition number, which explains
the decrease in the number of iterations observed in Table 2 (second and third column)
as well as why the  slopes in Figure 1 are below 1.5 even though the cost per iteration is
of order O p 3/2 .

N. iterations
p
Case (a) Case (b) Case (c)
100 17 22 45
205 18 21 47
435 19 20 48
910 19 21 43
1910 19 18 39
4000 19 17 34

Table 2: Number of CG iterations (21) for different values of p, in scenarios (a)-(c).

As these simulations illustrate, CG excels when it comes to large sparse unstructured


random designs. Indeed, for p in the order of 103 , CG converges in less than 40 iterations

17
in all three designs. Therefore, while this framework represents a worst-case scenario
for sampling via Cholesky factorization, as its computation requires O p3 time, it
is optimal for sampling via CG, since the number of iterations is independent of the
dimension p.

6. Application to Generalized Linear Mixed Models


In this section, we discuss and illustrate the application of the conjugate gradient sampler
to the more general class of GLMMs. In particular, relative to the simplified framework
of Model 1, we allow for the inclusion of multivariate fixed effects, random slopes and
interaction terms, as well as non-Gaussian likelihoods with data augmentation. We nu-
merically explore the computational performances of the CG sampler when applied to
synthetic and real data in this general framework, considering applications on a recom-
mendation system type data and on a survey data. This comparison will illustrate that
CG sampler is efficient in the former scenario, but loses efficiency in the latter one.

6.1. Model and algorithms


We consider GLMMs of the form
K
X
yi | ηi ∼ f (yi | ηi ) , ηi = xTi,0 θ 0 + xTi,k θ k , i = 1, . . . , N,
k=1 (22)
−1
T
, θ k = θ Tk,1 , · · · , θ Tk,Gk

θ k,g ∼ N 0, T k , g = 1, . . . , Gk ; k = 1, . . . , K.

Here θ 0 ∈ RD0 denotes the vector of fixed effects and xi,0 ∈ RD0 denotes the correspond-
ing covariates. For each factor k and level g, we consider the random effect θ k,g ∈ RDk ,
for g = 1, . . . , Gk , with corresponding covariates wi,k ∈ RDk . Up to now, we only con-
sidered random intercept models with Dk = 1 and wi,k = 1, while we now also consider
random slopes models with Dk > 1. As before, the vector z i,k ∈ {0, 1}Gk encodes the
level of the factor k to which observation yi is assigned to. With such notation, θ k has
covariates xi,k = z i,k ⊗ wi,k ∈ RGk Dk , where ⊗ denotes the Kronecker product.
In previous sections, we discussed the case of Gaussian likelihood function with identity
link, namely f (yi | ηi ) = N ηi , τ −1 . We consider also binomial likelihood with logistic
link function f (yi |ηi ) = Binomial ni , (1 + exp(−ηi ))−1 , using the Polya-Gamma data
augmentation (Polson et al., 2013) defined as

ηi2
 
1
p(yi , ωi | ηi ) = n exp [yi − ni /2] ηi − ωi fP G (ωi |ni , 0) ,
2 i 2

where fP G (ω|b, c) indicates a Polya-Gamma variable with parameters b and c. We assign


improper flat priors to the fixed effects parameters θ 0 and we assign Wishart priors,
W (αk , Φ−1
k ) to the matrices T k , which simplify to gamma distributions when Dk = 1.
Denote with T and m0 the prior precision and prior mean of θ respectively, with
κ = (y1 − n1 /2, . . . , yN − nN /2)T the vector of “centered” observations, and with Ω =

18
(ω1 , . . . , ωN )T the vector of PG latent variables. With such notation, simple Bayesian
computation leads to the following conditional posterior updates
−1 −1
p(θ | y, V , Ω, {T k }K T

k=1 ) ∼ N p Q (T m0 + V κ), Q , (23)
ind.
p(ωi | y, V , θ) ∼ P G(ni , v Ti θ), ∀i = 1, . . . , N, (24)
 
Gk
ind.
X
p(T k | y, V , θ) ∼ W αk + Gk , (Φk + θ k,g θ Tk,g )−1  , ∀k = 1, . . . , K, (25)
g=1

where Q = T + V T ΩV . The resulting Gibbs sampler is described in the supplementary


material. The computational bottleneck is the O p3 ≫ O (N ) cost for exact sampling
from (23), which we seek to reduce with CG samplers.

6.2. Application to voter turnout and student evaluations


We explore the performances of the CG sampler when used to sample from the distri-
bution in (23) in high-dimensional contexts. We consider the following two data sets:

1. A survey data of the 2004 American political elections (Ghitza and Gelman, 2013;
Goplerud et al., 2024). It collects the vote (Democrat or Republican) of 79148
American citizens. The data set also contains information about their income (5
levels), ethnic group (4 levels), age (4 levels), region of the country (5 levels) and
state (51 levels).

2. Instructor Evaluations by Students at ETH (Papaspiliopoulos et al., 2019). A data


set with 73421 observations for the following variables: students Id (2972 levels), a
factor denoting individual professors (1128 levels), student’s age (4 levels), the
department of the lecture (14 levels) and a factor describing when the lecture
rated had taken place (6 levels). The response is the ratings of lectures by the
students (discrete rating from poor to very good ), which will be modelled as yi ∼
Binom(4, ηi ), according to the notation of (22).

We monitor the number of iterations needed for CG to converge, defined as in (21). We


also explored other stopping criterion, such as ||θ k −θ||2 < ϵ1 ||θ||2 or ||θ k −θ||∞ < ϵ2 , but
decided not to include them in the analysis since they yielded similar results. We set the
relative accuracy threshold to 10−8 , which in our simulations led to practically negligible
error introduced by the CG solver. We also tested different accuracy levels (from 10−4
to 10−8 ), but opted for a more conservative one. Indeed, in our experiments once CG
reached a higher convergence rate after the “removal” of the extreme eigenvalues, it often
reaches very low error with only a few additional iterations (see end of Section 2.2).
We investigate how the number of CG iterations changes depending on:

• Design complexity. So far, we have considered missing completely at random


designs, which have proven very beneficial for the convergence of the CG method
(see Section 5.5). Now, we aim to study how the number of CG iterations varies

19
when moving away from this favorable random design. Moreover, we also construct
two simulated datasets with the same number of factors and levels as in the voter
turnout and student evaluations data, but with a random design matrix simulated
as described in Section 5.5. We will compare the simulated datasets with the real
ones, in order to isolate the contribution of the complexity in the sparsity structure
of Q.

• Model complexity. We consider several variations of the model in (22), which


we report below in order of increasing size and complexity.

(a) Random intercept. When we associate to each factor a one-dimensional ran-


dom effect.
(b) Random intercept with nested factors. This scenario often arises where certain
levels of a given factor are divided into groups and the group label is also
included as another factor, e.g. students who are divided in different classes.
For the voter turnout, we consider the factor state, which is nested inside
region of the country. For the student evaluations, the factor denoting the
professor is nested inside department.
(c) Random slopes. We include interactions between a continuous predictor and
a factor, which “modifies the slope” of the predictor depending on the level
of the factor.
(d) Interactions. We add interaction terms to the original factors in each data
set, e.g. if we have two factors of sizes G1 and G2 respectively, the two-
way interaction term is a third factor with G1 G2 levels, which encodes the
information about the pairwise interactions of the first two factors. Notice
that each interaction term between two factors is, by construction, nested in
both factors. We will also include three-way interactions.

• Sample size. We can analyze the impact of sample size by considering, for each
data set, a subsample with a smaller size, such as N = 7000, and a larger one with
N = 70000. However, instead of subsampling uniformly at random, we enforce
each level of each factor to appear in at least one observation. This guarantee to
have the same p, as long as we don’t include interaction terms, whose size doesn’t
depend only on which levels are observed, but also on how they interact.

• Factor size. The two data sets we consider differ significantly when it comes to
factor size. E.g. the student evaluations data set involves factors with thousands
of levels, compared to tens of levels for the voter turnout. We will see that factor
size plays a crucial role when considering the number of CG iterations.

Table 3 summarizes the results of our numerical study. Each row shows the number
of CG iterations for the different designs described above. We use the simple additive
model (a) as a benchmark and consider only one additional term at a time to isolate
the contribution of each component. E.g., if we include random slopes, we remove the

20
nested factor and interaction terms. The only exception is the three-way interactions,
as it would be unusual to consider them without the pairwise interactions. The last row
presents results for the full model, which includes all the effects mentioned. Each entry
in Table 3 reports the number of iterations for different sample sizes. Additionally, for
each data set, we include a column displaying the number of iterations obtained using
simulated data.
Voter Turnout Students Evaluations
Case
Real Simulated Real Simulated
30 (68) 25 (68) 26 (4101) 18 (4101)
Random intercepts
36 (68) 24 (68) 35 (4101) 16 (4101)
42 (73) 28 (73) 63 (4115) 25 (4115)
Nested factor
54 (73) 27 (73) 94 (4115) 22 (4115)
48 (127) 38 (127) 84 (5229) 34 (5229)
Random slopes
68 (127) 40 (127) 150 (5229) 34 (5229)
134 (763) 99 (787) 91 (19523) 31 (23585)
2 way interactions
343 (786) 190 (787) 122 (89385) 35 (97897)
166 (2952) 111 (3720) 98 (27588) 32 (31650)
3 way interactions
442 (3569) 258 (3723) 128 (159633) 36 (168028)
189 (3016) 147 (3784) 122 (28730) 58 (32792)
Full
532 (3633) 351 (3787) 266 (160775) 65 (169170)

Table 3: Average number of CG iterations. The average is obtained over 200 Gibbs
sampler iterations after an initial burn-in. The number in parentheses repre-
sents p. In each entry, we report results obtained with N = 7 000 (above) and
N = 70 000.

Based on the results in Table 3, we draw the following conclusions:


• In accordance with Theorem 5, we observe that introducing factors with poor pair-
wise connectivity results in an increase in CG iterations. This increase occurs when
we include nested factors (e.g., region for voter turnout and department for student
evaluations) or interaction terms, which introduce nesting by construction. For the
voter turnout case, the number of CG iterations increases significantly when multi-
way interactions are added, moving from 36 to 343 iterations by simply introducing
2 way interactions. For the student evaluations data, adding interaction terms ap-
pears less problematic relative to the problem size, although including interactions
between such large factors is usually uncommon. Analogous patterns are observed
in the simulated data sets.

• Table 3 shows, for the real data, an overall increase in the number of CG iterations
when the sample size increases. This might seem counter-intuitive since, for larger
N , the CI graph becomes denser and better connected. However, when examining
the spectrum of Q̄ for the random intercept only case, we observe that, for small
N , the interval [µ̄K+1 , µ̄p−2 ] is larger but most of the eigenvalues therein are

21
tightly concentrated in its center, around 1. In contrast, for larger N , the interval
[µ̄K+1 , µ̄p−2 ] is smaller, consistently with the CI graph being better connected, but
the spectrum is more diffuse. In other words, for smaller N , the effective condition
number κK+1,p−2 (Q̄) is worse in accordance with the theoretical analysis of Section
5. However, in this example, κs+1,p−r is better for larger s, r, resulting in an overall
faster convergence rate.

• The comparison between these two data sets highlights the significant impact of
factor size on the number of CG iterations. For the voter turnout data, which
presents relatively small factors, CG can require several hundred iterations, even
for solving relatively small problems on the order of a few thousand. Moreover,
this issue is not specific to the details of the voter turnout design, as a similarly
large number of iterations is needed also in the simulated case with a missing
completely at random design but same factor sizes. On the other hand, when
considering the student evaluations case, where some factors have numerous levels,
CG proves to be very efficient, requiring only a few hundred iterations to solve
significantly large linear systems in the order of 105 . This difference becomes
even more significant in the simulated case, where approximately 50 iterations are
sufficient for convergence, even for the largest problems.

7. Comparison with existing works


GLMMs present significant computational challenges, especially in high-dimensional set-
tings. Efficient computation for these models has been extensively studied in scenarios
where the random effects have a nested dependence structure (Tan and Nott, 2013,
2018; Tan, 2020; Papaspiliopoulos and Zanella, 2017). Successful strategies have also
been proposed for more complex dependence structures, arising e.g. from time-series
models (Saul and Jordan, 1995; Salimans and Knowles, 2012; Tan and Nott, 2018; Fra-
zier et al., 2023) or spatial data analysis (Rue and Held, 2005; Katzfuss and Guinness,
2021; Schäfer et al., 2021). However, the complexity of the crossed dependence structure
hinder the applicability of these methodologies for GLMMs with more than one factor.
It’s known that standard techniques become inefficient when applied to high dimen-
sional crossed effects models (Gao and Owen, 2020; Papaspiliopoulos et al., 2019). For
this reason, various computational approaches have been explored to address this issue,
including backfitting with centering (Ghosh et al., 2022), method of moments (Gao and
Owen, 2020), collapsed Gibbs sampler (Papaspiliopoulos et al., 2019; Ceriani and Zanella,
2024; Papaspiliopoulos et al., 2023), and variational inference methods (Menictas et al.,
2023; Goplerud et al., 2024).
The CG method is widely recognized for its effectiveness in solving large, sparse linear
systems, and it has been extensively applied also for sampling from high dimensional
Gaussian distributions (Papandreou and Yuille, 2010; Simpson et al., 2013; Vono et al.,
2022). However, there is a lack of theoretical studies analyzing the effectiveness of CG
for GLMMs. The recent work Nishimura and Suchard (2023) provides a detailed study

22
of the use of CG in the context of large Bayesian sparse regression models. Differently
from their analysis, we consider the case where the design matrix itself V is sparse.

8. Discussion
We remark that the applicability of the methodologies presented in this article goes be-
yond Gibbs sampling. Standard inferential methodologies such as restricted maximum
likelihood estimation, Laplace approximation and variational inference require the com-
putation of the Cholesky factorization of a matrix as in (1), and would suffer from the
same limitations presented in Section 4. On the other hand, some of these methodolo-
gies can benefit from the use of CG and similar iterative methods. For example, in the
context of ML estimation, one could employ the method of moments proposed in Gao
and Owen (2020), to obtain consistent estimates of the variance components and then
use conjugate gradient to compute the GLS estimator. Also, in the context of partially
factorized or unfactorized variational inference (Goplerud et al., 2024), one is required
to invert matrices with the same structure of (1), which can be computed via CG. In
all these cases, the spectral analysis of Section 5 may be of interest to predict theo-
retical performances. However, the applicability of CG algorithm is hindered in those
cases where the determinant of large matrices is also needed, e.g., for the computation of
marginal quantities. Several strategies have been proposed to approximate determinants
of matrices via Lanczos algorithm (Ubaru et al., 2017), though they are computation-
ally more expensive than CG, and it is harder to control for the error induced by such
approximations.
This paper raises both theoretical and methodological questions. For example, it
would be interesting to provide an explicit result, which proves the cubic complexity
of the Cholesky factorization under generic (K-partite) random designs. As for the
analysis of CG algorithm, Theorem 3 and 5 shows how linear dependence among the
columns of the design matrix V relates to small eigenvalues of Q, which, in general,
significantly slows down the convergence of CG. A standard technique to address this
issue involves imposing linear constraints on the vector of parameters θ (Wood, 2017;
Zanella and Roberts, 2020). Perhaps surprisingly, an initial numerical exploration of
such procedure suggests that it actually does not lead to a significant improvement of
the rate of convergence of CG algorithm, and, in some cases, may even slow it down.
We leave a more detailed investigation to future work.

References
Alon, N. (1986). Eigenvalues and expanders. Combinatorica, 6(2):83–96.

Ashcroft, C. J. (2021). On the eigenvalues of Erdős-Rényi random bipartite graphs.

Brito, G., Dumitriu, I., and Harris, K. D. (2022). Spectral gap in random bipartite
biregular graphs and applications. Combinatorics, Probability and Computing.

23
Ceriani, P. M. and Zanella, G. (2024). Linear-cost unbiased posterior estimates
for crossed effects and matrix factorization models via couplings. preprint
arXiv:2410.08939.

Chung, F. R. (1997). Spectral graph theory, volume 92. American Mathematical Soc.

Concus, P., Golub, G. H., and O’Leary, D. P. (1976). A generalized conjugate gradient
method for the numerical solution of elliptic partial differential equations. In Sparse
matrix computations, pages 309–332. Elsevier.

Erdős, P., Rényi, A., et al. (1960). On the evolution of random graphs. Publ. math. inst.
hung. acad. sci, 5(1):17–60.

Frazier, D. T., Loaiza-Maya, R., and Martin, G. M. (2023). Variational Bayes in state
space models: Inferential and predictive accuracy. Journal of Computational and
Graphical Statistics, 32(3):793–804.

Friedman, J. (2003). Relative expanders or weakly relatively Ramanujan graphs. Duke


Mathematical Journal, 118(1).

Gao, K. and Owen, A. (2017). Efficient moment calculations for variance components
in large unbalanced crossed random effects models. Electronic Journal of Statistics,
11(1):1235–1296.

Gao, K. and Owen, A. B. (2020). Estimation and inference for very large linear mixed
effects models. Statistica Sinica, 30:1741–1771.

Gelman, A. and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models, volume 3. Cambridge University Press New York, New York, USA.

Ghitza, Y. and Gelman, A. (2013). Deep interactions with MRP: Election turnout
and voting patterns among small electoral subgroups. American Journal of Political
Science, 57(3):762–776.

Ghosh, S., Hastie, T., and Owen, A. B. (2022). Backfitting for large scale crossed random
effects regressions. The Annals of Statistics, 50(1):560 – 583.

Godsil, C. and Royle, G. F. (2001). Algebraic graph theory, volume 207. Springer Science
& Business Media.

Golub, G. H. and van Loan, C. F. (2013). Matrix Computations. JHU Press, fourth
edition.

Goplerud, M., Papaspiliopoulos, O., and Zanella, G. (2024). Partially factorized varia-
tional inference for high-dimensional mixed models. Biometrika, in press.

Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. Cambridge University Press,


2 edition.

24
Katzfuss, M. and Guinness, J. (2021). A general framework for Vecchia approximations
of Gaussian processes. Statistical Science, 36(1):124 – 141.
Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of
linear differential and integral operators. Journal of Research of the National Bureau
of Standards.
Lin, C.-J. and Moré, J. J. (1999). Incomplete Cholesky factorizations with limited
memory. SIAM Journal on Scientific Computing, 21(1):24–45.
Menictas, M., Di Credico, G., and Wand, M. P. (2023). Streamlined variational inference
for linear mixed models with crossed random effects. Journal of Computational and
Graphical Statistics, 32(1):99–115.
Nishimura, A. and Suchard, M. A. (2023). Prior-preconditioned conjugate gradient
method for accelerated Gibbs sampling in “large n, large p” Bayesian sparse regression.
Journal of the American Statistical Association, 118(544):2468–2481.
Papandreou, G. and Yuille, A. L. (2010). Gaussian sampling by local perturbations.
Advances in Neural Information Processing Systems, 23.
Papaspiliopoulos, O., Roberts, G. O., and Zanella, G. (2019). Scalable inference for
crossed random effects models. Biometrika, 107(1):25–40.
Papaspiliopoulos, O., Stumpf-Fétizon, T., and Zanella, G. (2023). Scalable Bayesian
computation for crossed and nested hierarchical models. Electronic Journal of Statis-
tics, 17(2):3575 – 3612.
Papaspiliopoulos, O. and Zanella, G. (2017). A note on MCMC for nested multilevel
regression models via belief propagation. Arxiv 1704.06064.
Parker, A. and Fox, C. (2012). Sampling Gaussian distributions in Krylov spaces with
conjugate gradients. SIAM Journal on Scientific Computing, 34(3):B312–B334.
Perry, P. O. (2016). Fast moment-based estimation for hierarchical models. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 79(1):267–291.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models
using Pólya–Gamma latent variables. Journal of the American Statistical Association,
108(504):1339–1349.
Rivin, I. (2002). Counting cycles and finite dimensional Lp norms. Advances in Applied
Mathematics, 29(4):647 – 662.
Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications.
Chapman & Hall.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
royal statistical society: Series b (statistical methodology), 71(2):319–392.

25
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. Other Titles in Applied
Mathematics. SIAM, second edition.

Salimans, T. and Knowles, D. (2012). Fixed-form variational posterior approximation


through stochastic linear regression. Bayesian Analysis, 8.

Saul, L. and Jordan, M. (1995). Exploiting tractable substructures in intractable net-


works. In Advances in Neural Information Processing Systems, volume 8. MIT Press.

Schäfer, F., Katzfuss, M., and Owhadi, H. (2021). Sparse Cholesky factorization by
Kullback–Leibler minimization. SIAM Journal on Scientific Computing, 43(3):A2019–
A2046.

Simpson, D. P., Turner, I. W., Strickland, C. M., and Pettitt, A. N. (2013). Scalable
iterative methods for sampling from massive Gaussian random vectors. preprint at
arXiv:1312.1476.

Tan, L. S. L. (2020). Use of model reparametrization to improve variational Bayes.


Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(1):30–57.

Tan, L. S. L. and Nott, D. J. (2013). Variational inference for generalized linear mixed
models using partially noncentered parametrizations. Statistical Science, 28(2):168–
188.

Tan, L. S. L. and Nott, D. J. (2018). Gaussian variational approximation with sparse


precision matrices. Statistics and Computing, 28:259–275.

Trefethen, L. and Bau, D. (1997). Numerical Linear Algebra. SIAM.

Ubaru, S., Chen, J., and Saad, Y. (2017). Fast estimation of $tr(f(a))$ via stochastic
Lanczos quadrature. SIAM Journal on Matrix Analysis and Applications.

Van der Sluis, A. and Van der Vorst, H. A. (1986). The rate of convergence of conjugate
gradients. Numerische Mathematik, 48:543–560.

van der Vorst, H. A. (2003). Iterative Krylov methods for large linear systems. Cambridge
University Press, Cambridge; New York.

Vono, M., Dobigeon, N., and Chainais, P. (2022). High-dimensional Gaussian sampling:
A review and a unifying approach based on a stochastic proximal point algorithm.
SIAM Review, 64(1):3–56.

Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.

Zanella, G. and Roberts, G. (2020). Multilevel linear models, Gibbs samplers and multi-
grid decompositions. Bayesian Analysis.

26
A. Proofs
Proof of Theorem 1. The number of operations required by the recursive equation in (2)
is
Xp
Cost(Chol) = 1 + nL,m + nL,m (1 + nL,m ) , (26)
m=1

see Theorem 2.2 of https://www.tau.ac.il/~stoledo/Support/chapter-direct.pdf P 


p 2
for a step-by-step derivation of (26). In both cases, we have Cost(Chol) = O n
m=1 L,m
as stated in (5). Then, using Jensen’s inequality and pm=1 nL,m = nL , we have
P

p p
! p
!2
X X X
−1 −1
n2L,m =p p n2L,m ≥p p nL,m = n2L /p ,
m=1 m=1 m=1

which proves the lower bound in (5).


The upper bound
  in (5), instead, cannot be deduced purely from Cost(Chol) =
Pp 2
O m=1 nL,m , but rather it follows from a characterization of Cost(Chol) in terms
of 3-cycles of an undirected graph associated to L. More precisely, define GL as the
undirected graph with nodes {1, . . . , p} and an edge between vertices j > m if and only
if the future set of θm does not separate it from θj in GQ . For any j, m ∈ {1, . . . , p}
with j ̸= m, we write {j, m} ∈ GL if there is an edge between j and m in GL . By the
arguments in Section 2.1, Ljℓ is (a potential) non-zero if and only if {j, P ℓ} ∈ GL . The
dominating cost of the recursion (2) to obtain L is the computation of m−1 ℓ=1 Lmℓ Ljℓ for
m = 2, . . . , p and j = m + 1, . . . , p. Ignoring multiplications by zero and summation of
zeros, this corresponds to

O (|{(ℓ, m, j) : {m, ℓ} ∈ GL , {j, ℓ} ∈ GL and 1 ≤ ℓ < m < j ≤ p}|) (27)

operations. By definition of GL , if {ℓ, m} ∈ GL and {ℓ, j} ∈ GL for ℓ < m < j, then also
{ℓ, m} ∈ GL . Thus, the cost in (27) coincides with the number of 3-cycles in GL . The
upper bound in (5) then follows, noting that the number of 3-cycles in an undirected
graph with nL edges is less or equal than nL 1.5 , see e.g. (Rivin, 2002, Theorem 4).

Proof of Proposition 1. Let θ ∈ Rp be an arbitrary permutation of (θ0 , θ 1 , . . . , θ K ) and


denote by ℓ the position of θ0 in θ. We show that, whenever ℓ < p, switching the
positions of θ0 and the variable immediately after it in θ does not increase nL . Such
switching of positions does not change the future set of θm for any m ∈ / {ℓ, ℓ + 1} and
thus leaves also nL,m unchanged by its definition in (4). Moreover, since θ0 is connected
to all other variables in GQ , it follows that nL,m equals the maximum value of p − m + 1
for all θm located after θ0 in θ and for θ0 itself. Thus, when θ0 is in position ℓ both nL,ℓ
and nL,ℓ+1 take their maximal values and moving θ0 to position ℓ + 1 cannot increase
the value of (nL,ℓ + nL,ℓ+1 ).

27
Proof of Proposition 2. Denote with Q[θk,g ; θk′ ,g′ ] the entry of Q relative to the pair
(θk,g , θk′ ,g′ ), and recall that, for k ̸= k ′ , its value coincides with the number of observa-
tions i = 1, . . . , N s.t. zi,k,g = 1 and zi,k′ ,g′ = 1. Finally, θk,g is adjacent to θk′ ,g′ in GQ
if and only if Q[θk,g ; θk′ ,g′ ] > 0.
Define the function r : {1, . . . , G} 7→ {1, . . . , G} as r(1) = 1 and r(j) = ⌈d−1 (j − 1)⌉
for j = 2, . . . , G, so that Q[θ1,r(j) ; θ2,j ] = 1 for all j ≥ 1. Then, for every couple j and
m such that r(j) ≤ m ≤ j we now show that the future set of θ2,m does not separate it
from θ2,j in GQ . We construct a path in GQ between θ2,m and θ2,j that goes through
θ2,1 as follows. Since both Q[θ1,j ; θ2,j ] and Q[θ1,r(j) ; θ2,j ] are positive for all j ≥ 1,
the path going from θ2,j to θ1,r(j) to θ2,r(j) to θ1,r(r(j)) to θ2,r(r(j)) etc. is supported on
GQ . Also, since r(ℓ) < ℓ for all ℓ ≥ 2 and r(1) = 1, the above path eventually reaches
θ2,1 . The same strategy can be applied to construct a path in GQ from θ2,m to θ2,1 .
Joining the two above paths at θ2,1 , we obtain a path from θ2,m to θ2,j in GQ . The
assumption r(j) ≤ m ≤ j together with r(ℓ) ≤ ℓ for all ℓ ensures that such path does
not involve elements in the future set of θ2,m apart from θ2,j . Thus, by (4), L[θ2,g ; θ2,m ]
is a potential non-zero whenever r(j) ≤ m ≤ j, meaning that the row of L corresponding
to θ2,j contains at least j − r(j) + 1 potential non-zeros. Summing over j we obtain
G G   G  
X X j−1 d−1X d − 1 G(G + 1)
nL ≥ (j − r(j) + 1) ≥ j− ≥ j= −1 .
d d d 2
j=2 j=2 j=2

Thus, nL = O G2 which also implies Cost(Chol) = O G3 by the lower bound in


 

Theorem 1. The statements about nL and Cost(Chol) follow from the above equalities
and p = 1 + 2G.

Proof of P Theorem 3. We start by proving the results relative to the likelihood term U =
V TV = N T
i=1 v i v i .
Proof of part 1. First recall, from Section 3, that v i = (1, z Ti,1 , . . . , z Ti,K )T ∈ Rp , where
z i,k have only one entry with unitary value and the remaining ones are zero. Consider
the sets of vector W = {wk ; k = 1, . . . , K}, where each wk = (−1, 0TG1 , .., 1TGk , .., 0TGK )T ,
and the vector of ones 1Gk is in the position relative to the k -th factor. Then, for all
i = 1, . . . , N and k = 1 . . . , K, we have v Ti wk = 0p . Thus, U wk = 0p for each k. Since
the vectors in W are linearly independent, we can deduce that dim(N ull(U )) ≥ K. Also,
since U is positive semidefinite, then at least the first K eigenvalues must be equal to
zero. The same holds for Ū , since the matrix Diag(U ) is positive definite by hypothesis
(Uii ≥ 1 for each i). We have thus shown λ1 (Ū ) = · · · = λK (Ū ) = 0, as desired.
Proof of part 2. Recall that U = Diag(U ) + A, where A can be interpreted as a
weighted adjacency matrix representing a (K + 1)-partite graph (we can think of the
global effect θ0 as an additional part with one single vertex). We define Ā as the adja-
cency matrix A normalized by the row-wise sum of its elements, i.e. Ā = D −1/2 AD −1/2 .
As we showed in Section 5, it holds that
p
X
D ii = Aij = KUii .
j=1

28
It follows that Ū and Ā satisfy Ū = I p + K Ā and, in particular, they share the same
eigenvectors. Finally, recall that the largest eigenvalue of any normalized adjacency
matrix Ā is λp (Ā) = 1 (Chung, 1997). As a consequence, the largest eigenvalue relative
to Ū is λp (Ū ) = 1 + Kλp (Ā) = K + 1, as desired.
Proof of the bounds on µ̄1 ,. . . , µ̄p . Finally, we translate the spectral results on Ū into
bound on the eigenvalues of Q̄. Notice that Q̄ = E + Ū where E is a diagonal matrix
with elements Eii = (Tii + τ Uii )/Tii . Since Uii ≥ 1, we thus can bound the eigenvalues
of E as
Tii T̄
λp (E) = max Eii ≤ ≤ ,
i Tii + τ τ + T̄
where T̄ = maxk=0,...K Tk . We can then obtain the inequalities of Theorem 3 combining
the above inequality with Weyl’s inequality (Horn and Johnson, 2012, Theorem 4.3.1).
Indeed we have

0 < µ̄K ≤ λK (Ū ) +λp (E) ≤ ,
| {z } τ + T̄
=0

as well as
µ̄p ≤ λp (Ū ) + λp (E) ≤ K + 2
| {z }
≤1

and
µ̄p ≥ λp (Ū ) + λ1 (E) ≥ K + 1 .
| {z }
≥0

(r)
Proof of Corollary 1. We begin by defining Ū = Diag(U (r) )−1/2 U (r) Diag(U (r) )−1/2 .
Theorem 3 applied to such matrix guarantees that its spectrum is concentrated in the
interval [0, K] and, more specifically:
(r) (r) (r)
λ1 (Ū ) = · · · = λ1 (Ū )K−1 = 0; λ1 (Ū )p−1 = K. (28)

Recalling from Section 5.1 that D (r) = (K − 1)Diag(U (r) ), we have


(r) (r)
Ū = I p−1 + (K − 1)Ā .
(r) (r)
This implies that for each eigenvalue λ of Ū , (λ − 1)/(K − 1) is an eigenvalue for Ā ,
from which follows (15).
(r)
Proof of Lemma 1. Since Q̄ is obtained from Q̄ by removing the row and column rela-
tive to θ0 , we can apply Cauchy interlacing theorem (Horn and Johnson, 2012, Theorem
4.3.17), which implies
(r) (r)
µ̄1 ≤ λ1 (Q̄ ) ≤ µ̄2 ≤ . . . µ̄p−1 ≤ λp−1 (Q̄ ) ≤ µ̄p . (29)

29
Note that, with the notation introduced in Section 5.1, it holds that
(r) (r)
Q̄ = I p−1 + (C (r) )1/2 Ā (C (r) )1/2 ,

where C (r) is the diagonal matrix with elements


(r) (r)
(r) τ Dii τ (K − 1)Uii
Cii = (r) (r)
= (r) (r)
.
Tii + τ Uii Tii + τ Uii
(r) (r) (r)
Since Uii > 0 by assumption and Tii ≥ 0, then also 0 < Cii ≤ (K − 1). Thus, C (r)
(r)
is a positive definite diagonal matrix with the largest diagonal element γ = maxi Cii ≤
(r)
K − 1. Now, if ν̄p−r ≥ 0 also λp−r ((C (r) )1/2 Ā (C (r) )1/2 ) ≥ 0, and it can be upper-
bounded by γ ν̄p−1 . An analogous lower bound can be obtained for all those ν̄s ≤ 0. To
conclude, we have
µ̄p−r
κs+1,p−r (Q̄) =
µ̄s+1
(r)
λp−r (Q̄ )
≤ (r)
λs (Q̄ )
(r)
1 + λp−r ((C (r) )1/2 Ā (C (r) )1/2 )
= (r)
1 + λs ((C (r) )1/2 Ā (C (r) )1/2 )
1 + (K − 1)ν̄p−r
≤ .
1 + (K − 1)ν̄s

(r)
Proof of Corollary 2. We just need to prove that the spectrum of Ā is symmetric
(r)
around zero. This together with Lemma 1 gives the result. Recall that the matrix Ā
can be represented as
0 B ,
 
(r)
Ā =
BT 0
(r)
with B ∈ RG1 ×G2 . Consider w.l.o.g. any positive eigenvalue λ > 0 of Ā . Let x be one
of its eigenvector and split it into x = [x1 , x2 ], where x1 ∈ RG1 and x2 ∈ RG2 . Then
x∗ = [x1 , −x2 ] is an eigenvalue relative to −λ. Indeed:

−B T x2
   
(r) ∗ −λx1
Ā x = = = −λx∗ .
Bx1 λx2

Since this operation also preserve the multiplicity of the eigenvalues, the result holds
true.

30
Proof of Theorem 4. By Theorem 3.2 of Brito et al. (2022), the adjacency matrix A(r)
satisfies p p
λp−2 (A(r) ) ≤ d1 − 2 + d2 − 1 + ϵp ,
p p
λ2 (A(r) ) ≥ − ( d1 − 2 + d2 − 1) − ϵp ,
asymptotically almost surely, with ϵp → 0, as p → ∞. For a bipartite biregular graph,
(r) √
it holds that Ā = A(r) / d1 d2 , then the bound above extends to the eigenvalues ν̄2
(r)
and ν̄p−2 of Ā as follows
1 1
ν̄p−2 ≤ √ + √ + ϵp ,
d1 d2
1 1
ν̄2 ≥ − √ − √ − ϵp .
d1 d2
Finally, we apply Lemma 1 with s = q = 2 so that, when K = 2, equation (16) becomes
1 + ν̄p−2
κ3,p−2 ≤ .
1 + ν̄2
Combining the above bounds we obtain the desired statement.

Proof of Theorem 5. Notice that


(r) (r)
Ū = I p−1 + (K − 1) · Ā .
(r)
Then, the eigenspace relative to −1/(K − 1) of Ā coincides with the null space of
(r) (r)
Ū . For simplicity, we will refer to the latter. Notice also, that dim(N ull(Ū )) =
(r)
dim(N ull(U (r) )), since Ū is the product
PK−1of U
(r)
and full-rank matrices. In the re-
maining part of the proof, we will find ℓ=1 Cℓ linearly independent eigenvalues in the
null space of U (r) , hence proving Proposition 5. We split the proof in two parts.
(ℓ)
1. Construction of {xm , ℓ = 1, . . . , K − 1, m = 1, . . . , Cℓ } ⊂ N ull(U (r) ). Consider
w.l.o.g. the trivial permutation π equal to the identity. For each observation i and factor
k, denote with gk [i] the unique non-null entry of the one-hot vector z i,k . Then, for each
x ∈ Rp−1 , V (r) x = K
P
k=1 xgk [i] . Thus

n K
!2
X X
T (r) (r) 2
x U x = ||V x|| = xgk [i] =0
i=1 k=1
PK
happens if and only if k=1 xgk [i] = 0 for each i = 1, . . . , N . Hence, we have the
equivalence
K
X
(r)
x ∈ N ull(U ) ⇔ xgk [i] = 0, ∀i = 1, . . . , N . (30)
k=1

(ℓ) (ℓ) (ℓ) (ℓ)


For any ℓ = 1, . . . , K − 1, denote with P1 , . . . , PCℓ and Q1 , . . . , QCℓ the disconnected
components in the bipartite graph restricted to the pair (ℓ, ℓ + 1). In particular, the

31
(ℓ) (ℓ)
Pm ’s are disjoint subsets of the levels in factor ℓ, and the same goes for Qm ’s. For any
(ℓ)
ℓ = 1, . . . , K − 1 and m = 1, . . . , C1 , we define the vector xm ∈ N ull(U (r) ) as
 (ℓ)
xm,j = 1,
 if j belongs to Pm
(ℓ)
xm,j = −1, if j belongs to Qm
 (ℓ)

xm,j = 0, otherwise .
(ℓ) (ℓ)
Note that xm satisfies (30) because, if an observation i does not involve Pm , then
(ℓ) (ℓ) (ℓ)
xm,g∗ [i] = 0 for each k. On the other hand, if xm,gℓ [i] = 1, then xm,gℓ+1 [i] = −1 (by
k
(ℓ) (ℓ)
definition of connected components, an edge from Pm can only connect to Qm ), while
xm,gk [i] = 0 for each k ̸= ℓ, ℓ + 1 (by construction).
(ℓ)
2. {xm }ℓ,m are linearly independent. Consider the linear combination
K−1
X X Cℓ
(ℓ) (ℓ)
βm xm = 0.
ℓ=1 m=1

If we consider the first G1 components (i.e. we restrict to the levels of the first factor),
(1)
only {xm }C 1
m=1 have non-zero entries on such components. Moreover, since they are
(1) 1 (1) (1)
supported on the disjoint sets (Pm )C m=1 , it implies that β1 = · · · = βC1 = 0. We
now consider the G2 components relative to the second factor. If we don’t consider
(2)
the previous eigenvectors, only {xm }C 2
m=1 have non-zero entries on such components.
(2) (2)
Analogously to the previous case, we can conclude β1 = · · · = βC2 = 0. We can further
(ℓ)
extend this procedure, proving that βm = 0 for each ℓ, m.
(r)
Proof of Theorem 6. Consider the matrix Ā ∈ Rp−1 , Theorem 3 applied to it im-
plies that ν̄1 = . . . ν̄K−1 = −1/(K − 1) and ν̄p−1 = 1. In particular, the invari-
ant subspace associated to these eigenvalues is the K-dimensional space W = {x ∈
Rp−1 : x is constant factor-wise} = Span(w1 , . . . , wK ), where wk = (0TG1 , .., 1TGk , .., 0TGK )T .
We denote its orthogonal with δW = W ⊥ . With this notation, Theorem 6 becomes a
straightforward consequence of Lemma 1 and 2 below.

Lemma 2. Under assumption (19), it holds that:


(r)
||Ā x|| λ∗
max(|ν̄K |, ν̄p−2 ) = max < , (31)
x∈δW ||x|| K −1
where ||x|| denotes the standard Euclidean norm.
∗ (r)
Proof. Denote for simplicity η = √λ
K−1
. Denote instead with Ā [k, h] ∈ RGk ×Gh , the
(r)
block of Ā relative to the pair of factors (k, h). It holds that:
!
(r)

(k,h)
= (K − 1) ·
0 Ā [k, h]
,
(r)
Ā [h, k] 0

32
(r)
where the multiplicative factor K − 1 comes from the fact that Ā is made of K
cliques, hence each edge in the clique is connected to K − 1 levels in each other factor;
however when restricting to the pair (k, h), it’s connected to only one level. Moreover,
by properties of bipartite adjacency matrix (Brito et al., 2022), it holds that
(r)
||Ā [k, h]y||
max |λ| = (K − 1) max .

λ∈σ Ā
(k,h)
:|λ|̸=1 y∈RGh : y T 1Gh =0 ||y||

Hence, (19) is equivalent to


(r)
||Ā [k, h]y|| η
max < .
y∈RGh : y T 1Gh =0 ||y|| K −1

For a given vector x ∈ δW , denote with x[k] ∈ RGk the vector restricted to the factor
k; by definition of δW , it holds that x[k]T 1Gk = 0. Finally, taking x ∈ δW , we have
2
K
(r) (r)
X X
||Ā x||2 = Ā [k, h]x[h]
k=1 h̸=k
K X
X (r) 2
≤ Ā [k, h]x[h]
k=1 h̸=k
K X
η2 X
≤ ∥x[h]∥2
(K − 1)2
k=1 h̸=k
| {z }
η 2
= (K − 1)||x||2
(K − 1)2
η2
= ||x||2
K −1
(λ∗ )2 ||x||2
= .
(K − 1)2

B. Supplementary material for Section 5


B.1. Including multivariate fixed effects
Lemma 1 provides an upper bound on κK+1,p−2 (Q̄), by interlacing the spectrum of Q̄
(r)
with the one of Q̄ (see proof of the lemma). A similar result can also be obtained for
θ 0 ∈ RD0 , with D0 > 1, i.e. when we include multivariate fixed effects. In which case,
the upper bound would become
1 + (K − 1)ν̄p−D0 −1
κK+D0 ,p−D0 −1 (Q̄) ≤ ,
1 + (K − 1)ν̄K

33
where ν̄K and ν̄p−D0 −1 would be respectively the K-th smallest and second-largest eigen-
(r)
values of Ā ∈ R(p−D0 )×(p−D0 ) . Provided that θ 0 is low-dimensional, the above inequal-
ity would give an upper bound on the effective condition number after the removal of
few extreme eigenvalues.

B.2. Conjugate Gradient with Incomplete Cholesky preconditioner


In this section, we explore alternative preconditioning techniques to lower the number of
CG iterations. In particular, we will refer to the Incomplete Cholesky (IC) preconditioner
(Saad, 2003, Ch. 10). IC algorithm computes an approximate factorization LLT ≈ Q,
by restraining the support of L to a subset S, which can be chosen a priori or on the run
(see Algorithm 2). A standard choice for S is the lower-triangular support of Q itself.
Algorithm 2 with such choice of S is also referred to as zero fill-in incomplete Cholesky,
which is a widely used CG preconditioner.
An important limitation of Algorithm 2 is that it may fail for general positive def-
inite
P matrices. Indeed, by restraining the support of L, it may happen that Qjj −
L2 < 0. Several modifications have been proposed to prevent this issue,
k : (j,k)∈S jk
usually by scaling or shifting Q. Nevertheless, in the following numerical examples, we
obtained the best results (in terms of rate of convergence of CG) by taking the absolute
value of the previous quantity, hence forcing the continuation of Algorithm 2.
A lot of work has been done to refine the choice of S in order to improve the quality
of the approximation, while preserving the computational efficiency of Algorithm 2. For
example, Lin and Moré (1999) proposed a variation of the IC algorithm that selects S
depending on the numerical entries of the matrix and satisfies |S| ≤ p · nQ . Here, the
choice of p allows controlling the density of the factorization, obtaining a more accurate
solution at a price of additional memory and computational time. The results obtained
with such preconditioner were comparable to the ones obtained with the standard IC
preconditioner, hence we did not include them in this article.
Table 4 reports the comparison between the Jacobi (see Section 5.2) and the Incom-
plete Cholesky preconditioner. For both the American political elections survey data and
the Instructor Evaluations data set, we display the average number of CG iterations (af-
ter an initial burn-in), obtained with Jacobi and IC preconditioner, respectively. While
the IC preconditioned CG requires a smaller number of iterations, such reduction is not
significant (at most a factor of 4) but comes at the additional price, in terms of time and
memory complexity, of computing and inverting L. Finally, the overall cost of the two
algorithms is basically equivalent in the case we have considered. However, the Jacobi
preconditioning is easier to implement and does not incur in the problems of Algorithm
2.

B.3. Erdős-Rényi random bipartite graph


In this section, we extend Theorem 4 to the with K = 2 factors case and design given
by an Erdős-Rényi bipartite random graph (i.e. each edge is missing independently at
random with probability 1 − π). The resulting bound on κ3,p−2 (Q̄) is very similar to

34
Voter Turnout Students Evaluations
Case
Jacobi IChol Jacobi IChol
Random intercepts 36 19 35 16
Nested effect 53 31 94 37
Random slopes 70 27 150 17
2 way interactions 338 104 121 53
3 way interactions 445 108 126 54
Everything 517 140 262 67

Table 4: Average number of CG iterations with Jacobi and Incomplete Cholesky precon-
ditioning for the data sets considered in Section 6. We considered N = 70 000
observations.

the one of Theorem 4, however, in this case, we need an asymptotic lower bound on the
expected degree G1 π (and G2 π) to guarantee sufficient connectivity in the limit.

Theorem 7. Let Q = Diag(Q) + τ A be the posterior precision matrix in (11). Let A(r)
be the adjacency matrix of an Erdős-Rényi bipartite random graph. Let G2 ≥ G1 ≥ 1
and π = π(G1 ) be such that
p p
G1 π = Ω( G1 π log3 (G1 )), G2 π = Ω( G2 π log3 (G2 )) .

Then, with probability tending to 1 as G1 tends to infinity


µ̄p−2 1+ϵ
κ3,p−2 (Q̄) = ≤ ,
µ̄3 1−ϵ
with  r 
1 1 1 1
ϵ = 2[2 + o(1)] √ +√ + + ,
G1 π G2 π G1 π G2 π
as G1 → ∞.

Proof. The proof of this result follows the same steps of the proof of Theorem 4, with
the additional result from Ashcroft (2021).

B.4. Pairwise connected components is not sufficient


Example 3. Consider a simple crossed random effects model with K = 3 factors with
size G1 = G2 = G3 = 2. Let N = 3, and let the observed levels be

g1 [1] = 1, g2 [1] = 1, g3 [1] = 2,


g1 [2] = 2, g2 [2] = 2, g3 [2] = 2,
g1 [3] = 2, g2 [3] = 1, g3 [3] = 1.

35
The corresponding U (r) will be
 
1 0 1 0 0 1
0 2 1 1 1 1
U (r)
 
1
= 1 2 0 1 1
0 1 0 1 0 1
1 1 1 1 0 2

Notice that dim(N ull(U (r) )) = 3 and that the following vectors generate the null-space
of U (r) :      
1 1 1
1 1  0 
     
−1 0 −1/2
x1 = −1 ,

 0 ,
x2 =   x3 =  1/2  .

     
0 −1  1/2 
0 −1 −1/2
In particular, the first two vectors are the directions identified by Theorem 3, while the
third one is specific to the symmetry of this particular example. By construction also
(r)
Ū has a null-space of dimension 3. Finally, because of the relationship in (15), the
(r)
corresponding eigenvalues of Ā will satisfy ν̄1 = ν̄2 = ν̄3 = −1/2.
To conclude, this example provides a design where the bipartite subgraph of any pair
1
of factors is made of a unique connected component, but ν̄K = − K−1 .

C. Pseudocode of the algorithms


C.1. Cholesky factorization
Algorithm 1 shows a pseudocode of the algorithm that allows computing the Cholesky
factorization of a given positive-definite matrix Q ∈ Rp×p . A simple modification of
Algorithm 1 allows computing an approximate factorization L̂, by restricting the support
of L to a given subset S ⊆ {(i, j) : i ≥ j, i = 1, . . . , p, j = 1, . . . , p}. The resulting
algorithm is called incomplete Cholesky factorization (Golub and van Loan, 2013) and
it is outlined in Algorithm 2.

36
Algorithm 1 Cholesky factorization. Algorithm 2 Incomplete Cholesky
Input: Q ∈ Rp×p symmetric and positive Input: Q ∈ Rp×p symmetric, S sparsity
definite set
Output: L ∈ Rp×p lower triangular s.t. Output: L ∈ Rp×p lower triangular with
Q = LLT Supp(L) ⊂ S

1: for j ∈ {1,q
. . . , p} do 1: for j ∈ {1,q
. . . , p} do
Ljj ← Qjj − k : (j,k)∈S L2jk Ljj ← Qjj − k : (j,k)∈S L2jk
P P
2: 2:
3: for i ∈ {j + 1, . . . , N } do 3: for i > j : (i, j) ∈ S do
4: Lij ← Qij /Ljj 4: Lij ← Qij /Ljj
5: for k ∈ {1, . . . , j − 1} do 5: for k < j : (i, k), (j, k) ∈ S do
6: Lij ← Lij − (Lik · Ljk /Ljj ) 6: Lij ← Lij − (Lik · Ljk /Ljj )
7: end for 7: end for
8: end for 8: end for
9: end for 9: end for
10: return L 10: return L

C.2. Sampling with Cholesky factorization


For a given Cholesky factor L, sampling from a Gaussian distribution with precision
matrix Q = LLT is straightforward. E.g. one can obtain a sample θ ∼ N (Q−1 m, Q−1 ),
by first sampling z ∼ N (0, I p ), and then solving the following two linear system

Lw = m , LT θ = w + z .

Specifically, one needs to solve a lower and an upper-triangular linear system respectively,
which can be done with O (nL ) cost via forward and backward substitution.

C.3. Conjugate gradient


For the conjugate gradient algorithm, we refer to the formulation of Saad (2003). Namely,
we refer to Algorithm 6.17 and to Algorithm 9.1, for the preconditioned version. For
simplicity, we report them in Algorithm 3 and 4 respectively.
If we only consider vectorial operations, Algorithm 3 requires the computation of three
scalar products, three linear combinations of vectors and a matrix-vector multiplication,
accounting for a total of 4p + 2nQ flops. The preconditioned variant, only requires the
additional cost of evaluating M −1 z, which, for Jacobi preconditioning, accounts for p
flops.

37
Algorithm 3 Conjugate Gradient Algorithm 4 Preconditioned CG
Input: Q ∈ R p×p symmetric positive Input: Q ∈ Rp×p symmetric positive
definite, b ∈ R , ϵ > 0 desired accuracy. definite, b ∈ Rp , ϵ > 0, M preconditioner.
p

Output: Approx. solution of Qx = b. Output: Approx. solution of Qx = b.

1: x0 = 0, r 0 = b, p0 = r 0 . 1: x0 = 0, r 0 = b, z 0 = M −1 r 0 , p0 = z 0 .
2: for j = 0, 1, . . . until ||r j ||2 < ϵ, do 2: for j = 0, 1, . . . until ||r j ||2 < ϵ, do
r Tj r j r Tj z j
3: αj = T 3: αj = T
pj Qpj pj Qpj
4: xj+1 = xj + αj pj 4: xj+1 = xj + αj pj
5: r j+1 = r j − αj Qpj 5: r j+1 = r j − αj Qpj
r Tj+1 r j+1 6: z j+1 = M −1 r j+1
6: βj = r Tj+1 z j+1
r Tj r j
7: βj =
7: pj+1 = r j+1 + βj pj r Tj z j
8: end for 8: pj+1 = z j+1 + βj pj
9: return xj+1 9: end for
10: return xj+1

C.4. Polya-Gamma augmented Gibbs sampler

Algorithm 5 PG augmented Gibbs sampler


Input: hyperparameters m0 , αk , Φk ; vector of observations y; design matrix V ;
(0)
initial values θ (0) , Ω(0) , T k
for t = 1 : T do
(t−1) K
Sample θ (t) ∼ p(θ | y, V , Ω(t−1) , {T k }k=1 ) according to (23)
(t) (t)
Sample ωi ∼ p(ωi | y, V , θ ), for each i = 1, . . . , N , according to (24)
(t)
Sample T k ∼ p(T k | y, V , θ (t) ), for each k = 1, . . . , K, according to (25)
end for
(t)
Output: MCMC samples {(θ (t) , Ω(t) , , {T k }K
k=1 ); t = 0, . . . , T }.

38
D. Additional Figures
D.1. Sparsity structure of Q resulting from Example 1

Figure 3: Precision matrix Q of L(θ | y, g) induced by the structured design of Example


1 with G = 20, d = 3 and default ordering (θ1,1 , . . . , θ1,G , θ2,1 , . . . , θ2,G , θ0 ).

D.2. Graphical representation of the graph in Example 3

Figure 4: Conditional independence graph of Example 3. Obtained for G1 = G2 = G3 =


3, and N = 9.

39

You might also like