document
document
document
The estimation of variance-covariance matrices in situations that involve the optimization of an ob-
jective function (e.g. a log-likelihood function) is usually a difficult numerical problem, since the
resulting estimates should be positive semi-definite matrices. We can either use constrained opti-
mization, or employ a parameterization that enforces this condition. We describe here five different
parameterizations for variance-covariance matrices that ensure positive definiteness, while leaving
the estimation problem unconstrained. We compare the parameterizations based on their computa-
tional efficiency and statistical interpretability. The results described here are particularly useful in
maximum likelihood and restricted maximum likelihood estimation in mixed effects models, but are
also applicable to other areas of statistics.
KEY WORDS: Unconstrained estimation, Variance-covariance components estimation, Cholesky
factorization, matrix logarithm.
transforming the estimation of unstructured (general) variance- Another problem with the Cholesky parameterization is the
covariance matrices into an unconstrained problem. In Section 3 lack of a straightforward relationship between and the ele-
we compare the parameterizations with respect to their computa-
ments of . This makes it hard to interpret the estimates of
tional efficiency and statistical interpretability. Our conclusions and to obtain confidence intervals for the variances and covari-
and suggestions for further research are presented in Section 4.
ances in based on confidence p intervals for the elements of .
L
One exception is j[ ]11 j = [ ]11 , so confidence intervals on
2. PARAMETERIZATIONS
[ ]11 can be obtained from confidence intervals on [ ]11 , where L
Let denote a symmetric positive definite n n variance-
A A
[ ]ij denotes the ij th element of the matrix . By appropriately
covariance matrix corresponding to a random vector X =
permuting the columns and rows of we can in fact derive con-
fidence intervals for all the variance terms based on confidence
(X1 ; : : : ; Xn ). We do not assume any further structure for . intervals for the elements of . L
Because is symmetric, only n(n +1)=2 parameters are needed
to represent it. We will denote by any such minimal set of pa-
The main advantage of this parameterization, apart from the
fact that it ensures positive definiteness of the estimate of , is
rameters to determine . The rationale behind all parameteriza- that it is computationally simple and stable.
tions considered in this section is to write The Cholesky factorization of in (2.2) is A
=L L T 2 32 3
(2.1)
1 0 0 1 1 1
where L = L () is an n n matrix of full rank obtained from A = 4 1 2 0 54 0 2 2 5
a n(n + 1)=2-dimensional vector of unconstrained parameters 1 2 3 0 0 3
. It is clear that any defined from a full rank L as in (2.1) is By convention, the components of the upper triangular part of L
positive definite.
L
Different choices of lead to different parameterizations of
are listed column-wise to give = (1; 1; 2; 1; 2; 3) .
T
L
. We will consider here two classes of : one based on
the Cholesky factorization (Thisted, 1988, x33) of 2.2 Log-Cholesky Parameterization
L
and an-
If one requires the diagonal elements of in the Cholesky
L
other based on the spectral decomposition of (Rao, 1973,
x1c3). The first three parameterizations presented below use the factorization to be positive then is unique. To avoid con-
strained estimation, one can use the logarithms of the diagonal
L
Cholesky factorization of , while the last two are based on its
spectral decomposition. elements of . We call this parameterization the log-Cholesky
In some of the parameterizations there are particular compo- parameterization. It inherits the good computational properties
nents of the parameter vector that have meaningful statistical of the Cholesky parameterization, but has the advantage of be-
interpretations. These can include the eigenvalues of , which ing uniquely defined. As in the Cholesky parameterization the
are important in considering when the matrix is ill-conditioned, parameters lack direct interpretation in terms of the original vari-
ances and covariances, except for L11 .
A
the individual variances or standard deviations, and the correla-
tions. The log-Cholesky parameterization of is
The following variance-covariance matrix will be used
throughout this section to illustrate the use of the various = (0; 1; log(2); 1; 2; log(3))T :
parameterizations.
2 3 2.3 Spherical Parameterization
1 1 1
A=4 1 5 5 5
The purpose of this parameterization is to combine the compu-
(2.2) tational efficiency of the Cholesky parameterization with direct
1 5 14
interpretation of in terms of the variances and correlations in
.
2.1 Cholesky Parameterization
L L
Let i denote the ith column of in the Cholesky factoriza-
Because
is positive definite, it may by factored as
= l
tion of and i denote the spherical coordinates of the first i el-
LL T
L
, where is an upper triangular matrix. Setting to be L
ements of i . That is
L
the upper triangular elements of gives the Cholesky param-
eterization of . Lindstrom and Bates (1988) use this param- [Li ]1 = [li ]1 cos ([li ]2 )
eterization to obtain derivatives of the log-likelihood of a lin- [Li ]2 = [li ]1 sin ([li ]2 ) cos ([li ]3 )
ear mixed effects model for use in a Newton-Raphson algorithm.
[Li ]i?1 = [li ]1 sin ([li ]2 ) cos ([li ]i )
They reported that the use of this parameterization dramatically
improved the convergence properties of the optimization algo-
rithm, when compared to a constrained estimation approach. [Li ]i = [li ]1 sin ([li ]2 ) sin ([li ]i )
One problem with the Cholesky parameterization is that the
L
Cholesky factor is not unique. In fact, if is a Cholesky factor l
It then follows that ii = [ i ]1 and 1i = cos([ i ]2 ); i =
2
l
of then so is any matrix obtained by multiplying a subset of 2; : : : ; n, where ij denotes the correlation coefficient between
L
the rows of by ?1. This has implications on parameter iden- Xi and Xj . The correlations between other variables can be ex-
tification, since up to 2n different may represent the same . pressed as linear combinations of products of sines and cosines
Numerical problems can arise in the optimization of an objec- l l
of the elements in 1 ; : : : ; n , but the relationship is not as
tive function when different optimal solutions are close together straightforward as those involving X1 . If confidence intervals
in the parameter space. l
are available for the elements of i ; i = 1; : : : ; n then we can
U NCONSTRAINED PARAMETERIZATIONS FOR VARIANCE -C OVARIANCE M ATRICES 3
also obtain confidence intervals for the variances and the corre- The matrix logarithm of A is
lations 1i . By appropriately permuting the rows and columns of 2 3
?0:174 0:392 0:104
, we can in fact obtain confidence intervals for all the variances
and correlations of X1 ; : : : ; Xn . The exact same reasoning can
log (A) = 4 0:392 1:265 0:650 5
be applied to derive profile traces and profile contours (Bates and
0:104 0:650 2:492
Watts, 1988) for variances and correlations of X1 ; : : : ; Xn based and therefore the matrix logarithm parameterization of A is
= (?0:174; 0:392; 1:265; 0:104; 0:650; 2:492) :
on a likelihood function. T
To ensure uniqueness of the spherical parameterization we
must have 2.5 Givens Parameterization
[li ]1 > 0; i = 1; : : : ; n and The eigenstructure of contains valuable information for
[li ]j 2 (0; ) ; i = 2; : : : ; n; j = 2; : : : ; i determining whether some linear combination of X1 ; : : : ; Xn
could be regarded as nearly constant. This is useful, for example,
Unconstrained estimation is obtained by defining as follows in model building for mixed-effects models, as near zero eigen-
values may indicate overparameterization (Pinheiro and Bates,
i = log ([li ]1 ) ; i = 1; : : : ; n and 1995). The Givens parameterization uses the eigenvalues of
n+(i?2)(i?1)=2+(j?1) = directly in the definition of the parameter vector .
!
[li ]j
The Givens parameterization is based on the spectral decom-
log ? [l ] ; i = 2; : : : ; n; j = 2; : : : ; i
position of given in (2.3) and the fact that the eigenvector ma-
i j U
trix can be represented by n(n ? 1)=2 angles, used to gener-
ate a series of Givens rotation matrices (Thisted, 1988, x3166)
The spherical parameterization has about the same computa- U
whose product reproduce as follows
tional efficiency as the Cholesky and log-Cholesky parameteri-
zations, is uniquely defined, and allows direct interpretability of U = G
8
G Gn n? = ; where
1 2 ( 1) 2
in terms of the variances and correlations in .
> cos(i ); if j = k = m (i)
>
>
A
1
The spherical parameterization of is >
> or j = k = m2 (i)
>
>
< sin(i ); if j = m1 (i); k = m2 (i)
= (0; log(5)=2; log(14)=2; ?0:608; ?0:348; ?0:787)T : Gi [j; k] = > ? sin(i ); if j = m2(i); k = m1(i)
>
>
> 1; if j = k 6= m1 (i)
2.4 Matrix Logarithm Parameterization >
>
> and j = k 6= m2 (i)
>
:
This and the next parameterization are based on the spectral 0; otherwise
decomposition of . Because is positive definite, it has n pos-
itive eigenvalues . Letting U denote the orthogonal matrix of and m1 (i) < m2 (i) are integers taking values in f1; : : : ; ng and
orthonormal eigenvectors of and = diag ( ), we can write satisfying i = m2 (i) ? m1 (i)+(m1 (i) ? 1) (n ? m1 (i)=2). To
ensure uniqueness of the Givens parameterization we must have
= U U T (2.3) i 2 (0; ) ; i = 1: : : : ; n(n ? 1)=2.
The spectral decomposition (2.3) is unique up to a reorder-
By setting ing of the diagonal elements of
and columns of and up U
L = = UT1 2
(2.4) U
to switching of signs in each column of . Uniqueness can
p(2.1), where
in 1=2
denotes the diagonal matrix with [1=2 ]ii =
be achieved by forcing the eigenvalues to be sorted in ascend-
[]ii , we get a factorization of based on the spectral decom- ing order. This can be attained, within an unconstrained estima-
tion framework, by using a parameterization suggested by Jupp
position.
The matrix logarithm of is defined as log ( ) = (1978) and defining the first n elements of as
U U
log ( ) T , where log ( ) = diag [log ( )]. Note that i = log (i ? i?1 ) ; i = 1; : : : ; n
and log ( ) share the same eigenvectors. The matrix log ( )
can take any value in the space of n n symmetric matrices. where i denotes the ith eigenvalue of is ascending order and
Letting be equal to its upper triangular elements gives the with the convention that 0 = 0. The remaining elements of
matrix logarithm parameterization of . in the Givens parameterization are defined by the relation
The matrix logarithm parameterization defines a one-to-one
mapping between and and therefore does not have the identi- n+i = log ?i ; i = 1; : : : ; n(n ? 1)=2
i
fication problems of the Cholesky factorization. It does involve
considerable calculations, as produces log ( ) whose eigen- The main advantage of this parameterization is that the first n
structure must be determined before L in (2.4) can be calcu-
elements of give information about the eigenvalues of di-
lated. Similarly to the Cholesky and log-Cholesky parameteriza- rectly. Another advantage of the Givens parameterization is that
tions, the vector in the matrix logarithm parameterization does it can be easily modified to handle general (not necessarily posi-
not have a straightforward interpretation in terms of the origi- tive definite) symmetric matrices. The only modification needed
nal variances and covariances in . We note that even though is to set 1 = 1 and
the matrix logarithm is based on the spectral decomposition of
X
i
, there is not a straightforward relationship between and the i = 1 + exp (i ) ; i = 2; : : : ; n
eigenvalues-eigenvectors of j =2
U NCONSTRAINED PARAMETERIZATIONS FOR VARIANCE -C OVARIANCE M ATRICES 4
from the parameter vector . Another problem with the Givens Structure I Structure II
6.30 6.30
Time (seconds)
Time (seconds)
1.80 1.80
in a straightforward manner, so inferences about variances and
0.50 0.50
covariances require indirect methods.
U
The eigenvector matrix in (2.3) can also be expressed as a
0.15 0.15
1988, x312) and these in turn can be derived from n(n ? 1)=2
Structure III Structure IV
6.30 6.30
Time (seconds)
Time (seconds)
1.80 1.80
Time (seconds)
Time (seconds)
it here. 1.80 1.80
0.04 0.04
5 24 43 62 81 100 5 24 43 62 81 100
Dimension Dimension
User Time to Convergence Number of Iterations to Convergence User Time to Convergence Number of Iterations to Convergence
o 90 o 1200 o 135
70 o o
o
o
o
o o
o o o
o
o o o
o
o o
o o
o o
o
o o o o
o o o
40 o
o o 45 o 840 o 95
o o
o o o
o
User Time (seconds)
70 o
20 575
20
50
10 400
10
o o o o
o o
o o
o 35 o o
5.5 5 275 o
Cholesky logCholesky Spherical Matrix log Givens Cholesky logCholesky Spherical Matrix log Givens Cholesky logCholesky Spherical Matrix log Givens Cholesky logCholesky Spherical Matrix log Givens
Pameterization Pameterization Pameterization Pameterization
Figure 2: Box-plots of user time and number of iterations to con- Figure 3: Box-plots of user time and number of iterations to con-
vergence for 300 random samples of model (3.1) with of dimen- vergence for 50 random samples of model (3.1) with of dimen-
sion 3. sion 6.
b
where the i are independent, identically distributed random ef- reduced.
fects with common N ( ; 2 ) distribution and the i are inde- 0
As approaches singularity its determinant goes to zero and
pendent and identically distributed error terms with common dis- so at least one of the diagonal elements of its Cholesky factor
0 I
tribution Nni ( ; 2 ), independent of the i , with ni represent- b goes to zero too. The Cholesky parameterization would then be-
ing the number of observations on the ith cluster. Lindstrom and come numerically unstable, since equivalent solutions would get
Bates (1988) have shown that the log-likelihood in (3.1) can be closer together in the estimation space. At least one element of
profiled to produce a function of alone. In the simulation, we in the log-Cholesky parameterization would go to ?1 (the log-
used matrices of dimensions 3 and 6. These were defined such arithm of the diagonal element of that goes to zero). In the L
that the nonzero elements of the ith column of the corresponding spherical parameterization we would also have at least one el-
Cholesky factor were equal to the integers between 1 and i. For
ement of going in absolute value to 1: if the first diagonal
n = 3 we have = , as given in (2.2). For n = 3 we used A L
element of goes to zero, 1 ! ?1; otherwise at least one an-
M = 10, ni = 15; i = 1; : : : ; 10, 2 = 1, and = (10; 1; 2)T , gle of the spherical coordinates of the column of whose diago- L
while for n = 6 we used M = 50, ni = 25; i = 1; : : : 50, nal element approaches 0 would either approach 0 or , in which
2 = 1, and = (10; 1; 2; 3; 4; 5)T . In both cases, the elements cases the corresponding element of would go respectively to
of the first column of were set equal to 1 and the remaining X ?1 or 1.
elements were independently generated according to a U (1; 20) Singularity of
implies that at least one of its eigenvalues
distribution. A total of 300 and 50 samples were generated re- is zero. The Givens parameterization would then have at least
spectively for n = 3 and n = 6, and the number of iterations the first element of going to ?1. To understand what hap-
and the user time to calculate the maximum likelihood estimate pens with the matrix logarithm parameterization when ap-
of for each parameterization recorded. proaches singularity we note that letting (1 ; 1 ); : : : ; (n ; n ) u u
Figures 2 and 3 present box-plots of the number of iterations P
represent the eigenvalue-eigenvector pairs corresponding to
and of the user times for the various parameterizations. The we can write
= ni=1 i i Ti . As 1 ! 0 all entries of uu
Cholesky, the log-Cholesky, the spherical, and the matrix log-
log( ) corresponding to nonzero elements of 1 T1 would con- uu
arithm parameterizations had similar performances for n = 3, verge in absolute value to 1. Hence in the matrix logarithm pa-
considerably better than the Givens parameterization. For n = 6 rameterization we could have all elements of going either to
the Cholesky and the matrix logarithm parameterizations gave ?1 of 1 as approached singularity.
the best performances, followed by the log-Cholesky and spher- Finally we consider the statistical interpretability of the pa-
ical parameterizations, all considerably better than the Givens rameterizations of . The least interpretable parameterization
parameterization. Because is relatively small in these exam- is the matrix logarithm — none of its elements can be directly
ples, the numerical complexity of the different parameterizations related to the individual variances, covariances, or eigenvalues
did not play a major role in their performances. It is interesting
of . The Cholesky and log-Cholesky parameterizations have
to note that even though the matrix logarithm is the least effi- the first component directly related to the variance of X1 , the
cient parameterization in terms of numerical complexity, it had first underlying random variable in . By permuting the order of
the best performance in terms of number of iterations and user the random variables in the definition of , one can derive mea-
time to obtain the maximum likelihood estimate of , suggest- sures of variability and confidence intervals for all the variances
ing that this parameterization is most numerically stable.
in , from corresponding quantities obtained for the parameters
Another important aspect in which the parameterizations in the Cholesky or log-Cholesky parameterizations. The Givens
should be compared is their behavior as approaches singular- parameterization is the only one considered here that uses the
ity. All parameterizations described in Section 2 require to
eigenvalues of directly in the definition of . It is a very use-
be positive definite, though the Givens parameterization can be ful parameterization for identifying ill-conditioning of . None
modified to handle general symmetric matrices. It is usually an of its parameters, though, can be directly related to the variances
important statistical issue to test whether is of less than full and covariances in . Finally, the spherical parameterization is
rank, in which case the dimension of the parameter space can be the one that gives the largest number of interpretable parameters
U NCONSTRAINED PARAMETERIZATIONS FOR VARIANCE -C OVARIANCE M ATRICES 6
ACKNOWLEDGEMENTS
This research was partially supported by the National Sci-
ence Foundation through grant number DMS-9309101 and by
the Coordenação para Aperfeiçoamento de Pessoal de Nı́vel Su-
perior, Brazil. We are grateful to Bruce E. Ankenman for sug-
gesting the Givens parameterization and to two anonymous ref-
erees for helpful comments and suggestions.