Analysis of Correlation Structures Using Generalized Estimating Equation Approach For Longitudinal Binary Data
Analysis of Correlation Structures Using Generalized Estimating Equation Approach For Longitudinal Binary Data
Analysis of Correlation Structures Using Generalized Estimating Equation Approach For Longitudinal Binary Data
1. Introduction
In longitudinal studies, observations measured repeatedly from the same
subject over time are serially correlated. Time series with rather pronounced
serial correlation are common in clinical trials and stock markets. When
observations are measured on a continuous scale, the dependency structure
294
between observations can be modelled in a covariance matrix with nonconstant variances and non-zero covariances. There are different types of
covariance matrices to model such dependency between observations from
the same subject.
On the other hand, when the outcomes are binary such as the positive
or negative results to certain tests, we usually consider a generalized linear model (GLM) within the exponential family of sampling distributions
with different link functions. In this paper, we propose a marginal logistic
model and the parameters are estimated using the first order Generalised
Estimating Equation (GEE) approach (Liang & Zeger 1986, Zeger et al.
1988 and Hardin & Hilbe 2003). Marginal model is easy to interpret and
forecast. While the dependency structure can be explicitly modelled by a
working correlation matrix adopting some popular and general correlation
structures, the dependency structure of a conditional model is often limited to an auto-regressive type only. Moreover marginal model allows the
study of population-averaged effects of covariates on the response variable.
Although random effects models allow within subject correlation, it fails to
account for the serial correlation over time.
Over the past two decades, GEE approach has been developed and advanced. For example, Zhao and Prentice (1990) and Lipsitz et al. (1991)
apply the GEE approach to analyse binary data, Kenward et al. (1994)
consider ordinal data, Lipsitz et al. (1994) use categorical data, Prentice
and Zhao (1991) focus on multivariate data, Park (1993) compares GEE approach to maximum likelihood approach and Miller et al. (1993) to weighted
least squares approach, etc. Despite the emergence of new techniques in
recent years, for example, the Bayesian and semi-parametric approaches,
GEE approach is still popular and often offers an easy alternative. See
Horton and Lipsitz (1999) for its implementation in various softwares and
the new geepack module in R. Recent literature on GEE approach includes
Ballinger (2004) for counts and continuous data and Copas and Seaman
(2010) on the asymptotic bias using GEE under the missing at random
(MAR) dropout environment. Although it has been known that parameter
estimates using the GEE approach are robust to misspecification of the true
correlation matrix, our experience with real data shows disagreement. Instead, we found that parameter estimates change considerably across models
adopting different correlation matrices. In this regard, sensitivity analyses
through simulation experiments are often performed to evaluate the per-
295
296
Properties
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
1
c
c
c
c
1
c
c
c
c
1
c
c
c
c
1
1
2
3
3
2
1
1
0
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
1
2
2
1
1
1
0
2
1
1
297
1
1
2
0
298
M
X
P Tm
U () =
[Cov (Y m )]1 (Y m P m ) = 0
m=1
(1)
P Tm
2
2
),
, . . . , SmN
= X Tm Diag(Sm1
m
(2)
is
where
1
M
P
M X
Nm X
n1
X
m=1 n=2 n0 =1
Nm (Nm 1)/2
for EC,
(3)
m=1
M NX
m 1
X
1
(Ymn Pmn )(Ym n+1 Pm n+1 )
b, b1 =
for AR1 & TOEP2,3, (4)
N M m=1 n=1 [Pmn (1 Pmn ) Pm n+1 (1 Pm n+1 )] 21
M NX
m 2
X
(Ymn Pmn )(Ym n+2 Pm n+2 )
1
b2 =
for TOEP3.
N 2M m=1 n=1 [Pmn (1 Pmn ) Pm n+2 (1 Pm n+2 )] 21
since c , 1 , 2 are correlations between any pairs, only lag-1 pairs and only
lag-2 pairs of observations respectively and they are estimated by correlations based on corresponding pairs in the sample. Note that is correlation
between any pairs allowing for the number of lag and it is estimated by
sample correlation between successive (lag-1) pairs only. With estimates
(l)
(l)
(l)
c , (l) , 1 , 2 in iteration l, we can update (l) to (l+1) by solving (1)
(l)
(l) (l)
using the Newton-Raphson method. Then c , (l) , 1 , 2 are subsequently
(l+1)
(l+1)
(l+1)
using (l+1) in (3) to (5) and the cycle
updated to c , (l+1) , 1 , 2
repeats again until convergence is reached. Finally, the covariance matrix
(5)
299
b =
Cov()
T
m
[ X ( P V
m=1
M
X
P Tm
[ ( V
m=1
1 P m
m
T
1 P m
m
T
T
m
)] [ X ( P V
m=1
)]
1 b
1 P m
m Cov(Y m )V m
T
)]
(6)
T
b
where Cov(Y
m ) is given by (Y m P m )(Y m P m ) .
4. Empirical studies
300
301
Table 2: Parameter estimates with SE in italic for the three data sets
0
1
2
or
U
Blood glucose data (N = 3)
CAR1
1.9004 1.0310 -0.0285 0.0131
1.7001 0.2885 0.2830 0.2804 -0.55521
CI
1.7774 1.0744 -0.0259 0.0136
1.7821 0.2993
-0.55682
EC
2.4270 1.1060 -0.0341 0.0141
1.7444 0.3020 0.1388
-0.55747
AR1
2.1176 1.0802 -0.0301 0.0137
1.7595 0.3009 0.0986
-0.55700
TOEP2 2.1027 1.0794 -0.0299 0.0137
1.7602 0.3009 0.0983
-0.55699
TOEP3 2.4388 1.1235 -0.0343 0.0143
1.7467 0.3022 0.1040 0.2149 -0.55751
Plasma citrate concentration data (N = 14)
CAR1 -0.3413 0.5101 -0.1176 0.0554 -0.5981 0.5344 2.6297 0.4311 -0.50189
CI
0.6953 0.6216 -0.0999 0.0603 -0.7274 0.2902
-0.66648
EC
0.6700 0.6370 -0.1000 0.0624 -0.7286 0.2960 0.4324
-0.66656
AR1
0.4359 0.5624 -0.0719 0.0581 -0.5555 0.2774 0.5967
-0.66803
TOEP3 0.4511 0.6872 -0.0902 0.0858 -0.2419 0.2776 0.6046 0.4909 -0.67134
Methadone clinic data (4 Nm 26)
CAR1 -0.8423 0.2189 -0.0088 0.00282 -0.4049 0.0628 2.3960 0.1196 -0.73692
CI
-0.1332 0.4011 -0.0113 0.00620 -0.3710 0.0765
-0.78251
EC
-0.1892 0.3795 -0.0118 0.00517 -0.2911 0.0771 0.2360
-0.67457
AR1
-0.2197 0.3768 -0.0108 0.00569 -0.3434 0.0736 0.4310
-0.74800
TOEP2 -0.2614 0.4017 -0.0107 0.00605 -0.3282 0.0745 0.4424
-0.72970
TOEP3 -0.1900 0.3636 -0.0104 0.00549 -0.3683 0.0752 0.4422 0.3533 -0.77870
Note: The larger parameter estimate 2 with * is 30% more than the smaller
parameter estimate for the plasma citrate concentration data. - refers to
parameter not existing in the model.
4.2 Plasma Citrate Concentration data
An experiment involving 10 subjects (Andrews and Herzberg, 1985, P.237) was carried out to study the variation of plasma citrate concentration during a day. For each
subject, the concentration of citrate in plasma (in mol per litre) was measured hourly
at 14 time points from 8am to 9pm during a day, with a total of K = 140 observations.
Meals were given at 8am, at noon and at 5pm. The binary outcome of the logistic model
is Y , an indicator of whether the plasma citrate concentration (Y1 ) is more than 120
mol per litre. Covariates include the time of the day (X1 ) taking values from 1 to 14
and an indicator of meal time (X2 ). A summary for the counts M and K and the means
of Y1 and Y across observations taken at non-meal and meal times are reported in Table
1. Model parameters are (0 , 1 , 2 ) for the intercept and the 2 covariates.
Again the CAR1 model with an autoregressive parameter for the previous outcome
Ym n1 and the marginal models with 5 different working correlation matrices are fitted
to the data. However model with TOEP2 does not have stable parameter estimates.
Table 2 shows that for all except the CAR1 models, only the indicator of meal time (X2 )
is significantly associated with a higher probability of having the concentration of citrate
in plasma in excess of 120 mol per litre.
302
where Pbmn =
emn
1+emn
mn =
(7)
and
o + 1 Xmn1 + 2 Xmn2
for marginal models,
o + 1 Xmn1 + 2 Xmn2 + Ym n1 for CAR1 model,
303
that the nuisance parameters in the working correlation matrix of the marginal models do not enter explicitly into the calculation of Pbmn . However, as they appear in the
estimating equations, they affect the parameter estimates (0 , 1 , 2 ).
For the blood glucose data, Table 2 shows that the estimates of are very small
for all marginal models and the autoregressive parameter in the CAR1 model is also
insignificant. This shows that the strength of serial correlation is weak possibly because
a short time series cannot adequately reveal the dependency structure for binary data.
Hence the U values are similar across conditional and marginal models. Marginal model
adopting a CI correlation matrix has the least negative U value showing that observations
within patients are essentially independent.
For the plasma citrate concentration data, the (s) are much larger for all marginal
models and the autoregressive parameter in the CAR1 model is strongly significant.
This shows that the strength of serial correlation is much stronger. Clearly the c in
the correlation matrix of EC type is generally smaller than the in the AR1 matrix
or the 1 in the TOEP2 and TOEP3 matrices because EC assumes equal correlation
independent of the time-lag between any two observations. Moreover, 1 in the TOEP3
matrix which describes the correlation between observations of 1 time-lag is generally
larger than 2 which reveals the correlation between observations of 2 time-lag. Since
the CAR1 model gives the least negative U value, it again confirms the strong serial
correlation between observations within patients. However, across marginal models, the
U value is least negative for model with CI correlation matrix but this value is very close
to that of model with EC matrix. For this data, we believe that the EC matrix is the
best correlation matrix to describe the dependency structure between observations from
the same subject.
Lastly, for the methadone clinic data, Table 2 shows that the (s) are moderate
for all marginal models and the autoregressive parameter in the CAR1 model is
also significant. The U values differ substantially across models possibly because of the
generally longer time series and their specific dependency structures. The marginal model
adopting an EC correlation matrix gives the least negative U value again confirming the
strong serial correlation between observations within patients but such serial correlation
does not decrease as the time-lag between observations increases. Again the EC matrix
is the best correlation matrix for this data to describe the dependency structure between
observations from the same patient.
6. Conclusion
Although the GEE approach has been proposed for more than twenty years, few
studies investigate choices of working correlation matrices for data with different correlation structures, possibly because parameter estimates for marginal models using a GEE
approach are known to be robust against misspecification of working correlation matrix.
However our experience with real data analyses shows the contrary: parameter estimates
(0 , 1 , 2 ) as well as the model fit measure U differ substantially across models adopting
different working correlation matrices. Results from three real data analyses suggest that
the EC matrix is an appropriate choice of working correlation matrix if the time series is
304
short or the serial correlation does not decrease sharply with time. This is perhaps due
to the limited information in binary data that favors simple correlation structures. If
computationally feasible, fitting models with all the five working correlation matrices and
choosing one with the least negative posterior expected utility U is also a good practice.
References
Andrews, D.F., Herzberg, A.M. (1985). Data. Springer-Verlag, New York.
Ballinger, G.A. (2004) Using Generalized Estimating Equations for Longitudinal Data
Analysis. Organizational Research Methods 7, 127-150.
Chan, J.S.K., Kuk, A.Y.C., Bell, J., McGilchrist, C. (1998) The analysis of methadone
clinic data using marginal and conditional logistic models with mixture or random
effects. Australian and New Zealand Journal of Statistics 40, 1-40.
Copas, A.J., Seaman, S.R. (2010) Bias from the use of generalized estimating equations
to analyze incomplete longitudinal binary data. Journal of Applied Statistics 37,
911-922.
Diggle, P.J ., Liang, K. Y., Zeger, S.L. (1996). Analysis of Longitudinal Data. Oxford
Science Publications.
Hardin, J., Hilbe, J. (2003). Generalized Estimating Equations. London: Chapman and
Hall.
Horton, N.J., Lipsitz, S.R. (1999) Review of software to fit generalized estimating equation regression models. The American Statistician 53, 160-169.
Kenward, M.G., Lesaffre, E., Molenberghs, G. (1994) Application of maximum likelihood and generalized estimating equatins to the analysis of ordinal data from a
longitudinal study with cases missing at random. Biometrics 50, 945-953.
Liang, K.Y., Zeger, S.L. (1986) Longitudinal data analysis using generalized linear
models. Biometrika 73, 13-22.
Lipsitz, S.R., Kim, K., Zhao, L.P. (1994) Analysis of repeated categorical data using
generalized estimating equations, Statistics in Medicine 13, 1149-1163.
Lipsitz, S.R., Laird, N.M., Harrington, D.P. (1991) Generalized estimating equations
for correlated binary data: using the odds ratio as a measure of association,
Biometrika 78, 153-160.
Miller, M.E., Davis, S.C. and Landis, J.R. (1993) The analysis of longitudinal polytomous data: Generalized estimating equations and connections with weighted least
squares, Biometrics 49, 1033-1044.
Park, T. (1993) A comparison of the generalizing estimating equation approach with the
maximum likelihood approach for repeated measurements. Statistics in Medicine
12, 1723-1732.
305
Prentice, R.L. (1988) Correlated binary regression with covariates specific to each binary
observation. Biometrics 44, 1033-1048.
Prentice, R.L., Zhao, L.P. (1991) Estimating equations for parameters in means and
covariance of multivariate dicrete and continuous responses, Biometrics 47, 825839.
Walker, S. G., Gutierrez-Pena, E. (1999). Robustifying Bayesian Procedures. Bayesian
Statistics 6, 685-710.
Zeger, S.L., Liang, K.Y. Liang, Albert, P.S. (1988) Models for Longitudinal Data: A
Generalized Estimating Equation Approach, Biometrics 44, 1049- 1060.
Zhao, L.P., Prentice, R.L. (1990) Correlated binary regression using a generalized
quadratic model. Wiley Interdisciplinary Reviews: Computational Statistics 77,
642-648.
Received March 14, 2013; accepted August 15, 2013.