Statistical Issues
Statistical issues on the analysis Andrew Blance , Yu-Kang Tu1,2,
statistical literature (3, 4), the impact of baseline x1 ) x2 with x1 may invalidate the usual null
effects, particularly on statistical power, is not hypothesis because x1 appears in both variables.
widely appreciated. Furthermore, insufficient con- Any association between x1 ) x2 and x1 (i.e. a non-
sideration is given to the choice of statistical zero statistical correlation between x1 ) x2 and x1)
methods and their consequences in non-random- may exist, in part, because of MC (as x1 ) x2 and x1
ized studies, particularly as non-randomization are formulaically related). For instance, if x1 and x2
reverses otherwise standard advice to use ancova were two series of random numbers with the same
to generally maximize statistical power within mean and standard deviation, the expected corre-
randomized controlled trials (RCTs). lation between x1 and x2 is close to zero. However,
The aim of this article is to provide a non- it can be shown that the correlation between x1 ) x2
technical introduction to the current problems in and x1 in such circumstances will be close to
study design and associated analyses of follow-up 1 ⁄ 2 0.71 (16). This value can be highly signifi-
studies in oral health research, particularly cant when tested against the (incorrect) null
addressing the issues of: baseline effects, power hypothesis of zero, even with a small sample size
and non-randomization. (14). Researchers may thereby be misled to infer an
underlying ‘causal’ relationship between x1 ) x2
and x1, where none exists.
Problematic uses of correlation and ⁄ or regression
Baseline effects in analysing the association between treatment
Many studies in the dental literature show an effects and baseline values have been noted for a
association between baseline outcome status and long time now (17–20). One needs to know the
change from baseline, i.e. a treatment–baseline correct null hypothesis, and a method has been
interaction or baseline effect. For instance, in peri- proposed to obtain an estimate of this (21). How-
odontal follow-up studies, probing pocket depth ever, this approach does not provide a gauge of the
(PPD) reductions and clinical attachment level extent of association, as provided by a correlation
(CAL) gains have often been found to be positively coefficient. Alternatively, over 40 years ago, Old-
associated with baseline measurements of PPD and ham (17) suggested that one solution was to test the
CAL (5, 6). Similarly, the effect of orthodontic correlation of x1 ) x2 with the average (x1 + x2) ⁄ 2.
treatment of malocclusions, assessed as changes in The reason is that to know whether or not a baseline
the peer assessment rating (PAR) score, has been effect exists, a statistically correct approach is to
found to be positively associated with pre-treat- test for differences in the variances of the two
ment PAR scores (7–12). This is not unique to measurements, rather than to test the correlation
dentistry, and examples are found in studies of coefficient between change and baseline. In the latter
hypertension treatments, showing that patients (erroneous) approach, MC adds to and exacerbates
with higher-than-average blood pressure might the statistical artefact known as regression to the
experience greater blood pressure reduction mean (RTM) (22, 23), as the variables and also their
following a pharmacological intervention than measurement errors are formulaically related. A
those with baseline blood pressures lower than more technical explanation of this is outlined in
the study average (13). more detail elsewhere (15).
The problem is that use of correlation or regres- It is important to note that Oldham’s method
sion to test the association between change in an does not remove MC, rather it uses the fact that if
outcome and its baseline value suffers a serious there is a baseline effect, the follow-up measure-
statistical artefact: mathematical coupling (MC) (14, ments will vary differently from the baseline
15). MC occurs where there exists a formulaic measurements, because of the fact that the baseline
relationship between two variables, i.e. one can be effect will decrease the value of the observations.
expressed as a function of the other. MC distorts For example, consider periodontal treatment where
the perceived relationship between variables, as the a baseline effect means that initially deeper pocket
usual statistical testing of the null hypothesis – i.e. depths (PPD) reduce (improve) more than initially
that the correlation coefficient or regression slope is shallower pockets. In statistical terms, this means
zero – becomes inappropriate. For instance, sup- that the follow-up measurements have a smaller
pose, that pre-treatment PPD is x1, post-treatment standard deviation than the baseline measure-
PPD x2, and therefore PPD reduction following ments. An illustration adopting vector geometry
treatment is x1 ) x2. To correlate (or regress) is presented here, while the statistical theory is
provided in brief in Appendix 1. For readers not unequal, i.e. one standard deviation is smaller than
wishing to consider vector geometry, the following the other. In vector geometrical terms, this would
section can be omitted without loss of continuity. mean unequal lengths of x1 and x2. However,
Using vector geometry (24), pre-treatment (x1) under H0 (of no underlying relation), the vectors x1
and post-treatment (x2) PPD values can be repre- and x2 are of the same length, and the vectors
sented as vectors with lengths equal to their x1 ) x2 and x1 + x2 are therefore always perpen-
standard deviations (SDs), positioned such that dicular, irrespective of the angle between x1 and x2
the cosine of the angle between them is their (Fig. 1). Thus, the correlation between x1 ) x2
bivariate correlation (Fig. 1). Under the null (change) and (x1 + x2) ⁄ 2 (mean) is always zero,
hypothesis (H0) of no baseline effect, the SD of under H0. Therefore, although MC remains, the use
pre- and post-treatment values should be equal (15, of Oldham’s method under H0 provides a special
17). In vector-geometrical terms this means that the instance where its adverse effect (i.e. distortion to
two vectors x1 and x2 are perpendicular, and their the null hypothesis) is annulled.
lengths the same. (Note: we use bold to distinguish
between vector representation of the variables and
their usual variable representation.) The correlation
between change (x1 ) x2) and baseline (x1) is now
Modelling baseline effects
equivalent to the cosine of the angle w between the Simple statistical methods, such as Oldham’s cor-
vectors x1 ) x2 and x1. This is typically not zero, but relation (17), have been recommended to overcome
depends upon the angle between x1 and x2, i.e. the the problem of testing the interaction between
correlation between pre- and post-treatment PPD. treatment effects and baseline values. However,
When this is near zero, i.e. the vectors x1 and x2 are these methods have limited applications. For
perpendicular, the angle between x1 ) x2 and x1 is instance, Oldham’s method assumes that measure-
45, and its cosine is 1 ⁄ 2 0.71. Thus, under H0, ment errors are constant across occasions, and
the correlation between change and baseline is cannot take into consideration other explanatory
generally far from the standard assumption of a variables, such as treatment group variables. An
value of zero, and this distortion to the null alternative approach would be to use multilevel
hypothesis is a consequence of MC. modelling (MLM) (25, 26), which is more flexible in
Vector geometry also illustrates the rationale dealing with repeated measurement data, avoids
behind Oldham’s method. A relationship between the problems of MC, permits multiple follow-up
change and baseline requires that the variances of time-points, and permits the inclusion of additional
the baseline and follow-up measurements are covariates (27–31). MLM is more complex than
Oldham’s method, so we only outline the basic
principles here for the pre- ⁄ post-test study design;
(a) (b)
more technical details and discussion are given in
x2 x1+x2
Appendix 2 and elsewhere (32).
The MLM required to analyse change in relation
x2 to baseline, while completely avoiding MC, is where
θ one specifies baseline and follow-up values as
θ repeated outcomes (at level 1) clustered within
individuals (at level 2). Within this model, mea-
surement occasion is a covariate, where its coeffi-
x1–x2 cient exhibits random variation about its mean (26).
x1–x2 This is known as a random slope model because the
estimated slope (randomly) varies across individ-
Fig. 1. Variables x1 (baseline PPD) and x2 (follow-up uals (level 2) (33). The occasion covariate is centred
PPD) represented as vectors with lengths equal to their
standard deviation (SD); cosine h is the correlation
about 0 to aid model-fitting procedures (32), and its
between x1 and x2; under H0 (the SDs of x1 and x2 are interval, though arbitrary, is set to 1 so that
equal) the vectors x1 ) x2 and x1 + x2 are always per- interpretation of its regression coefficient becomes
pendicular, irrespective of the correlation between x1 and the mean change between occasions. The random
x2: (a) the correlation between x1 and x2 is zero, hence
structure of the model comprises subject-level
h = 90 and w = 45; (b) the correlation between x1 and x2
is positive, hence h < 90 and w > 45. MC is still present random intercept, subject-level random slope and
but use of Oldham’s method annuls the effects. a covariance between them, which is used to derive
Analysis of change in follow-up studies
the correlation between baseline (intercept) and variate analysis of variance (manova); and (f)
change (slope) (34), free from the distortion due to multilevel modelling (MLM). All simulations were
MC. This strategy can be extended to accommodate undertaken initially assuming that treatment ef-
observations of multiple sites (e.g. treatment of fects were not related to baseline values (i.e. there
different lesions) within the same individual, by was no baseline effect) and repeated assuming that
including an extra level for site. Thus, baseline and treatment effect would increase for higher baseline
follow-up values (level 1) are clustered within sites values (i.e. there was a baseline effect). In general,
(level 2), which in turn are clustered within ancova proved to be the most powerful method
individuals (level 3). Furthermore, other factors and always had greater power than the other
such as treatment group may be incorporated in commonly used methods such as change scores
the MLM as additional covariates. More complex and percentage change scores. The two multivar-
variations can also be developed to consider iate methods did not achieve greater power than
multiple follow-up measures, though this is ancova (37).
beyond the scope of this article to outline these. Many statisticians claim that ancova always
As a simple example, consider orthodontic PAR achieves the greatest power unless the correlation
scores to evaluate the effect of orthodontic treat- between the pre- and post-treatment measures is
ment of malocclusions: baseline and follow-up zero, at which point ancova achieves the same
PAR scores form level 1 observations nested within power as using post-treatment values only (3).
subjects at level 2. PAR scores from both occasions However, this is true only when the sample size
are regressed on the occasion covariate and its is ‘reasonably’ large. In our simulations (36), it
coefficient is allowed to exhibit random variation was noted that ancova might achieve less power
about an overall mean value. MC is not present in than testing post-treatment values only, where
this model as the dependent variable has no the correlation between the pre- and post-inter-
formulaic relationship with the independent vention measurements was low (£0.3), corre-
variable. sponding to varying treatment effect across
individuals, and the sample size was small
(£20). The reason for this is that ancova uses
baseline values as a covariate and thus loses one
Statistical power
degree of freedom more than the other methods;
When conducting a randomized controlled trial for small sample sizes, one degree of freedom
(RCT), a priori power calculations are necessary to can have a substantial effect if the correlation
determine the required sample size. This is often between pre- and post-treatment values is also
overlooked or under-reported in the oral health small. Given that the average sample size of
literature (35). In the repeated measurement study RCTs in oral health research is quite small (36),
design, typically adopted by RCTs, it is not well this finding might be important. Otherwise, in
known among oral health researchers that the general, ancova is the preferred method of
analytical method of choice affects statistical analysis for reasonably sized RCTs, as this yields
power. Moreover, the power of most statistical optimal statistical power.
methods to analyse repeated measurement designs
are affected by baseline effects.
In a separate study (36), only summarized here,
computer simulations were performed to compare
ANCOVA and Lord’s paradox
the power of four univariate statistical methods Although ancova is recommended for RCT data
and two multivariate statistical methods for the (3, 4, 37), and is typically described as useful
analysis of change in a hypothetical RCT involving because it ‘adjusts for baseline differences’, the
two measurements, one at baseline and the other at implicit assumption underlying ancova is often
follow-up. The univariate methods considered overlooked or misunderstood. Consequently, many
were: (a) testing post-treatment scores only using researchers have developed the naı̈ve view that
the two-sample t-test; (b) testing change scores ancova adjusts for baseline differences between
using the two-sample t-test; (c) testing percentage groups, when the reality is that it adjusts only for
change scores using the two-sample t-test; and baseline differences within groups. ancova
(d) analysis of covariance (ancova). The two achieves this adjustment within treatment groups
multivariate methods considered were: (e) multi- by using baseline values as a covariate, and it is this
that increases statistical power. If there is an assigned randomly. Controlling for baseline body
interaction between baseline values and treatment mass in this instance is questionable and will
groups, i.e. the patient selection process causes invoke Lord’s paradox.
differences in the baseline values between treat- To visually explain this phenomenon with
ment groups, the assumptions for ancova may be respect to follow-up studies, consider an investi-
violated and subsequent conclusions drawn could gation into the effect of water fluoridation on
be erroneous. dental caries (DMFT) increments. Researchers
For RCTs, no substantial differences in the mean might use data retrospectively or prospectively,
baseline values across groups should exist, because collected from one geographical area with water
(appropriate) randomization ensures that the dis- fluoridation and another without fluoridation.
tributions of baseline variables are very similar. In Suppose repeated oral examinations are performed
reality, small differences might be found, though on children in both areas at an interval of 5 years
these are assumed to be caused by chance alone and there are substantial differences in baseline
and will not bias the ancova estimates. By impli- caries rates. Even if the two areas had been
cation, within observational studies, i.e. where randomly selected from fluoridated and non-fluo-
randomization is not performed, or randomization ridated areas, there would remain the possibility
is not conducted appropriately, using ancova to that baseline differences occur due to the lack of
adjust for baseline differences could mislead by ‘appropriate’ randomization. Here we imply that
introducing bias into the ancova estimates, giving appropriate randomization warrants random allo-
rise to Lord’s paradox (38) and yielding difficulties cation of fluoridation to previously non-fluoridated
in the interpretation of results. areas – which is not the same as randomly selecting
Lord’s paradox occurs where baseline differ- fluoridated and non-fluoridated areas. Moreover,
ences cannot be attributed to chance alone. Lord’s even with appropriate random allocation of fluo-
paradox dictates that in instances where real ridation to areas, the study sample size would be
baseline differences exist, it is erroneous to attempt only two! The problem is whether or not the
to adjust for baseline differences, because ancova DMFT-increment over the 5-year period can be
has the potential to yield biased estimates of compared between the two areas; and if there is a
treatment differences (see Fig. 2). The original significant difference between areas, can this be
example described by Lord (38) is where one attributed to water fluoridation?
examines differences between males and females Many researchers might seek some form of
in the changes in body mass. Suppose, for instance, statistical adjustment for differences in baseline
we wish to know if a special diet has a differential DMFT. However, this would be inappropriate.
impact by sex on weight loss. Males and females Under the null hypothesis (H0) of the same 5-year
will have different mean body mass at baseline and change in DMFT among all children (i.e. irrespec-
this cannot be attributed to chance, as sex cannot be tive of area), without any biological variation
and ⁄ or measurement error, the follow-up DMFT
(x1) plotted against baseline DMFT (x2) would yield
DMFT at follow-up
a straight line (Fig. 2; the 45 dotted line). However,
Both groups of observations fall ‘around’ the 45º gradient line:
there are no genuine area differences in change in DMFT because of biological variation and ⁄ or measure-
ment error, the reality is that the data form a
Area 2 ‘Apparent’
difference is ‘cloud’ of points around the 45 incline. Further-
not zero:
Lord’s paradox more, as baseline DMFT differs between areas,
there are two such ‘clouds’, one for each area
Area 1
(Fig. 2: the data points form ellipsoids). Because of
RTM, the slope of the fitted line for follow-up
DMFT regressed on baseline DMFT is not coinci-
dent with the 45 incline. Thus, the ancova
Difference between areas in baseline DMFT estimate of the difference between fluoridated
DMFT at baseline and non-fluoridated areas is not zero, as required
under H0. This artefactual effect of area on the
Fig. 2. Plot of baseline DMFT (x1) versus follow-up DMFT
changes in DMFT (which some could erroneously
(x2) for children in a follow-up study of 5 years, follow-
ing an implicit ‘intervention’ of water fluoridation in interpret as being due to fluoridation) is due to
one area. RTM, thereby yielding Lord’s paradox.
Analysis of change in follow-up studies
to the treatment, giving rise to RTM, the correlation PARij ¼ B0ij þ B1j T; B0ij ¼ B0 þ u0j þ e0ij ;
between baseline and post-treatment values (r12) B1j ¼ B1 þ u1j ;
will be smaller than 1, and, therefore, the correla-
tion between change and baseline tends to be greater where B0 is the mean intercept of the sample values
than 0, unless r22 is much greater than r21 . It is at a time point midway between baseline (pre-
directly the consequence of the formulaic relation- intervention) and follow-up (post-intervention);
ship between x1 ) x2 and x1 (i.e. MC) that the B1 is the slope of the change in PAR score between
numerator in A1 depends upon r12, and when this measurement occasions; u0j is the residual variation
is not unity (i.e. when RTM operates), the usual for individual j about the mean intercept due to
null hypothesis of zero correlation is effectively population biological variation (heterogeneity
‘distorted’ (away from zero). Consequently, to between individuals of a population); e0ij is the
correctly test the association between change and residual variation for individual j about the mean
baseline, the impact of RTM needs to be estimated outcome on measurement occasion i, due to
and then explicitly accommodated, and, unfortu- instantaneous biological variation (variation within
nately, this is not always achievable. an individual) and ⁄ or measurement error (which
The Pearson correlation coefficient for Oldham’s may differ between occasions though it is assumed
method is given by (17): at least for this illustration to be independent across
occasions); u1j is the responsive biological variation
r21 r22 between subjects, i.e. the variation of the regression
rx1x2;ðx1þx2Þ=2 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 2 ffi: ðA2Þ
r1 þ r22 4r12 r21 r22 slope; and all variation is assumed to be normally
distributed with zero mean.
Clearly, the correlation between (x1 ) x2) and Allowing for instantaneous biological variation
(x1 + x2) ⁄ 2 will be zero if the variances of x1 and and ⁄ or measurement error to differ across mea-
x2 are equal, and positive if and only if r21 is surement occasions, there are five random param-
greater than r22 .The impact of MC has been eters to be estimated, yet only three degrees of
annulled, even though there remains a formulaic freedom: one for each occasion and one between
relationship between (x1 ) x2) and (x1 + x2) ⁄ 2 occasions (change). It is therefore necessary to
because, in this special instance, the numerator reduce the number of random parameters by
of equation (A2) no longer contains r12, and is making various model assumptions. The final
therefore unaffected by RTM (when r12 is not model is contingent on these assumptions. For
unity). instance, if we were to acknowledge that we are
unable to distinguish between population biolog-
ical variation and instantaneous biological varia-
tion and ⁄ or measurement error, and we further
Appendix 2 assume the latter to have constant variance across
In order to illustrate the MLM approach in occasions, we may estimate either the subject-
determining a baseline effect (i.e. interaction level random intercept or the occasion-level
between change following treatment and baseline), random intercept, though not both. It does not
while completely avoiding MC, consider the use affect our interpretation of the model whichever
of orthodontic PAR scores to evaluate the mal- we choose (constraining the other to be zero), as
occlusion of patients pre- and post-treatment. the chosen estimate represents the combined
Baseline and follow-up PAR scores form level 1 effects of population and instantaneous biological
observations (i ¼ 1; 2) nested within subjects at variation and measurement error across the study
level 2 (j ¼ 1; . . . ; N), where N is the number of period.
study subjects. PAR scores from both occasions While the MLM strategy removes MC, it does
are then regressed on the occasion covariate, say not remove the impact of measurement error, i.e.
T, which is centred about zero [to avoid inducing the remaining effects of RTM. However, if an
the bias: see Blance et al. (32) for details] and estimate of the error variance were obtained (or
adopts values such that it spans an interval of estimated), adjustment can then be made for the
one (T = ±1 ⁄ 2). The coefficient for T exhibits effects of measurement error. Although the work-
random variation about its mean, yielding a ing details of this are beyond the scope of this
multilevel regression model of the following article, it can be shown that, providing the
form: measurement error variance is constant across
Analysis of change in follow-up studies
