Transformation, Normalization and Batch Effect in The Analysis of Mass Spectrometry Data For Omics Studies
Transformation, Normalization and Batch Effect in The Analysis of Mass Spectrometry Data For Omics Studies
Transformation, Normalization and Batch Effect in The Analysis of Mass Spectrometry Data For Omics Studies
Abstract
Data transformation, normalization and handling of batch effect
are a key part of data analysis for almost all spectrometry-based omics
data. This paper reviews and contrasts these three distinct aspects.
We present a systematic overview of the key approaches and criti-
cally review some common procedures. Much of this paper is inspired
by mass spectrometry based experimentation, but most of our discus-
sion carries over to omics data using distinct spectrometric approaches
generally.
1
spectrometry. An excellent recent introduction to statistical methods in this
field was written by Naes et al [13]. The log-transform has a special signifi-
cance in traditional spectrometry. An crucial component of the appeal of the
log-transform in (near) infrared spectrometry is due to Beer’s law (1852) [2],
which states that absorbance of light in a medium is logarithmically related
to the concentration of the material through which the light must pass before
reaching the detector. In other words,
!
I0
Absorbance = log = kLC,
I
where I0 is the incident intensity of the light and I the measured intensity
after passing through the medium, with C the concentration, L the path
length the light travels through and k the absorption constant. Another
way to put this is that relative intensity is linearly related to concentration
through the log-transform as
!
1
log = α + βC,
I
where the path length, initial intensity and absorption constants are sub-
sumed in the parameters α, β. Similar formulae exist for light reflection
spectrometry. The formula has been used to provide justification for the ap-
plication of classical linear regression procedures with log-transformed uni-
variate spectrometric intensity readings.
The above constitutes an argument in favor of the log transform for
spectrometry data based on non-linearity of spectral response. It is partly
responsible for causing the early literature on statistical and chemometric
2
approaches in the analysis of spectrometry data to be based on the log-
transformed measures. Beer’s law applies only to (univariate) IR or NIR
spectrometric readings at a single wavelength. There is no multivariate ex-
tension of the law to cover full spectra consisting of the spectrometric readings
across an entire wavelength range. Nevertheless, the log-transform would also
be routinely applied once truly multivariate spectrometry became available,
jointly recording the spectrometric response at several wavelengths, as shown
in figure 1 which plots 50 NIR reflectance spectra across the range 1100-2500
nanometers.
1.4
1.2
1.0
0.8
log(1/R)
0.6
0.4
0.2
0.0
wavelength in nanometres
3
mass spectrometry based omics data analysis. This is because the statis-
tical reasons and rational behind the log-transform are more enduring and
powerful than the appeal to Beer’s law might reveal. We discuss 4 distinct
arguments in favor of the log-transform.
4
1.2 Skew and influential observations
A second issue related to the above is that spectral measures will tend to be
extremely skewed, not only within an individual (across the within-sample
spectrum responses), but also across samples or patients at a single m/z
point. This may render the data unsuitable for standard analyses such as
linear discriminant or similar when used on the original scale.
A related issue is that the skew may cause or be associated with a limited
set of highly influential observations. This is particularly troubling in omics
applications, as the spectra are typically very high-dimensional observations
either when storing the response on a large grid of m/z values which can
easily range in the thousands, or after reduction to an integrated peak list.
Influential observations may affect the robustness of conclusions reported, as
results may differ substantially after removal of a single - or isolated group
of - spectra from the analysis. A good example may be found in the cal-
culation of a principal component decomposition as a dimension reduction
prior to application of some subsequent data analytic procedure. Principal
component analysis is known to be sensitive to extreme observations [7], par-
ticulary in high-dimensional applications with small sample size. The same
phenomenon will however also tend to apply for other analysis approaches,
such as regression methods, discriminant procedures and so on. Transforma-
tion to log-scale may mitigate this problem.
5
operation at each m/z position along the mass/charge range which is being
investigated. As a consequence, spectrometry measures tend to have statisti-
cal properties reminiscent of those observed in Poisson (counting) processes.
The variation of the spectral response tends to be related to the magnitude
of the signal itself. This implies that multiplicative noise models are often a
more faithfully description of the data. Multiplicative error data are however
more difficult to analyze using standard software. Log-transforming can be
used to bring the data closer to the additive error scenario.
6
which reduces to
!
X1
f (E(Y )) = α + log .
X2
The result is that the linear dependence of the expected outcome via the link
function on the log-transformed data actually implies regression on the log-
ratio of the spectral responses X1 and X2 , such that any multiplicative effect
would cancel. This can be regarded as an implicit form of standardization
through the log-transform. It is a general property which can be used with
many statistical approaches such as (generalized) linear regression, discrimi-
nant analysis and so on.
For spectrometry data generally and mass spectrometry particularly, gen-
erally good advice would be to replace the raw measurements with log-
transformed values at an early stage of the analysis, by application of the
transformation
log(Y + a)
with a a suitably chosen constant. Many statistical texts will also mention
the Box-Cox transform when discussing the log transform. For (mass) spec-
trometry data this approach is of limited value however, because the optimal
transform may lack the multiple justifications given above - which might as
well be used as a priori grounds for choosing the logarithm - but also because
the approach is by definition univariate, while a modern mass spectrometry
reading will consist of thousands of measures across m/z values and samples.
7
2 Normalization and scaling
Normalization is an issue which is often encountered in omics data analyses,
but is somewhat resistant to a precise definition. We can identify what are
usually perceived as the main objectives of it and warn about the dangers
associated with the topic, so that we may avoid the most common pitfalls.
The objective of normalization can be loosely described as removing any
unwanted variation in the spectrometric signal which cannot be controlled
for or removed in any other way, such as by modifying the experiment for
example. This sets it apart from batch effect, which we will discuss later.
The latter can sometimes be adjusted for or accommodated by changing the
experiment so that its effects can be either explicitly removed or adjusted
for in subsequent analysis, by exploiting the structure of the experimental
design. Not all effects can be accounted for in this manner however.
Examples of such effects which may induce a need for normalization are
variations in the amount of material analyzed, such as ionization changes e.g.,
small changes in ‘spotting’ sample material to plates, subtle fluctuations in
temperature, small changes in sample preparation prior to measurement, such
as bead-based processing to extract protein, differential sample degradation,
sensor degradation and so forth. An important feature of such variation is
that, while we may speculate such variability sources are there and affect
our experiment, they are difficult to either control or predict, which typi-
cally means all we can do is try to post-hoc adjust for it, but prior to any
subsequent analysis steps.
Important in devising an appropriate normalization strategy is that we
should try to remove or reduce these effects on the measured spectral data,
8
while retaining the relevant (biological) signal of interest. Unfortunately,
this is typically problematic for spectrometry. This is because, as explained
in the above paragraphs on transformation, the spectrometric signal and
its variability are typically linked, often even after log-transform, while the
unwanted sources of variation affect all measures derived from the spectrum.
Imagine we have a study recording a mass spectrum on a dense grid of
finely spaced points along the mass range or alternatively storing the data as a
sequence of integrated peaks representing protein or peptide abundances and
this for a collection of samples (be it patients, animal or other). We write the
ordered sequence of spectrometry measures for each ith sample unit as xi =
(xi1 , . . . , xip ), with p the number of grid points at which the spectrum is stored
or the number of summary peaks. A transformation choice favored by some
analysts is to apply early on in the analysis (possibly after first application of
the log transform) standardization to unit standard deviation of all spectral
measures at each grid point separately. In other words, we replace the original
data at each j th gridpoint with the measures xij /std(xj ) where std(xj ) is
the standard deviation of the measures xj = (x1j , . . . , xnj )T across all n
samples at that gridpoint. This procedure, sometimes also referred to as
reduction to z-scores, is a form of scaling. It is identical to standardization
to unit standard deviation of predictor variables in regression analysis ([16],
pages 124-125, [15], page 349 and pages 357-358) when predictor variables
are measured at different measurement scales (different units, such as kg, cm,
mg/l and so on). Indeed, in the early days of regression analysis, reporting
standardized regression coefficients was an early attempt at assessing relative
importance of effects.
For spectrometry data generally and mass spectrometry in particular,
9
standardization is more complicated. Measurement units are by definition
identical within a spectrum across the mass range. This would counsel
against transforming to unit standard deviation as calibrated effects then
remain directly comparable across the mass range on the original untrans-
formed scale. There are however stronger arguments against this form of
standardization in spectrometry. Figure 2 illustrates the issues. The plot
shows mean (MALDI-TOF) spectra from a clinical case-control study, after
suitable transformation. To ease comparison, we plot the negative control
0.4
0.2
−0.2
spectral intensity
−0.4
−0.6
−0.8 1780 D
−1 1867.2 D
1352.4 D
1467.7 D
−1.2 cases group
control group
−1.4
1000 1500 2000 2500 3000 3500 4000 4500 5000
mass−to−charge value (Dalton)
Figure 2: Spectra
spectrum versus the mean case spectrum. The rectangular region highlights
and enlarges a region between 1200 and 1900 Dalton where most of the dis-
criminant effects are found between the cases and control groups, based on a
discriminant analysis. Indicated are 4 key peaks at 1352.4, 1467.7, 1780 and
1867.2 Dalton which together summarize most of the between-group contrast
between cases and controls. Figure 3 shows different statistics calculated on
10
the same data within the same mass range. The top plot again shows the
mean spectra for cases and controls within the 1200-1900 Dalton region as
before, while the middle graph plots a graph of weighted discriminant coef-
ficients obtained from a linear discriminant model calibrated from the data.
It is obvious how the discriminant analysis identifies the peaks at 1354.2 and
1467.7 Dalton and contrasts these with the peaks at 1780 and 1867.2 Dalton.
The below graph in figure 3 shows the first two principal components calcu-
lated on the same data and based on the pooled variance-covariance matrix.
first component
Figure 3: The top plot show mean cases and controls spectra separately.
The bottom curves are the loadings of the first two principal components
across the same mass/charge range. The middle curve shows the discriminant
weights from a logistic regression model calibrated to distinguish cases from
controls with the same data.
There are several things to note in this picture. The first is how much
the principal component and mean spectra curves resemble one another. The
11
first component closely approximates the mean control spectrum, while the
second component does the same for the mean cases spectrum. At first
sight, this might seem all the more remarkable, since the principal component
decomposition is based on the pooled variance-covariance matrix, and hence
on the ‘residual spectra’ xi − xg(i) , where xg(i) denotes the mean spectrum of
group g(i) to which the ith observation belongs, with g = 1, 2 for the cases and
control groups respectively. So the figure shows two different aspects of the
data. One is the systematic (mean) spectral response (top graphs), the other
are the deviations relative to the mean spectral outcome (bottom graphs).
From the figure, we can see that the component decomposition tells us that
the peaks at 1352.4 and 1467.7 Dalton are highly correlated and account for
much of the variation in the spectral data, as they weigh heavily in the first
principal component. Similarly, the second component summarizes much of
the expression in both peaks at 1780 and 1867.2, which are again highly
correlated. Because of this, the classification might as well be summarized
as a contrasting between the first and second principal component, since this
would contrast peaks 1352.4 and 1467.7 with the expression at peaks 1780
and 1867.2 (see Mertens et al [11] for the full analysis).
This feature of the data where the mean expression and deviations from
the mean are closely linked as shown in the above example, is typical of
spectrometric variation. It is the consequence of the connection between
mean expression and variance we mentioned above when discussing the log-
transform and can be observed in almost all spectrometry data, often even
after log transforming. To put this differently, in spectrometry data, we
will find the signal where the variance is (even if we correct the variance
calculation for systematic differences in expression, as shown in our above
12
example). It is for this reason that transforming to unit standardization
should be avoided with spectrometry data, unless scale-invariant methods
are explicitly used to counter this problem.
In addition to the above considerations, there are also other arguments for
avoiding reduction to z-scores or transformation to unit standard deviation.
An important argument here is that summary measures such as means and
standard deviations are prone to outliers or influential observations, which
can be a particular problem in high-dimensional statistics and with spec-
trometry in particular. A specific problem with such form of standardization
is that it may cause problems when comparing results between studies. This
is because systematic differences may be introduced between studies (or sim-
ilarly, when executing separate analyses between batches - see further), due
to distinct outliers which affect the estimates of the standard deviations for
standardizing between repetitions of the experiment.
Our final comments on the above standardization approach is that me-
dians and inter quartile ranges (IQR) are sometimes used instead of means
and standard deviations in an attempt to alleviate some of the robustness
concerns. Other authors advocate use of some function of the standard devia-
tion, such as the square root of the standard deviation instead of the standard
deviation itself. This is sometimes referred to as Pareto scaling [18]. The ra-
tional for this amendment is that it upweights the median expressed features
without excessively inflating the (spectral) baselines. An advantage of the
approach may be that it does not completely remove the scale information
from the data. Nevertheless, the choice of the square root would still appear
to be an ad hoc decision in any practical data analytic setting.
Some authors make a formal distinction between scaling and normaliza-
13
tion methods and consider the first as operations on each feature across the
available samples in the study [5]. Normalization is then specifically defined
as manipulation of the observed spectral data measurements on the same
sample unit (or collection of samples taken from the same individual)(within-
spectrum or within-unit normalization). A potential issue with the above
described approaches to normalization via statistical transformation is that
they are based on a borrowing of information across samples within an exper-
iment. Another extreme form of such borrowing is a normalization approach
which replaces the original set of spectral expression measures for a specific
sample with the sample spectrum measures divided by the sample sum, such
that the transformed set of measures adds to 1. An argument sometimes
used in favor of such transformation is that it would account for systematic
differences in abundance - possibly caused by varying degrees of ionization or
similar effects from sample to sample - such that only the relative abundances
within a sample are interpretable. Although this approach is unfortunately
common, it has in fact no biological foundation [5]. Even if arguments based
on either the physics or chemical properties of the measurement methodology
could be found, these could not be used in favor of such data-analytic ap-
proach as described above, which we shall refer to as ‘closure normalization’.
The problem with the approach is that it actually induces spurious - and
large - biases in the correlations between the spectral measures which mask
the true population associations between the compounds we wish to inves-
tigate. Figure 4 shows the effect of closure normalization on uncorrelated
normal data in 3 dimensions. The left plot shows scatterplots between each
of the three normally distributed measures. The right shows scatterplots of
the resulting transformed variables after closure normalization. The absolute
14
correlations between each variable pair has increased from 0 (for the original
uncorrelated data) to 0.5 (after transformation). This becomes particularly
15
spectral range in the transformed data, which cannot have biological inter-
pretation. Both approaches should be avoided.
Data normalized by the sum of the combined expression (closure nor-
malization) can be viewed as an instance of compositional data [1]. Hence,
instead of applying such normalization, one could therefor think of using
special-purpose methods from the compositional data analysis literature, or
to develop or adapt such methods for application in omics applications. This
has not been attempted to our knowledge at time of writing. As an alterna-
tive, it should be recommended to take a conservative approach and refrain
from excessive transformation when the consequences are not well under-
stood or accounted for in subsequent analysis. In such cases restricting to
log-transformation as discussed earlier is safer. In any case, the original un-
transformed data should always be at hand and stored to allow verification
of results through possible sensitivity analysis.
The above is only an introduction to some of the main forms of normal-
ization in use at this time. Many other forms exist and will undoubtedly
continue to emerge. An interesting one worth mentioning is the idea of ‘lag-
ging’ the spectra by taking differences between subsequent values within the
spectral range. With log-transformed spectrometric data, this is another
approach which induces ratios between subsequent spectral intensities which
eliminate multiplicative change effects. An example is found in an interesting
paper by Krzanowski et al [10]. It has the drawback that results from sub-
sequent statistical analysis can be more difficulty to interpret, but it might
be of use in pure prediction problems. Other forms of standardization and
normalization are also found in the literature, particularly methods which
seems to inherit more from common approaches in microarray analysis, such
16
as quantile normalization [6]. Ranking of spectral response, including the
extreme form of reducing to binary have also been investigated. The latter
can be particularly useful as a simple approach when data are subject to a
lower detection limit [8]. Other forms of normalization and standardization
worth mentioning at time of writing are scatter correction and orthogonal
signal correction. We refer to Naes et al [13] for a good introduction to these
methods.
Which transformations should be applied first? What is a good order
of applying distinct normalization or transformation steps? There is some
difference of opinion between researchers on the precise sequence in which
various normalization procedures are applied to the data. As a general rule
it seems wise to apply logarithms early and calculate means and standard
deviations only after log-transforming.
The issue of normalization is closely linked to the problem of standardiza-
tion of mass spectra. Several definitions of standardization may be possible
here. One option is to define the problem as ‘external’ standardization, which
would form part of the experiment itself (as opposed to the post-experimental
data processing we describe before) where we somehow try to change the
experiment so that part of the systematic experimental variation is either
prevented from occurring or could be accounted for through post-processing
of the data. Examples would be in the use of spike-in controls, on a sample
plate, or even within the sample material itself, so that the spectral response
can be adjusted for the expression of the known spike-in material which is
added. Another example would be in the use of technical controls on a sample
plate with know concentrations. Yet another example would be systematic
equipment re-calibration to re-produce a (set of) known standards, so that
17
sample-to-sample variation due to experimental drift is suppressed as much
as possible. All these approaches to standardization are different from the
above described methods in that they try to circumvent known sources of
variation by changing the experiment itself, rather than post-hoc attempting
to adjust for it.
18
and after the experiment. Before the experiment, because we may want to
tweak or change the experimental structure to take the presence of the batch
effect into account (this may involve discussion between both statistician and
spectrometrist). The objective is to change the experimental design so as to
avoid confounding of the batch structure with the effect or group structure
of interest. After the experiment, because we may wish to apply some data-
analytic approach to remove the effect (which may depend on the experiment
having been properly designed in the first place). Accounting for batch effect
in proteomic studies will hence involve two key steps.
19
experimentation. The first are time-fixed batch effects. Examples of these
are
• instrument re-calibration
Both types of batch effects may occur in proteomic experiments, but for
different reasons and with distinct consequences. The treatment of both
types of batch effect will also be different between the two.
20
3.1 Time-fixed batch effects
We consider a case-control study as an example. The experiment contrasted
175 cases with 242 healthy controls. Due to the large sample size and multiple
replicate measurements per sample, sample material needed to be assigned
to six target plates prior to mass spectrometric measurement. The plates
constitute a systematic - batch - structure within the experiment. On in-
specting within-sample medians and inter-quartile ranges (IQR), systematic
plate-to-plate differences were noted, shown in figure 5. Analysis of the data
using a discriminant approach led to correct classification of 97% of sam-
ples. Unfortunately, on closer investigation of the experimental design, it
turned out that all case material had been assigned sequentially to the first 3
plates, after which the controls samples were thawed next and assigned to the
subsequent plates. It is not possible to statistically adjust for such perfect
confounding between the plate structure and the potential between-group
effect which was the primary target of the research study. After discussion,
the study was abandoned, leading to significant loss of both experimentation
time and resources, among which the valuable sample materials.
A simple procedure exists to prevent such problems, called blocked ran-
domization. It consists of assigning cases and controls in equal proportions
and at random to the distinct plates. Table 1 shows such a design for a case-
control study randomizing cancer cases and healthy controls to three target
plates, as reported by Mertens et. al. [11], which gives more details about
the study. In addition to randomizing the cases and controls in roughly equal
proportions across the plates, the study also tried to have cancer stages in
roughly equal distributions from plate-to-plate. A recent overview of classi-
21
spectrum median expression
800 1500
600
spectrum iqr
1000
400
1
500 2
200 3
4
0 0 5
0 100 200 300 400 0 200 400 600 800 6
sample sequence number spectrum median
1500
spectrum median
600
spectrum iqr
1000
400
500 200
1 2 3 4 5 6
0
0 100 200 300 400 batch number
sample sequence number
Controls 17 17 16 50
Cases 22 22 19 63
Stage 1 2 3 4 1 2 3 4 1 2 3 4
Cases 4 10 4 4 4 10 4 4 3 8 4 4
22
of the study period, 288 samples had become available, of which 97 were
cases and 191 controls. The case-control assignment to plates is shown in
table 2. As can be seen, case-control assignment is perfectly confounded
with the plate effect for 25 out of 34 experimental batches, which makes these
measurements useless for between-group comparison. Note also how only two
plates, indicated in red typescript contain appreciable numbers of both cases
and controls. After analysis of the data, it was found that the estimate of
ca co ca co ca co ca co
4 0 1 3 0 15 1 6
3 0 0 9 1 0 1 9
11 0 4 0 3 0 0 7
5 0 1 0 0 4 0 4
12 0 1 0 1 0
21 40 1 0 0 5
1 3 2 0 1 9
2 0 0 4 2 8
16 13 0 3 0 4
0 15 0 16 2 14
the batch effect (std) substantially exceeded the measurement error estimate
(std). No clear evidence of differential expression of glycans between groups
emerged, though it could be hypothesized that any differences might be small
and at least smaller than the observed between-batch variation. This raised
questions whether the ‘design’ was to blame for the failure of the study to
identify differential expression.
To investigate the consequences of using a ‘design’ as used in the above
glycomics study, we investigate a simulation study which contrasts several
23
potential alternative designs. For each of these 4 designs we assume the
same sample size of 288 samples with 97 cases and 191 controls, just as for
the original study. We consider the following alternative scenarios.
We now simulate experimental data for each of the above 4 scenarios, gener-
ating effect sizes ranging from 0 to 1.5 for a single glycan (univariate simula-
tion) and assuming between-batch effects with standard deviations σB taking
values 3.6, 1.8, 0.9 and 0.45. The standard deviation of the error σE takes
the value 1.8 throughout (these numbers inspired by results from the real
data analysis).
In the analysis of the simulated data for the above experiments, we fit
linear mixed effect models [12] to the simulated data of experiments 2, 3 and
4. The mixed effect models correct for the known batch structure using a
random effect while estimating the between-group effect with a fixed effect
24
term. For the first simulated experiment a simple linear regression model
is used, which is equivalent to a two-sample pooled t-testing approach. Fig-
ures 6 to 10 shows the probabilities to detect the between-group effect (power)
across the effect size range simulated and for the standard deviations of the
batch effect indicated. As expected, we find that the probability to detect
the effect increases as the effect size grows for the single-plate experiment
[1]. The blocked experiment [3] matches the power of the single-plate exper-
iment regardless of the size of the batch random effect. The power of the
actually implemented glycomics experiment [4] depends on the size of the
batch effect. As the batch effect gets smaller (associated std goes to zero)
then we can eventually ignore the batch effect altogether and we obtain the
same powers as if the batch effect was not present. This is of course a con-
firmation of what we would expect to find. If the batch effect is substantial
however (associated std is large relative to the size of the between-group ef-
fect), then we pay a penalty in terms of seriously reduced powers of detecting
the between-group effect. The perfectly confounded experiment [3] performs
dramatically whatever the batch effect is, since we are forced to account for
a known batch structure - irrespective of the true but unknown population
batch effect - and thus loose all power of detecting the effect of interest.
The excellent performance of the blocked version of the experiment again
emphasizes the need and importance of pro-actively designing and imple-
menting block-randomized experiments when batch structures are identified
in advance of the experimentation.
25
powers σE=1.8 σB=3.6
1
Single plate
0.9 3 plates − blocked
3 plates − confounded
Glyco experiment
0.8
0.7
0.6
power
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5
effect size
Figure 6: Spectra
0.7
0.6
power
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5
effect size
Figure 7: Spectra
26
powers σE=1.8 σB=0.9
1
Single plate
0.9 3 plates − blocked
3 plates − confounded
Glyco experiment
0.8
0.7
0.6
power
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5
effect size
Figure 8: Spectra
0.7
0.6
power
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5
effect size
Figure 9: Spectra
27
powers σE=1.8 σB=0.2
1
Single plate
0.9 3 plates − blocked
3 plates − confounded
Glyco experiment
0.8
0.7
0.6
power
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5
effect size
28
Figure 11: Spectra
29
excellent text describing modern survival analysis methodology in novel high-
dimensional data applications was recently provided by van Houwelingen and
Putter [17][Part IV - Chapters 11 and 12].
It is at time of writing not clear how this issue should be addressed in
design and analysis of longitudinal and survival studies with spectrometry-
based omic data generally. The problem is particularly important because it
could affect all existing biobanks. One could propose that instead of - or in
addition to - the development of biobanks which store sample materials, at-
tention should be given to establishing databanks of (omic) spectra, which are
measured at pre-specified and regular time points instead. Such an approach
could break the confounding between the follow-up time of patients which is
then de-coupled from the measurement times. The problem with the latter
proposal is that the measurement devices (spectrometers) themselves may
exhibit ageing, such that the ageing problem is replaced with a spectrometer
calibration problem.
4 Discussion
We have critically discussed and contrasted the distinct issues of transforma-
tion, normalization and management of batch effect in the analysis of omic
data. Transformation, standardization and normalization are typically dealt
with ‘after’ the experiment and usually by the data analyst or statistician
involved. Batch effect and the presence thereof is an issue that should be
considered both ‘before’ and ‘after’ the data-generating experiment. The ob-
jectives here should be to optimize the design for known batch effect such
that these cannot unduly affect conclusions or invalidate the experiment.
30
This task is usually carried out in collaboration and prior discussion between
both statistician and spectrometrist when planning the experiment. Sec-
ondly, appropriate methods should be used after the experiment to either
eliminate or otherwise accommodate the batch effect after the experiment
when analyzing the data. The latter task will usually be carried out by the
statistician solely. Discussion of such methods falls outside the scope of this
chapter.
Methodological choice in pre-processing of spectrometry data should in
practice depend on many aspects. Purpose of the study is a key consideration.
If prediction or diagnostic models are to be calibrated, then pre-processing
methods which can be applied ‘within-sample’ without borrowing of infor-
mation across distinct samples are more attractive because this can make
subsequent calibration of any predictive rule easier. Other considerations
may be ease of communication of results, established practice (in sofar it
is reasonable of course), variance stabilization, interpretability and so forth.
Our discussion has not been comprehensive but highlighted the main ideas
and approaches instead, pointing to common pitfalls, opportunities and fu-
ture problems left to be solved.
Acknowledgements
This work was supported by funding from the European Communitys Seventh
Framework Programme FP7/2011: Marie Curie Initial Training Network
MEDIASRES (Novel Statistical Methodology for Diagnostic/Prognostic and
Therapeutic Studies and Systematic Reviews, www.mediasres-itn.eu) with
the Grant Agreement Number 290025 and by funding from the European
31
Unions Seventh Framework Programme FP7/ Health/F5/2012: MIMOmics
(Methods for Integrated Analysis of Multiple Omics Datasets, http://www
.mimomics.eu) under the Grant Agreement Number 305280.
References
1. Aitchison, J. (1986) The Statistical Analysis of Compositional Data Black-
burn.
4. Cox, D.R. and Oakes, D. (1984) Analysis of Survival Data, Chapman and
Hall.
32
8. Kakourou, A., Vach, W., Nicolardi, S., van der Burgt, Y. Mertens, B.
(2016) Statistical development and assessment of summary measures to
account for isotopic clustering of Fourier transform mass spectrometry data
in clinical diagnostic studies. arXiv:1602.02908 [stat.ME]
9. Klein, J.P., van Houwelingen , H.C., Ibrahim, J.G., Scheike, T.H. (2014)
Handbook of Survival Analysis, Chapman and Hall/CRC Press.
10. Krzanowski, W.J., Jonathan, P., McCarthy, W.V. and Thomas, M.R
(1995) Discriminant analysis with singular covariance matrices: methods
and applications to spectroscopic data, Applied Statistics, 44, 101-115.
12. Molenberghs, G. and Verbeke, G. (2000) Linear Mixed Models for Lon-
gitudinal Data. Springer
13. Naes, T., Isaksson, T., Fearn, T., Davies, T (2002) A User-Friendly
Guide to Multivariate Classification and Calibration, NIR Publications.
33
16. Vach, W. (2013) Regression Models as a Tool in Medical Research. Chap-
man and Hall/CRC Press.
18. van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., milde, A.
K., van der werf, M. J. (2006) Centering, scaling and transformations:
improving the biological information content of metabolomics data, BMC
Genomics, 7:142.
34