Bts 595

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Vol. 28 no.

24 2012, pages 3290–3297


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts595

Systems biology Advance Access publication October 9, 2012

Bayesian correlated clustering to integrate multiple datasets


Paul Kirk1, Jim E. Griffin2, Richard S. Savage1, Zoubin Ghahramani3 and David L. Wild1,*
1
Systems Biology Centre, University of Warwick, Coventry, CV4 7AL, 2School of Mathematics, Statistics and
Actuarial Science, University of Kent, CT2 7NF and 3Department of Engineering, University of Cambridge,
Cambridge, CB2 1PZ, UK
Associate Editor: Olga Troyanskaya

ABSTRACT Rigaut et al., 1999) datasets are available for a broad selection of
Motivation: The integration of multiple datasets remains a key chal- organisms, providing measurements of mRNA expression,
lenge in systems biology and genomic medicine. Modern high- protein–DNA binding and protein–protein interactions (PPIs).
throughput technologies generate a broad array of different data In the forthcoming era of personal genomic medicine, we may
types, providing distinct—but often complementary—information. reasonably expect genome sequences and other forms of
We present a Bayesian method for the unsupervised integrative mod- high-throughput data (such as gene expression, alternative spli-
elling of multiple datasets, which we refer to as MDI (Multiple Dataset cing, DNA methylation, histone acetylation and protein abun-
Integration). MDI can integrate information from a wide range of dif- dances) to be routinely measured for large numbers of people.
ferent datasets and data types simultaneously (including the ability to The development of novel statistical and computational method-
model time series data explicitly using Gaussian processes). Each ology for integrating diverse data sources is therefore essential,
dataset is modelled using a Dirichlet-multinomial allocation (DMA) mix- and it is with this that the present work is concerned.
ture model, with dependencies between these models captured As is common in statistics and machine learning, data integra-
through parameters that describe the agreement among the datasets. tion techniques can be broadly categorized as either supervised
Results: Using a set of six artificially constructed time series datasets, (where a training/gold-standard set with known labels is used to
we show that MDI is able to integrate a significant number of datasets learn statistical relationships) or unsupervised (where there is no
simultaneously, and that it successfully captures the underlying struc- training dataset, but we nevertheless seek to identify hidden
tural similarity between the datasets. We also analyse a variety of real structure in the observed data; e.g. by clustering). Our proposed
Saccharomyces cerevisiae datasets. In the two-dataset case, we method is unsupervised, but there are also a number of super-
show that MDI’s performance is comparable with the present vised learning algorithms that are designed to integrate multiple
state-of-the-art. We then move beyond the capabilities of current data sources; we now briefly mention these for the sake of com-
approaches and integrate gene expression, chromatin immunopreci- pleteness. These have proven highly successful in several con-
pitation–chip and protein–protein interaction data, to identify a set of texts, often when predicting whether a link or interaction exists
protein complexes for which genes are co-regulated during the cell between two genes or proteins. Depending on the application,
cycle. Comparisons to other unsupervised data integration tech- the link might represent (to provide just a few examples) protein–
niques—as well as to non-integrative approaches—demonstrate that protein binding (Jansen et al., 2003; Rhodes et al., 2005), or a
MDI is competitive, while also providing information that would be synthetic sick or lethal interaction (Wong et al., 2004) or might
difficult or impossible to extract using other methods. indicate that the two genes have been implicated in the same
Availability: A Matlab implementation of MDI is available from biological process (Myers and Troyanskaya, 2007). Approaches
http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/
for predicting these links often proceed by collecting a
software/.
gold-standard set of positive and negative interactions (see, for
Contact: D.L.Wild@warwick.ac.uk
contrasting examples, Jansen et al., 2003; Lee et al., 2004; Myers
Supplementary information: Supplementary data are available at
et al., 2005), and then training statistical models (e.g. decision
Bioinformatics online.
trees, naive Bayes classifiers) that predict the presence/absence of
Received on April 12, 2012; revised on September 19, 2012; accepted these interactions. These models may then be applied to predict
on September 29, 2012 the presence/absence of previously unknown interactions.
Because training and prediction are performed on the basis of
information collected from multiple different data sources, these
1 INTRODUCTION approaches provide a form of data integration. Such supervised
The wide range of modern high-throughput genomics technolo- data integration techniques have proven highly effective for pre-
gies has led to a rapid increase in both the quantity and variety of dicting interactions, some of which may then be verified experi-
functional genomics data that can be collected. For example, mentally (e.g. Rhodes et al., 2005; Huttenhower et al., 2009).
large-scale microarray (Lockhart et al., 1996; Schena et al., Moreover, the work of Huttenhower et al. (2009) demonstrates
1995), chromatin immunoprecipitation (ChIP) chip (Solomon that such approaches may be used to integrate whole-genome
et al., 1988) and tandem affinity purification (Puig et al., 2001; scale datasets. The Bayesian network approach of Troyanskaya
et al. (2003) was a precursor to many of these supervised
*To whom correspondence should be addressed. approaches, but differs from the others in that it uses knowledge

ß The Author(s) 2012. Published by Oxford University Press.


This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multiple dataset integration

from human experts to integrate predictions derived from diverse hierarchical Dirichlet process (DP) to perform integrative mod-
datasets. elling of two datasets. As well as significant methodological
Here we propose a novel unsupervised approach for the inte- differences, the principal practical distinction between this ap-
grative modelling of multiple datasets, which may be of different proach and MDI is that we are able to integrate more than
types. For brevity, we refer to our approach as MDI, simply as a two datasets, any or all of which may be of different types
shorthand for ‘Multiple Dataset Integration’. We model each (Section 2). Like MDI, the iCluster method of Shen et al.
dataset using a Dirichlet-multinomial allocation (DMA) mixture (2009) permits integrative clustering of multiple ( 2) genomic
model (Section 2.1), and exploit statistical dependencies between datasets, but uses a joint latent variable model (for details, see
the datasets to share information (Section 2.2). MDI permits the Shen et al., 2009). In contrast to MDI, iCluster seeks to find a
identification of groups of genes that tend to cluster together in single common clustering structure for all datasets. Moreover,
one, some or all of the datasets. In this way, our method is able iCluster must resort to heuristic approaches to estimate the
to use the information contained within diverse datasets to number of clusters, whereas MDI infers this automatically
identify groups of genes with increasingly specific characteristics (Section 2.1). We demonstrate that MDI provides results that
(e.g. not only identifying groups of genes that are co-regulated, are competitive with the two-dataset approach of Savage et al.
but additionally identifying groups of genes that are both (2010) in Section 3.2, and provide a comparison of results ob-
co-regulated and whose protein products appear in the same tained using MDI, iCluster and simple clustering approaches in
complex). the Supplementary Material.
Informally, our approach may be considered as a ‘correlated The potential biological applications of our approach are
clustering’ model, in which the allocation of genes to clusters in diverse, as there are many experimental platforms that produce
one dataset has an influence on the allocation of genes to clusters measurements of different types, which might be expected to
in another. This contrasts with ‘simple’ clustering approaches possess similar (but not necessarily identical) clustering struc-
(such as k-means, hierarchical clustering, etc) in which the data- tures. For example, in the two-dataset case, related methodolo-
sets are clustered independently (or else concatenated and treated gies have been used to discover transcriptional modules (Liu
as a single dataset). It also clearly distinguishes our methodology et al., 2007; Savage et al., 2010) and prognostic cancer subtypes
from biclustering (e.g. Cheng and Church, 2000; Reiss et al., (Yuan et al., 2011) through the integration of gene expression
2006). Biclustering is the clustering of both dimensions in a data with TF binding (ChIP–chip) data and copy number vari-
single dataset (e.g. both genes and experiments in a gene expres- ation data, respectively. A related approach was also used by
sion dataset). MDI, in contrast, clusters a single dimension Rogers et al. (2008) to investigate the correspondence between
(e.g. genes) across multiple datasets. Biclustering is not applicable transcriptomic and proteomic expression profiles. In the example
here as the datasets can be arbitrarily different, making any clus- presented in this article, we focus on the biological question of
tering across all features difficult. MDI avoids the problem of identifying protein complexes whose genes undergo transcrip-
comparing different data types by instead learning the degree of tional co-regulation during the cell cycle.
similarity between the clustering structures (i.e. the The outline of this article is as follows. In Section 2, we briefly
gene-to-cluster allocations) in different datasets (Section 2.2). provide some modelling background and present our approach.
MDI makes use of mixture models, which have become wide- Inference in our model is performed via a Gibbs sampler, which
spread in the context of unsupervised integrative data modelling is provided in the Supplementary Material. In Section 3, we de-
(e.g. Barash and Friedman, 2002; Liu et al., 2006, 2007), gaining scribe three case study examples, in all of which we use publicly
increased popularity in recent years (Rogers et al., 2010; Savage
available Saccharomyces cerevisiae (baker’s yeast) datasets. We
et al., 2010). The principal advantages of using mixture models
present results in Section 4 and a discussion in Section 5.
are as follows: (i) they provide flexible probabilistic models of the
data; (ii) they naturally capture the clustering structure that is
commonly present in functional genomics datasets; and (iii) by
adopting different parametric forms for the mixture components, 2 METHODS
they permit different data types to be modelled (see also Section In this section, we provide some background regarding DMA mixture
2.1). An early application to data integration is provided by models (Section 2.1), and consider how these may be extended to allow us
Barash and Friedman (2002), who performed integrative model- to perform integrative modelling of multiple datasets (Section 2.2).
ling of gene expression and binding site data. Inference in the resulting model (which we henceforth refer to as MDI)
is performed using a Gibbs sampler (Supplementary Material). We briefly
As part of our approach, we infer parameters that describe the
describe in Section 2.4 how the resulting posterior samples may be effect-
levels of agreement between the datasets. Our method may thus
ively summarized.
be viewed as extending the work of Balasubramanian et al.
(2004). In this regard, MDI is also related to the approach of
Wei and Pan (2012), which models the correlation between data 2.1 Dirichlet-multinomial allocation mixture models
sources as part of a method that classifies genes as targets or We model each dataset using a finite approximation to a DP mixture
non-targets of a given transcription factor (TF) using ChIP–chip, model (Ishwaran and Zarepour, 2002), known as a DMA mixture model
gene expression and DNA binding data, as well as information (Green and Richardson, 2001). Such models have the following general
regarding the position of genes on a gene network. Perhaps most form:
closely related to MDI (in terms of application) are the methods
X
N
of Savage et al. (2010) and iCluster (Shen et al., 2009). Savage pðxÞ ¼ c fðxjc Þ: ð1Þ
et al. (2010) adopt a mixture modelling approach, using a c¼1

3291
P.Kirk et al.

(a) (b)
In the above, p(x) denotes the probability density model for the data,
which is here an N component mixture model. The c ’s are mixture
proportions, f is a parametric density (such as a Gaussian) and c denotes
the vector of parameters associated with the c-th component.
Importantly, different choices for the density f allow us to model different
types of data (for example, a normal distribution might be appropriate
for continuous data, whereas a multinomial might be appropriate for
categorical data).
Given observed data x1 , . . . , xn , we wish to perform Bayesian infer- Fig. 1. Graphical representation of three DMA mixture models. (a)
ence for the unknown parameters in this model. As is common in mixture Independent case. (b) The MDI model. In both (a) and (b), xik denotes
modelling (e.g. Dempster et al., 1977; see also Friedman et al., 2004 for a the ith observation in dataset k and is generated by mixture component
graphical model perspective), we introduce latent component allocation cik . The prior probabilities associated with the distinct component allo-
variables cj 2 f1, . . . , Ng, such that ci is the component responsible for cation variables, ½c1k , . . . , cNk , are given in the vector pk , which is itself
observation xi . We then specify the model as follows: assigned a symmetric Dirichlet prior with parameter k . The parameter
vector, hck , for component c in dataset k is assigned a G0k prior. In (b), we
xi jci ,   Fðci Þ, additionally have k‘ parameters, each of which models the dependence
ci j  Multinomialð1 , . . . , N Þ, between the component allocations of observations in dataset k and ‘
ð2Þ
1 , . . . , N  Dirichletð=N, . . . , =NÞ,
c  Gð0Þ ,
different mass parameter, k : MDI links these models together at the
where F is the distribution corresponding to density f, p ¼ ð1 , . . . , N Þ is
level of the component allocation variables via the following conditional
the collection of N mixture proportions,  is a mass/concentration par-
prior:
ameter (which may also be inferred) and Gð0Þ is the prior for the compo-
nent parameters. Bayesian inference for such models may be performed Y
K Y
K 1 Y
K

via Gibbs sampling (Neal, 2000). Note that a realization of the collection pðci1 , ci2 , . . . , ciK j/Þ / cik k ð1 þ k‘ Iðcik ¼ ci‘ ÞÞ, ð3Þ
k¼1 k¼1 ‘¼kþ1
of component allocation variables, ðc1 , . . . , cn Þ, defines a clustering of the
data (i.e. if ci ¼ cj , then xi and xj are clustered together). Because each cj where I is the indicator function, k‘ 2 R0 is a parameter that controls
is a member of the set f1, . . . , Ng, it follows that the value of N places an the strength of association between datasets k and ‘, and  is the collec-
upper bound on the number of clusters in the data. tion of all K(K  1)/2 of the k‘ ’s. For clarity, note that cik 2 f1, . . . , Ng is
The DP mixture model may be derived by considering the limit the component allocation variable associated with gene i in model k, and
N ! 1 in Equation (1) (Neal, 1992; Rasmussen, 2000). In the present that cik k is the mixture proportion associated with component cik in
article, it is convenient to persist with finite N (Section 2.2). The import- model k. Informally, the larger k‘ , the more likely it is that cik and ci‘
ant point is that N just places an upper bound on the number of clusters will be the same, and hence the greater the degree of similarity between
present in the data (because, as in the infinite DP case, not all of the the clustering structure of dataset k and dataset ‘. In Figure 1b, we pro-
components need to be ‘occupied’; i.e. not all components need to have vide a graphical representation of our model in the case K ¼ 3. If all
observations associated with them), and hence N does not specify the k‘ ¼ 0, then we recover the case of K-independent DMA mixture
precise number of clusters a priori. Provided N is taken sufficiently models (Fig. 1a). Note that ð1 þ k‘ Iðcik ¼ ci‘ ÞÞ  1, hence if k‘ 40
large, the number of clusters present in the data will be (much) less then we are up-weighting the prior probability that cik ¼ ci‘ (relative to
than N, and we will retain the ability to identify automatically the the independent case).
number of clusters supported by the data. Theoretical justifications for Linking the mixture models at the level of the component allocation
‘large’ mixture models such as this (in which the number of components variables provides us with a means to capture dependencies between the
in the mixture is larger than the true number of clusters in the data) are datasets in a manner that avoids difficulties associated with the datasets
provided by Rousseau and Mengersen (2011). A choice of N ¼ n would being of different types and/or having different noise properties.
set the upper bound on the number of clusters to be equal to the number   An important feature of our model is that there is a correspondence
of genes. As a tradeoff with computational cost, we take N ¼ n=2 between the component labels across the datasets. That is, our model
throughout this article. implicitly ‘matches up’ Component c in Dataset k with Component c
in Dataset ‘. This allows us to identify groups of genes that tend to be
allocated to the same component (i.e. which tend to cluster together) in
2.2 Dependent component allocations
multiple datasets (Section 2.4). It is this desire to ‘match up’ components
We are interested in the situation where we have a collection of n genes, across datasets that motivates our use of finite approximations to DP
for each of which we have measurements from K different data sources. mixture models. Had we used an infinite mixture model, matching com-
One possible modelling approach would be to fit K independent DMA ponents across datasets would be more problematic. We reiterate that the
mixture models, represented graphically in Figure 1a for the case K ¼ 3. finite N that appears in our mixture models merely places an upper bound
However, this neglects to consider (and fails to exploit) structure within on the number of clusters in each dataset (as not all components need to
the data that may be common across some or all of the different sources. be occupied), and hence is not restrictive in practice. Note that while this
For example, a set of co-regulated genes might be expected to have simi- upper bound is the same for each data set, the actual number of occupied
lar expression profiles, as well as have a common collection of proteins components (i.e. clusters) is inferred separately for each dataset and in
that bind their promoters. We therefore propose a model in which we general will be different for each one.
allow dependencies between datasets at the level of the component allo-
cation variables, ci :
We consider K mixture models (one for each dataset), each defined as 2.3 Modelling different data types
in Equations (1) and (2). We add right subscripts to our previous notation To specify our model fully, we must provide parametric densities, f, ap-
to distinguish between the parameters of the K different models (so that propriate for each data source. It is important to note that we may tailor
k is the mass parameter associated with model k, etc.) and take Nk ¼ N our choice of f to reflect the data sources that we seek to model. In the
in all mixture models. Note that each model is permitted to have a present work, we use Gaussian process models (Cooke et al., 2011; Kirk

3292
Multiple dataset integration

and Stumpf, 2009; Rasmussen and Williams, 2006) for gene expression 3.1 6-dataset synthetic example
time course data, and use multinomial models for categorical data
(e.g. discretized gene expression levels). For comparison with the results
To illustrate the properties of our model, we start with a
of Savage et al. (2010), we also consider in our second example (Sections six-dataset synthetic example. Dataset 1 is constructed by
3.2 and 4.2) a bag-of-words model for ChIP–chip data. Full details of all taking a 100-gene subset of the gene expression time course
of these models are given in the Supplementary Material, where we also data of Cho et al. (1998), and may be partitioned into seven
provide a Gibbs sampler for performing inference. As in Nieto-Barajas easily distinguishable clusters (Fig. 2a). We therefore associate
et al. (2004), posterior simulation for our model is aided by the strategic with each time course a cluster label, Z 2 f1, . . . , 7g. For
introduction of an additional latent variable (Supplementary Material for i ¼ 1, . . . , 5, we form Dataset i þ 1 by randomly selecting 25
details). time courses from Dataset i and randomly permuting their asso-
ciated gene names (but not their cluster labels). Thus, for a max-
2.4 Extracting fused clusters from posterior samples imum of 25 genes, the cluster label associated with gene g in
We wish to identify groups of genes that tend to be grouped together in
Dataset i may be different from the cluster label associated
multiple datasets. Suppose we have a collection of K datasets, which we with the same gene in Dataset i þ 1. Figure 2b and c further
label as Dataset 1,. . ., Dataset K. We are interested in identifying groups illustrate this dataset. A formal approach for comparing the al-
of genes that tend to cluster together amongst some subcollection of the location of genes to clusters is to calculate the ARI between each
datasets. Let fk1 , k2 , . . . , km g be a subset of f1, . . . , Kg. Our aim is to pair of clustering partitions (Hubert and Arabie, 1985; Rand,
identify groups of genes that cluster together in all of Dataset k1 ,. . ., 1971). Figure 2d provides a heatmap depiction of the similarity
Dataset km . Adapting terminology from Savage et al. (2010), we define matrix formed by calculating pairwise ARIs.
the probability of the i-th gene being fused across datasets k1 , . . . , km to
be the posterior probability that cik1 ¼ cik2 ¼ . . . ¼ cikm : For brevity, we
denote this posterior probability by pðcik1 ¼ cik2 ¼ . . . ¼ cikm Þ. We calcu- 3.2 Integrating expression and ChIP data
late this quantity as the proportion of posterior samples for which
To compare our method with an existing approach for unsuper-
cik1 , cik2 , . . . , cikm are all equal. We may clearly calculate these posterior
vised data integration, we apply MDI to an example previously
fusion probabilities for any combination of the datasets (pairs, triplets,
etc.), simply by considering the appropriate subset of f1, . . . , Kg. We say considered by Savage et al. (2010) in the context of transcrip-
that the i-th gene is fused across datasets k1 , k2 , . . . , km if tional module discovery. We take expression data from a
pðcik1 ¼ cik2 ¼ . . . ¼ cikm Þ40:5, and we denote the set of all such fused 205-gene subset of the galactose-use data of Ideker et al.
genes by F k1 , k2 , ..., km . (2001), which we integrate with ChIP–chip data from Harbison
If gene i is a member of F k1 , k2 , ..., km , this simply tells us that the com- et al. (2004). The expression data were discretized, as in Savage
ponent allocation variables cik1 , cik2 , . . . , cikm tend to be equal (i.e. gene i et al. (2010). The 205 genes appearing in this dataset were se-
tends to be allocated to the same component across datasets lected in Yeung et al. (2003) to reflect four functional Gene
k1 , k2 , . . . , km ). We also wish to identify the clustering structure that Ontology (GO) categories. Although this functional classifica-
exists amongst these fused genes. From our Gibbs sampler, we have a
tion must be used with some degree of caution (Yeung et al.,
collection of sampled component allocations for each member of
F k1 , k2 , ..., km . We identify a final clustering for the set of fused genes by
2003), it provides a reasonable means by which to validate the
searching amongst the sampled component allocations to find the one groupings defined by our method. We use the same version of
that maximizes the posterior expected adjusted Rand index (ARI; Fritsch
and Ickstadt, 2009). The resulting fused clusters contain groups of genes
that tend to cluster together across datasets k1 , k2 , . . . , km . (a) (b) (c)

3 EXAMPLES
To demonstrate the usage and utility of MDI, we consider three
examples using publicly available S. cerevisiae datasets. We spe-
cify the priors adopted for unknown parameters and provide (d)
Markov chain Monte Carlo running specifications in the
Supplementary Material. Each of our examples serves a different
purpose. In the first (Section 3.1), we consider an easily inter-
pretable synthetic dataset, which allows us to illustrate the types
of results that can be obtained using MDI. In the second (Section
3.2), we seek to compare our method with the present
state-of-the-art in data integration (namely, the approach of
Savage et al., 2010). Although this approach is limited to inte-
grating two datasets only, it provides a useful benchmark for
Fig. 2. (a) The data for the six-dataset synthetic example, separated into
MDI. Finally, in Section 3.3, we provide an example that
seven clusters. (b) A representation of how the cluster labels associated
allows us to explore the benefits offered by MDI that go with each gene vary from dataset to dataset. Genes are ordered so that
beyond the existing state-of-the-art. We consider the integration the clustering of Dataset 1 is the one that appears coherent. (c) A table
of three datasets, two of which comprise static measurements showing the number of genes having the same cluster labels in datasets
(ChIP–chip and PPI), and the other of which comprises gene i and j. (d) A heatmap depiction of the similarity matrix formed by
expression time course data. calculating the ARI between pairs of datasets

3293
P.Kirk et al.

(a) (b)
the Harbison et al. dataset as considered by Savage et al. (2010)
(significance threshold P ¼ 0.001), which provides binding infor-
mation for 117 transcriptional regulators. For brevity, we hence-
forth refer to the data of Harbison et al. as ‘ChIP data’, although
we emphasise that this dataset comprises measurements corres-
ponding to a compendium of 117 TFs, rather than to a single
particular TF. Discretizing the data (both expression and ChIP–
chip) might seem like an unnecessary simplification (as our
model can accommodate continuous static measurements
through an appropriate choice of component density function,
Fig. 3. (a) Densities fitted to the sampled values of k . (b) Heatmap
f), but it helps to ensure that our comparison to the results of
representation of the matrix with k‘-entry k‘ , the posterior mean
Savage et al. (2010) is fair. Moreover, discretization of the ChIP value for kl
data simplifies modelling and interpretation of the data (the
ij-entry of our ChIP data matrix is 1 if we have high confidence
that TFj is able to bind the promoter region of gene i, and 0
otherwise), although we acknowledge that this is likely to incur To test our ability to identify fused genes, we calculated pair-
some small information loss. wise fusion probabilities, pðcik ¼ ci‘ Þ, for each gene i and each
pair of datasets ðk, ‘Þ. If the true cluster label of gene i is the same
in datasets k and ‘, then pðcik ¼ ci‘ Þ should be high (40.5) so that
3.3 Integrating expression, ChIP and PPI data the gene may be correctly identified as fused. Across all pairs of
For an example with three diverse data types, we integrate datasets, the minimum pairwise fusion probability for such genes
the ChIP data of Harbison et al. with binary PPI data obtained was 0.90 and the mean was 0.97. Conversely, for genes having
from BioGRID (Stark et al., 2006) and a gene expression time different cluster labels in datasets k and ‘, the maximum pairwise
course dataset of Granovskaia et al. (2010), with the initial in- fusion probability was 0.05 and the mean was 0.01. Because our
tention of identifying protein complexes whose genes undergo fusion threshold is 0.5, we are in this case able to identify the
transcriptional co-regulation during the cell cycle. We consider fusion status correctly for all genes.
the Granovkskaia et al. cell cycle dataset that comprises meas-
urements taken at 41 time points, and which was obtained from
cells synchronized using alpha factor arrest. We considered only 4.2 Expression þ ChIP example
genes identified in Granovskaia et al. (2010) as having periodic
We ran MDI using a multinomial likelihood model for both
expression profiles. After removing those for which there was no
the discretized expression data and the binary ChIP–chip
ChIP or PPI data, we were left with 551 genes. Our binary PPI
data. We estimated pairwise fusion probabilities and extracted
data matrix then has rows indexed by these 551 genes, and col-
fused clusters, as described in Section 2.4. We identified 52
umns indexed by all of the proteins for which physical inter-
fused genes, grouped into three clusters. We compared these
actions identified via yeast 2-hybrid or affinity capture assays
clusters to the functional classes defined in Yeung et al. (2003).
have been reported in BioGRID. The ij-entry of the PPI data
Within each cluster, all genes had the same functional classifica-
matrix is 1 if there is a reported interaction between protein j and
the protein product of gene i (and 0 otherwise). In an effort to tion, whereas genes in different clusters possessed different
reduce the number of uninformative features, we removed col- classifications.
umns containing fewer than five 1s, leaving 603 columns. In Savage et al. (2010), a bag-of-words model was used to
model TF binding data. To permit a fair comparison of the
two approaches, we therefore re-ran MDI using a bag-of-words
4 RESULTS likelihood model for the ChIP data. Following Savage et al.
(2010), we then calculated the Biological Homogeneity Index
4.1 6-dataset synthetic example (BHI; Datta and Datta, 2006) for the resulting fused clusters.
Figure 3a shows estimated posterior densities for the mass par- To calculate the BHI scores, we used the R package clValid
ameters, k (obtained from the samples generated by our Gibbs (Brock et al., 2008) together with the GO annotations in the
sampler using kernel density estimation). Because each of our org.Sc.sgd.db Bioconductor package (Carlson et al., 2010). The
datasets is identical (up to permutation of gene names), these clValid package provides four different BHI scores, depending
distributions should be close to identical, as is the case. For on which GO functional categories are used to define the set of
each pair of datasets, we used the posterior k‘ samples to esti- annotations. All categories may be considered or just one of
mate posterior means, k‘ . We used these to form a similarity biological process (bp), cellular component (cc) and molecular
matrix whose k‘-entry is k‘ (with k‘ defined to be ‘k whenever function (mf). We report all four BHI scores in Table 1, for the
k4‘, and with kk left undefined). This is shown as a heatmap in fused clusters defined by (i) the method of Savage et al. (2010);
Figure 3b. Although they do so in different ways, both the ARI (ii) MDI using a bag-of-words likelihood and (iii) MDI using a
and the dataset association parameters quantify the degree of multinomial likelihood. The BHI scores for MDI (bag-of-words)
similarity between the allocation of genes to clusters in pairs of and the method of Savage et al. (2010) are almost identical, al-
datasets. The similarity of Figures 2d and 3b is therefore though MDI (bag-of-words) identifies a greater number of fused
reassuring. genes.

3294
Multiple dataset integration

Table 1. BHI scores for the fused clusters obtained using the method of expression profiles for genes identified as fused across the PPI
Savage et al. (2010), together with those obtained using MDI and ChIP datasets, with genes fused across all three datasets
shown in colour. Supplementary Figure 2 further illustrates the
Method BHI BHI BHI BHI Number fused clusters, whereas Table 3 shows the fused cluster labels and
(all) (bp) (mf) (cc) of genes provides descriptions for the genes fused across all three datasets.
We can see from Figure 4a and b that the integration of the
Savage et al. (2010) 0.98 0.85 0.71 0.98 72 expression data in addition to the ChIP and PPI data results in
MDI (bag-of-words) 0.98 0.85 0.72 0.97 172 Cluster 1 (green) and Cluster 6 (black) being effectively removed.
MDI (multinomial) 1.00 0.89 0.77 1.00 52 Although many of the genes in Cluster 1 are annotated as cell
wall proteins (Supplementary Material), and although the two
genes in Cluster 6 are both cyclins, the genes within these clusters
4.3 Expression þ ChIP þ PPI example have different expression patterns to one another (Fig. 4c, panels
We applied MDI to the example of Section 3.3 (using GP models 1 and 6). Genes are also lost from Clusters 4 and 5 (shown pink
for the gene expression time courses, and multinomial models for and purple). However, further analysis suggests that this is owing
the ChIP and PPI datasets), to identify groups of genes that are to data normalization effects (Supplementary Material). Cluster
co-regulated during the yeast cell cycle, and whose protein prod- 2 (blue) is robust to the additional inclusion of expression data,
indicating that there is no significant disagreement amongst the
ucts appear in the same complex. We identified genes fused
three datasets regarding the existence of this cluster. Cluster 3
across all three datasets, as well as genes fused across pairs of
(red) is also relatively robust, with only one less gene when we
datasets. We then determined the fused clusters for each of these
consider the fusion of all three datasets, compared to the fusion
combinations (Section 2.4). Additionally, we identified clusters
of just the ChIP and PPI datasets (Fig. 4a and b). We note that
for the ‘single dataset fusion’ case (which amounts to identifying
the genes in Clusters 2 and 3 all have key roles, either encoding
a single clustering partition for each of our three datasets con-
core histone proteins or being involved in ribosome biogenesis
sidered separately). We assess the quality of our clusterings using
(Table 3).
GO Term Overlap (GOTO) scores (Mistry and Pavlidis, 2008).
Interestingly, the gene lost from Cluster 3 (the histone cluster)
These assign a score to a pair of genes according to how many
is HTZ1, which encodes the variant histone H2A.Z (Jackson
GO terms they have in common. This contrasts with BHI, which
et al., 1996; Santisteban et al., 2000). The function of H2A.Z is
just assigns a score of 0 or 1 to gene pairs depending on whether
different to that of the major H2As (e.g. Jackson and Gorovsky,
or not they share a common GO term. The GOTO scores there-
2000). We can see from Figure 4c (panel 3) that the expression of
fore provide a more finely grained assessment, which implicitly
this gene (shown grey) is subtly different to the expression of
takes into account the hierarchical structure of the GO. This is
others in the cluster.
invaluable here because (as a result of selecting only genes found
to have periodic expression profiles during the cell cycle) any two
randomly selected genes are likely to share some high-level GO 4.4 Comparison to other methods
terms (see the Supplementary Material for more details). The In Section G of the Supplementary Material, we provide a com-
GOTO scores are reported in Table 2 parison of MDI with other clustering methods, both in terms of
The GOTO scores generally increase as we require agreement performance and the types of results that can be obtained. The
across more datasets, while the number of fused genes decreases. key properties of MDI that distinguish it from other clustering
Note that this decrease is simply a consequence of requiring methods are (i) the clustering of genes in dataset k influences
agreement among a larger collection of datasets. For example, (and is influenced by) the clustering in dataset ‘, to an extent
as the set S1 ¼ {genes that are co-regulated and have protein determined by the inferred k‘ parameter; (ii) each dataset is
products that appear in the same complex} is a subset of permitted to have a different clustering structure (so each dataset
S2 ¼ fgenes that are co-regulatedg, it is inevitable that the may, for example, have a different number of clusters); (iii) the
number of genes of the former type will be less than or equal number of clusters is determined automatically as part of the
to the number of genes of the latter type. In other words, requir- inference procedure and (iv) there is a correspondence between
ing agreement across multiple datasets enables us to identify the cluster labels in different datasets, which enables us to iden-
clusters of genes that have increasingly specific shared character- tify clusters of genes that exist across some or all of the datasets.
istics. This is reflected in the increasing GOTO scores, which Simple clustering methods (such as k-means and hierarchical
indicate that genes in the same cluster tend to share a greater clustering) can be used to cluster each of the datasets independ-
number of lower-level (more specific) GO terms. ently, but do not model the dependence/similarity between clus-
In Figure 4, we compare the clusters formed by the genes fused tering structures in different datasets and do not enable clusters
across all three datasets with those formed by the genes fused that exist across multiple datasets to be identified automatically.
across just the PPI and ChIP datasets. Figure 4a and b illustrate More sophisticated methods such as iCluster (Shen et al., 2009)
fusion probabilities for the 31 genes identified as fused across the often share some of MDI’s properties, but do not allow for the
PPI and ChIP datasets. Each bar in Figure 4a corresponds to a identification of subsets of genes that cluster together across mul-
particular gene (as labelled), and represents the posterior prob- tiple datasets. The results of Section G of the Supplementary
ability of that gene being fused across the ChIP and PPI datasets. Material demonstrate that the ability to share information
The corresponding bar in Figure 4b represents the probability of across datasets typically provides improvements in clustering
the gene being fused across all three datasets. Figure 4c shows the quality, while MDI’s additional ability to pick out clusters that

3295
P.Kirk et al.

(a) (b) (c)

Fig. 4. (a) Pairwise fusion probabilities for the 31 genes identified as fused across the ChIP and PPI datasets in the ‘Expression þ ChIP þ PPI’ example.
Colours correspond to fused clusters and the dashed line indicates the fusion threshold. (b) Three-way fusion probabilities for the same 31 genes. Genes
that do not exceed the fusion threshold have white bars. (c) The expression profiles for genes identified as fused according to the ChIP and PPI datasets.
The coloured lines indicate genes that are also fused across the expression dataset as well

Table 2. GOTO scores for fused clusters obtained for all combinations of 4.5 Scaling and run-times
the expression, ChIP and PPI datasets
For typical examples (where the number of datasets, K, is rela-
tively small), the scaling of MDI will be O(KNn) (see
Dataset(s) GOTO GOTO GOTO Number Supplementary Section D.5 for further details and specific
(bp) (mf) (cc) Of genes run-times). MDI is particularly appropriate for applications in
which a gene pre-selection step is performed (e.g. on the basis of
ChIP 6.36 0.97 8.53 551 differential expression). We anticipate applications to collections
PPI 11.04 1.51 11.11 551 of 5 datasets, each comprising 1000 genes. Parallelizing MDI
Expression 7.66 1.15 9.48 551
using an approach such as the one described by Suchard et al.
ChIP þ PPI 27.04 3.47 18.99 31
ChIP þ Expression 24.46 2.93 16.87 48
(2010) should be possible, and we are currently investigating this.
PPI þ Expression 26.04 3.69 22.35 32
ChIP þ PPI þ Expression 34.81 2.46 26.70 16
5 DISCUSSION
We have presented MDI, a novel Bayesian method for the un-
supervised integrative modelling of multiple datasets. We have
Table 3. Clusters formed by the genes fused across all 3 datasets established that MDI provides competitive results with an exist-
ing method for integrating two datasets (Section 4.2), and is also
able to integrate collections of more than two datasets (Sections
ID Gene Brief description 4.1 and 4.3). Our application to a three-dataset example (Section
4.3) demonstrated that requiring agreement across multiple data-
2 NOB1 Involved in synthesis of 40S ribosomal subunits sets of different types can enable us to identify clusters of genes
2 ENP2 Required for biogenesis of the small ribosomal subunit
with increasingly specific shared characteristics. Moreover, we
2 RPF2 Involved in assembly of 60S ribosomal subunit
2 IMP3 Component of the SSU processome have found that sharing information across multiple datasets
2 DBP9 Involved in biogenesis of 60S ribosomal subunit can improve cluster quality.
3 HHF2 Histone H4, core histone protein MDI adopts a modelling approach distinctly different from
3 HTB2 Histone H2B, core histone protein those adopted by existing integrative modelling methods. For
3 HTA1 Histone H2A, core histone protein example, the model of Savage et al. (2010) performs integrative
3 HHT1 Histone H3, core histone protein modelling of two datasets only, achieved by introducing a ‘fused
3 HTB1 Histone H2B, core histone protein context’ (in which the two datasets are modelled together via a
3 HHT2 Histone H3, core histone protein product of likelihoods) in addition to two ‘unfused contexts’ in
3 HHF1 Histone H4, core histone protein
which the two datasets are modelled separately. This is analo-
5 SMC3 Subunit of the cohesin complex
5 IRR1 Subunit of the cohesin complex
gous to introducing—and modelling—an additional dataset. In
contrast, MDI introduces just a single parameter, k‘ 2 R0 , for
Descriptions were derived from the Saccharomyces Genome Database (Cherry each pair of datasets (Section 2.2), and it is this that provides
et al., 1998). The IDs in this table correspond to the cluster IDs in Figure 4, with MDI with the flexibility to perform integrative modelling of mul-
singletons omitted. tiple datasets. The scalability of MDI may be further improved
through parallelization of the type described by Suchard et al.
exist across multiple datasets permits the identification of groups (2010). This is an important direction for future work.
of genes with specific shared characteristics. Increasing the
number of datasets across which we seek agreement in cluster
assignment has the effect of increasing the specificity of these ACKNOWLEDGEMENTS
shared characteristics (which typically reduces the size of the We thank John Pinney for help with GO Term Overlap, and
gene subset—see Section 4.3 for further explanation). Maxime Huvet for useful discussions.

3296
Multiple dataset integration

Funding: P.K., J.E.G., Z.G. and D.L.W. acknowledge support Lockhart,D.J. et al. (1996) Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat. Biotechnol., 14, 1675–1680.
from the Engineering and Physical Sciences Research Council
Mistry,M. and Pavlidis,P. (2008) Gene Ontology term overlap as a measure of gene
(grant EP/I036575/1). R.S.S. was supported by an Medical functional similarity. BMC Bioinformatics, 9, 327.
Research Council Biostatistics Fellowship. Myers,C.L. and Troyanskaya,O.G. (2007) Context-sensitive data integration and
prediction of biological networks. Bioinformatics, 23, 2322–2330.
Conflict of Interest: none declared. Myers,C.L. et al. (2005) Discovery of biological networks from diverse functional
genomic data. Genome Biol., 6, R114.
Neal,R.M. (1992) Bayesian mixture modeling. In Maximum Entropy and Bayesian
Methods: Proceedings of the 11th International Workshop on Maximum Entropy
REFERENCES and Bayesian Methods of Statistical Analysis. pp. 197–211.
Balasubramanian,R. et al. (2004) A graph-theoretic approach to testing associations Neal,R.M. (2000) Markov chain sampling methods for Dirichlet process mixture
between disparate sources of functional genomics data. Bioinformatics, 20, models. J. Comput. Graph. Stat., 9, 249–265.
3353–3362. Nieto-Barajas,L. et al. (2004) Normalized random measures driven by increasing
Barash,Y. and Friedman,N. (2002) Context-specific Bayesian clustering for gene additive processes. Ann. Stat., 32, 2343–2360.
expression data. J. Comput. Biol., 9, 169–191. Puig,O. et al. (2001) The tandem affinity purification (TAP) method: a general
Brock,G. et al. (2008) clValid: an R package for cluster validation. J. Stat. Softw., procedure of protein complex purification. Methods, 24, 218–229.
25, 1–22. Rand,W. (1971) Objective criteria for the evaluation of clustering methods. J. Am.
Carlson,M. et al. (2010) org.Sc.sgd.db: genome wide annotation for Yeast. R package Stat. Assoc., 66, 846–850.
version 2.6.3. Rasmussen,C.E. (2000) The infinite Gaussian mixture model. In: Solla,S.A.,
Cheng,Y. and Church,G.M. (2000) Biclustering of expression data. Proc. Int. Conf. Leen,T.K. and Müller,K.-R. (eds.) Advances in Neural Information Processing
Intell. Syst. Mol. Biol., 8, 93–103. Systems, Vol. 12. MIT Press, Cambridge, MA, pp. 554–560.
Cherry,J.M. et al. (1998) SGD: Saccharomyces genome database. Nucleic Acids Rasmussen,C.E. and Williams,C.K. (2006) Gaussian Processes for Machine Learning
Res., 26, 73–79. (Adaptive Computation and Machine Learning). MIT Press, Cambridge, MA.
Cho,R.J. et al. (1998) A genome-wide transcriptional analysis of the mitotic cell Reiss,D.J. et al. (2006) Integrated biclustering of heterogeneous genome-wide data-
cycle. Mol. Cell, 2, 65–73. sets for the inference of global regulatory networks. BMC Bioinformatics, 7, 280.
Cooke,E.J. et al. (2011) Bayesian hierarchical clustering for microarray time series Rhodes,D.R. et al. (2005) Probabilistic model of the human protein-protein inter-
data with replicates and outlier measurements. BMC Bioinformatics, 12, 399. action network. Nat. Biotechnol., 23, 951–959.
Datta,S. and Datta,S. (2006) Methods for evaluating clustering algorithms for gene Rigaut,G. et al. (1999) A generic protein purification method for protein complex
expression data using a reference set of functional classes. BMC Bioinformatics, characterization and proteome exploration. Nat. Biotechnol., 17, 1030–1032.
7, 397. Rogers,S. et al. (2008) Investigating the correspondence between transcriptomic and
Dempster,A. et al. (1977) Maximum likelihood from incomplete data via EM proteomic expression profiles using coupled cluster models. Bioinformatics, 24,
Algorithm. J. R. Stat. Soc. Series B Methodol., 39, 1–38. 2894–2900.
Friedman,J. et al. (2004) Consistency in boosting: discussion. Ann. Stat., 32, Rogers,S. et al. (2010) Infinite factorization of multiple non-parametric views.
102–107. Mach. Learn., 79, 201–226.
Fritsch,A. and Ickstadt,K. (2009) Improved criteria for clustering based on the Rousseau,J. and Mengersen,K. (2011) Asymptotic behaviour of the posterior dis-
posterior similarity matrix. Bayesian Anal., 4, 367–391. tribution in overfitted mixture models. J. R. Stat. Soc. Series B Stat. Methodol.,
Granovskaia,M.V. et al. (2010) High-resolution transcription atlas of the mitotic 73, 689–710.
cell cycle in budding yeast. Genome Biol., 11, R24. Santisteban,M.S. et al. (2000) Histone H2A.Z regulates transcription and is partially
Green,P. and Richardson,S. (2001) Modelling heterogeneity with and without the redundant with nucleosome remodeling complexes. Cell, 103, 411–422.
Dirichlet process. Scand. J. Stat., 28, 355–375. Savage,R.S. et al. (2010) Discovering transcriptional modules by Bayesian data
Harbison,C.T. et al. (2004) Transcriptional regulatory code of a eukaryotic genome. integration. Bioinformatics, 26, i158–i167.
Nature, 431, 99–104. Schena,M. et al. (1995) Quantitative monitoring of gene expression patterns with a
Hubert,L. and Arabie,P. (1985) Comparing partitions. J. Classif., 2, 193–218. complementary DNA microarray. Science, 270, 467–470.
Huttenhower,C. et al. (2009) Exploring the human genome with functional maps. Shen,R. et al. (2009) Integrative clustering of multiple genomic data types using a
Genome Res., 19, 1093–1106. joint latent variable model with application to breast and lung cancer subtype
Ideker,T. et al. (2001) Integrated genomic and proteomic analyses of a systematic- analysis. Bioinformatics, 25, 2906–2912.
ally perturbed metabolic network. Science, 292, 929–934. Solomon,M.J. et al. (1988) Mapping protein-DNA interactions in vivo with formal-
Ishwaran,H. and Zarepour,M. (2002) Exact and approximate representations for dehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell,
the sum Dirichlet process. Can. J. Stat., 30, 269–283. 53, 937–947.
Jackson,J.D. and Gorovsky,M.A. (2000) Histone H2A.Z has a conserved function Stark,C. et al. (2006) BioGRID: a general repository for interaction datasets.
that is distinct from that of the major H2A sequence variants. Nucleic Acids Nucleic Acids Res., 34, D535–D539.
Res., 28, 3811–3816. Suchard,M.A. et al. (2010) Understanding GPU programming for statistical com-
Jackson,J.D. et al. (1996) A likely histone H2A.F/Z variant in Saccharomyces cer- putation: studies in massively parallel massive mixtures. J. Comput. Graph. Stat.,
evisiae. Trends Biochem. Sci., 21, 466–467. 19, 419–438.
Jansen,R. et al. (2003) A Bayesian networks approach for predicting protein-protein Troyanskaya,O.G. et al. (2003) A Bayesian framework for combining heteroge-
interactions from genomic data. Science, 302, 449–453. neous data sources for gene function prediction (in Saccharomyces cerevisiae).
Kirk,P.D. and Stumpf,M.P. (2009) Gaussian process regression bootstrapping: Proc. Natl Acad. Sci. USA, 100, 8348–8353.
exploring the effects of uncertainty in time course data. Bioinformatics, 25, Wei,P. and Pan,W. (2012) Bayesian joint modeling of multiple gene networks and
1300–1306. diverse genomic data to identify target genes of a transcription factor. Ann.
Lee,I. et al. (2004) A probabilistic functional network of yeast genes. Science, 306, Appl. Stat., 6, 334–355.
1555–1558. Wong,S.L. et al. (2004) Combining biological networks to predict genetic inter-
Liu,X. et al. (2006) Context-specific infinite mixtures for clustering gene expression actions. Proc. Natl Acad. Sci. USA, 101, 15682–15687.
profiles across diverse microarray dataset. Bioinformatics, 22, 1737–1744. Yeung,K.Y. et al. (2003) Clustering gene-expression data with repeated measure-
Liu,X. et al. (2007) Bayesian hierarchical model for transcriptional module discov- ments. Genome Biol., 4, R34.
ery by jointly modeling gene expression and ChIP-chip data. BMC Yuan,Y. et al. (2011) Patient-specific data fusion defines prognostic cancer subtypes.
Bioinformatics, 8, 283. PLoS Comput. Biol., 7, e1002227.

3297

You might also like