Biotechnology Advances 49 (2021) 107739

Contents lists available at ScienceDirect

Biotechnology Advances
journal homepage:

Research review paper

Using machine learning approaches for multi-omics data analysis: A review

Parminder S. Reel a, 1, Smarti Reel a, 1, Ewan Pearson a, Emanuele Trucco b, Emily Jefferson a, *
Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom


Keywords: With the development of modern high-throughput omic measurement platforms, it has become essential for
Multi-omics biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into
Machine Learning biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be
Predictive Modelling
integrated to unravel the intricate working of systems biology using machine learning-based predictive algo­
Supervised Learning
rithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data
Unsupervised Learning
Systems Biology enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease
prediction, patient stratification and delivery of precision medicine. This review paper explores different inte­
grative machine learning methods which have been used to provide an in-depth understanding of biological
systems during normal physiological functioning and in the presence of a disease. It provides insight and rec­
ommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics

1. Introduction e. providing a bespoke treatment for individuals (Gibson et al., 2015;

Kalaitzopoulos, 2016; Malod-Dognin et al., 2017). There has been un­
Digital information is growing rapidly, in terms of five V’s (volume, precedented growth in the development of precision medicine sup­
velocity, veracity, variety and value), and hence this is hailed as the big ported by ML (machine learning) approaches (Delavan et al., 2017;
data era (BCS, 2014; Bellazzi, 2014; Lee and Yoon, 2017). Health-based Peterson et al., 2013; Zou et al., 2017) and data mining tools (Chawla
big data including linked information for patients, such as their clinical and Davis, 2013; Cheng et al., 2015; Margolies et al., 2016). These
data (for example gender, age, pathological and physiological history) techniques have also helped to discover novel omics biological markers
and omics data (such as genetics, proteomics and metabolomics) has which can identify the molecular cause of a disease.
now become more widely available (Canuel et al., 2015; Singhal et al., A biomarker is a substance, structure, or process that can be
2016). Recently, such data has been used for precision (also called measured in the human body or its products and can provide surrogate
personalised or stratified) medicine to provide customised healthcare, i. information about the presence of a disease/condition (Strimbu and

Abbreviations: ATHENA, Analysis Tool for Heritable and Environmental Network Associations; BCC, Bayesian consensus clustering; BN, Bayesian Network; CS,
Concatenation-based Supervised Learning; CU, Concatenation-based Unsupervised Learning; DNA, Deoxyribo-Nucleic Acid; FCA, Formal Concept Analysis; FDA,
Food and Drug Administration; fMKL-DR, fast multiple kernel learning for dimensionality reduction; FSMKL, Multiple Kernel Learning with Feature Selection; HI-
DFNForest, Hierarchical integration deep flexible neural forest; JBF, Joint Bayes Factor; JIVE, Joint and Individual Variation Explained; KNN, k-nearest neighbors;
LASSO, Least Absolute Shrinkage and Selection Operator; LDA, Linear Discriminant Analysis; lncRNAs, long non-coding RNAs; MDI, Multiple Dataset Integration;
MDS, Multi-Dimensional Scaling; Meta-SVM, Meta-analytic SVM; miRNA, microRNA; ML, Machine Learning; MOFA, Multi-Omics Factor Analysis; MOLI, Multi-omics
late integration; MORONET, Multi-Omics gRaph cOnvolutional NETworks; MOSAE, Multi-omics Supervised Autoencoder; mRNA, messenger Ribo-Nucleic Acid; MS,
Model-based Supervised Learning; MU, Model-based Unsupervised Learning; NEMO, NEighborhood based Multi-Omics clustering; NMF, Non-negative Matrix Fac­
torisation; PCA, Principal Component Analysis; PINS, Perturbation clustering for data integration and disease subtyping; PSDF, Patient-Specific Data Fusion; RF,
Random Forest; rMKL-LPP, regularised multiple kernel learning for Locality Preserving Projections; RVM, Relevance Vector Machine; SDP-SVM, Semi-Definite
Programming SVM; SmSPK, smoothed shortest path graph kernel; SNF, Similarity Network Fusion; SSL, Semi-supervised learning; SVM, Support Vector Machine;
SVR, Support vector regression; TS, Transformation-based Supervised Learning; TU, Transformation-based Unsupervised Learning.
* Corresponding author at: Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK.
E-mail address: (E. Jefferson).
These authors contributed equally to this study (shared first authorship).
Table 1
The omics technologies which help us draw a complete picture of cell biology and related function.
S. Omic name Term coined in Data extracted Commonly used High-throughput technologies Common Reference Recent
No databases reviews

1 Genomics 1986 (Kuska, 1998) Single nucleotide DNA-Sequencing (Sanger (Sanger et al., 1977), DDBJ (Tateno et al., (Reuter
polymorphisms, Rare Whole-genome (Huang et al., 2017a), Whole- 2002), GenBank (Benson et al., 2015)
variants and Copy number exome (Weisz Hubshman et al., 2018), Single- et al., 2011), ENA (
variations. Cell DNA (Zhang et al., 2019a) and targeted Leinonen et al., 2011)
sequencing (Bewicke-Copley et al., 2019)),
Microarray (Bumgarner, 2013).
2 Transcriptomics 1999 (“Proteomics, Messenger, Micro and Long RNA-Sequencing (Sanger (Alidjinou et al., miRBase (Kozomara et al., (Lowe et al.,
transcriptomics,”, non-coding RNA 2017), Single-Cell RNA (Hwang et al., 2018) and 2019), 2017)
1999) expression. targeted sequencing (Mercer et al., 2012)), Rfam (Kalvari et al.,
Microarray (Zhao et al., 2014). 2018)
3 Proteomics 1994 (Wilkins and Protein expression Reverse Phase Protein Array (Boellner and HPA (Uhlen et al., 2010), (Aslam
Appel, 2007) Becker, 2015), Liquid Chromatography - Mass PDB (Burley et al., 2019), et al., 2017)
Spectrometry (Karpievitch et al., 2010) and Mass Pfam (Finn et al., 2010),
Spectrometry (Timp and Timp, 2020) UniProt (The UniProt
Consortium, 2019)
4 Metabolomics 2001 (Lindon et al., Metabolite expression Mass Spectrometry (Glaves et al., 2014), Liquid HMDB (Wishart et al., (Zampieri
2011) Chromatography - Mass Spectrometry (Zhou 2018), KEGG (Kanehisa et al., 2017)
et al., 2012), Gas Chromatography - Mass and Goto, 2000)
Spectrometry (Fiehn, 2016).
5 Lipidomics 2003 (Wang et al., Lipids Liquid Chromatography - Mass Spectrometry (Li LMSD (Sud et al., 2007), (Yang and
2016) et al., 2020), High-performance Liquid LipiDAT (Caffrey and Han, 2016)
Chromatography - Mass Spectrometry ( Hogan, 1992),
Knittelfelder et al., 2014) and Direct-Infusion/ LipidBank (Watanabe
Shotgun - Mass Spectrometry (Köfeler et al., et al., 2000), LipidHome (
2012). Foster et al., 2013),
LipidPedia (Kuo and
Tseng, 2018),
6 Glycomics 1990 (Vasta and Glycomes Matrix-Assisted Laser Desorption/Ionization GlyTouCan (Tiemeyer (Rojas-
Ahmed, 2008) Time-of-Flight - Mass Spectrometry (Zhang et al., 2017), UniCarb-DB Macias
et al., 2019b). (Campbell et al., 2014) et al., 2019)
7 Metagenomics 1998 (Handelsman Genetic data from Target Gene Sequencing, Shotgun Metagenome MG-RAST (Meyer et al., (Pérez-
et al., 1998) environmental (soil, water) Sequencing, Metatranscriptome Sequencing ( 2008), Cobas et al.,
samples. Zhou et al., 2015) SRA (Kodama et al., 2020)
2012), MGnify (Mitchell
et al., 2020)

Tavel, 2010). Molecular biomarkers are discovered by analysing the between data and phenotypes (Kim and Tagkopoulos, 2018). Although
cascade of information provided by different omics (Debnath et al., ML analysis of multi-omics is still in its embryonic stage, it has already
2010). For example, the high-sensitivity C-Reactive protein test provides been explored for a wide range of applications, as reported in recent
an accurate and quantitative risk assessment for cardiovascular disease reviews on brain diseases (Garali et al., 2018; Young et al., 2013), dia­
(Pfützner and Forst, 2006; Shrivastava et al., 2015). Biomarkers play a betes (Kavakiotis et al., 2017), cancers (Borad and LoRusso, 2017;
significant role in planning preventive measures and decisions for pa­ Chaudhary et al., 2017; Wong et al., 2016) cardiovascular disease (Weng
tients (Nielsen, 2017) and can be classified as either diagnostic, prognostic et al., 2017), medical imaging (Erickson et al., 2017), single-cell analysis
or predictive (Le et al., 2016; Shaw et al., 2015). Diagnostic biomarkers are in humans (Cao et al., 2020; Ma et al., 2020a) and plant science studies
used for determining the presence of disease in a patient, while prog­ (Acharjee et al., 2011). Currently, many of the multi-omic reviews are
nostic biomarkers provide information on the overall outcome with or focused on individual sub-topics. For example, designing studies (Haas
without the standard treatment (Carlomagno et al., 2017). Predictive et al., 2017; Hasin et al., 2017), setting up workflows (Kohl et al., 2014),
biomarkers are used to identify who is at risk of an outcome (Nalejska choosing software tools (Misra et al., 2019) and evaluating overfitted
et al., 2014). All of these biomarkers can also be used to identify which performance (McCabe et al., 2020).
treatment will be most suitable for a given patient. For example, the In contrast, this review aims at a broader focus, presenting an
ADNI (Alzheimer’s Disease Neuroimaging Initiative) study used a interdisciplinary perspective to new readers in this domain by providing
combination of neuroimaging, biochemical and genetic biomarkers to a background on multi-omics and ML. It takes forward the integration
discriminate early Alzheimer’s patients from healthy volunteers with an terminologies introduced by Ritchie (Ritchie et al., 2015) and summa­
accuracy of 98% (Gupta et al., 2019). Similarly, different forms of Par­ rises the recent integrative state-of-the-art approaches. We aim to cover
kinson’s syndromes have been investigated by developing an automated various integration methods concisely and include a recommendation
tool that fuses multi-site diffusion-weighted MRI imaging biomarkers flowchart enabling interdisciplinary scientists to have a quick head start
and disease rating score (MDS-UPDRS III) (Archer et al., 2019). Bio­ in this domain (Bersanelli et al., 2016; Nguyen and Wang, 2020).
markers can help identify high-risk individuals before their physiolog­ Scope of this review: This review investigates the two primary
ical symptoms are evident. Moreover, they also help in measuring learning strategies in ML, i.e. supervised and unsupervised, which are
disease progression (Mandel et al., 2010). commonly used within the context of multi-omics integration. This re­
In the context of precision medicine, ML has been used to develop view considers multi-omics integration as a process of combining different
diagnostic, prognostic and predictive tools from single omics data (Dias- single omics. Although various ML specialisations such as reinforcement
Audibert et al., 2020; Mamoshina et al., 2018; Sonsare and Gunavathi, (Coronato et al., 2020), hybrid (Zhou et al., 2019), multi-view (Zhao
2019). However, ML may have deteriorated performance for certain et al., 2017) and self-supervised learning (Chen et al., 2019) are now
single omics such as gene data due to inherent characteristics (Kim et al., emerging in generic healthcare applications, they have not yet gained
2020). ML methods are now also being applied to multi-omics data enough momentum in multi-omics analysis, hence they remain beyond
(Bersanelli et al., 2016), to investigate and interpret the relationships the scope of this review.

Table 2
The different ML learning approaches reviewed for multi-omics integration.
Learning Goal Description

Supervised Predict new Supervised learning involves fitting a model with labelled training data and then use it for prediction. It can be classed either as a regression
data (predicted variable is numeric) or classification (predicted variable is categorical) problems (Jiang et al., 2020). The three steps in supervised
learning are: (1) fitting a model from the sample input observations (2) evaluating the model and then extensively tuning the hyper-parameters
of the model (3) setting up the model for the production stage and using it for prediction (Foster et al., 2014).
Unsupervised Identify Unsupervised learning is used to find the underlying patterns in unlabelled data using input feature variables without the target/output
clusters variable (Badillo et al., 2020). It can be used for clustering (Xu and Tian, 2015), anomaly detection (Thudumu et al., 2020) and dimensionality
reduction (Xu et al., 2019a).

This paper is organised as follows. Section 2 provides a short back­ are actively expressed and provides information about what is
ground related to multi-omics and ML. Section 3 describes how ML is happening at the cellular level (Milward et al., 2016). Proteomics helps
employed for multi-omics analysis and what are the various real-world in characterising the information flow happening within the cell and the
challenges of it. In Section 4, details of different multi-omics integration organism in the form of protein pathways and their networks (Wu et al.,
approaches are presented. Section 5, published multi-omics studies 2014).
using ML methods are discussed. Section 6 describes a recommendation Although metabolomics, lipidomics, and glycomics do not form part
flowchart for choosing an appropriate method for multi-omics integra­ of the central dogma analysis (Cobb, 2017), they still provide an
tion. Conclusions are provided in Section 7. invaluable amount of information regarding the metabolites, lipids and
glycans (synthesised by the proteome via biosynthetic pathways) (Barh
2. Background et al., 2011). These substances are the intermediate products of a cell’s
information flow and therefore are considered to be excellent indicators
2.1. Multi-omics of the cell’s activity. Similar to single-genome studies, metagenomics is
used to sequence genetic information from environmental samples
In living beings, genetic information in the cells flows from DNA without the requirement of isolating individual species (Hugenholtz and
(deoxyribo-nucleic acid) to the mRNA (messenger ribo-nucleic acid) to Tyson, 2008).
protein and is dictated by the central dogma of molecular biology All measured omics data can be used as a biomarker which helps us
(Lodish et al., 2000). This flow of information is often considered to understand and analyse the underlying characteristics and complex­
analogous to a computer system which has facilitated the understanding ities of biological systems (Alberts et al., 2008). Table 1 shows some of
of biological information processing (Wang and Gribskov, 2005; the important omics used to study biological systems (Handelsman et al.,
D’Onofrio and An, 2010). 1998; Kuska, 1998; Lindon et al., 2011; “Proteomics, transcriptomics,”,
The study of DNA, mRNA and proteins is broadly denoted as geno­ 1999; Vasta and Ahmed, 2008; Wang et al., 2016; Wilkins and Appel,
mics, transcriptomics, and proteomics respectively. The genetic blue­ 2007). All of them are part of the same pipeline of biological informa­
print of a cell is explored using genomics, which looks at the DNA of tion, whose output depends on the different inputs and regulation. As
individuals and helps us to investigate the presence or absence of certain shown in Table 1, each of these omics can be measured using specialised
genes (Gibson, 2015; Vogel and Motulsky, 1997). Transcriptomics high-throughput technologies (for example microarray (Bumgarner,
studies the transcribed genetic material and examines the genes which 2013) and mass spectrometry (Glaves et al., 2014) for genomics and

Table 3
The standard ML terminology and related terms.
Term Definition

Accuracy It is a ratio of correctly predicted outcomes of a given class to the total outcomes. Accuracy is a measure of the performance of an ML model. It ranges from 0%
to 100%.
Classification It is a supervised learning method that provides predicted output as a discrete class. Classification can be binary, multi-class or multi-label.
Clustering It is an unsupervised learning method that can group data based on the attributes of the input features.
Cross-Validation It is a technique that allocates a given set of samples from the dataset which are not used for model training but set aside for testing (to evaluate model
performance). K-fold and Leave one out are commonly used cross-validation methods.
Curse of It refers to a set of problems that arise when using datasets with high dimensionality. In the context of ML, it can impact the predictive performance of an ML
dimensionality model (Duda et al., 2001).
Dataset It is a collection of structured data which comprises input feature variables and sometimes a corresponding target/output variable.
Ensemble Learning It is a paradigm where different models are trained for solving the same problem and then combined to get better performance. Bagging, boosting and
stacking are commonly used in ensemble learning methods.
Explainability Supervised learning models can be classed as ‘white’ or ‘black-box’ based on their explanation (or lack thereof) of how a decision is reached. This is a growing
and important domain of research in deep learning.
Feature Selection It is a process for selecting the most discriminating features without impacting the classification performance.
Hyper-parameter It is an empirically tuned internal parameter of an ML model.
Imputation It is a process of replacing missing values in a dataset with a corresponding statistical estimate. Imputation can be done using mean, median values or
employing methods such as KNN (Crookston and Finley, 2008) or MICE (Azur et al., 2011)
Outlier It is an extremely low or high value of a feature in a dataset (based on the range and distribution). The performance of ML algorithms is sensitive to outliers,
hence their detection and exclusion are crucial (Domingues et al., 2018).
Performance Metric It is a method to evaluate and compare the performance of ML models. For example, precision, recall/sensitivity, specificity, F1 score, Kappa and mean
absolute error.
Regression It is a supervised learning method that provides predicted output as a continuous value.
Training It is a first step in the learning process that uses a training dataset to fit the parameters of a supervised ML model.
Testing It is a second step in the learning process which uses a testing dataset (independent of the training dataset) to assess the predictive performance of a trained
supervised ML model.
Bias-variance trade- In order to achieve optimal prediction performance, a supervised model should ideally have low bias and low variance. A model is over or underfitted when a
off trade-off is not achieved.

Table 4
The commonly used ML algorithms and their attributes. The rank [1 – Low, 2 – Medium, 3 – High, 4 – Very High] denoted to attributes is pragmatically assigned based
on available literature (Amancio et al., 2014; Barredo Arrieta et al., 2020; de Andrade et al., 2020; Lorena et al., 2011; Rashidi et al., 2019; Sakr et al., 2017).
Family Models Comparative Overfitting Samples Explainability Hyper- Complexity Implementation Computation
Accuracy Risk needed parameter Time Cost

Probability- Bayesian 2 2 2 2 3 3 2 3
based Network
(Bayesian) Naive Bayes 2 2 2 2 2 3 2 3
Information Decision Tree 2 3 2 3 2 2 1 2
based (Tree) Random Forest 3 2 1 3 3 2 1 2
Gradient 3 3 2 1 4 4 2 3
Error based Linear 1 3 2 3 1 2 1 2
(Linear) Regression
Logistic 1 3 2 3 1 2 1 2
Partial Linear 2 1 3 3 2 2 1 2
Similarity-based K nearest 2 3 2 2 2 3 1 1
(Instance) neighbour
Self-Organising 2 3 2 2 3 3 1 1
Support Vectors Linear SVM 3 3 3 1 3 2 2 2
Non-linear 3 3 3 1 3 3 3 3
(Kernel) SVM
Neural Artificial Neural 3 3 2 1 3 3 3 3
Network- Network
based Deep Learning 4 1 4 1 4 4 4 4

metabolomics respectively). The table also includes a list of recent re­ present-day data and utilise that understanding to make forecasts or
views on each of these omics. High-throughput generated omics data choices for unidentified forthcoming data measures (Gammerman,
(Lightbody et al., 2019) has played a pivotal role in developing precision 2010; Obermeyer and Emanuel, 2016). To assist beginners in the ML
medicine biomarkers for diseases such as Alzheimer’s (Hampel et al., domain, a glossary of learning approaches covered in this review
2017; Hampel et al., 2016; Kovacs, 2016), diabetes (Capobianco, 2017; (Table 2), standard ML terminology (Table 3) and commonly used ML
McCarthy, 2017; Mutie et al., 2017), cancer (Borad and LoRusso, 2017; algorithms (Table 4) are provided. The basic foundations of ML and its
Senft et al., 2017), hypertension (Barnes et al., 2016; Dominiczak et al., uses have been extensively covered in the literature (Bishop, 2006).
2017), cardiovascular (Costantino et al., 2017) and chronic respiratory ML is employed in a wide range of scenarios, where designing and
diseases (Agache and Rogozea, 2017; Hanania and Diamant, 2017). programming explicit algorithms with optimal results is challenging,
Recently, these omics have also been integrated for COVID-19 studies such as email filtering (Dada et al., 2019), hand-written optical char­
(Barh et al., 2020; Overmyer et al., 2020; Zhou et al., 2020). Many other acter recognition (Memon et al., 2020), and computer vision (O’Mahony
specialised omics have also emerged such as pharmacogenomics (Wang, et al., 2020). Also, it has been deployed for self-driving cars (Badue
2010), methylomics (Liu et al., 2013), interactomics (Luck et al., 2017) et al., 2021), cyber-security (Handa et al., 2019), automated assistants
and radiomics (Lambin et al., 2017; Wong et al., 2016). such as ‘Siri’, websites that recommend items based on the purchasing
Overall, these omics provide a complete picture of cell biology and decisions of other people and novel solutions to some of the challenging
related cellular function (Cox, 2009). This provided the impetus for the problems of the real world (Watt et al., 2020).
development of various software mechanisms which can offer a pre­ Deep learning has emerged in recent years as the leading class of ML
diction of a particular phenotype while using the available next- algorithms. It uses neural networks composed of hidden layers per­
generation multi-omics data (Ritchie et al., 2015). Furthermore, they forming different operations to find complex representations of data. It
can be utilised to develop materials and devices which be used for has pushed the performance of classifiers beyond that of traditional ML
diagnostic and preventive purpose at the molecular level while targeting algorithms, especially in scenarios involving large-scale datasets with
molecules with greater accuracy (Giovanni Martinelli et al., 2015). high dimensionality. On the other hand, it is very computationally
intensive, requiring high-throughput or high-performance hardware,
and lacks explainability (transparency) in feature selection (black-box
2.2. Machine learning approach), in the sense that it is difficult to extract from the network the
features that the network has found as mainly responsible for the task, e.
Classical statistical modelling has always been the de facto standard g. classification (LeCun et al., 2015). However, in the context of multi-
choice for health data analysis and its interpretation. In recent years, omic integration, deep learning offers an exciting opportunity.
with the increasing availability of affordable computing power and Fig. 1 shows the number of publications indexed on the Web of Sci­
high-throughput omics data and the success of artificial intelligence ence website (Clarivate Analytics, 2020) with different key topics. This
technology in various fields, the use of ML has become popular in health information was collected from the Web of Science by entering a
sciences (Lee and Yoon, 2017; Clifton et al., 2015; Hung, 2019; Barnett-
Itzhaki et al., 2020; Kirchebner et al., 2020). ML can be used to mine
information hidden in the experimental data. In contrast, a conventional
statistics-based model is usually developed using statistical assumptions
and draws an inference about a population from a given dataset (Bzdok,
The objective of ML methods is to acquire knowledge from historic or

Fig. 1. Number of publications published per year on different search keywords. *For the year 2020, the annual count was extrapolated using the count of pub­
lications until October 2020.

different keyword and searching across all databases2. Although the use heterogeneous (Bersanelli et al., 2016). For example, transcriptomics
of ML in medical science can be dated back to the 1970s (Davenport and and proteomics use different normalisation and scaling techniques
Kalakota, 2019), more rapid growth is evident in the past 10 years. before omics analysis. This leads to different dynamic ranges and data
Moreover, publications based on ‘multi-omics integration’ and ‘multi- distribution. Also, some omics are more prone to generating sparse data
omics and machine learning’ have started to emerge in the last 5 years (e.g. in the case of metabolomics, some values might be below the limit
and have gained popularity in the precision and computational medicine of detection and hence assigned null value (Antonelli et al., 2019)) than
domain. Although deep learning is widely popular in other related do­ others. Therefore, imputation (Liew et al., 2011) and outlier detection
mains (such as medical imaging (Erickson et al., 2017) and clinical (Vivian et al., 2020) should be considered for each omic separately,
natural language processing (Wu et al., 2020)), the interest has been before planning their integration.
more limited for multi-omics analysis (Tan et al., 2020b). This is because
multi-omics studies are challenging to deploy as they require specialised 3.2. Class imbalance and overfitting
high-throughput omic infrastructure (as highlighted earlier in section
2). This fact is reinforced by the evidence that most of the current In disease classification, certain disease classes are rarer than others
literature employs deep learning on large-scale multi-omics datasets which can cause a class imbalance in the multi-omics dataset (Haas
from open sources such as TCGA3, CCLE4 and GDSC5 for cancer prog­ et al., 2017). For example, primary hypertension is the most common
nosis (Poirion et al., 2018; Seal et al., 2020; Tong et al., 2020; Lee et al., form of hypertension with 95% prevalence while endocrine hyperten­
2020; Zhu et al., 2020) and anti-cancer drug response (Sharifi-Noghabi sion occurs in only 5% (Rimoldi et al., 2014). The ML model trained
et al., 2019; Li et al., 2019; Deng et al., 2020). using an imbalanced dataset may be overfitted i.e. high accuracy for
training data but underperformance for unseen test data. Therefore, to
3. Challenges in multi-omics analysis using machine learning classify these two types of hypertension one of the following approaches
can be used: 1) Collect more data if possible, or 2) consider using
The use of ML to analyse high-throughput generated multi-omic data weighted or normalised metrics to measure the ML performance (such as
poses key unique challenges. They can be summarised as follows. F1-Score or Kappa (Jeni et al., 2013)), or 3) consider over or under-
sampling the under or over-represented class respectively, or 4)
3.1. Heterogeneity, sparsity and outliers consider synthetic sample generation (such as SMOTE (Chawla et al.,
2002) or ADASYN (Haibo He et al., 2008)) for the under-represented
Multi-omic data from different high-throughput sources are usually class. Similarly, techniques such as regularisation, bagging, hyper­
parameter tuning and cross-validation can be used to balance bias-
variance trade-off (Lee, 2010). Any of the above approaches can be
All Web of Science databases included: Web of Science Core Collection, used, depending on data and problem, to overcome the class imbalance
BIOSIS Citation Index, BIOSIS Previews, Current Contents Connect, Data Cita­ and overfitting problems.
tion Index, Derwent Innovations Index, KCI-Korean Journal Database, MED­
LINE®, Russian Science Citation Index, SciELO Citation Index and Zoological
Record. 3.3. More features than data (p >> n)
TCGA: The Cancer Genome Atlas
CCLE: Cancer Cell Line Encyclopaedia Most multi-omics datasets suffer from the classical ‘curse of dimen­
GDSC: Genomics of Drug Sensitivity in Cancer sionality’ problem, i.e. having much fewer observation samples (n) than

multi-omics features (p) (Misra et al., 2019). The resulting high- 3.6. Translating ML: bench to bedside
dimensional space often contains correlated features which are redun­
dant and can mislead the algorithm training (James et al., 2017). The Various ML-based multi-omics publications have emerged in the past
dimensional space of the data can be reduced by employing dimen­ 5 years (see Fig. 1) and some use performance metrics such as decision
sionality reduction techniques such as feature extraction and feature se­ curve (Vickers and Elkin, 2006) and calibration (Dankers et al., 2019)
lection. Feature extraction refers here to techniques computing a subset analytics to evaluate their diagnostic utility. Still, only very few have
of representative features which summarise the original dataset and its been translated into clinical practice, for example, Idx (diabetic reti­
dimensions6. These features are functions of the original ones, for instance, nopathy detection), FerriSmart (measure liver iron concentration) and
PCA (principal component analysis) (Jolliffe, 2002), LDA (linear SubtleMR (image processing software for radiology) (Benjamens et al.,
discriminant analysis) (Martinez and Kak, 2001) and MDS (multidi­ 2020; Hamamoto et al., 2020).
mensional scaling) (Young and Hamer, 1987). On the other hand, One of the key issues which hinder the clinical deployment of ML
feature selection finds a subset of the original features that maximise the methods is transparency and explainability (Black box medicine and
accuracy of a predictive model (Guyon and Elisseeff, 2003). It can be transparency, 2020). A transparent and explainable ML algorithm seems
based on prior knowledge i.e. evident from known literature or based on essential to building trust for clinical decision making (Gunning et al.,
a database such as a Biofilter (Bush et al., 2009). Formally, feature se­ 2019). Recently, the U.S. Food and Drug Administration (FDA) has is­
lection methods can be classed as filter (Information gain (Roobaert sued the “Artificial Intelligence/Machine Learning (AI/ML)-Based Software
et al., 2006), ReliefF (Beretta and Santaniello, 2011), Chi-square sta­ as a Medical Device Action Plan” to ensure deployment of ML-based
tistics (Lee et al., 2011)), wrapper (Recursive feature elimination (Guyon products is safe for patients to better assist the health care providers
et al., 2002), Sequential feature selection (Pudil et al., 1994)) and (Health, 2021). A recent in-depth analysis by Muehlematter showed
embedded (such as LASSO (Least Absolute Shrinkage and Selection most of the FDA approved and “Conformité Européenne” marked ML
Operator) (Zou, 2006)) techniques. Xu et al.(Xu et al., 2019b) and products are in the field of radiology. It also highlighted the key dif­
Stańczyk (Stańczyk and Jain, 2015) provide an excellent resource for ferences between U.S. and European policy implications around the
understanding and exploring the use of different dimensionality reduc­ approval of AI/ML-based devices (Muehlematter et al., 2021).
tion techniques in the generic ML domain. Meng (Meng et al., 2016b) All the above challenges directly impact the use of ML for multi-
offers a review of these methods from the perspective of multi-omics omics analysis. However, there are few other challenges related to
data analysis. multi-omics studies which are not ML related such as study design (Haas
et al., 2017), multi-site sample collection & management (Pinu et al.,
2019), multi-site data sharing and governance (Saulnier et al., 2019),
3.4. Computation and storage cost visualisation (Mougin et al., 2018), ethical standards (Lévesque et al.,
2018), and finally making the research reproducible (Conesa and Beck,
The use of ML for multi-omics analysis comes with computational 2019) and translational (Schumacher et al., 2014). A broader checklist
and data storage cost (Herrmann et al., 2020). Most ML algorithms of criteria is investigated by McShane while focussing on various aspects
require high computation power and large volumes of storage capacity ranging from specimen requirements, predictive model development to
to save the logs, results and analysis. In recent years, ML models can be clinical trial designing and related regulatory approvals (McShane et al.,
deployed on dedicated graphics processing units (Schmidhuber, 2015) 2013b; McShane et al., 2013a).
and cloud computing platforms (Armbrust et al., 2010) such as Amazon
EC2 (“Amazon EC2,”, 2021), Microsoft Azure (“Cloud Computing Ser­ 4. Data integration methods for multi-omics
vices | Microsoft Azure,”, 2021a) and Google Cloud Platform (“Cloud
Computing Services,”, 2021b). The related costs should be considered In recent years, various new data integration methods have been
well in advance before planning an ML-based multi-omics workflow. introduced from the modern developments in mathematical, statistical
and computational sciences. For the benefit of the readers, Table 5 in­
cludes a summary of a few reviews which cover the breadth of multi-
3.5. What algorithm works best for what conditions? omics integration for generic as well as specialised domains such as
oncology (Buescher and Driggers, 2016; Nicora et al., 2020) and toxi­
The commonly used ML algorithms have different attributes cology (Canzler et al., 2020). Most of these reviews have strived to
(Table 4) and therefore it is crucial to choose an appropriate algorithm introduce different categorical terminologies (for example: “early’, ‘late’
for the multi-omics analysis. In the literature, many reviews cover the and ‘intermediate’ in (Gligorijević and Pržulj, 2015) or ‘bottom-up’ and
key strengths and weaknesses of different ML algorithms using single ‘top-down’ in (Yu and Zeng, 2018)) which enable them to group the
omics (Amancio et al., 2014; López Pineda et al., 2015; Sakr et al., 2017; integration methods based on different factors/parameters.
Uddin et al., 2019) and multi-omics (Ma et al., 2016; Francescatto et al., As mentioned earlier, this section adopts the categorical terminol­
2018; Xu et al., 2019a; Sathyanarayanan et al., 2020) datasets. Most of ogies from Ritchie (Ritchie et al., 2015) and builds upon it to summarise
them use a systematic workflow that involves simultaneous performance a complete spectrum of recent integration methods. It concisely covers
evaluation of different algorithms using a common dataset. Since each them giving a clear perspective to a new interdisciplinary user. The
multi-omics dataset is unique, using a similar workflow could allow the various integration methods are classed as either ‘concatenation-’,
selection of the best-suited algorithm. Later, in Section 6 a recommen­ ‘model-’ or ‘transformation’-based and described below in detail.
dation flowchart is proposed which can help the inter-disciplinary user
to choose from available methods. 4.1. Concatenation-based integration methods
Recently, various artificial intelligence-driven automated ML plat­
forms and tools (Feurer et al., 2015; Olson et al., 2018; Waring et al., Concatenation-based integration methods consider developing a model
2020) have also emerged which can be utilised to exhaustively search using a joint data matrix which is formed by combining multiple omics
for the best ML model and corresponding parameter tuning, however, datasets. Fig. 2 shows the stages of concatenation-based integration.
they are computationally expensive. Stage 1 includes the raw data from three individual omics (e.g. genomics,
proteomics, and metabolomics) along with the corresponding pheno­
typic information. Commonly, concatenation-based integration does not
We note that “feature extraction” has a different meaning in image pro­ require any pre-processing and hence does not have a Stage 2. In Stage 3,
cessing and computer vision. the data from the individual omics is concatenated to form a single large

Table 5
Summary table of few reviews in multi-omics integration
Year of Review Terminology introduced Omics reviewed Application domain
review reference for classifying various covered
Genomics Transcriptomics Metabolomics Proteomics Epigenomics Interactomics Metagenomics Lipidomics Phosphoproteomics
integration methods

2009 (Van Deun Matrix decomposition ✓ Micro-organism

et al., 2009) (Escherichia coli)
2009 (Ebbels and ‘Conceptual’, ‘statistical’ ✓ ✓ Generic
Cavill, 2009) & ‘model’
2012 (Lussier and Li, ‘Cross-scale’ & ‘multi- ✓ ✓ Prediction of clinical
2012) scale’ outcomes.
2015 (Ritchie et al., ‘Concatenation’, ✓ ✓ Generic
2015) ‘transformation’ &
2015 (Gligorijević ‘Early’, ‘late’ & ✓ ✓ Generic
and Pržulj, ‘intermediate’
2016 (Bersanelli ‘Sequential’, ✓ ✓ ✓ Generic
et al., 2016) ‘simultaneous’,
‘network-based versus
network-free’ &
‘Bayesian vs non-
2016 (Gligorijević - ✓ ✓ ✓ Disease subtyping,
et al., 2016) biomarkers
discovery & drug
2016 (Buescher and - ✓ ✓ ✓ ✓ Cancer biology
Driggers, 2016)

2017 (Lin and Lane, - ✓ ✓ ✓ ✓ Investigated (Ritchie

2017) et al., 2015) from ML
2017 (Huang et al., - ✓ ✓ ✓ Patient survival
2017b) prediction
2017 (Hasin et al., ‘Genome’, ‘phenotype’ ✓ ✓ ✓ ✓ ✓ Generic
2017) & ‘environment-’ first
2018 (Yu and Zeng, ‘Bottom-up’ & ‘top- ✓ ✓ ✓ ✓ Generic
2018) down’ mode
2018 (Kim and ‘Data-to-data’, ‘data-to- ✓ ✓ ✓ ✓ ✓ ✓ Generic
Tagkopoulos, knowledge’ &
2018) ‘knowledge-to-
2018 (Rappoport and - ✓ ✓ ✓ Cancer
Shamir, 2018) benchmarking
2019 (Tini et al., - Multiple

✓ ✓ ✓ ✓
2019) (Mitochondrial
metabolism, Platelet
reactivity & Breast
2019 (Mirza et al., - ✓ ✓ ✓ ✓ ✓ Generic (more
2019) focussed on ML)
2019 (López de OnO (omics & non- ✓ ✓ ✓ ✓ ✓ ✓ Generic
Maturana et al., omics)
2019 (Wu et al., ✓ ✓ ✓ ✓ Generic
(continued on next page)
matrix of multi-omics data. Finally, in Stage 4 the joint matrix is used for
supervised or unsupervised analysis. The main advantage of using

Application domain
concatenation-based methods is the simplicity of employing ML for

focussed on ML)
Generic (more

Plant systems
analysing continuous or categorical data, once the concatenation of all

individual omics is completed. These methods use all the concatenated



features equally and can select the most discriminating features for a
given phenotype.
The different concatenation-based integration methods can be

further classed as:

4.1.1. Supervised learning concatenation-based methods

Different concatenation-based supervised learning methods have
been used for phenotypic prediction. In scenarios where the number of

features in the joint matrix are higher, different feature selection

methods described in Section 3 can be employed during concatenation

(Sorzano et al., 2014).

The concatenated multi-omics data (in the form of a joint matrix) is

provided as input to different classical ML methods such as DT (decision
tree) (Quinlan, 1993), NB (naive Bayes) (Domingos and Pazzani, 1997),

ANN (artificial neural networks) (Bishop, 1995), SVM (support vector

machine) (Vapnik, 1995), KNN (k-nearest neighbors) (Altman, 1992),
RF (random forest) (Breiman, 2001) and K-Star (Cleary and Trigg, 1995)

in the literature (Kim and Tagkopoulos, 2018; Lin and Lane, 2017;
Auslander et al., 2016; Acharjee et al., 2016; Zhang et al., 2018; Ding

et al., 2018; Wang et al., 2020). For example, a joint matrix of multi-
omics features (which included gene expression, copy number varia­
tion and mutation) was used with classical RF and SVM to predict anti-
cancer drug response (Stetson et al., 2014).

Similarly, multivariate LASSO models (Zou, 2006; Nicolai and Peter,

2010; Mankoo et al., 2011) have been investigated. Also, Boosted trees
(Elith et al., 2008) and SVR (support vector regression) (Awad and

Khanna, 2015) have been investigated for finding the longitudinal pre­
dictors of glycaemic health (Prelot et al., 2018).

Other than classical ML algorithms, deep neural networks (Tang

et al., 2019) have also been widely used to analyse concatenated multi-

omics data. They have been studied to identify robust survival sub­
groups of liver cancer using RNA, miRNA and methylation data

(Chaudhary et al., 2017).

4.1.2. Unsupervised learning concatenation-based methods

Various concatenation-based unsupervised methods have been used

for clustering and association analysis. Different matrix factorisation-

based methods have evolved in recent years. Joint NMF (non-negative

matrix factorisation) (Zhang et al., 2012) was proposed to integrate
multi-omics data with non-negative values. It involved decomposing the
Omics reviewed

joint matrix into loadings and factors, bringing the different omics into a

common basis matrix. Joint NMF is computationally slow and needs


large memory allocation.

Similarly, Shen (Shen et al., 2009) proposed iCluster framework

which used principles similar to NMF but allows integration of datasets

having negative values. They showed the functioning of the framework
Terminology introduced

‘Vertical’, ‘horizontal’,

‘Single-view’ & ‘multi-

by using copy number, mRNA expression and methylation data to

for classifying various

‘Element’, ‘pathway’
integration methods

and ‘mathematical’

conduct a cancer subtype discovery in glioblastoma. This framework

based approach

was also employed for a landmark study that used genomic and tran­
‘parallel’ &

scriptomic data from 2,000 breast tumours and discovered novel sub­
groups amongst them (Curtis et al., 2012).

Later, the iCluster+ framework by Mo (Mo et al., 2013), offered a


significant enhancement over iCluster framework. The iCluster+

framework can discover patterns and combine a range of omics having
(Canzler et al.,

(Nicora et al.,
(Eicher et al.,

Wang, 2020)
(Nguyen and
(Jamil et al.,

binary, categorical and continuous values and was demonstrated by

Table 5 (continued )

combining genomic data from the colorectal cancer datasets.






Another adaptation of NMF was evaluated as JIVE (Joint and Indi­

vidual Variation Explained) which captures joint variation across inte­
grating data types and structural variation of each data type along with
Year of

the residual noise (Lock et al., 2013). It was used to investigate gene





expression and miRNA data on brain tumour samples. The sparsity

Fig. 2. Workflow pipelines for different types of integration methods for multi-omics analysis.

problem in JIVE was improved by JBF (Joint Bayes Factor) (Ray et al., understand cell regulation in yeast using metabolomics and tran­
2014). JBF used joint factor analysis to evaluate the feature space and scriptomics data.
converted it into shared and datatype-specific components. Also, LRAcluster (Wu et al., 2015) was developed to integrate high-
The MoCluster proposed by Meng (Meng et al., 2016a), used multi- dimensional multi-omics data and find low-dimensional manifold to
block multivariate analysis for highlighting the patterns across identify molecular subtypes of cancer.
different input omics data and then finds the joint clusters amongst Recently, iClusterBayes was introduced by Mo (Mo et al., 2018),
them. MoCluster was validated by integrating proteomic and tran­ which is a fully Bayesian latent variable model. It overcomes the limi­
scriptomic data and shows a noticeably higher clustering accuracy and tations of iCluster+, in terms of statistical inference and computational
lower computation cost in comparison to both Cluster and iCluster+. speed. iClusterBayes includes a binary indicator prior for selection of
Fridley (Fridley et al., 2012) has studied the genomic effects due to variable and generalises for binary data and count data. Also, Argelaguet
the gemcitabine drug using high-throughput data from mRNA expres­ (Argelaguet et al., 2018) have developed MOFA (Multi-Omics Factor
sion and SNPs. They integrated these two datasets into one large input Analysis) which disentangles the heterogeneity shared across different
matrix and developed a Bayesian pathway analysis that uses a stochastic omics to discover the principal source of variability. It can integrate
search variable selection. Their proposed that the Bayesian integrative partially overlapping datasets.
model offers better performance in detecting the genomic effects in
comparison to using conventional single-omics analysis. Similarly, Zhu
(Zhu et al., 2012) has also explored BN (Bayesian network) to

4.2. Model-based integration methods samples to identify driver mutations. On the other hand, clustering
methods such as FCA (Formal Concept Analysis) consensus clustering
Model-based integration methods create multiple intermediate models (Hristoskova et al., 2014), MDI (Multiple Dataset Integration) (Kirk
for the different omics data and then build a final model from various et al., 2012), PINS (Perturbation clustering for data integration and
intermediate models (Fig. 2). Stage 1 sets up the raw data from the three disease subtyping) (Nguyen et al., 2017), PINS+ (Nguyen et al., 2019)
individual omics along with the corresponding phenotypic information. and BCC (Bayesian consensus clustering) (Lock and Dunson, 2013) are
In Stage 2, individual models are developed for each of the omics which more flexible and allow late-stage integration of clusters.
are later integrated into a joint model in Stage 3. Finally, in Stage 4 the Different network-based methods are also available for association
joint model is analysed. The major advantage of model-based integra­ analysis. Lemon-Tree (Bonnet et al., 2015) implemented ensemble
tion methods is that they can be used for merging models based on methods for reconstructing module networks which used somatic copy
different omic types, where each model is developed from a different number alterations and gene expression in brain tumour samples.
patient group having the same disease information (He et al., 2016; Furthermore, SNF (Similarity Network Fusion) (Wang et al., 2014)
Ritchie et al., 2015). constructs networks of samples for respective data type and then effec­
Model-based integration approaches facilitate the understanding of tively fuse them into a joint network which denotes the complete range
interactions amongst different omics for a certain phenotype (for of original data. It combines mRNA expression, DNA methylation and
example, survival in pancreatic cancer). The final multi-dimensional microRNA (miRNA) expression data from cancer datasets.
joint model in Stage 4 can be built using an ML algorithm (such as
neural networks) which uses the most relevant variables from each 4.3. Transformation-based integration methods
omics models (from Stage 3). This approach allows the analysis of the
improvement in the predictive power for individual models and also Transformation-based integration methods transform each of the omics
finds the best discriminating features. datasets firstly into graphs or kernel matrices and then combines all of
The different model-based integration methods can be further them into one before constructing a model.
classed as follows. Fig. 2 shows the various stages of transformation-based integration.
Stage 1 sets up the raw data from the three individual omics along with
4.2.1. Supervised learning model-based methods the corresponding phenotypic information. In Stage 2, the individual
Model-based supervised learning methods include a variety of frame­ transformations (in the form of graph or kernel relationship) are
works for developing a model, such as majority-based voting (Drăghici developed for each of the omics which are later integrated into a joint
and Potter, 2003), hierarchical classifiers (Bavafaye Haghighi et al., transformation in Stage 3. Finally, in Stage 4 it is analysed. The primary
2019) and ensemble-based approaches (such as XGBoost (Ma et al., advantage of the transformation-based integration methods is that they
2020b) and KNN (Shen and Chou, 2006)). can be used to combine a wide range of omics if unique information
Deep learning methods have also been adopted for model-based su­ (such as patient ID) is available.
pervised learning (Poirion et al., 2020). MOLI (multi-omics late inte­ Graphs provide a formal means to transform and portray relation­
gration) (Sharifi-Noghabi et al., 2019) method used type-specific ships between different omics samples where the nodes and edges of a
encoding sub-networks to learn features from somatic mutation, CNA graph represent the subjects and their relationships, respectively.
and gene expression data independently and then later concatenated Similarly, Kernel methods enable the transformation of data from its
them for predicting the response to a given drug. Lee (Lee et al., 2020) original space into a higher dimensional feature space. These methods
has proposed a deep learning-based auto-encoding approach for inte­ then explore linear decision functions in the feature space which were
grating four omics to create a survival prediction model. Also, HI- non-linear in the original space.
DFNForest (hierarchical integration deep flexible neural forest) frame­ The transformation-based integrative methods can be classed as
work (Xu et al., 2019a) was developed which uses a stacked auto- follows.
encoder (Vincent et al., 2010) to learn high-level representations from
three omic datasets. Later, these representations are integrated to pre­ 4.3.1. Supervised learning transformation-based methods
dict cancer subtype classification. Similarly, Chaudhary (Chaudhary In the past, various transformation-based supervised learning methods
et al., 2017) has used autoencoders along with SVM for survival pre­ have been presented. Most of them are kernel and graph-based algo­
diction in subgroups of hepatocellular carcinoma. rithms (Yan et al., 2017). The kernel-based integration approaches
In the past years, ATHENA (Analysis Tool for Heritable and Envi­ include SDP-SVM (Semi-Definite Programming SVM) (Lanckriet et al.,
ronmental Network Associations) was developed for analysing multi- 2004), FSMKL (Multiple Kernel Learning with Feature Selection)
omics data (Chung and Kang, 2019; Holzinger et al., 2014). It uses (Seoane et al., 2014), RVM (Relevance Vector Machine) (Bowd et al.,
grammatical evolution neural networks along with Biofilter (Bush et al., 2005; Tipping, 2001) and Ada-boost RVM (Wu et al., 2010). Moreover,
2009) and Random Jungle (Schwarz et al., 2010) to investigate different fMKL-DR (fast multiple kernel learning for dimensionality reduction)
categorical and quantitative variables and develop prediction models. (Giang et al., 2020) has been used along with SVM for combining gene
Recently, MOSAE (Multi-omics Supervised Autoencoder) (Tan et al., expression, miRNA expression, and DNA methylation data. Similarly,
2020a) was developed for pan-cancer analysis and compared with the graph-based integration approaches consist of graph-based SSL
conventional ML methods such as SVM, DT, naïve Bayes, KNN, RF and (semi-supervised learning7) (Tsuda et al., 2005; Culp and Michailidis,
AdaBoost. Similarly, Denoising autoencoder has been incorporated 2008; Kim et al., 2015; Yue et al., 2017; Bhardwaj and Van Steen, 2020),
along with L1-penalized logistic regression for identifying ovarian can­ graph sharpening (Shin et al., 2010; Shin et al., 2007), composite
cer subtypes (Guo et al., 2020). network (Mostafavi and Morris, 2010) and BN (Rhodes et al., 2005).
Overall, it is evident from the literature that kernel-based algorithms
4.2.2. Unsupervised learning model-based methods have superior performance to graph-based approaches, but they usually
Various model-based unsupervised learning methods have been need more time for the training phase. In contrast, graph-based ap­
implemented in the past. PSDF (Patient-Specific Data Fusion) (Yuan proaches can disclose the relations between samples while taking less
et al., 2011) is a non-parametric Bayesian model for clustering prog­ computation time. Yan (Yan et al., 2017) provide an extensive
nostic cancer subtypes by combining gene expression and copy number
variation data. It uses a two-step process and limits the integration to
only two datatypes. Similarly, CONEXIC (Akavia et al., 2010) also uses a 7
For the sake of simplicity, the semi-supervised integration methods (graph-
BN to integrate gene expression and copy number variation from tumour based) are grouped under supervised learning.

Table 6
The advantages and disadvantages of using different integrative methods.
Integrative Method Advantages Disadvantages

Concatenation- • Easy and straightforward. • Ideally, requires all omics data for all patients.
Based • Enables the use of classical supervised and unsupervised methods. • Need proper normalisation before concatenation.
• Does not consider the unique distribution of each omics.
• Memory and computation-intensive when the concatenated matrix is
Model- • Facilitates the understanding of interactions amongst different omics. • Not effective if omics data is extremely heterogeneous.
Based • Omics data can be from a different set of patients with a similar phenotype. • Could lead to an overfitted solution
• Does not increase dimensional complexity. • Weak signals could be lost.
Transformation- • Graph representation easy to understand and computationally less intensive. • Kernel methods are computationally more intensive than graph
Based • Kernel methods provide superior performance. methods.
• Multi-omics data for the same patient can be used for their disease subgroup • Transformation can be sometimes challenging.

comparison between different graph- and kernel-based integration ap­ cancer and idiopathic pulmonary fibrosis. Recently, NEMO (NEighbor­
proaches in a supervised learning context using various standardised test hood based Multi-Omics clustering) (Rappoport and Shamir, 2019) is
datasets. It highlights the better classification performance of RVM, Ada- introduced which uses an inter-patient similarity matrix–based distance
boost RVM and SDP-SVM in comparison to SSL, graph sharpening, metric for evaluating the input omic datasets individually. These omics
composite network and BN. matrices are then combined into one matrix and then analysed using
Recently, MORONET (Multi-Omics gRaph cOnvolutional NETworks) spectral-based clustering. It can work on partial data sets (no imputation
(Wang et al., 2020) is introduced, which use graph convolutional net­ needed), where measurements are only available for a subset of omics
works taking benefit of the omics features and the associations among data.
patients (as defined by the patient similarity networks) for better clas­ Table 6 highlights the advantages and disadvantages of various
sification results. integration methods. Table 7 summarises various multi-omics integra­
tion methods based on learning type.
4.3.2. Unsupervised learning transformation-based methods
Different transformation-based unsupervised methods have been 5. Application of integrative methods in multi-omics studies
introduced. Some of them are kernel- and graph-based methods. Lately,
rMKL-LPP (regularised multiple kernel learning for Locality Preserving The availability of high-throughput omics provides a unique op­
Projections) (Speicher and Pfeifer, 2015) was implemented for clus­ portunity to explore the complex relationships between different omics
tering analysis. It used an individual kernel for each omics along with a and phenotypic targets instead of mono-omics evaluation. This section
graph embedding framework to identify biologically meaningful sub­ describes various multi-omics studies which deployed methods investi­
groups for five different cancer types. Similarly, PAMOGK (Tepeli et al., gated in the previous section. Table 8 summarises different phenotypic
2019) is developed for integrating multi-omics data with pathways target-based, multi-omics studies published and tabulates them across
using graph kernel, SmSPK (smoothed shortest path graph kernel). It the span of 7 main omics namely, genomics, transcriptomics, metab­
used somatic mutations, transcriptomics and proteomics data to find olomics, proteomics, glycomics, lipidomics and epigenomics. Genomics
subgroups of kidney cancer. is further divided into gene expression, DNA methylation, somatic point
Meta-SVM (Meta-analytic SVM) is proposed by Kim (Kim et al., mutation and copy number alteration. Similarly, transcriptomics is
2017), which integrates multiple omics data and able to detect further classed into lncRNAs (long non-coding RNAs) and microRNAs
consensus genes associated with diseases across studies such as breast (mRNA and miRNA). The various multi-omics studies are broadly

The summary of multi-omics integration methods based on learning type. For abbreviations please refer to List of Abbreviations.
Multi-omics Integration Methods

Concatenation-based Model-based Transformation-based

Learning Supervised • Classical ML • Majority-based voting (Drăghici • SDP-SVM (Lanckriet et al., 2004)
Type (DT (Quinlan, 1993), NB (Domingos and Pazzani, 1997), and Potter, 2003) • FSMKL (Seoane et al., 2014)
ANN (Bishop, 1995), SVM (Vapnik, 1995), KNN ( • Hierarchical Classifiers (Bavafaye • RVM (Bowd et al., 2005; Tipping, 2001)
Altman, 1992), K-Star (Cleary and Trigg, 1995) Haghighi et al., 2019) • Ada-boost RVM (Wu et al., 2010)
• Ensemble-based classifiers • fMKL-DR (Giang et al., 2020)
• LASSO (Zou, 2006; Nicolai and Peter, 2010; Mankoo (XGBoost (Ma et al., 2020a) and • SSL (Tsuda et al., 2005; Culp and
et al., 2011) KNN (Shen and Chou, 2006)) Michailidis, 2008; Kim et al., 2015; Yue
• BT (Elith et al., 2008) • MOLI (Sharifi-Noghabi et al., 2019) et al., 2017; Bhardwaj and Van Steen, 2020),
• SVR (Awad and Khanna, 2015) • HI-DFNForest (Xu et al., 2019a) • Graph sharpening (Shin et al., 2010, Shin
• DNN (Tang et al., 2019) • ATHENA (Chung and Kang, 2019; et al., 2007)
Holzinger et al., 2014) • Composite network (Mostafavi and Morris,
• BN (Rhodes et al., 2005)
• MORONET (Wang et al., 2020)
Unsupervised • Joint NMF (Zhang et al., 2012) • PSDF (Yuan et al., 2011) • rMKL-LPP (Speicher and Pfeifer, 2015)
• iCluster (Shen et al., 2009) • FCA consensus clustering • PAMOGK (Tepeli et al., 2019)
• iCluster+ (Mo et al., 2013) (Hristoskova et al., 2014) • Meta-SVM (Kim et al., 2017)
• JIVE (Lock et al., 2013) • MDI (Kirk et al., 2012) • NEMO (Rappoport and Shamir, 2019)
• JBF (Ray et al., 2014) • BCC (Lock and Dunson, 2013)
• BN (Fridley et al., 2012; Zhu et al., 2012) • Lemon-Tree (Bonnet et al., 2015)
• MoCluster (Meng et al., 2016a) • SNF (Wang et al., 2014)
• iClusterBayes (Mo et al., 2018)
• MOFA (Argelaguet et al., 2018)

Table 8
Multi-omics studies using different ML methods. For abbreviations please refer to List of Abbreviations.
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

Age-related ✓ ✓ ✓ ✓ Graphical RF CU (Zierer et al.,
Acute myeloid ✓ ✓ LASSO CS (Taskesen et al.,
leukaemia 2015)
Anti-cancer ✓ ✓ RF & SVM CS (Stetson et al.,
therapeutic 2014)
Biomedical ✓ ✓ ✓ MORONET TS (Wang et al.,
data 2020)
Brain cancer ✓ ✓ ✓ ✓ ✓ LASSO CS (Lu et al., 2016)
✓ ✓ JIVE CU (Lock et al.,
✓ ✓ ✓ ✓ iClusterBayes CU (Mo et al., 2018)
✓ ✓ Lemon-Tree MU (Bonnet et al.,
✓ ✓ ✓ SNF MU (Wang et al.,
Breast cancer ✓ ✓ RF CS (List et al., 2014)
✓ ✓ LASSO CS (Lee et al., 2017)

✓ ✓ RF & SVM CS (Nam et al.,

✓ ✓ ✓ LASSO CS (Chen et al.,
✓ ✓ SVM CS (Auslander et al.,
✓ ✓ iCluster CU (Shen et al.,
✓ ✓ ✓ SVM, RF, SVM CS & TS (Ma et al., 2016)
& Multi-
✓ ✓ iCluster CU (Curtis et al.,
✓ ✓ ✓ ✓ BCC MU (Lock and
Dunson, 2013)
✓ ✓ FSMKL TS (Seoane et al.,

Biotechnology Advances 49 (2021) 107739

✓ ✓ ✓ Meta-SVM TU (Kim et al., 2017)
Cancer survival ✓ ✓ SVM & RF CS (Kim et al., 2014)
Cancer ✓ ✓ ✓ ✓ LASSO CS (Zhao et al.,
prognosis 2015)
✓ ✓ PSDF MU (Yuan et al.,
✓ ✓ CONEXIC MU (Akavia et al.,
Cancer drug ✓ ✓ ✓ MOLI (DL) MS (Sharifi-Noghabi
response et al., 2019)
(continued on next page)
Table 8 (continued )
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

Cardiac tissue ✓ ✓ RF CS (Dimitrakopoulos

ageing et al., 2014)
Colorectal ✓ ✓ Neural Fuzzy CU (Vineetha et al.,
cancer Network 2013)
COVID-19 ✓ ✓ ✓ PLS-DA CS (Thomas et al.,
analysis 2020)
✓ ✓ ✓ Extra Trees CS (Overmyer et al.,
Chronic ✓ ✓ ✓ MOFA CU (Argelaguet et al.,
lymphocytic 2018)
Gastric cancer ✓ ✓ SVM & RF CS (Yan et al., 2012)
Kidney cancer ✓ ✓ ✓ ✓ ✓ iClusterBayes CU (Mo et al., 2018)
✓ ✓ ✓ PAMOGK TU (Tepeli et al.,
Liver cancer ✓ ✓ ✓ Auto-encoder, MS (Chaudhary et al.,
SVM 2017)
Lung cancer ✓ ✓ iCluster CU (Shen et al.,
✓ ✓ ✓ ✓ Auto-encoder MS (Lee et al., 2020)
Neuroblastoma ✓ ✓ Auto- CS (Zhang et al.,
encoders, 2018)

Ovarian cancer ✓ ✓ RF CS (Anděl et al.,
✓ ✓ RF CS (Paik et al., 2017)
✓ ✓ ✓ ✓ LASSO CS (Mankoo et al.,
✓ ✓ ✓ Joint NMF CU (Zhang et al.,
✓ ✓ ✓ JBF CU (Ray et al., 2014)
✓ ✓ ✓ ✓ BN TS (Zhang et al.,
✓ ✓ ✓ ✓ Graph SSL TS (Kim et al., 2015)
Oral squamous ✓ ✓ ✓ SVM CS (Li et al., 2017)
Pan-cancer ✓ ✓ ✓ iCluster+ CU (Mo et al., 2013)
analysis ✓ ✓ moCluster CU (Meng et al.,

Biotechnology Advances 49 (2021) 107739

✓ ✓ ✓ ✓ LRAcluster CU (Wu et al., 2015)
✓ ✓ ✓ XGBoost MS (Ma et al., 2020a)
✓ ✓ ✓ RF MS (Bavafaye
Haghighi et al.,
✓ ✓ ✓ HI-DFN Forest MS (Xu et al., 2019a)
✓ ✓ ✓ ✓ MOSAE MS (Tan et al.,
✓ ✓ ✓ ✓ PINS MU (Nguyen et al.,
(continued on next page)
P.S. Reel et al.
Table 8 (continued )
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

✓ ✓ ✓ fMKL-DR TS (Giang et al.,

✓ ✓ ✓ rMKL-LPP TU (Speicher and
Pfeifer, 2015)
✓ ✓ ✓ NEMO TU (Rappoport and
Shamir, 2019)
Pancreatic ✓ ✓ SVM MS (Kwon et al.,
cancer 2015)
Prostate cancer ✓ ✓ RF CS (Fan et al., 2011)
Precision ✓ ✓ Auto- CS & (Ding et al.,
oncology encoders, MU 2018)
Elastic Net,

Thyroid ✓ ✓ RF CS (Pietzner et al.,
function 2017)
Ulcerative ✓ ✓ LASSO CS (Bjerrum et al.,
colitis 2014)

Potato flesh ✓ ✓ ✓ RF CS (Acharjee et al.,
colour 2016)
✓ ✓ RF MS (Acharjee et al.,

Animals & Micro-organisms

Dog heart ✓ ✓ RF CS (Li et al., 2015)
Yeast ✓ ✓ BN CU (Zhu et al., 2012)

Biotechnology Advances 49 (2021) 107739

P.S. Reel et al.

Biotechnology Advances 49 (2021) 107739

Fig. 3. Recommendation flowchart for choosing a method for multi-omics integration. For abbreviations please refer to List of Abbreviations.
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

grouped based on the target and the corresponding ML method used. unfortunately, the existing literature does not provide many direct
It is evident from Table 8 that most of the multi-omics studies focus comparisons between methods using the same publicly available data­
on different forms of cancer. In particular, the presence of many multi- sets. Hence, to choose the best method which suits a given dataset and
omics studies related to the breast (Chen et al., 2017; Lee et al., 2017; question, an empirical approach that investigates the use of different
List et al., 2014; Ma et al., 2016; Nam et al., 2009) and ovarian (Anděl methods, guided by ML practitioners is recommended.
et al., 2015; Mankoo et al., 2011; Paik et al., 2017; Zhang et al., 2014)
cancer highlights the research thrust by the scientific community in 7. Conclusions
these domains.
Many intra-omics studies have successfully explored the integration This paper reviewed various ML approaches used for the integration
of gene expression and DNA methylation. LASSO methods have been of multi-omics data for analysis. A concise background of multi-omics
used for this particular integration by Taskesen (Taskesen et al., 2015) and ML was presented. It examined the concatenation-, model- and
and Lee (Lee et al., 2017) for acute myeloid leukaemia and breast cancer transformation-based integration methods, employed for multi-omics
respectively. LASSO has also been employed for cancer prognosis (Zhao data along with their advantages and disadvantages. Also, various
et al., 2015). Similarly, mRNA – miRNA integration was investigated existing multi-omics studies have been summarised. Finally, a recom­
using Neural Fuzzy Network for colorectal cancer (Vineetha et al., mendation flowchart is presented for interdisciplinary professionals to
2013), SVM for pancreatic cancer (Kwon et al., 2015), and RF for cardiac choose an appropriate method for a multi-omics dataset. Overall, this
tissue ageing (Dimitrakopoulos et al., 2014) and ovarian cancer (Anděl work showcases the recent findings in the multi-omics domain and
et al., 2015) respectively. SVM has also been used for oral squamous cell signifies the key role of ML in the future of personalised healthcare.
carcinoma study by integrating different transcriptomics namely mRNA,
miRNA and IncRNA (Li et al., 2017). Disclosure
Metabolomics and proteomics have been integrated using RF for
analysis of prostate cancer (Fan et al., 2011) and thyroid functioning The authors have nothing to disclose.
(Pietzner et al., 2017). Similarly, metabolomics is integrated with mRNA
for studying ulcerative colitis (Bjerrum et al., 2014) and cancer survival
(Kim et al., 2014). On the other hand, glycomics and epigenomics have Declaration of Competing Interest
only appeared once in the multi-omics context (along with mRNA and
metabolomics) and used by Zierer (Zierer et al., 2016) for the study of The authors declare that they have no known competing financial
age-related comorbidities using a graphical variant of RF. interests or personal relationships that could have appeared to influence
Recently, metabolomics and proteomics have also been integrated the work reported in this paper.
with lipidomics to evaluate COVID-19 patients using PLS-DA (Partial
Least Squares Discriminant Analysis) and Extra Trees (Overmyer et al., Acknowledgement
2020; Thomas et al., 2020).
Multi-omics studies have also been successfully conducted in plants This project has received funding from the European Union’s Hori­
(potato (Acharjee et al., 2016, Acharjee et al., 2011)) and animals (such zon 2020 research and innovation programme under grant agreement
as canine heart disease (Li et al., 2015)). No 633983. Ewan Pearson and Emanuele Trucco would like to
Overall, the different recent multi-omics studies highlight the supe­ acknowledge the National Institute for Health Research (NIHR) global
riority of integration methods in understanding the complexity of health research unit on global diabetes outcomes research at the Uni­
different diseases and uncovering the underlying abnormalities from the versity of Dundee (INSPIRED project, Award number 16/136/102) for
vastly generated multi-omics data, which is not always possible with useful discussions.
individual omics analysis.
6. Recommendations
