Reel_2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Biotechnology Advances 49 (2021) 107739

Contents lists available at ScienceDirect

Biotechnology Advances
journal homepage: www.elsevier.com/locate/biotechadv

Research review paper

Using machine learning approaches for multi-omics data analysis: A review


Parminder S. Reel a, 1, Smarti Reel a, 1, Ewan Pearson a, Emanuele Trucco b, Emily Jefferson a, *
a
Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
b
VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom

A R T I C L E I N F O A B S T R A C T

Keywords: With the development of modern high-throughput omic measurement platforms, it has become essential for
Multi-omics biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into
Machine Learning biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be
Predictive Modelling
integrated to unravel the intricate working of systems biology using machine learning-based predictive algo­
Supervised Learning
rithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data
Unsupervised Learning
Systems Biology enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease
prediction, patient stratification and delivery of precision medicine. This review paper explores different inte­
grative machine learning methods which have been used to provide an in-depth understanding of biological
systems during normal physiological functioning and in the presence of a disease. It provides insight and rec­
ommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics
studies.

1. Introduction e. providing a bespoke treatment for individuals (Gibson et al., 2015;


Kalaitzopoulos, 2016; Malod-Dognin et al., 2017). There has been un­
Digital information is growing rapidly, in terms of five V’s (volume, precedented growth in the development of precision medicine sup­
velocity, veracity, variety and value), and hence this is hailed as the big ported by ML (machine learning) approaches (Delavan et al., 2017;
data era (BCS, 2014; Bellazzi, 2014; Lee and Yoon, 2017). Health-based Peterson et al., 2013; Zou et al., 2017) and data mining tools (Chawla
big data including linked information for patients, such as their clinical and Davis, 2013; Cheng et al., 2015; Margolies et al., 2016). These
data (for example gender, age, pathological and physiological history) techniques have also helped to discover novel omics biological markers
and omics data (such as genetics, proteomics and metabolomics) has which can identify the molecular cause of a disease.
now become more widely available (Canuel et al., 2015; Singhal et al., A biomarker is a substance, structure, or process that can be
2016). Recently, such data has been used for precision (also called measured in the human body or its products and can provide surrogate
personalised or stratified) medicine to provide customised healthcare, i. information about the presence of a disease/condition (Strimbu and

Abbreviations: ATHENA, Analysis Tool for Heritable and Environmental Network Associations; BCC, Bayesian consensus clustering; BN, Bayesian Network; CS,
Concatenation-based Supervised Learning; CU, Concatenation-based Unsupervised Learning; DNA, Deoxyribo-Nucleic Acid; FCA, Formal Concept Analysis; FDA,
Food and Drug Administration; fMKL-DR, fast multiple kernel learning for dimensionality reduction; FSMKL, Multiple Kernel Learning with Feature Selection; HI-
DFNForest, Hierarchical integration deep flexible neural forest; JBF, Joint Bayes Factor; JIVE, Joint and Individual Variation Explained; KNN, k-nearest neighbors;
LASSO, Least Absolute Shrinkage and Selection Operator; LDA, Linear Discriminant Analysis; lncRNAs, long non-coding RNAs; MDI, Multiple Dataset Integration;
MDS, Multi-Dimensional Scaling; Meta-SVM, Meta-analytic SVM; miRNA, microRNA; ML, Machine Learning; MOFA, Multi-Omics Factor Analysis; MOLI, Multi-omics
late integration; MORONET, Multi-Omics gRaph cOnvolutional NETworks; MOSAE, Multi-omics Supervised Autoencoder; mRNA, messenger Ribo-Nucleic Acid; MS,
Model-based Supervised Learning; MU, Model-based Unsupervised Learning; NEMO, NEighborhood based Multi-Omics clustering; NMF, Non-negative Matrix Fac­
torisation; PCA, Principal Component Analysis; PINS, Perturbation clustering for data integration and disease subtyping; PSDF, Patient-Specific Data Fusion; RF,
Random Forest; rMKL-LPP, regularised multiple kernel learning for Locality Preserving Projections; RVM, Relevance Vector Machine; SDP-SVM, Semi-Definite
Programming SVM; SmSPK, smoothed shortest path graph kernel; SNF, Similarity Network Fusion; SSL, Semi-supervised learning; SVM, Support Vector Machine;
SVR, Support vector regression; TS, Transformation-based Supervised Learning; TU, Transformation-based Unsupervised Learning.
* Corresponding author at: Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK.
E-mail address: e.r.jefferson@dundee.ac.uk (E. Jefferson).
1
These authors contributed equally to this study (shared first authorship).

https://doi.org/10.1016/j.biotechadv.2021.107739
Received 15 December 2020; Received in revised form 1 March 2021; Accepted 25 March 2021
Available online 29 March 2021
0734-9750/© 2021 Published by Elsevier Inc.
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Table 1
The omics technologies which help us draw a complete picture of cell biology and related function.
S. Omic name Term coined in Data extracted Commonly used High-throughput technologies Common Reference Recent
No databases reviews

1 Genomics 1986 (Kuska, 1998) Single nucleotide DNA-Sequencing (Sanger (Sanger et al., 1977), DDBJ (Tateno et al., (Reuter
polymorphisms, Rare Whole-genome (Huang et al., 2017a), Whole- 2002), GenBank (Benson et al., 2015)
variants and Copy number exome (Weisz Hubshman et al., 2018), Single- et al., 2011), ENA (
variations. Cell DNA (Zhang et al., 2019a) and targeted Leinonen et al., 2011)
sequencing (Bewicke-Copley et al., 2019)),
Microarray (Bumgarner, 2013).
2 Transcriptomics 1999 (“Proteomics, Messenger, Micro and Long RNA-Sequencing (Sanger (Alidjinou et al., miRBase (Kozomara et al., (Lowe et al.,
transcriptomics,”, non-coding RNA 2017), Single-Cell RNA (Hwang et al., 2018) and 2019), 2017)
1999) expression. targeted sequencing (Mercer et al., 2012)), Rfam (Kalvari et al.,
Microarray (Zhao et al., 2014). 2018)
3 Proteomics 1994 (Wilkins and Protein expression Reverse Phase Protein Array (Boellner and HPA (Uhlen et al., 2010), (Aslam
Appel, 2007) Becker, 2015), Liquid Chromatography - Mass PDB (Burley et al., 2019), et al., 2017)
Spectrometry (Karpievitch et al., 2010) and Mass Pfam (Finn et al., 2010),
Spectrometry (Timp and Timp, 2020) UniProt (The UniProt
Consortium, 2019)
4 Metabolomics 2001 (Lindon et al., Metabolite expression Mass Spectrometry (Glaves et al., 2014), Liquid HMDB (Wishart et al., (Zampieri
2011) Chromatography - Mass Spectrometry (Zhou 2018), KEGG (Kanehisa et al., 2017)
et al., 2012), Gas Chromatography - Mass and Goto, 2000)
Spectrometry (Fiehn, 2016).
5 Lipidomics 2003 (Wang et al., Lipids Liquid Chromatography - Mass Spectrometry (Li LMSD (Sud et al., 2007), (Yang and
2016) et al., 2020), High-performance Liquid LipiDAT (Caffrey and Han, 2016)
Chromatography - Mass Spectrometry ( Hogan, 1992),
Knittelfelder et al., 2014) and Direct-Infusion/ LipidBank (Watanabe
Shotgun - Mass Spectrometry (Köfeler et al., et al., 2000), LipidHome (
2012). Foster et al., 2013),
LipidPedia (Kuo and
Tseng, 2018),
6 Glycomics 1990 (Vasta and Glycomes Matrix-Assisted Laser Desorption/Ionization GlyTouCan (Tiemeyer (Rojas-
Ahmed, 2008) Time-of-Flight - Mass Spectrometry (Zhang et al., 2017), UniCarb-DB Macias
et al., 2019b). (Campbell et al., 2014) et al., 2019)
7 Metagenomics 1998 (Handelsman Genetic data from Target Gene Sequencing, Shotgun Metagenome MG-RAST (Meyer et al., (Pérez-
et al., 1998) environmental (soil, water) Sequencing, Metatranscriptome Sequencing ( 2008), Cobas et al.,
samples. Zhou et al., 2015) SRA (Kodama et al., 2020)
2012), MGnify (Mitchell
et al., 2020)

Tavel, 2010). Molecular biomarkers are discovered by analysing the between data and phenotypes (Kim and Tagkopoulos, 2018). Although
cascade of information provided by different omics (Debnath et al., ML analysis of multi-omics is still in its embryonic stage, it has already
2010). For example, the high-sensitivity C-Reactive protein test provides been explored for a wide range of applications, as reported in recent
an accurate and quantitative risk assessment for cardiovascular disease reviews on brain diseases (Garali et al., 2018; Young et al., 2013), dia­
(Pfützner and Forst, 2006; Shrivastava et al., 2015). Biomarkers play a betes (Kavakiotis et al., 2017), cancers (Borad and LoRusso, 2017;
significant role in planning preventive measures and decisions for pa­ Chaudhary et al., 2017; Wong et al., 2016) cardiovascular disease (Weng
tients (Nielsen, 2017) and can be classified as either diagnostic, prognostic et al., 2017), medical imaging (Erickson et al., 2017), single-cell analysis
or predictive (Le et al., 2016; Shaw et al., 2015). Diagnostic biomarkers are in humans (Cao et al., 2020; Ma et al., 2020a) and plant science studies
used for determining the presence of disease in a patient, while prog­ (Acharjee et al., 2011). Currently, many of the multi-omic reviews are
nostic biomarkers provide information on the overall outcome with or focused on individual sub-topics. For example, designing studies (Haas
without the standard treatment (Carlomagno et al., 2017). Predictive et al., 2017; Hasin et al., 2017), setting up workflows (Kohl et al., 2014),
biomarkers are used to identify who is at risk of an outcome (Nalejska choosing software tools (Misra et al., 2019) and evaluating overfitted
et al., 2014). All of these biomarkers can also be used to identify which performance (McCabe et al., 2020).
treatment will be most suitable for a given patient. For example, the In contrast, this review aims at a broader focus, presenting an
ADNI (Alzheimer’s Disease Neuroimaging Initiative) study used a interdisciplinary perspective to new readers in this domain by providing
combination of neuroimaging, biochemical and genetic biomarkers to a background on multi-omics and ML. It takes forward the integration
discriminate early Alzheimer’s patients from healthy volunteers with an terminologies introduced by Ritchie (Ritchie et al., 2015) and summa­
accuracy of 98% (Gupta et al., 2019). Similarly, different forms of Par­ rises the recent integrative state-of-the-art approaches. We aim to cover
kinson’s syndromes have been investigated by developing an automated various integration methods concisely and include a recommendation
tool that fuses multi-site diffusion-weighted MRI imaging biomarkers flowchart enabling interdisciplinary scientists to have a quick head start
and disease rating score (MDS-UPDRS III) (Archer et al., 2019). Bio­ in this domain (Bersanelli et al., 2016; Nguyen and Wang, 2020).
markers can help identify high-risk individuals before their physiolog­ Scope of this review: This review investigates the two primary
ical symptoms are evident. Moreover, they also help in measuring learning strategies in ML, i.e. supervised and unsupervised, which are
disease progression (Mandel et al., 2010). commonly used within the context of multi-omics integration. This re­
In the context of precision medicine, ML has been used to develop view considers multi-omics integration as a process of combining different
diagnostic, prognostic and predictive tools from single omics data (Dias- single omics. Although various ML specialisations such as reinforcement
Audibert et al., 2020; Mamoshina et al., 2018; Sonsare and Gunavathi, (Coronato et al., 2020), hybrid (Zhou et al., 2019), multi-view (Zhao
2019). However, ML may have deteriorated performance for certain et al., 2017) and self-supervised learning (Chen et al., 2019) are now
single omics such as gene data due to inherent characteristics (Kim et al., emerging in generic healthcare applications, they have not yet gained
2020). ML methods are now also being applied to multi-omics data enough momentum in multi-omics analysis, hence they remain beyond
(Bersanelli et al., 2016), to investigate and interpret the relationships the scope of this review.

2
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Table 2
The different ML learning approaches reviewed for multi-omics integration.
Learning Goal Description
approach

Supervised Predict new Supervised learning involves fitting a model with labelled training data and then use it for prediction. It can be classed either as a regression
data (predicted variable is numeric) or classification (predicted variable is categorical) problems (Jiang et al., 2020). The three steps in supervised
learning are: (1) fitting a model from the sample input observations (2) evaluating the model and then extensively tuning the hyper-parameters
of the model (3) setting up the model for the production stage and using it for prediction (Foster et al., 2014).
Unsupervised Identify Unsupervised learning is used to find the underlying patterns in unlabelled data using input feature variables without the target/output
clusters variable (Badillo et al., 2020). It can be used for clustering (Xu and Tian, 2015), anomaly detection (Thudumu et al., 2020) and dimensionality
reduction (Xu et al., 2019a).

This paper is organised as follows. Section 2 provides a short back­ are actively expressed and provides information about what is
ground related to multi-omics and ML. Section 3 describes how ML is happening at the cellular level (Milward et al., 2016). Proteomics helps
employed for multi-omics analysis and what are the various real-world in characterising the information flow happening within the cell and the
challenges of it. In Section 4, details of different multi-omics integration organism in the form of protein pathways and their networks (Wu et al.,
approaches are presented. Section 5, published multi-omics studies 2014).
using ML methods are discussed. Section 6 describes a recommendation Although metabolomics, lipidomics, and glycomics do not form part
flowchart for choosing an appropriate method for multi-omics integra­ of the central dogma analysis (Cobb, 2017), they still provide an
tion. Conclusions are provided in Section 7. invaluable amount of information regarding the metabolites, lipids and
glycans (synthesised by the proteome via biosynthetic pathways) (Barh
2. Background et al., 2011). These substances are the intermediate products of a cell’s
information flow and therefore are considered to be excellent indicators
2.1. Multi-omics of the cell’s activity. Similar to single-genome studies, metagenomics is
used to sequence genetic information from environmental samples
In living beings, genetic information in the cells flows from DNA without the requirement of isolating individual species (Hugenholtz and
(deoxyribo-nucleic acid) to the mRNA (messenger ribo-nucleic acid) to Tyson, 2008).
protein and is dictated by the central dogma of molecular biology All measured omics data can be used as a biomarker which helps us
(Lodish et al., 2000). This flow of information is often considered to understand and analyse the underlying characteristics and complex­
analogous to a computer system which has facilitated the understanding ities of biological systems (Alberts et al., 2008). Table 1 shows some of
of biological information processing (Wang and Gribskov, 2005; the important omics used to study biological systems (Handelsman et al.,
D’Onofrio and An, 2010). 1998; Kuska, 1998; Lindon et al., 2011; “Proteomics, transcriptomics,”,
The study of DNA, mRNA and proteins is broadly denoted as geno­ 1999; Vasta and Ahmed, 2008; Wang et al., 2016; Wilkins and Appel,
mics, transcriptomics, and proteomics respectively. The genetic blue­ 2007). All of them are part of the same pipeline of biological informa­
print of a cell is explored using genomics, which looks at the DNA of tion, whose output depends on the different inputs and regulation. As
individuals and helps us to investigate the presence or absence of certain shown in Table 1, each of these omics can be measured using specialised
genes (Gibson, 2015; Vogel and Motulsky, 1997). Transcriptomics high-throughput technologies (for example microarray (Bumgarner,
studies the transcribed genetic material and examines the genes which 2013) and mass spectrometry (Glaves et al., 2014) for genomics and

Table 3
The standard ML terminology and related terms.
Term Definition

Accuracy It is a ratio of correctly predicted outcomes of a given class to the total outcomes. Accuracy is a measure of the performance of an ML model. It ranges from 0%
to 100%.
Classification It is a supervised learning method that provides predicted output as a discrete class. Classification can be binary, multi-class or multi-label.
Clustering It is an unsupervised learning method that can group data based on the attributes of the input features.
Cross-Validation It is a technique that allocates a given set of samples from the dataset which are not used for model training but set aside for testing (to evaluate model
performance). K-fold and Leave one out are commonly used cross-validation methods.
Curse of It refers to a set of problems that arise when using datasets with high dimensionality. In the context of ML, it can impact the predictive performance of an ML
dimensionality model (Duda et al., 2001).
Dataset It is a collection of structured data which comprises input feature variables and sometimes a corresponding target/output variable.
Ensemble Learning It is a paradigm where different models are trained for solving the same problem and then combined to get better performance. Bagging, boosting and
stacking are commonly used in ensemble learning methods.
Explainability Supervised learning models can be classed as ‘white’ or ‘black-box’ based on their explanation (or lack thereof) of how a decision is reached. This is a growing
and important domain of research in deep learning.
Feature Selection It is a process for selecting the most discriminating features without impacting the classification performance.
Hyper-parameter It is an empirically tuned internal parameter of an ML model.
Imputation It is a process of replacing missing values in a dataset with a corresponding statistical estimate. Imputation can be done using mean, median values or
employing methods such as KNN (Crookston and Finley, 2008) or MICE (Azur et al., 2011)
Outlier It is an extremely low or high value of a feature in a dataset (based on the range and distribution). The performance of ML algorithms is sensitive to outliers,
hence their detection and exclusion are crucial (Domingues et al., 2018).
Performance Metric It is a method to evaluate and compare the performance of ML models. For example, precision, recall/sensitivity, specificity, F1 score, Kappa and mean
absolute error.
Regression It is a supervised learning method that provides predicted output as a continuous value.
Training It is a first step in the learning process that uses a training dataset to fit the parameters of a supervised ML model.
Testing It is a second step in the learning process which uses a testing dataset (independent of the training dataset) to assess the predictive performance of a trained
supervised ML model.
Bias-variance trade- In order to achieve optimal prediction performance, a supervised model should ideally have low bias and low variance. A model is over or underfitted when a
off trade-off is not achieved.

3
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Table 4
The commonly used ML algorithms and their attributes. The rank [1 – Low, 2 – Medium, 3 – High, 4 – Very High] denoted to attributes is pragmatically assigned based
on available literature (Amancio et al., 2014; Barredo Arrieta et al., 2020; de Andrade et al., 2020; Lorena et al., 2011; Rashidi et al., 2019; Sakr et al., 2017).
Family Models Comparative Overfitting Samples Explainability Hyper- Complexity Implementation Computation
Accuracy Risk needed parameter Time Cost
Tuning

Probability- Bayesian 2 2 2 2 3 3 2 3
based Network
(Bayesian) Naive Bayes 2 2 2 2 2 3 2 3
Information Decision Tree 2 3 2 3 2 2 1 2
based (Tree) Random Forest 3 2 1 3 3 2 1 2
Gradient 3 3 2 1 4 4 2 3
Boosting
Error based Linear 1 3 2 3 1 2 1 2
(Linear) Regression
Logistic 1 3 2 3 1 2 1 2
Regression
Partial Linear 2 1 3 3 2 2 1 2
Regression
Similarity-based K nearest 2 3 2 2 2 3 1 1
(Instance) neighbour
Self-Organising 2 3 2 2 3 3 1 1
Maps
Support Vectors Linear SVM 3 3 3 1 3 2 2 2
Non-linear 3 3 3 1 3 3 3 3
(Kernel) SVM
Neural Artificial Neural 3 3 2 1 3 3 3 3
Network- Network
based Deep Learning 4 1 4 1 4 4 4 4
(Neural
Network)

metabolomics respectively). The table also includes a list of recent re­ present-day data and utilise that understanding to make forecasts or
views on each of these omics. High-throughput generated omics data choices for unidentified forthcoming data measures (Gammerman,
(Lightbody et al., 2019) has played a pivotal role in developing precision 2010; Obermeyer and Emanuel, 2016). To assist beginners in the ML
medicine biomarkers for diseases such as Alzheimer’s (Hampel et al., domain, a glossary of learning approaches covered in this review
2017; Hampel et al., 2016; Kovacs, 2016), diabetes (Capobianco, 2017; (Table 2), standard ML terminology (Table 3) and commonly used ML
McCarthy, 2017; Mutie et al., 2017), cancer (Borad and LoRusso, 2017; algorithms (Table 4) are provided. The basic foundations of ML and its
Senft et al., 2017), hypertension (Barnes et al., 2016; Dominiczak et al., uses have been extensively covered in the literature (Bishop, 2006).
2017), cardiovascular (Costantino et al., 2017) and chronic respiratory ML is employed in a wide range of scenarios, where designing and
diseases (Agache and Rogozea, 2017; Hanania and Diamant, 2017). programming explicit algorithms with optimal results is challenging,
Recently, these omics have also been integrated for COVID-19 studies such as email filtering (Dada et al., 2019), hand-written optical char­
(Barh et al., 2020; Overmyer et al., 2020; Zhou et al., 2020). Many other acter recognition (Memon et al., 2020), and computer vision (O’Mahony
specialised omics have also emerged such as pharmacogenomics (Wang, et al., 2020). Also, it has been deployed for self-driving cars (Badue
2010), methylomics (Liu et al., 2013), interactomics (Luck et al., 2017) et al., 2021), cyber-security (Handa et al., 2019), automated assistants
and radiomics (Lambin et al., 2017; Wong et al., 2016). such as ‘Siri’, websites that recommend items based on the purchasing
Overall, these omics provide a complete picture of cell biology and decisions of other people and novel solutions to some of the challenging
related cellular function (Cox, 2009). This provided the impetus for the problems of the real world (Watt et al., 2020).
development of various software mechanisms which can offer a pre­ Deep learning has emerged in recent years as the leading class of ML
diction of a particular phenotype while using the available next- algorithms. It uses neural networks composed of hidden layers per­
generation multi-omics data (Ritchie et al., 2015). Furthermore, they forming different operations to find complex representations of data. It
can be utilised to develop materials and devices which be used for has pushed the performance of classifiers beyond that of traditional ML
diagnostic and preventive purpose at the molecular level while targeting algorithms, especially in scenarios involving large-scale datasets with
molecules with greater accuracy (Giovanni Martinelli et al., 2015). high dimensionality. On the other hand, it is very computationally
intensive, requiring high-throughput or high-performance hardware,
and lacks explainability (transparency) in feature selection (black-box
2.2. Machine learning approach), in the sense that it is difficult to extract from the network the
features that the network has found as mainly responsible for the task, e.
Classical statistical modelling has always been the de facto standard g. classification (LeCun et al., 2015). However, in the context of multi-
choice for health data analysis and its interpretation. In recent years, omic integration, deep learning offers an exciting opportunity.
with the increasing availability of affordable computing power and Fig. 1 shows the number of publications indexed on the Web of Sci­
high-throughput omics data and the success of artificial intelligence ence website (Clarivate Analytics, 2020) with different key topics. This
technology in various fields, the use of ML has become popular in health information was collected from the Web of Science by entering a
sciences (Lee and Yoon, 2017; Clifton et al., 2015; Hung, 2019; Barnett-
Itzhaki et al., 2020; Kirchebner et al., 2020). ML can be used to mine
information hidden in the experimental data. In contrast, a conventional
statistics-based model is usually developed using statistical assumptions
and draws an inference about a population from a given dataset (Bzdok,
2017).
The objective of ML methods is to acquire knowledge from historic or

4
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Fig. 1. Number of publications published per year on different search keywords. *For the year 2020, the annual count was extrapolated using the count of pub­
lications until October 2020.

different keyword and searching across all databases2. Although the use heterogeneous (Bersanelli et al., 2016). For example, transcriptomics
of ML in medical science can be dated back to the 1970s (Davenport and and proteomics use different normalisation and scaling techniques
Kalakota, 2019), more rapid growth is evident in the past 10 years. before omics analysis. This leads to different dynamic ranges and data
Moreover, publications based on ‘multi-omics integration’ and ‘multi- distribution. Also, some omics are more prone to generating sparse data
omics and machine learning’ have started to emerge in the last 5 years (e.g. in the case of metabolomics, some values might be below the limit
and have gained popularity in the precision and computational medicine of detection and hence assigned null value (Antonelli et al., 2019)) than
domain. Although deep learning is widely popular in other related do­ others. Therefore, imputation (Liew et al., 2011) and outlier detection
mains (such as medical imaging (Erickson et al., 2017) and clinical (Vivian et al., 2020) should be considered for each omic separately,
natural language processing (Wu et al., 2020)), the interest has been before planning their integration.
more limited for multi-omics analysis (Tan et al., 2020b). This is because
multi-omics studies are challenging to deploy as they require specialised 3.2. Class imbalance and overfitting
high-throughput omic infrastructure (as highlighted earlier in section
2). This fact is reinforced by the evidence that most of the current In disease classification, certain disease classes are rarer than others
literature employs deep learning on large-scale multi-omics datasets which can cause a class imbalance in the multi-omics dataset (Haas
from open sources such as TCGA3, CCLE4 and GDSC5 for cancer prog­ et al., 2017). For example, primary hypertension is the most common
nosis (Poirion et al., 2018; Seal et al., 2020; Tong et al., 2020; Lee et al., form of hypertension with 95% prevalence while endocrine hyperten­
2020; Zhu et al., 2020) and anti-cancer drug response (Sharifi-Noghabi sion occurs in only 5% (Rimoldi et al., 2014). The ML model trained
et al., 2019; Li et al., 2019; Deng et al., 2020). using an imbalanced dataset may be overfitted i.e. high accuracy for
training data but underperformance for unseen test data. Therefore, to
3. Challenges in multi-omics analysis using machine learning classify these two types of hypertension one of the following approaches
can be used: 1) Collect more data if possible, or 2) consider using
The use of ML to analyse high-throughput generated multi-omic data weighted or normalised metrics to measure the ML performance (such as
poses key unique challenges. They can be summarised as follows. F1-Score or Kappa (Jeni et al., 2013)), or 3) consider over or under-
sampling the under or over-represented class respectively, or 4)
3.1. Heterogeneity, sparsity and outliers consider synthetic sample generation (such as SMOTE (Chawla et al.,
2002) or ADASYN (Haibo He et al., 2008)) for the under-represented
Multi-omic data from different high-throughput sources are usually class. Similarly, techniques such as regularisation, bagging, hyper­
parameter tuning and cross-validation can be used to balance bias-
variance trade-off (Lee, 2010). Any of the above approaches can be
2
All Web of Science databases included: Web of Science Core Collection, used, depending on data and problem, to overcome the class imbalance
BIOSIS Citation Index, BIOSIS Previews, Current Contents Connect, Data Cita­ and overfitting problems.
tion Index, Derwent Innovations Index, KCI-Korean Journal Database, MED­
LINE®, Russian Science Citation Index, SciELO Citation Index and Zoological
Record. 3.3. More features than data (p >> n)
3
TCGA: The Cancer Genome Atlas
4
CCLE: Cancer Cell Line Encyclopaedia Most multi-omics datasets suffer from the classical ‘curse of dimen­
5
GDSC: Genomics of Drug Sensitivity in Cancer sionality’ problem, i.e. having much fewer observation samples (n) than

5
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

multi-omics features (p) (Misra et al., 2019). The resulting high- 3.6. Translating ML: bench to bedside
dimensional space often contains correlated features which are redun­
dant and can mislead the algorithm training (James et al., 2017). The Various ML-based multi-omics publications have emerged in the past
dimensional space of the data can be reduced by employing dimen­ 5 years (see Fig. 1) and some use performance metrics such as decision
sionality reduction techniques such as feature extraction and feature se­ curve (Vickers and Elkin, 2006) and calibration (Dankers et al., 2019)
lection. Feature extraction refers here to techniques computing a subset analytics to evaluate their diagnostic utility. Still, only very few have
of representative features which summarise the original dataset and its been translated into clinical practice, for example, Idx (diabetic reti­
dimensions6. These features are functions of the original ones, for instance, nopathy detection), FerriSmart (measure liver iron concentration) and
PCA (principal component analysis) (Jolliffe, 2002), LDA (linear SubtleMR (image processing software for radiology) (Benjamens et al.,
discriminant analysis) (Martinez and Kak, 2001) and MDS (multidi­ 2020; Hamamoto et al., 2020).
mensional scaling) (Young and Hamer, 1987). On the other hand, One of the key issues which hinder the clinical deployment of ML
feature selection finds a subset of the original features that maximise the methods is transparency and explainability (Black box medicine and
accuracy of a predictive model (Guyon and Elisseeff, 2003). It can be transparency, 2020). A transparent and explainable ML algorithm seems
based on prior knowledge i.e. evident from known literature or based on essential to building trust for clinical decision making (Gunning et al.,
a database such as a Biofilter (Bush et al., 2009). Formally, feature se­ 2019). Recently, the U.S. Food and Drug Administration (FDA) has is­
lection methods can be classed as filter (Information gain (Roobaert sued the “Artificial Intelligence/Machine Learning (AI/ML)-Based Software
et al., 2006), ReliefF (Beretta and Santaniello, 2011), Chi-square sta­ as a Medical Device Action Plan” to ensure deployment of ML-based
tistics (Lee et al., 2011)), wrapper (Recursive feature elimination (Guyon products is safe for patients to better assist the health care providers
et al., 2002), Sequential feature selection (Pudil et al., 1994)) and (Health, 2021). A recent in-depth analysis by Muehlematter showed
embedded (such as LASSO (Least Absolute Shrinkage and Selection most of the FDA approved and “Conformité Européenne” marked ML
Operator) (Zou, 2006)) techniques. Xu et al.(Xu et al., 2019b) and products are in the field of radiology. It also highlighted the key dif­
Stańczyk (Stańczyk and Jain, 2015) provide an excellent resource for ferences between U.S. and European policy implications around the
understanding and exploring the use of different dimensionality reduc­ approval of AI/ML-based devices (Muehlematter et al., 2021).
tion techniques in the generic ML domain. Meng (Meng et al., 2016b) All the above challenges directly impact the use of ML for multi-
offers a review of these methods from the perspective of multi-omics omics analysis. However, there are few other challenges related to
data analysis. multi-omics studies which are not ML related such as study design (Haas
et al., 2017), multi-site sample collection & management (Pinu et al.,
2019), multi-site data sharing and governance (Saulnier et al., 2019),
3.4. Computation and storage cost visualisation (Mougin et al., 2018), ethical standards (Lévesque et al.,
2018), and finally making the research reproducible (Conesa and Beck,
The use of ML for multi-omics analysis comes with computational 2019) and translational (Schumacher et al., 2014). A broader checklist
and data storage cost (Herrmann et al., 2020). Most ML algorithms of criteria is investigated by McShane while focussing on various aspects
require high computation power and large volumes of storage capacity ranging from specimen requirements, predictive model development to
to save the logs, results and analysis. In recent years, ML models can be clinical trial designing and related regulatory approvals (McShane et al.,
deployed on dedicated graphics processing units (Schmidhuber, 2015) 2013b; McShane et al., 2013a).
and cloud computing platforms (Armbrust et al., 2010) such as Amazon
EC2 (“Amazon EC2,”, 2021), Microsoft Azure (“Cloud Computing Ser­ 4. Data integration methods for multi-omics
vices | Microsoft Azure,”, 2021a) and Google Cloud Platform (“Cloud
Computing Services,”, 2021b). The related costs should be considered In recent years, various new data integration methods have been
well in advance before planning an ML-based multi-omics workflow. introduced from the modern developments in mathematical, statistical
and computational sciences. For the benefit of the readers, Table 5 in­
cludes a summary of a few reviews which cover the breadth of multi-
3.5. What algorithm works best for what conditions? omics integration for generic as well as specialised domains such as
oncology (Buescher and Driggers, 2016; Nicora et al., 2020) and toxi­
The commonly used ML algorithms have different attributes cology (Canzler et al., 2020). Most of these reviews have strived to
(Table 4) and therefore it is crucial to choose an appropriate algorithm introduce different categorical terminologies (for example: “early’, ‘late’
for the multi-omics analysis. In the literature, many reviews cover the and ‘intermediate’ in (Gligorijević and Pržulj, 2015) or ‘bottom-up’ and
key strengths and weaknesses of different ML algorithms using single ‘top-down’ in (Yu and Zeng, 2018)) which enable them to group the
omics (Amancio et al., 2014; López Pineda et al., 2015; Sakr et al., 2017; integration methods based on different factors/parameters.
Uddin et al., 2019) and multi-omics (Ma et al., 2016; Francescatto et al., As mentioned earlier, this section adopts the categorical terminol­
2018; Xu et al., 2019a; Sathyanarayanan et al., 2020) datasets. Most of ogies from Ritchie (Ritchie et al., 2015) and builds upon it to summarise
them use a systematic workflow that involves simultaneous performance a complete spectrum of recent integration methods. It concisely covers
evaluation of different algorithms using a common dataset. Since each them giving a clear perspective to a new interdisciplinary user. The
multi-omics dataset is unique, using a similar workflow could allow the various integration methods are classed as either ‘concatenation-’,
selection of the best-suited algorithm. Later, in Section 6 a recommen­ ‘model-’ or ‘transformation’-based and described below in detail.
dation flowchart is proposed which can help the inter-disciplinary user
to choose from available methods. 4.1. Concatenation-based integration methods
Recently, various artificial intelligence-driven automated ML plat­
forms and tools (Feurer et al., 2015; Olson et al., 2018; Waring et al., Concatenation-based integration methods consider developing a model
2020) have also emerged which can be utilised to exhaustively search using a joint data matrix which is formed by combining multiple omics
for the best ML model and corresponding parameter tuning, however, datasets. Fig. 2 shows the stages of concatenation-based integration.
they are computationally expensive. Stage 1 includes the raw data from three individual omics (e.g. genomics,
proteomics, and metabolomics) along with the corresponding pheno­
typic information. Commonly, concatenation-based integration does not
6
We note that “feature extraction” has a different meaning in image pro­ require any pre-processing and hence does not have a Stage 2. In Stage 3,
cessing and computer vision. the data from the individual omics is concatenated to form a single large

6
P.S. Reel et al.
Table 5
Summary table of few reviews in multi-omics integration
Year of Review Terminology introduced Omics reviewed Application domain
review reference for classifying various covered
Genomics Transcriptomics Metabolomics Proteomics Epigenomics Interactomics Metagenomics Lipidomics Phosphoproteomics
integration methods

2009 (Van Deun Matrix decomposition ✓ Micro-organism


et al., 2009) (Escherichia coli)
2009 (Ebbels and ‘Conceptual’, ‘statistical’ ✓ ✓ Generic
Cavill, 2009) & ‘model’
2012 (Lussier and Li, ‘Cross-scale’ & ‘multi- ✓ ✓ Prediction of clinical
2012) scale’ outcomes.
2015 (Ritchie et al., ‘Concatenation’, ✓ ✓ Generic
2015) ‘transformation’ &
‘model’
2015 (Gligorijević ‘Early’, ‘late’ & ✓ ✓ Generic
and Pržulj, ‘intermediate’
2015)
2016 (Bersanelli ‘Sequential’, ✓ ✓ ✓ Generic
et al., 2016) ‘simultaneous’,
‘network-based versus
network-free’ &
‘Bayesian vs non-
Bayesian’
2016 (Gligorijević - ✓ ✓ ✓ Disease subtyping,
et al., 2016) biomarkers
discovery & drug
repurposing.
2016 (Buescher and - ✓ ✓ ✓ ✓ Cancer biology
Driggers, 2016)
7

2017 (Lin and Lane, - ✓ ✓ ✓ ✓ Investigated (Ritchie


2017) et al., 2015) from ML
perspective
2017 (Huang et al., - ✓ ✓ ✓ Patient survival
2017b) prediction
2017 (Hasin et al., ‘Genome’, ‘phenotype’ ✓ ✓ ✓ ✓ ✓ Generic
2017) & ‘environment-’ first
approach
2018 (Yu and Zeng, ‘Bottom-up’ & ‘top- ✓ ✓ ✓ ✓ Generic
2018) down’ mode
2018 (Kim and ‘Data-to-data’, ‘data-to- ✓ ✓ ✓ ✓ ✓ ✓ Generic
Tagkopoulos, knowledge’ &
2018) ‘knowledge-to-
knowledge’
2018 (Rappoport and - ✓ ✓ ✓ Cancer
Shamir, 2018) benchmarking
2019 (Tini et al., - Multiple

Biotechnology Advances 49 (2021) 107739


✓ ✓ ✓ ✓
2019) (Mitochondrial
metabolism, Platelet
reactivity & Breast
cancer)
2019 (Mirza et al., - ✓ ✓ ✓ ✓ ✓ Generic (more
2019) focussed on ML)
2019 (López de OnO (omics & non- ✓ ✓ ✓ ✓ ✓ ✓ Generic
Maturana et al., omics)
2019)
2019 (Wu et al., ✓ ✓ ✓ ✓ Generic
2019)
(continued on next page)
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

matrix of multi-omics data. Finally, in Stage 4 the joint matrix is used for
supervised or unsupervised analysis. The main advantage of using

Application domain
concatenation-based methods is the simplicity of employing ML for

focussed on ML)
Generic (more

Plant systems
analysing continuous or categorical data, once the concatenation of all

Toxicological
individual omics is completed. These methods use all the concatenated

Oncology
research

biology.
covered

Generic
features equally and can select the most discriminating features for a
given phenotype.
The different concatenation-based integration methods can be
Phosphoproteomics

further classed as:

4.1.1. Supervised learning concatenation-based methods


Different concatenation-based supervised learning methods have
been used for phenotypic prediction. In scenarios where the number of

features in the joint matrix are higher, different feature selection


methods described in Section 3 can be employed during concatenation
Lipidomics

(Sorzano et al., 2014).


The concatenated multi-omics data (in the form of a joint matrix) is

provided as input to different classical ML methods such as DT (decision
tree) (Quinlan, 1993), NB (naive Bayes) (Domingos and Pazzani, 1997),
Metagenomics

ANN (artificial neural networks) (Bishop, 1995), SVM (support vector


machine) (Vapnik, 1995), KNN (k-nearest neighbors) (Altman, 1992),
RF (random forest) (Breiman, 2001) and K-Star (Cleary and Trigg, 1995)

in the literature (Kim and Tagkopoulos, 2018; Lin and Lane, 2017;
Auslander et al., 2016; Acharjee et al., 2016; Zhang et al., 2018; Ding
Interactomics

et al., 2018; Wang et al., 2020). For example, a joint matrix of multi-
omics features (which included gene expression, copy number varia­
tion and mutation) was used with classical RF and SVM to predict anti-
cancer drug response (Stetson et al., 2014).
Epigenomics

Similarly, multivariate LASSO models (Zou, 2006; Nicolai and Peter,


2010; Mankoo et al., 2011) have been investigated. Also, Boosted trees
(Elith et al., 2008) and SVR (support vector regression) (Awad and

Khanna, 2015) have been investigated for finding the longitudinal pre­
dictors of glycaemic health (Prelot et al., 2018).
Proteomics

Other than classical ML algorithms, deep neural networks (Tang


et al., 2019) have also been widely used to analyse concatenated multi-

omics data. They have been studied to identify robust survival sub­
groups of liver cancer using RNA, miRNA and methylation data
Metabolomics

(Chaudhary et al., 2017).

4.1.2. Unsupervised learning concatenation-based methods


Various concatenation-based unsupervised methods have been used


for clustering and association analysis. Different matrix factorisation-
Transcriptomics

based methods have evolved in recent years. Joint NMF (non-negative


matrix factorisation) (Zhang et al., 2012) was proposed to integrate
multi-omics data with non-negative values. It involved decomposing the
Omics reviewed

joint matrix into loadings and factors, bringing the different omics into a

common basis matrix. Joint NMF is computationally slow and needs


Genomics

large memory allocation.


Similarly, Shen (Shen et al., 2009) proposed iCluster framework

which used principles similar to NMF but allows integration of datasets


having negative values. They showed the functioning of the framework
Terminology introduced

‘Vertical’, ‘horizontal’,

‘Single-view’ & ‘multi-

by using copy number, mRNA expression and methylation data to


for classifying various

‘Element’, ‘pathway’
integration methods

and ‘mathematical’

conduct a cancer subtype discovery in glioblastoma. This framework


based approach

was also employed for a landmark study that used genomic and tran­
‘hierarchical’
‘parallel’ &

scriptomic data from 2,000 breast tumours and discovered novel sub­
groups amongst them (Curtis et al., 2012).
view’

Later, the iCluster+ framework by Mo (Mo et al., 2013), offered a


-

significant enhancement over iCluster framework. The iCluster+


framework can discover patterns and combine a range of omics having
(Canzler et al.,

(Nicora et al.,
(Eicher et al.,

Wang, 2020)
(Nguyen and
(Jamil et al.,

binary, categorical and continuous values and was demonstrated by


reference
Table 5 (continued )

combining genomic data from the colorectal cancer datasets.


Review

2020)

2020)

2020)

2020)

Another adaptation of NMF was evaluated as JIVE (Joint and Indi­


vidual Variation Explained) which captures joint variation across inte­
grating data types and structural variation of each data type along with
Year of
review

the residual noise (Lock et al., 2013). It was used to investigate gene
2020

2020

2020

2020

2020

expression and miRNA data on brain tumour samples. The sparsity

8
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Fig. 2. Workflow pipelines for different types of integration methods for multi-omics analysis.

problem in JIVE was improved by JBF (Joint Bayes Factor) (Ray et al., understand cell regulation in yeast using metabolomics and tran­
2014). JBF used joint factor analysis to evaluate the feature space and scriptomics data.
converted it into shared and datatype-specific components. Also, LRAcluster (Wu et al., 2015) was developed to integrate high-
The MoCluster proposed by Meng (Meng et al., 2016a), used multi- dimensional multi-omics data and find low-dimensional manifold to
block multivariate analysis for highlighting the patterns across identify molecular subtypes of cancer.
different input omics data and then finds the joint clusters amongst Recently, iClusterBayes was introduced by Mo (Mo et al., 2018),
them. MoCluster was validated by integrating proteomic and tran­ which is a fully Bayesian latent variable model. It overcomes the limi­
scriptomic data and shows a noticeably higher clustering accuracy and tations of iCluster+, in terms of statistical inference and computational
lower computation cost in comparison to both Cluster and iCluster+. speed. iClusterBayes includes a binary indicator prior for selection of
Fridley (Fridley et al., 2012) has studied the genomic effects due to variable and generalises for binary data and count data. Also, Argelaguet
the gemcitabine drug using high-throughput data from mRNA expres­ (Argelaguet et al., 2018) have developed MOFA (Multi-Omics Factor
sion and SNPs. They integrated these two datasets into one large input Analysis) which disentangles the heterogeneity shared across different
matrix and developed a Bayesian pathway analysis that uses a stochastic omics to discover the principal source of variability. It can integrate
search variable selection. Their proposed that the Bayesian integrative partially overlapping datasets.
model offers better performance in detecting the genomic effects in
comparison to using conventional single-omics analysis. Similarly, Zhu
(Zhu et al., 2012) has also explored BN (Bayesian network) to

9
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

4.2. Model-based integration methods samples to identify driver mutations. On the other hand, clustering
methods such as FCA (Formal Concept Analysis) consensus clustering
Model-based integration methods create multiple intermediate models (Hristoskova et al., 2014), MDI (Multiple Dataset Integration) (Kirk
for the different omics data and then build a final model from various et al., 2012), PINS (Perturbation clustering for data integration and
intermediate models (Fig. 2). Stage 1 sets up the raw data from the three disease subtyping) (Nguyen et al., 2017), PINS+ (Nguyen et al., 2019)
individual omics along with the corresponding phenotypic information. and BCC (Bayesian consensus clustering) (Lock and Dunson, 2013) are
In Stage 2, individual models are developed for each of the omics which more flexible and allow late-stage integration of clusters.
are later integrated into a joint model in Stage 3. Finally, in Stage 4 the Different network-based methods are also available for association
joint model is analysed. The major advantage of model-based integra­ analysis. Lemon-Tree (Bonnet et al., 2015) implemented ensemble
tion methods is that they can be used for merging models based on methods for reconstructing module networks which used somatic copy
different omic types, where each model is developed from a different number alterations and gene expression in brain tumour samples.
patient group having the same disease information (He et al., 2016; Furthermore, SNF (Similarity Network Fusion) (Wang et al., 2014)
Ritchie et al., 2015). constructs networks of samples for respective data type and then effec­
Model-based integration approaches facilitate the understanding of tively fuse them into a joint network which denotes the complete range
interactions amongst different omics for a certain phenotype (for of original data. It combines mRNA expression, DNA methylation and
example, survival in pancreatic cancer). The final multi-dimensional microRNA (miRNA) expression data from cancer datasets.
joint model in Stage 4 can be built using an ML algorithm (such as
neural networks) which uses the most relevant variables from each 4.3. Transformation-based integration methods
omics models (from Stage 3). This approach allows the analysis of the
improvement in the predictive power for individual models and also Transformation-based integration methods transform each of the omics
finds the best discriminating features. datasets firstly into graphs or kernel matrices and then combines all of
The different model-based integration methods can be further them into one before constructing a model.
classed as follows. Fig. 2 shows the various stages of transformation-based integration.
Stage 1 sets up the raw data from the three individual omics along with
4.2.1. Supervised learning model-based methods the corresponding phenotypic information. In Stage 2, the individual
Model-based supervised learning methods include a variety of frame­ transformations (in the form of graph or kernel relationship) are
works for developing a model, such as majority-based voting (Drăghici developed for each of the omics which are later integrated into a joint
and Potter, 2003), hierarchical classifiers (Bavafaye Haghighi et al., transformation in Stage 3. Finally, in Stage 4 it is analysed. The primary
2019) and ensemble-based approaches (such as XGBoost (Ma et al., advantage of the transformation-based integration methods is that they
2020b) and KNN (Shen and Chou, 2006)). can be used to combine a wide range of omics if unique information
Deep learning methods have also been adopted for model-based su­ (such as patient ID) is available.
pervised learning (Poirion et al., 2020). MOLI (multi-omics late inte­ Graphs provide a formal means to transform and portray relation­
gration) (Sharifi-Noghabi et al., 2019) method used type-specific ships between different omics samples where the nodes and edges of a
encoding sub-networks to learn features from somatic mutation, CNA graph represent the subjects and their relationships, respectively.
and gene expression data independently and then later concatenated Similarly, Kernel methods enable the transformation of data from its
them for predicting the response to a given drug. Lee (Lee et al., 2020) original space into a higher dimensional feature space. These methods
has proposed a deep learning-based auto-encoding approach for inte­ then explore linear decision functions in the feature space which were
grating four omics to create a survival prediction model. Also, HI- non-linear in the original space.
DFNForest (hierarchical integration deep flexible neural forest) frame­ The transformation-based integrative methods can be classed as
work (Xu et al., 2019a) was developed which uses a stacked auto- follows.
encoder (Vincent et al., 2010) to learn high-level representations from
three omic datasets. Later, these representations are integrated to pre­ 4.3.1. Supervised learning transformation-based methods
dict cancer subtype classification. Similarly, Chaudhary (Chaudhary In the past, various transformation-based supervised learning methods
et al., 2017) has used autoencoders along with SVM for survival pre­ have been presented. Most of them are kernel and graph-based algo­
diction in subgroups of hepatocellular carcinoma. rithms (Yan et al., 2017). The kernel-based integration approaches
In the past years, ATHENA (Analysis Tool for Heritable and Envi­ include SDP-SVM (Semi-Definite Programming SVM) (Lanckriet et al.,
ronmental Network Associations) was developed for analysing multi- 2004), FSMKL (Multiple Kernel Learning with Feature Selection)
omics data (Chung and Kang, 2019; Holzinger et al., 2014). It uses (Seoane et al., 2014), RVM (Relevance Vector Machine) (Bowd et al.,
grammatical evolution neural networks along with Biofilter (Bush et al., 2005; Tipping, 2001) and Ada-boost RVM (Wu et al., 2010). Moreover,
2009) and Random Jungle (Schwarz et al., 2010) to investigate different fMKL-DR (fast multiple kernel learning for dimensionality reduction)
categorical and quantitative variables and develop prediction models. (Giang et al., 2020) has been used along with SVM for combining gene
Recently, MOSAE (Multi-omics Supervised Autoencoder) (Tan et al., expression, miRNA expression, and DNA methylation data. Similarly,
2020a) was developed for pan-cancer analysis and compared with the graph-based integration approaches consist of graph-based SSL
conventional ML methods such as SVM, DT, naïve Bayes, KNN, RF and (semi-supervised learning7) (Tsuda et al., 2005; Culp and Michailidis,
AdaBoost. Similarly, Denoising autoencoder has been incorporated 2008; Kim et al., 2015; Yue et al., 2017; Bhardwaj and Van Steen, 2020),
along with L1-penalized logistic regression for identifying ovarian can­ graph sharpening (Shin et al., 2010; Shin et al., 2007), composite
cer subtypes (Guo et al., 2020). network (Mostafavi and Morris, 2010) and BN (Rhodes et al., 2005).
Overall, it is evident from the literature that kernel-based algorithms
4.2.2. Unsupervised learning model-based methods have superior performance to graph-based approaches, but they usually
Various model-based unsupervised learning methods have been need more time for the training phase. In contrast, graph-based ap­
implemented in the past. PSDF (Patient-Specific Data Fusion) (Yuan proaches can disclose the relations between samples while taking less
et al., 2011) is a non-parametric Bayesian model for clustering prog­ computation time. Yan (Yan et al., 2017) provide an extensive
nostic cancer subtypes by combining gene expression and copy number
variation data. It uses a two-step process and limits the integration to
only two datatypes. Similarly, CONEXIC (Akavia et al., 2010) also uses a 7
For the sake of simplicity, the semi-supervised integration methods (graph-
BN to integrate gene expression and copy number variation from tumour based) are grouped under supervised learning.

10
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Table 6
The advantages and disadvantages of using different integrative methods.
Integrative Method Advantages Disadvantages

Concatenation- • Easy and straightforward. • Ideally, requires all omics data for all patients.
Based • Enables the use of classical supervised and unsupervised methods. • Need proper normalisation before concatenation.
• Does not consider the unique distribution of each omics.
• Memory and computation-intensive when the concatenated matrix is
large.
Model- • Facilitates the understanding of interactions amongst different omics. • Not effective if omics data is extremely heterogeneous.
Based • Omics data can be from a different set of patients with a similar phenotype. • Could lead to an overfitted solution
• Does not increase dimensional complexity. • Weak signals could be lost.
Transformation- • Graph representation easy to understand and computationally less intensive. • Kernel methods are computationally more intensive than graph
Based • Kernel methods provide superior performance. methods.
• Multi-omics data for the same patient can be used for their disease subgroup • Transformation can be sometimes challenging.
analysis.

comparison between different graph- and kernel-based integration ap­ cancer and idiopathic pulmonary fibrosis. Recently, NEMO (NEighbor­
proaches in a supervised learning context using various standardised test hood based Multi-Omics clustering) (Rappoport and Shamir, 2019) is
datasets. It highlights the better classification performance of RVM, Ada- introduced which uses an inter-patient similarity matrix–based distance
boost RVM and SDP-SVM in comparison to SSL, graph sharpening, metric for evaluating the input omic datasets individually. These omics
composite network and BN. matrices are then combined into one matrix and then analysed using
Recently, MORONET (Multi-Omics gRaph cOnvolutional NETworks) spectral-based clustering. It can work on partial data sets (no imputation
(Wang et al., 2020) is introduced, which use graph convolutional net­ needed), where measurements are only available for a subset of omics
works taking benefit of the omics features and the associations among data.
patients (as defined by the patient similarity networks) for better clas­ Table 6 highlights the advantages and disadvantages of various
sification results. integration methods. Table 7 summarises various multi-omics integra­
tion methods based on learning type.
4.3.2. Unsupervised learning transformation-based methods
Different transformation-based unsupervised methods have been 5. Application of integrative methods in multi-omics studies
introduced. Some of them are kernel- and graph-based methods. Lately,
rMKL-LPP (regularised multiple kernel learning for Locality Preserving The availability of high-throughput omics provides a unique op­
Projections) (Speicher and Pfeifer, 2015) was implemented for clus­ portunity to explore the complex relationships between different omics
tering analysis. It used an individual kernel for each omics along with a and phenotypic targets instead of mono-omics evaluation. This section
graph embedding framework to identify biologically meaningful sub­ describes various multi-omics studies which deployed methods investi­
groups for five different cancer types. Similarly, PAMOGK (Tepeli et al., gated in the previous section. Table 8 summarises different phenotypic
2019) is developed for integrating multi-omics data with pathways target-based, multi-omics studies published and tabulates them across
using graph kernel, SmSPK (smoothed shortest path graph kernel). It the span of 7 main omics namely, genomics, transcriptomics, metab­
used somatic mutations, transcriptomics and proteomics data to find olomics, proteomics, glycomics, lipidomics and epigenomics. Genomics
subgroups of kidney cancer. is further divided into gene expression, DNA methylation, somatic point
Meta-SVM (Meta-analytic SVM) is proposed by Kim (Kim et al., mutation and copy number alteration. Similarly, transcriptomics is
2017), which integrates multiple omics data and able to detect further classed into lncRNAs (long non-coding RNAs) and microRNAs
consensus genes associated with diseases across studies such as breast (mRNA and miRNA). The various multi-omics studies are broadly

Table 7
The summary of multi-omics integration methods based on learning type. For abbreviations please refer to List of Abbreviations.
Multi-omics Integration Methods

Concatenation-based Model-based Transformation-based


Learning Supervised • Classical ML • Majority-based voting (Drăghici • SDP-SVM (Lanckriet et al., 2004)
Type (DT (Quinlan, 1993), NB (Domingos and Pazzani, 1997), and Potter, 2003) • FSMKL (Seoane et al., 2014)
ANN (Bishop, 1995), SVM (Vapnik, 1995), KNN ( • Hierarchical Classifiers (Bavafaye • RVM (Bowd et al., 2005; Tipping, 2001)
Altman, 1992), K-Star (Cleary and Trigg, 1995) Haghighi et al., 2019) • Ada-boost RVM (Wu et al., 2010)
• Ensemble-based classifiers • fMKL-DR (Giang et al., 2020)
• LASSO (Zou, 2006; Nicolai and Peter, 2010; Mankoo (XGBoost (Ma et al., 2020a) and • SSL (Tsuda et al., 2005; Culp and
et al., 2011) KNN (Shen and Chou, 2006)) Michailidis, 2008; Kim et al., 2015; Yue
• BT (Elith et al., 2008) • MOLI (Sharifi-Noghabi et al., 2019) et al., 2017; Bhardwaj and Van Steen, 2020),
• SVR (Awad and Khanna, 2015) • HI-DFNForest (Xu et al., 2019a) • Graph sharpening (Shin et al., 2010, Shin
• DNN (Tang et al., 2019) • ATHENA (Chung and Kang, 2019; et al., 2007)
Holzinger et al., 2014) • Composite network (Mostafavi and Morris,
2010)
• BN (Rhodes et al., 2005)
• MORONET (Wang et al., 2020)
Unsupervised • Joint NMF (Zhang et al., 2012) • PSDF (Yuan et al., 2011) • rMKL-LPP (Speicher and Pfeifer, 2015)
• iCluster (Shen et al., 2009) • FCA consensus clustering • PAMOGK (Tepeli et al., 2019)
• iCluster+ (Mo et al., 2013) (Hristoskova et al., 2014) • Meta-SVM (Kim et al., 2017)
• JIVE (Lock et al., 2013) • MDI (Kirk et al., 2012) • NEMO (Rappoport and Shamir, 2019)
• JBF (Ray et al., 2014) • BCC (Lock and Dunson, 2013)
• BN (Fridley et al., 2012; Zhu et al., 2012) • Lemon-Tree (Bonnet et al., 2015)
• MoCluster (Meng et al., 2016a) • SNF (Wang et al., 2014)
• iClusterBayes (Mo et al., 2018)
• MOFA (Argelaguet et al., 2018)

11
P.S. Reel et al.
Table 8
Multi-omics studies using different ML methods. For abbreviations please refer to List of Abbreviations.
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

Humans
Age-related ✓ ✓ ✓ ✓ Graphical RF CU (Zierer et al.,
2016)
Acute myeloid ✓ ✓ LASSO CS (Taskesen et al.,
leukaemia 2015)
Anti-cancer ✓ ✓ RF & SVM CS (Stetson et al.,
therapeutic 2014)
response
Biomedical ✓ ✓ ✓ MORONET TS (Wang et al.,
data 2020)
classification
Brain cancer ✓ ✓ ✓ ✓ ✓ LASSO CS (Lu et al., 2016)
✓ ✓ JIVE CU (Lock et al.,
2013)
✓ ✓ ✓ ✓ iClusterBayes CU (Mo et al., 2018)
✓ ✓ Lemon-Tree MU (Bonnet et al.,
2015)
✓ ✓ ✓ SNF MU (Wang et al.,
2014)
Breast cancer ✓ ✓ RF CS (List et al., 2014)
✓ ✓ LASSO CS (Lee et al., 2017)
12

✓ ✓ RF & SVM CS (Nam et al.,


2009)
✓ ✓ ✓ LASSO CS (Chen et al.,
2017)
✓ ✓ SVM CS (Auslander et al.,
2016)
✓ ✓ iCluster CU (Shen et al.,
2009)
✓ ✓ ✓ SVM, RF, SVM CS & TS (Ma et al., 2016)
& Multi-
Kernel
Learning
✓ ✓ iCluster CU (Curtis et al.,
2012)
✓ ✓ ✓ ✓ BCC MU (Lock and
Dunson, 2013)
✓ ✓ FSMKL TS (Seoane et al.,

Biotechnology Advances 49 (2021) 107739


2014)
✓ ✓ ✓ Meta-SVM TU (Kim et al., 2017)
Cancer survival ✓ ✓ SVM & RF CS (Kim et al., 2014)
Cancer ✓ ✓ ✓ ✓ LASSO CS (Zhao et al.,
prognosis 2015)
✓ ✓ PSDF MU (Yuan et al.,
2011)
✓ ✓ CONEXIC MU (Akavia et al.,
2010)
Cancer drug ✓ ✓ ✓ MOLI (DL) MS (Sharifi-Noghabi
response et al., 2019)
(continued on next page)
P.S. Reel et al.
Table 8 (continued )
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

Cardiac tissue ✓ ✓ RF CS (Dimitrakopoulos


ageing et al., 2014)
Colorectal ✓ ✓ Neural Fuzzy CU (Vineetha et al.,
cancer Network 2013)
COVID-19 ✓ ✓ ✓ PLS-DA CS (Thomas et al.,
analysis 2020)
✓ ✓ ✓ Extra Trees CS (Overmyer et al.,
2020)
Chronic ✓ ✓ ✓ MOFA CU (Argelaguet et al.,
lymphocytic 2018)
leukaemia
Gastric cancer ✓ ✓ SVM & RF CS (Yan et al., 2012)
Kidney cancer ✓ ✓ ✓ ✓ ✓ iClusterBayes CU (Mo et al., 2018)
✓ ✓ ✓ PAMOGK TU (Tepeli et al.,
2019)
Liver cancer ✓ ✓ ✓ Auto-encoder, MS (Chaudhary et al.,
SVM 2017)
Lung cancer ✓ ✓ iCluster CU (Shen et al.,
2009)
✓ ✓ ✓ ✓ Auto-encoder MS (Lee et al., 2020)
Neuroblastoma ✓ ✓ Auto- CS (Zhang et al.,
encoders, 2018)
13

SVM & NB
Ovarian cancer ✓ ✓ RF CS (Anděl et al.,
2015)
✓ ✓ RF CS (Paik et al., 2017)
✓ ✓ ✓ ✓ LASSO CS (Mankoo et al.,
2011)
✓ ✓ ✓ Joint NMF CU (Zhang et al.,
2012)
✓ ✓ ✓ JBF CU (Ray et al., 2014)
✓ ✓ ✓ ✓ BN TS (Zhang et al.,
2014)
✓ ✓ ✓ ✓ Graph SSL TS (Kim et al., 2015)
Oral squamous ✓ ✓ ✓ SVM CS (Li et al., 2017)
cell
carcinoma
Pan-cancer ✓ ✓ ✓ iCluster+ CU (Mo et al., 2013)
analysis ✓ ✓ moCluster CU (Meng et al.,

Biotechnology Advances 49 (2021) 107739


2016a)
✓ ✓ ✓ ✓ LRAcluster CU (Wu et al., 2015)
✓ ✓ ✓ XGBoost MS (Ma et al., 2020a)
✓ ✓ ✓ RF MS (Bavafaye
Haghighi et al.,
2019)
✓ ✓ ✓ HI-DFN Forest MS (Xu et al., 2019a)
(AE)
✓ ✓ ✓ ✓ MOSAE MS (Tan et al.,
2020a)
✓ ✓ ✓ ✓ PINS MU (Nguyen et al.,
2017)
(continued on next page)
P.S. Reel et al.
Table 8 (continued )
Genomics Transcriptomics

OMICS ▸ Gene DNA Somatic Copy mRNA miRNA IncRNA Metabolomics Proteomics Glycomics Lipidomics Epigenomics Method Used Method Reference
expression methylation point number Type
Target mutation alteration

✓ ✓ ✓ fMKL-DR TS (Giang et al.,


2020)
✓ ✓ ✓ rMKL-LPP TU (Speicher and
Pfeifer, 2015)
✓ ✓ ✓ NEMO TU (Rappoport and
Shamir, 2019)
Pancreatic ✓ ✓ SVM MS (Kwon et al.,
cancer 2015)
Prostate cancer ✓ ✓ RF CS (Fan et al., 2011)
Precision ✓ ✓ Auto- CS & (Ding et al.,
oncology encoders, MU 2018)
Elastic Net,
SVM &
14

Consensus
Clustering
Thyroid ✓ ✓ RF CS (Pietzner et al.,
function 2017)
Ulcerative ✓ ✓ LASSO CS (Bjerrum et al.,
colitis 2014)

Plants
Potato flesh ✓ ✓ ✓ RF CS (Acharjee et al.,
colour 2016)
✓ ✓ RF MS (Acharjee et al.,
2011)

Animals & Micro-organisms


Dog heart ✓ ✓ RF CS (Li et al., 2015)
disease
Yeast ✓ ✓ BN CU (Zhu et al., 2012)

Biotechnology Advances 49 (2021) 107739


P.S. Reel et al.
15

Biotechnology Advances 49 (2021) 107739


Fig. 3. Recommendation flowchart for choosing a method for multi-omics integration. For abbreviations please refer to List of Abbreviations.
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

grouped based on the target and the corresponding ML method used. unfortunately, the existing literature does not provide many direct
It is evident from Table 8 that most of the multi-omics studies focus comparisons between methods using the same publicly available data­
on different forms of cancer. In particular, the presence of many multi- sets. Hence, to choose the best method which suits a given dataset and
omics studies related to the breast (Chen et al., 2017; Lee et al., 2017; question, an empirical approach that investigates the use of different
List et al., 2014; Ma et al., 2016; Nam et al., 2009) and ovarian (Anděl methods, guided by ML practitioners is recommended.
et al., 2015; Mankoo et al., 2011; Paik et al., 2017; Zhang et al., 2014)
cancer highlights the research thrust by the scientific community in 7. Conclusions
these domains.
Many intra-omics studies have successfully explored the integration This paper reviewed various ML approaches used for the integration
of gene expression and DNA methylation. LASSO methods have been of multi-omics data for analysis. A concise background of multi-omics
used for this particular integration by Taskesen (Taskesen et al., 2015) and ML was presented. It examined the concatenation-, model- and
and Lee (Lee et al., 2017) for acute myeloid leukaemia and breast cancer transformation-based integration methods, employed for multi-omics
respectively. LASSO has also been employed for cancer prognosis (Zhao data along with their advantages and disadvantages. Also, various
et al., 2015). Similarly, mRNA – miRNA integration was investigated existing multi-omics studies have been summarised. Finally, a recom­
using Neural Fuzzy Network for colorectal cancer (Vineetha et al., mendation flowchart is presented for interdisciplinary professionals to
2013), SVM for pancreatic cancer (Kwon et al., 2015), and RF for cardiac choose an appropriate method for a multi-omics dataset. Overall, this
tissue ageing (Dimitrakopoulos et al., 2014) and ovarian cancer (Anděl work showcases the recent findings in the multi-omics domain and
et al., 2015) respectively. SVM has also been used for oral squamous cell signifies the key role of ML in the future of personalised healthcare.
carcinoma study by integrating different transcriptomics namely mRNA,
miRNA and IncRNA (Li et al., 2017). Disclosure
Metabolomics and proteomics have been integrated using RF for
analysis of prostate cancer (Fan et al., 2011) and thyroid functioning The authors have nothing to disclose.
(Pietzner et al., 2017). Similarly, metabolomics is integrated with mRNA
for studying ulcerative colitis (Bjerrum et al., 2014) and cancer survival
(Kim et al., 2014). On the other hand, glycomics and epigenomics have Declaration of Competing Interest
only appeared once in the multi-omics context (along with mRNA and
metabolomics) and used by Zierer (Zierer et al., 2016) for the study of The authors declare that they have no known competing financial
age-related comorbidities using a graphical variant of RF. interests or personal relationships that could have appeared to influence
Recently, metabolomics and proteomics have also been integrated the work reported in this paper.
with lipidomics to evaluate COVID-19 patients using PLS-DA (Partial
Least Squares Discriminant Analysis) and Extra Trees (Overmyer et al., Acknowledgement
2020; Thomas et al., 2020).
Multi-omics studies have also been successfully conducted in plants This project has received funding from the European Union’s Hori­
(potato (Acharjee et al., 2016, Acharjee et al., 2011)) and animals (such zon 2020 research and innovation programme under grant agreement
as canine heart disease (Li et al., 2015)). No 633983. Ewan Pearson and Emanuele Trucco would like to
Overall, the different recent multi-omics studies highlight the supe­ acknowledge the National Institute for Health Research (NIHR) global
riority of integration methods in understanding the complexity of health research unit on global diabetes outcomes research at the Uni­
different diseases and uncovering the underlying abnormalities from the versity of Dundee (INSPIRED project, Award number 16/136/102) for
vastly generated multi-omics data, which is not always possible with useful discussions.
individual omics analysis.
References
6. Recommendations
Acharjee, A., Kloosterman, B., de Vos, R.C.H., Werij, J.S., Bachem, C.W.B., Visser, R.G.F.,
Today a plethora of multi-omic integration methods are available for Maliepaard, C., 2011. Data integration and network reconstruction with ~omics
data using Random Forest regression in potato. Anal. Chim. Acta 705, 56–63.
both supervised and unsupervised learning as evident in the current
https://doi.org/10.1016/j.aca.2011.03.050.
review. This information can overwhelm interdisciplinary scientists and Acharjee, A., Kloosterman, B., Visser, R.G.F., Maliepaard, C., 2016. Integration of multi-
would require a time-consuming effort to understand the challenging omics data for prediction of phenotypic traits using random forest. BMC Bioinformat.
mathematical and computational concepts behind them. Hence, we 17, 180. https://doi.org/10.1186/s12859-016-1043-4.
Agache, I., Rogozea, L., 2017. Asthma biomarkers: do they bring precision medicine
suggest that interdisciplinary teams working on multi-omics always closer to the clinic? Allergy, Asthma Immunol. Res. 9, 466–476. https://doi.org/
include ML practitioners to assist with the choice of methods, the 10.4168/aair.2017.9.6.466.
development of solutions, the interpretation of results and their signif­ Akavia, U.D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H.C.,
Pochanard, P., Mozes, E., Garraway, L.A., Pe’er, D., 2010. An integrated approach to
icance and limits. Such truly interdisciplinary teams offer real oppor­ uncover drivers of cancer. Cell 143, 1005–1017. https://doi.org/10.1016/j.
tunities for better mutual understanding of the different fields, practice cell.2010.11.013.
and expertise necessary, leading ultimately to more robust conclusions. Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., Walter, P., 2008. Molecular
Biology of the Cell 5E, 5 edition. Garland Science, New York.
Also, to facilitate the method selection process, a recommendation Alidjinou, E.K., Deldalle, J., Hallaert, C., Robineau, O., Ajana, F., Choisy, P., Hober, D.,
flowchart is proposed in Fig. 3. It shows the various decision steps Bocket, L., 2017. RNA and DNA Sanger sequencing versus next-generation
required for choosing an appropriate method (or family of methods) for sequencing for HIV-1 drug resistance testing in treatment-naive patients.
J. Antimicrob. Chemother. 72, 2823–2830. https://doi.org/10.1093/jac/dkx232.
a given scenario. For example, to choose a method for integrating two Altman, N.S., 1992. An introduction to kernel and nearest-neighbor nonparametric
omics for unsupervised learning one can choose a model-based method regression. Am. Stat. 46, 175–185. https://doi.org/10.1080/
such as ‘PSDF or Lemon-Tree’ if the two omics are gene expression and 00031305.1992.10475879.
Amancio, D.R., Comin, C.H., Casanova, D., Travieso, G., Bruno, O.M., Rodrigues, F.A.,
CNV, otherwise ‘MDI or SNF’ can be used. Similarly, ‘NEMO’ can be used
Costa, L. Da F., 2014. A systematic comparison of supervised classifiers. PLoS One 9,
in scenarios where the datasets are partially overlapping, and a trans­ e94137. https://doi.org/10.1371/journal.pone.0094137.
formation approach is required. Hence, it can be used for biomedical Amazon EC2, 2021. Amaz. Web Serv. Inc. URL. https://aws.amazon.com/ec2/ (accessed
analysis, including diagnosis, prognosis and biomarker identification, by 10.29.20 [WWW Document]).
Anděl, M., Kléma, J., Krejčík, Z., 2015. Network-constrained forest for regularized
posing them as supervised or unsupervised learning problems. classification of omics data. In: Methods, Network-based Approaches to the Analysis
Clearly, a ‘one-size-fits-all’ approach is not feasible. Also, of Omics Data, 83, pp. 88–97. https://doi.org/10.1016/j.ymeth.2015.04.006.

16
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Antonelli, J., Claggett, B.L., Henglin, M., Kim, A., Ovsak, G., Kim, N., Deng, K., Rao, K., Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press,
Tyagi, O., Watrous, J.D., Lagerborg, K.A., Hushcha, P.V., Demler, O.V., Mora, S., Inc., New York, NY, USA.
Niiranen, T.J., Pereira, A.C., Jain, M., Cheng, S., 2019. Statistical workflow for Bishop, C.M., 2006. Pattern recognition and machine learning, Information science and
feature selection in human metabolomics data. Metabolites 9. https://doi.org/ statistics. Springer, New York.
10.3390/metabo9070143. Bjerrum, J.T., Rantalainen, M., Wang, Y., Olsen, J., Nielsen, O.H., 2014. Integration of
Archer, D.B., Bricker, J.T., Chu, W.T., Burciu, R.G., McCracken, J.L., Lai, S., Coombes, S. transcriptomics and metabonomics: improving diagnostics, biomarker identification
A., Fang, R., Barmpoutis, A., Corcos, D.M., Kurani, A.S., Mitchell, T., Black, M.L., and phenotyping in ulcerative colitis. Metabolomics Off. J. Metabolomic Soc. 10,
Herschel, E., Simuni, T., Parrish, T.B., Comella, C., Xie, T., Seppi, K., Bohnen, N.I., 280–290. https://doi.org/10.1007/s11306-013-0580-3.
Müller, M.L., Albin, R.L., Krismer, F., Du, G., Lewis, M.M., Huang, X., Li, H., Black box medicine and transparency (Executive Summary), 2020. PHG Foundation
Pasternak, O., McFarland, N.R., Okun, M.S., Vaillancourt, D.E., 2019. Development (University of Cambridge). London, UK.
and validation of the automated imaging differentiation in parkinsonism (AID-P): a Boellner, S., Becker, K.-F., 2015. Reverse phase protein arrays—quantitative assessment
multicentre machine learning study. Lancet Digit. Health 1, e222–e231. https://doi. of multiple biomarkers in biopsies for clinical use. Microarrays 4, 98–114. https://
org/10.1016/S2589-7500(19)30105-0. doi.org/10.3390/microarrays4020098.
Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J.C., Buettner, F., Bonnet, E., Calzone, L., Michoel, T., 2015. Integrative multi-omics module network
Huber, W., Stegle, O., 2018. Multi-Omics Factor Analysis—a framework for inference with lemon-tree. PLoS Comput. Biol. 11, e1003983 https://doi.org/
unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14 https://doi. 10.1371/journal.pcbi.1003983.
org/10.15252/msb.20178124. Borad, M.J., LoRusso, P.M., 2017. Twenty-first century precision medicine in oncology:
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., genomic profiling in patients with cancer. Mayo Clin. Proc. 92, 1583–1591. https://
Patterson, D., Rabkin, A., Stoica, I., Zaharia, M., 2010. A view of cloud computing. doi.org/10.1016/j.mayocp.2017.08.002.
Commun. ACM 53, 50–58. https://doi.org/10.1145/1721654.1721672. Bowd, C., Medeiros, F.A., Zhang, Z., Zangwill, L.M., Hao, J., Lee, T.-W., Sejnowski, T.J.,
Aslam, B., Basit, M., Nisar, M.A., Khurshid, M., Rasool, M.H., 2017. Proteomics: Weinreb, R.N., Goldbaum, M.H., 2005. Relevance vector machine and support vector
technologies and their applications. J. Chromatogr. Sci. 55, 182–196. https://doi. machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer
org/10.1093/chromsci/bmw167. measurements. Invest. Ophthalmol. Vis. Sci. 46, 1322–1329. https://doi.org/
Auslander, N., Yizhak, K., Weinstock, A., Budhu, A., Tang, W., Wang, X.W., Ambs, S., 10.1167/iovs.04-1122.
Ruppin, E., 2016. A joint analysis of transcriptomic and metabolomic data uncovers Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:
enhanced enzyme-metabolite coupling in breast cancer. Sci. Rep. 6 https://doi.org/ 1010933404324.
10.1038/srep29662. Buescher, J.M., Driggers, E.M., 2016. Integration of omics: more than the sum of its parts.
Awad, M., Khanna, R., 2015. Support vector regression. In: Awad, M., Khanna, R. (Eds.), Cancer Metab. 4, 4. https://doi.org/10.1186/s40170-016-0143-y.
Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and Bumgarner, R., 2013. DNA microarrays: types, applications and their future. Curr.
System Designers. Apress, Berkeley, CA, pp. 67–80. https://doi.org/10.1007/978-1- Protoc. Mol. Biol. https://doi.org/10.1002/0471142727.mb2201s101. Ed. Frederick
4302-5990-9_4. M Ausubel Al 0 22, Unit-22.1.
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J., 2011. Multiple imputation by chained Burley, S.K., Berman, H.M., Bhikadiya, C., Bi, C., Chen, L., Costanzo, L.D., Christie, C.,
equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, Duarte, J.M., Dutta, S., Feng, Z., Ghosh, S., Goodsell, D.S., Green, R.K.,
40–49. https://doi.org/10.1002/mpr.329. Guranovic, V., Guzenko, D., Hudson, B.P., Liang, Y., Lowe, R., Peisach, E.,
Badillo, S., Banfai, B., Birzele, F., Davydov, I.I., Hutchinson, L., Kam-Thong, T., Siebourg- Periskova, I., Randle, C., Rose, A., Sekharan, M., Shao, C., Tao, Y.-P., Valasatava, Y.,
Polster, J., Steiert, B., Zhang, J.D., 2020. An introduction to machine learning. Clin. Voigt, M., Westbrook, J., Young, J., Zardecki, C., Zhuravleva, M., Kurisu, G.,
Pharmacol. Ther. 107, 871–885. https://doi.org/10.1002/cpt.1796. Nakamura, H., Kengaku, Y., Cho, H., Sato, J., Kim, J.Y., Ikegawa, Y., Nakagawa, A.,
Badue, C., Guidolini, R., Carneiro, R.V., Azevedo, P., Cardoso, V.B., Forechi, A., Jesus, L., Yamashita, R., Kudou, T., Bekker, G.-J., Suzuki, H., Iwata, T., Yokochi, M.,
Berriel, R., Paixão, T.M., Mutz, F., de Paula Veronese, L., Oliveira-Santos, T., De Kobayashi, N., Fujiwara, T., Velankar, S., Kleywegt, G.J., Anyango, S., Armstrong, D.
Souza, A.F., 2021. Self-driving cars: a survey. Expert Syst. Appl. 165, 113816. R., Berrisford, J.M., Conroy, M.J., Dana, J.M., Deshpande, M., Gane, P.,
https://doi.org/10.1016/j.eswa.2020.113816. Gáborová, R., Gupta, D., Gutmanas, A., Koča, J., Mak, L., Mir, S., Mukhopadhyay, A.,
Barh, D., Blum, K., Madigan, M.A. (Eds.), 2011. OMICS: Biomedical Perspectives and Nadzirin, N., Nair, S., Patwardhan, A., Paysan-Lafosse, T., Pravda, L., Salih, O.,
Applications, 1 edition. CRC Press, Boca Raton. Sehnal, D., Varadi, M., Vařeková, R., Markley, J.L., Hoch, J.C., Romero, P.R.,
Barh, D., Tiwari, S., Weener, M.E., Azevedo, V., Góes-Neto, A., Gromiha, M.M., Baskaran, K., Maziuk, D., Ulrich, E.L., Wedell, J.R., Yao, H., Livny, M., Ioannidis, Y.
Ghosh, P., 2020. Multi-omics-based identification of SARS-CoV-2 infection biology E., 2019. Protein Data Bank: the single global archive for 3D macromolecular
and candidate drugs against COVID-19. Comput. Biol. Med. 126, 104051. https:// structure data. Nucleic Acids Res. 47, D520–D528. https://doi.org/10.1093/nar/
doi.org/10.1016/j.compbiomed.2020.104051. gky949.
Barnes, J.W., Tonelli, A.R., Heresi, G.A., Newman, J.E., Mellor, N.E., Grove, D.E., Bush, W.S., Dudek, S.M., Ritchie, M.D., 2009. Biofilter: a knowledge-integration system
Dweik, R.A., 2016. Novel methods in pulmonary hypertension phenotyping in the for the multi-locus analysis of genome-wide association studies. Pac. Symp.
age of precision medicine (2015 Grover Conference series). Pulm. Circ. 6, 439–447. Biocomput. Pac. Symp. Biocomput. 368–379.
https://doi.org/10.1086/688847. Bzdok, D., 2017. Classical statistics and statistical learning in imaging neuroscience.
Barnett-Itzhaki, Z., Elbaz, M., Butterman, R., Amar, D., Amitay, M., Racowsky, C., Front. Neurosci. 11 https://doi.org/10.3389/fnins.2017.00543.
Orvieto, R., Hauser, R., Baccarelli, A.A., Machtinger, R., 2020. Machine learning vs. Caffrey, M., Hogan, J., 1992. LIPIDAT: A database of lipid phase transition temperatures
classic statistics for the prediction of IVF outcomes. J. Assist. Reprod. Genet. 37, and enthalpy changes. DMPC data subset analysis. Chem. Phys. Lipids 61, 1–109.
2405–2412. https://doi.org/10.1007/s10815-020-01908-1. https://doi.org/10.1016/0009-3084(92)90002-7.
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Campbell, M.P., Nguyen-Khuong, T., Hayes, C.A., Flowers, S.A., Alagesan, K.,
Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., Herrera, F., 2020. Kolarich, D., Packer, N.H., Karlsson, N.G., 2014. Validation of the curation pipeline
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and of UniCarb-DB: Building a global glycan reference MS/MS repository. Biochim.
challenges toward responsible AI. Inf. Fusion 58, 82–115. https://doi.org/10.1016/j. Biophys. Acta BBA - Proteins Proteomics, Computational Proteomics in the Post-
inffus.2019.12.012. Identification Era 1844, 108–116. https://doi.org/10.1016/j.bbapap.2013.04.018.
Bavafaye Haghighi, E., Knudsen, M., Elmedal Laursen, B., Besenbacher, S., 2019. Canuel, V., Rance, B., Avillach, P., Degoulet, P., Burgun, A., 2015. Translational research
Hierarchical classification of cancers of unknown primary using multi-omics data. platforms integrating clinical and omics data: a review of publicly available
Cancer Informat. 18 https://doi.org/10.1177/1176935119872163. solutions. Brief. Bioinform. 16, 280–290. https://doi.org/10.1093/bib/bbu006.
BCS, T.C.I. for I, 2014. Big Data: Opportunities and challenges. BCS, The Chartered Canzler, S., Schor, J., Busch, W., Schubert, K., Rolle-Kampczyk, U.E., Seitz, H., Kamp, H.,
Institute for IT. von Bergen, M., Buesen, R., Hackermüller, J., 2020. Prospects and challenges of
Bellazzi, R., 2014. Big data and biomedical informatics: a challenging opportunity. multi-omics data integration in toxicology. Arch. Toxicol. 94, 371–388. https://doi.
Yearb. Med. Inform. 9, 8–13. https://doi.org/10.15265/IY-2014-0024. org/10.1007/s00204-020-02656-y.
Benjamens, S., Dhunnoo, P., Meskó, B., 2020. The state of artificial intelligence-based Cao, K., Bai, X., Hong, Y., Wan, L., 2020. Unsupervised topological alignment for single-
FDA-approved medical devices and algorithms: an online database. Npj Digit. Med. cell multi-omics integration. Bioinformatics 36, i48–i56. https://doi.org/10.1093/
3, 1–8. https://doi.org/10.1038/s41746-020-00324-0. bioinformatics/btaa443.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W., 2011. GenBank. Capobianco, E., 2017. Systems and precision medicine approaches to diabetes
Nucleic Acids Res. 39, D32–D37. https://doi.org/10.1093/nar/gkq1079. heterogeneity: a Big Data perspective. Clin. Transl. Med. 6, 23. https://doi.org/
Beretta, L., Santaniello, A., 2011. Implementing ReliefF filters to extract meaningful 10.1186/s40169-017-0155-4.
features from genetic lifetime datasets. J. Biomed. Inform. 44, 361–369. https://doi. Carlomagno, N., Incollingo, P., Tammaro, V., Peluso, G., Rupealta, N., Chiacchio, G.,
org/10.1016/j.jbi.2010.12.003. Sandoval Sotelo, M.L., Minieri, G., Pisani, A., Riccio, E., Sabbatini, M., Bracale, U.M.,
Bersanelli, M., Mosca, E., Remondini, D., Giampieri, E., Sala, C., Castellani, G., Calogero, A., Dodaro, C.A., Santangelo, M., 2017. Diagnostic, predictive, prognostic,
Milanesi, L., 2016. Methods for the integration of multi-omics data: mathematical and therapeutic molecular biomarkers in third millennium: a breakthrough in gastric
aspects. BMC Bioinformat. 17, 167–177. https://doi.org/10.1186/s12859-015- cancer. Biomed. Res. Int. 2017. https://doi.org/10.1155/2017/7869802.
0857-9. Chaudhary, K., Poirion, O.B., Lu, L., Garmire, L.X., 2017. Deep Learning based multi-
Bewicke-Copley, F., Arjun Kumar, E., Palladino, G., Korfi, K., Wang, J., 2019. omics integration robustly predicts survival in liver cancer. Clin. Cancer Res. Off. J.
Applications and analysis of targeted genomic sequencing in cancer studies. Comput. Am. Assoc. Cancer Res. doi. https://doi.org/10.1158/1078-0432.CCR-17-0853.
Struct. Biotechnol. J. 17, 1348–1359. https://doi.org/10.1016/j.csbj.2019.10.004. Chawla, N.V., Davis, D.A., 2013. Bringing big data to personalized healthcare: a patient-
Bhardwaj, A., Van Steen, K., 2020. Multi-omics data and analytics integration in ovarian centered framework. J. Gen. Intern. Med. 28 (Suppl. 3), S660–S665. https://doi.org/
cancer. Artif. Intell. Appl. Innov. 584, 347–357. https://doi.org/10.1007/978-3-030- 10.1007/s11606-013-2455-8.
49186-4_29.

17
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic application in cardiac tissue aging dataset. Conf. Proc. Annu. Int. Conf. IEEE Eng.
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. https://doi.org/ Med. Biol. Soc. IEEE Eng. Med. Biol. Soc. Annu. Conf. 2014, 318–321. https://doi.
10.1613/jair.953. org/10.1109/EMBC.2014.6943593.
Chen, Y., Wang, X., Wang, G., Li, Z., Wang, J., Huang, L., Qin, Z., Yuan, X., Cheng, Z., Ding, M.Q., Chen, L., Cooper, G.F., Young, J.D., Lu, X., 2018. Precision oncology beyond
Zhang, S., Yin, Y., He, J., 2017. Integrating multiple omics data for the discovery of targeted therapy: combining omics data with machine learning matches the majority
potential Beclin-1 interactions in breast cancer. Mol. BioSyst. 13, 991–999. https:// of cancer cells to effective therapeutics. Mol. Cancer Res. 16, 269–278. https://doi.
doi.org/10.1039/c6mb00653a. org/10.1158/1541-7786.MCR-17-0378.
Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D., 2019. Self- Domingos, P., Pazzani, M., 1997. On the optimality of the simple bayesian classifier
supervised learning for medical image analysis using image context restoration. under zero-one loss. Mach. Learn. 29, 103–130. https://doi.org/10.1023/A:
Med. Image Anal. 58, 101539. https://doi.org/10.1016/j.media.2019.101539. 1007413511361.
Cheng, P.F., Dummer, R., Levesque, M.P., 2015. Data mining the cancer genome atlas in Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J., 2018. A comparative evaluation
the era of precision cancer medicine. Swiss Med. Wkly. 145, w14183. https://doi. of outlier detection algorithms: Experiments and analyses. Pattern Recogn. 74,
org/10.4414/smw.2015.14183. 406–421. https://doi.org/10.1016/j.patcog.2017.09.037.
Chung, R.-H., Kang, C.-Y., 2019. A multi-omics data simulator for complex disease Dominiczak, A., Delles, C., Padmanabhan, S., 2017. Genomics and precision medicine for
studies and its application to evaluate multi-omics data analysis methods for disease clinicians and scientists in hypertension. Hypertens. Dallas Tex 69, e10–e13. https://
classification. GigaScience 8. https://doi.org/10.1093/gigascience/giz045. doi.org/10.1161/HYPERTENSIONAHA.116.08252.
Clarivate Analytics, 2020. Web of science [v.5.35] - web of science core collection basic Drăghici, S., Potter, R.B., 2003. Predicting HIV drug resistance with neural networks.
search [WWW Document]. Web Sci. URL https://apps.webofknowledge.com/WOS Bioinforma. Oxf. Engl. 19, 98–107.
_GeneralSearch_input.do?product=WOS&search_mode=GeneralSearch accessed Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern classification, 2nd ed. Wiley, New York.
10.23.20. Ebbels, T.M.D., Cavill, R., 2009. Bioinformatic methods in NMR-based metabolic
Cleary, J.G., Trigg, L.E., 1995. K*: An Instance-based Learner Using and Entropic profiling. Prog. Nucl. Magn. Reson. Spectrosc. 55, 361–374. https://doi.org/
Distance Measure, in: Proceedings of the Twelfth International Conference on 10.1016/j.pnmrs.2009.07.003.
International Conference on Machine Learning, ICML’95. Morgan Kaufmann Eicher, T., Kinnebrew, G., Patt, A., Spencer, K., Ying, K., Ma, Q., Machiraju, R., Mathé, E.
Publishers Inc., San Francisco, CA, USA, pp. 108–114. A., 2020. Metabolomics and multi-omics integration: a survey of computational
Clifton, D.A., Niehaus, K.E., Charlton, P., Colopy, G.W., 2015. Health informatics via methods and resources. Metabolites 10. https://doi.org/10.3390/metabo10050202.
machine learning for the clinical management of patients. Yearb. Med. Inform. 10, Elith, J., Leathwick, J.R., Hastie, T., 2008. A working guide to boosted regression trees.
38–43. https://doi.org/10.15265/IY-2015-014. J. Anim. Ecol. 77, 802–813. https://doi.org/10.1111/j.1365-2656.2008.01390.x.
Cloud Computing Services, 2021a. Microsoft Azure [WWW Document]. URL. https Erickson, B.J., Korfiatis, P., Akkus, Z., Kline, T.L., 2017. Machine learning for medical
://azure.microsoft.com/en-gb/ (accessed 10.29.20). imaging. Radiogr. Rev. Publ. Radiol. Soc. N. Am. Inc 37, 505–515. https://doi.org/
Cloud Computing Services, 2021b. Google Cloud. URL. https://cloud.google.com/ 10.1148/rg.2017160130.
(accessed 10.29.20 [WWW Document]). Fan, Y., Murphy, T.B., Byrne, J.C., Brennan, L., Fitzpatrick, J.M., Watson, R.W.G., 2011.
Cobb, M., 2017. 60 years ago, Francis Crick changed the logic of biology. PLoS Biol. 15, Applying random forests to identify biomarker panels in serum 2D-DIGE data for the
e2003243 https://doi.org/10.1371/journal.pbio.2003243. detection and staging of prostate cancer. J. Proteome Res. 10, 1361–1373. https://
Conesa, A., Beck, S., 2019. Making multi-omics data accessible to researchers. Sci. Data doi.org/10.1021/pr1011069.
6, 251. https://doi.org/10.1038/s41597-019-0258-4. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F., 2015.
Coronato, A., Naeem, M., Pietro, G.D., Paragliola, G., 2020. Reinforcement learning for Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D.,
intelligent healthcare applications: a survey. Artif. Intell. Med. 101964 https://doi. Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information
org/10.1016/j.artmed.2020.101964. Processing Systems 28. Curran Associates, Inc, pp. 2962–2970.
Costantino, S., Libby, P., Kishore, R., Tardif, J.-C., El-Osta, A., Paneni, F., 2017. Fiehn, O., 2016. Metabolomics by gas chromatography-mass spectrometry: the
Epigenetics and precision medicine in cardiovascular patients: from basic concepts combination of targeted and untargeted profiling. Curr. Protoc. Mol. Biol. Ed.
to the clinical arena. Eur. Heart J. https://doi.org/10.1093/eurheartj/ehx568. Frederick M Ausubel Al 114, 30.4.1–30.4.32. https://doi.org/10.1002/0471142727.
Cox, B., 2009. Building bridges from “omics” to cell biology. Genome Biol. 10, 305. mb3004s114.
https://doi.org/10.1186/gb-2009-10-3-305. Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L.,
Crookston, N.L., Finley, A.O., 2008. yaImpute: An R Package for kNN Imputation. J. Stat. Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L.L., Eddy, S.R.,
Softw. 23, 1–16. https://doi.org/10.18637/jss.v023.i10. Bateman, A., 2010. The Pfam protein families database. Nucleic Acids Res. 38,
Culp, M., Michailidis, G., 2008. Graph-based semisupervised learning. IEEE Trans. D211–D222. https://doi.org/10.1093/nar/gkp985.
Pattern Anal. Mach. Intell. 30, 174–179. https://doi.org/10.1109/ Foster, J.M., Moreno, P., Fabregat, A., Hermjakob, H., Steinbeck, C., Apweiler, R.,
TPAMI.2007.70765. Wakelam, M.J.O., Vizcaíno, J.A., 2013. LipidHome: a database of theoretical lipids
Curtis, C., Shah, S.P., Chin, S.-F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., optimized for high throughput mass spectrometry lipidomics. PLoS One 8. https://
Lynch, A.G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari, G., Bashashati, A., doi.org/10.1371/journal.pone.0061951.
Russell, R., McKinney, S., Langerød, A., Green, A., Provenzano, E., Wishart, G., Foster, K.R., Koprowski, R., Skufca, J.D., 2014. Machine learning, medical diagnosis, and
Pinder, S., Watson, P., Markowetz, F., Murphy, L., Ellis, I., Purushotham, A., biomedical engineering research - commentary. Biomed. Eng. Online 13, 94. https://
Børresen-Dale, A.-L., Brenton, J.D., Tavaré, S., Caldas, C., Aparicio, S., 2012. The doi.org/10.1186/1475-925X-13-94.
genomic and transcriptomic architecture of 2,000 breast tumours reveals novel Francescatto, M., Chierici, M., Rezvan Dezfooli, S., Zandonà, A., Jurman, G.,
subgroups. Nature 486, 346–352. https://doi.org/10.1038/nature10983. Furlanello, C., 2018. Multi-omics integration for neuroblastoma clinical endpoint
D’Onofrio, D.J., An, G., 2010. A comparative approach for the investigation of biological prediction. Biol. Direct 13, 5. https://doi.org/10.1186/s13062-018-0207-8.
information processing: An examination of the structure and function of computer Fridley, B.L., Lund, S., Jenkins, G.D., Wang, L., 2012. A Bayesian integrative genomic
hard drives and DNA. Theor. Biol. Med. Model. 7, 3. https://doi.org/10.1186/1742- model for pathway analysis of complex traits. Genet. Epidemiol. 36, 352–359.
4682-7-3. https://doi.org/10.1002/gepi.21628.
Dada, E.G., Bassi, J.S., Chiroma, H., Abdulhamid, S.M., Adetunmbi, A.O., Ajibuwa, O.E., Gammerman, A., 2010. Modern Machine Learning Techniques and Their Applications to
2019. Machine learning for email spam filtering: review, approaches and open Medical Diagnostics. In: Artificial Intelligence Applications and Innovations, IFIP
research problems. Heliyon 5. https://doi.org/10.1016/j.heliyon.2019.e01802. Advances in Information and Communication Technology, Presented at the IFIP
Dankers, F.J.W.M., Traverso, A., Wee, L., van Kuijk, S.M.J., 2019. Prediction modeling International Conference on Artificial Intelligence Applications and Innovations.
methodology. In: Kubben, P., Dumontier, M., Dekker, A. (Eds.), Fundamentals of Springer, Berlin, Heidelberg, p. 2. https://doi.org/10.1007/978-3-642-16239-8_2.
Clinical Data Science. Springer, Cham (CH). Garali, I., Adanyeguh, I.M., Ichou, F., Perlbarg, V., Seyer, A., Colsch, B., Moszer, I.,
Davenport, T., Kalakota, R., 2019. The potential for artificial intelligence in healthcare. Guillemot, V., Durr, A., Mochel, F., Tenenhaus, A., 2018. A strategy for multimodal
Future Healthc. J. 6, 94–98. https://doi.org/10.7861/futurehosp.6-2-94. data integration: application to biomarkers identification in spinocerebellar ataxia.
de Andrade, B.M., de Gois, J.S., Xavier, V.L., Luna, A.S., 2020. Comparison of the Brief. Bioinform. 19, 1356–1369. https://doi.org/10.1093/bib/bbx060.
performance of multiclass classifiers in chemical data: Addressing the problem of Giang, T.-T., Nguyen, T.-P., Tran, D.-H., 2020. Stratifying patients using fast multiple
overfitting with the permutation test. Chemom. Intell. Lab. Syst. 201, 104013. kernel learning framework: case studies of Alzheimer’s disease and cancers. BMC
https://doi.org/10.1016/j.chemolab.2020.104013. Med. Inform. Decis. Mak. 20, 108. https://doi.org/10.1186/s12911-020-01140-y.
Debnath, M., Prasad, G.B.K.S., Bisen, P.S., Prasad, G.B.K.S., 2010. Molecular Diagnostics: Gibson, G., 2015. A Primer of Human Genetics, 1st ed 2015 editiom. Sinauer,
Promises and Possibilities. Springer, Dordrecht. Sunderland, Massachusetts, U.S.A.
Delavan, B., Roberts, R., Huang, R., Bao, W., Tong, W., Liu, Z., 2017. Computational drug Gibson, G., Marigorta, U.M., Ojagbeghru, E.R., Park, S., 2015. PART of the WHOLE: A
repositioning for rare diseases in the era of precision medicine. Drug Discov. Today. case study in wellness-oriented personalized medicine. Yale J. Biol. Med. 88,
https://doi.org/10.1016/j.drudis.2017.10.009. 397–406.
Deng, L., Cai, Y., Zhang, W., Yang, W., Gao, B., Liu, H., 2020. Pathway-guided deep Glaves, J.P., Li, M.X., Mercier, P., Fahlman, R.P., Sykes, B.D., 2014. High-throughput,
neural network toward interpretable and predictive modeling of drug sensitivity. multi-platform metabolomics on very small volumes: 1H NMR metabolite
J. Chem. Inf. Model. 60, 4497–4505. https://doi.org/10.1021/acs.jcim.0c00331. identification in an unadulterated tube-in-tube system. Metabolomics 10,
Dias-Audibert, F.L., Navarro, L.C., de Oliveira, D.N., Delafiori, J., Melo, C.F.O.R., 1145–1151. https://doi.org/10.1007/s11306-014-0678-2.
Guerreiro, T.M., Rosa, F.T., Petenuci, D.L., Watanabe, M.A.E., Velloso, L.A., Gligorijević, V., Pržulj, N., 2015. Methods for biological data integration: perspectives
Rocha, A.R., Catharino, R.R., 2020. Combining machine learning and metabolomics and challenges. J. R. Soc. Interface 12, 20150571. https://doi.org/10.1098/
to identify weight gain biomarkers. Front. Bioeng. Biotechnol. 8 https://doi.org/ rsif.2015.0571.
10.3389/fbioe.2020.00006. Gligorijević, V., Malod-Dognin, N., Pržulj, N., 2016. Integrative methods for analyzing
Dimitrakopoulos, G.N., Dimitrakopoulou, K., Maraziotis, I.A., Sgarbas, K., Bezerianos, A., big data in precision medicine. PROTEOMICS 16, 741–758. https://doi.org/
2014. Supervised method for construction of microRNA-mRNA networks: 10.1002/pmic.201500396.

18
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.-Z., 2019. Hergesheimer, L., Shaffer, E., Mackin, S., Nelson, C., Bickford, D., Butters, M.,
XAI—Explainable artificial intelligence. Sci. Robot. 4 https://doi.org/10.1126/ Zmuda, M., Jack, C.R.J., Bernstein, M., Borowski, B., Gunter, J., Senjem, M.,
scirobotics.aay7120. Kantarci, K., Ward, C., Reyes, D., Koeppe, R.A., Landau, S., Toga, A.W., Crawford, K.,
Guo, L.-Y., Wu, A.-H., Wang, Y., Zhang, L., Chai, H., Liang, X.-F., 2020. Deep learning- Neu, S., Saykin, A.J., Foroud, T.M., Faber, K.M., Nho, K., Nudelman, K.N.,
based ovarian cancer subtypes identification using multi-omics data. BioData Min. Mackin, S., Rosen, H., Nelson, C., Bickford, D., Au, Y.H., Scherer, K., Catalinotto, D.,
13, 10. https://doi.org/10.1186/s13040-020-00222-x. Stark, S., Ong, E., Fernandez, D., Butters, M., Zmuda, M., Lopez, O.L., Oakley, M.,
Gupta, Y., Lama, R.K., Kwon, G.-R., Initiative, A.D.N., Weiner, M.W., Aisen, P., Simpson, D.M., 2019. Prediction and classification of alzheimer’s disease based on
Weiner, M., Aisen, P., Petersen, R., Jack, C.R.J., Jagust, W., Trojanowki, J.Q., combined features from apolipoprotein-e genotype, cerebrospinal fluid, MR, and
Toga, A.W., Beckett, L., Green, R.C., Saykin, A.J., Morris, J., Shaw, L.M., FDG-PET imaging biomarkers. Front. Comput. Neurosci. 13 https://doi.org/
Khachaturian, Z., Sorensen, G., Carrillo, M., Kuller, L., Raichle, M., Paul, S., 10.3389/fncom.2019.00072.
Davies, P., Fillit, H., Hefti, F., Holtzman, D., Mesulam, M.M., Potter, W., Snyder, P., Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach.
Schwartz, A., Green, R.C., Montine, T., Petersen, R., Aisen, P., Thomas, R.G., Learn. Res. 3, 1157–1182.
Donohue, Michael, Walter, S., Gessert, D., Sather, T., Jiminez, G., Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer
Balasubramanian, A.B., Mason, J., Sim, I., Beckett, L., Harvey, D., Donohue, Michael, classification using support vector machines. Mach. Learn. 46, 389–422. https://doi.
Jack, C.R.J., Bernstein, M., Fox, N., Thompson, P., Schuff, N., DeCArli, C., org/10.1023/A:1012487302797.
Borowski, B., Gunter, J., Senjem, M., Vemuri, P., Jones, D., Kantarci, K., Ward, C., Haas, R., Zelezniak, A., Iacovacci, J., Kamrad, S., Townsend, S., Ralser, M., 2017.
Jagust, W., Koeppe, R.A., Foster, N., Reiman, E.M., Chen, K., Mathis, C., Landau, S., Designing and interpreting ‘multi-omic’ experiments that may change our
Morris, J.C., Cairns, N.J., Franklin, E., Taylor-Reinwald, L., Shaw, L.M., understanding of biology. Curr. Opin. Syst. Biol. 6, 37–45. https://doi.org/10.1016/
Trojanowki, J.Q., Lee, V., Korecka, M., Figurski, M., Toga, A.W., Crawford, K., j.coisb.2017.08.009.
Neu, S., Saykin, A.J., Foroud, T.M., Potkin, S., Shen, L., Faber, K., Kim, S., Nho, K., Hamamoto, R., Suvarna, K., Yamada, M., Kobayashi, K., Shinkai, N., Miyake, M.,
Weiner, M.W., Thal, Lean, Khachaturian, Z., Thal, Leon, Buckholtz, N., Weiner, M. Takahashi, M., Jinnai, S., Shimoyama, R., Sakai, A., Takasawa, K., Bolatkan, A.,
W., Snyder, P.J., Potter, W., Paul, S., Albert, M., Frank, R., Khachaturian, Z., Shozu, K., Dozen, A., Machino, H., Takahashi, S., Asada, K., Komatsu, M., Sese, J.,
Hsiao, J., Kaye, J., Quinn, J., Silbert, L., Lind, B., Carter, R., Dolen, S., Schneider, L. Kaneko, S., 2020. Application of artificial intelligence technology in oncology:
S., Pawluczyk, S., Becerra, M., Teodoro, L., Spann, B.M., Brewer, J., Vanderswag, H., towards the establishment of precision medicine. Cancers 12, 3532. https://doi.org/
Fleisher, A., Heidebrink, J.L., Lord, J.L., Petersen, R., Mason, S.S., Albers, C.S., 10.3390/cancers12123532.
Knopman, D., Johnson, Kris, Doody, R.S., Villanueva-Meyer, J., Pavlik, V., Hampel, H., O’Bryant, S.E., Castrillo, J.I., Ritchie, C., Rojkova, K., Broich, K., Benda, N.,
Shibley, V., Chowdhury, M., Rountree, S., Dang, M., Stern, Y., Honig, L.S., Bell, K.L., Nisticò, R., Frank, R.A., Dubois, B., Escott-Price, V., Lista, S., 2016. Precision
Ances, B., Morris, J.C., Carroll, M., Creech, M.L., Franklin, E., Mintun, M.A., medicine - the golden gate for detection, treatment and prevention of Alzheimer’s
Schneider, S., Oliver, A., Marson, D., Geldmacher, D., Natelson Love, M., Griffith, R., disease. J. Prev. Alzheimers Dis. 3, 243–259. https://doi.org/10.14283/
Clark, D., Brockington, J., Roberson, E., Grossman, H., Mitsis, E., Shah, R.C., jpad.2016.112.
deToledo-Morrell, L., Duara, R., Greig-Custo, M.T., Barker, W., Albert, M., Hampel, H., O’Bryant, S.E., Durrleman, S., Younesi, E., Rojkova, K., Escott-Price, V.,
Onyike, C., D’Agostino, D.I., Kielb, S., Sadowski, M., Sheikh, M.O., Anaztasia, U., Corvol, J.-C., Broich, K., Dubois, B., Lista, S., Alzheimer Precision Medicine
Mrunalini, G., Doraiswamy, P.M., Petrella, J.R., Borges-Neto, S., Wong, T.Z., Initiative, 2017. A Precision Medicine Initiative for Alzheimer’s disease: the road
Coleman, E., Arnold, S.E., Karlawish, J.H., Wolk, D.A., Clark, C.M., Smith, C.D., ahead to biomarker-guided integrative disease modeling. Climacteric J. Int.
Jicha, G., Hardy, P., Sinha, P., Oates, E., Conrad, G., Lopez, O.L., Oakley, M., Menopause Soc. 20, 107–118. https://doi.org/10.1080/13697137.2017.1287866.
Simpson, D.M., Porsteinsson, A.P., Goldstein, B.S., Martin, K., Makino, K.M., Hanania, N.A., Diamant, Z., 2017. The road to precision medicine in asthma: challenges
Ismail, M.S., Brand, C., Potkin, S.G., Preda, A., Nguyen, D., Womack, K., and opportunities. Curr. Opin. Pulm. Med. https://doi.org/10.1097/
Mathews, D., Quiceno, M., Levey, A.I., Lah, J.J., Cellar, J.S., Burns, J.M., MCP.0000000000000444.
Swerdlow, R.H., Brooks, W.M., Apostolova, L., Tingus, K., Woo, E., Silverman, D.H. Handa, A., Sharma, A., Shukla, S.K., 2019. Machine learning in cybersecurity: A review.
S., Lu, P.H., Bartzokis, G., Graff-Radford, N.R., Parfitt, F., Poki-Walker, K., WIREs Data Min. Knowl. Discov. 9, e1306 https://doi.org/10.1002/widm.1306.
Farlow, M.R., Hake, A.M., Matthews, B.R., Brosch, J.R., Herring, S., van Dyck, C.H., Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J., Goodman, R.M., 1998. Molecular
Carson, R.E., MacAvoy, M.G., Varma, P., Chertkow, H., Bergman, H., Hosein, C., biological access to the chemistry of unknown soil microbes: a new frontier for
Black, S., Stefanovic, B., Caldwell, C., Hsiung, G.-Y.R., Mudge, B., Sossi, V., natural products. Chem. Biol. 5, R245–R249. https://doi.org/10.1016/S1074-5521
Feldman, H., Assaly, M., Finger, E., Pasternack, S., Rachisky, I., Trost, D., Kertesz, A., (98)90108-9.
Bernick, C., Munic, D., Mesulam, M.-M., Rogalski, E., Lipowski, K., Weintraub, S., Hasin, Y., Seldin, M., Lusis, A., 2017. Multi-omics approaches to disease. Genome Biol.
Bonakdarpour, B., Kerwin, D., Wu, C.-K., Johnson, N., Sadowsky, C., Villena, T., 18, 83. https://doi.org/10.1186/s13059-017-1215-1.
Turner, R.S., Johnson, Kathleen, Reynolds, B., Sperling, R.A., Johnson, K.A., He, Haibo, Yang, Bai, Garcia, E.A., Li, Shutao, 2008. ADASYN: Adaptive synthetic
Marshall, G., Yesavage, J., Taylor, J.L., Lane, B., Rosen, A., Tinklenberg, J., sampling approach for imbalanced learning, in: 2008 IEEE International Joint
Sabbagh, M.N., Belden, C.M., Jacobson, S.A., Sirrel, S.A., Kowall, N., Killiany, R., Conference on Neural Networks (IEEE World Congress on Computational
Budson, A.E., Norbash, A., Johnson, P.L., Obisesan, T.O., Wolday, S., Allard, J., Intelligence). Presented at the 2008 IEEE International Joint Conference on Neural
Lerner, A., Ogrocki, P., Tatsuoka, C., Fatica, P., Fletcher, E., Maillard, P., Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328.
Olichney, J., DeCarli, C., Carmichael, O., Kittur, S., Borrie, M., Lee, T.-Y., Bartha, R., https://doi.org/10.1109/IJCNN.2008.4633969.
Johnson, S., Asthana, S., Carlsson, C.M., Potkin, S.G., Preda, A., Nguyen, D., He, H., Lin, D., Zhang, J., Wang, Y., Deng, H.-W., 2016. Biostatistics, data mining and
Tariot, P., Burke, A., Milliken, A.M., Trncic, N., Fleisher, A., Reeder, S., Bates, V., computational modeling. In: Application of Clinical Bioinformatics, Translational
Capote, H., Rainka, M., Scharre, D.W., Kataki, M., Kelley, B., Zimmerman, E.A., Bioinformatics. Springer, Dordrecht, pp. 23–57. https://doi.org/10.1007/978-94-
Celmins, D., Brown, A.D., Pearlson, G.D., Blank, K., Anderson, K., Flashman, L.A., 017-7543-4_2.
Seltzer, M., Hynes, M.L., Santulli, R.B., Sink, K.M., Leslie, G., Williamson, J.D., Health, C. for D. and R, 2021. Artificial Intelligence and Machine Learning in Software as
Garg, P., Watkins, F., Ott, B.R., Tremont, G., Daiello, L.A., Salloway, S., Malloy, P., a Medical Device [WWW Document]. FDA. URL. https://www.fda.gov/medical-de
Correia, S., Rosen, H.J., Miller, B.L., Perry, D., Mintzer, J., Spicer, K., Bachman, D., vices/software-medical-device-samd/artificial-intelligence-and-machine-learning
Finger, E., Pasternak, S., Rachinsky, I., Rogers, J., Kertesz, A., Drost, D., Pomara, N., -software-medical-device (accessed 2.1.21).
Hernando, R., Sarrael, A., Schultz, S.K., Smith, K.E., Koleva, H., Nam, K.W., Shim, H., Herrmann, M., Probst, P., Hornung, R., Jurinovic, V., Boulesteix, A.-L., 2020. Large-scale
Relkin, N., Chiang, G., Lin, M., Ravdin, L., Smith, A., Raj, B.A., Fargher, K., benchmark study of survival prediction methods using multi-omics data. Brief.
Weiner, M.W., Aisen, P., Weiner, M., Aisen, P., Petersen, R., Green, R.C., Harvey, D., Bioinform. https://doi.org/10.1093/bib/bbaa167.
Jack, C.R.J., Jagust, W., Morris, J.C., Saykin, A.J., Shaw, L.M., Toga, A.W., Holzinger, E.R., Dudek, S.M., Frase, A.T., Pendergrass, S.A., Ritchie, M.D., 2014.
Trojanowki, J.Q., Neylan, T., Grafman, J., Green, R.C., Montine, T., Weiner, M., ATHENA: the analysis tool for heritable and environmental network associations.
Petersen, R., Aisen, P., Thomas, R.G., Donohue, Michael, Devon, G., Sather, T., Bioinforma. Oxf. Engl. 30, 698–705. https://doi.org/10.1093/bioinformatics/
Melissa, D., Morrison, R., Jiminez, G., Neylan, T., Jacqueline, H., Shannon, F., btt572.
Harvey, D., Donohue, Michael, Jack, C.R.J., Bernstein, M., Borowski, B., Gunter, J., Hristoskova, A., Boeva, V., Tsiporkova, E., 2014. A formal concept analysis approach to
Senjem, M., Kejal, K., Chad, W., Jagust, W., Koeppe, R.A., Foster, N., Reiman, E.M., consensus clustering of multi-experiment expression data. BMC Bioinformat. 15,
Chen, K., Landau, S., Morris, J.C., Cairns, N.J., Householder, E., Shaw, L.M., 151. https://doi.org/10.1186/1471-2105-15-151.
Trojanowki, J.Q., Lee, V., Korecka, M., Figurski, M., Toga, A.W., Karen, C., Scott, N., Huang, J., Liang, X., Xuan, Y., Geng, C., Li, Y., Lu, H., Qu, S., Mei, X., Chen, H., Yu, T.,
Saykin, A.J., Foroud, T.M., Potkin, S., Shen, L., Faber, K., Kim, S., Nho, K., Sun, N., Rao, J., Wang, J., Zhang, W., Chen, Y., Liao, S., Jiang, H., Liu, X., Yang, Z.,
Weiner, M.W., Karl, F., Schneider, L.S., Pawluczyk, S., Mauricio, B., Brewer, J., Mu, F., Gao, S., 2017a. A reference human genome dataset of the BGISEQ-500
Vanderswag, H., Stern, Y., Honig, L.S., Bell, K.L., Fleischman, D., Arfanakis, K., sequencer. GigaScience 6, 1–9. https://doi.org/10.1093/gigascience/gix024.
Shah, R.C., Duara, R., Varon, D., Greig, M.T., Doraiswamy, P.M., Petrella, J.R., Huang, S., Chaudhary, K., Garmire, L.X., 2017b. More is better: recent progress in multi-
James, O., Porsteinsson, A.P., Goldstein, B., Martin, K.S., Potkin, S.G., Preda, A., omics data integration methods. Front. Genet. 8 https://doi.org/10.3389/
Nguyen, D., Mintzer, J., Massoglia, D., Brawman-Mintzer, O., Sadowsky, C., fgene.2017.00084.
Martinez, W., Villena, T., Jagust, W., Landau, S., Rosen, H., Perry, D., Turner, R.S., Hugenholtz, P., Tyson, G.W., 2008. Metagenomics. Nature 455, 481–483. https://doi.
Behan, K., Reynolds, B., Sperling, R.A., Johnson, K.A., Marshall, G., Sabbagh, M.N., org/10.1038/455481a.
Jacobson, S.A., Sirrel, S.A., Obisesan, T.O., Wolday, S., Allard, J., Johnson, S.C., Hung, A.J., 2019. Can machine-learning algorithms replace conventional statistics? BJU
Fruehling, J.J., Harding, S., Peskind, E.R., Petrie, E.C., Li, G., Yesavage, J.A., Int. 123, 1. https://doi.org/10.1111/bju.14542.
Taylor, J.L., Furst, A.J., Chao, S., Relkin, N., Chiang, G., Ravdin, L., Ravdin, L., Hwang, B., Lee, J.H., Bang, D., 2018. Single-cell RNA sequencing technologies and
Mackin, S., Aisen, P., Raman, R., Mackin, S., Weiner, M., Aisen, P., Raman, R., bioinformatics pipelines. Exp. Mol. Med. 50, 1–14. https://doi.org/10.1038/s12276-
Jack, C.R.J., Landau, S., Saykin, A.J., Toga, A.W., DeCarli, C., Koeppe, R.A., 018-0071-8.
Green, R.C., Drake, E., Weiner, M., Aisen, P., Raman, R., Donohue, Mike,
Jimenez, G., Gessert, D., Harless, K., Salazar, J., Cabrera, Y., Walter, S.,

19
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

James, G., Witten, D., Hastie, T., Tibshirani, R. (Eds.), 2017. An Introduction to Lanckriet, G.R.G., De Bie, T., Cristianini, N., Jordan, M.I., Noble, W.S., 2004. A statistical
Statistical Learning: with Applications in R, 1st ed. 2013, Corr. 7th printing 2017 framework for genomic data fusion. Bioinforma. Oxf. Engl. 20, 2626–2635. https://
edition. Springer, New York. doi.org/10.1093/bioinformatics/bth294.
Jamil, I.N., Remali, J., Azizan, K.A., Nor Muhammad, N.A., Arita, M., Goh, H.-H., Le, N., Sund, M., Vinci, A., 2016. Prognostic and predictive markers in pancreatic
Aizat, W.M., 2020. Systematic Multi-Omics Integration (MOI) approach in plant adenocarcinoma. Dig. Liver Dis. 48, 223–230. https://doi.org/10.1016/j.
systems biology. Front. Plant Sci. 11 https://doi.org/10.3389/fpls.2020.00944. dld.2015.11.001.
Jeni, L.A., Cohn, J.F., De La Torre, F., 2013. Facing imbalanced data–recommendations LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. https://
for the use of performance metrics. In: 2013 Humaine Association Conference on doi.org/10.1038/nature14539.
Affective Computing and Intelligent Interaction. Presented at the 2013 Humaine Lee, J.K. (Ed.), 2010. Statistical bioinformatics: a guide for life and biomedical science
Association Conference on Affective Computing and Intelligent Interaction, researchers. Wiley-Blackwell, Hoboken, N.J.
pp. 245–251. https://doi.org/10.1109/ACII.2013.47. Lee, C.H., Yoon, H.-J., 2017. Medical big data: promise and challenges. Kidney Res. Clin.
Jiang, T., Gradus, J.L., Rosellini, A.J., 2020. Supervised machine learning: a brief primer. Pract 36, 3–11. https://doi.org/10.23876/j.krcp.2017.36.1.3.
Behav. Ther. 51, 675–687. https://doi.org/10.1016/j.beth.2020.05.002. Lee, I.-H., Lushington, G.H., Visvanathan, M., 2011. A filter-based feature selection
Jolliffe, I.T., 2002. Principal Component Analysis, 2nd ed. Springer. approach for identifying potential biomarkers for lung cancer. J. Clin. Bioinforma. 1,
Kalaitzopoulos, D., 2016. The potential of precision medicine. New Horiz. Transl. Med. 3, 11. https://doi.org/10.1186/2043-9113-1-11.
63–65. https://doi.org/10.1016/j.nhtm.2016.05.001. Lee, G., Bang, L., Kim, S.Y., Kim, D., Sohn, K.-A., 2017. Identifying subtype-specific
Kalvari, I., Nawrocki, E.P., Argasinska, J., Quinones-Olvera, N., Finn, R.D., Bateman, A., associations between gene expression and DNA methylation profiles in breast cancer.
Petrov, A.I., 2018. Non-Coding RNA analysis using the Rfam database. Curr. Protoc. BMC Med. Genet. 10, 28. https://doi.org/10.1186/s12920-017-0268-z.
Bioinformatics 62, e51. https://doi.org/10.1002/cpbi.51. Lee, T.-Y., Huang, K.-Y., Chuang, C.-H., Lee, C.-Y., Chang, T.-H., 2020. Incorporating
Kanehisa, M., Goto, S., 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic deep learning and multi-omics autoencoding for analysis of lung adenocarcinoma
Acids Res. 28, 27–30. https://doi.org/10.1093/nar/28.1.27. prognostication. Comput. Biol. Chem. 87, 107277. https://doi.org/10.1016/j.
Karpievitch, Y.V., Polpitiya, A.D., Anderson, G.A., Smith, R.D., Dabney, A.R., 2010. compbiolchem.2020.107277.
Liquid chromatography mass spectrometry-based proteomics: biological and Leinonen, R., Akhtar, R., Birney, E., Bower, L., Cerdeno-Tárraga, A., Cheng, Y.,
technological aspects. Ann. Appl. Stat. 4, 1797–1823. https://doi.org/10.1214/10- Cleland, I., Faruque, N., Goodgame, N., Gibson, R., Hoad, G., Jang, M.,
AOAS341. Pakseresht, N., Plaister, S., Radhakrishnan, R., Reddy, K., Sobhany, S., Ten
Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I., 2017. Hoopen, P., Vaughan, R., Zalunin, V., Cochrane, G., 2011. The European nucleotide
Machine learning and data mining methods in diabetes research. Comput. Struct. archive. Nucleic Acids Res. 39, D28–D31. https://doi.org/10.1093/nar/gkq967.
Biotechnol. J. 15, 104–116. https://doi.org/10.1016/j.csbj.2016.12.005. Lévesque, E., Kirby, E., Bolt, I., Knoppers, B.M., de Beaufort, I., Pashayan, N.,
Kim, M., Tagkopoulos, I., 2018. Data integration and predictive modeling methods for Widschwendter, M., 2018. Ethical, legal, and regulatory issues for the
multi-omics datasets. Mol. Omics 14, 8–25. https://doi.org/10.1039/C7MO00051K. implementation of omics-based risk prediction of women’s cancer: points to
Kim, S., Park, T., Kon, M., 2014. Cancer survival classification using integrated data sets consider. Public Health Genomics 21, 37–44. https://doi.org/10.1159/000492663.
and intermediate information. Artif. Intell. Med. 62, 23–31. https://doi.org/ Li, Q., Freeman, L.M., Rush, J.E., Huggins, G.S., Kennedy, A.D., Labuda, J.A.,
10.1016/j.artmed.2014.06.003. Laflamme, D.P., Hannah, S.S., 2015. Veterinary medicine and multi-omics research
Kim, D., Joung, J.-G., Sohn, K.-A., Shin, H., Park, Y.R., Ritchie, M.D., Kim, J.H., 2015. for future nutrition targets: metabolomics and transcriptomics of the common
Knowledge boosting: a graph-based integration approach with multi-omics data and degenerative mitral valve disease in dogs. Omics J. Integr. Biol. 19, 461–470.
genomic knowledge for cancer clinical outcome prediction. J. Am. Med. Inform. https://doi.org/10.1089/omi.2015.0057.
Assoc. 22, 109–120. https://doi.org/10.1136/amiajnl-2013-002481. Li, S., Chen, X., Liu, X., Yu, Y., Pan, H., Haak, R., Schmidt, J., Ziebolz, D., Schmalz, G.,
Kim, S., Jhong, J.-H., Lee, J., Koo, J.-Y., 2017. Meta-analytic support vector machine for 2017. Complex integrated analysis of lncRNAs-miRNAs-mRNAs in oral squamous
integrating multiple omics data. BioData Min. 10, 2. https://doi.org/10.1186/ cell carcinoma. Oral Oncol. 73, 1–9. https://doi.org/10.1016/j.
s13040-017-0126-8. oraloncology.2017.07.026.
Kim, A.A., Rachid Zaim, S., Subbian, V., 2020. Assessing reproducibility and veracity Li, M., Wang, Y., Zheng, R., Shi, X., Li, Y., Wu, F., Wang, J., 2019. DeepDSC: a deep
across machine learning techniques in biomedicine: A case study using TCGA data. learning method to predict drug sensitivity of cancer cell lines. IEEE/ACM Trans.
Int. J. Med. Inform. 141, 104148. https://doi.org/10.1016/j.ijmedinf.2020.104148. Comput. Biol. Bioinform. 575–582. https://doi.org/10.1109/TCBB.2019.2919581.
Kirchebner, J., Günther, M.P., Sonnweber, M., King, A., Lau, S., 2020. Factors and Li, W., Zhang, A., Zhou, X., Nan, Y., Liu, Q., Sun, H., Fang, H., Wang, X., 2020. High-
predictors of length of stay in offenders diagnosed with schizophrenia - a machine- throughput liquid chromatography mass-spectrometry-driven lipidomics discover
learning-based approach. BMC Psychiatry 20. https://doi.org/10.1186/s12888-020- metabolic biomarkers and pathways as promising targets to reveal the therapeutic
02612-1. effects of the Shenqi pill. RSC Adv. 10, 2347–2358. https://doi.org/10.1039/
Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z., Wild, D.L., 2012. Bayesian correlated C9RA07621B.
clustering to integrate multiple datasets. Bioinforma. Oxf. Engl. 28, 3290–3297. Liew, A.W.-C., Law, N.-F., Yan, H., 2011. Missing value imputation for gene expression
https://doi.org/10.1093/bioinformatics/bts595. data: computational techniques to recover missing data from available information.
Knittelfelder, O.L., Weberhofer, B.P., Eichmann, T.O., Kohlwein, S.D., Rechberger, G.N., Brief. Bioinform. 12, 498–513. https://doi.org/10.1093/bib/bbq080.
2014. A versatile ultra-high performance LC-MS method for lipid profiling. Lightbody, G., Haberland, V., Browne, F., Taggart, L., Zheng, H., Parkes, E., Blayney, J.
J. Chromatogr. B Anal. Technol. Biomed. Life Sci. 951–952, 119–128. https://doi. K., 2019. Review of applications of high-throughput sequencing in personalized
org/10.1016/j.jchromb.2014.01.011. medicine: barriers and facilitators of future progress in research and clinical
Kodama, Y., Shumway, M., Leinonen, R., 2012. The sequence read archive: explosive application. Brief. Bioinform. 20, 1795–1811. https://doi.org/10.1093/bib/bby051.
growth of sequencing data. Nucleic Acids Res. 40, D54–D56. https://doi.org/ Lin, E., Lane, H.-Y., 2017. Machine learning and systems genomics approaches for multi-
10.1093/nar/gkr854. omics data. Biomark. Res. 5, 2. https://doi.org/10.1186/s40364-017-0082-y.
Köfeler, H.C., Fauland, A., Rechberger, G.N., Trötzmüller, M., 2012. Mass spectrometry Lindon, J.C., Nicholson, J.K., Holmes, E., 2011. The Handbook of Metabonomics and
based lipidomics: an overview of technological platforms. Metabolites 2, 19–38. Metabolomics. Elsevier.
https://doi.org/10.3390/metabo2010019. List, M., Hauschild, A.-C., Tan, Q., Kruse, T.A., Mollenhauer, J., Baumbach, J., Batra, R.,
Kohl, M., Megger, D.A., Trippler, M., Meckel, H., Ahrens, M., Bracht, T., Weber, F., 2014. Classification of breast cancer subtypes by combining gene expression and
Hoffmann, A.-C., Baba, H.A., Sitek, B., Schlaak, J.F., Meyer, H.E., Stephan, C., DNA methylation data. J. Integr. Bioinforma. 11, 236. https://doi.org/10.2390/
Eisenacher, M., 2014. A practical data processing workflow for multi-OMICS biecoll-jib-2014-236.
projects. Biochim. Biophys Acta BBA - Proteins Proteomics, Computational Liu, Y., Ding, J., Reynolds, L.M., Lohman, K., Register, T.C., De La Fuente, A., Howard, T.
Proteomics in the Post-Identification Era 1844, 52–62. https://doi.org/10.1016/j. D., Hawkins, G.A., Cui, W., Morris, J., Smith, S.G., Barr, R.G., Kaufman, J.D.,
bbapap.2013.02.029. Burke, G.L., Post, W., Shea, S., Mccall, C.E., Siscovick, D., Jacobs, D.R., Tracy, R.P.,
Kovacs, G.G., 2016. Molecular pathological classification of neurodegenerative diseases: Herrington, D.M., Hoeschele, I., 2013. Methylomics of gene expression in human
turning towards precision medicine. Int. J. Mol. Sci. 17 https://doi.org/10.3390/ monocytes. Hum. Mol. Genet. 22, 5065–5074. https://doi.org/10.1093/hmg/
ijms17020189. ddt356.
Kozomara, A., Birgaoanu, M., Griffiths-Jones, S., 2019. miRBase: from microRNA Lock, E.F., Dunson, D.B., 2013. Bayesian consensus clustering. Bioinformatics 29,
sequences to function. Nucleic Acids Res. 47, D155–D162. https://doi.org/10.1093/ 2610–2616. https://doi.org/10.1093/bioinformatics/btt425.
nar/gky1141. Lock, E.F., Hoadley, K.A., Marron, J.S., Nobel, A.B., 2013. Joint and individual variation
Kuo, T.-C., Tseng, Y.J., 2018. LipidPedia: a comprehensive lipid knowledgebase. explained (jive) for integrated analysis of multiple data types. Ann. Appl. Stat. 7,
Bioinformatics 34, 2982–2987. https://doi.org/10.1093/bioinformatics/bty213. 523–542. https://doi.org/10.1214/12-AOAS597.
Kuska, B., 1998. Beer, Bethesda, and biology: how “genomics” came into being. J. Natl. Lodish, H., Berk, A., Zipursky, S.L., Matsudaira, P., Baltimore, D., Darnell, J., 2000.
Cancer Inst. 90, 93. Molecular Cell Biology, 4th ed. W. H. Freeman.
Kwon, M.-S., Kim, Y., Lee, S., Namkung, J., Yun, T., Yi, S.G., Han, S., Kang, M., Kim, S.W., López de Maturana, E., Alonso, L., Alarcón, P., Martín-Antoniano, I.A., Pineda, S.,
Jang, J.-Y., Park, T., 2015. Integrative analysis of multi-omics data for identifying Piorno, L., Calle, M.L., Malats, N., 2019. Challenges in the integration of omics and
multi-markers for diagnosing pancreatic cancer. BMC Genomics 16 (Suppl. 9), S4. non-omics data. Genes 10, 238. https://doi.org/10.3390/genes10030238.
https://doi.org/10.1186/1471-2164-16-S9-S4. López Pineda, A., Ye, Y., Visweswaran, S., Cooper, G.F., Wagner, M.M., Tsui, F., 2015.
Lambin, P., Leijenaar, R.T.H., Deist, T.M., Peerlings, J., de Jong, E.E.C., van Comparison of machine learning classifiers for influenza detection from emergency
Timmeren, J., Sanduleanu, S., Larue, R.T.H.M., Even, A.J.G., Jochems, A., van department free-text reports. J. Biomed. Inform. 58, 60–69. https://doi.org/
Wijk, Y., Woodruff, H., van Soest, J., Lustberg, T., Roelofs, E., van Elmpt, W., 10.1016/j.jbi.2015.08.019.
Dekker, A., Mottaghy, F.M., Wildberger, J.E., Walsh, S., 2017. Radiomics: the bridge Lorena, A.C., Jacintho, L.F.O., Siqueira, M.F., Giovanni, R.D., Lohmann, L.G., de
between medical imaging and personalized medicine. Nat. Rev. Clin. Oncol. https:// Carvalho, A.C.P.L.F., Yamamoto, M., 2011. Comparing machine learning classifiers
doi.org/10.1038/nrclinonc.2017.141.

20
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

in potential distribution modelling. Expert Syst. Appl. 38, 5268–5275. https://doi. Mitchell, A.L., Almeida, A., Beracochea, M., Boland, M., Burgin, J., Cochrane, G.,
org/10.1016/j.eswa.2010.10.031. Crusoe, M.R., Kale, V., Potter, S.C., Richardson, L.J., Sakharova, E.,
Lowe, R., Shirley, N., Bleackley, M., Dolan, S., Shafee, T., 2017. Transcriptomics Scheremetjew, M., Korobeynikov, A., Shlemov, A., Kunyavskaya, O., Lapidus, A.,
technologies. PLoS Comput. Biol. 13, e1005457 https://doi.org/10.1371/journal. Finn, R.D., 2020. MGnify: the microbiome analysis resource in 2020. Nucleic Acids
pcbi.1005457. Res. 48, D570–D578. https://doi.org/10.1093/nar/gkz1035.
Lu, J., Cowperthwaite, M.C., Burnett, M.G., Shpak, M., 2016. Molecular predictors of Mo, Q., Wang, S., Seshan, V.E., Olshen, A.B., Schultz, N., Sander, C., Powers, R.S.,
long-term survival in glioblastoma multiforme patients. PLoS One 11, e0154313. Ladanyi, M., Shen, R., 2013. Pattern discovery and cancer gene identification in
https://doi.org/10.1371/journal.pone.0154313. integrated cancer genomic data. Proc. Natl. Acad. Sci. 110, 4245–4250. https://doi.
Luck, K., Sheynkman, G.M., Zhang, I., Vidal, M., 2017. Proteome-scale human org/10.1073/pnas.1208949110.
interactomics. Trends Biochem. Sci. 42, 342–354. https://doi.org/10.1016/j. Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K.S., Hilsenbeck, S.G., 2018. A fully
tibs.2017.02.006. Bayesian latent variable model for integrative clustering analysis of multi-type omics
Lussier, Y.A., Li, H., 2012. Breakthroughs in genomics data integration for predicting data. Biostatistics 19, 71–86. https://doi.org/10.1093/biostatistics/kxx017.
clinical outcome. J. Biomed. Inform. 45, 1199–1201. https://doi.org/10.1016/j. Mostafavi, S., Morris, Q., 2010. Fast integration of heterogeneous data sources for
jbi.2012.10.003. predicting gene function with limited annotation. Bioinformatics 26, 1759–1765.
Ma, S., Ren, J., Fenyö, D., 2016. Breast cancer prognostics using multi-omics data. AMIA https://doi.org/10.1093/bioinformatics/btq262.
Summits Transl. Sci. Proc. 2016, 52–59. Mougin, F., Auber, D., Bourqui, R., Diallo, G., Dutour, I., Jouhet, V., Thiessard, F.,
Ma, A., McDermaid, A., Xu, J., Chang, Y., Ma, Q., 2020a. Integrative methods and Thiébaut, R., Thébault, P., 2018. Visualizing omics and clinical data: Which
practical challenges for single-cell multi-omics. Trends Biotechnol. 38, 1007–1022. challenges for dealing with their variety? Methods, Comp.Visualizat. Meth. High
https://doi.org/10.1016/j.tibtech.2020.02.013. Dimens. Biol. Data 132, 3–18. https://doi.org/10.1016/j.ymeth.2017.08.012.
Ma, B., Meng, F., Yan, G., Yan, H., Chai, B., Song, F., 2020b. Diagnostic classification of Muehlematter, U.J., Daniore, P., Vokinger, K.N., 2021. Approval of artificial intelligence
cancers using extreme gradient boosting algorithm and multi-omics data. Comput. and machine learning-based medical devices in the USA and Europe (2015–20): a
Biol. Med. 121, 103761. https://doi.org/10.1016/j.compbiomed.2020.103761. comparative analysis. Lancet Digit. Health 0. https://doi.org/10.1016/S2589-7500
Malod-Dognin, N., Petschnigg, J., Pržulj, N., 2017. Precision medicine — a promising, (20)30292-2.
yet challenging road lies ahead. Curr. Opin. Syst. Biol. https://doi.org/10.1016/j. Mutie, P.M., Giordano, G.N., Franks, P.W., 2017. Lifestyle precision medicine: the next
coisb.2017.10.003. generation in type 2 diabetes prevention? BMC Med. 15, 171. https://doi.org/
Mamoshina, P., Volosnikova, M., Ozerov, I.V., Putin, E., Skibina, E., Cortese, F., 10.1186/s12916-017-0938-x.
Zhavoronkov, A., 2018. Machine learning on human muscle transcriptomic data for Nalejska, E., Mączyńska, E., Lewandowska, M.A., 2014. Prognostic and predictive
biomarker discovery and tissue-specific drug target identification. Front. Genet. 9 biomarkers: tools in personalized oncology. Mol. Diagn. Ther. 18, 273–284. https://
https://doi.org/10.3389/fgene.2018.00242. doi.org/10.1007/s40291-013-0077-9.
Mandel, S.A., Morelli, M., Halperin, I., Korczyn, A.D., 2010. Biomarkers for prediction Nam, H., Chung, B.C., Kim, Y., Lee, K., Lee, D., 2009. Combining tissue transcriptomics
and targeted prevention of Alzheimer’s and Parkinson’s diseases: evaluation of drug and urine metabolomics for breast cancer biomarker identification. Bioinforma. Oxf.
clinical efficacy. EPMA J. 1, 273–292. https://doi.org/10.1007/s13167-010-0036-z. Engl. 25, 3151–3157. https://doi.org/10.1093/bioinformatics/btp558.
Mankoo, P.K., Shen, R., Schultz, N., Levine, D.A., Sander, C., 2011. Time to recurrence Nguyen, N.D., Wang, D., 2020. Multiview learning for understanding functional
and survival in serous ovarian tumors predicted from integrated genomic profiles. multiomics. PLoS Comput. Biol. 16, e1007677 https://doi.org/10.1371/journal.
PLoS One 6, e24709. https://doi.org/10.1371/journal.pone.0024709. pcbi.1007677.
Margolies, L.R., Pandey, G., Horowitz, E.R., Mendelson, D.S., 2016. Breast imaging in the Nguyen, T., Tagett, R., Diaz, D., Draghici, S., 2017. A novel approach for data integration
era of big data: structured reporting and data mining. AJR Am. J. Roentgenol. 206, and disease subtyping. Genome Res. 27, 2025–2039. https://doi.org/10.1101/
259–264. https://doi.org/10.2214/AJR.15.15396. gr.215129.116.
Martinelli, Giovanni, Foreman, Nicholas, Sanders, Sean, 2015. Advancing Precision Nguyen, H., Shrestha, S., Draghici, S., Nguyen, T., 2019. PINSPlus: a tool for tumor
Medicine Through Multi-Omics: An Integrated Approach To Tumor Profiling. subtype discovery in integrated genomic data. Bioinformatics 35, 2843–2846.
Martinez, A.M., Kak, A.C., 2001. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1093/bioinformatics/bty1049.
23, 228–233. https://doi.org/10.1109/34.908974. Nicolai, M., Peter, Bühlmann, 2010. Stability selection. J. R. Stat. Soc. Ser. B Stat
McCabe, S.D., Lin, D.-Y., Love, M.I., 2020. Consistency and overfitting of multi-omics Methodol. 72, 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x.
methods on experimental data. Brief. Bioinform. 21, 1277–1284. https://doi.org/ Nicora, G., Vitali, F., Dagliati, A., Geifman, N., Bellazzi, R., 2020. Integrated multi-omics
10.1093/bib/bbz070. analyses in oncology: a review of machine learning methods and tools. Front. Oncol.
McCarthy, M.I., 2017. Painting a new picture of personalised medicine for diabetes. 10 https://doi.org/10.3389/fonc.2020.01030.
Diabetologia 60, 793–799. https://doi.org/10.1007/s00125-017-4210-x. Nielsen, J., 2017. Systems biology of metabolism: a driver for developing personalized
McShane, L.M., Cavenagh, M.M., Lively, T.G., Eberhard, D.A., Bigbee, W.L., Williams, P. and precision medicine. Cell Metab. 25, 572–579. https://doi.org/10.1016/j.
M., Mesirov, J.P., Polley, M.-Y.C., Kim, K.Y., Tricoli, J.V., Taylor, J.M., Shuman, D.J., cmet.2017.02.002.
Simon, R.M., Doroshow, J.H., Conley, B.A., 2013a. Criteria for the use of omics- O’Mahony, N., Campbell, S., Carvalho, A., Harapanahalli, S., Hernandez, G.V.,
based predictors in clinical trials: explanation and elaboration. BMC Med. 11, 220. Krpalkova, L., Riordan, D., Walsh, J., 2020. Deep learning vs. traditional computer
https://doi.org/10.1186/1741-7015-11-220. vision. In: Arai, K., Kapoor, S. (Eds.), Advances in Computer Vision. Springer
McShane, L.M., Cavenagh, M.M., Lively, T.G., Eberhard, D.A., Bigbee, W.L., Williams, P. International Publishing, Cham, pp. 128–144.
M., Mesirov, J.P., Polley, M.-Y.C., Kim, K.Y., Tricoli, J.V., Taylor, J.M.G., Shuman, D. Obermeyer, Z., Emanuel, E.J., 2016. Predicting the future — big data, machine learning,
J., Simon, R.M., Doroshow, J.H., Conley, B.A., 2013b. Criteria for the use of omics- and clinical medicine. N. Engl. J. Med. 375, 1216–1219. https://doi.org/10.1056/
based predictors in clinical trials. Nature 502, 317–320. https://doi.org/10.1038/ NEJMp1606181.
nature12564. Olson, R.S., Sipper, M., Cava, W.L., Tartarone, S., Vitale, S., Fu, W., Orzechowski, P.,
Memon, J., Sami, M., Khan, R.A., Uddin, M., 2020. Handwritten optical character Urbanowicz, R.J., Holmes, J.H., Moore, J.H., 2018. A system for accessible artificial
recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access intelligence. In: Banzhaf, W., Olson, R.S., Tozier, W., Riolo, R. (Eds.), Genetic
8, 142642–142668. https://doi.org/10.1109/ACCESS.2020.3012542. Programming Theory and Practice XV, Genetic and Evolutionary Computation.
Meng, C., Helm, D., Frejno, M., Kuster, B., 2016a. moCluster: identifying joint patterns Springer International Publishing, Cham, pp. 121–134. https://doi.org/10.1007/
across multiple omics data sets. J. Proteome Res. 15, 755–765. https://doi.org/ 978-3-319-90512-9_8.
10.1021/acs.jproteome.5b00824. Overmyer, K.A., Shishkova, E., Miller, I.J., Balnis, J., Bernstein, M.N., Peters-Clarke, T.
Meng, C., Zeleznik, O.A., Thallinger, G.G., Kuster, B., Gholami, A.M., Culhane, A.C., M., Meyer, J.G., Quan, Q., Muehlbauer, L.K., Trujillo, E.A., He, Y., Chopra, A.,
2016b. Dimension reduction techniques for the integrative analysis of multi-omics Chieng, H.C., Tiwari, A., Judson, M.A., Paulson, B., Brademan, D.R., Zhu, Y.,
data. Brief. Bioinform. bbv 108. https://doi.org/10.1093/bib/bbv108. Serrano, L.R., Linke, V., Drake, L.A., Adam, A.P., Schwartz, B.S., Singer, H.A.,
Mercer, T.R., Gerhardt, D.J., Dinger, M.E., Crawford, J., Trapnell, C., Jeddeloh, J.A., Swanson, S., Mosher, D.F., Stewart, R., Coon, J.J., Jaitovich, A., 2020. Large-scale
Mattick, J.S., Rinn, J.L., 2012. Targeted RNA sequencing reveals the deep complexity multi-omic analysis of COVID-19 severity. Cell Syst. https://doi.org/10.1016/j.
of the human transcriptome. Nat. Biotechnol. 30, 99–104. https://doi.org/10.1038/ cels.2020.10.003.
nbt.2024. Paik, E.S., Choi, H.J., Kim, T.-J., Lee, J.-W., Kim, B.-G., Bae, D.-S., Choi, C.H., 2017.
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E., Kubal, M., Paczian, T., Molecular signature for lymphatic invasion associated with survival of epithelial
Rodriguez, A., Stevens, R., Wilke, A., Wilkening, J., Edwards, R., 2008. The ovarian cancer. Cancer Res. Treat. Off. J. Korean Cancer Assoc. https://doi.org/
metagenomics RAST server – a public resource for the automatic phylogenetic and 10.4143/crt.2017.104.
functional analysis of metagenomes. BMC Bioinformat. 9, 386. https://doi.org/ Pérez-Cobas, A.E., Gomez-Valero, L., Buchrieser, C., 2020. Metagenomic approaches in
10.1186/1471-2105-9-386. microbial ecology: an update on whole-genome and marker gene sequencing
Milward, E.A., Shahandeh, A., Heidari, M., Johnstone, D.M., Daneshi, N., analyses. Microb. Genomics 6, e000409. https://doi.org/10.1099/mgen.0.000409.
Hondermarck, H., 2016. Transcriptomics, in: Encyclopedia of Cell Biology. Academic Peterson, T.A., Doughty, E., Kann, M.G., 2013. Towards precision medicine: advances in
Press, Waltham, pp. 160–165. https://doi.org/10.1016/B978-0-12-394447- computational approaches for the analysis of human variants. J. Mol. Biol. 425,
4.40029-5. 4047–4063. https://doi.org/10.1016/j.jmb.2013.08.008.
Mirza, B., Wang, W., Wang, J., Choi, H., Chung, N.C., Ping, P., 2019. Machine learning Pfützner, A., Forst, T., 2006. High-sensitivity C-reactive protein as cardiovascular risk
and integrative analysis of biomedical big data. Genes 10, 87. https://doi.org/ marker in patients with diabetes mellitus. Diabetes Technol. Ther. 8, 28–36. https://
10.3390/genes10020087. doi.org/10.1089/dia.2006.8.28.
Misra, B.B., Langefeld, C., Olivier, M., Cox, L.A., 2019. Integrated omics: tools, advances Pietzner, M., Engelmann, B., Kacprowski, T., Golchert, J., Dirk, A.-L., Hammer, E.,
and future approaches. J. Mol. Endocrinol. R21–R45. https://doi.org/10.1530/JME- Iwen, K.A., Nauck, M., Wallaschofski, H., Führer, D., Münte, T.F., Friedrich, N.,
18-0055. Völker, U., Homuth, G., Brabant, G., 2017. Plasma proteome and metabolome

21
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

characterization of an experimental human thyrotoxicosis model. BMC Med. 15, 6. omics integration. Genomics 112, 2833–2841. https://doi.org/10.1016/j.
https://doi.org/10.1186/s12916-016-0770-8. ygeno.2020.03.021.
Pinu, F.R., Beale, D.J., Paten, A.M., Kouremenos, K., Swarup, S., Schirra, H.J., Senft, D., Leiserson, M.D.M., Ruppin, E., Ronai, Z.A., 2017. Precision oncology: the road
Wishart, D., 2019. Systems biology and multi-omics integration: viewpoints from the ahead. Trends Mol. Med. 23, 874–898. https://doi.org/10.1016/j.
metabolomics research community. Metabolites 9. https://doi.org/10.3390/ molmed.2017.08.003.
metabo9040076. Seoane, J.A., Day, I.N.M., Gaunt, T.R., Campbell, C., 2014. A pathway-based data
Poirion, O.B., Chaudhary, K., Garmire, L.X., 2018. Deep Learning data integration for integration framework for prediction of disease progression. Bioinformatics 30,
better risk stratification models of bladder cancer. AMIA Summits Transl. Sci. Proc. 838–845. https://doi.org/10.1093/bioinformatics/btt610.
2018, 197–206. Sharifi-Noghabi, H., Zolotareva, O., Collins, C.C., Ester, M., 2019. MOLI: multi-omics late
Poirion, O.B., Chaudhary, K., Huang, S., Garmire, L.X., 2020. Multi-omics-based pan- integration with deep neural networks for drug response prediction. Bioinformatics
cancer prognosis prediction using an ensemble of deep-learning and machine- 35, i501–i509. https://doi.org/10.1093/bioinformatics/btz318.
learning models. medRxiv 19010082. https://doi.org/10.1101/19010082. Shaw, A., Bradley, M.D., Elyan, S., Kurian, K.M., 2015. Tumour biomarkers: diagnostic,
Prelot, L., Draisma, H., Anasanti, M., Balkhiyarova, Z., Wielscher, M., Yengo, L., prognostic, and predictive. BMJ 351, h3449. https://doi.org/10.1136/bmj.h3449.
Balkau, B., Roussel, R., Sebert, S., Korpela, M.A., Froguel, P., Jarvelin, M., Shen, H.-B., Chou, K.-C., 2006. Ensemble classifier for protein fold pattern recognition.
Kaakinen, M., Prokopenko, I., 2018. Machine Learning in Multi-Omics Data to Assess Bioinforma. Oxf. Engl. 22, 1717–1722. https://doi.org/10.1093/bioinformatics/
Longitudinal Predictors of Glycaemic Health. https://doi.org/10.1101/358390. btl170.
Proteomics, transcriptomics: what’s in a name?, 1999. Nature 402, 715. https://doi.org/ Shen, R., Olshen, A.B., Ladanyi, M., 2009. Integrative clustering of multiple genomic data
10.1038/45354. types using a joint latent variable model with application to breast and lung cancer
Pudil, P., Novovičová, J., Kittler, J., 1994. Floating search methods in feature selection. subtype analysis. Bioinformatics 25, 2906–2912. https://doi.org/10.1093/
Pattern Recogn. Lett. 15, 1119–1125. https://doi.org/10.1016/0167-8655(94) bioinformatics/btp543.
90127-9. Shin, H., Lisewski, A.M., Lichtarge, O., 2007. Graph sharpening plus graph integration: a
Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers synergy that improves protein functional classification. Bioinformatics 23,
Inc, San Francisco, CA, USA. 3217–3224. https://doi.org/10.1093/bioinformatics/btm511.
Rappoport, N., Shamir, R., 2018. Multi-omic and multi-view clustering algorithms: Shin, H., Hill, N.J., Lisewski, A.M., Park, J.-S., 2010. Graph sharpening. Expert Syst.
review and cancer benchmark. Nucleic Acids Res. 46, 10546–10562. https://doi. Appl. 37, 7870–7879. https://doi.org/10.1016/j.eswa.2010.04.050.
org/10.1093/nar/gky889. Shrivastava, A.K., Singh, H.V., Raizada, A., Singh, S.K., 2015. C-reactive protein,
Rappoport, N., Shamir, R., 2019. NEMO: cancer subtyping by integration of partial multi- inflammation and coronary heart disease. Egypt. Heart J. 67, 89–97. https://doi.
omic data. Bioinformatics 35, 3348–3356. https://doi.org/10.1093/bioinformatics/ org/10.1016/j.ehj.2014.11.005.
btz058. Singhal, A., Simmons, M., Lu, Z., 2016. Text mining genotype-phenotype relationships
Rashidi, H.H., Tran, N.K., Betts, E.V., Howell, L.P., Green, R., 2019. Artificial intelligence from biomedical literature for database curation and precision medicine. PLoS
and machine learning in pathology: the present landscape of supervised methods. Comput. Biol. 12, e1005017 https://doi.org/10.1371/journal.pcbi.1005017.
Acad. Pathol. 6 https://doi.org/10.1177/2374289519873088, 2374289519873088. Sonsare, P.M., Gunavathi, C., 2019. Investigation of machine learning techniques on
Ray, P., Zheng, L., Lucas, J., Carin, L., 2014. Bayesian joint analysis of heterogeneous proteomics: A comprehensive survey. Prog. Biophys. Mol. Biol. 149, 54–69. https://
genomics data. Bioinformatics 30, 1370–1376. https://doi.org/10.1093/ doi.org/10.1016/j.pbiomolbio.2019.09.004.
bioinformatics/btu064. Sorzano, C.O.S., Vargas, J., Montano, A.P., 2014. A survey of dimensionality reduction
Reuter, J.A., Spacek, D.V., Snyder, M.P., 2015. High-throughput sequencing techniques. ArXiv14032877 Cs Q-Bio Stat 1–35.
technologies. Mol. Cell 58, 586–597. https://doi.org/10.1016/j. Speicher, N.K., Pfeifer, N., 2015. Integrating different data types by regularized
molcel.2015.05.004. unsupervised multiple kernel learning with application to cancer subtype discovery.
Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana- Bioinformatics 31, i268–i275. https://doi.org/10.1093/bioinformatics/btv244.
Sundaram, S., Ghosh, D., Pandey, A., Chinnaiyan, A.M., 2005. Probabilistic model of Stańczyk, U., Jain, L.C. (Eds.), 2015. Feature Selection for Data and Pattern Recognition,
the human protein-protein interaction network. Nat. Biotechnol. 23, 951. https:// Studies in Computational Intelligence. Springer, Berlin.
doi.org/10.1038/nbt1103. Stetson, L.C., Pearl, T., Chen, Y., Barnholtz-Sloan, J.S., 2014. Computational
Rimoldi, S.F., Scherrer, U., Messerli, F.H., 2014. Secondary arterial hypertension: when, identification of multi-omic correlates of anticancer therapeutic response. BMC
who, and how to screen? Eur. Heart J. 35, 1245–1254. https://doi.org/10.1093/ Genomics 15, S2. https://doi.org/10.1186/1471-2164-15-S7-S2.
eurheartj/eht534. Strimbu, K., Tavel, J.A., 2010. What are biomarkers? Curr. Opin. HIV AIDS 5, 463–466.
Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D., 2015. Methods of https://doi.org/10.1097/COH.0b013e32833ed177.
integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet. 16, Sud, M., Fahy, E., Cotter, D., Brown, A., Dennis, E.A., Glass, C.K., Merrill, A.H.,
85–97. https://doi.org/10.1038/nrg3868. Murphy, R.C., Raetz, C.R.H., Russell, D.W., Subramaniam, S., 2007. LMSD: LIPID
Rojas-Macias, M.A., Mariethoz, J., Andersson, P., Jin, C., Venkatakrishnan, V., Aoki, N. MAPS structure database. Nucleic Acids Res. 35, D527–D532. https://doi.org/
P., Shinmachi, D., Ashwood, C., Madunic, K., Zhang, T., Miller, R.L., Horlacher, O., 10.1093/nar/gkl838.
Struwe, W.B., Watanabe, Y., Okuda, S., Levander, F., Kolarich, D., Rudd, P.M., Tan, K., Huang, W., Hu, J., Dong, S., 2020a. A multi-omics supervised autoencoder for
Wuhrer, M., Kettner, C., Packer, N.H., Aoki-Kinoshita, K.F., Lisacek, F., Karlsson, N. pan-cancer clinical outcome endpoints prediction. BMC Med. Inform. Decis. Mak. 20,
G., 2019. Towards a standardized bioinformatics infrastructure for N - and O 129. https://doi.org/10.1186/s12911-020-1114-3.
-glycomics. Nat. Commun. 10, 3275. https://doi.org/10.1038/s41467-019-11131-x. Tan, X., Yu, Y., Duan, K., Zhang, J., Sun, P.S., 2020b. Current advances and limitations of
Roobaert, D., Karakoulas, G., Chawla, N.V., 2006. Information gain, correlation and deep learning in anticancer drug sensitivity prediction [WWW Document] Curr. Top.
support vector machines. In: Feature Extraction, Studies in Fuzziness and Soft Med. Chem. 20 (21), 1858–1867. https://doi.org/10.2174/
Computing. Springer, Berlin, Heidelberg, pp. 463–470. https://doi.org/10.1007/ 1568026620666200710101307. URL. https://www.eurekaselect.com/183610/art
978-3-540-35488-8_23. icle (accessed 11.2.20).
Sakr, S., Elshawi, R., Ahmed, A.M., Qureshi, W.T., Brawner, C.A., Keteyian, S.J., Tang, B., Pan, Z., Yin, K., Khateeb, A., 2019. Recent advances of deep learning in
Blaha, M.J., Al-Mallah, M.H., 2017. Comparison of machine learning techniques to bioinformatics and computational biology. Front. Genet. 10 https://doi.org/
predict all-cause mortality using fitness data: the Henry ford exercIse testing (FIT) 10.3389/fgene.2019.00214.
project. BMC Med. Inform. Decis. Mak. 17. https://doi.org/10.1186/s12911-017- Taskesen, E., Babaei, S., Reinders, M.M.J., de Ridder, J., 2015. Integration of gene
0566-6. expression and DNA-methylation profiles improves molecular subtype classification
Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminating in acute myeloid leukemia. BMC Bioinformat. 16 (Suppl. 4), S5. https://doi.org/
inhibitors. Proc. Natl. Acad. Sci. U. S. A. 74, 5463–5467. https://doi.org/10.1073/ 10.1186/1471-2105-16-S4-S5.
pnas.74.12.5463. Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H.,
Sathyanarayanan, A., Gupta, R., Thompson, E.W., Nyholt, D.R., Bauer, D.C., Nagaraj, S. Gojobori, T., 2002. DNA Data Bank of Japan (DDBJ) for genome scale research in life
H., 2020. A comparative study of multi-omics integration tools for cancer driver science. Nucleic Acids Res. 30, 27–30. https://doi.org/10.1093/nar/30.1.27.
gene identification and tumour subtyping. Brief. Bioinform. 21, 1920–1936. https:// Tepeli, Y.I., Ünal, A.B., Akdemir, F.M., Tastan, O., 2019. PAMOGK: a pathway graph
doi.org/10.1093/bib/bbz121. kernel based multi-omics clustering approach for discovering cancer patient
Saulnier, K.M., Bujold, D., Dyke, S.O.M., Dupras, C., Beck, S., Bourque, G., Joly, Y., 2019. subgroups. bioRxiv 834168. https://doi.org/10.1101/834168.
Benefits and barriers in the design of harmonized access agreements for international The UniProt Consortium, 2019. UniProt: a worldwide hub of protein knowledge. Nucleic
data sharing. Sci. Data 6, 297. https://doi.org/10.1038/s41597-019-0310-4. Acids Res. 47, D506–D515. https://doi.org/10.1093/nar/gky1049.
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. Thomas, T., Stefanoni, D., Dzieciatkowska, M., Issaian, A., Nemkov, T., Hill, R.C.,
Off. J. Int. Neural Netw. Soc. 61, 85–117. https://doi.org/10.1016/j. Francis, R.O., Hudson, K.E., Buehler, P.W., Zimring, J.C., Hod, E.A., Hansen, K.C.,
neunet.2014.09.003. Spitalnik, S.L., D’Alessandro, A., 2020. Evidence for structural protein damage and
Schumacher, A., Rujan, T., Hoefkens, J., 2014. A collaborative approach to develop a membrane lipid remodeling in red blood cells from COVID-19 patients. medRxiv.
multi-omics data analytics platform for translational research. Appl. Transl. https://doi.org/10.1101/2020.06.29.20142703, 2020.06.29.20142703.
Genomics, Global Sharing of Genomic Knowledge in a Free Market 3, 105–108. Thudumu, S., Branch, P., Jin, J., Singh, J., 2020. A comprehensive survey of anomaly
https://doi.org/10.1016/j.atg.2014.09.010. detection techniques for high dimensional big data. J. Big Data 7, 42. https://doi.
Schwarz, D.F., König, I.R., Ziegler, A., 2010. On safari to Random Jungle: a fast org/10.1186/s40537-020-00320-x.
implementation of Random Forests for high-dimensional data. Bioinformatics 26, Tiemeyer, M., Aoki, K., Paulson, J., Cummings, R.D., York, W.S., Karlsson, N.G.,
1752–1758. https://doi.org/10.1093/bioinformatics/btq257. Lisacek, F., Packer, N.H., Campbell, M.P., Aoki, N.P., Fujita, A., Matsubara, M.,
Seal, D.B., Das, V., Goswami, S., De, R.K., 2020. Estimating gene expression from DNA Shinmachi, D., Tsuchiya, S., Yamada, I., Pierce, M., Ranzinger, R., Narimatsu, H.,
methylation and copy number variation: A deep learning regression model for multi- Aoki-Kinoshita, K.F., 2017. GlyTouCan: an accessible glycan structure repository.
Glycobiology 27, 915–919. https://doi.org/10.1093/glycob/cwx066.

22
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Timp, W., Timp, G., 2020. Beyond mass spectrometry, the next step in proteomics. Sci. Wu, C.-C., Asgharzadeh, S., Triche, T.J., D’Argenio, D.Z., 2010. Prediction of human
Adv. 6 https://doi.org/10.1126/sciadv.aax8978 eaax8978. functional genetic networks from heterogeneous data using RVM-based ensemble
Tini, G., Marchetti, L., Priami, C., Scott-Boyer, M.-P., 2019. Multi-omics integration—a learning. Bioinformatics 26, 807–813. https://doi.org/10.1093/bioinformatics/
comparison of unsupervised clustering methodologies. Brief. Bioinform. 20, btq044.
1269–1279. https://doi.org/10.1093/bib/bbx167. Wu, X., Hasan, M.A., Chen, J.Y., 2014. Pathway and network analysis in proteomics.
Tipping, M.E., 2001. Sparse bayesian learning and the relevance vector machine. J. Theor. Biol. 0, 44–52. https://doi.org/10.1016/j.jtbi.2014.05.031.
J. Mach. Learn. Res. 1, 211–244. https://doi.org/10.1162/15324430152748236. Wu, D., Wang, D., Zhang, M.Q., Gu, J., 2015. Fast dimension reduction and integrative
Tong, L., Wu, H., Wang, M.D., 2020. Integrating multi-omics data by learning modality clustering of multi-omics data using low-rank approximation: application to cancer
invariant representations for improved prediction of overall survival of cancer. molecular classification. BMC Genomics 16, 1022. https://doi.org/10.1186/s12864-
Methods. https://doi.org/10.1016/j.ymeth.2020.07.008. 015-2223-8.
Tsuda, K., Shin, H., Schölkopf, B., 2005. Fast protein classification with multiple Wu, C., Zhou, F., Ren, J., Li, X., Jiang, Y., Ma, S., 2019. A selective review of multi-level
networks. Bioinformatics 21, ii59–ii65. https://doi.org/10.1093/bioinformatics/ omics data integration using variable selection. High-Throughput 8, 4. https://doi.
bti1110. org/10.3390/ht8010004.
Uddin, S., Khan, A., Hossain, M.E., Moni, M.A., 2019. Comparing different supervised Wu, S., Roberts, K., Datta, S., Du, J., Ji, Z., Si, Y., Soni, S., Wang, Q., Wei, Q., Xiang, Y.,
machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. Zhao, B., Xu, H., 2020. Deep learning in clinical natural language processing: a
19, 281. https://doi.org/10.1186/s12911-019-1004-8. methodical review. J. Am. Med. Inform. Assoc. 27, 457–470. https://doi.org/
Uhlen, M., Oksvold, P., Fagerberg, L., Lundberg, E., Jonasson, K., Forsberg, M., 10.1093/jamia/ocz200.
Zwahlen, M., Kampf, C., Wester, K., Hober, S., Wernerus, H., Björling, L., Ponten, F., Xu, D., Tian, Y., 2015. A comprehensive survey of clustering algorithms. Ann. Data Sci. 2,
2010. Towards a knowledge-based human protein atlas. Nat. Biotechnol. 28, 165–193. https://doi.org/10.1007/s40745-015-0040-1.
1248–1250. https://doi.org/10.1038/nbt1210-1248. Xu, J., Wu, P., Chen, Y., Meng, Q., Dawood, Hussain, Dawood, Hassan, 2019a.
Van Deun, K., Smilde, A.K., van der Werf, M.J., Kiers, H.A., Van Mechelen, I., 2009. A hierarchical integration deep flexible neural forest framework for cancer subtype
A structured overview of simultaneous component based data integration. BMC classification by integrating multi-omics data. BMC Bioinformat. 20, 527. https://
Bioinformat. 10, 246. https://doi.org/10.1186/1471-2105-10-246. doi.org/10.1186/s12859-019-3116-7.
Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag New Xu, X., Liang, T., Zhu, J., Zheng, D., Sun, T., 2019b. Review of classical dimensionality
York, Inc., New York, NY, USA. reduction and sample selection methods for large-scale data processing. In:
Vasta, G.R., Ahmed, H., 2008. Animal Lectins: A Functional View. CRC Press. Neurocomputing, Chinese Conference on Computer Vision 2017, 328, pp. 5–15.
Vickers, A.J., Elkin, E.B., 2006. Decision curve analysis: a novel method for evaluating https://doi.org/10.1016/j.neucom.2018.02.100.
prediction models. Med. Decis. Mak. Int. J. Soc. Med. Decis. Mak. 26, 565–574. Yan, Z., Xiong, Y., Xu, W., Li, M., Cheng, Y., Chen, F., Ding, S., Xu, H., Zheng, G., 2012.
https://doi.org/10.1177/0272989X06295361. Identification of recurrence-related genes by integrating microRNA and gene
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., 2010. Stacked expression profiling of gastric cancer. Int. J. Oncol. 41, 2166–2174. https://doi.org/
denoising autoencoders: learning useful representations in a deep network with a 10.3892/ijo.2012.1637.
local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408. Yan, K.K., Zhao, H., Pang, H., 2017. A comparison of graph- and kernel-based –omics
Vineetha, S., Chandra Shekara Bhat, C., Idicula, S.M., 2013. MicroRNA-mRNA data integration algorithms for classifying complex traits. BMC Bioinformat. 18, 539.
interaction network using TSK-type recurrent neural fuzzy network. Gene 515, https://doi.org/10.1186/s12859-017-1982-4.
385–390. https://doi.org/10.1016/j.gene.2012.12.063. Yang, K., Han, X., 2016. Lipidomics: techniques, applications, and outcomes related to
Vivian, J., Eizenga, J.M., Beale, H.C., Vaske, O.M., Paten, B., 2020. Bayesian framework biomedical sciences. Trends Biochem. Sci. 41, 954–969. https://doi.org/10.1016/j.
for detecting gene expression outliers in individual samples. JCO Clin. Cancer tibs.2016.08.010.
Inform. 4 https://doi.org/10.1200/CCI.19.00095. Young, F.W., Hamer, R.M., 1987. Multidimensional Scaling: History, Theory, and
Vogel, F., Motulsky, A.G., 1997. Vogel and Motulsky’s Human Genetics: Problems and Applications. L. Erlbaum Associates, Hillsdale, N.J.
Approaches. Springer Science & Business Media. Young, J., Modat, M., Cardoso, M.J., Mendelson, A., Cash, D., Ourselin, S., Alzheimer’s
Wang, L., 2010. Pharmacogenomics: a systems approach. Wiley Interdiscip. Rev. Syst. Disease Neuroimaging Initiative, 2013. Accurate multimodal probabilistic prediction
Biol. Med. 2, 3–22. https://doi.org/10.1002/wsbm.42. of conversion to Alzheimer’s disease in patients with mild cognitive impairment.
Wang, D., Gribskov, M., 2005. Examining the architecture of cellular computing through NeuroImage Clin. 2, 735–745. https://doi.org/10.1016/j.nicl.2013.05.004.
a comparative study with a computer. J. R. Soc. Interface 2, 187–195. https://doi. Yu, X.-T., Zeng, T., 2018. Integrative analysis of omics big data. Methods Mol. Biol.
org/10.1098/rsif.2005.0038. Clifton NJ 1754, 109–135. https://doi.org/10.1007/978-1-4939-7717-8_7.
Wang, B., Mezlini, A.M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., Yuan, Y., Savage, R.S., Markowetz, F., 2011. Patient-specific data fusion defines
Goldenberg, A., 2014. Similarity network fusion for aggregating data types on a prognostic cancer subtypes. PLoS Comput. Biol. 7, e1002227 https://doi.org/
genomic scale. Nat. Methods 11, 333–337. https://doi.org/10.1038/nmeth.2810. 10.1371/journal.pcbi.1002227.
Wang, M., Wang, C., Han, R.H., Han, X., 2016. Novel advances in shotgun lipidomics for Yue, Z., Meng, D., He, J., Zhang, G., 2017. Semi-supervised learning through adaptive
biology and medicine. Prog. Lipid Res. 61, 83–108. https://doi.org/10.1016/j. Laplacian graph trimming. Image Vis. Comput. Regularizat.Tech. High Dimen. Data
plipres.2015.12.002. Analysis 60, 38–47. https://doi.org/10.1016/j.imavis.2016.11.013.
Wang, T., Shao, W., Huang, Z., Tang, H., Zhang, J., Ding, Z., Huang, K., 2020. Zampieri, M., Sekar, K., Zamboni, N., Sauer, U., 2017. Frontiers of high-throughput
MORONET: multi-omics integration via graph convolutional networks for metabolomics. Curr. Opin. Chem. Biol. Omics 36, 15–23. https://doi.org/10.1016/j.
biomedical data classification. bioRxiv. https://doi.org/10.1101/ cbpa.2016.12.006.
2020.07.02.184705, 2020.07.02.184705. Zhang, S., Liu, C.-C., Li, W., Shen, H., Laird, P.W., Zhou, X.J., 2012. Discovery of multi-
Waring, J., Lindvall, C., Umeton, R., 2020. Automated machine learning: Review of the dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids
state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822. Res. 40, 9379–9391. https://doi.org/10.1093/nar/gks725.
https://doi.org/10.1016/j.artmed.2020.101822. Zhang, Q., Burdette, J.E., Wang, J.-P., 2014. Integrative network analysis of TCGA data
Watanabe, K., Yasugi, E., Oshima, M., 2000. How to search the glycolipid data in for ovarian cancer. BMC Syst. Biol. 8, 1338. https://doi.org/10.1186/s12918-014-
“LIPIDBANK for Web” the newly developed lipid database in Japan. Trends Glycosci. 0136-9.
Glycotechnol. 12, 175–184. https://doi.org/10.4052/tigg.12.175. Zhang, L., Lv, C., Jin, Y., Cheng, G., Fu, Y., Yuan, D., Tao, Y., Guo, Y., Ni, X., Shi, T.,
Watt, J., Borhani, R., Katsaggelos, A.K., 2020. Machine Learning Refined: Foundations, 2018. Deep learning-based multi-omics data integration reveals two prognostic
Algorithms, and Applications, Second edition. Cambridge University Press, New subtypes in high-risk neuroblastoma. Front. Genet. 9 https://doi.org/10.3389/
York. fgene.2018.00477.
Weisz Hubshman, M., Broekman, S., van Wijk, E., Cremers, F., Abu-Diab, A., Khateb, S., Zhang, L., Dong, X., Lee, M., Maslov, A.Y., Wang, T., Vijg, J., 2019a. Single-cell whole-
Tzur, S., Lagovsky, I., Smirin-Yosef, P., Sharon, D., Haer-Wigman, L., Banin, E., genome sequencing reveals the functional landscape of somatic mutations in B
Basel-Vanagaite, L., de Vrieze, E., 2018. Whole-exome sequencing reveals POC5 as a lymphocytes across the human lifespan. Proc. Natl. Acad. Sci. 116, 9014–9019.
novel gene associated with autosomal recessive retinitis pigmentosa. Hum. Mol. https://doi.org/10.1073/pnas.1902510116.
Genet. 27, 614–624. https://doi.org/10.1093/hmg/ddx428. Zhang, Y., Wang, B., Jin, W., Wen, Y., Nan, L., Yang, M., Liu, R., Zhu, Y., Wang, C.,
Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N., 2017. Can machine-learning Huang, L., Song, X., Wang, Z., 2019b. Sensitive and robust MALDI-TOF-MS
improve cardiovascular risk prediction using routine clinical data? PLoS One 12. glycomics analysis enabled by Girard’s reagent T on-target derivatization (GTOD) of
https://doi.org/10.1371/journal.pone.0174944. reducing glycans. Anal. Chim. Acta 1048, 105–114. https://doi.org/10.1016/j.
Wilkins, M.R., Appel, R.D., 2007. Ten years of the proteome. In: Wilkins, M.R., Appel, R. aca.2018.10.015.
D., Williams, K.L., Hochstrasser, D.F. (Eds.), Proteome Research: Concepts, Zhao, S., Fung-Leung, W.-P., Bittner, A., Ngo, K., Liu, X., 2014. Comparison of RNA-Seq
Technology and Application. Springer, Berlin Heidelberg, Berlin, Heidelberg, and Microarray in Transcriptome Profiling of Activated T Cells. PLoS One 9, e78644.
pp. 1–13. https://doi.org/10.1007/978-3-540-72910-5_1. https://doi.org/10.1371/journal.pone.0078644.
Wishart, D.S., Feunang, Y.D., Marcu, A., Guo, A.C., Liang, K., Vázquez-Fresno, R., Zhao, Q., Shi, X., Xie, Y., Huang, J., Shia, B., Ma, S., 2015. Combining multidimensional
Sajed, T., Johnson, D., Li, C., Karu, N., Sayeeda, Z., Lo, E., Assempour, N., genomic measurements for predicting cancer prognosis: observations from TCGA.
Berjanskii, M., Singhal, S., Arndt, D., Liang, Y., Badran, H., Grant, J., Serra- Brief. Bioinform. 16, 291–303. https://doi.org/10.1093/bib/bbu003.
Cayuela, A., Liu, Y., Mandal, R., Neveu, V., Pon, A., Knox, C., Wilson, M., Manach, C., Zhao, J., Xie, X., Xu, X., Sun, S., 2017. Multi-view learning overview: Recent progress and
Scalbert, A., 2018. HMDB 4.0: the human metabolome database for 2018. Nucleic new challenges. Inf. Fusion 38, 43–54. https://doi.org/10.1016/j.
Acids Res. 46, D608–D617. https://doi.org/10.1093/nar/gkx1089. inffus.2017.02.007.
Wong, A.J., Kanwar, A., Mohamed, A.S., Fuller, C.D., 2016. Radiomics in head and neck Zhou, B., Xiao, J.F., Tuli, L., Ressom, H.W., 2012. LC-MS-based metabolomics. Mol.
cancer: from exploration to application. Transl. Cancer Res. 5, 371–382. https://doi. BioSyst. 8, 470–481. https://doi.org/10.1039/c1mb05350g.
org/10.21037/8805.

23
P.S. Reel et al. Biotechnology Advances 49 (2021) 107739

Zhou, J., He, Z., Yang, Y., Deng, Y., Tringe, S.G., Alvarez-Cohen, L., 2015. High- regulation. PLoS Biol. 10, e1001301 https://doi.org/10.1371/journal.
throughput metagenomic technologies for complex microbial community analysis: pbio.1001301.
open and closed formats. mBio 6. https://doi.org/10.1128/mBio.02288-14. Zhu, W., Xie, L., Han, J., Guo, X., 2020. The application of deep learning in cancer
Zhou, J.T., Pan, S.J., Tsang, I.W., 2019. A deep learning framework for hybrid prognosis prediction. Cancers 12. https://doi.org/10.3390/cancers12030603.
heterogeneous transfer learning. Artif. Intell. 275, 310–328. https://doi.org/ Zierer, J., Pallister, T., Tsai, P.-C., Krumsiek, J., Bell, J.T., Lauc, G., Spector, T.D.,
10.1016/j.artint.2019.06.001. Menni, C., Kastenmüller, G., 2016. Exploring the molecular basis of age-related
Zhou, Y., Hou, Y., Shen, J., Mehra, R., Kallianpur, A., Culver, D.A., Gack, M.U., Farha, S., disease comorbidities using a multi-omics graphical model. Sci. Rep. 6, 37646.
Zein, J., Comhair, S., Fiocchi, C., Stappenbeck, T., Chan, T., Eng, C., Jung, J.U., https://doi.org/10.1038/srep37646.
Jehi, L., Erzurum, S., Cheng, F., 2020. A network medicine approach to investigation Zou, H., 2006. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101,
and population-based validation of disease manifestations and drug repurposing for 1418–1429.
COVID-19. PLoS Biol. 18, e3000970 https://doi.org/10.1371/journal.pbio.3000970. Zou, Q., Chen, L., Huang, T., Zhang, Z., Xu, Y., 2017. Machine learning and graph
Zhu, J., Sova, P., Xu, Q., Dombek, K.M., Xu, E.Y., Vu, H., Tu, Z., Brem, R.B., analytics in computational biomedicine. Artif. Intell. Med. https://doi.org/10.1016/
Bumgarner, R.E., Schadt, E.E., 2012. Stitching together multiple data dimensions j.artmed.2017.09.003.
reveals interacting metabolomic and transcriptomic networks that modulate cell

24

You might also like