Basedpaired
Basedpaired
Basedpaired
Molecular Sciences
Review
Deep Learning in Virtual Screening: Recent Applications
and Developments
Talia B. Kimber † , Yonghui Chen † and Andrea Volkamer *
Abstract: Drug discovery is a cost and time-intensive process that is often assisted by computational
methods, such as virtual screening, to speed up and guide the design of new compounds. For many
years, machine learning methods have been successfully applied in the context of computer-aided
drug discovery. Recently, thanks to the rise of novel technologies as well as the increasing amount of
available chemical and bioactivity data, deep learning has gained a tremendous impact in rational
active compound discovery. Herein, recent applications and developments of machine learning,
with a focus on deep learning, in virtual screening for active compound design are reviewed. This
includes introducing different compound and protein encodings, deep learning techniques as well
as frequently used bioactivity and benchmark data sets for model training and testing. Finally, the
present state-of-the-art, including the current challenges and emerging problems, are examined
and discussed.
Keywords: virtual screening; drug-target interaction; deep learning; protein encoding; ligand encoding
compound, they can narrow the search space down to few hundreds of compounds with
desired properties to be further investigated [7].
Nowadays, VS has become an integral part of drug discovery. It is usually imple-
mented in the form of a hierarchical workflow, combining different methods (sequentially
or in parallel) as filters to prioritize potentially active compounds [7,8]. VS methods are
often divided into two major categories: 1. structure-based methods, which focus on the
complementarity of the target binding pocket and the ligand; as well as 2. ligand-based
methods, which rely on the similarity of novel compounds to known active molecules.
Structure-based methods (1) require 3D structural information of both ligand and
protein as a complex or at least of the protein with some knowledge about the binding
site. The most commonly used technique is molecular docking, which predicts one or
several binding pose(s) of a query ligand in the receptor structure and estimates their
binding affinity [9]. While protein-ligand docking shows great ability in enriching likely
active compounds over inactive ones, there are still complications in placing or scoring
the individual poses, some of which can be unmasked by visual inspection [10–13]. Dur-
ing the molecular docking process, thousands of possible ligand poses are generated
based on the target structure and ranked by a scoring function (SF) [14]. There are three
classical types of scoring functions: physics-, empirical-, and knowledge-based [15,16].
Physics-based methods rely on molecular mechanics force fields. In short, non-bonded
interaction terms such as Van der Waals interactions, electrostatics, and hydrogen bonds
are summed. Similarly, empirical SFs sum weighted energy terms. Items describing for
example rotatable bonds or solvent-accessible-surface area are also added and all terms
are parameterized against experimental binding affinities. In contrast, knowledge-based
methods rely on statistical analyses of observed atom pair potentials from protein-ligand
complexes. More recently, new groups of scoring functions were introduced, namely
machine/deep learning-based SFs. One group of models is based on classical SFs which
try to learn the relationship between the interaction terms to predict binding affinity (see
the review by Shen et al. [16]). Others models encode the complex via protein-ligand
interaction fingerprints, grid- or graph-based methods [17]. Such models will be referred
to as complex-based methods throughout this review and discussed in greater details, see
Figure 1. Note that pharmacophore-based VS has also incorporated machine learning,
and is suitable to screen very large databases, see for example Pharmit [18]. However, these
methods are not the focus of this review and recent developments in the pharmacophore
field are described by Schaller et al. [19].
Ligand-based methods (2), including QSAR (quantitative structure-activity relation-
ship) modeling, molecular similarity search and ligand-based pharmacophores, are rela-
tively mature technologies [20]. Unlike structure-based methods, ligand-based methods
only require ligand information. Note that they are not the focus of this review and the
reader is kindly referred to the respective literature, e.g., [20,21]. Nevertheless, the latter
category can also be enriched by simple protein—mostly sequence-based—information
and is often referred to as proteochemometric (PCM) modeling, which will be further
addressed in this review. PCM combines both ligand and target information within a
single model in order to predict an output variable of interest, such as the activity of a
molecule in a particular biological assay [22,23]. Thus, PCM methods do not only rely on
ligand similarities, but incorporate information from the target they bind to, and have been
found to outperform truly ligand-based methods [24]. Note that in some PCM applications
an additional cross-term is introduced that can describe the interaction between the two
objects [22]. To distinguish the herein described methods, which handle the two objects
individually, we refer to them as pair-based methods, see Figure 1.
Int. J. Mol. Sci. 2021, 22, 4435 3 of 34
Figure 1. Workflows in virtual screening. The first split separates the schemes that contain (1) protein and ligand information
and (2) ligand information only, which are typically used in models for QSAR predictions. For details on solely ligand-based
methods, see for example MoleculeNet [25]. For (1), a second split makes the differences between complex-based and
pair-based models. Complex-based models describe the protein and ligand in a complex, whereas pair-based models
(also PCM in the broader sense) treat the protein and ligand as two independent entities. The latter typically use protein
sequence and molecular SMILES information as input, while the complex-based models use, for example, a 3D grid of the
protein-ligand binding site or interaction fingerprints.
list of ligand encodings is carefully outlined in the review by Lo et al. [53]. For protein
descriptors, the work by Xu et al. [54] describes common sequence- as well as structure-
based descriptors, embedding representations and possible mutations.
Figure 2. Ligand encoding. Having a computer-readable format is one of the starting points for machine—and deep—
learning. The example molecule is the FDA–approved drug Fasudil [55] taken from the PKIDB database [56]. Recent
studies focused on virtual screening (detailed in Section 3) commonly use SMILES, circular fingerprints or graphs to encode
the ligand.
Graph
The molecular graph can be encoded using two matrices: the first one, called the
feature matrix X, gives a per atom description, where the type of information stored in
each node is decided a priori. Common per atom examples are atomic type and degree [57].
The dimension of X is N × D, where N is the number of nodes, i.e., atoms, in the graph
and D the number of pre-defined features. The second matrix, called connectivity matrix,
describes the structure of the molecule. Its purpose is to illustrate how the nodes are
connected in the graph, i.e., via bonds. Two frequent formats store this information:
1. the adjacency matrix A of dimension N × N, where Aij = 1 if node i is connected to
Int. J. Mol. Sci. 2021, 22, 4435 6 of 34
SMILES
An efficient way of storing information from the molecular graph using string
characters is the simplified molecular input line entry system (SMILES) developed by
Weininger [58]. The main idea behind SMILES is the linearization of the molecular graph
by enumerating the nodes and edges following a certain path. Due to the randomness
in the choice of the starting atom and the path followed along the 2D graph, there exist
several valid SMILES for one molecule [59]. However, it may be desirable to have one
unique SMILES for a given compound, called the canonical SMILES, and most software
have their own canonization algorithm. In order to apply mathematical operations in the
context of machine learning, SMILES still need to be transformed into numerical values,
where both label and one-hot encoding are often used [60,61]. Please find more information
on these encodings below.
Circular Fingerprint
Circular fingerprints are, once folded, fixed-length binary vectors that determine the
presence (encoded by 1) of a substructure or the absence of it (encoded by 0). The recursive
algorithm behind extended-connectivity fingerprints (ECFP) [62] starts with an atom initial-
izer for each node and updates the atom identifiers using a hash function by accumulating
information from neighboring nodes. The substructures which are identified using the
local atom environments correspond to the bits in the fingerprints. A free version of the
algorithm is available in the open-source cheminformatics software RDKit [63] (under the
name of Morgan fingerprints), which will not produce the same results as the original
Int. J. Mol. Sci. 2021, 22, 4435 7 of 34
implementation in Pipeline Pilot [64] due to the difference in the hash functions, but will
yield similar results.
Other Encodings
Ligands are evidently not restricted to these encodings [53]. For example, different
types of fingerprints may be used, such as a physicochemical-based vector, describing the
global properties of the molecule, as in the study by Kundu et al. [65]. Also, the 166-bit long
MACCS keys [66] are a common way to encode molecular compounds as a fingerprint.
Recently, learned fingerprints have also shown to be effective in QSAR predictions [61,67].
Another way of employing the molecular structure as input to machine learning is the 2D
image itself. Rifaioglu et al. [68] use the 200-by-200 pixel 2D image generated directly from
the SMILES using the canonical orientation/depiction as implemented in RDKit [63].
Protein Identifier
A simple way to discriminate models from ligand information only is to include the
identifier (ID) of the protein. Such a descriptor adds no information whatsoever about
the physicochemical properties of the protein, the amino acid composition, nor the 3D
conformation. It is merely a way for a machine learning model to be able to differentiate
several proteins. For example, the one-hot encoding of the protein ID can be used, as in the
study by Sorgenfrei et al. [70].
Protein Sequence
The (full) sequence of a protein, often referred to as the primary structure, is the
enumeration of the amino acids as they appear from the beginning (N-terminus) to end
(C-terminus) of the protein, in which each of the 20 standard amino acids can be encoded as
a single letter. The length of a protein can vary greatly, some of them containing thousands
of amino acid residues. Although the sequence is a compact way of storing information
about the primary structure, it does not give any information about the 3D structure of the
protein. In opposition to the full sequence, it is possible to only consider the sequence from
the binding site, reducing greatly the number of residues.
Z-Scales
The z-scale descriptors published in the late 80s by Hellberg et al. [71] are constructed
by considering, for each of the 20 amino acids, 29 physicochemical properties such as
the molecular weight, the logP and the logD (see [70], Table 1). A principal component
analysis (PCA) on the 20 × 29 matrix is performed and the three principal components z1 ,
z2 and z3 for each of the amino acids are retained. The authors suggest interpreting z1 , z2
and z3 as hydrophilicity, bulk and electronic properties, respectively.
structures. The secondary structure is predicted from the sequence using SSpro, developed
by Magnan and Baldi [75]. Neighboring residues are thereby grouped together to form
secondary structure elements. Then, four letters are assigned to each of these elements.
The first letter represents the secondary structure: alpha helix (A), beta sheet (B) or coil (C).
The second letter determines the solvent exposure: N as “not exposed” or E as “exposed”.
The third letter describes thy physicochemical properties, i.e., non-polar (G), polar (T),
acidic (D) or basic (K). The last letter represents the length: small (S), medium (M) or
large (L).
Figure 3. Complex encoding. Visual representation of encodings for protein-ligand complexes used in structure-based
virtual screening, exemplified with the drug Fasudil co-crystallized with the ROCK1 kinase (PDB ID: 2esm). 3D grids,
graphs and interaction fingerprints are among popular encodings for complexes, as discussed in Section 3.
Interaction Fingerprint
Interaction fingerprints (IFPs) describe—as the name implies—the interactions be-
tween a protein and a ligand based on a defined set of rules [21,76]. Typically, the IFP is
represented as a bit string, which encodes the presence (1) or absence (0) of interactions
between the ligand and the surrounding protein residues. In most implementations, each
binding site residue is described by the same number of features, which usually include
interaction types such as hydrophobic, hydrogen bond donor and acceptor.
IFPs encoding interaction types: The structural interaction fingerprint (SIFt) [77] de-
scribes the interactions between the ligand and n binding site residues as an n × 7 long bit
string. Here, the seven interaction types include whether the residue (i), and more precisely
their main (ii) or side (iii) chain atoms, are in contact with the ligand; whether a polar (iv) or
apolar (v) interaction is involved; and whether the residue provides hydrogen bond accep-
tors (vi) or donors (vii). Similarly, the protein-ligand interaction fingerprint (PyPLIF) [78]
and the IChem’s IFP [79] encode each residue also by seven though slightly different types,
while PADIF [80] uses the Gold [81] scoring function contributions as interactions types.
The aforementioned IFPs vary in size and are sensitive to the order of the residues,
which limits their application for ML purposes. Thus, SILRID [82], a binding site inde-
pendent and fixed-length IFP was introduced. SILRID generates a 168 long integer vector
per binding site, obtained by summing the bits corresponding to a specific amino acid or
cofactor (20 + 1), while each amino acid is described by eight interaction types.
Int. J. Mol. Sci. 2021, 22, 4435 9 of 34
IFPs including distance bits: To describe the interactions more explicitly, distances
with respect to interaction pairs or triples were introduced. APIF [83], an atom-pair
based IFP, encodes three interaction types for the protein and ligand atoms: hydrophobic
contact, hydrogen bond donor and acceptor. Combinations of these three types lead to
six pairings, including for example a protein acceptor-hydrophobic pair complemented
with a ligand donor-hydrophobic atom-pair. Moreover, for each pairwise interaction
in the active site, the respective receptor and ligand atom distances are measured and
binned into seven ranges. In this way, the total APIF is composed of 6 types × 7 protein
distances × 7 ligand distances = 294 bits. Pharm-IF [84], while using slightly different
interaction type definitions, calculates distances between the pharmacophore features of
their ligand atoms. Finally, triplets between interaction pseudoatoms are introduced in
TIFP [85]. The fingerprint registers the count of unique interaction pseudoatom triplets
encoded by seven properties (such as hydrophobic, aromatic or hydrogen-bond) and the
related distances between them, discretized into six intervals. Redundant and geometrically
invalid triplets are removed, and the fingerprint is pruned to 210 integers representing the
most frequently occurring triplets in the appointed data set.
IFPs including circular fingerprint idea: To become more independent of pre-defined in-
teraction types, circular fingerprint inspired IFPs were introduced by encoding all possible
interaction types (e.g., π − π, CH − π) implicitly via the atom environment. The structural
protein-ligand interaction fingerprint (SPLIF) [86] is constructed using the extended con-
nectivity fingerprint (ECFP, see Section 2.1.1 for more information). For each contacting
protein-ligand atom pair (i.e., distance less than 4.5 Å), the respective protein and ligand
atoms are each expanded to circular fragments using ECFP2 and hashed together into
the fingerprint. Similarly, ECFP is integrated in the protein-ligand extended connectivity
(PLEC) fingerprint [87], where n different bond diameters (called “depth”) for atoms from
protein and ligand are used.
3D Grid
Another type of encoding are 3D grids, in which the protein is embedded into a three-
dimensional Cartesian grid centered on the binding site. Similar to pixel representation in
images, each grid point holds one (or several) values that describe the physicochemical
properties of the complex at this specific position in 3D space. Such grids can, for example,
be unfolded to a 1D floating point array [88] or transformed into a 4D tensor [89] as
input for a DL model. Depending on the implementation, the cubic grids vary in size
between 16 Å and 32 Å, as well as grid spacing (resolution) usually being either 0.5 Å or
1 Å [88–91]. Per grid point attributes can be: 1. simple annotations of atom types or IFPs, as
in AtomNet [88] and DeepAtom [92], 2. physicochemical or pharmacophoric features, as in
Pafnucy [89] and BindScope [93], or 3. energies based using one or several probe atoms as
in AutoGrid/smina [90,94].
Graph
Although the description of a small molecule as a graph seems natural, the idea
can be adapted to a molecular complex. As in the ligand case (see Section 2.1.1), two
main components have to be considered in the graph description of such protein-ligand
structures: the nodes, with an associated feature vector, and the relationship between
them, usually encoded in matrix form. When considering a complex, the atoms from
both the protein and the ligand can simply be viewed as the nodes of the graph and
the atomic properties can vary depending on the task at hand. Some might consider,
among other characteristics, the one-hot encoded atom type/degree and a binary value
to describe aromaticity, as in [95]. As simple as the node description is for complexes,
the intricacy arises when describing the interactions between the atoms, which should
account for covalent and non-covalent bonds. The focus here will be on two different
ways of describing such structures. The first one, developed by Lim et al. [95], considers
two adjacency matrices A1 and A2 . A1 is constructed in such a way that it only takes
Int. J. Mol. Sci. 2021, 22, 4435 10 of 34
into account covalent bonds, more precisely A1ij = 1 if i, j are covalently connected, and 0
otherwise. A2 , on the other hand, not only captures bonded intramolecular and non-bonded
intermolecular interactions, but also their strength through distances. Mathematically, this
can be translated as follows: if atom i belongs to the ligand, atom j to the protein, and they
live in a neighborhood of 5 Å, then
(dij − µ)2
A2ij = e− ,
σ
where dij is the distance between atoms i and j, and µ and σ are learned parameters.
The smaller the distance between the atoms to µ is, the stronger the bond is. If atoms i
and j both belong to either the ligand or the protein, then A2ij = A1ij .
The other graph form of protein-ligand developed by Feinberg et al. [96] consists of
an enlarged adjacency matrix A ∈ R N × N × Net , where N is the number of atoms and Net the
number of edge types. Aijk = 1 if atom j is in the neighborhood of atom i and if k is the
bond type between them. If not, that same entry is 0. This scheme numerically encodes the
spatial graph as well as the bonds through edge type.
Other Encodings
Moreover, there are also other encoding methods to describe a complex, which will
only be shortly introduced here. Topology-based methods, as reported by Cang and
Wei [97], describe biomolecular data in a simplified manner. The topology thereby deals
with the connectivity of individual parts and characterizes independent entities, rings
and higher dimensional faces. In this way, element-specific topological fingerprints can
retain the 3D biological information and the complex can be represented by an image-like
topological representation (resembling barcodes).
Also, simply the protein-ligand atom pairs together with their distances can be used
as input. In the work by Zhu et al. [98], all atom pair energy contributions are summed,
where the contributions themselves are learned through a neural network considering the
properties of the two atoms and their distances. Similarly, Pereira et al. [99] introduced the
atom context method to represent the environment of the interacting atoms, i.e., atom and
amino acid embeddings.
Figure 4. Deep learning models. Schematic illustration of the neural networks described in Section 2.2. (A) Vanilla neural
network. (B) Multilayer perceptron with three hidden layers (MLP). (C) Convolutional neural network (CNN). (D) Recurrent
neural network (RNN). (E) Graph neural network (GNN). CNNs and GNNs particularly have become very popular in
recent virtual screening studies (see tables in Section 3).
Neural Networks
Neural networks (NNs) [32], also sometimes called artificial neural networks (ANNs),
are models that take as input a set of features on which mathematical computations are
performed that depend on a set of parameters. The sequential computations between the
input and the output are called hidden layers and the final one, the last layer, should account
for the targeted prediction: classification or regression. The information flows through the
network and is monitored by non-linearities called activation functions that determine if or
how much of the information can be passed on to the next layer. The parameters in the
network are optimized using back-propagation ([32], Chapter 6).
A simple example of a neural network connects the input to the output with one
single hidden layer and is sometimes called a “vanilla network” or a “single layer per-
ceptron” [100], in opposition to a multilayer perceptron (MLP) that has more than one
hidden layer. In a single layer perceptron, the hidden layer is composed of a set of nodes
where each input element is connected to every hidden node and every node in the hidden
layer is connected to the output. When all nodes from one layer are connected to the next,
the layer is called fully-connected, or dense. If the network contains only such layers, then
Int. J. Mol. Sci. 2021, 22, 4435 12 of 34
More details on graph neural networks can be found in the review by Zhou et al. [108].
In the context of molecular prediction, dozens of examples use GNNs, as summarized in
the review by Wieder et al. [109].
Table 1. Structure and bioactivity data sets. The table lists common labeled data sets used in virtual screening studies.
Freely available data is increasing each year and is an essential element for affinity prediction using machine and deep
learning models. The table summarizes the name, the size and the content covered as well as links to the respective website.
Furthermore, a refined and a core set with higher quality data are extracted from the
general set. In the refined set, the 4852 protein-ligand complexes meet certain quality
criteria (e.g., resolution, R-factor, protein-ligand covalent bonds, ternary complexes or steric
clashes, and type of affinity value). The core set, resulting after further filtering, provides
285 high-quality protein-ligand complexes for validating docking or scoring methods.
BindingDB
BindingDB [112] is a publicly accessible database, which collects experimental protein-
ligand binding data from scientific literature, patents, and other. The data extracted by
BindingDB includes not only the affinity, but also the respective experimental conditions
(i.e., assay description). BindingDB contains 2,229,892 data points, i.e., measured binding
affinity for 8499 protein targets and 967,208 compounds, including 2823 protein-ligand
crystal structures with mapped affinity measurements (requiring 100% sequence identity),
as of 1 March 2021 [113].
BindingMOAD
BindingMOAD (Mother of All Databases) [114,115] is another database focused on
providing combined high-quality structural and affinity data, similar to PDBbind. Binding-
MOAD (release 2019) contains 38,702 well-resolved protein-ligand crystal structures, with
ligand annotation and protein classifications, of which 15,964 are linked to experimental
binding affinity data with biologically-relevant ligands.
ChEMBL
ChEMBL [36,117] is a widely used open-access bioactivity database with information
about compounds and their bioassay results extracted from full-text articles, approved
drugs and clinical development reports. The last release, ChEMBL v.28, contains 14,347 tar-
gets and over 17 million activities, which are collected from more than 80,000 publications
and patents [37], alongside deposited data and data exchanged with other databases such
as BindingDB and PubChem BioAssay.
Table 2. Benchmark data sets. Evaluating novel models on labeled benchmark data is crucial for any machine learning task,
including deep learning-based virtual screening. The table depicts some commonly used databases with their respective
size, the origin of the data, provided information (affinity or activity) as well as their availability through websites (accessed
on 18 March 2021).
CASF
The comparative assessment of scoring functions (CASF) benchmark [122] is devel-
oped to monitor the performance of structure-based scoring functions. In the latest version,
CASF-2016, the PDBbind v.2016 core set was incorporated with 285 high-quality protein-
ligand complexes assigned to 57 clusters. Scoring functions can be evaluated by four
metrics: 1. The scoring power, indicating the binding affinity prediction capacity using the
Pearson correlation coefficient R [123]. 2. The ranking power, showing affinity-ranking
capacity using the Spearman correlation coefficient ρ [124,125]. 3. The docking power,
using the root mean square deviation (RMSD) [126] to analyze how well the method has
placed the ligand (pose prediction). 4. The screening power measures the enrichment
factor (EF) [127], showing the ability of the function to prioritize active over inactive
compounds.
Note that the CASF team has evaluated scoring functions from well-known docking
programs, such as AutoDock vina [128], Gold [81], and Glide [129], and published the
results on their website [122].
DUD(-E)
The directory of useful decoys (DUD) [130] is a virtual screening benchmarking set
providing 2950 ligands for 40 different targets, and 36 decoy molecules per ligand drawn
from ZINC [4]. Decoys, i.e., negative samples, are chosen to have similar physicochemical
properties, but dissimilar 2D topology to the respective active molecules. DUD-E [131] is
an enhanced and rebuilt version of DUD, with 22,886 active compounds and affinity values
against 102 diverse targets. On average, 50 decoys for each active compound are selected.
DUD-E is usually used in classification tasks to benchmark molecular docking programs
with regard to their ability to rank active compounds over inactive ones (decoys).
MUV
The maximum unbiased validation (MUV) data set [132] is based on the PubChem
BioAssay database mostly for ligand-based studies, using refined nearest neighbor analysis
to select actives and inactives, to avoid analogue bias and artificial enrichment. It contains
17 different target data sets, each containing 30 actives and 15,000 inactives. Note that in
contrast to DUD(-E) decoys, the inactives have experimental validated activities.
Int. J. Mol. Sci. 2021, 22, 4435 16 of 34
3. Recent Developments
In this section, recent developments in virtual screening (VS) are described and specif-
ically how deep learning (DL) helps to improve drug-target binding, i.e., activity/potency
prediction. Our review focuses on methods using protein and ligand information (see
Figure 1), either in form of a protein-ligand complex (complex-based) or considering pro-
tein and ligand as two independent entities (pair-based/PCM). It is imperative to state
that the aim of this section is not to directly compare different studies or models, but to
describe them and put them into context. The list of abbreviations can be found at the end
of this review.
Table 3. Complex-based models. Summary of recent work using a protein-ligand complex for active
molecule or binding affinity prediction. The year of publication, the name of the authors or the model,
the complex encoding and the machine/deep learning model(s) are shown in the respective columns.
Classification (class.) implies predicting e.g. hit or non-hit, whereas regression (reg.) evaluates e.g.,
pIC50 values. CNNs, coupled with 3D grids, have become frequent in state-of-the-art studies.
Complex
Year Name ML/DL Model Framework
Encoding 1
2010 Sato et al. [84] IFP SVM, RF, MLP class.
2016 Wang et al. [135] IFP Adaboost-SVM class.
2019 Li et al. [136] IFP MLP class.
2018 gnina [90] 3D grid CNN class.
2018 KDEEP [91] 3D grid CNN reg.
2018 Pafnucy [89] 3D grid CNN reg.
2018 DenseFS [137] 3D grid CNN class.
2019 DeepAtom [92] 3D grid CNN reg.
2019 Sato et al. [138] 3D grid CNN class.
2019 Erdas-Cicek et al. [94] 3D grid CNN reg.
2019 BindScope [93] 3D grid CNN class.
2018 PotentialNet [96] graph GGNN reg.
2019 Lim et al. [95] graph GANN class.
2017 TopologyNet [97] topol. CNN reg.
2019 Math-DL [139] topol. GAN, CNN reg.
2018 Cang et al. [140] topol. CNN reg.
2016 DeepVS [99] atom contexts CNN class.
2019 OnionNet [141] atom pairs CNN reg.
2020 Zhu et al. [98] atom pairs MLP reg.
1 Abbreviations: IFP: interaction fingerprints, topol.: algebraic topology.
Int. J. Mol. Sci. 2021, 22, 4435 17 of 34
3D Grid-Based Studies
Many methods using a 3D grid representation of a protein-ligand complex—comparable
to pixels in 3D images—for affinity prediction, have evolved over the last years [88,89,92,93],
especially due to the increased popularity of deep CNNs.
One of the first published models, AtomNet [88] uses a CNN, composed of an input
layer, i.e., the vectorized 3D grids, several 3D convolutional and fully-connected layers,
as well as an output layer, which assigns the probability of the two classes: active and
inactive. Among other data sets, the DUD-E benchmark, consisting of 102 targets, over
20,000 actives and 50 property matched decoys per active compound, was used for evalua-
tion. 72 targets were randomly assigned as training set, the remaining 30 targets as test set
(DUDE-30). For each target, a holo structure from the scPDB [143] is used to place the grid
around the binding site and multiple poses per molecule are sampled. Finally, the grid is
fixed to a side length of 20 Å and a 1 Å grid spacing, in which each grid point holds some
structural feature such as atom-type or IFP. On the DUDE-30 test set, AtomNet achieves
a mean AUC of 0.855 over the 30 targets, thus outperforming the classical docking tool
smina [144] (mean AUC of 0.7). Furthermore, AUC values greater than 0.9 were reported
for 46% of the targets in the DUDE-30 test set.
Similarly, BindScope [93] voxelizes the binding pocket by a 16 Å grid of 1 Å resolution,
molecules are placed using smina, and each voxel is assigned a distance-dependent input
based on eight pharmacophoric feature types [91]. The 3D-CNN model architecture was
adapted from DenseNet [145] and yields a mean AUC of 0.885 on the DUD-E benchmark in
a five-fold cross-validation (folds were assigned based on protein sequence similarity-based
clusters). Comparable AUC values on the DUD-E set were reported by Ragoza et al. [146]
Int. J. Mol. Sci. 2021, 22, 4435 18 of 34
(mean AUC of 0.867), a similar grid-based CNN method, which outperformed AutoDock
vina on 90% of the targets.
DeepAtom [92] uses a 32 Å box with 1 Å resolution and assigns a total of 24 features
to each voxel (11 Arpeggio atom types [147] and an exclusion volume for ligand and
protein respectively) in individual channels to encode the protein-ligand complex. The
PDBbind v.2016 served as baseline benchmark data, split into 290 complexes for testing
and 3767 non-overlapping complexes between the refined and core sets for training and
validation. In particular, each original example gets randomly translated and rotated for
data argumentation, which aims to improve the learning capacity. The performance of the
built 3D-CNN model, trained on the PDBbind refined set, in predicting the affinity for the
core set in a regression setting was reported with a low mean RMSE of 1.318 (R of 0.807)
over five runs. In this case, DeepAtom outperformed RF-Score [148], a classical ML method
(mean RMSE of 1.403), as well as Pafnucy [89] (mean RMSE of 1.553), a similar 3D-CNN
method, trained and applied to the same data using their open-source code. Note that in
the original publication, Pafnucy [89] achieved prediction results with an RMSE of 1.42 on
the PDBbind core set v.2016. In a further study, the training set for DeepAtom was extended
by combining BindingMOAD and PDBbind subsets, resulting in 10,383 complexes. While
the mean RMSE of DeepAtom slightly decreased to 1.232, the R value increased to 0.831
for the PDBbind core set.
The presented examples show the effectiveness of 3D grid-based encodings and CNN
models for affinity prediction, which seem to be well suited to implicitly capture the variety
of information important for ligand-binding. However, disadvantages are the high memory
demand of 3D grids and CNNs, as well as the implicit grid boundary definition to capture
the protein-ligand interactions.
Graph-Based Studies
Graph neural networks have proven to be some of the most effective deep learning
models, easily reaching state-of-the-art performance. In this context, two recent applications
of such models in virtual screening are described.
Lim et al. [95] construct a graph representation of the protein predicted binding pose
complex, obtained using smina [144] and train a graph neural network to successfully
predict activity. The node feature vector concatenates atomic information from the ligand
and from the protein. The features considered for both are the one-hot encoding of the
following atomic properties: type (10 symbols), degree (6 possibilities for 0 to 5 neighbors),
number of hydrogens (5 entries for 0 to 4 possible attached Hs), implicit valence of elec-
trons (6 entries) and a binary entry for aromaticity. This leads to 28 entries for the ligand,
another 28 for the protein, generating a feature vector of size 56. The 3D information
is encoded in the two matrices A1 and A2 described in Section 2.1.3, for covalent and
non-covalent interactions, respectively. The model applies four layers of GAT (gate aug-
mented graph attention) to both A1 and A2 , before aggregating the node information using
summation. A 128-unit fully-connected layer is then applied to this vector, which leads
to binary activity prediction. DUD-E is used for training and testing the VS performance,
where the training set contains 72 proteins with 15,864 actives and 973,260 inactives and
the test set another 25 proteins with 5841 actives and 364,149 inactives. The AUC value
on the test data set reaches 0.968, which is high compared to the value of 0.689 obtained
with the smina docking tool. The model also obtains better scores than other deep learn-
ing (DL) models such as the CNN-based models AtomNet [88] and the one developed by
Ragoza et al. [146]. The same trend holds for the reported PDBbind data set study. How-
ever, when testing their model and docking results on external data sets such as ChEMBL
and MUV, the performance drops, hinting to the fact that the DL model might not be able
to generalize to the whole chemical space.
The graph convolution family PotentialNet developed by Feinberg et al. [96] predicts
protein-ligand binding at state-of-the-art scales. The atomic features are atom type, formal
charge, hybridization, aromaticity, and the total numbers of bonds, hydrogens (total
Int. J. Mol. Sci. 2021, 22, 4435 19 of 34
and implicit), and radical electrons. The structure between the atoms is described using
A ∈ R N × N × Net , the extended representation of an adjacency matrix, as described in
Section 2.1.3. The PotentialNet model uses a Gated Graph Neural Network (GGNN), which
means that unlike GNNs, the update function is a GRU, leading to the new node vector,
depending on its previous state and the message from its neighbors, in a learned manner.
PotentialNet also considers different stages, where stage 1 makes use of only the bonded
part of the adjacency matrix, leading to node updates for connectivity information, stage
2 considers spatial information, and stage 3 sums all node vectors from ligands before
applying a fully-connected layer for binding affinity prediction. The model is trained on
complexes of the PDBbind v.2007 using a subset of size 1095 of the initial refined set for
training, and then tested on the core set of 195 data points. The model reaches a test R2
value of 0.668 and a test R value of 0.822, outperforming RF-Score (R of 0.783) and X-Score
(R of 0.643) ([96], Table 1). However, similar results were reported by the CNN-based
model TopologyNet [97], introduced below.
Other Studies
In MathDL [139] and TopologyNet [97], the complexes—and thus the interactions be-
tween protein and ligand—are encoded using methods from algebraic topology. In MathDL,
advanced mathematical techniques (including geometry, topology and/or graph theory)
are used to encode the physicochemical interactions into lower-dimensional rotational and
translational invariant representations. Several CNNs and GANs (Generative Adversarial
Networks) are trained on the PDBbind v.2018 data set and applied on the data of the D3R
Grand Challenge 4 (GC4), a community-wide blind challenge for compound pose and
binding affinity prediction [149]. The models are among the top performing methods in
pose prediction on the beta secretase 1 (BACE) data set with an RMSD [126] of 0.55 Å and a
high ρ of 0.73 in affinity ranking of 460 Cathepsin S (CatS) compounds. Additionally good
performance was reported on the free energy set of 39 CatS compounds. TopologyNet [97],
a family of multi-channel topological CNNs, represent the protein-ligand complex geome-
try by a 1D topological invariant (using element-specific persistent homology) for affinity
prediction and protein mutation. In the affinity study, a TopologyNet model (TNet-BP) is
trained on the PDBbind v.2007 refined set (excluding the core set) and achieves an R of
0.826 and an RMSE of 1.37 in pKd /pKi units. The pKd and pKi values describe the negative
decimal logarithm of Kd and Ki values, respectively. Thus, TNet-BP seems to outperform
other well-known tools such as AutoDock vina and GlideScore-XP on this data set (note
that the results are adopted from the original study by Li et al. [150]).
DeepBindRG [151] and DeepVS [99] focus on the interacting atom environments in the
complex using atom pair and atom context encodings, respectively. DeepBindRG, a CNN
model trained on PDBbind v.2018 (excluding targets that appear in the respective test set),
achieves good performance on independent data sets such as the CASF-2013 and DUD-E
subsets, with an RMSE varying between 1.6 and 1.8 for a given protein and an R between
0.5 and 0.6. With these values, DeepBindGP performs slightly better than AutoDock vina,
while being in a similar range as Pafnucy [89]. DeepVS, another CNN model, trained and
tested on the DUD data set using leave-one-out cross-validation outperforms, with an AUC
of 0.81, AutoDock vina 1.2 which has an AUC value of 0.62.
Table 4. Pair-based models. The listed models consider information from the protein and the ligand, but the encodings are
built independently of each other. The year of publication, the name of the authors or the model, the ligand and the protein
encodings and the machine/deep learning model(s) are shown in the respective columns. Classification (class.) implies
hit or non-hit, whereas regression (reg.) evaluates an affinity measure, for example pIC50 values. Graphs and associated
GCNNs have become prominent in recent years.
Ligand as SMILES
In 2018, Öztürk et al. [60] proposed the DeepDTA (Deep Drug-Target Binding Affinity
Prediction) regression model which takes the SMILES and a fixed length truncation of
the full protein sequence as features for the ligand and protein, respectively. In the study,
two kinase-focused data sets are used: the Davis data [119] and the KIBA data [120] with
roughly 30,000 and 250,000 data points, respectively. The first reports Kd values, which
represents the dissociation constant, while the second reports Kinase Inhibitor BioActivity
(KIBA) scores, which combines information from IC50 , Ki and Kd measurements. As input
for the CNN, both the SMILES and the protein sequence are label encoded independently.
The authors apply convolutions to the embeddings of each object, before concatenating
them and predicting the pKd value or KIBA score, depending on the data set used. The data
are randomly split into six equal parts where one of them is used as a test set to evaluate the
model and the five remaining compose the folds for cross-validation and parameter tuning.
On the Davis and KIBA test sets, the model exhibits an MSE of 0.261 and 0.194, respectively
(see [60], Tables 3 and 4), which outperforms baselines such as KronRLS [157], a variation
of least squares regression and SimBoost [158], a tree-based gradient boosting method.
The success of the deep learning model could be explained by the use of convolution layers
which are able to extract information from the protein-ligand pair.
The same authors extended DeepDTA to WideDTA [152]. This time, instead of only
considering the SMILES label encoding for the ligand, substructure information is also
included where a list of the 100,000 most frequent maximum common substructures defined
by Woźniak et al. [159] are used. For the protein description, approximately 500 motifs
and domains are extracted from the PROSITE database [160] and label encoded. The deep
learning architecture is similar to DeepDTA, but WideDTA does achieve slightly better
results, for example an MSE of 0.179 on the KIBA data ([152], Table 5).
In another study, Karimi et al. [74] use SMILES as ligand and structural property
sequences as protein descriptors to predict protein-ligand affinities. The first learning task
is an auto-encoder which aims at giving a latent representation of a compound-target pair.
Once the neural network is trained in an unsupervised setting, the resulting fingerprint
is then fed to recurrent plus convolution layers with an attention mechanism to predict
pIC50 values. The BindingDB data set [161], containing close to 500,000 labeled protein-
compound pairs after curation, is used for their study. After removing four protein classes
as generalization sets, the remaining ∼370,000 pairs are split into train (70%) and test (30%)
sets. On the test set, while RF yields an RMSE of 0.91 and a Pearson’s R of 0.78, the
DeepAffinity model reaches an R of 0.86 and a lower RMSE of 0.73 ([74], Table 2), thus
outperforming conventional methods such as RF. A GCNN is also tested on the graph
Int. J. Mol. Sci. 2021, 22, 4435 21 of 34
encoding of the compounds, but this alternative did not show improvements with respect
to the SMILES notation.
Ligand as Fingerprint
The DL-CPI model suggested by Tian et al. [153] stands for Deep Learning for
Compound-Protein Interactions and applies four fully-connected hidden layers to a 6404
long input binary vector which is the concatenation of compound and protein features;
881 entries for substructure identification in the ligand and another 5523 for Pfam [73]
identified protein domains. Using five-fold cross-validation, the AUC varies between
0.893 and 0.919 depending on the ratio of negative samples in the data set for DL-CPI and
between 0.687 and 0.724 for an RF model ([153], Table 2). The high accuracy performance
is explained by the abstraction coming from the hidden layers of the network.
The study by Kundu et al. [65] compares various types of models trained and tested on
a subset of 2864 instances of the PDBbind v.2015 data set. The 127 long input feature vector
combines features of the protein, such as the percentage of amino acids, the accessible
surface area of the protein, the number of chains, etc., as well as physicochemical (e.g.,
molecular weight, topological surface area, etc.) and structural properties (e.g., ring count)
of the ligand. RF is shown to outperform models such as MLP and SVM in the task of
predicting inhibition constant (Ki ) and dissociation constant (Kd ) values. One possible
reason for these results might originate from the size of the data set: RF models can be very
successful when the available data is small.
The study undertaken by Sorgenfrei et al. [70] focuses on the RF algorithm. They en-
code the ligand with Morgan fingerprints and note that using z-scales descriptors from
the binding site of the protein highly improves the performance of the model compared to
the baseline which only considers the one-hot encoded ID of the target. The data set used
contains over 1,300,000 compound-kinase activities and comes from combined sources
such as ChEMBL and the KIBA data provided by Tang et al. [120]. The activity threshold
for pIC50 values (pIC50 = −log10 ( IC50 )) is set at 6.3. On a test set in which both target
and compound are left out during training, the AUC reaches a value of 0.75 ([70], Table 1),
justifying the usefulness of pair-based/PCM methods in hit identification.
Morgan fingerprints are also used in the study by Lee et al. [154] to represent the ligand,
while the full raw protein sequence is used as protein input. The deep learning model,
DeepConv-DTI, consists of convolutions applied to the embeddings of the full protein
sequence (padded to reach the length of 2500), which are then combined with the ligand
descriptors in fully-connected layers to predict whether the drug is a hit or not. The model
is built on combined data from various open sources, such as DrugBank [162], KEGG [163]
and IUPHAR [164]. After curation and generating negative samples, the training set
contains close to 100,000 data points. The model is externally tested on PubChem where
both protein and compound had not been seen during training. DeepConv-DTI reaches
an accuracy close to 0.8 ([154], Figure 3D) and seems to outperform the DeepDTA model
by Öztürk et al. [60], see ([154], Figure 4).
Ligand as Graph
In the study by Torng and Altman [57], GCNNs are used on both target and ligand.
The residues in the binding pocket of the protein correspond to the nodes and the 480
long feature vector, computed with the program developed by Bagley and Altman [165],
represents their physicochemical properties. The small molecule is also treated as a graph
and properties such as the one-hot encoded atomic element, degree, attached hydrogen(s),
valence(s), and aromaticity, are included in the 62 long feature vector. Graph convolutional
layers are applied to both graphs independently and the resulting vectors from both entities
are concatenated. A fully-connected layer is applied to the concatenated vector to learn the
interaction between the molecule and the target, leading to an interaction vector which is
then used to predict binding or non-binding. The model is trained on the DUD-E data set
Int. J. Mol. Sci. 2021, 22, 4435 22 of 34
and externally tested on MUV, and reaches an AUC value of 0.621, which is better than
results from 3D CNNs, AutoDock vina and RF-Score ([57], Table 2).
The research undertaken by the authors Jiang et al. [155] uses a similar workflow,
where GNNs are applied to both the ligand graph and the protein graph based on the con-
tact map. More precisely, the atomic properties of the ligand nodes are the one-hot encoding
of the element (44 entries), the degree (11 entries), the total (implicit and explicit) number
of attached hydrogens (11 entries), the number of implicit attached hydrogens only (11 en-
tries), and aromaticity (binary value), leading to a vector of length 78. The atomic properties
of the protein include the one-hot encoding of the residue (21 entries), the position-specific
scoring matrix for amino acid substitutions (21 entries), and the binary values of the 12
following properties: being aliphatic, aromatic, polar, acidic, or basic, the weight, three
different dissociation constants, the pH value and hydrophobicity at two different pH
values. The contact map, which can be computed using the PconsC4 tool [166] directly
from the protein sequence, can be used as a proxy for the adjacency matrix. Given the graph
representation for the ligand as well as the protein, three graph convolutional, pooling
layer, and fully-connected layers are subsequently applied to both ligand and protein
independently, then concatenated, and finally the binding affinity is predicted after another
two fully-connected layers. The deep learning model is called DGraphDTA, which stands
for “Double Graph Drug–Target Affinity predictor”. The MSE on the Davis and KIBA data
sets are as low as 0.202 and 0.126, respectively and DGraphDTA seems to give better results
than both DeepDTA and WideDTA, see ([155], Tables 7 and 8). This hints to the fact that
graph representations are well suited for drug-target interaction prediction.
The PADME model, an acronym for “Protein And Drug Molecule interaction prE-
diction” [156], suggests two variants for ligand encoding in the regression context of
drug-target interaction prediction. The ECFP fingerprint (implemented as the Morgan
fingerprint in the code) as well as the graph encoding are used along side a 8421 long
protein feature vector describing the sequence composition (8420 entries for the amino
acid, dipeptide, and tripeptide composition computed using the propy tool [167]) and
one entry for phosphorylation. The deep learning model is adapted depending on the
encoding of the ligand. In the case of circular fingerprint, the protein and ligand vectors
are concatenated to form a “combined input vector”, on which fully-connected layers are
then applied. In the graph setting, a graph layer is applied to the ligand, resulting in a
vector, which is then again concatenated to the protein vector as in the previous case. The
regression models (either graph or circular fingerprint) consistently outperform baseline
models such as KronRLS and SimBoost. The simulations are run on several kinase data
sets such as Davis [119] and KIBA [120]. Using cross-validation schemes that involve
testing the model on the fold for which no protein was trained on, the RMSE on the KIBA
data with the PADME graph setting is 0.6225 and on the Davis data with the circular
fingerprint setting is 0.5639 ([156], Table 2). This study provides further evidence that
deep learning models could indeed improve drug-target prediction compared to standard
machine learning algorithms.
machine learning models (Section 2.2), the data sets (Section 2.3) as well as the model
performances (Section 3) are reported and put in context. These studies show overall very
promising results on typical benchmarks and often outperform the respective classical
approach chosen for comparison, such as docking or more standard machine learning
models. This is also exemplified on the Merck Molecular Activity Kaggle competition data,
where deep neural networks have shown to routinely perform better than random forest
models [168]. Similarly, in other blind challenges for pose and affinity prediction such as
the D3R grand challenges, deep learning-based methods increasingly make it to the top
ranges ([149], Table 1). One possible reason for such outstanding achievements may be
explained by the way biological entities are encoded: for example, rather than using human-
engineered descriptors, features are learned by the models. Also, novel encodings, such
as voxels (where physicochemical atomic properties are pinned to locations in 3D space)
and graphs (that describe the connectivity, bonded and non-bonded, between the atoms),
seem to capture well the variety of information important for ligand-binding. For example,
DeepAtom [92], a 3D grid-based method where each grid cell is assigned a different
physicochemical property seems well suited to model the complexity of protein-ligand
binding using 3D information. Encoding chemical and biological objects in graph form
also seems to be very fitting, as shown in the study by Lim et al. [95] and the DGraphDTA
model by Jiang et al. [155].
Nevertheless, several challenges still remain open and new ones have also emerged,
including: 1. precision of chemical encoding, 2. generalization of chemical space, 3. lack
of (big and high-quality) data, 4. comparability of models, and 5. interpretability. All of
which will be discussed in the following.
1. Precision of chemical encoding: The better performance of structure-based methods
using ML-based vs. classical SFs is often attributed to the avoidance of a pre-determined
functional form of the protein-ligand complexes, meaning that the precision of the chemical
description does not necessarily lead to more accurate binding affinity prediction [169].
Contributing factors might be associated with: 1. modeling assumptions, where more
precise descriptions may introduce errors. 2. The dependence of encoding and regression
technique: more precise description might produce longer and sparser features which
could be problematic in cases such as RF models. 3. Restrictions to data in the bound
state, neglecting contribution from both partners in solvation and induced fit phenomena.
Or missing consideration of conformational heterogeneity, where multiple conformations
might co-exist with different probabilities.
2. Generalization of chemical space: As mentioned in the work by Lim et al. [95], although
some deep learning models perform outstandingly well, there seems to still exist some
issues exploring the whole chemical space, a challenge also occurring in classical machine
learning methods. Some less successful results have been detected when evaluating
some models on external data sets, showing that since the data used for training is not
representative of the immense chemical space, the model, instead of learning and exploring
it, is rather memorizing patterns from it [170].
3. Lack of (big and high-quality) data: Deep learning is very data greedy and usually,
the bigger the training set is, the better the results. Goodfellow et al. [32] suggest that
a model trained on a data set of size of the order of 10 million may surpass human
performance. However, as previously discussed, biochemical data are still considerably
smaller than, for example, image or video data sets. Therefore, depending on the data
at hand, choosing more standard machine learning approaches, or more shallow neural
networks, that require less parameter training, may perform just as well. Examples are
shown in the studies by Kundu et al. [65], which employs random forest for activity
prediction, or by Göller et al. [171], which summarizes DL and ML models for ADMETox
predictions. Another alternative is to find a way to acquire more data, through, for example,
data augmentation. In image classification, this can be done using image rotating, cropping,
recoloring, etc., which can be adapted to virtual screening tasks. The Pharm-IF method [84]
performs better with more crystal structures or by employing additional docking poses.
Int. J. Mol. Sci. 2021, 22, 4435 24 of 34
DeepAtom [92] translates and rotates the protein-ligand complex to gain more training
data. In QSAR predictions, using SMILES augmentation has also become popular as means
to enlarge the training set [59,61]. Note that not only the quantity of data, but also its
quality is often unsatisfactory, such as low resolution of crystal structures or relying on
docked poses, as well as activity data taken from various experiments (and conditions)
providing different measurements, e.g., Kd , Ki , IC50 or EC50 (which is the measured half
maximal effective concentration of a drug).
4. Comparability of models, benchmark data, open-source: Reviewing a multitude of studies
and wanting to compare and rank them is understandable, but also unreasonable for several
reasons, starting with the data and the splits. Models that have been trained on different
data or even different tasks should hardly be compared; a regression task or a classification
task, even when using similar performance metrics, are not analogous. Assuming that the
models do use the same data, if the splits are different, then the evaluation can no longer be
directly compared. Assuming now that the splits are identical, then if the metrics used are
different, again no fair comparison can be made, as pointed out by Feinberg et al. [96]. This
means that there are a chain of elements that have be considered before comparing and
ranking methods blindly. To this end, two major elements become crucial: 1. open-source
data and 2. open-source code.
Having benchmark data sets freely available—together with a code basis—such as
MoleculeNet [25], TDC [133] or work by Riniker and Landrum [134], and updated regularly,
such as ChEMBL, is highly beneficial for academic research and method publication.
Moreover, having access to the source code of newly developed methods and being able to
reproduce results is also becoming more and more essential in the field especially as the
number of models developed is becoming larger (as embraced by the FAIR principles [172]).
Moreover, while several data sets are available to benchmark the performance of
different approaches in VS, Sieg et al. [121] recently elaborated on the need of bias control
for ML-based virtual screening studies. Several types of biases exist. For example, domain
bias, which maybe be due to insufficient generalization as discussed above, but still ac-
ceptable, if the models are applied in a narrow chemical space. Nevertheless, non-causal
bias is dangerous, when there is correlation but no causation. While mainly focusing on
structure-based ML models for VS on DUD, DUD-E and MUV, Sieg et al. [121] found
that small molecule features dominated the predictions across dissimilar proteins even
when structure-based methods/descriptors are used. Thus, special care needs to be taken
when methods and descriptors are evaluated on benchmark sets, if the compilation pro-
tocol of the benchmark is suited for the context of the methodology. In another study,
Chen et al. [173] also claim hidden analogue and decoy bias in the DUD-E database that
may lead to superior performance of CNN models during VS. Thus, there is urgent need
for bias control in benchmarking data sets, especially for structure-enabled ML-based VS.
5. Interpretability: With the rise of deep learning, the complexity of the architectures
and the depth of the models comes the issue of interpretability. Such models are often
considered as black boxes and understanding the mechanism in the hidden layers is a
challenge. However, research undertaken in this direction aims at deciphering what the
algorithm has learned [101,174]. This may also be important in detecting bias in the
data [121].
For further considerations on the type, quality and quantity of the data as well as the
challenges of DL models built thereof to impact different areas of drug discovery, the reader
is kindly referred to two recent reviews by Bender and Cortés-Ciriano [175,176].
In this work, the recent progress in DL-based VS methods has been reviewed, exem-
plifying the boost in development and application over the last few years. While some
challenges due to, for example data coverage and unbiased evaluation sets, molecular
encoding and modeling the respective biological protein-ligand binding event still remain,
the reported results show the unprecedented advances in the field.
Int. J. Mol. Sci. 2021, 22, 4435 25 of 34
Author Contributions: Note this is a review article. Investigation, T.B.K., Y.C., A.V.; writing—
original draft preparation, T.B.K., Y.C., A.V.; writing—review and editing, T.B.K., Y.C., A.V.; visualiza-
tion, T.B.K., Y.C., A.V.; supervision, A.V. All authors have read and agreed to the published version
of the manuscript.
Funding: The authors received funding from the Stiftung Charité under the Einstein BIH Visiting
Fellow Project, and the China Scholarship Council Project (Grant Number: 201906210079).
Data Availability Statement: The Python code to generate most components of the figures in
the review is available on GitHub at https://github.com/volkamerlab/DL_in_VS_review, using
packages such as RDKit [63], NGLview [177], the Open Drug Discovery Toolkit (ODDT) [178] and
PyMOL [179].
Acknowledgments: The authors would like to thank Dominique Sydow for the highly beneficial
discussion concerning protein encodings and classification, as well as Maxime Gagnebin for his
valuable input on deep learning. We also thank Jaime Rodríguez-Guerra and David Schaller for their
feedback on the pre-final version of the review. We acknowledge support from the German Research
Foundation and the Open Access Publication Fund of Charité-Universitätsmedizin Berlin.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
General
HTS High-throughput screening
VS Virtual screening
SF Scoring function
QSAR Quantitative structure–activity relationship
PCM Proteochemometric
FDA Food and Drug Administration
Machine learning
ML Machine learning
DL Deep learning
SVM Support vector machine
RF Random forest
NN Neural network
ANN Artificial neural network
MLP Multilayer perceptron
CNN Convolutional neural network
RNN Recurrent neural network
GNN Graph neural network
GCNN Graph convolution neural network
GRU Gated recurrent unit
GGNN Gated graph neural network
GANN Graph attention neural network
GAN Generative adversarial network
Encoding
SMILES Simplified molecular input line entry system
ECFP Extended-connectivity fingerprint
MCS Maximum common substructure
IPF Interaction fingerprint
ID Identifier
Int. J. Mol. Sci. 2021, 22, 4435 26 of 34
Metrics
MSE Mean squared error
RMSE Root mean squared error
RMSD Root mean square deviation
ROC Receiver operating characteristic
AUC Area under the ROC curve
EF Enrichment factor
Data
PDB Protein Data Bank
KIBA Kinase inhibitor bioactivity
MUV Maximum unbiased validation
TDC Therapeutics Data Commons
DUD Directory of useful decoys
CASF Comparative assessment of scoring functions
Appendix A. Figures
Figure A1. Encoding and padding. Starting from a SMILES string, the characters can be stored
in a dictionary (or the dictionary can be constructed prior using a set of known characters). Label
encoding consists of enumerating the characters in the dictionary. One-hot encoding consists of
assigning a binary vector to each character. (a) Constructing the label encoding for a given SMILES by
assigning the integer associated to the character as they appear in the SMILES. (b) Constructing the
one-hot encoding by concatenating the binary vectors of the characters as they appear in the SMILES.
(Pad) Inputs of same dimension are often required when using machine learning. A common solution
is to use padding, which consists of adding zeros to either the label vector or to the one-hot matrix
up to the maximum length of the SMILES in the data set.
the typical metrics used in the regression framework are described and finally the metrics
that can be applied in both cases.
TP + TN
Acc = ,
TP + TN + FN + FP
where TP, TN, FN, FP are the true positives, true negatives, false negatives and false
positives, respectively.
The enrichment factor (EF) [127] is a measure for evaluating screening efficiency. At a
pre-defined sampling percentage χ, EFχ% shows the proportion of true active compounds
in the sampling set in relation to the proportion of true active compounds in the whole
data set and is defined by
ns
Ns
EFχ% = n ,
N
where N is the number of compounds in the entire data set, n the number of compounds in
the sampling set, Ns the number of true active compounds in the entire data set, and ns the
number of true active compounds in the sampling set.
∑i (yi − ŷi )2
R2 = 1 − ,
∑i (yi − ȳ)2
where
n
1
ȳ =
n ∑ yi .
i =1
The closer is the R2 to 1, the better the fit. R2 = 0 when the model predicts all values
to the mean ȳ.
Int. J. Mol. Sci. 2021, 22, 4435 28 of 34
The Pearson’s correlation coefficient R [123,155,185] defined below is also often used.
References
1. Berdigaliyev, N.; Aljofan, M. An overview of drug discovery and development. Future Med. Chem. 2020, 12, 939–947. [CrossRef]
[PubMed]
2. Butkiewicz, M.; Wang, Y.; Bryant, S.; Lowe, E., Jr.; Weaver, D.; Meiler, J. High-Throughput Screening Assay Datasets from the
PubChem Database. Chem. Inform. (Wilmington Del.) 2017, 3. [CrossRef] [PubMed]
3. Walters, W.; Stahl, M.T.; Murcko, M.A. Virtual screening—An overview. Drug Discov. Today 1998, 3, 160–178. [CrossRef]
4. Sterling, T.; Irwin, J.J. ZINC 15–Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. [CrossRef]
5. MolPORT. Available online: https://www.molport.com (accessed on 2 March 2021).
6. Enamine REAL. Available online: https://enamine.net/library-synthesis/real-compounds (accessed on 2 March 2021).
7. Scior, T.; Bender, A.; Tresadern, G.; Medina-Franco, J.L.; Martínez-Mayorga, K.; Langer, T.; Cuanalo-Contreras, K.; Agrafiotis, D.K.
Recognizing Pitfalls in Virtual Screening: A Critical Review. J. Chem. Inf. Model. 2012, 52, 867–881. [CrossRef]
8. Kumar, A.; Zhang, K.Y. Hierarchical virtual screening approaches in small molecule drug discovery. Methods 2015, 71, 26–37.
[CrossRef]
9. Brooijmans, N.; Kuntz, I.D. Molecular Recognition and Docking Algorithms. Annu. Rev. Biophys. Biomol. Struct. 2003, 32, 335–373.
[CrossRef]
10. Sulimov, V.B.; Kutov, D.C.; Sulimov, A.V. Advances in Docking. Curr. Med. Chem. 2020, 26, 7555–7580. [CrossRef]
11. Fischer, A.; Smieško, M.; Sellner, M.; Lill, M.A. Decision Making in Structure-Based Drug Discovery: Visual Inspection of Docking
Results. J. Med. Chem. 2021, 64, 2489–2500. [CrossRef]
12. Klebe, G. Virtual ligand screening: Strategies, perspectives and limitations. Drug Discov. Today 2006, 11, 580–594. [CrossRef]
13. Kolodzik, A.; Schneider, N.; Rarey, M. Structure-Based Virtual Screening. In Applied Chemoinformatics; John Wiley & Sons, Ltd.:
Hoboken, NJ, USA, 2018; Chapter 6.8, pp. 313–331. [CrossRef]
14. Pagadala, N.S.; Syed, K.; Tuszynski, J. Software for molecular docking: A review. Biophys. Rev. 2017, 9, 91–102. [CrossRef]
15. Li, J.; Fu, A.; Zhang, L. An Overview of Scoring Functions Used for Protein–Ligand Interactions in Molecular Docking. Interdiscip.
Sci. Comput. Life Sci. 2019, 11, 320–328. [CrossRef]
16. Shen, C.; Ding, J.; Wang, Z.; Cao, D.; Ding, X.; Hou, T. From machine learning to deep learning: Advances in scoring functions for
protein–ligand docking. WIREs Comput. Mol. Sci. 2019, 10. [CrossRef]
17. Ain, Q.U.; Aleksandrova, A.; Roessler, F.D.; Ballester, P.J. Machine-learning scoring functions to improve structure-based binding
affinity prediction and virtual screening. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2015, 5, 405–424. [CrossRef]
18. Sunseri, J.; Koes, D.R. Pharmit: Interactive exploration of chemical space. Nucleic Acids Res. 2016, 44, W442–W448. [CrossRef]
19. Schaller, D.; Šribar, D.; Noonan, T.; Deng, L.; Nguyen, T.N.; Pach, S.; Machalz, D.; Bermudez, M.; Wolber, G. Next generation 3D
pharmacophore modeling. WIREs Comput. Mol. Sci. 2020, 10, e1468. [CrossRef]
20. Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inform. 2010, 29, 476–488. [CrossRef]
21. Sydow, D.; Burggraaff, L.; Szengel, A.; van Vlijmen, H.W.T.; IJzerman, A.P.; van Westen, G.J.P.; Volkamer, A. Advances and
Challenges in Computational Target Prediction. J. Chem. Inf. Model. 2019, 59, 1728–1742. [CrossRef]
22. Lapinsh, M.; Prusis, P.; Gutcaits, A.; Lundstedt, T.; Wikberg, J.E. Development of proteo-chemometrics: A novel technology for
the analysis of drug-receptor interactions. Biochim. Biophys. Acta (BBA) Gen. Subj. 2001, 1525, 180–190. [CrossRef]
23. Van Westen, G.J.P.; Wegner, J.K.; IJzerman, A.P.; van Vlijmen, H.W.T.; Bender, A. Proteochemometric modeling as a tool to design
selective compounds and for extrapolating to novel targets. Med. Chem. Commun. 2011, 2, 16–30. [CrossRef]
24. Geppert, H.; Humrich, J.; Stumpfe, D.; Gärtner, T.; Bajorath, J. Ligand Prediction from Protein Sequence and Small Molecule
Information Using Support Vector Machines and Fingerprint Descriptors. J. Chem. Inf. Model. 2009, 49, 767–779. [CrossRef]
25. Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A benchmark
for molecular machine learning. Chem. Sci. 2018, 9, 513–530. [CrossRef]
26. Oladipupo, T. Types of Machine Learning Algorithms; IntechOpen: London, UK, 2010. [CrossRef]
27. Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms; Technical Report; Cornell Aeronautical
Lab Inc.: Buffalo, NY, USA, 1961.
28. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984.
Int. J. Mol. Sci. 2021, 22, 4435 29 of 34
29. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
30. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
31. Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive
Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [CrossRef]
32. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
33. Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al.
The Open Images Dataset V4. Int. J. Comput. Vis. 2020, 128, 1956–1981. [CrossRef]
34. LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist/
(accessed on 2 March 2021).
35. kaggle. Available online: https://www.kaggle.com/ (accessed on 2 March 2021).
36. Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; De Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka,
M.; et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 2018, 47, D930–D940. [CrossRef]
37. ChEMBL. Available online: https://www.ebi.ac.uk/chembl/ (accessed on 2 March 2021).
38. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank.
Nucleic Acids Res. 2000, 28, 235–242. [CrossRef]
39. Burley, S.K.; Bhikadiya, C.; Bi, C.; Bittrich, S.; Chen, L.; Crichlow, G.V.; Christie, C.H.; Dalenberg, K.; Di Costanzo, L.; Duarte,
J.M.; et al. RCSB Protein Data Bank: Powerful new tools for exploring 3D structures of biological macromolecules for basic and
applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic
Acids Res. 2020, 49, D437–D451. [CrossRef]
40. RCSB PDB. Available online: http://www.rcsb.org/stats/growth/growth-released-structures (accessed on 2 March 2021).
41. Berman, H.M.; Vallat, B.; Lawson, C.L. The data universe of structural biology. IUCrJ 2020, 7, 630–638. [CrossRef]
42. Helliwell, J.R. New developments in crystallography: Exploring its technology, methods and scope in the molecular biosciences.
Biosci. Rep. 2017, 37. [CrossRef] [PubMed]
43. Ajay; Walters, W.P.; Murcko, M.A. Can We Learn to Distinguish between “Drug-like” and “Nondrug-like” Molecules? J. Med.
Chem. 1998, 41, 3314–3324. [CrossRef]
44. Burden, F.R.; Winkler, D.A. Robust QSAR Models Using Bayesian Regularized Neural Networks. J. Med. Chem. 1999,
42, 3183–3187. [CrossRef] [PubMed]
45. Burden, F.R.; Ford, M.G.; Whitley, D.C.; Winkler, D.A. Use of Automatic Relevance Determination in QSAR Studies Using
Bayesian Neural Networks. J. Chem. Inf. Comput. Sci. 2000, 40, 1423–1430. [CrossRef] [PubMed]
46. Baskin, I.I.; Winkler, D.; Tetko, I.V. A renaissance of neural networks in drug discovery. Expert Opin. Drug Discov. 2016,
11, 785–795. [CrossRef]
47. Carpenter, K.A.; Cohen, D.S.; Jarrell, J.T.; Huang, X. Deep learning and virtual drug screening. Future Med. Chem. 2018,
10, 2557–2567. [CrossRef]
48. Ellingson, S.R.; Davis, B.; Allen, J. Machine learning and ligand binding predictions: A review of data, methods, and obstacles.
Biochim. Biophys. Acta (BBA) Gen. Subj. 2020, 1864, 129545. [CrossRef]
49. D’Souza, S.; Prema, K.; Balaji, S. Machine learning models for drug–target interactions: current knowledge and future directions.
Drug Discov. Today 2020, 25, 748–756. [CrossRef]
50. Li, H.; Sze, K.H.; Lu, G.; Ballester, P.J. Machine-learning scoring functions for structure-based drug lead optimization. WIREs
Comput. Mol. Sci. 2020, 10. [CrossRef]
51. Li, H.; Sze, K.H.; Lu, G.; Ballester, P.J. Machine-learning scoring functions for structure-based virtual screening. WIREs Comput.
Mol. Sci. 2020, 11. [CrossRef]
52. Rifaioglu, A.S.; Atas, H.; Martin, M.J.; Cetin-Atalay, R.; Atalay, V.; Doğan, T. Recent applications of deep learning and machine
intelligence on in silico drug discovery: Methods, tools and databases. Briefings Bioinform. 2018, 20, 1878–1912. [CrossRef]
53. Lo, Y.C.; Rensi, S.E.; Torng, W.; Altman, R.B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 2018,
23, 1538–1546. [CrossRef]
54. Xu, Y.; Verma, D.; Sheridan, R.P.; Liaw, A.; Ma, J.; Marshall, N.M.; McIntosh, J.; Sherer, E.C.; Svetnik, V.; Johnston, J.M. Deep Dive
into Machine Learning Models for Protein Engineering. J. Chem. Inf. Model. 2020, 60, 2773–2790. [CrossRef]
55. Bond, J.E.; Kokosis, G.; Ren, L.; Selim, M.A.; Bergeron, A.; Levinson, H. Wound Contraction Is Attenuated by Fasudil Inhibition
of Rho-Associated Kinase. Plast. Reconstr. Surg. 2011, 128, 438e–450e. [CrossRef]
56. Carles, F.; Bourg, S.; Meyer, C.; Bonnet, P. PKIDB: A Curated, Annotated and Updated Database of Protein Kinase Inhibitors in
Clinical Trials. Molecules 2018, 23, 908. [CrossRef]
57. Torng, W.; Altman, R.B. Graph Convolutional Neural Networks for Predicting Drug-Target Interactions. J. Chem. Inf. Model. 2019,
59, 4131–4149. [CrossRef]
58. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem.
Inf. Comput. Sci. 1988, 28, 31–36. [CrossRef]
59. Bjerrum, E.J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv 2017,
arXiv:1703.07076.
60. Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 2018, 34, i821–i829.
[CrossRef]
Int. J. Mol. Sci. 2021, 22, 4435 30 of 34
61. Kimber, T.B.; Engelke, S.; Tetko, I.V.; Bruno, E.; Godin, G. Synergy Effect between Convolutional Neural Networks and the
Multiplicity of SMILES for Improvement of Molecular Prediction. arXiv 2018, arXiv:1812.04439.
62. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [CrossRef]
63. RDKit: Open-source cheminformatics. Available online: http://www.rdkit.org (accessed on 2 March 2021).
64. Hassan, M.; Brown, R.D.; Varma-O’Brien, S.; Rogers, D. Cheminformatics analysis and learning in a data pipelining environment.
Mol. Divers. 2006, 10, 283–299. [CrossRef] [PubMed]
65. Kundu, I.; Paul, G.; Banerjee, R. A machine learning approach towards the prediction of protein-ligand binding affinity based on
fundamental molecular properties. RSC Adv. 2018, 8, 12127–12137. [CrossRef]
66. Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput.
Sci. 2002, 42, 1273–1280. [CrossRef] [PubMed]
67. Winter, R.; Montanari, F.; Noé, F.; Clevert, D.A. Learning continuous and data-driven molecular descriptors by translating
equivalent chemical representations. Chem. Sci. 2019, 10, 1692–1701. [CrossRef]
68. Rifaioglu, A.S.; Nalbat, E.; Atalay, V.; Martin, M.J.; Cetin-Atalay, R.; Doğan, T. DEEPScreen: High performance drug–target
interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem. Sci. 2020,
11, 2531–2557. [CrossRef]
69. Murray, R.K.; Bender, D.A.; Botham, K.M.; Kennelly, P.J.; Rodwell, V.W.; Weil, P.A. Harper’s Illustrated Biochemistry, Twenty-Eighth
Edition; McGraw-Hill Medical McGraw-Hill Distributor: New York, NY, USA, 2009.
70. Sorgenfrei, F.A.; Fulle, S.; Merget, B. Kinome-wide profiling prediction of small molecules. ChemMedChem 2018, 13, 495–499.
[CrossRef]
71. Hellberg, S.; Sjoestroem, M.; Skagerber, B.; Wold, S. Peptide quantitative structure-activity relationships, multivariate approach.
J. Med. Chem. 1987, 30, 1126–1135. [CrossRef]
72. Sigrist, C.J.A.; de Castro, E.; Cerutti, L.; Cuche, B.A.; Hulo, N.; Bridge, A.; Bougueleret, L.; Xenarios, I. New and continuing
developments at PROSITE. Nucleic Acids Res. 2012, 41, D344–D347. [CrossRef]
73. Finn, R.D.; Bateman, A.; Clements, J.; Coggill, P.; Eberhardt, R.Y.; Eddy, S.R.; Heger, A.; Hetherington, K.; Holm, L.; Mistry, J.; et al.
Pfam: The protein families database. Nucleic Acids Res. 2013, 42, D222–D230. [CrossRef]
74. Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAffinity: Interpretable deep learning of compound–protein affinity through unified
recurrent and convolutional neural networks. Bioinformatics 2019, 35, 3329–3338. [CrossRef]
75. Magnan, C.N.; Baldi, P. SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure and relative solvent
accessibility using profiles, machine learning and structural similarity. Bioinformatics 2014, 30, 2592–2597. [CrossRef] [PubMed]
76. De Freitas, R.F.; Schapira, M. A systematic analysis of atomic protein–ligand interactions in the PDB. MedChemComm 2017,
8, 1970–1981. [CrossRef] [PubMed]
77. Deng, Z.; Chuaqui, C.; Singh, J. Structural Interaction Fingerprint (SIFt): A Novel Method for Analyzing Three-Dimensional
Protein-Ligand Binding Interactions. J. Med. Chem. 2004, 47, 337–344. [CrossRef] [PubMed]
78. Radifar, M.; Yuniarti, N.; Istyastono, E.P. PyPLIF: Python-based Protein-Ligand Interaction Fingerprinting. Bioinformation 2013,
9, 325–328. [CrossRef]
79. DaSilva, F.; Desaphy, J.; Rognan, D. IChem: A Versatile Toolkit for Detecting, Comparing, and Predicting Protein-Ligand
Interactions. ChemMedChem 2017, 13, 507–510. [CrossRef]
80. Jasper, J.B.; Humbeck, L.; Brinkjost, T.; Koch, O. A novel interaction fingerprint derived from per atom score contributions:
Exhaustive evaluation of interaction fingerprint performance in docking based virtual screening. J. Cheminform. 2018, 10.
[CrossRef]
81. Verdonk, M.L.; Cole, J.C.; Hartshorn, M.J.; Murray, C.W.; Taylor, R.D. Improved protein-ligand docking using GOLD. Proteins
Struct. Funct. Bioinform. 2003, 52, 609–623. [CrossRef]
82. Chupakhin, V.; Marcou, G.; Gaspar, H.; Varnek, A. Simple Ligand–Receptor Interaction Descriptor (SILIRID) for alignment-free
binding site comparison. Comput. Struct. Biotechnol. J. 2014, 10, 33–37. [CrossRef]
83. Pérez-Nueno, V.I.; Rabal, O.; Borrell, J.I.; Teixidó, J. APIF: A New Interaction Fingerprint Based on Atom Pairs and Its Application
to Virtual Screening. J. Chem. Inf. Model. 2009, 49, 1245–1260. [CrossRef]
84. Sato, T.; Honma, T.; Yokoyama, S. Combining Machine Learning and Pharmacophore-Based Interaction Fingerprint for in Silico
Screening. J. Chem. Inf. Model. 2009, 50, 170–185. [CrossRef]
85. Desaphy, J.; Raimbaud, E.; Ducrot, P.; Rognan, D. Encoding Protein–Ligand Interaction Patterns in Fingerprints and Graphs.
J. Chem. Inf. Model. 2013, 53, 623–637. [CrossRef]
86. Da, C.; Kireev, D. Structural Protein–Ligand Interaction Fingerprints (SPLIF) for Structure-Based Virtual Screening: Method and
Benchmark Study. J. Chem. Inf. Model. 2014, 54, 2555–2561. [CrossRef]
87. Wójcikowski, M.; Kukiełka, M.; Stepniewska-Dziubinska, M.M.; Siedlecki, P. Development of a protein–ligand extended
connectivity (PLEC) fingerprint and its application for binding affinity predictions. Bioinformatics 2018, 35, 1334–1341. [CrossRef]
88. Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A deep convolutional neural network for bioactivity prediction in structure-based
drug discovery. arXiv 2015, arXiv:1510.02855.
89. Stepniewska-Dziubinska, M.M.; Zielenkiewicz, P.; Siedlecki, P. Development and evaluation of a deep learning model for
protein–ligand binding affinity prediction. Bioinformatics 2018, 34, 3666–3674. [CrossRef]
Int. J. Mol. Sci. 2021, 22, 4435 31 of 34
90. Sunseri, J.; King, J.E.; Francoeur, P.G.; Koes, D.R. Convolutional neural network scoring and minimization in the D3R 2017
community challenge. J. Comput.-Aided Mol. Des. 2018, 33, 19–34. [CrossRef]
91. Jiménez, J.; Škalič, M.; Martínez-Rosell, G.; Fabritiis, G.D. KDEEP: Protein–Ligand Absolute Binding Affinity Prediction via
3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287–296. [CrossRef]
92. Li, Y.; Rezaei, M.A.; Li, C.; Li, X. DeepAtom: A Framework for Protein-Ligand Binding Affinity Prediction. In Proceedings of the
2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019.
[CrossRef]
93. Skalic, M.; Martínez-Rosell, G.; Jiménez, J.; Fabritiis, G.D. PlayMolecule BindScope: Large scale CNN-based virtual screening on
the web. Bioinformatics 2018, 35, 1237–1238. [CrossRef]
94. Erdas-Cicek, O.; Atac, A.O.; Gurkan-Alp, A.S.; Buyukbingol, E.; Alpaslan, F.N. Three-Dimensional Analysis of Binding Sites for
Predicting Binding Affinities in Drug Design. J. Chem. Inf. Model. 2019, 59, 4654–4662. [CrossRef]
95. Lim, J.; Ryu, S.; Park, K.; Choe, Y.J.; Ham, J.; Kim, W.Y. Predicting Drug–Target Interaction Using a Novel Graph Neural Network
with 3D Structure-Embedded Graph Representation. J. Chem. Inf. Model. 2019, 59, 3981–3988. [CrossRef]
96. Feinberg, E.N.; Sur, D.; Wu, Z.; Husic, B.E.; Mai, H.; Li, Y.; Sun, S.; Yang, J.; Ramsundar, B.; Pande, V.S. PotentialNet for Molecular
Property Prediction. ACS Cent. Sci. 2018, 4, 1520–1530. [CrossRef]
97. Cang, Z.; Wei, G.W. TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property
predictions. PLoS Comput. Biol. 2017, 13, e1005690. [CrossRef]
98. Zhu, F.; Zhang, X.; Allen, J.E.; Jones, D.; Lightstone, F.C. Binding Affinity Prediction by Pairwise Function Based on Neural
Network. J. Chem. Inf. Model. 2020, 60, 2766–2772. [CrossRef]
99. Pereira, J.C.; Caffarena, E.R.; dos Santos, C.N. Boosting Docking-Based Virtual Screening with Deep Learning. J. Chem. Inf. Model.
2016, 56, 2495–2506. [CrossRef]
100. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New
York, NY, USA, 2009. [CrossRef]
101. Webel, H.E.; Kimber, T.B.; Radetzki, S.; Neuenschwander, M.; Nazaré, M.; Volkamer, A. Revealing cytotoxic substructures in
molecules using deep learning. J. Comput.-Aided Mol. Des. 2020, 34, 731–746. [CrossRef]
102. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556.
103. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM
2017, 60, 84–90. [CrossRef]
104. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015. [CrossRef]
105. Liu, Z.; Zhou, J. Introduction to Graph Neural Networks. Synth. Lect. Artif. Intell. Mach. Learn. 2020, 14, 1–127. [CrossRef]
106. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated Graph Sequence Neural Networks. arXiv 2017, arXiv:1511.05493.
107. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903.
108. Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. arXiv
2018, arXiv:1812.08434.
109. Wieder, O.; Kohlbacher, S.; Kuenemann, M.; Garon, A.; Ducrot, P.; Seidel, T.; Langer, T. A compact review of molecular property
prediction with graph neural networks. Drug Discovery Today Technol. 2020. [CrossRef]
110. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem in 2021:
New data content and improved web interfaces. Nucleic Acids Res. 2020, 49, D1388–D1395. [CrossRef] [PubMed]
111. Liu, Z.; Su, M.; Han, L.; Liu, J.; Yang, Q.; Li, Y.; Wang, R. Forging the Basis for Developing Protein–Ligand Interaction Scoring
Functions. Accounts Chem. Res. 2017, 50, 302–309. [CrossRef]
112. Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A public database for medicinal chemistry,
computational chemistry and systems pharmacology. Nucleic Acids Res. 2015, 44, D1045–D1053. [CrossRef] [PubMed]
113. BindingDB. Available online: https://www.bindingdb.org/bind/index.jsp (accessed on 2 March 2021).
114. Ahmed, A.; Smith, R.D.; Clark, J.J.; Dunbar, J.B.; Carlson, H.A. Recent improvements to Binding MOAD: A resource for
protein–ligand binding affinities and structures. Nucleic Acids Res. 2014, 43, D465–D469. [CrossRef]
115. Smith, R.D.; Clark, J.J.; Ahmed, A.; Orban, Z.J.; Dunbar, J.B.; Carlson, H.A. Updates to Binding MOAD (Mother of All Databases):
Polypharmacology Tools and Their Utility in Drug Repurposing. J. Mol. Biol. 2019, 431, 2423–2433. [CrossRef] [PubMed]
116. PubChem. Available online: https://pubchem.ncbi.nlm.nih.gov/ (accessed on 2 March 2021).
117. Davies, M.; Nowotka, M.; Papadatos, G.; Dedman, N.; Gaulton, A.; Atkinson, F.; Bellis, L.; Overington, J.P. ChEMBL web services:
Streamlining access to drug discovery data and utilities. Nucleic Acids Res. 2015, 43, W612–W620. [CrossRef]
118. Kooistra, A.J.; Volkamer, A. Kinase-Centric Computational Drug Development. In Annual Reports in Medicinal Chemistry; Elsevier:
Amsterdam, The Netherlands, 2017; pp. 197–236. [CrossRef]
119. Davis, M.I.; Hunt, J.P.; Herrgard, S.; Ciceri, P.; Wodicka, L.M.; Pallares, G.; Hocker, M.; Treiber, D.K.; Zarrinkar, P.P. Comprehensive
analysis of kinase inhibitor selectivity. Nat. Biotechnol. 2011, 29, 1046–1051. [CrossRef]
120. Tang, J.; Szwajda, A.; Shakyawar, S.; Xu, T.; Hintsanen, P.; Wennerberg, K.; Aittokallio, T. Making Sense of Large-Scale Kinase
Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis. J. Chem. Inf. Model. 2014, 54, 735–743. [CrossRef]
Int. J. Mol. Sci. 2021, 22, 4435 32 of 34
121. Sieg, J.; Flachsenberg, F.; Rarey, M. In need of bias control: Evaluating chemical data for machine learning in structure-based
virtual screening. J. Chem. Inf. Model. 2019, 59, 947–961. [CrossRef]
122. Su, M.; Yang, Q.; Du, Y.; Feng, G.; Liu, Z.; Li, Y.; Wang, R. Comparative assessment of scoring functions: The CASF-2016 update.
J. Chem. Inf. Model. 2018, 59, 895–913. [CrossRef]
123. Rodgers, J.L.; Nicewander, W.A. Thirteen Ways to Look at the Correlation Coefficient. Am. Stat. 1988, 42, 59–66. [CrossRef]
124. Spearman, C. The Proof and Measurement of Association between Two Things. Am. J. Psychol 1904, 15, 72–101. [CrossRef]
125. Glasser, G.J.; Winter, R.F. Critical Values of the Coefficient of Rank Correlation for Testing the Hypothesis of Independence.
Biometrika 1961, 48, 444. [CrossRef]
126. Wells, R.D.; Bond, J.S.; Klinman, J.; Masters, B.S.S. (Eds.) RMSD, Root-Mean-Square Deviation. In Molecular Life Sciences:
An Encyclopedic Reference; Wells, Springer: New York, NY, USA, 2018; pp. 1078–1078. [CrossRef]
127. Truchon, J.F.; Bayly, C.I. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem.
J. Chem. Inf. Model. 2007, 47, 488–508. [CrossRef]
128. Trott, O.; Olson, A.J. AutoDock Vina: Improving the speed and accuracy ofdocking with a new scoring function, efficient
optimization, andmultithreading. J. Comput. Chem. 2010, 31, 455–461. [CrossRef]
129. Halgren, T.A.; Murphy, R.B.; Friesner, R.A.; Beard, H.S.; Frye, L.L.; Pollard, W.T.; Banks, J.L. Glide: A New Approach for Rapid,
Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening. J. Med. Chem. 2004, 47, 1750–1759. [CrossRef]
130. Huang, N.; Shoichet, B.K.; Irwin, J.J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. [CrossRef]
131. Mysinger, M.M.; Carchia, M.; Irwin, J.J.; Shoichet, B.K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and
Decoys for Better Benchmarking. J. Med. Chem. 2012, 55, 6582–6594. [CrossRef]
132. Rohrer, S.G.; Baumann, K. Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity
Data. J. Chem. Inf. Model. 2009, 49, 169–184. [CrossRef] [PubMed]
133. Huang, K.; Fu, T.; Gao, W.; Zhao, Y.; Roohani, Y.; Leskovec, J.; Coley, C.; Xiao, C.; Sun, J.; Zitnik, M. Therapeutics Data Commons:
Machine Learning Datasets for Therapeutics. Available online: https://tdcommons.ai (accessed on 2 March 2021).
134. Riniker, S.; Landrum, G.A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminform.
2013, 5, 1758–2946. [CrossRef] [PubMed]
135. Wang, M.; Li, P.; Qiao, P. The Virtual Screening of the Drug Protein with a Few Crystal Structures Based on the Adaboost-SVM.
Comput. Math. Methods Med. 2016, 2016, 1–9. [CrossRef] [PubMed]
136. Li, F.; Wan, X.; Xing, J.; Tan, X.; Li, X.; Wang, Y.; Zhao, J.; Wu, X.; Liu, X.; Li, Z.; et al. Deep Neural Network Classifier for Virtual
Screening Inhibitors of (S)-Adenosyl-L-Methionine (SAM)-Dependent Methyltransferase Family. Front. Chem. 2019, 7. [CrossRef]
137. Imrie, F.; Bradley, A.R.; van der Schaar, M.; Deane, C.M. Protein Family-Specific Models Using Deep Neural Networks and
Transfer Learning Improve Virtual Screening and Highlight the Need for More Data. J. Chem. Inf. Model. 2018, 58, 2319–2330.
[CrossRef]
138. Sato, A.; Tanimura, N.; Honma, T.; Konagaya, A. Significance of Data Selection in Deep Learning for Reliable Binding Mode
Prediction of Ligands in the Active Site of CYP3A4. Chem. Pharm. Bull. 2019, 67, 1183–1190. [CrossRef]
139. Nguyen, D.D.; Gao, K.; Wang, M.; Wei, G.W. MathDL: Mathematical deep learning for D3R Grand Challenge 4. J. Comput.-Aided
Mol. Des. 2019, 34, 131–147. [CrossRef]
140. Cang, Z.; Mu, L.; Wei, G.W. Representability of algebraic topology for biomolecules in machine learning based scoring and
virtual screening. PLoS Comput. Biol. 2018, 14, e1005929. [CrossRef]
141. Zheng, L.; Fan, J.; Mu, Y. OnionNet: A Multiple-Layer Intermolecular-Contact-Based Convolutional Neural Network for
Protein–Ligand Binding Affinity Prediction. ACS Omega 2019, 4, 15956–15965. [CrossRef]
142. Mordalski, S.; Kosciolek, T.; Kristiansen, K.; Sylte, I.; Bojarski, A.J. Protein binding site analysis by means of structural interaction
fingerprint patterns. Bioorganic Med. Chem. Lett. 2011, 21, 6816–6819. [CrossRef]
143. Desaphy, J.; Bret, G.; Rognan, D.; Kellenberger, E. sc-PDB: A 3D-database of ligandable binding sites—10 years on. Nucleic Acids
Res. 2014, 43, D399–D404. [CrossRef]
144. Koes, D.R.; Baumgartner, M.P.; Camacho, C.J. Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmark-
ing Exercise. J. Chem. Inf. Model. 2013, 53, 1893–1904. [CrossRef]
145. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [CrossRef]
146. Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D.R. Protein–Ligand Scoring with Convolutional Neural Networks. J. Chem.
Inf. Model. 2017, 57, 942–957. [CrossRef]
147. Jubb, H.C.; Higueruelo, A.P.; Ochoa-Montaño, B.; Pitt, W.R.; Ascher, D.B.; Blundell, T.L. Arpeggio: A Web Server for Calculating
and Visualising Interatomic Interactions in Protein Structures. J. Mol. Biol. 2017, 429, 365–371. [CrossRef]
148. Ballester, P.J.; Mitchell, J.B.O. A machine learning approach to predicting protein–ligand binding affinity with applications to
molecular docking. Bioinformatics 2010, 26, 1169–1175. [CrossRef]
149. Parks, C.D.; Gaieb, Z.; Chiu, M.; Yang, H.; Shao, C.; Walters, W.P.; Jansen, J.M.; McGaughey, G.; Lewis, R.A.; Bembenek, S.D.;
et al. D3R grand challenge 4: Blind prediction of protein–ligand poses, affinity rankings, and relative binding free energies.
J. Comput.-Aided Mol. Des. 2020, 34, 99–119. [CrossRef]
150. Li, H.; Leung, K.S.; Wong, M.H.; Ballester, P.J. Improving AutoDock Vina Using Random Forest: The Growing Accuracy of
Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets. Mol. Inform. 2015, 34, 115–126. [CrossRef]
Int. J. Mol. Sci. 2021, 22, 4435 33 of 34
151. Zhang, H.; Liao, L.; Saravanan, K.M.; Yin, P.; Wei, Y. DeepBindRG: A deep learning based method for estimating effective
protein–ligand affinity. PeerJ 2019, 7, e7362. [CrossRef]
152. Öztürk, H.; Ozkirimli, E.; Özgür, A. WideDTA: Prediction of drug-target binding affinity. arXiv 2019, arXiv:1902.04166.
153. Tian, K.; Shao, M.; Wang, Y.; Guan, J.; Zhou, S. Boosting compound-protein interaction prediction by deep learning. Methods
2016, 110, 64–72. [CrossRef]
154. Lee, I.; Keum, J.; Nam, H. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein
sequences. PLoS Comput. Biol. 2019, 15, e1007129. [CrossRef] [PubMed]
155. Jiang, M.; Li, Z.; Zhang, S.; Wang, S.; Wang, X.; Yuan, Q.; Wei, Z. Drug–target affinity prediction using graph neural network and
contact maps. RSC Adv. 2020, 10, 20701–20712. [CrossRef]
156. Feng, Q.; Dueva, E.V.; Cherkasov, A.; Ester, M. PADME: A Deep Learning-based Framework for Drug-Target Interaction
Prediction. arXiv 2018, arXiv:1807.09741.
157. Van Laarhoven, T.; Nabuurs, S.B.; Marchiori, E. Gaussian interaction profile kernels for predicting drug–target interaction.
Bioinformatics 2011, 27, 3036–3043. [CrossRef]
158. He, T.; Heidemeyer, M.; Ban, F.; Cherkasov, A.; Ester, M. SimBoost: A read-across approach for predicting drug–target binding
affinities using gradient boosting machines. J. Cheminform. 2017, 9. [CrossRef]
159. Woźniak, M.; Wołos, A.; Modrzyk, U.; Górski, R.L.; Winkowski, J.; Bajczyk, M.; Szymkuć, S.; Grzybowski, B.A.; Eder, M. Linguistic
measures of chemical diversity and th “keywords” of molecular collections. Sci. Rep. 2018, 8. [CrossRef]
160. Sigrist, C.J.A.; Cerutti, L.; de Castro, E.; Langendijk-Genevaux, P.S.; Bulliard, V.; Bairoch, A.; Hulo, N. PROSITE, a protein domain
database for functional characterization and annotation. Nucleic Acids Res. 2009, 38, D161–D166. [CrossRef]
161. Liu, T.; Lin, Y.; Wen, X.; Jorissen, R.N.; Gilson, M.K. BindingDB: A web-accessible database of experimentally determined
protein-ligand binding affinities. Nucleic Acids Res. 2006, 35, D198–D201. [CrossRef]
162. Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A.C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; et al. DrugBank
4.0: Shedding new light on drug metabolism. Nucleic Acids Res. 2013, 42, D1091–D1097. [CrossRef]
163. Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and
drugs. Nucleic Acids Res. 2016, 45, D353–D361. [CrossRef]
164. Southan, C.; Sharman, J.L.; Benson, H.E.; Faccenda, E.; Pawson, A.J.; Alexander, S.; Buneman, O.P.; Davenport, A.P.; McGrath, J.C.;
Peters, J.A.; et al. The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: Towards curated quantitative interactions between
1300 protein targets and 6000 ligands. Nucleic Acids Res. 2015, 44, D1054–D1068. [CrossRef]
165. Bagley, S.C.; Altman, R.B. Characterizing the microenvironment surrounding protein sites. Protein Sci. 1995, 4, 622–635.
[CrossRef]
166. Michel, M.; Menéndez Hurtado, D.; Elofsson, A. PconsC4: Fast, accurate and hassle-free contact predictions. Bioinformatics 2018,
35, 2677–2679. [CrossRef]
167. Cao, D.S.; Xu, Q.S.; Liang, Y.Z. propy: A tool to generate various modes of Chou’s PseAAC. Bioinformatics 2013, 29, 960–962.
[CrossRef]
168. Ma, J.; Sheridan, R.P.; Liaw, A.; Dahl, G.E.; Svetnik, V. Deep Neural Nets as a Method for Quantitative Structure–Activity
Relationships. J. Chem. Inf. Model. 2015, 55, 263–274. [CrossRef]
169. Ballester, P.J.; Schreyer, A.; Blundell, T.L. Does a More Precise Chemical Description of Protein–Ligand Complexes Lead to More
Accurate Prediction of Binding Affinity? J. Chem. Inf. Model. 2014, 54, 944–955. [CrossRef]
170. Wallach, I.; Heifets, A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J. Chem.
Inf. Model. 2018, 58, 916–932. [CrossRef]
171. Göller, A.H.; Kuhnke, L.; Montanari, F.; Bonin, A.; Schneckener, S.; ter Laak, A.; Wichard, J.; Lobell, M.; Hillisch, A. Bayer’s
in silico ADMET platform: A journey of machine learning over the past two decades. Drug Discov. Today 2020, 25, 1702–1709.
[CrossRef]
172. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.;
da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship.
Sci. Data 2016, 3. [CrossRef]
173. Chen, L.; Cruz, A.; Ramsey, S.; Dickson, C.J.; Duca, J.S.; Hornak, V.; Koes, D.R.; Kurtzman, T. Hidden bias in the DUD-E dataset
leads to misleading performance of deep learning in structure-based virtual screening. PLoS ONE 2019, 14, e0220113. [CrossRef]
174. Jiménez-Luna, J.; Skalic, M.; Weskamp, N.; Schneider, G. Coloring Molecules with Explainable Artificial Intelligence for Preclinical
Relevance Assessment. J. Chem. Inf. Model. 2021. [CrossRef] [PubMed]
175. Bender, A.; Cortés-Ciriano, I. Artificial intelligence in drug discovery: What is realistic, what are illusions? Part 1: Ways to make
an impact, and why we are not there yet. Drug Discov. Today 2020. [CrossRef]
176. Bender, A.; Cortes-Ciriano, I. Artificial intelligence in drug discovery: What is realistic, what are illusions? Part 2: a discussion of
chemical and biological data. Drug Discov. Today 2021. [CrossRef]
177. Nguyen, H.; Case, D.A.; Rose, A.S. NGLview–interactive molecular graphics for Jupyter notebooks. Bioinformatics 2017,
34, 1241–1242. [CrossRef] [PubMed]
178. Wójcikowski, M.; Zielenkiewicz, P.; Siedlecki, P. Open Drug Discovery Toolkit (ODDT): A new open-source player in the drug
discovery field. J. Cheminform. 2015, 7. [CrossRef] [PubMed]
179. Schrödinger, LLC. The PyMOL Molecular Graphics System; Version 1.8; Schrödinger LLC: New York, NY, USA, 2015.
Int. J. Mol. Sci. 2021, 22, 4435 34 of 34
180. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997,
30, 1145–1159. [CrossRef]
181. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [CrossRef]
182. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary
classification evaluation. BMC Genom. 2020, 21. [CrossRef]
183. Kvålseth, T.O. Cautionary Note about R2 . Am. Stat. 1985, 39, 279–285. [CrossRef]
184. Ash, A.; Shwartz, M. R2: A useful measure of model performance when predicting a dichotomous outcome. Stat. Med. 1999,
18, 375–384. [CrossRef]
185. Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson Correlation Coefficient; Springer: Berlin, Germany, 2009; pp. 1–4. [CrossRef]