Chemprop Benchmark 2019 SI
Chemprop Benchmark 2019 SI
Chemprop Benchmark 2019 SI
Kevin Yang,∗,† Kyle Swanson,∗,† Wengong Jin,† Connor Coley,‡ Philipp Eiden,¶
Code
S1
In addition, we ran the baseline from Mayr et al. 1 using the code at https://github.
com/yangkevin2/lsc_experiments.
S2
Additional Dataset Statistics
Figure S2: Class balance on the publicly available classification datasets. The Y-axis is the
average percent of positives in the tasks in a dataset, weighted by the number of molecules
with known values for each task.
In addition to the class balance statistics in Figure S2, we also analyze the class imbalances
introduced in both random and scaffold splits, quantifying the imbalance using the following
metric. Let r be the fraction of the less common class (0 or 1) in the full dataset, and let rt be
the same fraction for a particular test fold for one of our splits. Then we measure imbalance
r rt
for a particular property and test fold by max rt , r (that is, the ratio of the larger over
the smaller) and average across all properties and test folds for each dataset. Thus, a higher
metric indicates greater class imbalance introduced by data splitting. On rare occasions
for both the random and scaffold split, rt is 0 due to the sparsity and imbalance of some
properties in datasets that contain a large number of properties. We omit these cases from
the average, and for each dataset we denote the number of property-test fold pairs for which
this occurred. (Such properties are omitted when calculating the average AUC for that
particular test fold.) Overall, as indicated in Figure S3 and Tables S1 and S2, the scaffold
S3
split is more imbalanced than the random split, but the numbers are comparable. On no
dataset does the scaffold split test set’s class balance differ (on average) from the full dataset’s
class balance by more than a factor of 2.
Figure S3: Ratio of class balance between the full dataset and the test set on the publicly
available classification datasets (see text for definition of the ratio).
S4
Table S2: Class Balance Ratio (Scaffold Split).
Dataset Class Balance Ratio # Failed Property-Fold Pairs # Property-Fold Pairs
PCBA 1.193 ± 0.018 0 384
MUV 1.562 ± 0.112 3 51
HIV 1.092 ± 0.035 0 3
BACE 1.096 ± 0.076 0 10
BBBP 1.251 ± 0.203 0 10
Tox21 1.603 ± 0.530 0 120
ToxCast 1.471 ± 0.082 180 6170
SIDER 1.294 ± 0.204 1 270
ClinTox 1.915 ± 1.142 0 20
ChEMBL 1.393 ± 0.017 284 3930
Comparison to Baselines
Note: All p-values for comparisons involving MoleculeNet, 2 Mayr et al., 1 and different
dataset split types are from a one-sided Welch’s t-test. All remaining p-values are from
a one-sided Wilcoxon signed-rank test. See the main text for more details.
Comparison to MoleculeNet
Comparison between our D-MPNN with RDKit features and the best model from Molecu-
leNet using the splits from MoleculeNet. 2 We were unable to reproduce the splits from
MoleculeNet on QM7, BACE, and ToxCast, so we leave out those datasets. The QM8,
QM9, and PDBbind datasets include 3D coordinates that our model does not use but some
MoleculeNet models may use.
S5
(a) Regression Datasets (lower = better). (b) Classification Datasets (higher = better).
Comparison between our best single model (i.e. optimized hyperparameters and optionally
RDKit features but without ensembling) and the feed-forward network (FFN) architecture
of Mayr et al. 1 using their best descriptor set. We ran this comparison only on scaffold
split due to computational cost; as the model of Mayr et al. 1 uses hyperparameter opti-
mization spanning 40 different parameter settings for 300 epochs each, their hyperparameter
optimization is actually more expensive than that of our D-MPNN.
S6
(a) Regression Datasets (Scaffold Split, lower (b) Classification Datasets (Scaffold Split,
= better). higher = better).
S7
(a) Regression Datasets (Random Split, lower (b) Classification Datasets (Random Split,
= better). higher = better).
(c) Regression Datasets (Scaffold Split, lower = (d) Classification Datasets (Scaffold Split,
better). higher = better).
S8
Table S5: Comparison to Baselines, Part I (Random Split).
S9
Table S6: Comparison to Baselines, Part II (Random Split).
S10
Table S7: Comparison to Baselines, Part III (Random Split).
S11
Table S8: Comparison to Baselines, Part I (Scaffold Split).
S12
Table S9: Comparison to Baselines, Part II (Scaffold Split).
S13
Table S10: Comparison to Baselines, Part III (Scaffold Split).
Proprietary Datasets
Amgen
Comparison of our D-MPNN in both its unoptimized and optimized form against baseline
models on Amgen internal datasets using a time split of the data. Note that rPPB is in logit
while Sol and RLM are in log10 .
Table S11: Comparison to Baselines on Amgen, Part I. Note: The metric for hPXR (class)
is ROC-AUC; all others are RMSE. *Only one run.
S14
(a) Regression Datasets (lower = better). (b) Classification Datasets (higher = better).
Table S12: Comparison to Baselines on Amgen, Part II. Note: The metric for hPXR (class)
is ROC-AUC; all others are MSE.
Table S13: Comparison to Baselines on Amgen, Part III. Note: The metric for hPXR (class)
is ROC-AUC; all others are MSE. *Only one run.
S15
(a) Regression Datasets (lower = better). (b) Classification Datasets (higher = better).
Table S14: Optimizations on Amgen, Part I. Note: The metric for hPXR (class) is ROC-
AUC; all others are RMSE.
Table S15: Optimizations on Amgen, Part II. Note: The metric for hPXR (class) is ROC-
AUC; all others are RMSE.
BASF
Comparison of our D-MPNN in both its unoptimized and optimized form against baseline
models on BASF internal datasets using a scaffold split of the data.
S16
Figure S9: Comparison to Baselines on BASF (higher = better).
Table S16: Comparison to Baselines on BASF, Part I. Note: All numbers are R2 .
S17
Table S17: Comparison to Baselines on BASF, Part II. Note: All numbers are R2 .
Dataset FFN on Morgan FFN on Morgan Counts FFN on RDKit
Benzene 0.587 ± 0.007 (-28.21%) 0.630 ± 0.009 (-22.89%) 0.742 ± 0.007 (-9.23%)
Cyclohexane 0.521 ± 0.014 (-31.53%) 0.572 ± 0.012 (-24.86%) 0.682 ± 0.007 (-10.32%)
Dichloromethane 0.032 ± 0.047 (-96.04%) 0.012 ± 0.133 (-98.55%) 0.695 ± 0.014 (-15.15%)
DMSO 0.604 ± 0.007 (-27.92%) 0.654 ± 0.009 (-21.95%) 0.755 ± 0.007 (-9.92%)
Ethanol 0.597 ± 0.008 (-28.13%) 0.644 ± 0.012 (-22.54%) 0.755 ± 0.005 (-9.21%)
Ethyl acetate 0.587 ± 0.008 (-29.82%) 0.635 ± 0.012 (-24.10%) 0.748 ± 0.009 (-10.58%)
H2O 0.599 ± 0.014 (-27.73%) 0.643 ± 0.016 (-22.38%) 0.754 ± 0.002 (-8.99%)
Octanol 0.583 ± 0.017 (-29.66%) 0.638 ± 0.009 (-23.05%) 0.749 ± 0.007 (-9.60%)
Tetrahydrofuran 0.582 ± 0.002 (-30.38%) 0.629 ± 0.007 (-24.65%) 0.747 ± 0.013 (-10.55%)
Toluene 0.578 ± 0.016 (-30.34%) 0.622 ± 0.005 (-25.02%) 0.756 ± 0.005 (-8.82%)
S18
Table S18: Optimizations on BASF, Part I. Note: All numbers are R2 .
Table S19: Optimizations on BASF, Part II. Note: All numbers are R2 .
Novartis
Comparison of our D-MPNN in both its unoptimized and optimized form against baseline
models on a Novartis internal dataset using a time split of the data.
Table S20: Comparison to Baselines on Novartis, Part I. Note: All numbers are RMSE.
Table S21: Comparison to Baselines on Novartis, Part II. Note: All numbers are RMSE.
S19
Figure S11: Comparison to Baselines on Novartis (lower = better).
We provide additional results for different versions of our D-MPNN model as well as for a
Lasso regression model 3 on the Novartis dataset in Figures S12, S13, and S14. These results
showcase the importance of proper normalization for the additional RDKit features. Our base
D-MPNN predicts a logP within 0.3 of the ground truth on 47.4% of the test set, comparable
to the best Lasso baseline. However, augmenting our model with features normalized using a
Gaussian distribution assumption (simply subtracting the mean and dividing by the standard
deviation for each feature) results in only 43.0% of test set predictions within 0.3 of the
ground truth. But using properly normalized CDFs drastically improves this number to
51.2%. Note that the Lasso baseline runs on Morgan fingerprints as well as RDKit features
using properly normalized CDFs.
S20
(a) Binned distribution of errors for Lasso base- (b) Scatterplot of Lasso baseline predictions vs.
line. ground truth.
Figure S12: Performance of Lasso models on the proprietary Novartis logP dataset. For each
model, a pie chart shows the binned distribution of errors on the test set, and a scatterplot
shows the predictions vs. ground truth for individual data points.
Figure S13: Performance of base D-MPNN model on the proprietary Novartis logP dataset.
A pie chart shows the binned distribution of errors on the test set, and a scatterplot shows
the predictions vs. ground truth for individual data points.
S21
(a) Binned distribution of errors for D-MPNN (b) Scatterplot of predictions vs. ground truth
with features normalized according to a Gaus- for D-MPNN with features normalized accord-
sian distribution assumption. ing to a Gaussian distribution assumption.
Figure S14: Performance of D-MPNN models with features on proprietary Novartis logP
dataset. For each model, a pie chart shows the binned distribution of errors on the test set,
and a scatterplot shows the predictions vs. ground truth for individual data points.
S22
Sliding Time Window Splits
We additionally evaluated our model on sliding time window splits where chronological data
splits were available. For each dataset we divided it chronologically into 14 equally sized
chunks. For each contiguous group of 5 chunks, we used the first 3 as training, the fourth
as validation, and the fifth as test, for a total of 10 3:1:1 splits. Due to constraints of
computational cost, we only evaluated on 3 of the splits for the Amgen datasets RLM, Sol,
and hPXR. Overall, the time window split results are very noisy due to the smaller dataset
size, so it is hard to make many strong conclusions, but overall the relative ranking of model
architectures stays approximately stable compared to the full time splits.
(a) Time Split (lower = better). (b) Sliding Time Window Split (higher = better).
Figure S15: Comparison to Baselines Using Time and Sliding Time Window Splits.
Table S22: Comparison to Baselines Using Time Split, Part I. *Only one run.
S23
Table S23: Comparison to Baselines Using Time Split, Part II.
Dataset FFN on Morgan FFN on Morgan Counts FFN on RDKit
PDBbind-F 2.403 ± 0.044 (+9.88%) 2.431 ± 0.055 (+11.16%) 2.180 ± 0.033 (-0.34%)
PDBbind-C 3.743 ± 0.808 (+3.05%) 4.380 ± 0.964 (+20.57%) 3.334 ± 0.114 (-8.21%)
PDBbind-R 2.993 ± 0.380 (+23.45%) 2.908 ± 0.185 (+19.95%) 2.387 ± 0.126 (-1.56%)
rPPB 1.856 ± 0.517 (+75.69%) 1.903 ± 0.468 (+80.09%) 1.119 ± 0.027 (+5.87%)
Sol 0.802 ± 0.017 (+13.64%) 0.779 ± 0.008 (+10.31%) 0.765 ± 0.003 (+8.30%)
RLM 0.359 ± 0.005 (+8.48%) 0.353 ± 0.003 (+6.81%) 0.384 ± 0.003 (+16.14%)
hPXR 45.428 ± 1.255 (+24.17%) 44.305 ± 1.226 (+21.10%) 39.426 ± 0.465 (+7.77%)
LogP 0.915 ± 0.020 (+32.23%) 0.838 ± 0.019 (+21.10%) 0.753 ± 0.013 (+8.82%)
Table S24: Comparison to Baselines Using Sliding Time Window Split, Part I. *Only one
run.
Table S25: Comparison to Baselines Using Sliding Time Window Split, Part II.
Dataset FFN on Morgan FFN on Morgan Counts FFN on RDKit
PDBbind-F 1.368 ± 0.083 (+8.67%) 1.485 ± 0.235 (+17.95%) 1.264 ± 0.169 (+0.41%)
PDBbind-C 1.697 ± 0.175 (+9.63%) 1.825 ± 0.211 (+17.88%) 1.651 ± 0.194 (+6.67%)
PDBbind-R 1.783 ± 0.460 (+32.33%) 1.721 ± 0.584 (+27.80%) 1.289 ± 0.010 (-4.28%)
rPPB 2.524 ± 0.483 (+92.64%) 2.194 ± 0.075 (+67.39%) 1.475 ± 0.084 (+12.57%)
Sol 1.148 ± 0.209 (+15.74%) 1.170 ± 0.223 (+17.96%) 1.018 ± 0.276 (+2.61%)
RLM 0.472 ± 0.007 (+19.44%) 0.461 ± 0.004 (+16.70%) 0.444 ± 0.011 (+12.48%)
hPXR 60.244 ± 8.015 (+26.00%) 57.859 ± 8.343 (+21.02%) 48.961 ± 9.012 (+2.40%)
LogP 1.240 ± 0.091 (+70.80%) 1.126 ± 0.064 (+55.10%) 0.810 ± 0.089 (+11.56%)
Experimental Error
S24
same time split as the two models since it can only be measured on molecules which were
tested more than once, but even so the difference in performance is striking.
S25
Analysis of Split Type
Figure S17: Overlap of molecular scaffolds between the train and test sets for a random or
chronological split of four Amgen regression datasets. Overlap is defined as the percent of
molecules in the test set which share a scaffold with a molecule in the train set.
Table S27: Overlap of molecular scaffolds between the train and test sets for a random or
chronological split of four Amgen regression datasets. Overlap is defined as the percent of
molecules in the test set which share a scaffold with a molecule in the train set.
S26
Figure S18: D-MPNN Performance by Split Type on Amgen datasets (lower = better).
Table S28: D-MPNN Performance by Split Type on Amgen datasets. Note: All numbers
are RMSE. *Only one run.
S27
Figure S19: D-MPNN Performance by Split Type on Novartis datasets (lower = better).
Table S29: Performance of D-MPNN on different data splits on Novartis datasets. Note: All
numbers are RMSE.
S28
Figure S20: D-MPNN Performance by Split Type on PDBBind (lower = better).
Table S30: D-MPNN Performance by Split Type on PDBBind. Note: All numbers are
RMSE.
(a) Regression Datasets (lower = better). (b) Classification Datasets (higher = better).
S29
Table S31: D-MPNN Performance by Split Type on Public Datasets.
Ablations
Message Type
Here we describe the implementation and performance of our atom-based and undirected
bond-based messages. For the most direct comparison, we implemented these as options in
our model; the changes are only a few lines of code in each case. Therefore, in each case, we
simply detail the differences from our directed bond-based messages.
Atom Messages
We initialize messages based on atom features rather than bond features, according to h0v =
τ (Wi xv ) rather than h0vw = τ (Wi cat(xv , evw )), with matrix dimensions adjusted accordingly.
During message passing, each atom receives messages according to mt+1 = k∈{N (v)} htk .
P
v
S30
Finally, mv is the sum of all of the atom hidden states at the end of message passing.
The only difference between undirected bonds and our D-MPNN is that before each message
passing step, for each pair of bonded atoms v and w, we set htvw and htwv to each be equal to
their average. Consequently, the hidden state for each directed bond is always equal to the
hidden state of its reverse bond, resulting in message passing on undirected bonds.
Comparison of performance using different message passing paradigms. Our D-MPNN uses
directed messages.
(a) Regression Datasets (Random Split, lower (b) Classification Datasets (Random Split,
= better). higher = better).
(c) Regression Datasets (Scaffold Split, lower = (d) Classification Datasets (Scaffold Split,
better). higher = better).
S31
Table S32: Message Type (Random Split).
Dataset Metric D-MPNN Undirected Atom
QM7 MAE 66.475 ± 2.088 68.628 ± 2.177 (+3.24% p=0.01) 72.811 ± 2.737 (+9.53% p=0.00)
QM8 MAE 0.0110 ± 0.0002 0.012 ± 0.000 (+5.99% p=0.00) 0.011 ± 0.000 (+2.43% p=0.00)
QM9 MAE 3.101 ± 0.010 3.263 ± 0.069 (+5.23% p=0.00) 3.589 ± 0.033 (+15.73% p=0.00)
ESOL RMSE 0.665 ± 0.052 0.702 ± 0.042 (+5.62% p=0.00) 0.719 ± 0.045 (+8.05% p=0.00)
FreeSolv RMSE 1.167 ± 0.150 1.242 ± 0.249 (+6.44% p=0.00) 1.243 ± 0.182 (+6.50% p=0.06)
Lipophilicity RMSE 0.596 ± 0.050 0.645 ± 0.075 (+8.15% p=0.00) 0.625 ± 0.056 (+4.87% p=0.00)
PDBbind-F RMSE 1.311 ± 0.034 1.337 ± 0.036 (+1.98% p=0.00) 1.330 ± 0.027 (+1.42% p=0.00)
PDBbind-C RMSE 2.151 ± 0.285 2.090 ± 0.270 (-2.86% p=0.98) 2.211 ± 0.339 (+2.79% p=0.17)
PDBbind-R RMSE 1.395 ± 0.087 1.427 ± 0.090 (+2.30% p=0.00) 1.424 ± 0.082 (+2.07% p=0.00)
PCBA PRC-AUC 0.337 ± 0.004 0.330 ± 0.007 (-2.23% p=0.01) 0.333 ± 0.010 (-1.15% p=0.03)
MUV PRC-AUC 0.1222 ± 0.0204 0.097 ± 0.042 (-20.66% p=0.00) 0.103 ± 0.022 (-16.05% p=0.08)
HIV ROC-AUC 0.816 ± 0.023 0.805 ± 0.022 (-1.40% p=0.92) 0.805 ± 0.019 (-1.33% p=0.80)
BACE ROC-AUC 0.878 ± 0.032 0.850 ± 0.039 (-3.13% p=0.00) 0.864 ± 0.035 (-1.63% p=0.00)
BBBP ROC-AUC 0.913 ± 0.026 0.910 ± 0.032 (-0.40% p=0.12) 0.908 ± 0.033 (-0.63% p=0.03)
Tox21 ROC-AUC 0.845 ± 0.015 0.844 ± 0.014 (-0.14% p=0.17) 0.845 ± 0.014 (+0.04% p=0.29)
ToxCast ROC-AUC 0.737 ± 0.013 0.732 ± 0.015 (-0.61% p=0.00) 0.735 ± 0.014 (-0.27% p=0.25)
SIDER ROC-AUC 0.646 ± 0.016 0.641 ± 0.014 (-0.73% p=0.34) 0.644 ± 0.014 (-0.23% p=0.50)
ClinTox ROC-AUC 0.894 ± 0.027 0.881 ± 0.037 (-1.49% p=0.03) 0.896 ± 0.037 (+0.22% p=0.62)
ChEMBL ROC-AUC 0.746 ± 0.040 0.745 ± 0.043 (-0.14% p=0.02) 0.744 ± 0.045 (-0.31% p=0.01)
RDKit Features
S32
(a) Regression Datasets (Random Split, lower (b) Classification Datasets (Random Split,
= better). lower = better).
(c) Regression Datasets (Scaffold Split, lower = (d) Classification Datasets (Scaffold Split,
better). higher = better).
S33
Table S34: RDKit Features (Random Split).
S34
Table S35: RDKit Features (Scaffold Split).
Hyperparameter Optimization
Effect of performing Bayesian hyperparameter optimization on the depth, hidden size, num-
ber of fully connected layers, and dropout of our model. Optimization was done on random
splits and then the optimized model was applied to both random and scaffold splits.
S35
(a) Regression Datasets (Random Split, lower (b) Classification Datasets (Random Split,
= better). higher = better).
(c) Regression Datasets (Scaffold Split, lower = (d) Classification Datasets (Scaffold Split,
better). higher = better).
S36
Table S36: Hyperparameter Optimization (Random Split).
S37
Table S37: Hyperparameter Optimization (Scaffold Split).
Ensembling
Benefit of ensembling five models instead of a single model. All results are using our best
model settings (i.e. optimized hyperparameters and RDKit features, if they improved per-
formance in the single model setting).
S38
(a) Regression Datasets (Random Split, lower (b) Classification Datasets (Random Split,
= better). higher = better).
(c) Regression Datasets (Scaffold Split, lower = (d) Classification Datasets (Scaffold Split,
better). higher = better).
S39
Table S38: Ensembling (Random Split).
S40
Effect of Data Size
Table S40: Effect of Data Size on ChEMBL. All numbers are ROC-AUC.
RDKit-Calculated Features
We used the following list of 200 RDKit functions to calculate the RDKit features used by
our model.
S41
Chi1n Chi1v Chi2n
Chi2v Chi3n Chi3v
Chi4n Chi4v EState VSA1
EState VSA10 EState VSA11 EState VSA2
EState VSA3 EState VSA4 EState VSA5
EState VSA6 EState VSA7 EState VSA8
EState VSA9 ExactMolWt FpDensityMorgan1
FpDensityMorgan2 FpDensityMorgan3 FractionCSP3
HallKierAlpha HeavyAtomCount HeavyAtomMolWt
Ipc Kappa1 Kappa2
Kappa3 LabuteASA MaxAbsEStateIndex
MaxAbsPartialCharge MaxEStateIndex MaxPartialCharge
MinAbsEStateIndex MinAbsPartialCharge MinEStateIndex
MinPartialCharge MolLogP MolMR
MolWt NHOHCount NOCount
NumAliphaticCarbocycles NumAliphaticHeterocycles NumAliphaticRings
NumAromaticCarbocycles NumAromaticHeterocycles NumAromaticRings
NumHAcceptors NumHDonors NumHeteroatoms
NumRadicalElectrons NumRotatableBonds NumSaturatedCarbocycles
NumSaturatedHeterocycles NumSaturatedRings NumValenceElectrons
PEOE VSA1 PEOE VSA10 PEOE VSA11
PEOE VSA12 PEOE VSA13 PEOE VSA14
PEOE VSA2 PEOE VSA3 PEOE VSA4
PEOE VSA5 PEOE VSA6 PEOE VSA7
PEOE VSA8 PEOE VSA9 RingCount
SMR VSA1 SMR VSA10 SMR VSA2
SMR VSA3 SMR VSA4 SMR VSA5
S42
SMR VSA6 SMR VSA7 SMR VSA8
SMR VSA9 SlogP VSA1 SlogP VSA10
SlogP VSA11 SlogP VSA12 SlogP VSA2
SlogP VSA3 SlogP VSA4 SlogP VSA5
SlogP VSA6 SlogP VSA7 SlogP VSA8
SlogP VSA9 TPSA VSA EState1
VSA EState10 VSA EState2 VSA EState3
VSA EState4 VSA EState5 VSA EState6
VSA EState7 VSA EState8 VSA EState9
fr Al COO fr Al OH fr Al OH noTert
fr ArN fr Ar COO fr Ar N
fr Ar NH fr Ar OH fr COO
fr COO2 fr C O fr C O noCOO
fr C S fr HOCCN fr Imine
fr NH0 fr NH1 fr NH2
fr N O fr Ndealkylation1 fr Ndealkylation2
fr Nhpyrrole fr SH fr aldehyde
fr alkyl carbamate fr alkyl halide fr allylic oxid
fr amide fr amidine fr aniline
fr aryl methyl fr azide fr azo
fr barbitur fr benzene fr benzodiazepine
fr bicyclic fr diazo fr dihydropyridine
fr epoxide fr ester fr ether
fr furan fr guanido fr halogen
fr hdrzine fr hdrzone fr imidazole
fr imide fr isocyan fr isothiocyan
fr ketone fr ketone Topliss fr lactam
S43
fr lactone fr methoxy fr morpholine
fr nitrile fr nitro fr nitro arom
fr nitro arom nonortho fr nitroso fr oxazole
fr oxime fr para hydroxylation fr phenol
fr phenol noOrthoHbond fr phos acid fr phos ester
fr piperdine fr piperzine fr priamide
fr prisulfonamd fr pyridine fr quatN
fr sulfide fr sulfonamd fr sulfone
fr term acetylene fr tetrazole fr thiazole
fr thiocyan fr thiophene fr unbrch alkane
fr urea qed
References
(1) Mayr, A.; Klambauer, G.; Unterthiner, T.; Steijaert, M.; Wegner, J. K.; Ceulemans, H.;
Clevert, D.-A.; Hochreiter, S. Large-Scale Comparison of Machine Learning Methods for
Drug Target Prediction on ChEMBL. Chemical Science 2018, 9, 5441–5451.
(2) Wu, Z.; Ramsundar, B.; Feinberg, E.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.;
Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Sci-
ence 2018, 9, 513530.
S44