Academia.eduAcademia.edu

Efficient design of peptide-binding polymers using active learning approaches

2023, Journal of Controlled Release

bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. EFFICIENT DESIGN OF PEPTIDE-BINDING POLYMERS USING ACTIVE LEARNING APPROACHES A. Rakhimbekova 1, A. Lopukov2, N. Klyachko2, A. Kabanov2,3, T.I. Madzhidov1*, A. Tropsha4* 1 A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia. E-mail: timur.madzhidov@kpfu.ru 2 Laboratory of Chemical Design of Bionanomaterials, Faculty of Chemistry, M.V. Lomonosov Moscow State University, Moscow, Russia 3 Center for Nanotechnology in Drug Delivery and Division of Pharmacoengineering and Molecular Pharmaceutics, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, NC, USA 4 Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA. E-mail: alex_tropsha@unc.edu Abstract Active learning (AL) has become a subject of active recent research both in industry and academia as an efficient approach for rapid design and discovery of novel chemicals, materials, and polymers. The key advantages of this approach relate to its ability to (i) employ relatively small datasets for model development, (ii) iterate between model development and model assessment using small external datasets that can be either generated in focused experimental studies or formed from subsets of the initial training data, and (iii) progressively evolve models toward increasingly more reliable predictions and the identification of novel chemicals with the desired properties. Herein, we first compared various AL protocols for their effectiveness in finding biologically active molecules using synthetic datasets. We have investigated the dependency of AL performance on the size of the initial training set, the relative complexity of the task, and the choice of the initial training dataset. We found that AL techniques as applied to regression modeling offer no benefits over random search, while AL used for classification tasks performs better than models built for randomly selected training sets but still quite far from perfect. Using the best performing AL protocol, we have assessed the applicability of AL for the discovery of polymeric micelle formulations for poorly soluble drugs. Finally, the best performing AL approach was employed to discover and experimentally validate novel binding polymers for a case study of asialoglycoprotein receptor (ASGPR). Keywords: active learning, molecular design, bioactivity, binders, polymers, polymer binders. 1. Introduction Machine learning (ML) methods have been successfully applied in many areas of chemical research [1–5]. Classical ML models are trained on labeled datasets and then applied for virtual screening of chemical libraries to find hits with desired properties. However, in many areas of chemistry it is quite challenging to find a labeled dataset of sufficiently large size to enable rigorous building and external validations of property prediction ML models, as experiments are too expensive and slow to collect enough data. These problems have given rise to the field of active learning (AL) algorithms [6]. Active learning is an iterative procedure in which a machine learning model proposes candidates for testing to the user, and the user then returns labeled candidates, which are then used to update the model (Figure 1. ). Usually, the main task of AL is to maximize the predictive 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. performance of models using minimal required training data [6,7] and small datasets as user feedback. As applied to molecular and materials design, AL was shown to accelerate the experimental discovery of promising candidates by rapidly optimizing the molecular properties of interest [8–11]. New candidates for testing in the AL framework are either generated de novo if computational predictions are iterated with experimental testing or selected from a pool of accessible data [6]. Candidates are typically prioritized for testing based on some expected benefit for improving the predictive performance of the models. In applications of AL for chemical property optimization, in most cases the candidates are selected from unexplored regions of chemistry space where model uncertainty is the largest, which is referred to as “exploration” [7]. Alternatively, AL can be directed towards candidates with desired properties by maximizing the task-specific utility of a particular action/measurement [8–12]; this approach is referred to as “exploitation”. Figure 1. Active learning concept There has been a growing interest both in basic and applied research in exploring AL as a means of efficient search, design, and discovery of chemicals, materials, and polymers. For instance, there have been several recent efforts on the use of AL strategies to predict reaction yields [8], discover organic semiconductors [13], search for polymer dielectrics with a large band gap [14], design novel 19F magnetic resonance imaging (MRI) agents [15], develop ML models that predict quantum chemical properties [16–19], and facilitate drug discovery [10,11,20] (reviewed recently by Reker et al. [21]). In all aforementioned examples, initial datasets were usually quite large and only several studies were reported where AL started from extremely small datasets [9,19]. Kim et al. [9] evaluated the effectiveness of three AL strategies (exploitation, exploration, and the balanced exploitation and exploration method) compared to a random selection (when the training dataset is selected randomly) approach for the detection of polymers with high glass transition temperatures (Tg > 450K). The initial surrogate model was built for five randomly selected polymers using Gaussian process regression (GPR). They found that the balanced explorationexploitation approach had the highest accuracy and efficiency to identify polymers with desired properties. Loeffler et al. [19] attempted to minimize the size of the training set to build neural network models for predicting energy of water clusters. They introduced AL strategy that starts with minimal training data and is continuously updated via a nested ensemble Monte Carlo scheme. The authors generated training data on the fly by selectively adding configurations from 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. failed regions. They showed that this strategy can be applied to develop accurate force-fields for molecular simulations using sparse training data sets. Herein, we report on an AL methodology for efficient and quick search for candidate compounds with the desired properties using minimal data and limited experimental testing. Following the initial benchmarking studies of different AL protocols, we employ the bestperforming AL approach to the discovery of synthetic polymers that can selectively bind specific peptides or small molecules. Such polymers can be used as synthetic scavengers of xenobiotics [22,23], as micelles for targeted delivery of peptides or small molecules [24] or to enhance solubility of poorly soluble drugs [25], to name a few. Due to lack and sparsity of data, classic QSAR modeling cannot be applied in this case. Thus, this study focuses on the design of polymeric systems in the common case of small training datasets. We chose the asialoglycoprotein receptor (ASGPR) as a target for molecular recognition screening. ASGPR is involved in the endocytosis of the desialylated glycoproteins. The receptor is located predominantly on the surface of hepatocytes and is also found on urethral epithelial cells [26] and human sperm cells [27]. It is actively expressed by hepatocellular carcinoma (HCC) cells [28]. It belongs to the C-type lectin family, which also includes mannose macrophage receptor, Dendritic Cell-Specific Intercellular adhesion molecule-3-Grabbing Non-integrin, subset of selectins (E-selectin, P-selectin, L-selectin), and a number of others [29,30]. ASGPR takes part in the pathogenesis of the Marburg hemorrhagic fever [31], viral hepatitis [32], and gonorrhea [26]. It is also involved in the development of colon adenocarcinoma metastasis [33], as well as in the metastasis of other cancers by activation of EGFR – ERK with upregulation of MMP-9 [34]. ASGPR was targeted for the delivery of anticancer therapy to HCC cells [35–38] and for delivery of the copper chelation agents as a therapy for Wilson disease [39]. ASGPR was used for identification of the ligand epitope and design of polyplex-based gene carrier [40]. Other examples of ASGPR-based targeted delivery include the treatment of acute hepatic porphyria [41], viral hepatitis [42,43]. It was also involved in the delivery of atorvastatin conjugated with the targeting moiety into hepatocytes for up-regulation of cholesterol metabolism in hypercholesterolemia patients [44], providing reduction of systemic distribution-related side effects, such as myopathy and muscle pain [45]. ASGPR was used as the model target for the development of neuronal gene delivery platform. Even though the ASGPR is not expressed by the primary sensory neurons, the detection of several galectin receptors with similar ligand profile was observed. Thus, due to its participation in various biological processes, ASGPR represents an attractive model object for creating a molecular recognition system. Although there are recognition systems based on small molecules, the macromolecule-based recognition system can exhibit greater specificity and stability. Macromolecules and polymers can interact with the cell membrane and the receptor in the fashion different from that of low molecular weight compounds. The biodistribution and pharmacokinetic profile are expected to be different as well, offering the opportunity to overcome the problems typical for small molecule-based drugs. A novel polymerbased targeting moiety could be conjugated with the active pharmaceutical ingredient molecule for directed treatment of various liver diseases. Meanwhile, the water-soluble polymer-based recognition system can prevent the ASGPR-targeted interaction of various pathogens, such as viruses (Marburg virus, Hepatitis A and B viruses) or bacteria (Neisseria gonorrhoeae). This paper is organized as follows. First, we conduct a series of synthetic experiments to study and compare various AL strategies using a small initial training set (5 to 50 data points) with known values of biological activities (pIC50, pKi) of chemical compounds. In doing so, the goal is to accelerate the discovery of the desired compounds, which is the ultimate objective of the experimental research, rather than improve the overall model performance. Second, we investigate how the performance of each AL strategy may vary depending on (i) the size of the initial training set, (ii) different approaches to selecting the initial data, and (iii) the relative complexity of the task at hand. Finally, we evaluate the proposed AL methodology for finding polymers that can 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bind organic molecules based on the literature data [46] and using real experimental setup as well as polymers that can bind ASGPR. 2. Materials and methods 2.1. Datasets and descriptors 2.1.1. ChEMBL datasets The comparison and selection of optimal AL approaches was carried out using datasets with known values of biological activities (pKi) of chemical compounds tested in biological assays for different target proteins; all the data was downloaded from the ChEMBL database (CHEMBL205, CHEMBL244, CHEMBL217) [47]. A histogram of the distribution of pKi values in the selected datasets is shown in Figure 2. Figure 2. Distribution of the pKi values for all the datasets considered in this study Molecules were randomly divided into training (80%) and external test (20%) sets (Table 1). Regression models were built using pKi values whereas to build binary classification models, we assumed that molecules with pKi greater than 7 were active, otherwise inactive (the distribution of active and inactive molecules is presented in Table 1). Descriptors. Morgan fingerprints with a size of 1024 bits and a radius of 2 generated using the RDKit package [48] were used as descriptors for modeling. Table 1. Sizes of training and external test sets for three datasets. Number of active/inactive compounds is given in brackets Dataset CHEMBL205 CHEMBL217 CHEMBL244 Size of dataset 3808 (2463/1345) 5011 (2016/2995) 3305 (1771/1534) Size of training set 3047 (1977/1070) 4009 (1607/2402) 2644 (1403/1241) Size of external test set 761 (486/275) 1002 (409/593) 661 (368/293) 2.1.2. Polymer datasets We used literature data [46] to assess the application of AL strategies for the discovery of polymers with specified binding affinities for organic molecules. We have employed published data [46] on the loading capacity (LC) of five hydrophobic drugs (Curcumin (CUR), Paclitaxel (PTX), antiretroviral efavirenz (EFV), Dexamethasone (DEX), anti-oxidative tanshinone IIA (T2A)) for micelles formed by 18 different amphiphilic polymers. The loading capacity values 4 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. were measured at different time intervals: 0, 5, 10, and 30 minutes after adding the drug to the polymer solution. LC for PTX, EFV and CUR was measured at all time points; and LC for DEX and T2A were measured only at 0 minutes. Thus, our collection included 252 data points for drugpolymer systems. For this dataset, the high value of loading capacity (LC > 33%) was observed for 77 drug-polymer systems. Such points were considered as having class label “active”, or “1”, and the remaining systems were assigned to class “0”. Models were developed for eight different datasets, and every dataset contained LCs data for a given drug, measured at a given time interval. Only the solubilization data for curcumin (CUR) and paclitaxel (PTX) contained instances of active and inactive classes and therefore these datasets were selected for model development (other drugs were insoluble in the presence of any polymeric micelle; see Table S1, Supporting Materials). The stability of polymer-drug composition with CUR and PTX drugs after certain time (0, 5, 10, and 30 minutes) were used for testing AL approaches. Thus, we had 8 datasets with 18 datapoints in each: a dataset included data for one drug (CUR or PTX), solubilized with 18 different polymeric micelles for one of the four time intervals. Then, an individual virtual experiment was conducted to design an optimally solubilizing polymer. For experimental validation, AL workflow was used for the design of polymeric recognition system for ASGPR. The design of this system aimed to find block copolymers of PEG and polyamine acids (polylysine - PLKC, polyglutamate - PLE, polyaspartate - PLD) that bind to the model protein ASGPR. PEG block can have molecular weight of 5k or 1k; PLKS, PLD or PLE polymers can have 10, 30, 50, 100 repeated blocks. The α-end of the polyethylene glycol could have a CH3O group or an azide group. 13 different types of polymers were synthesized in the project (see Results section). Descriptors. To describe the chemical structure of small molecules, polymers, and drugpolymer systems, we used modified simplex descriptors described in [25]. The descriptors for drug-polymer systems included: i. traditional SiRMS descriptors (multiplets containing 4 atoms) of a pseudo small molecule representing a polymer. Pseudo small molecule is a block of polymer repeated only once with corresponding terminal blocks of the polymer. In case of block-copolymer both blocks are repeated once; ii. “mixture” simplex descriptors of a drug-polymer system. In every simplex at least one of four atom comes from drug molecule and at least one comes from a polymer building block; iii. composition descriptors, which includes the number of particular monomer repetitions in a polymer structure; iv. time interval for the loading capacity measurement (0, 5, 10, or 30 minutes). Used only for the LC dataset. In case of highly correlated (r ≥ 0.9) descriptors, one of them was chosen arbitrarily and removed; similarly, we removed low variance descriptors to reduce the dimensionality of the chemical space without losing important information. A total of 446 descriptors were obtained. For the design of ASGPR polymeric recognition system descriptors of the target molecule were not added. Thus, only SiRMS descriptors [25] of the pseudo-small polymer molecule were taken to describe the polymer structure as described above. Low variance and highly correlated descriptors were removed. As a result, 21 SiRMS descriptors were obtained, which was augmented by 2 descriptors containing information about polymer composition (MW of PEG block and number of polyaminoacid block). SiRMS descriptor calculation was performed using Python implementation accessible in GitHub: https://github.com/DrrDom/sirms. Descriptor selection was accomplished using built-in tools from scikit-learn library [49,50]. 5 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 2.2. Modelling Random Forest Regression (denoted as RFR), Classification (denoted as RFClf), and Gaussian Process Regression (denoted as GPR) approaches were employed for building quantitative structure-property relationships (QRPR) models. Implementations of the RFR, RFClf and GPR methods were taken from the scikit-learn library [50]. Random Forest is an ensemble of multiple decision trees trained independently on a random subset of training data. This method has a small number of hyperparameters and is insensitive to the presence of many descriptors. The predicted classification values are defined by the majority voting for one of the classes. The predicted regression values are defined by the mean prediction of the individual trees. As regression model's prediction confidence, required for some AL approaches, we used variance of tree predictions in Random Forest. We slightly modified the original RFR code to extract variance together with prediction itself. The number of trees in RFR and RFClf was 500, the number of features selected upon tree branching (max_features option) was set to log2 of features number. Such settings have shown best performance in our tests. Other hyperparameters of RFR and RFClf were set to default values. GPR assumes that the joint distribution of a real-valued property of chemical objects and their descriptors is multivariate normal (Gaussian) with the elements of its covariance matrix computed using special covariance functions (kernels). GPR model produces a posterior conditional distribution (so-called prediction density) of the property given the vector of descriptors for every chemical object. The prediction density has a normal (Gaussian) distribution with the mean corresponding to the predicted value of the property and the variance corresponding to prediction confidence [51]. Based on our tests, for GPR models hyperparameters of noise level, alpha was set to 0.1, and RBF kernel’s gamma value was 10. Other hyperparameters of GPR were set by default. 2.3. Testing different AL strategies using ChEMBL datasets For testing different AL approaches, we used ChEMBL datasets described above. Our objective was to compare these strategies for their relative ability to find as many candidate compounds that possess desired property as possible. Training and test datasets. The dataset applied for training surrogate model for the first time is called the initial set; it is usually quite small (5, 10, 20, 50 objects). After an AL cycle, the initial dataset is enriched by the fixed number of new data and the resulting dataset is called current set. As the AL cycles are executed progressively, the size of the current dataset gradually increases. As the validation set for each model, we use the data that was left after exclusion of the current dataset data points from the training sets. All model quality metrics are measured on the external test set, which is selected in advance and by no means used in modeling (Figure 3). Selection of the initial dataset. The active learning cycle starts from training the surrogate model on the initial dataset of N objects (N = 5, 10, 20, 50); see Figure 3. The initial dataset is comprised of data points that have to be experimentally measured prior to AL modeling. Two strategies for selecting the initial dataset were tested: 1. random selection (RANDOM) – N objects are randomly selected from a set of possible candidates; 2. selection of the most diverse objects (DIVERSE) - the entire sample is clustered by KMeans algorithm implemented in the scikit-learn library [50]. One random object is selected from each cluster to form the initial dataset. Thus, the number of clusters was set equal to the desired size of the initial set. Active learning strategies. Initially, descriptors of selected objects for testing as well as value of their property are retrieved. Then we train the surrogate model on the initial set using the machine learning model that relates descriptors of the molecules and their pKi. After building the 6 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. surrogate model, pKi values (along with associated uncertainties) or class probabilities are predicted for the remaining candidate molecules from the validation set. Based on these predictions, K (hereafter, K=1, 2 or 5) molecules are selected for further experimental testing and added to form the current set. The following selection strategies were tested: • • • • • Strategy 1. Exploitation (Y-MAX) – Top K molecules with the highest predicted values are selected as new objects. Strategy 2. Exploration (Y-VAR) – objects with higher prediction uncertainties are selected. For regression, candidates with the maximum variance of predictions returned by RFR are selected. For RFClf, we calculated the variance of the predictions from the Bernoulli distribution equation as var=p(1-p), where p is probability that the molecule is active. So, the closer the probability to 0.5 the larger is the variance. The exploration approach is efficient in making the model more robust, but the added object may not be active [6,7]. Strategy 3. Hybrid selection (Y-MAX(M)-Y-VAR(L)) – the best M = K-L candidates are selected according to the Y-MAX strategy, while the best L candidates are selected according to the Y-VAR strategy. This approach exhibits balance between exploration and exploitation. In the work, Y-MAX(2)-Y-VAR(3) or Y-MAX(1)-Y-VAR(1) is used. Strategy 4. Random selection - in this strategy, we randomly select new candidates at every iteration and add them to the initial or current set. This strategy serves as negative control. Strategy 4. Hypothetical perfect strategy – this approach approximates optimal selection strategy. To find perfect candidate for adding, we look for the candidate, addition of which to current set improves the predictive performance of the surrogate model the most. It cannot be considered as AL strategy, since we need to know actual property value for every candidate to select the best one (in AL we don’t know candidate property in advance). This approach is used for benchmarking purpose only. The AL cycle was repeated continuously until the size of the current set reached 150 objects (Figure 3). The entire process was repeated 10 times for each strategy, each time starting from a different initial dataset to eliminate the bias associated with the initial selection of N points. Statistical analysis. Once the surrogate model is built, its utility for selecting new objects is assessed on the external test set. The following statistical metrics were used to assess different aspects of performance of AL strategies: 1. Model accuracy performance – predictive performance of models obtained on a current set. Coefficient of determination R2 was used to characterize the accuracy of the regression models and was calculated as shown in classic formulae (1). Balanced accuracy BA was used to assess classification model performance and was calculated as shown in formulae (2). The motivation of this metric is that often AL is used to build best model with least amount of data [6,7]; also model with high predictive ability can better select candidates with desired property values for further testing. 2 𝑅 =1− 𝑇𝑃𝑅 = ∑𝑛𝑖=1(𝑦𝑖𝑒𝑥𝑝 − 𝑦𝑖𝑝𝑟𝑒𝑑 ) ∑𝑛𝑖=1(𝑦𝑖𝑒𝑥𝑝 𝐵𝐴 = 𝑇𝑃𝑅+𝑇𝑁𝑅 2 − 𝑒𝑥𝑝 ) ̅̅̅̅̅̅ 𝑦 , where 2 2 . 𝑇𝑁 𝑇𝑃 𝑎𝑛𝑑 𝑇𝑁𝑅 = 𝑇𝑁 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 (1) (2) 2. Candidate improvement performance. This metric reflects the quality of the candidate selection process, i.e., how well the surrogate model selects highly active compounds from 7 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. the external test set. The quality of selection was characterized using the Enrichment Faction criterion (EF10). For regression, it is equal to the fraction of 10% most active compounds in the list of the 10% most active compounds as predicted by model. For classification, EF10 was equal to the fraction of 10% most active compounds (selected according to pKi values) in the list of 10% molecules with the highest probability of positive class (includes only objects with probability greater than 0.5). Thus, if the model correctly retrieves most active compounds, EF10 is equal to the maximum value of 1. Figure 3. Illustration of AL method. Left - dataset division into training and external test sets. Right - overview of the AL algorithm. First, the AL model is trained based on current knowledge. Using the surrogate model, predictions, and associated uncertainties are obtained for new points. New points are selected based on respective AL strategies (see text). After testing new selected points, the results are used to update current knowledge, and iterations are repeated until the desired goal is achieved 2.4. Experimental settings in the ASPGR binding measurement The experiment was conducted on a Biacore X100 machine (Biacore AB, Uppsala, Sweden) using a CM5 chip. ASGPR from rabbit liver was purchased from Generic Assays (GA Generic Assays GmbH, Berlin, Germany) and used for SPR assays. The ASGPR was immobilized according to the standard amine-coupling protocol provided by the manufacturer. Polymer ligands were dissolved in a running buffer (150 mM NaCl, 50 mM CaCl2, 50 mM Tris, pH 7.4) followed by serial dilution up to 5*10-6 M. Samples were injected at a flow rate of 20 μL/min at 25 °C for 30 s followed by 30 s dissociation. The regeneration of the sensor chip was obtained by injection of 20 μL of 20 mM EDTA. All solutions were filtered and deoxygenated. Data were analyzed using BIAevaluation 3.0 software. The KD values were evaluated using the 1:1 Langmuir association model. Data are mean ± SD of n = 3 independent measurements. 3. Results and discussion 3.1. Active learning strategy for finding biologically active molecules in "virtual" experiments To compare different AL strategies introduced above, we simulated AL workflow using "virtual" experiments, represented by the collection of datasets with known values of biological activities (pKi) of chemical compounds. 8 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 3.1.1. Optimization of continuous property First, the performance of AL using regression modelling for optimization of continuous properties was tested. Table 2 demonstrates the performances of RFR and GPR models trained on the entire training set. The performances of regression models are quite similar. Table 2. Coefficient of determination (R2), RMSE of predictions, and EF10 estimated on the external test set using RFR and GPR models trained on the entire training sets RFR Dataset CHEBML205 CHEMBL217 CHEMBL244 R2 0.69 0.61 0.78 RMSE 0.64 0.60 0.76 GPR EF10 59 57 61 R2 RMSE 0.69 0.65 0.81 0.64 0.56 0.71 EF10 59 61 67 We analyzed the influence of different strategies (Y-MAX, Y-VAR, and Random selection) on the performance of the AL algorithms for RFR and GPR models (see Supporting Materials, Figure S1-S2). In general, RFR and GPR machine learning approaches were characterized by quite similar change in the model accuracy (R2) and virtual screening (EF10) metrics as the dataset size was increased. Thus, only RFR is used below due to its simplicity and speed. The maximum value of the coefficient of determination achieved 0.38 for 150 iterations, the maximum values of EF10 were about 30-40% which is much lower than corresponding metrics for model trained on all data (shown in Table 2). None of machine learning approaches combined with Y-MAX, Y-VAR AL strategies was better than random selection within 150 iterations. Notably, Y-VAR strategy was almost always better than Y-MAX strategy. In the experiments above, the AL cycle stopped when 150 compounds were selected, which can be insufficient for modeling; thus, we tested AL strategies by adding compounds iteratively until current set contains the whole training set (Figure 4). It was found that to achieve 70% value of the model's performance built on the entire dataset, more than 1000-1500 points must have been added to current set. Moreover, the R2 of considered AL strategies was never better than Random selection (red line in Figure 4) showing no benefits of AL application in this case. Interestingly, despite quite poor R2 values, Y-MAX strategy shows EF10 curve above the line of Random selection in CHEMBL217 dataset [47]. Since our goal was to find the desired compounds with minimal experimental effort, Y-MAX AL strategy is a reasonable alternative to random search. In general, for datasets explored in this study, we saw no benefits of AL application in regression modeling. Possible reason for this observation can be related to the selection of noisy datasets with high random measurement error collected from different bioassays. Such a large fluctuations of predicted value could mimic true difference in activity values for compounds, especially if small datasets are used. 9 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Figure 4. Influence of different AL strategies on the quality of RFR. The initial set contained 10 points; 5 objects were added at every AL cycle. Active learning strategies are Y-MAX (green line), Y-VAR (blue line), and random (red line), black dotted line represents performance of the model trained on the entire training data set. Lines correspond to median values over 10 repetitions 3.1.2. Binary classification models To reduce noise in pKi values, which could lead to weak model performance in AL iterations, classification modeling was selected for surrogate model building. Compounds with pKi greater than 7 were considered active; otherwise, they were labelled as inactive. Models built with conventional QSAR approach using complete training sets were characterized by reasonably high balanced accuracy (more than 0.8, Table 3). Table 3 The predictive performance of the external test sets using RFClf models trained on all training sets Dataset CHEBML205 CHEMBL217 TN FP FN TP ACC TPR TNR 206 69 38 448 0.86 0.92 0.85 532 61 101 308 0.84 0.75 0.89 BA 0.84 0.83 PPV 0.87 0.83 AUC 0.92 0.91 EF10 41.0 38.0 10 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. CHEMBL244 257 36 35 333 0.89 0.90 0.88 0.89 0.90 0.95 35.5 The influence of different strategies (Y-MAX, Y-VAR, and Random selection) on the performance of the AL algorithms for RFClf models is shown in Figure 5-Figure 7. While in the case of regression modeling the predictive performance of models was poor (R2 ~ 0) when only small number of up to 150 datapoints was explored (cf. Figure 4), in the set used for classification model building Y-VAR performed better than regression models even if datasets were small (BA is significantly greater than 0.5, Figure 5). In contrast, Y-MAX strategy had mediocre BA on small datasets without trend to grow when up to 150 datapoints in the current set were explored (Figure 5). On the contrary, for the regression models, we have higher values of EF10 than for classification models even if all training data are used for modeling (black lines in Figure 4 and Figure 5). Thus, continuous scale of predicted values helps model to better select most active compounds. It is quite logical since classifier has only class labels without information on actual activity values. We analyzed the effect of the initial data selection procedure on the performance of each AL strategy (Figure 5, the results for other sizes of the initial training set and for other datasets are given in Figure S3-S6 of the Supporting Materials). It turned out that the formation of the initial dataset from the most diverse objects did not have an advantage over random selection. This conclusion was valid for all datasets, different sizes of the initial training set, and different AL strategies (Y-MAX, Y-VAR). Thus, below we will demonstrate results for the simplest method, a random initial set selection. The size of initial dataset also influences model performance. In general, larger initial sets enable models with greater performance. BA values at different initial set size is shown in Figure 6 for CHEMBL244 dataset. For Y-MAX strategy it is better to start from a larger dataset, since as is shown above (Figure 4), Y-MAX selection leads to worse model than random selection of the same number of molecules. On the contrary, using Y-VAR strategy one may start with smaller number of objects in initial dataset iteratively adding objects to it, and the model performance would be similar (Figure 7). 11 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Figure 5. Influence of different selection protocols for the initial dataset on AL performance (for CHEMBL244 dataset). Initial set was selected randomly (Random, deep green or deep blue) or as diverse set using clusterization (Cluster, light green, or light blue). Y-MAX (deep and light green) and Y-VAR (deep and light blue) strategies were tested. The initial set contains 10 points, 5 objects were added at every AL cycle. Black dotted line represents performance of the model trained on entire training data set. Statistics corresponds to 10 repetitions of AL cycle; lines correspond to median values 12 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Figure 6. Influence of the initial dataset size (for CHEMBL244 dataset) on BA for the external test set. Y-MAX (top) and Y-VAR (bottom) strategies were tested. Five objects were added at every AL cycle. Black dotted line represents performance of the model trained on the entire training data set. Error bars are calculated by averaging the results of 10 repetitions of the AL selection cycle The influence of different AL strategies: Y-MAX, Y-VAR, Hybrid Y-MAX-Y-VAR has been thoroughly investigated in comparison with Random selection and Hypothetical perfect strategy (Figure 7). Notice, that hypothetical perfect model has both best BA values (closest to model built on all data), and highest EF10 among others. It supports intuitive concept that the more performant models are more efficient in compounds selection. As in the case of regression models, the YMAX strategy very slowly improves the performance of the classification models. The models were biased towards a positive class showing high TPR on external dataset (high TPR in Figure S7 Supporting Materials, close to 1.0), but low precision value (PPV in Figure S7), which is close to positive class fraction in the dataset. Also, we observe that EF10 values for Y-MAX were the worst among all other strategies. Only when current dataset size achieves 500 molecules or more, EF10 increases and the approach becomes the best in active compound selection with small advantage over Random selection (Figure S7), despite model still staying biased towards positive class. As shown in Figure 7 and Figure 8, Y-VAR strategy and, to a lesser extent, Hybrid Y-MAXY-VAR strategy had an advantage over Random selection in improving the performance of the classification models (BA values on external test set). But when small current sets are selected (up to 150 objects, Figure 7), EF10 values for Y-VAR and Hybrid Y-MAX-Y-VAR strategies did not 13 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. outperform models that relied on Random selection. Further selection by Y-VAR (shown in Figure S7) makes even worse and EF10 values were generally worse. Hybrid selection Y-MAX-Y-VAR in combination with classification models showed reasonable quality for ChEMBL datasets (Figure 7). It showed slightly lower BA values in comparison with Y-MAX, but still generally better than Random selection. At the same time, it has shown largest EF10 value, reflecting its utility for the selection of active compounds (Figure 7, top right). It is usually better than Random selection and approaches the accuracy that can be seen with Hypothetical perfect strategy; however, fluctuations are rather high. Thus, Hybrid selection Y-MAX-Y-VAR shows reasonable quality in both tests and adopts best features of Y-VAR strategy to enhance model quality and Y-MAX, which usually better selects active compounds. To summarize, in our virtual experiment we have compared different AL approaches. Application of AL to continuous value properties required the selection of rather large datasets, but for binary property, AL strategies had shown good results even for small datasets. Neither machine learning method, nor the initial set selection strategy have influenced the accuracy of AL models. Conversely, the selection strategy affected the AL performance the most. Exploration (YVAR) and hybrid strategies could select compounds such that the model accuracy is higher than if compounds were selected randomly. Hybrid strategy also allows to more efficiently select compounds with the desired property, and thus we chose this approach for further application. Figure 7. Performance of different AL strategies: Y-MAX (green box and line), Y-VAR (blue box and line), Y-MAX(2)-Y-VAR(3) (orange box and line), in comparison with random selection (red dotted line) and hypothetical ideal model (cyan dotted line). Lines correspond to 14 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. median values over repetitions. The initial datasets consist of 10 points. Black dotted line represents performance of the model trained on entire training data set 3.2. Testing active learning strategy in searching for polymers with good solubility We tested AL strategy in a case study where we searched for polymers that form micelles used in drug delivery systems. The dataset with drug LC for different polymeric micelles was extracted from publication [46]. To form the initial training dataset, five random polymers were selected. The model was built using the Random Forest Classification method and special polymer-drug descriptors developed previously [25]. Then the Hybrid strategy Y-MAX(1)-Y-VAR(1), which proved to be the best on CHEMBL datasets, was applied for AL. The AL cycle was repeated 30 times for every dataset to collect statistics. Since we had only 18 objects in the training set, five of them were selected to form the initial set and with every iteration, one object was added. The performance statistics of the approach was collected on validation set only, i.e., objects left in the modeling set after current set selection. The iterations were carried out until the objects from the pool were exhausted. In Figure 8, a typical plot describing the performance of the selection process is shown (other systems are given in Figure S8-S9 Supporting Materials). It may be noted that median BA of the model built on the current set selected by the Hybrid strategy was greater than for randomly selected current sets (Figure 8 left). It also provided a good model (mean value BA about 0.83) after only 10 iterations. Moreover, the chance of finding an active polymer at the first iterations of AL procedure was maximal and achieved 50-70%. Notice, that the chance of active polymer selection at random is only 33%. The further gain of chance to find solubilizing polymer can be explained by detection of all active polymers at early stages. Very similar results can be obtained for other datasets (Figures S8 and S9, Supporting Materials). For curcumin (CUR), the hybrid strategy did not show any boost of model performance in comparison with random selection, probably due to high baseline chance to find active compounds (almost 50%). At the same time, one can notice (Figures S8, Supporting Materials) that the chance to find active compound at first iterations of AL cycle (>85%) is much higher than baseline probability. To summarize, the developed AL strategy can be successfully applied to select polymer compositions that enable high LC for drug molecules. This strategy shows good performance even if only few (5) experimental data is available as initial set. Figure 8. The performance of selection of the optimal polymer for PTX solubilization. Solubilizing ability was studied after 5 minutes of mixing. On the left - the predictive 15 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. performance of the model built on current set on validation set, the BA value is given. On the right - the chance that selected polymer is active (solubilizing). Baseline BA value for the model built on all data assessed in leave-one-out validation is shown by black dotted line 3.3. Validating active learning strategy in an experimental case study The developed AL strategy and modeling procedure were tested in real-world scenario to find specific polymers that bind specific proteins. It is a complicated case since polymer space is rather big and there is no clear understanding of what features of a polymer can be responsible for protein binding. Also, the experiment is costly and slow, thus the dataset of reasonable size for QSAR modeling can hardly be collected. Here, we searched block copolymers of polyethylene glycol (PEG) and polyamine acids (polylysine - PLKC, polyglutamate - PLE, polyaspartate - PLD) that could bind to the model ASGPR protein at pH 7.4. These polymers are non-toxic and can be applied as scavengers or tracers of proteins or polymer micelles for target delivery. Initial search space contained 13 different types of polymers (Table 4). Six polymers (4 polyL-lysine polymers with different numbers of links and different PEG masses, 2 poly-L-glutamate polymers with different numbers of links) were randomly selected, and KD constants were measured. The values of KD ranging from 0.51 to 3.36 mM were obtained (Table 4). Polymers with values of KD less than 1 mM were considered active (i.e., polymers #2 and #12 in Table 4). Table 4. Dataset used to test the AL strategy. KD column contains KD values of initially tested molecules. Six molecules with measured KD values constituted the initial training set. Two molecules (10 and 11) were selected after the first iteration of the AL model. The last column KD(1st iter) shows the measured KD values for polymers proposed to be tested by the first iteration of the AL strategy 1 2 3 4 5 6 7 8 9 10 11 12 13 1 block Mw 2 block DP α-PEG KD, mM PEG PEG PEG PEG PEG PEG PEG PEG PEG PEG PEG PEG PEG 5k 5k 5k 5k 5k 5k 5k 5k 5k 1k 1k 1k 1k PLKC PLKC PLKC PLKC PLKC PLE PLE PLE PLD PLD PLKC PLKC PLE 50 100 50 30 10 50 50 100 10 10 100 30 10 m m N3 m m m N3 m m m m m m 3.27 0.85 KD(1st iteration), mM 1.20 3.36 1.69 1.55 0.02 0.51 The α-end of the polyethylene glycol could have a CH3O group (m) or an azide group (N3). SiRMS descriptors [52] of the pseudo-small polymer molecules were employed to describe the polymer structure. The Hybrid active learning strategy Y-MAX(1)-Y-VAR(1) was applied to develop the classification model: one polymer (#11) with the highest probability to be active and one (#10) with the highest variance were selected at first iteration. According to the surrogate model, polymer #10 was predicted inactive while polymer #11 was predicted active. 16 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The dissociation constant of these polymers was measured experimentally. These predictions were fully confirmed: the predicted active polymer did indeed have a low dissociation constant, whereas polymer #10 was not active. Notice, that polymer #11 had dissociation constant lower (0.023) than the most active candidate (KD=0.51), so even the first iteration was successful. A second AL iteration was undertaken to continue searching for the best performing polymer from the available ones. However, AL system did not find any additional putative active candidates in the set and the calculations were terminated. To conclude, we observe that the developed AL approach showed its applicability and accuracy both in "virtual" experiments for finding biologically active molecules and polymers with good solubility and in real settings of searching for polymers with a desired ability to bind to a specific protein. Conclusion In this study, we introduced a methodology using classical machine learning methods (classification and regression) in combination with an AL system to discover compounds with the desired properties in a minimum number of the experimental measurements. We benchmarked the performance of the three AL strategies (Y-MAX, Y-VAR, Hybrid Y-MAX-Y-VAR) in comparison with Random selection, and Hypothetical perfect strategy that we proposed for both continuous property and binary datasets. Both regression and classification scenario for AL was explored. In continuous value modeling (regression) AL required selection of quite large number of objects for making informative decisions better than random selection. In case of classification model some AL approaches were found that works well even for small datasets. So, classification modeling is more efficient in AL scenario when only small chemical space exploration is possible. For classification modeling, adding the top candidates (by Y-MAX) has led to the bias in the selected experiments and thereby did not improve the quality of the models over random selection. Adding candidates based on Y-VAR (objects with highest uncertainty are selected at every iteration) causes constant improvement of the model performance, however, the chance to find highly active compound by the respective model is lower than by a model built on random dataset of the same size. Thus, Hybrid Y-MAX-Y-VAR strategy was proposed that selects almost half of objects by Y-MAX, and the other half by Y-VAR strategy. Hybrid Y-MAX-Y-VAR strategies had benefits of both parent approaches and were capable of continually improving both the performance of classification models and the chance to find highly active compound. The influence of AL method, the size of the initial training set, the way to select the initial data on the performance of AL was thoroughly analyzed in “virtual” experiments. Here, we used datasets of chemical compounds with known biological activities to test AL selection strategies. We found that the formation of the initial dataset from the most diverse objects did not have an advantage over random selection of initial set. For Y-MAX, the model performance obtained using a larger initial training set was higher than those obtained on a smaller initial training set. However, if YVAR strategy is applied one can start from smaller dataset and constantly add objects without loss of model accuracy. Also, we found that regression models built even using Y-VAR strategy improve their accuracy much slower with adding AL cycles, than classification models. The former models have lower performance metrics (R2) than models built on random datasets. Also, even the best in model performance improvement AL approach, Y-VAR, is quite far from hypothetically perfect AL strategy meaning that there is still a room for improvement and best strategy selection. The developed AL algorithm was tested on practical cases when AL is really important to use: looking for polymers selectively binding to protein targets. The problem is due to lack of mechanistic insight on binding of polymers to drugs and expensive and slow experimental data collection. Special descriptors reflecting drug and polymer structure were applied for building the surrogate model. The developed AL algorithm has shown its applicability and accuracy both in 17 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. "virtual" experiments for polymers solubilizing small molecule drugs and in real experiments of searching for polymers with a given ability to bind to a protein. Previously, AL was primarily used to find quite large datasets for building better model [53] or enhance speed of computational drug design campaigns [54]. Mostly these studies used quite large numbers of data for model building. Here, we show how AL can be efficiently used in case of extremely small datasets, when cost of experiment is high and training sets of reasonable size cannot be collected. It reinforces that tighter interface between experiment and computation opens a door to an extremely time- and resource-efficient design campaigns. Conflict of interest None Acknowledgements This work was supported by the Russian Science Foundation grant 20-63-46029. Abbreviations AL – active learning; ASGPR - asialoglycoprotein receptor; BA – balanced accuracy; HCC – hepatocellular carcinoma; LC – loading capacity; PLKC – polylysine; PLE – polyglutamate; PLD – polyaspartate; PEG - polyethylene glycol; CUR – curcumin; PTX – paclitaxel; EFV – efavirenz, DEX – dexamethasone; T2A – tanshinone IIA; RFR – Random Forest Regression; RFClf – Random Forest Classification; GPR - Gaussian Process Regression; QRPR - quantitative structureproperty relationships; QSAR - quantitative structure-activity relationships; RBF – radial basis function; EF - Enrichment Faction; TPR – true positive rate. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Baskin I.I. et al. Artificial intelligence in synthetic chemistry: achievements and prospects // Russian Chemical Reviews. 2017. Vol. 86, № 11. P. 1127–1156. Artrith N. et al. Best practices in machine learning for chemistry // Nature Chemistry. 2021. Vol. 13, № 6. Coley C.W., Eyke N.S., Jensen K.F. Autonomous Discovery in the Chemical Sciences Part I: Progress // Angewandte Chemie International Edition. 2020. Vol. 59, № 51. Cherkasov A. et al. QSAR Modeling: Where Have You Been? Where Are You Going To? // Journal of Medicinal Chemistry. 2014. Vol. 57, № 12. Muratov E.N. et al. QSAR without borders // Chemical Society Reviews. 2020. Vol. 49, № 11. Settles B. Active Learning // Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012. Vol. 6, № 1. Reker D., Schneider G. Active-learning strategies in computer-assisted drug discovery // Drug Discovery Today. 2015. Vol. 20, № 4. Eyke N.S., Green W.H., Jensen K.F. Iterative experimental design based on active machine learning reduces the experimental burden associated with reaction screening // Reaction Chemistry & Engineering. 2020. Vol. 5, № 10. Kim C. et al. Active-learning and materials design: the example of high glass transition temperature polymers // MRS Communications. 2019. Vol. 9, № 3. Jastrzębski S. et al. Emulating Docking Results Using a Deep Neural Network: A New Perspective for Virtual Screening // Journal of Chemical Information and Modeling. 2020. Vol. 60, № 9. Graff D.E., Shakhnovich E.I., Coley C.W. Accelerating high-throughput virtual screening through molecular pool-based active learning // Chemical Science. 2021. 18 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. del Rosario Z. et al. Assessing the frontier: Active learning, model accuracy, and multiobjective candidate discovery and optimization // The Journal of Chemical Physics. 2020. Vol. 153, № 2. Kunkel C. et al. Active discovery of organic semiconductors // Nature Communications. 2021. Vol. 12, № 1. Lookman T. et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design // npj Computational Materials. 2019. Vol. 5, № 1. Reis M. et al. Machine-Learning-Guided Discovery of 19 F MRI Agents Enabled by Automated Copolymer Synthesis // Journal of the American Chemical Society. 2021. Vol. 143, № 42. Smith J.S. et al. Less is more: Sampling chemical space with active learning // The Journal of Chemical Physics. 2018. Vol. 148, № 24. Gubaev K., Podryabinkin E. v., Shapeev A. v. Machine learning of molecular properties: Locality and active learning // The Journal of Chemical Physics. 2018. Vol. 148, № 24. Melnikov A.A. et al. Active learning machine learns to create new quantum experiments // Proceedings of the National Academy of Sciences. 2018. Vol. 115, № 6. Loeffler T.D. et al. Active Learning the Potential Energy Landscape for Water Clusters from Sparse Training Data // The Journal of Physical Chemistry C. 2020. Vol. 124, № 8. Kangas J.D., Naik A.W., Murphy R.F. Efficient discovery of responses of proteins to compounds using active learning // BMC Bioinformatics. 2014. Vol. 15, № 1. Reker D. Practical considerations for active machine learning in drug discovery // Drug Discovery Today: Technologies. 2019. Vol. 32–33. Liu Z.-W., Han B.-H. Evaluation of an Imidazolium-Based Porous Organic Polymer as Radioactive Waste Scavenger // Environmental Science & Technology. 2020. Vol. 54, № 1. Samanta P. et al. Chemically stable microporous hyper-cross-linked polymer (HCP): an efficient selective cationic dye scavenger from an aqueous medium // Materials Chemistry Frontiers. 2017. Vol. 1, № 7. Batrakova E. v. et al. Polymer Micelles as Drug Carriers // Nanoparticulates as Drug Carriers. PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO., 2006. Alves V.M. et al. Cheminformatics-driven discovery of polymeric micelle formulations for poorly soluble drugs // Science Advances. 2019. Vol. 5, № 6. Harvey H.A. et al. Receptor-mediated endocytosis of Neisseria gonorrhoeae into primary human urethral epithelial cells: the role of the asialoglycoprotein receptor // Molecular Microbiology. 2008. Vol. 42, № 3. Harvey H.A. et al. Gonococcal lipooligosaccharide is a ligand for the asialoglycoprotein receptor on human sperm // Molecular Microbiology. 2000. Vol. 36, № 5. Shi B., Abrams M., Sepp-Lorenzino L. Expression of Asialoglycoprotein Receptor 1 in Human Hepatocellular Carcinoma // Journal of Histochemistry & Cytochemistry. 2013. Vol. 61, № 12. Kanazawa N. Dendritic cell immunoreceptors: C-type lectin receptors for patternrecognition and signaling on antigen-presenting cells // Journal of Dermatological Science. 2007. Vol. 45, № 2. Rigopoulou E.I. et al. Asialoglycoprotein receptor (ASGPR) as target autoantigen in liver autoimmunity: Lost and found // Autoimmunity Reviews. 2012. Vol. 12, № 2. Becker S., Spiess M., Klenk H.-D. The asialoglycoprotein receptor is a potential liverspecific receptor for Marburg virus // Journal of General Virology. 1995. Vol. 76, № 2. Dotzauer A. et al. Hepatitis A Virus-Specific Immunoglobulin A Mediates Infection of Hepatocytes with Hepatitis A Virus via the Asialoglycoprotein Receptor // Journal of Virology. 2000. Vol. 74, № 23. 19 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. Mohr A.M. et al. Enhanced colorectal cancer metastases in the alcohol-injured liver // Clinical & Experimental Metastasis. 2017. Vol. 34, № 2. Ueno S. et al. Asialoglycoprotein Receptor Promotes Cancer Metastasis by Activating the EGFR–ERK Pathway // Cancer Research. 2011. Vol. 71, № 20. Pranatharthiharan S. et al. Asialoglycoprotein receptor targeted delivery of doxorubicin nanoparticles for hepatocellular carcinoma // Drug Delivery. 2017. Vol. 24, № 1. Oh H. et al. Galactosylated Liposomes for Targeted Co-Delivery of Doxorubicin/Vimentin siRNA to Hepatocellular Carcinoma // Nanomaterials. 2016. Vol. 6, № 8. Zheng G. et al. Co-delivery of sorafenib and siVEGF based on mesoporous silica nanoparticles for ASGPR mediated targeted HCC therapy // European Journal of Pharmaceutical Sciences. 2018. Vol. 111. Bhingardeve P. et al. Receptor-Specific Delivery of Peptide Nucleic Acids Conjugated to Three Sequentially Linked N -Acetyl Galactosamine Moieties into Hepatocytes // The Journal of Organic Chemistry. 2020. Vol. 85, № 14. Monestier M. et al. ASGPR-Mediated Uptake of Multivalent Glycoconjugates for Drug Delivery in Hepatocytes // ChemBioChem. 2016. Vol. 17, № 7. Thakor D.K., Teng Y.D., Tabata Y. Neuronal gene delivery by negatively charged pullulan– spermine/DNA anioplexes // Biomaterials. 2009. Vol. 30, № 9. Scott L.J. Givosiran: First Approval // Drugs. 2020. Vol. 80, № 3. Fiume L. et al. Liver targeting of antiviral nucleoside analogues through the asialoglycoprotein receptor // Journal of Viral Hepatitis. 1997. Vol. 4, № 6. Plourde R., Wu G.Y. Targeted therapy for viral hepatitis // Advanced Drug Delivery Reviews. 1995. Vol. 17, № 3. Zhang Y. et al. Targeted delivery of atorvastatin via asialoglycoprotein receptor (ASGPR) // Bioorganic & Medicinal Chemistry. 2019. Vol. 27, № 11. Sirtori C.R. The pharmacology of statins // Pharmacological Research. 2014. Vol. 88. Lübtow M.M. et al. Like Dissolves Like? A Comprehensive Evaluation of Partial Solubility Parameters to Predict Polymer–Drug Compatibility in Ultrahigh Drug-Loaded Polymer Micelles // Biomacromolecules. 2019. Vol. 20, № 8. P. 3041–3056. Zankov D. v. et al. QSAR Modeling Based on Conformation Ensembles Using a MultiInstance Learning Approach // Journal of Chemical Information and Modeling. 2021. Vol. 61, № 10. P. 4913–4923. RDKit: Open-Source Cheminformatics. http://www.rdkit.org. Pedregosa F. et al. Scikit-learn: Machine Learning in Python. 2012. Scikit-Learn User Guide. Available online: https://scikitlearn.org/stable/_downloads/scikit-learn-docs.pdf. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; ISBN 026218253X. Available online: http://www.gaussianprocess.org/gpml/chapters/RW.pdf. Muratov E.N. et al. Per aspera ad astra : application of Simplex QSAR approach in antiviral research // Future Medicinal Chemistry. 2010. Vol. 2, № 7. P. 1205–1226. Smith J.S. et al. Less is more: Sampling chemical space with active learning // The Journal of Chemical Physics. 2018. Vol. 148, № 24. Yang Y. et al. Efficient Exploration of Chemical Space with Docking and Deep Learning // Journal of Chemical Theory and Computation. 2021. 20 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Supporting Materials EFFICIENT DESIGN OF PEPTIDE-BINDING POLYMERS USING ACTIVE LEARNING APPROACHES Assima Rakhimbekovaa, Anton Lopukovb, Natalia Klyachkob, Alexander Kabanovb,c, Timur I. Madzhidova*, Alexander Tropshad* a A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia of Chemical Design of Bionanomaterials, Faculty of Chemistry, M.V. Lomonosov Moscow State University, Moscow, b Laboratory Russia c Center for Nanotechnology in Drug Delivery and Division of Pharmacoengineering and Molecular Pharmaceutics, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, NC, USA d Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA 21 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 1. Influence on the quality (R2) of RFR and GPR models by different strategies of the AL algorithms. The initial set contains 10 points, 5 objects were added at every AL cycle. Lines correspond to median values over 10 repetitions. Active learning strategies are Y-MAX (green line), Y-VAR (blue line), and random (red line), black dotted line represents performance of the model trained on the entire training data set. Lines correspond to median values over 10 repetitions of AL cycle. 22 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 2. Influence on the quality (EF10) of RFR and GPR models by different strategies of the AL algorithms. The initial set contains 10 points, 5 objects were added at every AL cycle. Lines correspond to median values over 10 repetitions. Active learning strategies are Y-MAX (green line), Y-VAR (blue line), and random (red line), black dotted line represents performance of the model trained on the entire training data set. Lines correspond to median values over 10 repetitions of AL cycle. 23 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 3. Influence of selection initial dataset strategy. The initial CHEMBL244 dataset consists of 5, 10, 20 and 50 points 5 data points 10 data points 20 data points 50 data points 24 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 4. Influence of selection initial dataset strategy. The initial CHEMBL205 dataset consists of 5, 10, 20 and 50 points 5 data points 10 data points 20 data points 50 data points 25 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 5. Influence of selection initial dataset strategy. The initial CHEMBL217 dataset consists of 5, 10, 20 and 50 points 5 data points 10 data points 20 data points 50 data points 26 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 6. Influence of initial dataset size ChEMBL244 ChEMBL205 ChEMBL217 27 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 7. Results of active learning strategies applied to the ChEMBL244 dataset. The initial datasets consist of 10 points. Training the model on all data (black dotted line) 28 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Table S 1. The number of active (N «1»)/inactive (N «0») compounds for load capacity (LC) datasets of 5 hydrophobic drugs (Curcumin (CUR), Paclitaxel (PTX), antiretroviral efavirenz (EFV), Dexamethasone (DEX), anti-oxidative tanshinone IIA (T2A)) with amphiphilic polymers investigated under different time. The load capacity values were measured at different time intervals: 0, 5, 10, and 30 minutes after adding the drug to the polymer solution Time, min Drug EFV CUR DEX T2A PTX 0 N («0») 14 5 18 18 12 5 N («1») 4 13 0 0 6 N («0») 15 7 0 0 12 10 N («1») 3 11 0 0 6 N («0») 16 7 0 0 13 30 N («1») 2 11 0 0 5 N («0») 15 8 0 0 15 N («1») 3 10 0 0 3 Table S 2. Performance of the models on leave-one-out validation for datasets of 3 hydrophobic drugs (Curcumin (CUR), Paclitaxel (PTX), antiretroviral efavirenz (EFV) with amphiphilic polymers investigated under different time. The load capacity values were measured at different time intervals: 0, 5, 10, and 30 minutes after adding the drug to the polymer solution EFV CUR PTX 0 5 10 30 0 5 10 30 0 5 10 30 TN 13 15 16 13 0 4 3 5 10 12 11 14 FP 1 0 0 2 5 3 4 3 2 0 2 1 FN 3 3 2 3 2 1 4 2 2 2 4 3 TP 1 0 0 0 11 10 7 8 4 4 1 0 Acc 0.778 0.833 0.889 0.722 0.611 0.778 0.556 0.722 0.778 0.889 0.667 0.778 TPR 0.25 0 0 0 0.846 0.909 0.636 0.8 0.667 0.667 0.2 0 TNR 0.929 1 1 0.867 0 0.571 0.429 0.625 0.833 1 0.846 0.933 PPV 0.5 0 0.688 0.769 0.636 0.727 0.667 1 0.333 0 NPV 0.813 0.833 0.889 0.813 0 0.8 0.429 0.714 0.833 0.857 0.733 0.824 BA 0.589 0.5 0.5 0.433 0.423 0.740 0.533 0.713 0.750 0.833 0.523 0.467 29 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Fig. S 8. Selection of the optimal polymer for PTX solubilization. Solubilizing ability was studied after 0 and 5 minutes of mixing. On the left - the predictive performance of the model, the BA value is given. On the right - the proportion of active polymer depending on the iterations Fig. S 9. Selection of the optimal polymer for CUR solubilization. Solubilizing ability was studied after 5 and 30 minutes of mixing. On the left - the predictive performance of the model, the BA value is given. On the right - the proportion of active polymer depending on the iterations 30 bioRxiv preprint doi: https://doi.org/10.1101/2021.12.17.473241; this version posted December 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 31