Livingstone 2001

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal of Computer-Aided Molecular Design, 15: 741–752, 2001.

KLUWER/ESCOM 741
© 2001 Kluwer Academic Publishers. Printed in the Netherlands.

Simultaneous prediction of aqueous solubility and octanol/water partition


coefficient based on descriptors derived from molecular structure

David J. Livingstonea,∗, Martyn G. Fordb , Jarmo J. Huuskonenc & David W. Saltd


a ChemQuest,Delamere House, 1, Royal Crescent, Sandown, Isle of Wight, PO36 8LZ, UK; b Centre for Molecular
Design, University of Portsmouth, Portsmouth, Hants, PO1 2EG, UK; c Division of Pharmaceutical Chemistry,
Department of Pharmacy, POB 56, FIN-00014 University of Helsinki, Finland; d School of Computer Science &
Mathematics, University of Portsmouth, Portsmouth, Hants, PO1 2DY, UK

Received 20 November 2000; accepted 11 June 2001

Key words: canonical correlation, electrotopological descriptors, log P , log S, neural networks, regression analysis

Summary
It has been shown that water solubility and octanol/water partition coefficient for a large diverse set of compounds
can be predicted simultaneously using molecular descriptors derived solely from a two dimensional representation
of molecular structure. These properties have been modelled using multiple linear regression, artificial neural
networks and a statistical method known as canonical correlation analysis. The neural networks give slightly better
models both in terms of fitting and prediction presumably due to the fact that they include non-linear terms. The
statistical methods, on the other hand, provide information concerning the explanation of variance and allow easy
interrogation of the models. Models were fitted using a training set of 552 compounds, a validation set and test set
each containing 68 molecules and two separate literature test sets for solubility and partition.

Introduction Recently, non-linear models have been constructed


using artificial neural networks (ANN) [19–22] and
A wide variety of properties, experimental, empiri- we have reported methods for the calculation of log S
cal and theoretical, may be estimated or calculated [23–26] and log P [27, 28] using electrotopological
for the description of chemical structures and their descriptors [29] coupled with ANN. Since both of
interactions with solvents, other small molecules and these properties may be predicted successfully using
complex molecular structures such as membranes and the same set of descriptors it should be possible to
pharmaceutical receptors [1]. Of the experimental model them simultaneously. An artificial neural net-
properties, perhaps the two most important are wa- work may be trained to compute two target values
ter solubility (log S) and octanol/water partition co- from the same set of inputs by simply including an
efficient (log P ). Because of this importance several extra neuron in the output layer. Statistical methods
groups have reported methods for the calculation of may also be used to model two or more response vari-
solubility [2–7] and a rather larger number of studies ables simultaneously, one such technique is known as
have examined the prediction of log P [4, 8–18]. The canonical correlation analysis (CCA). Although there
compounds used in these reports have been charac- are often circumstances in drug discovery where mul-
terised directly by molecular structure (group contri- tiple responses or measurements are available for a set
butions) or by properties derived from structure and of compounds, CCA has been applied rarely to analyse
the mathematical models relating log S or log P to the data [30–33]. This paper reports a comparison of
these descriptors have mainly been constructed using the use of MLR, ANNs and CCA to model log S and
multiple linear regression (MLR). log P using a set of 37 electrotopological descriptors.
∗ To whom correspondence should be addressed; E-mail:
davel@chmqst.demon.co.uk
742

Methods Neural network implementation

Data The artificial neural network simulations were car-


ried out using the NeuDesk software (v. 2.20, Neural
A set of 900 drug and pesticide-like compounds were Computational Sciences, UK). A three-layered, fully
chosen from the AQUASOL dATAbASE [34] and the connected neural network was trained by the standard
aqueous solubility values at 25 ◦ C expressed as log S, back-propagation learning algorithm with a logistic
were S is the solubility in moles per litre, were used. f (x) = 1/(1 + e−x ) activation function both for hid-
The compounds were checked in the ‘starlist’ (pre- den and output nodes. Before the training was started,
ferred measured values) of Biobyte [35] and those the input and output values were scaled between 0.1
with an entry were selected to give an overall set and 0.9, and the adjustable weights between neurons
of 688 compounds with both log S and log P val- were given random values of between −0.5 and 0.5.
ues. A referee has pointed out, however, a potential The learning rate and momentum parameters were
problem with literature solubility values for ionisable set at 0.1 and 0.9, respectively. The optimal training
compounds. It appears that reported solubilities are endpoint and network architecture was determined on
‘invariably’ those of the unbuffered solution where the the basis of the validation set of 68 compounds. The
pH is that of the saturated solution. The common data network architecture and the training endpoint giving
set of 688 compounds contains 82 carboxylic acids 2 , and
the highest coefficient of determination, Rpred
and 43 aliphatic amines, solubility values for these the lowest standard error s for the predictions of the
compounds will be for a mixture of the neutral and validation set was then used. The predictions were re-
ionised species. peated 10 times with different random starting weights
in the network and the averaged log S values were
Training/test sets calculated.
The set of 688 compounds was subdivided into three Canonical correlation analysis
sets; a validation set of 10% (n = 68), a test set of
10% (n = 68) and a training set of the remainder Canonical correlation analysis was carried out using
(n = 552). The validation and test sets were selected the 6M routine of the BMDP package [38] running on
so that they contained molecules presenting all the dif- a PC. Since canonical correlation is a somewhat less
ferent chemical structures and functional groups found well known technique than other statistical methods a
in the training set. The log S values of the training set brief explanation is presented here:
ranged between −6.30 and 1.60 and the log P val- Canonical correlation analysis (CCA) is a generali-
ues between −3.89 and 6.50, respectively. Figure 1 sation of multiple linear regression analysis. In the
shows the distribution of solubilities and partition co- latter a weighted sum of the predictor variables is
efficient values for the training set. Two further test found which maximally correlates with a single re-
sets were chosen from the literature, a test set of 19 sponse variable. CCA is a technique, which finds
for log P [36] and a test set of 21 for log S [37], allow- a weighted sum of several response variables that
ing comparison with earlier proposed log S and log P is maximally correlated with a weighted sum of the
estimation methods. predictor variables. Whereas in multiple regression,
where the responses are analysed independently ig-
Multiple linear regression analysis noring any covariance amongst the response set of
variables, CCA utilises this shared information and
Multiple linear regression (MLR) analysis was per-
provides an analysis of all response variables simul-
formed with the SPSS software (v.8.0, SPSS Inc.,
taneously.
Chicago, IL) running on a Pentium PC. The qual-
ity criteria of the fit in MLR analysis were squared
Canonical variates
correlation coefficient, R 2 , standard error, s, and Fis-
cher significance value, F , when all parameters in the Denote the variables in the response set Y1 , Y2 , . . . , Yq
model were significant at the 95% confidence level. and the variables in the predictor set X1 , X2 , . . . , Xp .
In CCA the coefficients a11 , a12 , . . . , a1q and
b11 , b12 , . . . , b1p are found such that the two linear
constructs W1 and Z1 , where
743

Figure 1. Distribution of log S and log P values for the training set.
744

W1 = a11Y1 + a12 Y2 + . . . + a1q Yq (1) pairs of canonical variates and represents the amount
of variance in the canonical variate Wi that is ac-
and
counted for by the other canonical variate Zi . The
Z1 = b11 X1 + b12 X2 + . . . + b1p Xp (2) corresponding eigenvectors allow the canonical vari-
ate coefficients (aij and bij in equation (3)) to be
are maximally correlated. The two new variables, W1
calculated.
and Z1 , are referred to as canonical variates (CV) and
the correlation between them (R1 ) is known as the
Canonical weights and canonical loadings
canonical correlation. A second pair of CVs, (W2 , Z2 )
is then selected to account for a maximum amount Canonical weights, aij and bij in equation (3), are
of the relationship between the two sets of variables analogous to the coefficients in multiple linear regres-
unaccounted for by the first pair of CVs, and so sion analysis (MRA) and indicate the contribution of
on. The number of pairs of canonical variates con- each variable to the variance of the respective canon-
structed equals the smaller of q and p. Thus, the linear ical variate. Canonical loadings are more useful in
relationships, identifying the nature of the canonical relationships.
W1 = a11Y1 + a12 Y2 + . . . + a1q Yq , Canonical loadings give the simple product moment
correlation of the original variable and its respective
W2 = a21Y1 + a22 Y2 + . . . + a2q Yq ,
CV and reflect the degree to which the variable is rep-
...................... resented by a CV. The canonical loadings can easily be
found by correlating the raw variable scores with the
Ws = as1 Y1 + as2Y2 + . . . + asq Yq
canonical variate scores.
and
Z1 = b11X1 + B12 X2 + . . . + b1p Xp , Predicting a response variable
Z2 = b21X1 + b22 X2 + . . . + b2 Xp , There are several ways in which the results of a CCA
(3)
....................... can be used to obtain a prediction for a response vari-
able. One approach is to regress each of the original
Zs = bs1X1 + bs2X2 + . . . + bsp Xp responses on the appropriate set of canonical variates
can be found, where s is the smaller of q and p. The which have been constructed. Because each variate is
pairs of canonical variates are extracted so that the a new, orthogonal variable with a known functional
correlation R1 between the first pair of CV’s (W1 , Z1 ) relationship to the original variables, the procedure is
is a maximum; the correlation R2 between the second straightforward. Each original y variable is regressed
pair (W2 , Z2 ) is a maximum subject to these variables on the set of canonical variates constructed from the
being uncorrelated with (W1 , Z1 ) and R2 < R1 ; and molecular descriptors (the X block) using a standard
so forth. multiple regression procedure.
A second approach is to base prediction on a
Procedure for CCA method analogous to that used to solve simultaneous
equations. The various canonical variates are regarded
The method of extracting the successive pairs of as i equations in i unknowns which can be solved an-
CVs involves an eigenvalue-eigenvector analysis. The alytically; i is the number of variables in the smallest
eigenvalues (Ri2 ) and associated eigenvectors con- set.
structed by the CCA are based on the combined
(p + q) × (p + q) correlation matrix, C, between the
descriptor variables and the response variables, where Results and discussion
 
..
 p × pmatrixR XX . . p × qmatrixR XY  Regression analysis
C =  ..................... .. ...................  (4)

..
q × pmatrixR XY . q × qmatrixRY Y Stepwise and backward methods were employed in the
regression analysis. The following regression equa-
From this matrix a s × s matrix R−1 
Y Y R XY RY Y is con- tions were calculated for log S with 28 and log P with
structed, and its eigenvalues λ1 > λ2 > . . . > λs are 26 significant descriptors at 95% confidence limits:
the squares of the canonical correlations between the
745

log S = (ai Si ) + 1.128


(5)
n = 552, R 2 = 0.783, s = 0.753, F = 68.18

log P = (ai Si ) − 0.635


(6)
n = 552, R 2 = 0.870, s = 0.645, F = 134.8

In both equations n is the number of the compounds


used in the fit, F is the overall F-statistic for the ad-
dition of each successive descriptor, and ai and Si are
the regression coefficients and the corresponding elec-
trotopological indices. The regression coefficients in
equations 5 and 6 are indicated in Table 1 with the t-
scores of significant parameters. Fitting and prediction
results for these two regression equations are shown in Figure 2. Plot of calculated vs. observed values of log S for the
training set, validation set and test set compounds.
table 2.

Neural networks

In order to allow direct comparison with the multi-


ple linear regression models, neural networks were
trained separately to reproduce log S and log P values
using the same topological descriptors as selected by
the stepwise regression procedure. In other words, the
neural network equivalents of equations 5 and 6. In the
case of solubility the ‘best’ network architecture was
a 28-6-1 network and for partition coefficient a 26-
5-1 network. Fitting and prediction results for these
networks can be seen in Table 2 where the network Figure 3. Plot of calculated vs. observed values of log P for the
fitting statistics are improved over the MLR results for training set, validation set and test set compounds.
the training and test sets and are equivalent to MLR
for the validation sets.
Training a neural network with all 37 topologi- 3 show plots of calculated against observed for the
cal descriptors using both log S and log P as targets training set, validation set and test set compounds.
resulted in a ‘best’ network architecture of 37-10-2 Calculated values for the compounds in the two
with a training set R 2 of 0.894 for water solubility literature test sets of solubility and partition coeffi-
and R 2 = 0.931 for partition coefficient. Statistics cient are shown in Tables 3 and 4 respectively. It can
for training, validation and test sets for this network be seen from Table 3 that the ANN gives improved
are shown in Table 2. One of the drawbacks of neural results compared with both the MLR model and the
network modelling is that the final model is hidden in two published comparisons for this set [41, 42]. Two
the set of overall weights and thus it is difficult to esti- of the compounds in this test set (2,2,4,5,5-PCB and
mate the contribution that individual descriptors make 4,4 -DDT) have log S values which are well outside
to the model. Neural network pruning techniques do the range of values of the training set (−6.3 to 1.60).
allow an estimation of the importance of individual These two compounds are poorly predicted by both the
input variables [39]. Application of one such pruning MLR and ANN models but are quite well predicted by
method [40] resulted in the removal of seven of the the two literature methods. For the partition coefficient
topological descriptors as shown in Table 1. Training test set, ANN performs better than XLOGP, MLOGP
a network with the remaining 30 descriptors also re- and the Rekker method but not as well as CLOGP
sulted in a ‘best’ network architecture with 10 hidden or KOWWIN. Figures 4 and 5 show the correlation
neurons, i.e. a 30-10-2 ANN. This network gave im- between predicted against observed log S values in
proved statistics for all the data sets, except the test the test set of 21 drug and pesticide compounds and
set for partition, as shown in Table 2. Figures 2 and
746
Table 1. Parameters useda in multiple linear regression and neural network models.

No. Symbol Atom-type log S t-score log P t-score ANN obsd min max

1 SsCH3 − CH3 −0.208 11.752 0.322 21.685 X 348 0 13.905


2 SdCH2 = CH2 0.253 6.083 X 16 0 7.020
3 SssCH2 − CH2 - −0.271 14.568 0.337 21.025 X 299 −1.691 12.161
4 StCH ≡CH −0.239 2.901 0.368 1.919 4 0 5.703
5 SdsCH = CH- −0.222 5.297 X 72 0 5.982
6 SaaCH aCHa −0.177 16.224 0.212 26.156 X 336 0 19.332
7 SsssCH > CH - −0.174 3.680 0.189 5.580 X 173 −6.865 2.929
8 StC ≡C - −0.644 1.716 10 0 3.496
9 SdssC =C< −0.272 6.857 X 304 −7.562 2.904
10 SaasC asCa −0.109 2.370 0.066 1.928 X 350 −5.362 6.070
11 SaaaC aaCa −0.242 3.303 X 41 0 4.765
12 SssssC >C< −0.263 4.227 0.080 1.705 X 106 −6.759 0.601
13 SsNH3+ - NH3 + −0.512 14.179 X 19 0 5.939
14 SsNH2 - NH2 0.036 1.862 −0.103 6.474 X 77 0 10.790
15 SdNH = NH 2 0 6.804
16 SssNH - NH - 0.074 2.711 −0.117 5.584 X 131 0 6.666
17 SaaNH aNHa 4 0 3.084
18 StN ≡N −0.120 4.564 0.152 1.916 X 6 0 17.409
19 SdsN =N- −0.119 3.761 X 29 0 8.086
20 SaaN aNa −0.097 7.133 0.020 1.906 X 79 0 15.657
21 SsssN >N- 0.282 7.132 −0.275 13.470 X 124 0 7.444
22 SddsN - N << 0.549 4.150 −1.123 11.643 X 31 −3.706 0
23 SsOH - OH 0.041 7.782 X 225 0 66.476
24 SdO =O −0.054 11.728 X 353 0 58.507
25 SssO -O- −0.029 2.640 X 170 0 29.004
26 SaaO aOa 6 0 5.105
27 SsF -F −0.098 10.021 0.073 9.601 X 25 0 39.932
28 SdsssP ->P= −0.202 1.955 X 34 −5.743 0
29 SsSH - SH 2 0 3.983
30 SdS =S −0.320 6.989 0.153 6.703 X 37 0 10.862
31 SssS -S- −0.309 3.895 0.294 4.458 X 48 0 3.598
32 SaaS aSa −0.563 3.116 0.479 3.118 12 0 1.823
33 SdssS >S= 0.726 2.807 X 4 −1.861 0
34 SddssS > S << −0.132 3.077 0.082 2.513 X 33 −9.059 0
35 SsCl - Cl −0.177 23.065 0.162 26.346 X 86 0 51.006
36 SsBr - Br −0.319 6.877 0.365 9.203 X 11 0 9.650
37 SsI -I −0.932 4.469 0.851 4.761 X 4 0 2.280
a Where an entry is missing from the log S and log P columns that descriptor was removed during the stepwise MLR
analysis for that property. An X in the ANN column indicates that this parameter was retained following network
pruning, that is to say these descriptors were used in the 30-10-2 network.

log P values in the test set of 19 drug compounds, be significant (Bartlett’s test: χ2 = 442.1, 36 degrees
respectively. of freedom, p = 0.0000) and Table 5 shows the canon-
ical correlations along with the canonical coefficients
Canonical correlation analysis and loadings for the response set of variables (cnvrf1,
cnvrf2). The first canonical correlation between the 37
As there are only two variables in the smaller of the
descriptors and the two response variables log P and
two sets of variables (log P and log S) there is a max-
log S is large (0.923). The second is, of course smaller
imum of two pairs of canonical variates that can be
extracted. Application of CCA yielded both pairs to
747

Table 2. Comparison of the predictive ability of multiple linear regression and neural network
models.

Model Training set Validation set Test set


# R2 s n R2 s n R2 s n

(A) Aqueous solubility, log S

MLR 28 0.78 0.75 552 0.84 0.63 68 0.78 0.75 68


ANNa 28 0.87 0.60 552 0.84 0.63 68 0.83 0.65 68
ANNb 37 0.89 0.53 552 0.86 0.60 68 0.83 0.65 68
ANNc 30 0.90 0.52 552 0.92 0.44 68 0.84 0.63 68

(B) Partition coefficient, log P

MLR 26 0.87 0.65 552 0.91 0.50 68 0.86 0.65 68


ANNa 26 0.91 0.53 552 0.90 0.51 68 0.89 0.55 68
ANNb 37 0.93 0.47 552 0.90 0.51 68 0.89 0.55 68
ANNc 30 0.94 0.44 552 0.94 0.40 68 0.89 0.55 68

# = number of variables used in the model.


a Same parameters as in MLR equation used as input to ANNs.
b All atom-type E-state indices used as inputs to ANNs with two outputs.
c Significant parameters after pruning used as inputs to ANNs with two outputs.

Table 3. The observed and predicted aqueous solubility values for the test set.

No. Compound log Sexp MLR ANN Klopmana Kühneb

1 antipyrine 0.39 −1.47 −1.31 −2.76 −1.90


2 theophylline −1.39 −1.10 −1.24 −1.07 0.54
3 acetylsalicylic acid −1.72 −1.83 −1.84 −1.52 −1.93
4 benzocaine −2.32 −1.59 −1.73 −1.71 −1.75
5 phenobarbital −2.32 −2.80 −3.26 −2.08 −2.41
6 prostaglandin E2 −2.47 −4.65 −3.98 −4.21 na
7 phenolphthalein −2.90 −4.16 −3.79 −4.48 −4.61
8 malathion −3.37 −3.39 −3.22 −2.94 −3.48
9 nitrofurantoin −3.38 −2.50 −3.06 −2.19 −2.62
10 diazinon −3.64 −4.16 −4.13 −5.29 −4.98
11 diazepam −3.76 −4.17 −3.90 −5.54 −4.51
12 diuron −3.80 −3.04 −3.26 −2.85 −3.38
13 atrazine −3.85 −3.12 −3.98 −3.05 −3.95
14 phenytoin −3.90 −3.68 −4.02 −3.47 −5.25
15 testosterone −4.09 −4.24 −4.44 −5.17 −4.62
16 lindane −4.64 −4.66 −4.90 −4.88 −5.08
17 parathion −4.66 −3.82 −3.90 −3.94 −4.59
18 chlorpyriphos −5.49 −5.13 −5.16 −5.77 −3.75
19 a-chlordane −6.86 −6.79 −6.05 −7.55 −6.51
20 2.2 .4.5.5 -PCB −7.89 −6.27 −6.34 −7.90 −7.47
21 4,4 -DDT −8.08 −6.65 −6.30 −8.00 −7.75

r2 0.79 0.86 0.72 0.76


s 0.97 0.77 1.11 1.06
n 21 21 21 20
a Ref. 42.
b Ref. 41.
748

Table 4. The observed and predicted partition coefficient values for the test set.

No. Compound Log Pobsd MLR ANN XLOGP MLOGP Rekker CLOGP KOWWIN

1 chlorothiazide −0.24 −0.74 0.00 −0.58 −0.36 −0.68 −0.30 −0.23


2 cimetidine 0.40 1.90 1.74 0.20 0.82 0.63 0.35 0.57
3 procainamide 0.88 1.60 1.52 1.27 1.72 1.11 1.42 0.97
4 trimethoprim 0.91 1.44 1.27 0.72 1.26 −0.07 0.88 0.73
5 chloramphenicol 1.14 0.96 1.00 1.46 1.23 0.32 1.28 0.92
6 phenobarbital 1.47 1.82 1.91 1.77 0.78 1.23 1.36 1.33
7 atropine 1.83 2.59 2.23 2.29 2.21 1.88 1.31 1.91
8 lidocaine 2.26 2.96 2.82 2.47 2.52 2.30 1.95 1.66
9 phenytoin 2.47 2.91 2.64 2.23 1.80 2.76 2.08 2.16
10 diltiazem 2.70 4.23 3.31 3.14 2.67 4.53 3.64 2.79
11 propranolol 2.98 3.60 3.70 2.98 2.53 3.46 2.75 2.60
12 diazepam 2.99 3.52 3.00 2.98 3.36 3.18 3.16 2.70
13 diphenhydramine 3.27 4.88 4.21 3.74 3.26 3.41 3.54 3.11
14 tetracaine 3.51 3.31 2.78 2.73 2.64 3.55 3.83 3.02
15 verapamil 3.79 6.70 4.55 5.29 3.23 6.15 4.46 4.80
16 haloperidol 4.30 4.76 4.51 4.35 4.01 3.57 3.84 4.20
17 imipramine 4.80 4.70 4.71 4.26 3.88 4.43 5.03 5.01
18 chlorpromazine 5.19 5.13 5.35 4.45 3.86 5.81 5.51 5.15
19 flufenamic acid 5.25 3.99 4.65 4.91 3.77 5.10 5.92 5.65

r2 0.74 0.91 0.90 0.88 0.84 0.95 0.96


s 0.86 0.51 0.54 0.59 0.66 0.37 0.34
n 19 19 19 19 19 19 19

Table 5. Canonical correlations and canonical weights and loadings for the
canonical variates for the response set of variables.

Canonical weights Canonical loadings


cnvrf1 cnvrf2 cnvrf1 cnvrf2

log S −0.122 0.850 −0.806 0.592


log P 0.474 0.644 0.990 0.143
Canonical correlations 0.923 0.752

Table 6. Proportion of variance in the response set accounted for by the predictor
set.

Canonical variate (a) (b) (a) × (b)


Squared canonical Proportion of variance
correlation accounted for

1 0.8518 0.8143 0.6936


2 0.5651 0.1857 0.1049
Total = 0.7985
749
Table 7. Fit and prediction statistics (R 2 ) for the canonical correla-
tion analysis.

Training Validation Test Testb


set set set 1 Set 2

Solubility
CCAa 0.752 0.828 0.759 0.785
Regression on CVs 0.751 0.828 0.758 0.820
Partition
CCAa 0.846 0.903 0.867 0.738
Regression on CVs 0.846 0.904 0.868 0.767
a Obtained by simultaneous equations.
b These are the two published test sets for log S (21 compounds) and
log P (19).

regard to maximising the correlation between the sets.


The first canonical variate is based on a difference be-
tween log S and log P whereas the second variate is
based on a weighted sum of the original variables. As
the weights are comparable with multiple regression
coefficients they are subject to the same problem of
multicollinearity and consequently the canonical load-
ings should be used in conjunction with the weights if
reliable interpretations of the canonical variates are to
be achieved.
As in multiple regression it is useful to know what pro-
portion of the variance in the response set is accounted
Figure 4. Correlation of predicted vs. observed log S values in the
test set of 21 drug and pesticide compounds.
for by a particular canonical variate. The proportion of
explained variance in the Y -set that is accounted for
by a particular canonical variate is given by

q
[RY (j )]2 = [rYi (j )]2 /q (7)
i=1

where [RY (j )]2 denotes the proportion of variance in


the response set variables accounted for by the jth
canonical variate, and rYi (j ) is the canonical loading
of the ith response variable on the jth canonical variate.
Similarly, the proportion of variance in the predictor
set of variables (X) accounted for by the jth canoni-
cal variate can be evaluated. The results of applying
equation 7 to the current data set are shown in Table 6
Figure 5. Correlation of predicted vs. observed log P values in the
test set of 19 drug compounds. where it can be seen that 81.4% of the variability in
log P and log S is accounted for by the first canonical
variate. A similar calculation shows that the propor-
(0.752), but still indicates high association between tion of the variance accounted for by the response set
the second pair of canonical variates. by the second canonical variate is 18.6%.
Let us now consider the other elements in the ta- The proportion of variance accounted for in the
ble. Essentially the canonical weights are comparable response set just evaluated gives the amount of this
with regression weights and stress the importance of a variance explained by the respective canonical vari-
variable from one set in relation to the other set with ates. It would be useful, however, to know how much
750

of the variance in the response set is accounted for plained in and by the response and descriptor sets and
by the predictor set. One might think that RC2 pro- the resulting models, like MLR models, can be readily
vides this information. However, although the squared interrogated. Details of the CCA models are shown in
canonical correlation coefficients do have some vari- the Appendix.
ance interpretations, they give the variance shared by
the canonical variates and not the variance shared by
the original X and Y variables. Stewart and Love [43] Conclusions
have proposed an index, called the redundancy coeffi-
cient, which represents the amount of variance in the Two important pharmaceutical properties, namely
response set that is ‘redundant’ to the variance in the aqueous solubility and partition coefficient, have been
predictor set. This redundancy coefficient, denoted by modelled simultaneously using ANN and CCA. Both
RXY /X , is given by methods produce quite satisfactory models with the
ANN outperforming the CCA approach. CCA has the

s
RCY /X = λj [RY (j )]2 (8) advantage that the proportion of variance explained
j =1 is easily obtained and the resulting models can be
expressed in the form of coefficients that are the
and is the sum of the product of the proportion of ex- equivalent of multiple regression coefficients (see ap-
plained variance in the Y set that is accounted for by a pendix). ANN models suffer from the fact that the
particular canonical variate with its associated squared model is contained within a set, often large, of network
canonical correlation coefficient. A redundancy coef- weights although techniques have been proposed by
ficient, RCX/Y , can also be constructed and represents which these models may be extracted [44].
the amount of variance in the X set of variables that It is, of course, not very surprising that these
is redundant to the variance in the Y set, but this two properties can be successfully modelled simul-
is usually of secondary importance. The redundancy taneously since each property can be well described
coefficient calculated for the current data set is approx- alone using electrotopological descriptors [25, 28].
imately 80%, which is very high. This figure comes What these results show, however, is the ease with
about because the first response set canonical vari- which two, or more, dependent variables may be mod-
ate contains 81.4% of the response set variability and elled using a set or sets of physicochemical properties.
that variate shares 85.2% of its variance with the first The dependent variables may be pharmacological re-
predictor set (X) canonical variate. Considering both sponses, adverse effects such as toxicity measures,
canonical variates together, about 80% of the response desirable (or undesirable) properties such as log P and
set variability is accounted for. log S, measures of stability or pharmacokinetcs and
The performance of these canonical correlation so on. In this case a single set of descriptors was
equations in fitting and prediction are shown in table 7. sufficient to model both properties but in other appli-
A comparison of the results from the ANN analyses cations it may be necessary to use different types [1]
of the various data sets (Table 2) with those obtained of descriptor for the different dependent variables. The
from the two methods used in the CCA (Table 7) models produced by the CCA technique can be most
show that the neural networks provide better fits to informative since the first canonical variate is based
the training data and are better at prediction. This is on the difference between the two dependent variables
presumably due to the fact that the ANN are able to whereas the second canonical variate is based on their
accommodate non-linearity in their fitting. The CCA weighted sum. It is obvious how such information may
models, on the other hand, do have the advantage that be useful in the drug design process.
it is possible to ‘dissect’ the amount of variance ex-
751
Appendix. Coefficients of the responses and descriptors on the two canonical
variates

Response coef1 coef2 Descriptor coef1 coef2

log S −0.122 0.850 SsCH3 0.185978 0.012381


log P 0.474 0.644 SdCH2 0.130015 0.117873
SssCH2 0.204915 −0.017625
StCH 0.161745 0.020671
SdsCH 0.044057 −0.244769
SaaCH 0.114486 −0.034184
SsssCH 0.126668 −0.068152
StC −0.219068 −0.542252
SdssC −0.152605 −0.268030
SaasC 0.093036 −0.035704
SaaaC 0.129195 −0.249057
SssssC 0.076709 −0.282966
SsNH3 −0.265181 −0.431079
SsNH2 −0.046216 −0.154948
SdNH −0.008897 −0.185470
SssNH −0.036128 −0.228728
SaaNH −0.035715 0.027962
StN 0.071061 −0.029496
SdsN 0.008058 −0.163893
SaaN 0.007526 −0.091930
SsssN −0.238424 −0.039155
SddsN −0.691107 −0.307503
SsOH 0.005564 −0.053314
SdO 0.003980 −0.065451
SssO 0.002466 −0.040553
SaaO 0.030563 −0.117513
SsF 0.051114 −0.051660
SdsssP 0.006999 −0.300027
SsSH 0.013365 0.010609
SdS 0.119694 −0.256348
SssS 0.193939 −0.114682
SaaS 0.342971 −0.376335
SdssS 0.324943 0.776224
SddssS 0.033125 −0.133106
SsC1 0.100302 −0.063910
SsBr 0.241055 −0.106695
SsI 0.562609 −0.353126

References 7. Mitchell, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., 38
(1998) 489.
1. Livingstone, D.J., J.Chem. Inf. Comput. Sci., 40 (2000) 195. 8. Rekker, R. E., Hydrophobic Fragment Constant; Elsevier:
2. Nirmalakhandan, N.N. and Speece, R.E., Environ. Sci. Tech- New York, 1977.
nol., 22 (1988) 328. 9. Hansch, C. and Leo, A., Substituent Constants for Correlation
3. Bodor, N. and Huang, M-J., J. Pharm. Sci., 81 (1992) 954. Analysis in Chemistry and Biology ; Wiley: New York, 1979.
4. Patil, G.S., J. Hazard. Mater., 36 (1994) 35. 10. Klopman, G. and Iroff, L., J. Comput. Chem., 2 (1981) 157.
5. Sutter, J.M. and Jurs, P.C., J. Chem. Inf. Comput. Sci., 36 11. Bodor, N. and Huang, M-J., J. Pharm. Sci., 81 (1992) 272.
(1996) 100. 12. Leo, A., Chem. Rev., 93 (1993) 1281.
6. Huibers, P.D.T. and Katritzky, A.R., J.Chem.Inf.Comput.Sci., 13. Klopman, G.; Li, J-Y.; Wang, S. and Dimayuga, M., J. Chem.
38 (1998) 283. Inf. Comput. Sci., 34 (1994) 752.
752

14. Meylan, W. M. and Howard, P. H., J. Pharm. Sci., 84 (1995) 31. Laass, W., In Seydel, J.K. (ed.), QSAR and Strategies in the
83. Design of Bioactive Compounds, VCH, Weinheim, 1985, pp.
15. Wang, R.; Fu, Y. and Lai, L., J. Chem. Inf. Comput. Sci. 37 285–289.
(1997) 615. 32. Bordas, B., In Seydel, J.K. (ed.), QSAR and Strategies in the
16. Haeberlin, M. and Brinck, T., J. Chem. Soc. Perkin Trans. 2, Design of Bioactive Compounds, VCH, Weinheim, 1985, pp.
(1997) 289. 389–392.
17. Bodor, N. and Buchwald, P., J. Phys. Chem., 101 (1997) 3404. 33. Ford, M.G. and Salt, D.W., In van de Waterbeemd, H.
18. Buchwald, P. and Bodor, N., Current. Med. Chem., 5 (1998) (ed.), Chemometric Methods in Molecular Design, VCH,
353. Weinheim, 1995, pp. 265–282.
19. Bodor, N. and Huang, M-J., J.Am.Chem.Soc., 113 (1991) 34. Yalkowsky, S.H. and Dannenfelser, R-M. (ed.), AQUASOL
9480. dATAbASE of Aqueous Solubility, College of Pharmacy,
20. Breindl, A.; Beck, N.; Clark, T. and Glen, R. C., J. Mol. University of Arizona, Arizona, USA, 1990.
Model., 3 (1997) 142. 35. Biobyte, Corp., 201 W. Fourth St., Suite #204, Claremont, CA
21. Schaper, K.-J. and Samitier, M. L. R., Quant. Struct.-Act. 91711, USA.
Relat. 16 (1997) 224. 36. Moriguchi, I., Hirono, S., Nakagome, I. And Hirono, H.
22. Devillers, J.; Domine, D. and Guillon, C., Eur. J. Med. Chem. Chem. Pharm. Bull., 42 (1994) 976–978.
33 (1998) 659. 37. Yalkowsky, S., Chemosphere 26 (1993) 1239–1261.
23. Huuskonen, J. J.; Salo, M. and Taskinen, J., J. Pharm. Sci. 86 38. BMDP Statistical Software Manual, Dixon, W.J. (ed.) Univer-
(1997) 450. sity of California Press, 1990.
24. Huuskonen, J. J.; Salo, M. and Taskinen, J., J. Chem. Inf. 39. Tetko, I.V., Villa, A.E.P. and Livingstone, D.J., J. Chem. Inf.
Comput. Sci., 38 (1998) 450. Comput. Sci, 36 (1996) 794.
25. Huuskonen, J.J., Rantanen, J. and Livingstone, D., Eur. J. 40. Wikel, J.H. and Dow, E.R., BioMed. Chem. Lett. 3 (1993)
Med. Chem., 35 (2000) 1081. 645.
26. Huuskonen, J. J., J. Chem. Inf. Comput. Sci., 40 (2000) 773. 41. Kühne, R., Ebert, R.-U., Kleint, F., Scmidt, G. and Schüür-
27. Huuskonen, J. J.; Villa, A. E. P. and Tetko, I. V., J. Pharm. Sci. mann, G. Chemosphere 30 (1995) 2061
88 (1999) 229. 42. Klopman, G., Wang, S., Balthasar, D.M.,
28. Huuskonen, J. J.; Livingstone, D.J. and Tetko, I. V., J. Chem. J.Chem.Inf.Comput.Sci. 32 (1992) 474
Inf. Comput. Sci., 40 (2000) 947. 43. Stewart, D. K. and Love, W. A. Psychol. Bull., 70 (1968) 160.
29. Hall, L.H. and Kier, L.B., J. Chem. Inf. Comput. Sci. 35 44. Roadknight, C.M., Palmer-Brown, D. and Mills, G.E., In Liu,
(1995) 1039. X., Cohen, P. and Berthold, M. (eds), Advances in Intelligent
30. Szydlo, R.M., Ford M.G., Greenwood, R.G. and Salt, D.W., In Data Analysis, Springer-Verlag, Berlin, 1997, pp. 337–346.
Dearden, J.C. (ed.), Quantitative Approaches to Drug Design,
Elsevier, Amsterdam, 1983, pp. 203–14.

You might also like