Livingstone 2001
Livingstone 2001
Livingstone 2001
KLUWER/ESCOM 741
© 2001 Kluwer Academic Publishers. Printed in the Netherlands.
Key words: canonical correlation, electrotopological descriptors, log P , log S, neural networks, regression analysis
Summary
It has been shown that water solubility and octanol/water partition coefficient for a large diverse set of compounds
can be predicted simultaneously using molecular descriptors derived solely from a two dimensional representation
of molecular structure. These properties have been modelled using multiple linear regression, artificial neural
networks and a statistical method known as canonical correlation analysis. The neural networks give slightly better
models both in terms of fitting and prediction presumably due to the fact that they include non-linear terms. The
statistical methods, on the other hand, provide information concerning the explanation of variance and allow easy
interrogation of the models. Models were fitted using a training set of 552 compounds, a validation set and test set
each containing 68 molecules and two separate literature test sets for solubility and partition.
Figure 1. Distribution of log S and log P values for the training set.
744
W1 = a11Y1 + a12 Y2 + . . . + a1q Yq (1) pairs of canonical variates and represents the amount
of variance in the canonical variate Wi that is ac-
and
counted for by the other canonical variate Zi . The
Z1 = b11 X1 + b12 X2 + . . . + b1p Xp (2) corresponding eigenvectors allow the canonical vari-
ate coefficients (aij and bij in equation (3)) to be
are maximally correlated. The two new variables, W1
calculated.
and Z1 , are referred to as canonical variates (CV) and
the correlation between them (R1 ) is known as the
Canonical weights and canonical loadings
canonical correlation. A second pair of CVs, (W2 , Z2 )
is then selected to account for a maximum amount Canonical weights, aij and bij in equation (3), are
of the relationship between the two sets of variables analogous to the coefficients in multiple linear regres-
unaccounted for by the first pair of CVs, and so sion analysis (MRA) and indicate the contribution of
on. The number of pairs of canonical variates con- each variable to the variance of the respective canon-
structed equals the smaller of q and p. Thus, the linear ical variate. Canonical loadings are more useful in
relationships, identifying the nature of the canonical relationships.
W1 = a11Y1 + a12 Y2 + . . . + a1q Yq , Canonical loadings give the simple product moment
correlation of the original variable and its respective
W2 = a21Y1 + a22 Y2 + . . . + a2q Yq ,
CV and reflect the degree to which the variable is rep-
...................... resented by a CV. The canonical loadings can easily be
found by correlating the raw variable scores with the
Ws = as1 Y1 + as2Y2 + . . . + asq Yq
canonical variate scores.
and
Z1 = b11X1 + B12 X2 + . . . + b1p Xp , Predicting a response variable
Z2 = b21X1 + b22 X2 + . . . + b2 Xp , There are several ways in which the results of a CCA
(3)
....................... can be used to obtain a prediction for a response vari-
able. One approach is to regress each of the original
Zs = bs1X1 + bs2X2 + . . . + bsp Xp responses on the appropriate set of canonical variates
can be found, where s is the smaller of q and p. The which have been constructed. Because each variate is
pairs of canonical variates are extracted so that the a new, orthogonal variable with a known functional
correlation R1 between the first pair of CV’s (W1 , Z1 ) relationship to the original variables, the procedure is
is a maximum; the correlation R2 between the second straightforward. Each original y variable is regressed
pair (W2 , Z2 ) is a maximum subject to these variables on the set of canonical variates constructed from the
being uncorrelated with (W1 , Z1 ) and R2 < R1 ; and molecular descriptors (the X block) using a standard
so forth. multiple regression procedure.
A second approach is to base prediction on a
Procedure for CCA method analogous to that used to solve simultaneous
equations. The various canonical variates are regarded
The method of extracting the successive pairs of as i equations in i unknowns which can be solved an-
CVs involves an eigenvalue-eigenvector analysis. The alytically; i is the number of variables in the smallest
eigenvalues (Ri2 ) and associated eigenvectors con- set.
structed by the CCA are based on the combined
(p + q) × (p + q) correlation matrix, C, between the
descriptor variables and the response variables, where Results and discussion
..
p × pmatrixR XX . . p × qmatrixR XY Regression analysis
C = ..................... .. ................... (4)
..
q × pmatrixR XY . q × qmatrixRY Y Stepwise and backward methods were employed in the
regression analysis. The following regression equa-
From this matrix a s × s matrix R−1
Y Y R XY RY Y is con- tions were calculated for log S with 28 and log P with
structed, and its eigenvalues λ1 > λ2 > . . . > λs are 26 significant descriptors at 95% confidence limits:
the squares of the canonical correlations between the
745
Neural networks
No. Symbol Atom-type log S t-score log P t-score ANN obsd min max
log P values in the test set of 19 drug compounds, be significant (Bartlett’s test: χ2 = 442.1, 36 degrees
respectively. of freedom, p = 0.0000) and Table 5 shows the canon-
ical correlations along with the canonical coefficients
Canonical correlation analysis and loadings for the response set of variables (cnvrf1,
cnvrf2). The first canonical correlation between the 37
As there are only two variables in the smaller of the
descriptors and the two response variables log P and
two sets of variables (log P and log S) there is a max-
log S is large (0.923). The second is, of course smaller
imum of two pairs of canonical variates that can be
extracted. Application of CCA yielded both pairs to
747
Table 2. Comparison of the predictive ability of multiple linear regression and neural network
models.
Table 3. The observed and predicted aqueous solubility values for the test set.
Table 4. The observed and predicted partition coefficient values for the test set.
No. Compound Log Pobsd MLR ANN XLOGP MLOGP Rekker CLOGP KOWWIN
Table 5. Canonical correlations and canonical weights and loadings for the
canonical variates for the response set of variables.
Table 6. Proportion of variance in the response set accounted for by the predictor
set.
Solubility
CCAa 0.752 0.828 0.759 0.785
Regression on CVs 0.751 0.828 0.758 0.820
Partition
CCAa 0.846 0.903 0.867 0.738
Regression on CVs 0.846 0.904 0.868 0.767
a Obtained by simultaneous equations.
b These are the two published test sets for log S (21 compounds) and
log P (19).
of the variance in the response set is accounted for plained in and by the response and descriptor sets and
by the predictor set. One might think that RC2 pro- the resulting models, like MLR models, can be readily
vides this information. However, although the squared interrogated. Details of the CCA models are shown in
canonical correlation coefficients do have some vari- the Appendix.
ance interpretations, they give the variance shared by
the canonical variates and not the variance shared by
the original X and Y variables. Stewart and Love [43] Conclusions
have proposed an index, called the redundancy coeffi-
cient, which represents the amount of variance in the Two important pharmaceutical properties, namely
response set that is ‘redundant’ to the variance in the aqueous solubility and partition coefficient, have been
predictor set. This redundancy coefficient, denoted by modelled simultaneously using ANN and CCA. Both
RXY /X , is given by methods produce quite satisfactory models with the
ANN outperforming the CCA approach. CCA has the
s
RCY /X = λj [RY (j )]2 (8) advantage that the proportion of variance explained
j =1 is easily obtained and the resulting models can be
expressed in the form of coefficients that are the
and is the sum of the product of the proportion of ex- equivalent of multiple regression coefficients (see ap-
plained variance in the Y set that is accounted for by a pendix). ANN models suffer from the fact that the
particular canonical variate with its associated squared model is contained within a set, often large, of network
canonical correlation coefficient. A redundancy coef- weights although techniques have been proposed by
ficient, RCX/Y , can also be constructed and represents which these models may be extracted [44].
the amount of variance in the X set of variables that It is, of course, not very surprising that these
is redundant to the variance in the Y set, but this two properties can be successfully modelled simul-
is usually of secondary importance. The redundancy taneously since each property can be well described
coefficient calculated for the current data set is approx- alone using electrotopological descriptors [25, 28].
imately 80%, which is very high. This figure comes What these results show, however, is the ease with
about because the first response set canonical vari- which two, or more, dependent variables may be mod-
ate contains 81.4% of the response set variability and elled using a set or sets of physicochemical properties.
that variate shares 85.2% of its variance with the first The dependent variables may be pharmacological re-
predictor set (X) canonical variate. Considering both sponses, adverse effects such as toxicity measures,
canonical variates together, about 80% of the response desirable (or undesirable) properties such as log P and
set variability is accounted for. log S, measures of stability or pharmacokinetcs and
The performance of these canonical correlation so on. In this case a single set of descriptors was
equations in fitting and prediction are shown in table 7. sufficient to model both properties but in other appli-
A comparison of the results from the ANN analyses cations it may be necessary to use different types [1]
of the various data sets (Table 2) with those obtained of descriptor for the different dependent variables. The
from the two methods used in the CCA (Table 7) models produced by the CCA technique can be most
show that the neural networks provide better fits to informative since the first canonical variate is based
the training data and are better at prediction. This is on the difference between the two dependent variables
presumably due to the fact that the ANN are able to whereas the second canonical variate is based on their
accommodate non-linearity in their fitting. The CCA weighted sum. It is obvious how such information may
models, on the other hand, do have the advantage that be useful in the drug design process.
it is possible to ‘dissect’ the amount of variance ex-
751
Appendix. Coefficients of the responses and descriptors on the two canonical
variates
References 7. Mitchell, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., 38
(1998) 489.
1. Livingstone, D.J., J.Chem. Inf. Comput. Sci., 40 (2000) 195. 8. Rekker, R. E., Hydrophobic Fragment Constant; Elsevier:
2. Nirmalakhandan, N.N. and Speece, R.E., Environ. Sci. Tech- New York, 1977.
nol., 22 (1988) 328. 9. Hansch, C. and Leo, A., Substituent Constants for Correlation
3. Bodor, N. and Huang, M-J., J. Pharm. Sci., 81 (1992) 954. Analysis in Chemistry and Biology ; Wiley: New York, 1979.
4. Patil, G.S., J. Hazard. Mater., 36 (1994) 35. 10. Klopman, G. and Iroff, L., J. Comput. Chem., 2 (1981) 157.
5. Sutter, J.M. and Jurs, P.C., J. Chem. Inf. Comput. Sci., 36 11. Bodor, N. and Huang, M-J., J. Pharm. Sci., 81 (1992) 272.
(1996) 100. 12. Leo, A., Chem. Rev., 93 (1993) 1281.
6. Huibers, P.D.T. and Katritzky, A.R., J.Chem.Inf.Comput.Sci., 13. Klopman, G.; Li, J-Y.; Wang, S. and Dimayuga, M., J. Chem.
38 (1998) 283. Inf. Comput. Sci., 34 (1994) 752.
752
14. Meylan, W. M. and Howard, P. H., J. Pharm. Sci., 84 (1995) 31. Laass, W., In Seydel, J.K. (ed.), QSAR and Strategies in the
83. Design of Bioactive Compounds, VCH, Weinheim, 1985, pp.
15. Wang, R.; Fu, Y. and Lai, L., J. Chem. Inf. Comput. Sci. 37 285–289.
(1997) 615. 32. Bordas, B., In Seydel, J.K. (ed.), QSAR and Strategies in the
16. Haeberlin, M. and Brinck, T., J. Chem. Soc. Perkin Trans. 2, Design of Bioactive Compounds, VCH, Weinheim, 1985, pp.
(1997) 289. 389–392.
17. Bodor, N. and Buchwald, P., J. Phys. Chem., 101 (1997) 3404. 33. Ford, M.G. and Salt, D.W., In van de Waterbeemd, H.
18. Buchwald, P. and Bodor, N., Current. Med. Chem., 5 (1998) (ed.), Chemometric Methods in Molecular Design, VCH,
353. Weinheim, 1995, pp. 265–282.
19. Bodor, N. and Huang, M-J., J.Am.Chem.Soc., 113 (1991) 34. Yalkowsky, S.H. and Dannenfelser, R-M. (ed.), AQUASOL
9480. dATAbASE of Aqueous Solubility, College of Pharmacy,
20. Breindl, A.; Beck, N.; Clark, T. and Glen, R. C., J. Mol. University of Arizona, Arizona, USA, 1990.
Model., 3 (1997) 142. 35. Biobyte, Corp., 201 W. Fourth St., Suite #204, Claremont, CA
21. Schaper, K.-J. and Samitier, M. L. R., Quant. Struct.-Act. 91711, USA.
Relat. 16 (1997) 224. 36. Moriguchi, I., Hirono, S., Nakagome, I. And Hirono, H.
22. Devillers, J.; Domine, D. and Guillon, C., Eur. J. Med. Chem. Chem. Pharm. Bull., 42 (1994) 976–978.
33 (1998) 659. 37. Yalkowsky, S., Chemosphere 26 (1993) 1239–1261.
23. Huuskonen, J. J.; Salo, M. and Taskinen, J., J. Pharm. Sci. 86 38. BMDP Statistical Software Manual, Dixon, W.J. (ed.) Univer-
(1997) 450. sity of California Press, 1990.
24. Huuskonen, J. J.; Salo, M. and Taskinen, J., J. Chem. Inf. 39. Tetko, I.V., Villa, A.E.P. and Livingstone, D.J., J. Chem. Inf.
Comput. Sci., 38 (1998) 450. Comput. Sci, 36 (1996) 794.
25. Huuskonen, J.J., Rantanen, J. and Livingstone, D., Eur. J. 40. Wikel, J.H. and Dow, E.R., BioMed. Chem. Lett. 3 (1993)
Med. Chem., 35 (2000) 1081. 645.
26. Huuskonen, J. J., J. Chem. Inf. Comput. Sci., 40 (2000) 773. 41. Kühne, R., Ebert, R.-U., Kleint, F., Scmidt, G. and Schüür-
27. Huuskonen, J. J.; Villa, A. E. P. and Tetko, I. V., J. Pharm. Sci. mann, G. Chemosphere 30 (1995) 2061
88 (1999) 229. 42. Klopman, G., Wang, S., Balthasar, D.M.,
28. Huuskonen, J. J.; Livingstone, D.J. and Tetko, I. V., J. Chem. J.Chem.Inf.Comput.Sci. 32 (1992) 474
Inf. Comput. Sci., 40 (2000) 947. 43. Stewart, D. K. and Love, W. A. Psychol. Bull., 70 (1968) 160.
29. Hall, L.H. and Kier, L.B., J. Chem. Inf. Comput. Sci. 35 44. Roadknight, C.M., Palmer-Brown, D. and Mills, G.E., In Liu,
(1995) 1039. X., Cohen, P. and Berthold, M. (eds), Advances in Intelligent
30. Szydlo, R.M., Ford M.G., Greenwood, R.G. and Salt, D.W., In Data Analysis, Springer-Verlag, Berlin, 1997, pp. 337–346.
Dearden, J.C. (ed.), Quantitative Approaches to Drug Design,
Elsevier, Amsterdam, 1983, pp. 203–14.