Lecture Notes in Bioinformatics
Lecture Notes in Bioinformatics
Mathematical Methods
for Protein Structure
Analysis and Design
13
Series Editors
Sorin Istrail, Celera Genomics, Applied Biosystems, Rockville, MD, USA
Pavel Pevzner, University of California, San Diego, CA, USA
Michael Waterman, University of Southern California, Los Angeles, CA, USA
Volume Editors
Concettina Guerra
Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
via Gradenigo 6a, 35131 Padova, Italy
E-mail: guerra@dei.unipd.it
Sorin Istrail
Celera Genomics, Applied Biosystems
45 West Gude Drive, Rockville, MD 20850, USA
E-mail: sorin.istrail@celera.com
A catalog record for this book is available from the Library of Congress.
ISSN 0302-9743
ISBN 3-540-40104-0 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
Concettina Guerra
Sorin Istrail
Contents
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
11 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Identifying Flat Regions and Slabs in Protein Structures
Mary Ellen Bock, Concettina Guerra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2 A Geometric Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3 An Improved Geometric Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Performances of the Two Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Plane Detection in Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
OPTIMA: A New Score Function for the Detection of Remote
Homologs
Maricel Kann, Richard A. Goldstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A Comparison of Methods for Assessing the Structural
Similarity of Proteins
Dean C. Adams, Gavin J.P. Naylor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2 The DALI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3 The Root Mean Square Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4 Geometric Morphometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Prediction of Protein Secondary Structure at High Accuracy
Using a Combination of Many Neural Networks
Claus Lundegaard, Thomas Nordahl Petersen, Morten Nielsen,
Henrik Bohr, Jacob Bohr, Søren Brunak, Garry Gippert, Ole Lund . . . . . 117
1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Contents XI
1 Introduction
A protein is a complex molecule for which a simple linear structure, given by
the sequence of its aminoacids, determines a unique, often very beautiful, three
dimensional shape. Such shape (3D structure) is perhaps the most important
of all protein’s features, since it determines completely how the protein func-
tions and interacts with other molecules. Most biological mechanisms at the
protein level are based on shape-complementarity, so that proteins present
particular concavities and convexities that allow them to bind to each other
and form complex structures, such as skin, hair and tendon. For this reason,
for instance, the drug design problem consists primarily in the discovery of
ad hoc peptides whose 3D shape allows them to “dock” onto some specific
proteins and enzymes, to inhibit or enhance their function.
The analysis and development of models, algorithms and software tools
based on 3D structures are heceforth very important fields of modern Struc-
tural Genomics and Proteomics. In the past few years, many tools have
emerged which allow, among other things, the comparison and clustering of
3D structures. In this survey we cannot possibly recollect all of them, and even
if we tried, our work would soon be incomplete, as the field is dynamically
changing, with new tools and databases being introduced constantly. At the
heart of all of them, there is the Brookhaven Protein Data Bank, the largest
world’s repository of 3D protein structures.
Two typical problems related with protein structure are Fold Prediction
and Fold Comparison. The former problem consists in trying to determine the
3D structure of a protein from its aminoacid sequence alone. It is a problem
of paramount importance for its many implications to, e.g., medicine, genetic
engineering, protein classification and evolutionary studies. We will not discuss
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 1-33, 2003.
Springer-Verlag Berlin Heidelberg 2003
2 Giuseppe Lancia and Sorin Istrail
the Fold Prediction problem here, if not marginally, but we will instead focus
on algorithms and applications for protein fold (structure) comparison. The
following are some of the reasons why the Fold Comparison problem is also
extremely important:
• For determining function. The function of a new protein can be determined
by comparing its structure to some known ones. That is, given a set of pro-
teins whose fold has already been determined and whose fuction is known,
if a new one has a fold highly similar to a known one, then its function will
similar as well. This type of problems imply the design of search algorithm
for 3D databases, where a match must be based on structure similarity.
Analogous problems have already been studied in Computational Geom-
etry and Computer Vision, where a geometric form or object has to be
identified by comparing it to a set of known ones.
• For clustering. Given a set of proteins and their structures, we may want
to cluster them in families based on structure similarity. Furthermore, we
may want to identify a consensus structure for each family. In this case,
we would have to solve a multiple structure alignment problem.
• For assessment of fold Predictions. The Model Assessment Problem is the
following: Given a set of “tentative” folds for a protein, and a “correct”
one (determined experimentally), which of the guesses is the closest to the
true? This is, e.g., the problem faced by the CASP (Critical Assessment
of Structure Prediction) jurors, in a biannual competition where many re-
search groups try to predict protein structure from sequence. The large
number of predictions submitted (more than 10,000) makes the design of
sound algorithms for structure comparison a compelling need. In particu-
lar, such algorithms are at the base of CAFASP, a recent Fully Automated
CASP variant.
The problem of Structure Similarity/Alignment determination is in a way
analogous to the Sequence Alignment problem, but the analogy is only su-
perficial and it breaks down when it comes to their complexity. There is a
dramatic difference between the complexity of sequence alignment and struc-
ture alignment. As opposed to the protein sequence alignment, where we are
certain that there is a unique alignment to a common ancestor sequence, in
structure comparison the notion of a common ancestor does not exist. Simi-
larity in folding structure is due to a different balance in folding forces, and
there is not necessarily a one-to-one correspondence between positions in both
proteins. In fact, for two homologous proteins that are distantly related, it is
possible for the structural alignment to be entirely different from the correct
evolutionary alignment [14].
By their nature, three-dimensional computational problems are inherently
more complex than the similar one-dimensional ones for which we have more
effective solutions. The mathematics that can provide rigorous support in
understanding models for structure prediction and analysis is almost nonex-
istent, as the problems are a blend of continuous, geometric- and combinato-
Protein Structure Comparison: Algorithms and Applications 3
usually employed to justify the use of simple heuristics instead of any rigor-
ous techniques or provable property of the model. However, after a heuristic
optimization, the so–called “global optimum” found is then used to draw
conclusions of biological meaning or support some theory on the biological
process. Hence, it is important for the quality of the solution to be as good
as possible, which justifies the use of sophisticated optimization techniques.
It is auspicable therefore an ever increasing cooperation of the community of
theoretical computer scientists, applied discrete mathematicians and molec-
ular biologists to address these problems starting from their computational
complexity, devising exact and approximation algorithms, studying lower and
upper bounds, and designing effective heuristics of many sorts.
The remainder of the paper is organized as follows. We start with a short
account on 3D structures, their properties and how they are determined. The
rest of the article is divided roughly in two parts. The first part focuses on ap-
plications of structure alignment problems, while the second part is centered
on algorithms. This second part is in turn divided in two, first talking about
measures which are not based on contact maps, and then focusing on contact
map problems. Among the applications of structure comparison, we describe
some databases based on structure similarity (PDB, SCOP, HOMSTRAD,
SLoop, CAMPASS, FSSP) and the problem of protein folding prediction as-
sessment, relevant, e.g., in the CASP and CAFASP competitions. We then
describe some of the main algorithms for structure comparison, and in partic-
ular those used by some of the applications previously mentioned. We survey
RMSD and its variants, DALI, MaxSub, Lgscore, GDT, Geometric Hashing
(3dSEARCH), MNYFIT and STAMP. In the final part of the article, we talk
about novel algorithms and problems based on the contact map representa-
tion of structures. We review the literature and some of our work on exact
and heuristic algorithms for the maximum contact map overlap problem. The
chapter is closed by a section on possible generalizations and directions for
future work.
2 Preliminaries
A protein consists of a chain of aminoacids, of length varying from about 50 to
3,000 and more. The chemical structure of a single aminoacid is comprised of
a carbon atom (called Cα ) connected to a carboxyl group and an amine group,
a hydrogen atom and a part depending on the specific aminoacid, called the
residue. The amine group of one aminoacid is linked to the carboxyl group of
the next one, giving rise to the linear chain. The backbone of the protein is
the sequence of Cα atoms, to which the rest of the atoms are attached.
The chemical properties and forces between the aminoacids are such that,
whenever the protein is left in its natural environment, it folds to a spe-
cific 3-dimensional structure, called its native, which minimizes the total free
energy. Several experiments have confirmed that two proteins with the same
Protein Structure Comparison: Algorithms and Applications 5
aminoacid sequence have the same 3D structure under natural conditions, and
that, if a folded protein is artificially stretched to a chain and then released,
it will fold again to the same native.
The 3D structure of a protein is fully specified by the set of (x, y, z)–
coordinates of its atoms, with respect to an arbitrary origin. An alternative,
weaker description is the set of distances between either all atoms or only
the Cα s. The main procedures used to to determine the 3D structure of a
protein are X-ray crystallography and Nuclear Magnetic Resonance. In X–ray
crystallography the 3D structure of a molecule is determined by X–ray diffrac-
tion from crystals. This technique requires the molecule to be first crystalized,
at perfect quality; then, its diffraction pattern, produced by X–irradiation is
studied. This involves the analysis of thousands of spots, each with a posi-
tion and an intensity. Obtaining good crystals and studying the phases of the
waves forming each spot are very complex problems. For these reasons, X–ray
crystallography is a very long and delicate process, which may take several
months to complete. The result is the set of spacial coordinates of each atom
in the structure. For a good description of this technique, see [47] or [5].
Nuclear Magnetic Resonance (NMR) is performed in an aqueous solution,
where the molecules tumble and vibrate from termal motion. NMR detects
the chemical shift of atomic nuclei with nonnull spin. In order to get an ad-
equate resolution, the molecule must tumble rapidly, which typically limits
the size of the proteins to which this technique is applicable. The result of
NMR is a set of (estimates of) pairwise distances between the atoms and
hence it yields a collection of structures (namely all those compatible with
the observed atom–to–atom distances) rather than a single one. According to
some studies, the results of NMR are “not as detailed and accurate as that
obtained crystallographically” [8].
For a protein we distinguish several levels of structure. At a first level, we
have its primary structure, given by the monodimensional chain of aminoacids.
Subsequent structures depend on the protein fold. The secondary structure
describes the protein as a chain of structural elements, the most important of
which are α–helices and β–sheets. The tertiary structure is the full description
of the actual 3D fold, while the quaternary structure describes the interaction
of several copies of the same folded protein. Correspondingly, when comparing
two proteins, one may use algorithms that highlight differences and similarities
at each of these levels. Algorithms for the comparison of primary structures
are known as sequence alignment methods and are beyond the scope of this
paper. Here, we will focus mainly on algorithms based on the secondary and
tertiary structures.
The input to a generic structure alignment algorithm contains the de-
scription of the n–ary structures of two proteins (e.g., for n = 3, the set of 3D
atomic coordinates of two different molecules). Loosely speaking, the goal of
the algorithm is to find a geometrical transformation (typically rigid, made
of rotation and translation) which maps a “large” number of elements of the
first molecule to corresponding elements of the second. Such algorithms can
6 Giuseppe Lancia and Sorin Istrail
The Protein Data Bank (PDB, [3]) is the most comprehensive single worldwide
repository for the processing and distribution of 3D biological macromolecular
structure data. As soon as a new 3D structure has been determined, it can
be deposited in the PDB, this way making it readily available to the scientific
comunity. Currently, the 3D structures of proteins in the PDB are determined
mainly by X-ray crystallography (over 80%) and Nuclear Magnetic Resonance
(about 16%). The remaining structures are determined by other methods, such
as theoretical modeling, which are not experimental in nature, but rather
based on computation and/or similarity to known proteins.
In this archive, one can find the primary and secondary structure infor-
mation, atomic coordinates, crystallographic and NMR experimental data, all
annotated with the relevant bibliographic citations, in a suitable text file for-
mat. Several viewers are available which interpret this format to give a 3D
graphic rendering of the protein.
Protein Structure Comparison: Algorithms and Applications 7
The amount of information in the PDB has increased rapidly over the
past years, passing from the roughly one thousand structures of 1990 to over
14,000 of today (see Figure 3.13 ). In order for this information to be usable and
accessible, it must be organized into databases which can be queried, e.g., for
global/local 3D similarity and/or complementarity, or for finding evolutionary
related homologs. A number of such databases has been created in the last
decade, each with its own special features. Common to all of them is the
use of structure comparison algorithms as the basic tool for clustering and
information retrieval. In the remainder of the section we survey a few of the
most known, i.e. SCOP, FSSP, HOMSTRAD, SLoop and CAMPASS.
SCOP
In the SCOP (Structural Classification Of Proteins, [42]) database, all pro-
teins of known structure are related to each other according to their structural
similarity and evolutionary relationships. The database was created by both
manual inspection and the use of algorithms and automated methods. It con-
tains detailed information on all known protein folds, and on all structurally
close relatives to a given protein. The SCOP database is also a starting point
for the construction of more specific databases, of which we will mention a
few later on in the survey.
As of 2000, the database contained information on some 12,000 entries from
the PDB, with relative folds, families and superfamilies statistics. A hidden
Markov Model for SCOP superfamilies has recently been proposed by Gough
et al [19]. One of the programs used in the partially automatic classification
process in SCOP is 3dSEARCH [51], a procedure based on Geometric Hashing,
which is a technique originally developed in the field of Computer Vision. We
will shortly describe the technique and the algorithm later in this paper.
3
Figure from http://www.rcsb.org/pdb/holdings.html
8 Giuseppe Lancia and Sorin Istrail
HOMSTRAD
SLoop
The basis of the SLoop database [7] is in the work of Donate et al. [11], who de-
scribed a classification of protein loops and loop regions. SLoop is a database
of super secondary fragments, in which proteins are clustered according to
the similarity of their secondary structures, in particular the lenght and type
of loop regions. The clustering of structures derives form a two–stage pro-
cess. First, the program SSTRUC is used to identify the loop regions, and
each loop is separated into groups according to the length and type of the
bounding secondary structures. Then, the loops within each group are pair-
wise superimposed, the relative similarities are computed and then used in a
classical clustering scheme.
CAMPASS
to the SCOP database and the literature. Individual domains within multi-
domains proteins have been considered separately. Altough this database is
founded on primary structure similarity, it fits within the scope of this survey
since for the distantly related members of each superfamily, simple multiple
sequence alignment procedures are not appropriate. Hence, the alignment of
superfamily members is based on the conservation of structural features like
the presence of secondary structures, hydrogen bonding and solvent accessi-
bility. For this type of alignments, the program COMPARER [49] has been
used, which takes into account structural information. In the present version,
the database consists of 52 superfamilies for which structure-based sequence
alignments are available.
FSSP
The problem of protein fold prediction is perceived as one of the core ques-
tions, albeit extremely complex, for the molecular biology and gene therapy
of the 21st century. The amount of time and work required to determine a
protein structure experimentally makes the design of faster yet highly reli-
able methods a compelling need. One of the possibilities which is extensively
studied is the creation of algorithms capable of computing the final fold of
a protein from its aminoacid sequence. This is a challenging computational
problem and the prize for its solution includes, beyond the obvious scientific
and medical implications, also relevant economical interests. For this reasons,
some companies among which IBM, are devoting a good deal of research into
this problem. In 1999 IBM announced a $100 million research project, named
Blue gene [2], for the development of a system capable of more than one
10 Giuseppe Lancia and Sorin Istrail
petaflop (1015 operations per second) which will tackle the problem of protein
folding by simulating the actual physical phenomenon in a massive, parallel,
computation.
The computational problem of protein fold prediction is beyond the scope
of this paper. For a good introduction to the topic see, e.g., [34]. What is
relevant for our purposes is the assessment of the prediction quality, i.e., how
can we judge if a prediction is “close enough” to a model? Clearly, we need
tools of structure comparison which, given a model determined experimentally,
and a prediction of the former, are capable to come up with a score (possibly,
a real number in the range [0, 1], with 0 meaning completely wrong, 1 meaning
completely right).
We now describe some of the situations where this problem occurs and
how it is currently solved.
main algorithm used in CAFASP for scoring the predictions is MaxSub, de-
scribed later in this paper. This algorithm was partially validated by scoring
some of the CASP3 Fold-Recognition (FR) models and comparing the rank-
ings obtained to the human–expert assessment carried out by CASP jurors.
Even if some differences were observed, the top six predicting groups ranked
by MaxSub were the same as those ranked by CASP3. On the other hand,
this algorithm had very little success in comparing hard FR targets to their
models, because it fails to detect local structural similarities. In the words of
CAFASP organizers [13], even for a target with a known fold, the fact that
MaxSub scored a prediction with zero does not necessarily imply that the pre-
diction is totally wrong. (...) Because of the relatively low sensitivity of our
automated evaluation, a “human expert” is required to learn more about the
prediction capabilities of the servers.
Live Bench
The Live Bench Project [6] is an effort aimed at the continuous benchmark-
ing of protein structure prediction servers. This project implements an au-
tomatic large–scale assessment of the performance of publicly available fold–
recognition/prediction servers, such as PDB–Blast, GenTHREADER, FFAS,
T98-lib, 3D-PSSM and INBGU. The goal is to provide a consistent, automa-
tized framework in which such servers can be evaluated and compared. Every
week, all new PDB protein structures are submitted to all participating fold
recognition servers. For instance, in the period October 1999–April 2000, 125
targets were used for comparisons, divided into 30 “easy” structures and 95
“hard” ones. The results of the predictions are compared using similar algo-
rithms as in CAFASP, i.e., MaxSub and Lgscore. Note that, differently from
CASP and CAFASP, these “predictions” are in fact done after the struc-
tures are made known and available on the PDB, and hence the reliability of
LiveBench is based on the assumption that the evaluated servers do not use
any hidden feature directing the prediction towards the correct answer. One
of the main results of the LiveBench project was to recognize that the servers
largely disagree in many cases (they were able to produce structurally similar
models for only one half of the targets, and for only one third of the targets
such models were significantly accurate) but that a “combined consensus”
prediction, in which the results of all servers are considered, would increase
the percentage of correct assignments by about 50%.
RMSD
Given n points (a1 , . . . , an ), with ai = (ai1 , ai2 , ai3 ) and n points (b1 , . . . , bn ),
a “rigid body superposition”, is a composition of a rotation and a traslation,
that takes the points ai into a i and makes the sum of differences |a i − bi | as
small as possible.
More specifically, let ∆ be defined by
∆2 = min |Rai + t − bi |,
R,t
i
bi s. To determine the
n optimal R one must consider the correlation matrix A
defined by Aij = h=1 ahi bhj , as it can be shown that the optimal rotation
maximizes Tr(RA). One way to maximize Tr(RA), exploits the fact that the
problem is three dimensional. Represent the matrix R as
where
0 n −m
M = −n 0 l
m −l 1
and θ represent the rotation angle around an axis in the direction of the
unit vector u = (l, m, n). It then follows that
max Tr(RA) = Tr(A) + (Tr(M A))2 + (Tr(A) − Tr(M 2 A))2 .
θ
So, for a fixed axis of rotation, the calculation of the optimal angle θ is
immediate [36].
Alternative ways of determining the optimal rotation matrix R, based on
the computation of the eigenvalues of AT A, or on the quaternion method, are
described in [31].
As previously mentioned the RMSD is widely used as a sequence depen-
dent measure of 3D similarity, i.e., to find an optimal mapping of points of a
structure into a predetermined set of corresponding points in another struc-
ture. Despite some known flaws, the RMSD measure is at the hearth of many
other measures. The main problem with RMSD is that it is more sensitive to
small regions of local differences than to large regions of similarity. In fact,
small distances between well-matched atoms have a much lesser impact on
the RMSD than do the very large distances between poorly matched Cα s.
Another issue with RMSD is that it is not designed to be equivalent be-
tween different targets. For instance, the quality of a model with an RMSD
of a 4 Å for a 40 residues long target is not the same as that of a model of 4
Å RMSD over 400 residues.
The problem of local different regions affecting the total score is common
to other global measures, but for the RMSD value the effect is larger than for
other scores, e.g., the contact map similarity. In this situation, a solution is
typically to first calculate the similarity for segments of the protein and then
define a normalized score based on the number of residues in the segments.
However, the right relationship between the length of the segment and the
14 Giuseppe Lancia and Sorin Istrail
value must be defined, and this is not a trivial problem. There are two con-
flicting issues: we are interested in large segments that are highly similar, but
the similarity and the length of the segments are inversely proportional.
Similar problems are faced by “trimming” methods which, given a current
alignment, re-define iteratively a core of pairs of residues that are matched
within a small distance, work on these only, and then try to extend the align-
ment. The threshold and the degree of trimming are to some extent arbitrary,
and this choice affects the final outcome.
DALI
In DALI [25], the structural alignment of two proteins is done indirectly, not
by comparing the actual structures, but their distance matrices. The distance
matrix of a folded protein P is the matrix DP = (dP ij ) of the Euclidean dis-
tances between all pairs (i, j) of its residues. The matrix provides a 2D repre-
sentation of a 3D structure, and contains enough information for retrieving the
actual structure, except for overall chirality [22]. The idea underlying DALI is
that if two structures are similar, then their distance matrices must be similar
too. An analogous idea is used to compare structures via their contact maps
and is described later in this article. In order to find similarities between two
distance matrices, DA and DB , DALI uses a heuristic trick, and looks for all
6 × 6 submatrices of consecutive rows and columns in DA and DB (that is, it
maps iA , iA + 1, . . . , iA + 5 into iB , iB + 1, . . . , iB + 5, and j A , j A + 1, . . . , j A + 5
into j B , j B + 1, . . . , j B + 5) to find patterns of similarity. This is done in the
first step of the algorithm. The set {iA , . . . , iA + 5, j A , . . . , j A + 5} with its
counterpart in B, is called a contact pattern. The second step attempts at
merging the contact patterns into larger, consistent, alignments. An align-
ment of two proteins A and B is a one-to-one function M of some residues
MA (the matched ones) of A into some residues of B. DALI tries to determine
the alignment that maximizes the global objective function
S(M ) = φ(i, j, M (i), M (j)).
i,j∈MA
where θ = 1.5Å is the zero level of similarity. The more complex elastic score
uses relative, other than absolute deviations, and an envelope function to
weigh down the contribution of long distance matched pairs.
Some heuristic rules are employed to speed up the search. For instance,
only a subset of all contact patterns is considered. Also, overlapping contact
patterns (they could possibly overlap by 11 out of 12 residues) are suppressed
by partitioning the protein in nonoverlapping structural elements and merging
Protein Structure Comparison: Algorithms and Applications 15
repetitive segments. Further rules to prune the search involve the suppression
of pairs of patterns for which the sum of distances of one is not within a given
interval of the same value in the other. Other rules of the same nature are
employed which we do not describe here. The final optimization is done via a
Monte Carlo algorithm, which is basically the local search strategy known as
Simulated Annealing. In this search a neighborhood of a solution is explored
for a good move, leading to a new solution. A move which takes into a solution
worse than the current one can be accepted, but with probability inverse to the
degradation in the solution quality. The temperature of the system is also used
to favour or make more difficult such non–improving moves. DALI uses two
basic moves, named expansion and trimming ones. Expansion moves try to ex-
tend a current partial alignment by adding to it a contact pattern compatible
with one of the currently matched residues (i.e., containing the same match),
and then possibly removing some matches that have become noncompatible.
Trimming moves simply remove from a current alignment any subalignment
of 4 elements that gives negative contribution to the total similarity score.
These moves are alternated as one trimming cycle every five expansion cycles.
DALI was used to carry out an all–against–all structure comparison for
225 representative protein structures from the PDB, providing the basis for
classification in the FSSP database. Also, the DALI server can be used to
submit the coordinates of a query protein structure which is then compared
against the PDB.
MaxSub
1
S(M ∗ ) = /n,
i∈M ∗
1 + ( ddi )2
where di = ||ai − TM ∗ (bi )||. In our opinion, there are several non rigorous
steps that should be addressed in evaluating MaxSub and its reliability. For
instance, although the basic objective optimized by the algorithm is maximize
|M | such that di ≤ d for all i ∈ M , the optimization of S (whichis the score
1
eventually returned) calls ultimately for maximize over all M , i∈M d2 +d i
2
Lgscore
The measure Lgscore [9] is used in LiveBench and CAFASP for the automatic
assessment of fold recognition problems. This measure is statistically based,
and relies on the following formula, by Levitt and Gerstein [32], for the simi-
larity of two structures, after a superposition in which M denotes the aligned
residues:
1 Ng
Sstr (M ) = K 2
− ,
1 + (di /d0 ) 2
i∈M
where K and d0 are constants (usually set to 10 and 5Å), di is the distance
between the aligned pairs i of Cα atoms, and Ng is the number of gaps in the
alignment. The P–value of a similarity is the likelihood that such a similarity
could be achieved by chance. Levitt and Gerstein showed how to compute the
P–values for a distribution of Sstr depending on the length of the alignment.
Lgscore is the negative log of the P–value for the most significant subpart of
the alignment. In order to find such most significant segment, two heuristic
algorithms are used, i.e. “top–down” and “bottom–up”. The top–down ap-
proach consists of a loop in which (1) a superposition is done of all residues
that exist in the current model and the target; (2) the P–value of this super-
position is stored and the residues that are furthest apart are deleted in the
model and the target. The loop is repeated as long as there are at least still
25 residues. The alignment with the best P–value is returned. The bottom–up
approach essentially tries to match a fragment i, . . . , i+j of j ≥ 25 consecutive
residues in the model and the target, for different values of i and j and re-
turns the best P–valued fragment. None of the two approaches dominates the
other, although the authors report that the bottom–up algorithm found the
best subset in most cases. The arbitrary value of 25 residues as a threshold for
fragment length is justified in [9] since “short segments are given unrealistic
good P–values”.
GDT
GDT (Global Distance Test, [59]) is an algorithm for identifying large sets of
residues (not necessarily continuous) in a prediction and a target which do
not deviate more than a given cutoff value. The algorithm is heuristic, and
similar to Lgscore. It consists of a loop, started from the alignment between
model and target of the longest continuous fragments within the cutoff value,
found by enumeration, and all three–residue segments. At each iteration, an
RMS transfom is computed for the given alignment, and the pairs of residues
whose distance is greater than the threshold are removed. The loop is ended
as soon as no residues are removed by one iteration.
18 Giuseppe Lancia and Sorin Istrail
MNYFIT
STAMP
One of the programs used by the database HOMSTRAD for the detection
of homologous protein families is STAMP [48], a program for the alignment
of protein sequences based on their 3D structures. The core of this program
contains an implementation the basic Smith and Waterman dynamic pro-
gramming procedure for sequence alignment [52], but using suitable similarity
scores which express the probability of residue–residue structural equivalence.
These scores are computed according to the equation by Argos and Rossmann
[1]
d2ij s2ij
Pij = exp exp ,
−2E12 −2E22
where dij is the distance between the Cα atoms of residues i and j, and
sij measures the local main chain conformation. The Smith and Waterman
procedure is embedded in a loop as follows: at each iteration, the output of
the dynamic program is a residue–residue equivalence (the sequence align-
ment). This list of equivalences is used to compute a best fit transformation
minimizing the RMSD, via the least square method of McLaughlan. The new
set of coordinates and residue–residue distances, obtained under this trans-
formation, are then used to recompute the similarity score values, which are
then used for another round of the Smith Waterman procedure. The loop is
repeated iteratively until the alignment becomes stable.
The method has proved experimentally effective, allowing the generation
of tertiary structure based multiple protein sequence alignments for a vari-
ety of protein structural families. However, the method is only effective for
proteins which share a good deal of global topological similarity, while fails if
applied to, e.g., proteins with common secondary structures but with different
connectivity, orientation or organization.
STAMP allows for the specification of a minimum number of equivalent
residues to be matched in two structures, the reversals of strand directions, the
20 Giuseppe Lancia and Sorin Istrail
swapping of sequence segments and more. The output contains two measures
of alignment confidence: a “structural similarity score”, which can also be used
to measure the functional and evolutionary relationship, and an “individual
reside accuracy” which is intended to measure the quality of the topological
equivalence of the pairs of aligned residues.
A contact map [33, 21] is a binary version of the distance matrix representation
of a protein structure. More specifically, the contact map of a folded protein
of n residues is a 0-1, n × n matrix C, whose 1–elements correspond to pairs
of amino acids in three–dimensional “contact”. A contact can be defined in
many ways. Typically [37], one considers Cij = 1 when the distance of two
heavy atoms, one from the i–th aminoacid and one from the j–th aminoacid of
the protein, is smaller than a given threshold (e.g., 5Å). The framework of the
contact map representation of proteins is very appealing, since this intuitive
and fairly simple representation is already complex enough to capture the
most important properties of the folding phenomenon. It has been shown
that it is relatively easy to go from a map to a set of possible structures to
which it may correspond [21, 56]. This result has opened the possibility of
using contact maps to predict protein structure from sequence, by predicting
contact maps from sequence instead. Vendruscolo and collaborators, among
others, have looked at the problem of devising an energy function based on
contacts, which should be minimized by the protein’s native state contact map
[57, 45]. For this purpose, they have set up a system of linear inequalities, with
20 × 20 variables Cab for all pairs of aminoacids a and b, which represent the
weight to give to a contact between the aminoacids a and b. The inequalities
are built as follows. Given the contact map of a correct structure r, there
is an inequality for any alternative structure w over the same sequence of
aminoacids, imposing that the energy of r is lower than that of w. Alternative
structures are obtained, e.g., by threading through other known folds. The
results are that “a simple pairwise contact energy function is not capable of
assigning the lowest energy to an experimentally determined structure” [45],
but by using corrective factors, such as a distinction between contacts on the
surface or in the core, and simple distance–dependant interaction weights, one
can achieve contact potentials which are in fact often stabilized by the native
contact maps. The use of energy function minimization to predict contact
maps is just one possible way. To the best of our knowledge, very little success
has been met so far in the contact map prediction problem. It is possible that
the research on this question will be boosted by the fact that the competition
CAFAPS has recently introduced the new “contacts” prediction category.
The statistics of contact maps have been studied as well, and it has been
shown that the number of contact maps corresponding to the possible config-
urations of a polypeptide chain of n residues, represented by a self–avoiding
Protein Structure Comparison: Algorithms and Applications 21
G1
G2
Besides their use for protein fold prediction, contact maps can be exploited
to compare 3D structures. The basic idea is fairly obvious: if two structures
are similar, we should expect their contact maps to be similar as well, and
conversely. Hence, we can use an indirect method for structure comparison,
i.e., contact map comparison instead. In our previous work [29] we have de-
signed an exact algorithm based on an Integer Programming formulation of
this problem, which we will now review.
We can regard the contact map of a protein p as the adjacency matrix of a
graph Gp . Each residue is a node of Gp , and there is an edge between two nodes
if the the corresponding residues are in contact. The Contact Map Overlap
(CMO) problem, calls for determining a sequence–independent alignment of
some residues in the first protein (nodes in G1 ) with residues of the second
protein (nodes in G2 ) which highlights the largest set of common contacts as
follows: The value of the alignment is the number of contacts (i.e., edges) in
the first map whose endpoints are aligned with residues that are also in contact
in the second map. This value is called the overlap for the two proteins, and
the optimization problem is to find the maximum overlap. Figure 5.1 shows
two contact maps and a feasible alignment.
This measure was introduced in [15], and its optimization was proved
NP-hard in [18], thus justifying the use of sophisticated heuristics or Branch–
and–Bound methods.
Some of the most powerful algorithms for finding exact solutions of com-
binatorial optimization problems are based on Integer Programming (IP) [43].
The IP approach consists in formulating a problem as the maximization of a
linear function of some integer variables and then solving it via Branch–and–
Bound. The upper bound comes from the Linear Programming (LP) relax-
ation, in which the variables are not restricted to be integer, and is polyno-
22 Giuseppe Lancia and Sorin Istrail
mially solvable. When the LP–relaxation value is close to the value over the
integers, then the bound, and hence the pruning of the search space, is effec-
tive. In order to obtain good bounds, the formulation is often reinforced by
the use of additional constraints, called cuts (from which the approach name,
Branch–and–Cut): these are constraints that do not eliminate any feasible in-
teger solution, but make the space of fractional solutions smaller, this way
decreasing the value of the LP bound. In many cases a good IP formulation
requires an exponential number of constraints and/or cuts. This would make
its practical solution impossible, unless there is a way to include all of them
only implicitely. This way exists, and works as follows. Given an LP fractional
solution x∗ and an inequality ax ≤ b, we say that the inequality is violated
by x∗ if ax∗ > b. If we have an exponential number of inequalities, we can
solve the LP with only a (small) subset of them, obtain a solution x∗ and
then check if any of the (exponentially many) inequalities that were left out
is violated by x∗ . If not, the current solution is optimal with respect to the
full formulation, otherwise, we can add the violated inequality to the current
LP and iterate the process. The check for a violated inequality is called sepa-
ration and is carried out by a separation algorithm. By a fundamental result
of Grötschel, Lovász and Schrijver [20], the existence of a polynomial–time
separation algorithm is a necessary and sufficient condition for solving the
whole exponential–sized LP relaxation in polynomial time.
The CMO problem can be reduced to a (very large) Maximum Indepen-
dent Set (MIS) problem on a suitable graph. An independent set is a set of
vertices such that there is no edge between any two of them. The MIS is a
classic problem in combinatorial optimization which has a definition nice and
simple, but is one of the toughest to solve exactly. However, by exploiting the
particular characteristics of the graphs derived from the CMO problem, we
can in fact solve the MIS on graphs of 10,000 nodes and more.
The natural formulation of the MIS as an IP problem has a binary variable
xv for each vertex v, with the objective of maximizing v xv , subject to
xu + xv ≤ 1 for all edges {u, v}. This formulation gives a very weak bound,
but
it can be strengthened by the use of clique inequalities cuts, such as
v∈Q xv ≤ 1, which say that any clique Q can have at most one node in
common with any independent set. The addition of these constraints can lead
to tight formulations. This is exactly the case for the CMO problem. In [29]
we formulated the CMO problem as an IP and solved it by Branch–and–
Cut, where the cuts used are mainly clique–inequalities. Although there is an
exponential number of different clique inequalities, we can characterize them
completely and separate over them in fast (O(n2 )) polynomial time. Note that
finding cliques in a graph is in general a difficult problem, but in this case
we can solve it effectively since it can be shown that the underlying graph
is perfect. The following section gives some details on the formulations and
results from [29]. Further details and full proofs can be found in the original
paper.
Protein Structure Comparison: Algorithms and Applications 23
IP Formulation
We can phrase the CMO problem in graph–theoretic language as follows:
We are given two undirected graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), with
ni = |Vi | for i = 1, 2. A total order is defined on V1 = {a1 < . . . < an1 } and
V2 = {b1 < . . . < bn2 }. It is customary to draw such a graph with the vertices
arranged increasingly on a line. We denote an edge by an ordered pair (i, j),
with a tail in the left endpoint and a head in the right endpoint.
A non–crossing alignment of V1 in V2 is defined by any two subsets of the
same size k, {i1 , . . . , ik } ⊆ V1 and {u1 , . . . , uk } ⊆ V2 , where i1 < i2 . . . < ik
and similarly for the uh ’s. In this alignment, uh is aligned with ih for 1 ≤
h ≤ k. Two edges (contacts) (i, j) ∈ E1 and (u, v) ∈ E2 are shared by the
alignment if there are l, t ≤ k s.t. i = il , j = it , u = ul and v = ut (see
Figure 5.1). Each pair of shared edges contributes a sharing to the objective
function. The problem consists in finding the non–crossing alignment which
maximizes the number of sharings.
An alignment corresponds to a set of lines connecting nodes of V1 and V2
in the usual drawing with V1 drawn on the top and V2 on the bottom. We
denote such a line for i ∈ V1 and j ∈ V2 by [i, j]. We say that two lines cross
if their intersection is a point. The sharings (e1 , f1 ) and (e2 , f2 ) can be both
achieved by an alignment if and only if they are compatible, i.e. no two of the
lines betweens the tails of e1 and f1 , the tails of e2 and f2 , the heads of e1
and f1 and the heads of e2 and f2 cross. A set of sharings is feasible if the
sharings are all mutually compatible, otherwise it is infeasible. Similarly we
define a feasible and infeasible set of lines. If we draw the lines connecting the
endpoints of an infeasible set of sharings, we have an infeasible set of lines.
We denote by yef a binary variable for e ∈ E1 and f ∈ E2 , which is 1 iff
the edges e and f are a sharing in a feasible solution. The objective function
of CMO is
max yef , (1)
e∈E1 ,f ∈E2
There are algorithms for finding a max weighted clique in a weakly trian-
gulated graph (as Gx can be shown to be) of time O(n5 ), due to Hayward,
Hoang, Maffray [23] and Raghunathan [46]. However, for this specific graph
we can do better and find max weighted cliques in time O(n2 ), thus making
a huge difference in the practical solution of the problem.
We start by chacterizing all cliques in Gx , i.e. sets of alignment lines which
are all mutually crossing. We define the following notion of a triangle: this is
a set of lines with one common endpoint and a second endpoint in a range of
consecutive nodes, like T (i, j|u) := {[i, u], [i + 1, u], . . . , [j − 1, u], [j, u]} where
i ≤ j ∈ V1 and u ∈ V2 , and T (i|j, u) := {[i, j], [i, j + 1], . . . , [i, u − 1], [i, u]}
where
i ∈ V1 and j ≤ u ∈ V2 . For S a set of lines, by x(S) we denote the value
[i,j]∈S xij .
Protein Structure Comparison: Algorithms and Applications 25
a1 an a1 an
1 1
b1 bn c1 cn
2 2
Fig. 3. Left: A zigzag path P (bold) and the set T (P ). Right: Same path after
flipping V 2.
Call a1 , an1 , b1 , and bn2 the set of terminal nodes. Consider a path P which
passes through all the terminal nodes, and alternates nodes of V1 and V2 in a
zig–zag fashion: That is, we can orient the path so that a1 is the first of the
nodes of V1 visited by the path, and if ak has been visited by the path, then
all of the nodes in V1 visited after ak are to its right. Similarly, bn2 is the first
of the nodes of V2 visited by the path, and if bh has been visited by the path,
then all of the nodes in V2 visited after bh are to its left. Note that any such
path must start and end at a terminal node (see figure 5.1, Left), and must
always include the lines [a1 , bn2 ] and [an1 , b1 ]. For each node of degree two in
P , a triangle is defined by considering the set of lines incident on the node
and contained within the two lines of the path. Let TA (P ) (TB (P )) be the set
of triangles defined by P with tip in the nodes of V1 (V2 ) having degree two
in P . We define T (P ) := TA (P ) ∪ TB (P ). The following theorem characterizes
the cliques of Gx .
The inequalities x(T (P )) ≤ 1 for all zigzag paths P are therefore the
strongest clique cuts for this particular independent set problem. We now
show that to find the most violated such inequality in time O(n2 ). To make
the following argument easier, we rename the nodes of V2 as {c1 , . . . , cn2 }, so
that the leftmost node c1 is bn2 and the rightmost, cn2 , is b1 (i.e., we flip the
nodes of V2 with respect to the usual drawing). A zigzag path P now looks as
a path which goes from left to right both in V1 and V2 . We call such a path a
leftright path (see figure 5.1, Right).
With respect to the new drawing of W , orient each line [a, c] in the two
possible ways and, given a real vector x∗ , define the length for each arc (a, c) ∈
V1 ×V2 and (c, a) ∈ V2 ×V1 as follows: l(a, c) = x∗ (T (a|1, c))−x∗ (T (1, a|c)) and
l(c, a) = x∗ (T (1, a|c)) − x∗ (T (a|1, c)). The lengths of four special arcs are de-
fined separately, as l(a1 , c1 ) = 0, l(c1 , a1 ) = 0, l(an1 , cn2 ) = x∗ (T (an1 |1, cn2 ))
26 Giuseppe Lancia and Sorin Istrail
and l(cn2 , an1 ) = x∗ (T (1, an1 |cn2 )). Now, consider a leftright path P starting
with either the arc (a1 , c1 ) or (c1 , a1 ) and ending with either the arc (an1 , cn2 )
or (cn2 , an1 ). Call l(P ) the standard length of this path, i.e. the sum of arcs
lengths. We then have the following lemma.
Fig. 4. The mutation operator. The top alignment shows the state before any
mutations, the middle alignment shows the state after the first mutation but before
the second mutation, the bottom alignment shows the state after both mutations.
Dotted lines are edges that have been removed by a mutation, dashed lines are edges
that have been added after a mutation is performed.
edge to be removed and a new edge to be randomly inserted on the right end
– exactly one of the dashed lines is inserted.
The recombination operator is used to create a new candidate solutions
(the child) from existing solutions. This is done by randomly selecting two
existing solutions (the parents) by a standard GA method, i.e., randomly, but
biased towards good solutions. A set of contiguous edges is randomly selected
from one of the parents and is copied directly to the child. Next, all edges
from the second parent that do not cross edges in the child are copied to the
child. Finally, new edges are added in any available positions, exactly as was
done with mutation.
Our Steepest Ascent Local Search heuristic algorithms follow the standard
approach: Let s be a feasible current solution. The neighborhood of s is the set
of solutions which can be obtained by applying a move to s. If all solutions
in the neighborhood are no better than the current solution s, s is a local
optimum and the search is terminated. Otherwise, the move that results in the
best solution value is applied to s, and the search continues. Since converging
to a local optimum is very fast, the search can be repeated many times, each
time starting from a random feasible solution.
A feasible contact map alignment solution is identified by a pair of lists
of the same size, (A, B), where A = (a1 < . . . < ak ) are nodes from G1 and
B = (b1 < . . . < bk ) are nodes from G2 . The alignment maps residue ai in the
first graph to residue bi in the second graph (see Figure 5.1(a)).
Our two local search algorithms differ only in the definition of the moves
that generate a neighborhood. The first algorithm uses moves that add a
single specific line to the solution, removing any lines that cross the new line.
28 Giuseppe Lancia and Sorin Istrail
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5
(a) Initial State (b) M(7,4)
Fig. 5. Moves for the two algorithms. (a) Starting solution s. (b) A move from the
first algorithm applied to s. (c) An increasing move applied to s. (d) A decreasing
move applied to s.
Computational Experiments
Our program has been implemented and run on some proteins from the PDB.
This is the first time that exact solutions have been found for real instances of
this problem. We have run our procedure on a set of 269 proteins, with sizes
ranging from 64 to 72 residues and 80 to 140 contact each. The set was chosen
to contain both a large number of similar proteins, as well as a large number
of dissimilar proteins. An all-against-all computation would have resulted in
36046 alignments; we selected a subset of 597 alignments, so that there would
be an equal number of similar pairs and dissimilar pairs.
In order to perform such a massive computation on a standard, single–
processor Pentium PC, we have limited the time devoted to each individual
instance to one hour (details can be found in [29]). Therefore, some prob-
lems have not been solved to optimality. However, even within the time limit
Protein Structure Comparison: Algorithms and Applications 29
set, we have been able to solve 55 problems optimally and for 421 problems
(70 percent) the gap between the best solution found and the current upper
bound was less than or equal to 5, thereby providing a strong certificate of
near–optimality. The results also show the effectiveness of our lower bound-
ing heuristic procedures, and in particular, of our genetic algorithm. The GA
heuristic turned out to be superior to the others, finding 52 of the 55 optimal
solutions.
In a second test, we used our programs to cluster proteins according to
their Contact Map Overlap. I.e., given a set of proteins, we compute a nor-
malized score based on their best CMO alignment, and check if pairs with
high score are actually coming from the same family.
A test set was suggested to us by Jeffrey Skolnick. The set contains 33 proteins
classified by SCOP into four families: CheY-related, Plasto–cyanin/azurin-
like, Triosephosphate isomerase and Ferratin (see Figure 5.1). Since this pro-
teins are relatively large (beyond the capability of Branch–and–Cut exact so-
lution, the problem is NP–hard after all), we used the heuristics that worked
well for providing lower bounds in the Branch–and–Cut algorithm, i.e. GA
and Local Search.
We applied the heuristics to all 528 contact map pairs, and clustered the
proteinss based on the following similarity score:
cij
sij =
min{mi , mj }
where mi and mj are the number of contacts in proteins i and j, and cij
is the number of shared contacts in the solution found. Note that 0 ≤ sij ≤ 1.
A score of 1.0 would indicate that the smaller contact map is completely
contained in the larger contact map. We found that 409 alignments with
score between 0.0 and 0.313 were almost exclusively between contact maps
in different families; 1.3% of these were alignments within the same family.
The remaining 119 alignments, with score between 0.314 and 0.999, were all
30 Giuseppe Lancia and Sorin Istrail
between contact maps in the same family. Hence, we validated the CMO score
for clustering with 98.7% accuracy (1.3% false negatives).
5.3 Conclusions
In order to promote the use of contact map overlap for structure comparison,
more work should be devoted into designing effective algorithms for proteins of
larger size than our current limit. Armed with such an algorithm, one could
run an all against all alignment of the PDB structures, and cluster them
according to their contact map similarity.
Our Branch–and–Bound and heuristic algorithms can be extended to new
contact map–based similarity measures. For instance, one can consider the
introduction of weights we on the contacts e = {i1 , i2 }. These weights can be
based on the residues appearing at positions i1 and i2 , or on the distance be-
tween i1 and i2 in the folded protein. This would allow to model a situation in
which some contacts are “more important” to preserve than others. Similarly,
penalties uij can be used to weigh each residue–residue xij alignment. Finally,
the objective function could be made parametric in the number of residues
considered, so that the measure can be used, e.g., for local alignments. The
problem would then read: for a given k find the best contact map alignment
which maps exactly k residues of the first proteins in k of the second.
These and similar measures will be the object of our future research.
6 Acknowledgements
Support for Lancia was provided in part by the Italian Ministry of Univer-
sity and Research under the National Project “Bioinformatics and Genomics
Research”, and by the Research Program of the University of Padova.
References
1. P. Argos and M. Rossmann. Exploring structural homology of proteins. J. Mol.
Biol. 105 (1976) 75–95
2. F. Allen et al. Blue Gene: A vision for protein science using a petaflop super-
computer. IBM System Journal 40(2) (2001) 310-321
3. H.M.Berman, J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig,
I.N.Shindyalov, P.E.Bourne, The Protein Data Bank. Nucl. Ac. Res. 28
(2000) 235–242
4. T. L. Blundell. Structure-based drug design. Nature 384 (1996) 23–26
5. C. Branden and J. Tooze, Introduction to Protein Structure. Garland, 1999
6. J. M. Bujnicki, A. Elofsson, D. Fischer and L. Rychlewski. LiveBench-1: Contin-
uous benchmarking of protein structure prediction servers. Prot. Sc. 10 (2001)
352–361
Protein Structure Comparison: Algorithms and Applications 31
Haim J. Wolfson
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
wolfson@post.tau.ac.il
1 Introduction
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 35-56, 2003.
Springer-Verlag Berlin Heidelberg 2003
36 Haim J. Wolfson
informative for the deduction of protein function, than the sequence informa-
tion alone. Since protein structure is more conserved (during evolution) than
its sequence, we can detect proteins with less than 25% sequence similarity,
yet with roughly similar structure and function. It should be noted, though,
that sometimes functionally non related proteins might fold into similar stable
structures. Among the 15,000 structures in the Protein Data Bank ([4], circa
March 2002) there are only about 700 structurally different single chain pro-
tein folds. Since proteins function by association, the spatial arrangement of
certain residues on the protein molecular surface is crucial for their function.
The books by Lesk ([44]) and Branden & Tooze ([8]) give an excellent in-depth
introduction into protein structure, while lively descriptions and illustrations
can be found in the popular book of Goodsell ([26]).
Most of the protein structures known today have been detected by X-ray
crystallography and NMR based techniques. These methods are time con-
suming and cannot be applied to all proteins due to physical constraints.
The most challenging task is, naturally, to computationally detect the fold
of novel proteins from basic principles. The crucial role of protein structure
analysis in the elucidation of protein function has triggered the initiation of
the Structural Genomics ([9]) project as a natural follow up of the Human
Genome project. The aim of this project is to detect (mainly, by X-ray crys-
tallography) the structures of representatives from each cluster of sequentially
homologous proteins and, consequently, to model the structures of the other
proteins in the respective clusters based on sequence similarity to the detected
structural template. This project is expected to supply numerous novel protein
structures for subsequent structural and functional analysis, which requires a
computational infrastructure similar to the one that was developed for protein
sequence analysis. This algorithmic infrastructure, which gained the name of
Structural Bioinformatics, receives as input the 3D shapes of proteins and,
thus, from the mathematical viewpoint belongs to the discipline of Geomet-
ric Computing, which includes Computational Geometry, Computer Vision,
Robotics, Computer Graphics, Medical Imaging and more. In particular, the
structural analog of the classical protein sequence alignment tool, which is
protein structure alignment, is conceptually similar to the classic problems of
Object Recognition in Computer Vision ([51]). Both are problems of spatial
pattern detection.
In our discussion we shall distinguish between pattern recognition and pat-
tern detection or pattern discovery. While the term pattern recognition will be
used for the detection of an a-priori known pattern/template in a database
of shapes, the term pattern detection will apply for the detection of a-priori
unknown (structural) patterns, which appear in several shapes (or recurrently
appear in a single shape) of the database.
In Structural Bioinformatics spatial pattern discovery dominates the task
of protein structural comparison and the task of the prediction of protein-
protein, protein-DNA or protein-drug interaction (docking). In the general
case the individual molecular structures also posses internal degrees of free-
Spatial Pattern Detection in Structural Bionformatics 37
covalent bonds. Hinge and shear movements of protein domains have been ob-
served as a combination of such rotations ([25]). Thus, a more complicated
task is to discover structural similarity among molecules modulo internal de-
grees of freedom as well.
There are several applications where structural alignment plays a key role.
Among these applications are the classification of the proteins universe by
shape similarity ([47, 32, 14]), and the detection of three dimensional struc-
tural patterns, so called, structural motifs, which imply similar protein
function. Structural alignment has also key applications to Computer-Assisted
Drug Design, where one is trying to detect or design drugs which snugly fit
some functional active site of a protein (receptor) molecular surface. Detection
of structural similarity between the active site of a protein and a molecular
surface patch of a novel protein implies that the new protein is a candidate
receptor for the same drug. Likewise, if one detects other drugs having a cer-
tain structural similarity of functional groups to the original one, these drugs
become candidate inhibitors for the same receptor.
Let us formulate the rigid structural alignment task as a particular instance
of pattern detection, which is often called partial matching:
Given two rigid sets of features in the 3-dimensional space and the group
of rigid 3D transformations, the partial matching task is to recover a transfor-
mation (or several transformations), which superimpose a large enough subset
of the first structure onto the second one.
Note that in partial matching not all the features of either structure will
have a matched counterpart. Moreover, one does not know in advance which
of the features will have a counterpart. One should also note that in biological
applications two features (e.g. atomic centers) are considered superimposed if
the distance between them (in some metric) is less than a predefined thresh-
old. Another practical observation is that we are not necessarily looking for
the maximally matching subsets. Due to inaccuracies and fuzziness of corre-
spondence the ”correct solution” might not achieve the maximal score of a
given algorithm. Moreover, even non-maximal matching substructures might
imply similarity in biological function. Thus, one is looking for large match-
ing subsets (and not only for the maximal one) each of which can represent a
correct solution.
In order to tackle the partial matching task, one has to address two prob-
lems - feature correspondence and corresponding feature superposition. The
correspondence problem is the difficult one. Once a correspondence hypoth-
esis is established, there are well known efficient methods to find the rigid
transformation which superimposes them with minimal least squares distance
among the matching feature points (see, e.g. [36, 33, 59]).
We have tackled the partial matching problems in Structural Biology by
methods which were originally developed for the object recognition task in
Computer Vision and Robotics [40, 71]. A major goal of computer vision
systems is to efficiently perform model-based object recognition in cluttered
scenes [5, 12]. Specifically, given a database of familiar objects (models) and a
40 Haim J. Wolfson
newly observed scene, the task is to detect all the appearances of the models
in the scene as well as the transformation between their pose in the model
database and the scene. The objects appearing in the scene may be partially
occluded and additional objects not appearing in the database may clutter
the scene. Objects are usually represented by feature sets, such as interest
points, line segments, surface patches etc.
One can notice that there is a striking conceptual similarity between the
two tasks. Actually, the model-based object recognition task, as formulated
above, is analogous to the structural alignment of a newly discovered molec-
ular structure against a database of such structures. The analog of the model
database is the molecular structural database, and the analog of the newly ob-
served scene is the new molecular structure. Partial occlusion and additional
scene clutter correspond to the fact that we are looking for previously unde-
fined structural patterns. This analogy is even more direct, when in Computer
Vision one is dealing with data acquired by a 3-D sensor, such as a range sen-
sor. In essence, both in Computer Vision and in Structural Biology we are
faced with the task of spatial pattern detection or partial matching . Guided
by this conceptual similarity we have introduced Computer Vision based ob-
ject recognition/matching techniques into Structural Bioinformatics [51].
Due to the need to compare databases of protein and DNA sequences,
algorithms for character string matching have been extensively applied in
Molecular Biology (see e.g. [58, 28]). Most of the methods are based on the
dynamic programming approach, although hashing techniques have been ap-
plied as well to speed up processing [46]. Consequently, dynamic programming
is deeply rooted in computational molecular biology and there have been sig-
nificant attempts to tackle the three-dimensional structural matching task
using this technique [63].
In order to apply the dynamic programming method, one has to exploit
the order of the amino acid residues on the polypeptide chain. Thus, as was
mentioned in section 2, from a purely geometric viewpoint the problem tack-
led is not that of 3D point set set partial matching, but of 3D curve partial
matching. This is a significantly easier problem, since a curve in any dimen-
sional space is just a one dimensional entity. The location of a point on a
curve is fully defined by only one parameter, which is the arc-length ([17]).
We shall see below that the sequence order constraint allows us to tackle the
flexible alignment task without prior knowledge of hinge positions in the pro-
teins ([61]). However, sequence order dependent matching algorithms cannot
detect geometric patterns which do not depend on such an order, especially,
molecular surface motifs.
Let us very shortly outline the sequence order independent protein struc-
tural alignment technique ([51, 2]). To facilitate the exposition we explain our
method for structural alignment of Cα atom sets. We consider rigid objects,
which are constellations of points. Each such point is located at the center of
a Cα atom and may have a label (e.g. a residue name, or a residue type).
Spatial Pattern Detection in Structural Bionformatics 41
For the sake of clarity let us discuss the following problem: one molecule is
rigid and the other is composed of two rigid parts (domains) which are joined
by a hinge. A hinge in this case is a rotational joint with full 3-D rotational
freedom. In practice, rotations are only around specific bonds, but we allow
this more general 3-D rotation model since it implicitly allows approximation
of several consecutive (or, nearby) bond rotations, as well. The fact that one
molecule is rigid can be assumed without loss of generality, while the intro-
duction of several hinges in the other molecule is a straightforward extension
of the algorithm that we present. Our flexible partial matching method is
based on the ideas that we have introduced for articulated (flexible) object
recognition in Computer Vision [71, 3]. It exploits an associative memory in-
dexing approach reminiscent of the one used in Geometric Hashing, except
that this time we have the additional task of handling the internal flexibility.
Of course, one can apply the rigid matching method to the individual rigid
Spatial Pattern Detection in Structural Bionformatics 43
parts, and then check whether there is a pair of high scoring hypotheses, one
for each part, which are spatially consistent with the single flexible object
input. However, our aim is to handle the information obtained from all the
molecule rigid parts in an integrated manner, so that the whole is more than
just the sum of its parts. We achieve this goal by accumulating evidence on
the position of the hinge location of the model molecule relative to the target
molecule. Since the hinge belongs to both rigid parts and its location is not
influenced by the internal rotation, both parts contribute evidence which is
integrated in our algorithm.
Below we describe the algorithm in more detail. Assume that we have a
database of flexible model molecules, each consisting of two rigid parts joined
by a hinge at a known position. Given a target molecule, we want to find
a large partial match of the target with some model molecule allowing both
external rotation and translation and internal rotation at the hinge. Thus, we
are trying to detect a flexible spatial pattern. The molecules are represented
by their Cα backbone atoms.
We first preprocess the database of the flexible molecules, and then per-
form the pattern discovery (recognition) versus the target. For each database
(model) molecule the following preprocessing is done:
(a) The molecule is represented by its Cα backbone atoms, as interest fea-
tures.
(b) The (known) hinge location is chosen as the origin of a 3-D reference
which is denoted as the ‘hinge frame’. The orientation of this frame is set
arbitrarily.
(c) Each ordered non-collinear triplet of atoms belonging to a single rigid part
and satisfying both proximity and non-degeneracy constraints is defined
as a ‘feature group’. An unambiguous 3D reference frame is defined for the
triangle. For example, given the non-collinear triplet of points p0 , p1 , p2 ,
one can set the origin of the reference frame at p0 . To set the reference
set axes, let us define two vectors v1 = p1 − p0 ; v2 = p2 − p0 . Define v1
as the direction of the x-axis, the cross product v1 ⊗ v2 as the direction
of the z-axis, and the unit vector of the y-axis as the cross product of the
unit vectors of the x and z axes.
(d) The (rotation and translation invariant) triple of triangle side lengths
serves as an address to a (hash) table, where one stores the information
on the model molecules, rigid part, feature group and, especially, on the
transformation between the feature group reference frame and the hinge
frame.
In the pattern discovery/recognition stage we repeat a similar proce-
dure for the rigid target molecule exploiting the information accumulated in
the associative memory:
44 Haim J. Wolfson
best alignments ranked according to the number of detected rigid parts, the
size of the overall alignment, the size of the individual rigid parts, the RMSD
of the alignment and some additional criteria. Thus, the algorithm overcomes
the requirement of prior partition of one of the molecules to rigid parts. Yet,
in order to accomplish this task efficiently, it exploits locally the amino acid
sequence order and is not fully sequence order independent as the algorithm
described above.
The goal of the FlexProt algorithm is to divide the two protein molecules
into a minimal number of separate consecutive fragments of maximal size, such
that the fragments which are matched will be almost congruent (-congruent).
Two rigid fragments are -congruent, if they have the same number of Cα
atoms and there exists a 3-D rotation and translation which superimposes the
corresponding atoms with an RMSD less than some pre-defined threshold .
The arrangement of the matching fragments should be consistent with their
order on the protein chain. Flexible regions are located between the rigid
matching fragments. A trivial way to achieve a flexible alignment of maximal
size is to allow flexibility between each pair of neighboring amino acids (Cα
atoms), thus aligning completely the two molecules, if they are of the same
length. Thus, to avoid such absurd ”optimal” solutions, our goal is to minimize
the number of flexible regions or rigid fragments. Clearly, the two goals of
maximal matching size and minimal number of flexible regions are conflicting
and some compromise heuristic should be applied.
The input to the algorithm are two protein molecules M1 and M2 , each
being represented by the sequence of its Cα atom coordinates. Assume that
molecule M1 has undergone hinge bending movements at several locations
along its backbone. Further assume that between the flexible joint regions
there are fragments without a significant structural change. The resulting
hinge-bent molecule is denoted M2 . Under our assumptions there exists a set
of rigid fragments of M2 that are -congruent to the corresponding set of frag-
ments of M1 . The model presented applies not only to different conformations
of a given molecule, but to the general case of flexible motif detection between
two molecules with different sequences.
The algorithm has the following major steps:
1. Detect all the large enough -congruent rigid fragment pairs, one from
each molecule. This is done by aligning each atom pair (one atom from
each molecule) and extending the alignment to the left and to the right,
along the backbone chain, until the RMSD of the superposition of the two
fragments gets larger than . The procedure is efficient, since (based on
the calculations of the previous steps) one can compute the RMSD of each
extension in O(1) time ([36, 59]). Thus, the procedure is linear in the size
of the matched Cα atom pairs. As a by-product we also get the rotation
and translation, which best aligns the pair of fragments. At the end of this
step, we have recovered all the possible -congruent rigid fragment pairs
together with the transformations, which superimpose them.
46 Haim J. Wolfson
4 Protein-Protein Docking
Protein-protein interactions play a major role in all biological processes. In
addition, protein-drug docking is a major tool in computer assisted drug de-
sign. The binding affinity of the molecules is affected mainly by electrostatic,
hydrophobic and van der Waals interactions. Since these non-covalent inter-
actions are weak and act at short distances, in order to be effective the in-
teracting molecules have to be very close to each other. As a result shape
complementarity of the interacting molecules becomes a necessary condition
Spatial Pattern Detection in Structural Bionformatics 47
for docking. Thus, the majority of the docking methods are searching, first, for
complementary spatial patterns on the molecular surfaces of the interacting
molecules ([43, 31]). The spatial pattern detection problem in docking is very
challenging, since molecules usually undergo conformational changes upon as-
sociation. This is aggravated by the fact that the residues on the molecular
surface, which take part in docking, are more flexible than the residues in the
protein interior.
When evaluating docking algorithms, one should distinguish between the,
so called, ‘bound’ and ‘unbound’ cases. In the ’bound’ case we are given the
co-crystallized complex of two molecules. The complex is artificially separated
by randomly rotating and translating one of the molecules. Now, the goal be-
comes to reconstruct the original complex. No conformational changes are
involved in the ‘bound’ case. Success in bound docking examples is a natural
pre-requisite for any docking algorithm, yet such a success does not ensure ad-
equate performance in ‘real life’ cases, which are ‘unbound’. In the ’unbound’
situation we are given two molecules in their native conformations. The goal
of the algorithm is to predict the ‘correct’ structure of the complex. In the
few tens of cases, where we do have an independently resolved structure of
the complex, one can verify the quality of the algorithm’s prediction. Most of
the docking algorithms encounter difficulties with the ‘unbound’ case ([31]).
As was mentioned above, significant geometric surface complementarity
is usually a prerequisite for successful docking. We shall discuss mainly this
geometric part of the docking methodology, where the task is to detect large
enough patches on the molecular surfaces of the docked molecules, which are
of complementary shape. Assuming that the molecules do not undergo large
conformational changes, we are faced again with the 3D rigid inexact partial
matching task.
Some of the algorithms approach this task by direct enumeration of the
six dimensional rigid transformation space, to detect a translation and ro-
tation, which best superimposes one molecule onto the surface of the other.
Most of these methods ([70, 66, 67, 68, 24, 10, 11]) use brute force search of
the 3 rotational parameters and the Fast Fourier Transform technique ([37])
for fast enumeration of the translations. Such algorithms are computationally
expensive. Other algorithms define discrete interest features on the molecular
surfaces and apply a partial shape matching algorithm on these features in a
way similar to protein structural alignment algorithms. The pioneering dock-
ing algorithm in this direction was suggested by Kuntz et al. [39], where the
problem was reduced to the detection of large enough cliques in the, so called,
docking graph. We have applied the Geometric Hashing technique ([21]) for
rigid docking as well as a variant of Geometric Hashing and Pose Cluster-
ing ([48, 49]), which proved to be relatively robust even for unbound docking
([50]). The flexible pattern detection method described in section 3.2 was ap-
plied to the docking of flexible molecules ([56, 57]). In protein-drug docking,
one is often looking for complementary patterns of hydrogen bond donors
versus hydrogen bond acceptors ( [52, 53]).
48 Haim J. Wolfson
Most of the geometric docking algorithms can be roughly divided into the
following major steps :
1. Molecular Surface Representation - a popular representation is that
of the solvent accessible surface as calculated by Connolly ([14, 13]).
Sparser discrete interest features, such a points and associated normals
([15, 45, 49]) are extracted in this stage.
2. Focusing on candidate binding (active) sites - in order to signifi-
cantly reduce the number of false positives and reduce computation time
it is desirable to focus a-priori on the approximate areas of the molec-
ular surface, where binding is likely to appear. Such candidate binding
sites are usually detected by biological and shape criteria. An excellent
example of ’biologically’ defined binding regions are the complementarity
defining regions(CDRs) in antibodies ([44]). An example of a shape cri-
terion is the binding of drugs and small ligands in the large cavities of a
receptor. There one might restrict the receptor surface to be explored to
such cavities ([39]).
3. Complementary spatial pattern detection - this is the heart of the
geometric docking algorithm and usually applies a partial shape matching
algorithm on a set of discrete features, such as surface points with asso-
ciated normals, which should align in roughly opposite directions. The
output of this step is a set of candidate rigid transformations, which dock
one molecule to the other.
4. Geometric Complementarity Scoring and Ranking - since mole–
cules cannot penetrate into each other, candidate transformations from
the previous step are discarded if they cause a significant penetration.
Minor penetrations are allowed to reflect conformational changes of the
molecular surface upon docking. In this step one can also calculate the size
of the detected complementary surface (size of binding site) and score the
hypotheses according to this size. (The fact that candidate solutions can
be discarded due to penetration supplies a powerful false positives filter,
which does not exist in the structural alignment algorithms described in
section 3.)
5. Biological Scoring and Re-ranking- in this step one would like to
accept the high enough scoring hypotheses of the previous step and re-rank
them according to a free-energy function ([10]), which could discriminate
between the biologically valid hypotheses and decoy complexes, which
exhibit only geometric complementarity. To date this step seems to be the
real bottleneck in the performance of unbound rigid docking algorithms
([31]).
We have recently developed an efficient rigid docking algorithm, which per-
forms relatively well in the unbound docking task ([19]). The geometric part
of the algorithm is based on methodology developed for Computer Vision
applications and it extends the ideas presented in ([22, 50]). The algorithm
partitions the molecular surface of the molecules into convex, concave and
Spatial Pattern Detection in Structural Bionformatics 49
flat local patches of almost equal size. Patches with higher probability of be-
longing to the binding site are considered, and complementary configurations
of pairs of interest points with associated normals are detected. Alignment
of such pairs induce rigid transformations, which are subsequently tested for
shape penetration and scored by geometric complementarity. The use of sur-
face patches reduces the number of potential docking solutions, while still (in
most tested cases) retaining the correct transformation. The algorithm can
treat receptors and ligands of variable sizes. It succeeds in docking of large
proteins (antibody with antigen) and small drug molecules. The running times
of the algorithm are on the order of seconds for small drug molecules and
minutes for large proteins. The algorithm was tested on most of the known
benchmark complexes (see e.g. [11]) with satisfactory results. Figure 4 illus-
trates a challenging antibody-antigen docking example. We briefly sketch the
outline of this algorithm, which is presented in [19].
Fig. 1. Unbound docking of the Antibody Fab 5G9 with Tissue factor (PDB codes
1FGN,1BOY). The antibody is depicted by ribbons and the CDRs by atomic balls.
The antigen is depicted as a bright backbone at the top of the figure. The dark
(antigen) backbone represents the best solution obtained by our program (which
was ranked 8’th by its score) superimposed on the complex with RMSD 2.27Å.
the MS program [14, 13], which outputs, concave, convex and saddle patches
that are sampled with high density. In addition, the sparse surface representa-
tion of [45], which retains for each patch one point with its associated normal
is calculated.
The input to this step is the sparse set of critical points. The goal is to di-
vide the surface to patches of almost equal area. Each patch is of homogenous
shape, namely, we segment the surface into convexities, concavities and flats.
In order to decide the shape type in the vicinity of a single critical surface
point we apply the shape function used in [15, 48]. A sphere is centered at each
critical point and the shape function measures the ratio of the sphere volume,
which is occupied by the molecule interior. The radius of the shape function
sphere is selected according to the molecule size. We use 6Å for proteins and
3Å for small ligands. According to the histogram of the shape function values
for the molecule, two cut-values are calculated, which split the critical points
into three equal size sets of knobs (convex), flats and holes (concave). Using
graph-theoretic techniques followed by a split-and-merge algorithm the molec-
ular surface is divided into connected, almost equal size patches of concave,
convex and flat points.
Pose Clustering.
For each candidate transformation from the previous step, we superimpose the
molecules according to the transformation. Now, we have to detect and discard
those transformations, which cause unacceptable steric clashes (penetrations),
and rank the rest of the solutions according to their shape complementarity.
The detection of steric clashes and the scoring of shape complementarity re-
quires to know the distance of the molecular surface points of one molecule
from the molecular surface of the other. In general, points of one molecule,
which penetrate deeply into the interior of the other discard the transforma-
tion, moderately penetrating points are penalized, and points, which are close
to the molecular surface of the other molecule receive a positive score (binding
site area). In order to accomplish this task efficiently, we compute in advance
52 Haim J. Wolfson
the distance transform of the first (stationary) molecule and calculate the
score of each transformation using this distance transform. To further speed
up the penetration detection and geometric scoring calculation, we construct
a multi-resolution data structure for the transformed molecule. Only trans-
formations that have not been rejected and received a large enough score at
the lower resolutions, are re-scored at a higher resolution.
The transformation from the matching step was computed based on aligning
pairs points/ normals. Since the interface includes a much higher number of
points, the transformation can be refined to improve geometric complemen-
tarity. Similar to the structural alignment algorithms described in section 3,
for each candidate transformation a new extended match-list is compiled and
the rotation and translation giving minimal RMSD for that match-list is cal-
culated. This can be done in several iterations.
5 Summary
We have presented several applications of spatial pattern discovery algorithms
to major tasks in Structural Bioinformatics, such as protein structural align-
ment and protein-protein docking. The exposition concentrated on algorithms
developed by the interdisciplinary Structural Bioinformatics Group of Tel
Aviv University and was not intended to review the enormous amount of
work on this topic, which is done elsewhere.
Among the challenging tasks ahead of us is the development of efficient
multiple structural alignment algorithms ([41]) and flexible multiple struc-
tural alignment algorithms. In docking, one should find better ways to handle
surface side chain flexibility, and, especially, detect new ways for biological
re-ranking of the geometrically ranked solutions. Our experience shows, that
even in unbound rigid docking, we almost always detect a solution, which is
very close to the ‘correct’ one, among the few hundred top ranked geome–tric
solutions. This implies that one needs an efficient and robust biologically based
score, which could successfully re-rank these top few hundred geome–trically
derived hypotheses.
Successful docking algorithms might have direct implications on auto-
mated protein folding as well. If one could derive from the protein sequence
partial structures of building blocks ([64]), then folding could be reduced to
the docking of these building blocks subject to some proximity constraints.
Work in this direction is underway.
Spatial Pattern Detection in Structural Bionformatics 53
Acknowledgments
I thank my long time colleague Ruth Nussinov and our joint graduate stu-
dents, in particular, Dina Duhovny and Maxim Shatsky for valuable discus-
sions and input. This research has been supported in part by the Israel Science
Foundation administered by the Israel Academy of Sciences - Center of Ex-
cellence in ”Geometric Computing”, and by the Tel Aviv University Research
Foundation.
References
1. V. Alesker, R. Nussinov, and H. J. Wolfson: Identification of non-topological
motifs in protein structures. Protein Engineering. 9(12) (1996) 1103–1119
2. O. Bachar, D. Fischer, R. Nussinov, and H. J. Wolfson: A computer vision
based technique for 3-d sequence independent structural comparison. Protein
Engineering. 6(3) (1993) 279–288
3. A. Beinglass and H. J. Wolfson: Articulated object recognition, or, how to gen-
eralize the generalized Hough transform. In Proc. of the 1991 IEEE Computer
Society Conf. on Computer Vision and Pattern Recognition (1991) 461–466
4. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.
Shindyalov, and P. E. Bourne: The protein data bank. Nucleic Acids Research.
28 (2000) 235–242
5. P. J. Besl and R. C. Jain: Three-dimensional object recognition. ACM Comput-
ing Surveys. 17(1) (1985) 75–154
6. A. A. Bogan and K. S. Thorn: Anatomy of hot spots in protein interfaces. J.
Mol. Biol. 280 (1998) 1–9
7. F. L. Bookstein: Morphometric Tools for Landmark Data: Geometry and Biol-
ogy. Cambridge Univ. Press, Cambridge, UK 1991
8. C. Branden and G. Tooze: Introduction to Protein Structure. Garland Publish-
ing Inc., New York 1991
9. S. K. Burley: An Overview of Structural Genomics. Nature Struct. Bio. (Struc-
tural Genomics Suppl.) (2000) 932–934
10. J. C. Camacho, D. W. Gatchell, S. R. Kimura, and S. Vajda: Scoring docked
conformations generated by rigid body protein protein docking. PROTEINS:
Structure, Function and Genetics. 40 (2000) 525–537
11. R. Chen and Z Weng: Docking unbound proteins using shape complementarity,
desolvation, and electrostatics. ROTEINS: Structure, Function and Genetics.
47 (2002) 281–294
12. R. T. Chin and C. R. Dyer: Model-based recognition in robot vision. ACM
Computing Surveys. 18(1) (1986) 67–108
13. M.L. Connolly: Analytical molecular surface calculation. J. Appl. Cryst. 16
(1983) 548–558
14. M.L. Connolly: Solvent-accessible surfaces of proteins and nucleic acids. Science.
221 (1983) 709–713
15. M.L. Connolly: Shape complementarity at the hemoglobin α1 β1 subunit inter-
face. Biopolymers. 25 (1986) 1229–1247
54 Haim J. Wolfson
between Protein and their Ligands by Correlation Techniques. Proc. Natl. Acad.
Sci. USA. 89 (1992) 2195–2199
38. I. Koch, T. Lengauer, and E. Wanke: An algorithm for finding maximal common
subtopologies in a set of protein structures. J. Comp. Biol. 3 (1996) 289–306
39. I.D. Kuntz, J.M. Blaney, S.J. Oatley, R. Langridge, and T.E. Ferrin: A geometric
approach to macromolecule-ligand interactions. J. Mol. Biol. 161 (1982) 269–
288
40. Y. Lamdan and H. J. Wolfson: Geometric Hashing: A general and efficient
model-based recognition scheme. In Proceedings of the IEEE Int. Conf. on Com-
puter Vision. (1998) 238–249
41. N. Leibowitz, Z.Y. Fligelman, R. Nussinov, and H. J. Wolfson: Multiple struc-
tural alignment and core detection by geometric hashing. In Int. Conf. on Intell.
Systems for Mol. Bio. (1999) 169–177
42. C. Lemmen and T. Lengauer: Computational methods for the structural align-
ment of molecules. J. Computer-Aided Molecular Design. 14(3) (2000) 215–232
43. T. Lengauer and M. Rarey: Computational methods for biomolecular docking.
Current Opinion in Structural Biology. 6 (1996) 402–406
44. A. M. Lesk: Introduction to Protein Architecture : the structural biology of pro-
teins. Oxford Univ. Press, 2001
45. S. L. Lin, R. Nussinov, D. Fischer, and H. J. Wolfson: Molecular surface rep-
resentation by sparse critical points. PROTEINS: Structure, Function and Ge-
netics. 18 (1994) 94–101
46. D. J. Lipman and W. R. Pearson: Rapid and Sensitive protein Similarity
Searches. Science. 227 (1985) 1435–1441
47. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia: SCOP: a struc-
tural classification of proteins database for the investigation of sequences and
structures. J. Mol. Biol. 247 (1995) 536–540
48. R. Norel, S. L. Lin, H. J. Wolfson, and R. Nussinov: Shape complementarity at
protein-protein interfaces. Biopolymers. 34 (1994) 933–940
49. R. Norel, S. L. Lin, H. J. Wolfson, and R. Nussinov: Molecular surface com-
plementarity at protein-protein interfaces: The critical role played by surface
normals at well placed, sparse points in docking. J. Mol. Biol. 252 (1995) 263–
273
50. R. Norel, D. Petrey, H. J. Wolfson, and R. Nussinov: Examination of shape com-
plementarity in docking of unbound proteins. PROTEINS: Structure, Function
and Genetics. 35 (1999) 403–419
51. R. Nussinov and H.J. Wolfson: Efficient detection of three-dimensional motifs
in biological macromolecules by computer vision techniques. Proc. Natl. Acad.
Sci. USA. 88 (1991) 10495–10499
52. M. Rarey, B. Kramer, and T. Lengauer: Time-efficient docking of flexible ligands
into active sites of proteins. In 3’rd Int. Conf. on Intelligent Systems for Mol.
Bio. (1995) 300–308
53. M. Rarey, B. Kramer, T. Lengauer, and G. Klebe: A fast flexible docking method
using incremental construction algorithm. J. Mol. Biol. 261 (1996) 470–489
54. M. Rarey, S. Wefing, and T. Lengauer: Placement of medium-sized molecular
fragments into active sites of proteins. J. Computer-Aided Molecular Design. 10
(1996) 41–54
55. B. Sandak, R. Nussinov, and H.J. Wolfson: An automated robotics-based tech-
nique for biomolecular docking and matching allowing hinge-bending motion.
Computer Applications in the Biosciences (CABIOS). 11 (1995) 87–99
56 Haim J. Wolfson
1 Introduction
Protein structural comparison is an important operation in molecular biology
and bionformatics. It plays a central role in protein analysis and design. As
proteins fold in three dimensional space, assuming a variety of shapes, a careful
characterization of their geometry is needed to study their function which
is known to be related to the shape. Moreover, the comparison of protein
structures is essential to infer evolutionary information.
The problem of comparing three-dimensional structures has been widely
studied in other disciplines such as computer vision and image processing,
robotics, astronomy and some core methods have migrated from these disci-
plines to bionformatics.
There are many instances of the protein comparison problem that have
been addressed; they include: 1) protein pairwise comparison, 2) protein clas-
sification, to organize all known structures in a biologically relevant groups,
3) searching for common folding patterns and three-dimensional motifs, 4)
studying of protein interaction to identify binding sites for drug design.
From the application point of view, it is important to mention how the
growth of the Protein Data Bank (PDB) asks for effective automatic pro-
cedures for classification and search of the database elements. Currently the
PDB contains more than 17,000 structures and this number is rapidly growing.
The protein comparison may involve different levels of representations of
the three dimensional protein structures, from the atomic level to the level of
secondary structures. Most methods presented in the literature deal with a
protein representation in terms of atomic coordinates
and therefore with a matching problem that uses as basic elements sets of
points (atoms).
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 57-82, 2003.
Springer-Verlag Berlin Heidelberg 2003
58 Carlo Ferrari and Concettina Guerra
2 Protein Description
Several programs have also been developed to yield the vectorial represen-
tation of a protein [17], [41]. Singular-value decomposition (SVD) is a stan-
dard routine [4], [19] to find the axes of α helices and the best fit segments
62 Carlo Ferrari and Concettina Guerra
for the β strands. In this routine typically only Cα atoms are used. Other
simpler methods derive the vector associated to a β strand either by directly
connecting the starting and ending residues of the β strand assignments or
by connecting two points that are computed as the average points of few of
the extreme residues on both sides of the strand [51]. This second approach
is less sensitive to curved or kinked structures. Figure 5 shows the vectorial
representation of protein kinase CK2, where each segment is displayed as a
cylinder of fixed radius.
Different metrics have been used in the literature to determine the struc-
ture similarity between geometric objects. The most common metric is the
RMSD (Root Mean Square Deviation) defined for point sets as follows:
D(A, B) = RM SD(A, B) = ( i=1,n d(ai , bi )2 )1/2
where d is the Euclidean distance between two points, and assuming that
the sets have the same cardinality n, ai corresponds to bi .
The RMSD distance is useful when comparing very similar structures,
as those produced during the christallographic analysis or NMR at different
stages of the process. Its disadvantage becomes apparent in the presence of
outliers when the proteins are not structurally close. The existence of even
few outliers may significantly alter the value of the distance and therefore the
determination of the optimal superposition of the two structures.
A second important definition between two point sets is based on the use
of contact maps [33]. The chapter by G. Lancia and Sorin Istrail in this book
deals extensively with contact maps and they are not further discussed here.
Another definition is the Hausdorff metric widely used in the area of com-
puter vision and image processing, in astronomy and extensively studied in
the field of computational geometry. The Hausdorff distance H(A, B) between
A and B is:
H(A, B) = max(h(A, B), h(B, A)))
where h(A, B) is the one-way Hausdorff distance from A to B given by:
h(A, B) = maxai ∈A minbj ∈B d(ai , bj )
In the following we discuss different approaches to solve the above three
problems with different metrics. Problem 1 and its solutions are presented in
section 5. Problem 2 with the Hausdorff distance as metric is considered in
section 6.Problems 3 is reviewed in sections 8.
r11 r12 r13
r21 r22 r23
r31 r32 r33
Geometric Methods for Protein Structure Comparison 65
with nine different parameters, the orthonormality conditions (for the ma-
trix columns) add three more constraints, while the sign of the cross product
adds other three constraints. These six constraint equations, reduce the degree
of freedom to three, that is, only three parameters are needed to completely
represent a pure rotation in R3 . The matrix coefficients rij can be expressed
in term of these three parameters.
A rotation matrix R ∈ SO(3) represents a rigid body transformation. In
fact it can be proved that it preserves distances and orientations, that is:
• Rq − Rp = q − p for all points p, q ∈ R3
• R(v × w) = Rv × Rw for all v, w ∈ R3
Moreover a rotation matrix can be seen as an operator that takes the co-
ordinates of a point (or vector) from a frame to another. Let pb the coordinate
of a point P with respect to the frame B, and Rab the rotation matrix, the
coordinates of P with respect to the frame A are given as:
pa = Rab pb
The pure rotation operator is a linear operator (with the additional con-
straint that it is orthonormal). A sequence of two (or more) rotations will
result in a single combined rotation and conversely a given rotation can be
decomposed using two or more rotations. Rotation matrices can be combined
to form new rotation matrices using matrix multiplication. If a frame C has
orientation Rbc relative to frame B and B has orientation Rab from frame A,
then the orientation of C with respect to A is given by:
Rac = Rab Rbc
−1 T
In particular, as we could expect, Rab Rba = I and Rba = Rab = Rab .
An important result about rotations is the Euler Theorem that establishes
that any rotation R ∈ SO(3) is equivalent to a rotation about a given axis
ω ∈ R3 ( ω = 1), by an angle θ ∈ [0, 2π). In fact, it is possible to represent
the motion of a single point p rotating about ω at a constant unit velocity,
with the following differential equation:
ṗ(t) = ω × p(t) = ω̂p(t)
Solving this equation gives the expression for a single rotation about ω by
θ, that is R(ω, θ) = eω̂θ . The matrix ω̂ is defined as follows:
0 ω3 −ω2
−ω3 0 ω1
ω2 −ω1 0
where ω T = [ω1 , ω2 , ω3 ] and it has the property that ω̂ T = −ω̂. Such a matrix
is a skew-symmetric matrix. If ω = 1, ω̂ is a unit skew-symmetric matrix.
It can be proved that the exponential eω̂θ can be rewritten in term of the
skew-symmetric matrix, resulting in the so-called Rodriguez’s formula:
66 Carlo Ferrari and Concettina Guerra
n matrix then there exist two orthogonal matrices U ∈ Rm×m and V ∈ Rn×n
such that U T AV = diag(σ1 , . . . , σp ), p = min{m, n} where σ1 ≥ σ2 ≥ . . . ≥
σp ≥ 0. The problem of maximizing trace(RT B T A) can be solved through
This general method proves itself useful for computing 3-D rigid transfor-
mation estimation between two sets of corresponding points. The algorithm
can be sketched as a two step algorithm. The first step computes Nthe op- T
timal rotation matrix R using the 3x3 correlation matrix H = i=1 ai bi
T
through its singular value decomposition (H = U diag(σ1 , σ2 , σ3 )V ) obtain-
ing R = V U T . The second step computes the optimal translation vector as
P = a − Rb.
A similar approach computes the eigenvalues of a proper derived matrix
(the orthonormal matrices ) instead. It was proposed by Horn, Hilden and
Negahdaripour [25]. With this method the correlation matrix H is firstly
computed. However, rather than computing its SVD, a polar decomposition
is used, such that H = RS, where S = (HH T )1/2 . The optimal rotation is
given by:
1 1 1
R = H T √ u1 uT1 + √ u2 uT2 + √ u3 uT3
λ1 λ2 λ3
where {λi } and {ui } are the eigenvalus and eigenvectors of the matrix HH T .
Representing rotations using unit and dual quaternions gives two more
techniques that have been proposed respectively by Horn [28] and by Walker,
Shao and Voltz [56]. The former method asks to rewrite the minimization
problem in the quaternion framework. A new 4x4 matrix can be constructed
from the correlation matrix H as:
H00 + H11 + H22 H12 − H21 H20 − H02 H01 − H10
− − −
K=
H 12 H 21 H 00 H 11 H 22 H 01 + H 10 H20 + H02
H20 − H02 H01 + H10 H11 − H00 − H22 H12 + H21
H01 − H10 H20 + H02 H12 + H21 H22 − H11 − H00
The optimal rotation is the eigenvector related to the largest positive eigen-
value of K.
The latter method is the most significantly different of the four. It was
designed ot minimize the equation;
L N
Σ 2 = i=1 αi n1 i − Rn2 i 2 + i=1 βi ai − Rbi − P 2
where {n1i } and {n2i } are two sets of corresponding unit normal vectors,
and {αi }, {β} are weighting factors reflecting data reliability. Dual quater-
nions for representing both rotation and translation are used and again the
minimization problem can be rewritten in this new framework, resulting in
new equations involving the parametrization of the dual quaternions. Again
optimal values for R and P can be computed.
These four algorithms can be compared with respect to their accuracy,
stability and efficiency [40]. Experimentations shows that no one algorithm
is superior in all case. In fact difference in accuracy (on nondegenerate 3-D
point sets) is almost insignificant. Stability is more discriminant instead. The
SVD and the unit quaternion method are very similar and usually the most
stable. In terms of efficiency, the orthonormal matrix looks quicker with small
data sets, while the dual quaternions method is superior with larger data
Geometric Methods for Protein Structure Comparison 71
The time complexity of the overall algorithm is O(mn3 log n). In fact,
using the Hausdorff metric, the nearest neighbor query in a set of segments
(to identify the segment of B “closest” to a segment of A), reduces to a nearest
neighbor query among points in R6 that can be performed in optimal O(log n)
time within a known error bound. It is shown that the error introduced with
the approximation is within a bounded factor from optimal. This bound is
the same as the bound obtained in [15] for the simpler case of point sets.
Experiments have been conducted on several proteins and the results were
consistent with previous studies. As an example presented in [22], figure 6
shows the superposition of two sets of segments associated to proteins 1rpa
and 1rpt, with very similar structures.
Fig. 6. The alignment of proteins 1rpa (red segments) and 1rpt (green segments)
7 Indexing Techniques
Indexing techniques, initially proposed in the field of computer vision by Wolf-
son et al., have found interesting applications in the area of bioinformatics.
Indexing or geometric hashing provides a way to efficiently search a large
database of proteins by storing redundant transformation invariant informa-
tion about the proteins in a hash table, from which this information can be
easily retrieved. The construction of the hash table, that constitutes the most
complex part of the entire process, is done off-line at a preprocessing stage.
Indexing techniques have been applied to compare proteins at different
levels of representations [2], [8], [13], [14], [32], [36], [55] (see also the chapter
by H. Wolfson of this volume).
One major distinction of the comparison approaches is whether they are
order dependent or order independent, in other words whether the use the
74 Carlo Ferrari and Concettina Guerra
order of the elements along the protein chain as a constraint in the correspon-
dence process. Indexing techniques do not take into consideration the order of
elements (either points or secondary structures) along the chain and therefore
fall into the category of order-idenpendent methods.
For matching 3D point sets, quadruples of points are used to define ref-
erence frames or bases in which the coordinates of all other points remain
invariant. Models are stored into the table by considering all possible combi-
nations of quadruples of points as bases and using the invariant coordinates
of the remaining points to index the table. At recognition time, if the correct
quadruple of points is chosen from the image points, the candidate matches
are efficiently retrieved from the corresponding entries of the hash table.
Here we concentrate on the application of hashing techniques at the sec-
ondary structure level involving transformation invariant properties of vectors
associated to the secondary structures. The programs 3dSEARCH [51] and
3d-Lookup [26], based on hashing, compute geometric properties of pairs of
secondary structures. They both construct the hash table by a procedure that
consists of the following steps for the insertion of a protein in the database:
Step 1. For each pair of vectors of the protein, compute a reference frame
or coordinate system identified by the two endpoints of one vector and by
the orientation of the other vector.
Step 2. For each remaining vector in the protein, compute its coordinates
in the reference frame defined in the previous step 1.
Step 3. The coordinates are quantized into fixed size interval and used to
access the entry of the table corresponding to those coordinates where the
following pair of information is stored: 1) name of the protein that hashed
into it; 2) identifiers of the two vectors used as reference frame.
Once the hash table is built, each secondary structure vector from the given
query structure is simultaneously compared to the entire library of target
structures by simply indexing into this table. Thus, to compare a query protein
to all target proteins the above step 1 and 2 are repeated for the query protein.
Step 3 is replaced by the following:
Step 3’ The coordinates are quantized into fixed size interval and used to
access the entry of a 3d table corresponding to those coordinates where a
vote is cast to every pair (protein name, two vector identifiers) present at
that entry.
At the end of the process the proteins in the table which obtained the most
votes are the candidates for matching.
A recently proposed approach [21] considers triplets of secondary struc-
tures rather than pairs to build the hash table. The three dihedral angles
associated to all triplets of secondary structures are used to index a hash ta-
ble. Let (si , sj , sk ) be a triplet of segments, where s corresponds either to an
α-helix or a β-strand. Let αsr be the dihedral angle formed by two segments
s and r. The dihedral angle between two segments is the angle formed by
Geometric Methods for Protein Structure Comparison 75
the two planes perpendicular to the straight lines containing the segments
themselves and therefore is defined in the range [0, 180].
The triplet of segments (si , sj , sk ) is then described by the three angles
(αij , αjk , αki ). The three angles, quantized into uniform intervals, provide
three indeces for the table; a fourth index triplet type is used to access the
table; it depends on the types of the secondary structures in the triplets,
whether all α helices, one α helix and two β strands and so forth. Thus a
4-dimensional table is constructed during the set up phase of the method. No
explicit information is present in the tables about the order of the segments
along the polypeptide chain.
Each entry of the table keeps a record of each triplet hashed into it. The
record contains the following information: 1) the name of the protein, 2) the
identifiers for the three vectors, 3) the pairwise distances between the three
vectors. Such distances are used to filter incorrect hypotheses of associations
in the matching process, because false results could be obtained based on
angular information only. The distance is measured as the distance between
the middle points of two segments.
The construction of the table is computation intensive; it requires O(n3 )
time, for n secondary structures. Once it is built it allows fast retrieval of
candidate matches between the query protein and the proteins stored in the
database. The space requirements for this approach may be high. However,
the table is only partially occupied since the three angles are related by the
triangular inequality.
The table can be queried to find similarities in the arrangement of the sec-
ondary structures of a query protein with the proteins stored in the database.
A protein P is matched against the database of proteins by the following
procedure:
Step 1 For each triplet (si , sj , sk ) of secondary structures of P , compute the
three angles (α, β, γ) and the three distances of the associated segments.
Step 2 Access the cell of the hash table indexed by (α, β, γ, triplet type)
and tally a vote for each entry in the cell with similar distance values.
Step 3 Formulate and rank hypotheses of matching by determining the
proteins with the highest number of votes.
The verification of the hypothesized matches may be performed by a pair-
wise comparison between the proteins, either at the level of secondary struc-
tures [23] or by extending the matching to residue level.
Once compiled, the table can be used for different types of comparisons,
for instance for all-to-all structure comparison.
Experiments have been conducted that consisted in building the hash table
for all proteins in the PDB (approx. 14.000) have shown that the approach is
both robust and efficient. The construction of such a table for approximately
350 representative proteins from the PDB has led to interesting observations
about the distribution of the angles of the secondary structures which deviate
76 Carlo Ferrari and Concettina Guerra
from the distribution of randomly chosen vectors more significantly than one
would have expected [47].
8 Graph-Theoretic Approaches
structure element, say a strand, into two structures or converted a strand into
a turn.
Once an initial superposition of the secondary structures has been obtained
by the dynamic programming algorithm, it is refined by iteratively minimizing
the RMSD between pairs of nearest atoms from the two proteins. An inter-
esting feature of the approach is that it does not simply rely on RMSD for
judging the quality of the alignment but it takes into consideration also the
number of “well” aligned atoms. Well aligned atoms define the “core” of the
proteins and are selected as follows. Pairs of atoms, one atom from each pro-
tein, are selected so that each atom of the pair is the nearest atom of the other
atom of the pair in the other protein. Furthermore, to be included in the core,
such pairs of atoms have to satisfy the co-linearity property i.e. if (i, j) and
(h, k) are two such pairs and i < h then it must be j < k. Thus this method
is order-dependent, according to one major classification of protein compar-
ison approaches. The last step of the algorithm is to try to improve on the
superposition of the core atoms even at the cost of degrading the alignment
of the rest of the atoms.
The algorithm is efficient in terms of computational complexity and spends
most of the execution time on the secondary and atomic alignment and a small
fraction on the alignment of the core structures.
10 Conclusions
The problem of protein comparison can be successfully approached by first
considering the related geometric issues. In the paper, the power and limita-
tions of the different algorithms for protein structure comparison have been
reviewed and discussed. Most of them have already proved their utility in
computer vision and image processing, as well as in robotics, astronomy and
physics. Their use in molecular biology and bioinformatics opens new perspec-
tives for developing integrated methods for protein comparison, classification
and engineering. Even if the different methods have been introduced to be
used within different applications (characterized by different requirements),
they solve particular instances of a more general matching problem, as deeply
investigated in the area of computational geometry. The variety of protein
representations supports the reasoning both at the level of points (i.e. atomic
level) and at level of segments (or secondary structures). Estimation of rigid
transformations with different metrics is an important technique within the
protein structure comparison algorithms. Moreover geometric indexing tech-
niques prove their effectiveness in searching large protein databases. Finally,
graph-theoretic protein modeling helps in designing algorithms for substruc-
ture identification and comparison.
From the current research it has been recognized that the combination
of different methods and different protein representations may result in new
and effective algorithms with decreased computational complexity and better
Geometric Methods for Protein Structure Comparison 79
11 Acknowledgements
Support for Guerra was provided in part by the Italian Ministry of Univer-
sity and Research under the National Project “Bioinformatics and Genomics
Research”, and by the Research Program of the University of Padova.
Support for Ferrari was provided in part by the Italian Ministry of Univer-
sity and Research and by the Research Program of the University of Padova.
References
1. Abagyan, R.A. , Maiorov, V.N. (1989). An Automatic Search for Similar Spa-
tial Arrangements of α helices and β-strands in globular proteins. Journal of
Molecular Structural Dynamics, 6, 5, 1045-1060.
2. Alesker, A., Nussinov, R., Wolfson H.J. (1996). Detection of non-topological
motifs in protein structures. Protein Engineering, 9, 5, 1103-1119.
3. Alter, T. D. (1992). Robust and efficient 3D recognition by alignment. Technical
Report AITR-1410, Massachusetts Institute of Technology, Artificial Intelligence
Laboratory.
4. Arun, K.S., Huang,T.S., Blostein, S.D. (1987). Least-square fitting of two 3-D
point sets. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9, 5,
698-700.
5. Branden, C., Tooze, J. (1999) Introduction to Protein Structure, (second edi-
tion), Garland.
6. Brown, N.P., Orengo C.A., Taylor W.R. (1996). A protein structure comparison
methodology. Comp. Chem. , 27, 359-380.
7. Brint, A.T., Willett, P. (1987). Algorithms for the identification of three-
dimensional maximal common substructures. J. Chem. Inform. Comput. Sci.,
27, 152-156.
8. Califano, A., Mohan, R. (1992). Multidimensional indexing for recognizing visual
shapes. IEEE Trans. on Pattern Analysis and machine Intelligence, 16, 4, 373-
392.
9. Chen, H. H., and Huang, T. S. (1990) Matching 3d line segments with appli-
cation to multiple-objects motion estimation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 12, 1002-1008.
80 Carlo Ferrari and Concettina Guerra
10. Chew, P.L., Dor D.,pEfrat A., Kedem, K. (1995). Geometric pattern matching
in d-dimensional space, Algorithms-ESA ’95, 264-279.
11. Chew, L.P., Goodrich M.T., Huttenlocher,D.P., Kedem,K., Kleinberg, J.M.,
Kravets, D. (1997). Geometric pattern matching under Euclidean motion, Com-
putational Geometry. Theory and Applications, 7, 113-124.
12. Escalier V., Pothier, J., Soldano, H., Viari, A. (1998). Pairwise and Multi-
ple Identification of three-dimensional common substructures in proteins. J.
of Computational Biology, 5, 41-56,
13. Fischer, D., Bachar, O., Nussinov, R., and Wolfson, H. (1992). An efficient
automated computer vision based technique for detection of three dimensional
structural motifs in proteins. J. Biomol. Struct. Dyn., 9, 769-789.
14. Fischer, D., Tsai, C.J., Nussinov, R., and Wolfson, H. (1995). A 3D sequence-
independent representation of the protein data bank. Protein Engineering, 8,
981-997.
15. Goodrich, M.T., Mitchell, J.S.B., Orletsky, M.W. (1999) Practical Methods for
Approximate Geometric Pattern Matching Under Rigid Motion. IEEE Trans.
Pattern Analysis and Machine Intelligence, 21, 4, 371-379.
16. A. Glassner, ( editor), Graphics Gems, Academic Press, 1990.
17. Gerstein, M. (1992). A Resolution-Sensitive Procedure for Comparing Protein
Surfaces and its Application to the Comparison of Antigen-Combining Sites.
Acta Cryst., A8, 271-276.
18. Gerstein, M., Levitt, M. (1998). Comprehensive Assessment of automatic struc-
tural alignment against a manual standard, the scop classification of proteins.
Protein Science, 7, 445-456.
19. Golub, G.H., Van Loan C.F. , (1996) Matrix Computation, Johns Hopkins Uni-
versity Press.
20. Grindley, H.M., Artymiuk, P.J., Rice, D.W., and Willett, P. (1993). Identifi-
cation of tertiary structure resemblance in proteins using a maximal common
subgraph isomorphism algorithm. J. Mol. Biol., 229, 707-721.
21. Guerra, C., Lonardi, S., Zanotti, G., (2002). Analysis of proteins secondary
structures using indexing techniques. IEEE Proc First Int. Symposium on 3D
Data Processing Visualization and Transmission, 812-821.
22. C. Guerra, V. Pascucci. (1999) On matching sets of 3D segments”, Proceedings
of SPIE Vision Geometry VIII, 157-167.
23. Guerra, C., Pascucci, V. (1999) 3D segment matching using the Hausdorff dis-
tance. Proceedings of the IEEE Conference on Image Processing and its Appli-
cations, IPA99, 18-22.
24. Hagedoorn, M., Veltkamp, R. C. (1999). Reliable and efficient pattern matching
using affine invariant metric. Int. J. of Computer Vision, 31, (2/3), 203-225.
25. B. Horn, H. Hilden, S. Negahdaripour (1988). Closed-form solution of absolute
orientation using orthonormal matrices. J. Opt. Soc. Am., 5, 1127-1135.
26. Holm, L., Sander, C. (1995). 3D-Lookup: Fast protein structure database
searches at 90% reliability. Proc. Third Int. Conf. on Intell. Sys. for Mol. Biol.,
Menlo Park, 179-187.
27. Holm, L., Sander, C. (1996). Mapping the protein universe. Science, Science,
273, 595-602.
28. B. Horn, (1987). Closed-form solution of absolute orientation using unit quater-
nions. J. Opt. Soc. Am., 4, 629-642.
Geometric Methods for Protein Structure Comparison 81
48. Rucklidge, W. J. (1997) Efficiently locating objects using the Hausdorff distance.
International Journal of Computer Vision, 24, 3, 251-270.
49. Sabata, B., Aggarwal, J.K. (1991). Estimatiom of motion from a pair of range
images: a review. Computer Vision, Graphics and Image Processing: Image Un-
derstanding, 54, 3, 309-324.
50. Sali A., Blundell, T.L. (1990). Definition of a General Topological Equivalence
in protein structures: a procedure involving comparisons of properties and re-
lationships through simulated annealing and dynamic programming, Journal of
Molecular Biology, 212, 403-428.
51. Singh,A.P., Brutlag, D.L., (1997). Hierarchical protein structure superposition
using both secondary structures and atomic representations. Proc. Fifth Int.
Conf. on Intell. Sys. for Mol. Biol., Menlo Park, 284-293.
52. Smith, T.F., Waterman, M.S., (1981). Identification of common molecular sub-
sequences. J. Mol. Biol., 147, 195-197.
53. Sowdhamini R., Burke D.F, Huang J-F, Mizuguchi, K. Nagarajaram H.A., Srini-
vasan N., Steward RE. and Blundell T.L. (1998). CAMPASS: A database of
structurally aligned protein superfamilies. Structure, 6, 9, 1087-1094.
54. Ullman, J.R. (1995) J. Assoc. Comp., 23, 31-42.
55. Verbitsky, G. Nussinov, R., Wolfson H.J. (1998). Structural comparisons allow-
ing hinge bendings, swiveling motions. Proteins, 34, 232-254.
56. Walker M.W, Shao L., Voltz R.A., Estimating 3-D Location Parameters Using
Dual Number Quaternions, CVGIP:Image Understanding, 54, 3, 1991, 358-367
57. Willett, P. (1995) Searching fore pharmacophoric patterns in databases of three-
dimensional chemical structures. J. of Molecular Recognition, 8, 290-303.
Identifying Flat Regions and Slabs
in Protein Structures
1 Introduction
When we examine the 3D tangle of a protein string we sometimes observe
flat or planar regions. Certain large segments or groups of amino acids lie
roughtly in a plane or very narrow slab. The presence of these segments or
groups of amminoacids in a very narrow slab can be very useful for confirming
or suggesting potential secondary structures such as a β-strand. The episodic
occurence of this phenomenon motivated an interest in general algorithms
that quickly find flat regions in 3D point sets.
We formulate the problem of identifying flat regions as follows. Given a
set Γ of n points in 3D space and a non-negative constant , determine the
plane that is at a distance at most from the maximal number of points of
Γ.
In other words, we want to determine a pair of parallel planes (slab) a distance
of 2 apart enclosing a maximal number of points of Γ . The pair of planes is
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 83-97, 2003.
Springer-Verlag Berlin Heidelberg 2003
84 Mary Ellen Bock and Concettina Guerra
not necessarily unique. For the extraction of multiple planar regions from the
input data, the same problem is solved repeatedly after the removal of the
points that are found to belong to the best slab.
We present a new method for the detection of planar regions in point
sets; we also give some heuristics that can help speed-up the process while
maintaining a good quality for the solution. We show that our geometric
method has many advantages over the Hough transform, a popular technique
to extract parametric shapes, such as lines, planes, and circles from images.
Our detection algorithm solves the problem as a maximum coverage prob-
lem in the parameter space. A subset of three points of Γ determines a region
in the parameter space containing the parameters of all planes within dis-
tance from the three points. Consider the arrangement in the parameter space
formed by the regions determined by all subsets of three points of Γ . Define
the depth of a point in the parameter space as the number of regions that
contain the point. This is in fact the number of points of Γ within distance
from the corresponding plane. The point of maximum depth is sought among
the vertices of the arrangement. Each such vertex gives the parameters of a
plane at distance either + or − from each of the three points.
The search for a vertex of maximum depth can be done by simply counting
for a plane corresponding to a vertex of the arrangement the number of points
of Γ within distance from it and then choosing the plane that has the largest
count. This computation requires O(n) time for each plane. A more efficient
solution to this problem is also presented that avoids the above computation
for all planes, by exploiting the structure of the arrangement of the regions in
the parameter space. The algorithm has a O(n3 log n) time complexity.
The problem of detecting geometric primitives is a fundamental one in com-
puter vision; there it is often solved by the Hough transform, a method that
transforms the extraction problem into an intersection problem in the param-
eter space, that is is quantized in an accumulator array.
The most common application of the Hough transform in image processing
is for line detection. The method has been also applied to circle and ellipse
detection, and to the extraction of planes from range images. If storage space
is a concern this approach can be used only for parametric shapes with few pa-
rameters, since the accumulator array dimensionality grows with the number
of the parameters.
The plane extraction problem is related to a well-known problem in com-
putational geometry, the width problem, that is the determination of a pair of
parallel planes of minimum distance enclosing all points of a given set. Houle
and Toussain [9] give algorithms that produce the exact width of a point set
and run in O(n log n)time in 2 and in O(n2 ) in 3 . An efficient approxima-
tion to the width determination is given in [4], that can be used to perform the
metrology primitive of “flatness”. The requirement that all points are within
a single slab formed by the two parallel planes makes these approaches not
Identifying Flat Regions and Slabs in Protein Structures 85
suitable for applications that require the extraction of multiple slabs with a
predefined width.
2 A Geometric Algorithm
takes O(n4 ) time. In the next section we present a more efficient algorithm
that exploits the structure of the arrangement of the regions and achieves
O(n3 log n) time complexity. We have implemented the algorithm also using
random sampling, that is the selection of a small number of triplets of points
from the input data set. The results are satisfactory in most cases; we do not
however discuss them here for lack of space.
For the template matching cost function, methods which generate only
the first two cases above for the candidate planes, i.e. the triple of points
always lies in one of the two planes which form the enclosing pair, can miss
25% of the points as seen in example 2.1. Again for the template matching
cost function, methods which always place each triplet of points in the middle
plane (and form the potentially optimal pair of planes by creating parallel
planes on either side of the middle plane a distance of away) can miss (25%
) of the points as seen in example 2.2.
Example 2.1. Consider the following four points: P1 (0,.75,0), P2 (0, -
.75,0),P3 (1,0, .75), and P4 (1, 0, -.75). Let = 0.5. The pair of parallel planes,
one with coefficients (1, 0, 0, −1) containing P3 and P4 and the second parallel
to it and containing P1 , encloses all four points (in fact P2 belongs to the
second plane). The solution generated by our method is the middle plane of
this pair with coefficient (1, 0, 0, −.5) and it is within distance 0.5 from all
four points.
On the other hand, restricting one of the two planes which form the pair
to always contain three of the four points misses one point. The middle plane
of such a pair has always distance greater or equal to 0.7 from the remaining
point. See for example, the plane (0.6, 0, −0.8, −0.5) which is middle plane of
a pair with one member containing P1 , P2 and P3 . Thus the maximal number
of points within distance 0.5 from such planes is 3.
Example 2.2. Consider the same four points as in the previous example:
P1 (0,.75,0), P2 (0, -.75,0),P3 (1,0, .75), and P4 (1, 0, -.75). Let = 0.5. The
planes containing three of the four points always have the distance 1.2 from
the remaining point. Again the maximal number of points within distance 0.5
from such planes is 3.
An interesting feature of the algorithm is that its time complexity is in-
dependent of . Therefore it is the same even when high precision is required
(very small values of ). This is unlike the methods based on the Hough trans-
form for which the complexity in time and memory increases with the degree
of required precision.
We first observe that the vertices of the arrangement lie on the O(n2 )
ellipses eij formed by the intersection of two hyperplanes Qi + and Qj + ,
i, j = 1, ..., n, and the hypercylinder. More precisely, of the 8 vertices generated
by Ri , Rj and Rk 4 belong to the ellipse eij and 4 to the ellipse eik . (or to
the ellipse ejk ) On eij (eik ) one pair of vertices is obtained by intersecting
eij with Qk + and the second pair with Qk − (Qj + and Qj − ). Each pair of
vertices on eij defines an arc consisting of points in Ri ∩ Rj ∩ Rk . Consider
all arcs on eij obtained for all different values of k. The depth of a vertex is
clearly given by 2 plus the number of arcs it belongs to.
Thus the problem of finding the vertex of maximum depth becomes that of
computing the maximum overlap in a set of arcs. This problem can be solved
using standard geometric techniques in O(k log k) time for k arcs [5]. Since
this step is repeated for each ellipse corresponding to a pair of regions Ri Rj ,
the algorithm has an overall O(n3 log n) time complexity.
The equation of the ellipse eij can be obtained as follows. Let
axi + byi + czi + d + = 0
axj + byj + czj + d + = 0
a 2 + b 2 + c2 = 1
be the equations of the hyperplanes Qi + , Qj + and the hypercylinder, re-
spectively.
The intersection of the two hyperplanes in 4 can be represented by:
a = sua + tva
b = sub + tvb
c = sua + tvc
d = − + sud + tvd
where s and t are parameters and u = (ua , ub , uc , ud ) and v = (va , vb , vc , vd )
are orthogonal unit vectors. We can choose the vector v so that vd = 0. The
intersection of the above plane with the hypercylinder a2 + b2 + c2 = 1 is given
by:
(u2a + u2b + u2c )s2 + (ua va + ub vb + uc vc )s · t + (va2 + vb2 + vc2 )t2 = 1
Since vd = 0 and u ⊥ v the equation above is rewritten as:
αs2 + t2 = 1
where α = 1 − u2d ≤ 1. The above is the equation of the ellipse on the plane.
4 Hough Transform
arbitrary shapes. There exist several variants of the basic Hough transform;
for recent surveys on the subject see [13], [16].
The standard formulation of the Hough transform for line detection is as
follows. Suppose we are given a set of image points - for instance edge points
or some other local feature points - and we want to determine subsets of
them lying on straight lines. The Hough transform associates an image point
with a line in the parameter space. The method is based on the property that
collinear points are transformed into lines intersecting at the same point in the
parameter space. The problem is then solved as an intersection problem in the
parameter space, that is quantized in an array of cells, called the accumulator
array. Specifically, the Hough transform consists of the following steps.
step 1. For each image point, determine the parameters of all lines through
it and accumulate the corresponding cells in the parameter space.
steo 2. Find the maximum (or the local maxima) in the accumulator array.
The corresponding parameters describe the detected line (lines).
An important property of the Hough transform is its insensitivity to noise
and to missing parts of lines. The disadvantages are its high computational
and memory requirements which increase with the resolution of the accumula-
tor array. Furthermore, since the array dimensionality grows with the number
of the curve parameters, this approach is applicable only to curves with few
parameters. To overcome these limitations, many approaches have been pro-
posed for image and vision vision applications, including probabilistic [11],
randomized [12] and fast [15] Hough approaches.
Some of these approaches formulate the Hough transform as a many-to-
one mapping from the original set of points into the parameter space, instead
of a one-to-many mapping as in the standard formulation. For example, in
the case of line detection that means determining for any two points of the
set the line passing through them and incrementing the corresponding entry
of the array in the parameter space. In general, for the detection of a curve
represented by m parameter equations, the original approach maps one point
into a m−1 hypersurface into the parameter space, while the second approach
maps m points which define a curve into one point of the parameter space.
The second approach is generally used in conjunction with random sampling
[6], that is the random selection of few subsets of n points from the original
input to reduce storage and computation time.
The Hough transform is easily extended to the detection of planes in a set
Γ of points in 3D space. A plane is parametrized by the following expression,
containing three parameters α, β and d:
Table 1. Execution times of the two algorithms for fragments of the protein 1KDU.
Table 2. Execution times of the two algorithms for fragments of the protein 3PGM
92 Mary Ellen Bock and Concettina Guerra
geometry of the secondary structures that can help predict the secondary
structures from the sequence of amino-acids.
We have also applied the algorithm to the detection of slabs of given widths
enclosing the maximum numbe of points to several proteins from different
structural classes. In figures 6 and 7 we show the results for 6 proteins from
the PDB; in all cases we have considered the Ca atoms only.
Fig. 3. Protein 1emd has an helix with width larger than average
Fig. 4. Protein 1adb has an helix with width larger than average
Identifying Flat Regions and Slabs in Protein Structures 95
Fig. 5. A β-strand of protein 2bpa with a width value larger than average
7 Acknowledgements
Support for Guerra was provided in part by the Italian Ministry of Univer-
sity and Research under the National Project “Bioinformatics and Genomics
Research”, and by the Research Program of the University of Padova.
We would like to thank Ivano Tagliapietra for his help with this paper.
96 Mary Ellen Bock and Concettina Guerra
References
1. F.C. Bernstein, T.F. Koetzle, G.J. Williams, E.F. Meyer, M.D. Brice, J.R.
Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi. The Protein data bank: a
computer-based archival file for macromolecular structures. J. Mol. Biol.,vol.
112, pp. 535-542, 1977.
2. N. Colloc’h, C. Etchebest, E. Thoreau, B. Henrissat, J.P. Mornon. Comparison
of three algorithms for the assignment of secondary structures in proteins: The
advantage of a consensus assignment. Protein Eng. , vol. 6, pp. 377-382, 1993.
3. R.O. Duda, P.E. Hart. Use of the Hough transform to detect lines and curves in
pictures. Commun. ACM, , vol. 15, pp.11-15, 1972.
4. C.A. Duncan, M.T. Goodrich, and E.A. Ramos, Efficient Approximation and
Optimization Algorithms for Computational Metrology, 8th ACM-SIAM Symp.
on Discrete Algorithms (SODA), 1997.
5. H. Edelsbrunner. Algorithms in Combinatorial Geometry . Springer-Verlag, Hei-
delberg, Germany, 1987.
6. M. A. Fischler, R.C. Bolles. Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography. Computer
Vision, Graphics and Image Processing, vol. 24, pp. 381-395, n.6, 1981.
7. D. Frishman, P. Argos, Knowledge-based secondary structure assignment, Pro-
teins: structure, function and genetics, vol. 23, pp. 566-579, 1995.
8. P.V. Hough. 1962. Methods and means to recognize complex patterns. U. S.
patent 3.069.654.
9. M.E. Houle, G.T. Toussaint, Computing the width of a set, IEEE Trans. on
Pattern Analysis and Machine Intelligence, PAMI-10, 1988, pp. 761-765.
10. W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recog-
nition of hydrogen-bonded and geometrical features, Biopolymers, vol. 22, pp.
2577-2637, 1983.
11. N. Kiryati, Y., Eldar, A.M., Bruckstein, A Probabilistic Hough Transform, Pat-
tern Recognition, vol. 24, pp. 303-316, 1991.
12. P. Kultanen, L. Xu,E. Oja, Randomized Hough Transform (Rht), Int. C. Pattern
Recognition, R-B(90), pp. 631-635, 1990.
13. J. Illingworth, J. Kittler, A survey of the Hough transform. Computer Vision,
Graphics and Image Processing, vol. 44, pp. 87-116, 1988
Identifying Flat Regions and Slabs in Protein Structures 97
1 Abstract
2 Introduction
The amount of sequence data from the rapidly progressing genome projects
keeps growing exponentially. The similar fast growing speed of computers
has make computer aided tools to infer structure and function of the newly
sequenced proteins an integral part of modern molecular biology. In order
to understand the evolutionary history of the new sequences, aligning the
primary structure of the probe sequence with others in the database using so-
phisticated software such as FASTA [1], BLAST [2, 3], and SSEARCH [1, 4]
is one the most significant and widely used techniques. Sequences with a
high similarity score usually share a common structure and might have simi-
lar functions or mechanisms. For any pair of proteins, the optimal alignment
that maximizes the total score can be done quickly, using standard dynamic
programming techniques [5]. The maximum possible score for a given pair
RAG also with the Biophysics Research Division. phone: (734) 763-8013, fax:
(734)764-3323, email:richardgumich.edu.
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 99-108, 2003.
Springer-Verlag Berlin Heidelberg 2003
100 Maricel Kann and Richard A. Goldstein
of proteins is then used to determine whether the pair of proteins are homol-
ogous. This is often done by computing such quantities as p(Sr > x), the
probability that a random pair of proteins of the same length would have that
score or higher, E, the expected number of random proteins in the database
that would have at least that score, and P , the probability that there is at
least one random pair with a higher score. Smaller values of p(Sr > x), E,
and P indicate a higher likelihood that the given pair is in fact homologous.
The first commonly-used score matrices were the PAM (percent accepted
mutations) series developed by Dayhoff and co-workers [6]. Others such as
those developed by Gonnet et al (GCB) [7] and Jones et al. (JTT) [8] have
applied Dayhoff’s method to larger sequence datasets. Henikoff and Henikoff
used a dataset of aligned sequence blocks to construct his popular BLOSUM62
matrix [9]. Overington and coworkers used Henikoffs’ cluster method to create
a score matrix (STR) where the protein sequences were aligned based on their
observed structures [10].
The matrices described above, represent considerable improvements in the
search procedure, allowing the immediately detection of similarities among
approximately half of all the newly discovered genes. However, there is still a
challenging problem to overcome, when sequence similarity is low with those
score functions (below 25% sequence identity) distant relationships among
protein sequences that are in fact homologs, are impossible to detect.
In this paper, we describe a new procedure to generate a score function
optimized to detect distantly related pairs of protein sequences. Results with
the independent test sets demonstrate the superiority of the resulting score
function compared with other commonly-used score functions for the detection
of distant homologies.
3 Methods
Theory
where ni,j refers to the number of times that amino acid type i is aligned with
amino acid type j, ngap−I is the total number of gaps in the alignment, ngap−E
is the total number of residues in each gap beyond one, γi,j , also known as
the score function, substitution matrix, or exchange residue matrix, represents
the contribution to the score for any amino acid match or mismatch, while
the gap penalties are given by γgap−I and γgap−E initialization and extention
of a gap, respectively . The score for the alignment of the target protein
and a putative homolog is x. To estimate the significance of this score, we
OPTIMA: A New Score Function for the Detection of Remote Homologs 101
where the averages are over the alignments of the target protein with the
ensemble of random non-homologous proteins in the dataset. By using the
Z score, we automatically account for variations in the expected score with
the length of the proteins. We can represent the distribution of scores for
ungapped [3, 14, 15] and gapped alignments [16] by an extreme value distri-
bution (EVD) [17]. In this case, the probability that a given random score Sr
would be equal or greater than x is
Sr
p(Sr > x) = 1 − ρ(x) dx
−∞
= 1 − exp (− exp (−αZ − β)) (3)
α = 1.282 and β = 0.5772... for a perfect EVD [18], although these parameters
are generally adjusted based on the observed distribution. For a search of a
database of size D, the expected number of scores between the target protein
and random pairs is equal to E = D p (Sr > x). In this paper, we use a
value of D = 100, 000. Assuming a Poisson distribution, the probability P of
observing at least one alignment with score equal to or greater than x is given
by
P = 1 − exp (−E) (4)
Both the E-value and the P -value depend upon the size of the database being
searched. E-values range from 0 to D, while P -values range from 0 to 1.
We are interested in optimizing the ability of a score function to discrimi-
nate between homologous and non-homologous pairs of proteins. That is, we
are interested in identifying a true homolog, and in having confidence in this
identification. Our confidence in a putative match is equal to the number of
correct matches divided by the number of matches, both correct and incor-
rect, with the same score or higher. Assuming that we have one true homolog
in the dataset, the average confidence C can be quantified as
1 −1
C= = 1 + D 1 − e− exp(−αZ−β) (5)
1+E
A C value close to 1 indicates a confident alignment, with C decreasing to
1/(1 + D) as the confidence of the alignment decreases. This represents our
average relative chance the match is to a true homolog. In this paper we
optimize the score function by maximizing C averaged over the training set.
By optimizing C we automatically focus on homologous pairs at the limit of
detection, reducing the dependence of the score function on homologies that
are either easily detectable (E 1) or overly distant (E ∼ D).
102 Maricel Kann and Richard A. Goldstein
Database Preparation
We are interested in optimizing our score functions for the detection of distant
homologs, beyond the capability of current score functions. We therefore need
a set of known homologs whose homology cannot be reliably determined with
standard pairwise sequence comparisons. For this purpose, we took advantage
of the Cluster of Orthologs Genomes (COG) database of Koonin and co-
workers [13]. A 900-pair training set was constructed of pairs of proteins
in the same COG that share less than 25% sequence identity. A 177-pair
disjoint test set was constructed in a similar manner from the COG database,
excluding all COGs that contributed to the training set, with each pair of
proteins again having less than 25% sequence identity. We also constructed
a test sets independently of the method used to construct the training setby
taking pairs of proteins identified as homologs in the PFAM database release
5.2 [19] In order to avoid overlap with the training and test sets derived
from the COG database, we ran a BLAST search [2] (using BLOSUM62 [9]
with -12,-2 for the gap penalties) of all the sequences in the PFAM database
against the 1077 pairs from the COGs that we were using either as the training
set or first test set, and excluded all PFAM families with any member with
similarity to these proteins (i.e. E < 10). From this set of protein sequences,
we selected 103 pairs that share less than 25% sequence identity as a second
test set of distantly related sequences.
We are interested in maximizing the confidence value C averaged over pro-
teins in the training set, where the calculation of C involves the distribution
of scores for the optimal alignment of the target proteins with homologous
and non-homologous proteins in the dataset. These optimal alignments are
themselves dependent upon the value of the score function. Thus an iterative
scheme is required.
We started with the BLOSUM62 matrix [9] and used the local dynamic
programming algorithm [5] to align each of the target proteins in the training
dataset against a homolog and a set of non-homologous proteins with a large
number of different gap penalties. We then calculated Z and C for each
pair of homologs, and averaged over the pairs in the training set to yield C.
The highest C values were obtained with gap penalties of γgap−I = −12.0 and
γgap−E = −2.0. This scoring scheme (BL62(12,2)) was then used to gener-
ate an initial set of alignments of the target proteins with homologous and
non-homologous proteins. The observed distributions of the non-homologous
proteins were used to adjust the values of α = 1.31 and β = 0.74, similar to
the values expected (α = 1.282 and β = 0.5772) for a perfect EVD [18].
As multiplication of the score function by any constant does not change
Z or any of the other statistics, we fixed one score function (γgap−I ) equal to
OPTIMA: A New Score Function for the Detection of Remote Homologs 103
Results
The values of C as averaged over the training set and distant COG homolog
test set during the optimization process are shown in Figure 1. The resulting
score function (OPTIMA) obtained after 10 iterations is shown in Table I.
Figure 2 shows the cumulative distribution of C values for COG distant-
homolog test set with the different score functions. As shown, the greater
discriminatory power of the OPTIMA score function is represented by the
larger fraction of the protein sequence pairs having larger values of C. That
implies a substantial improvement in our ability of making confident predic-
tions compared with other standard score functions. The optimal value for
A 36
R -9 56
N -19 4 59
D -20 -18 18 65
C 6 -29 -30 -30 99
Q -3 12 2 2 -30 46
E -10 3 3 20 -39 19 40
G 4 -18 7 -10 -29 -20 -23 67
H -19 3 12 -7 -29 3 2 -18 86
I -5 -28 -32 -34 -6 -30 -33 -41 -28 35
L -7 -20 -32 -43 -6 -23 -31 -42 -27 28 32
K -10 31 1 -4 -29 15 14 -18 -7 -31 -21 37
M -9 -10 -19 -30 -8 1 -21 -30 -19 12 24 -12 51
F -19 -30 -29 -33 -18 -29 -32 -32 -8 8 17 -29 2 57
P -5 -18 -17 -7 -30 -11 -7 -18 -18 -30 -33 -10 -21 -39 74
S 12 -11 10 4 -10 0 -1 2 -9 -20 -22 3 -10 -19 -8 36
T 0 -8 0 -10 -7 -7 -6 -17 -20 -8 -13 -8 -7 -18 -11 18 48
W -29 -29 -39 -40 -18 -19 -29 -19 -18 -28 -15 -30 -8 14 -38 -29 -19 110
Y -19 -15 -19 -20 -18 -9 -21 -29 20 -8 -2 -17 -9 37 -28 -19 -17 22 69
V 6 -32 -31 -31 -6 -19 -28 -30 -29 35 18 -23 10 0 -18 -22 6 -28 -9 38
A R N D C Q E G H I L K M F P S T W Y V
Table 1. OPTIMA score matrix achieved at the tenth iteration. The elements are
multiplied by ten to increase precision; corresponding gap penalties are -120 and
-20.
γgap−I was −11.97 with γgap−E fixed at −2.0. The small change in the gap
penalties indicates that most of the improvement comes from refinements of
104 Maricel Kann and Richard A. Goldstein
Score Matrix Gap penalties COGs Distant Homologs PFAM Distant Homologs
(Initiate/Extend) C p(Sr > x) P C p(Sr > x) P
OPTIMA -11.97/-2.0 0.736 0.004 0.283 0.710 0.007 0.311
1 BLOSUM62 [9] -12/-2 0.652 0.009 0.372 0.645 0.016 0.376
2 BLOSUM62 [9] -8/-0.5 [22] 0.248 0.024 0.800 0.312 0.033 0.737
3 PAM250 [6] -12/-2 0.480 0.017 0.549 0.669 0.012 0.359
4 PAM250 [6] -6/-1.3 [22] 0.013 0.072 0.999 0.035 0.061 0.988
5 GCB [7] -12/-2 0.647 0.007 0.377 0.703 0.022 0.324
6 GCB [7] -7.5/-0.4 [22] 0.023 0.061 0.997 0.030 0.042 0.987
7 STR [10] -12/-2 0.515 0.035 0.509 0.575 0.033 0.450
8 STR [10] -8/-0.5 [22] 0.172 0.041 0.866 0.281 0.034 0.774
9 JTT [8] -12/-2 0.517 0.009 0.516 0.642 0.022 0.392
10 JTT [8] -10.5/-0.4 [22] 0.076 0.035 0.958 0.127 0.036 0.916
Table 2. Comparison of the various score matrices and gap penalties on PFAM and
COG test sets of distant homologs (percentage identity less than 25%) as evaluated
with average confidence value (C), average probability that a random score would
be higher than the known homolog (p(Sr > x)), and average probability that at
least one of the random scores in a dataset of 100,000 proteins would be higher than
the known homolog (P ). For the purpose of these comparisons, we use both the
standard default gap penalties as well as the gap penalties derived by Argos [22].
0.75
test
train
0.7
<C>
0.65
0.6
0 5 10 15 20
iteration
Fig. 1. Value of C averaged over the training dataset and test dataset of distant
homologs during the optimization procedure.
the values of γi,j . As shown in Table II, OPTIMA has a significantly im-
proved average confidence (C) value compared with other commonly-used
score matrices . This improvement is not confined only to values of C; both
p(Sr > x), the average probability that any random score would be higher
than the homolog, as well as P , the average probability that at least one
OPTIMA: A New Score Function for the Detection of Remote Homologs 105
random score is higher than the homolog, are both substantially decreased
compared with other matrices.
0.8
Fraction of protein sequences
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
C
Fig. 2. Cumulative distribution of the C values for the various score matrices,
showing the fraction of all protein pairs in the COGs test set of distant homologs
with less than a given value of confidence. The various lines refer to the OPTIMA
score matrix ( – ); BLOSUM62 (12/2)[9] (· · · · · · ); GCB (12/2)[7] (− − −); STR
(12/2)[10] (− · − · − · −); JTT (12/2)[8] (− · · − · · −); and PAM250 (12/2)[6] (–).
Figures 3 A and B show the coverage or fraction of true positives vs. the
estimated number of false positives per query for the distant homolog COGs
and PFAM test sets, respectively, where the estimated number of false posi-
tives per query represents the expected number of random sequences with a
score greater than the pair of homologous sequences. The better performance
of OPTIMA can be seen from the large number of homologous pairs with
lower estimated number of false positives.
As a further test, we constructed a parametric plot where we calculated
the fraction of true positive homologs identified (coverage) and the fraction
of non-homologous pairs identified incorrectly as homologies (fraction of false
positives) as a function of the cut-off value of E. The results are shown for
the distant homolog COGs and PFAM test sets in Figures 3 C and D, respec-
tively. While this parametric plot may be compromised by the presence of
true homologs incorrectly labeled as non-homologs (false false-positives) in the
test sets, the qualitative agreement with the previous plots further supports
the performance of OPTIMA compared with the other score functions.
106 Maricel Kann and Richard A. Goldstein
-4 -4
10 10
-6 -6
10 0 0.2 0.4 0.8 1 10 0 0.2 0.4 0.8 1
0.6 0.6
0
Coverage 0
Coverage
10 10
Fraction of false positives
-1 C -1 D
10 10
-2 -2
10 10
-3 -3
10 10
-4 -4
10 0.7 0.8 0.9 1 10 0.7 0.8 0.9 1
Coverage Coverage
Discussion
Most methods for constructing a score function rely on creating a dataset
of reliably-aligned sequences or sequence fragments and gathering statistics
on the relative number of times that each possible pair of amino acids are
aligned. In practice, however, we are interested in distinguishing optimally
(but possibly incorrectly) aligned homologs from optimal alignments of non-
homologs. Our approach towards generating a score matrix is to optimize
the ability of this matrix to do what we want to do: discriminate between
OPTIMA: A New Score Function for the Detection of Remote Homologs 107
Acknowledgements. The authors would like to thank Bin Qian for his collaboration
to this project and Stephen Altschul, Ting-Lan Chiu, William Pearson, Romesh
Saigal, Stephen Bryant and John Spouge for helpful discussions, also Todd Raeker
and Michael Kitson for computational assistance. Financial support was provided
by NIH Grant LM0577 and NSF equipment grant BIR9512955.
References
1. Pearson, W. R., Lipman, D. J. Improved tools for biological sequence analysis.
Proc. Nat. Acad. Sci. USA 85:2444–2448, 1988.
2. Altschul, S. F., Gish, W., W.Miller, Myers, E., Lipman, D. Basic local alignment
tool. J. Mol. Biol. 215:403–410, 1990.
3. Altschul, S. F., Gish, W. Local alignment statistics. Methods Enzymol. 215:460–
480, 1996.
4. Lipman, D. J., Pearson, W. R. Rapid and sensitive protein similarity searches.
Science 227:1435–1441, 1985.
5. Smith, T. F., Waterman, M. S. Identification of common molecular subsequences.
J. Mol. Biol. 147:195–197, 1981.
6. Dayhoff, M. O., Schwartz, R. M., Orcutt, B. C. A model of evolutionary change
in proteins. In: Atlas of Protein Sequence and Structure. Vol. 5. Dayhoff, M. O.
ed. Vol. 5. . National Biomedical Research Foundation 1978.
7. Gonnet, G. H., Cohen, M. A., Benner, S. A. Exhaustive matching of the entire
protein database. Science 256:1443–1445, 1992.
8. Jones, D. T., Taylor, W. R., Thornton, J. M. The rapid generation of mutation
data matrices from protein sequences. CABIOS 8:275–282, 1992.
9. Henikoff, S., Henikoff, J. G. Amino acid substitution matrices from protein
blocks. Proc. Nat. Acad. Sci. USA 89:10915–10919, 1992.
10. Overington, J., Donnelly, D., Johnson, M. S., Šali, A., Blundell, T. L.
Environment-specific amino-acid substitution tables: Tertiary templates and pre-
diction of protein folds. Protein Sci. 1:216–226, 1992.
11. Brenner, S. E., Chothia, C., Hubbard, T. J. P. Assessing sequence comparison
methods with reliable structurally identified distant evolutionary relationships.
Proc. Nat. Acad. Sci. USA 95:6073–6078, 1998.
108 Maricel Kann and Richard A. Goldstein
12. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller,
W., Lipman, D. L. Gapped Blast and Psi-Blast: A new generation of protein
database search programs. Nucl. Acid Res. 25:3389–3402, 1997.
13. Tatusov, R. L., Galperin, M. Y., Koonin, E. V. The COG database: a tool for
genome-scale analysis of proteins functions and evolution. Nucleic Acids Res
28:33–36, 2000.
14. Karlin, S., Altschul, S. F. Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc. Nat. Acad.
Sci. USA 87:2264–2268, 1990.
15. Dembo, A., Karlin, S., Zeitouni, O. Limit distribution of maximal non-aligned
two-sequence segmental score. Ann. Prob. 22:2022, 1994.
16. Pearson, W. R. Empirical statistical estimates for sequence similarity searches.
J. Mol. Biol 276:71–84, 1998.
17. Gumbel, E. J. Statistics of Extremes. New York: Columbia University Press.
1958.
18. Gumbel, E. J. Statistics theory of extreme values and some practical applica-
tions. Washington: U.S. Government Printing Office: National Bureau of Stan-
dards Applied Mathematics Series 33. 1954.
19. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn, R. D., Sonnhammer,
E. L. L. Pfam 3.1: 1313 multiple alignments match the majority of proteins.
Nucleic Acids Research 27:260–262, 1999.
20. Dennis, J. E., Jr., and Schnabel, R. B. Numerical Methods for Unconstrained
Optimization and Nonlinear Equations. New York: Prentice-Hall. 1983.
21. Press, W. H., Teukolsky, S. A., Vetterling, W. T., Flannery, B. P. Numerical
Recipes in C: Cambridge University Press. 1992.
22. Vogt, G., Etzold, T., Argos, P. An assessment of amino acid exchange matrices
in aligning protein sequences: The twilight zone revisited. J. Mol. Biol. 249:816–
831, 1995.
23. Metz, C. E. ROC methodology in radiologic imaging. Invest. Radiol. 21:720–
733, 1986.
24. Swets, J. A. Measuring the accuracy of diagnostic systems. Science 240:1285–
1293, 1988.
A Comparison of Methods for Assessing the
Structural Similarity of Proteins
Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A.
{dcadams, gnaylor}@iastate.edu
1 Introduction
The link between biological form and function is well known, and is assumed
to hold true at the molecular level. Since identifying similar protein struc-
tures is the first step in identifying similar functions, much effort has been
placed in developing methods to detect structural similarity. Several methods
exist, including: SCOP [8], the DALI algorithm (from the FSSP Database
[6]), the VAST algorithm (from the MMDB database [5]), and Root Mean
Square (RMS) superimposition [9]. The latter three provide quantitative met-
rics describing protein similarity on an objective, continuous scale. Statistical
analyses can then be performed on similarity scores for a set of proteins, to
obtain a plot of ’protein structure space’ [7]. Before such analyses are done
however, one must be sure that the metric used accurately represents similar-
ity.
In this paper, we describe the DALI Z-score and RMS-distance (DRMS )
metrics, and discuss their shortcomings. We then present a novel means of
comparing protein structures using Geometric Morphometric (GM) methods:
statistical shape methods borrowed from anatomy. Finally, we compare results
from these three methods for a data set of globin structures, and show that
the more intuitive GM method markedly outperforms existing techniques.
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 109-115, 2003.
Springer-Verlag Berlin Heidelberg 2003
110 Dean C. Adams and Gavin J.P. Naylor
between the ith and j th residues for that protein. Distance matrices are aligned
in pairwise fashion, and n homologous residues are identified. The structural
similarity for the two proteins (A and B) is then calculated as:
∗ 2
|dA B
ij − dij |
d
− 20ijÅ
S= 0.2 − e (1)
i j
d∗ij
where d∗ij is the mean distance for those residues (a standardized version of S,
the Z-score, is also calculated). Z-scores are calculated for all protein pairs, and
the best three-dimensional ordination of the structure space is found through
an eigen-decomposition (correspondence analysis) of the Z-score matrix, where
similar proteins are close together, and dissimilar proteins are far apart.
Though Z-scores quantify some aspects of structural similarity, details of
this metric warrant careful scrutiny. First, Z-scores are generated from pair-
wise alignments, so different residues can be used for each pair. Thus, values
in the Z-score matrix represent different aspects of structural similarity, and
are not directly comparable. Second, the metric contains a dissimilarity cut-off
(0.2) to eliminate protein comparisons > 20%. However, most protein com-
parisons in a large database are > 20%, yielding negative scores, which are
arbitrarily truncated to zero. An eigen-analysis of such data will explain little
of the variation in few dimensions, and a low dimensional ordination from
this analysis will fail to capture the essence of ’structure space.’ Finally, the
exponential term in the metric downweights contributions from residues far
from one another. This results in Z-scores for self-comparisons that are not the
same for each protein, implying that some proteins are more ’perfectly’ similar
to themselves than others, which is nonsensical. Thus, DALI Z-scores are not
a true similarity metric, and statistical analyses of them are unpredictable.
where Xij and Yij are the coordinate sets for the ith residue. DRMS is cal-
culated for all protein pairs, and the best three-dimensional ordination of the
structure space is found through an eigen-decomposition (principal coordi-
nates analysis) of the DRMS matrix.
Comparison of Methods for Assessing Protein Structures 111
RMS methods are appealing because DRMS makes intuitive sense: unlike
proteins have a large DRMS , while similar proteins align quite well and have
a small DRMS . It is also a true distance measure, because all self-comparisons
of proteins yield an identical value of zero (no structural differences). Like Z-
scores however, DRMS is calculated in pairwise fashion, so different residues
can be used for each protein pair, rendering DRMS scores incomparable.
4 Geometric Morphometrics
Both DALI Z-scores and DRMS can be used to generate a map of protein struc-
ture space. However, both have methodological difficulties which limit their
utility. Interestingly, these same difficulties have already been addressed in a
completely different discipline: Geometric Morphometrics (GM). GM meth-
ods were originally developed to analyze anatomical structures (e.g., skulls),
but may easily be adapted to compare macromolecular structures. First, a set
of homologous points recorded on all specimens are superimposed using gen-
eralized Procrustes analysis (GPA), which translates specimens to a common
location, scales them to unit size, and optimally rotates them (in a LS sense)
[3],[10]. Shape variables are then generated for each specimen, which may be
used in statistical analyses [2]. Additionally, Procrustes distance (DP ROC )
between two specimens (X & Y) can be calculated as:
−1
n 3
DP ROC = 2 sin (Xij − Yij ) /2
2 (3)
i=1 j=1
where Xij and Yij are the aligned coordinates for the ith residue. DP ROC is
calculated for all protein pairs, and the best three-dimensional plot of protein
shape space is found through an eigen-decomposition (principal coordinates
analysis) of this data. Although GM and RMS are quite similar, they differ
in two important respects. First, size is mathematically held constant in GPA
(not in RMS), and second, GPA superimposes all specimens simultaneously.
5 Comparison of Methods
To compare the three methods described above we used a representative set of
protein structures. We extracted all globin sequences (as of 12/10/1999) from
the Protein Data Bank, and separated them into their individual chains, so
that monomeric and non-monomeric globins could be used. Structural simi-
larity among the 560 chains was then assessed using each of the three methods
(Z-scores, DRMS , DP ROC ). Pairwise structural alignments were calculated in
the DALI domain dictionary [7] (http://www2.ebi.ac.uk/dali) and both DALI
Z-scores and DRMS scores were obtained for each protein pair. For GM, we
112 Dean C. Adams and Gavin J.P. Naylor
aligned the amino acid sequences with Clustal W [11] and deleted all gaps,
yielding 96 homologous residues [1]. We then superimposed the structural data
for these residues with GPA, and generated DP ROC for each protein pair.
The ability of each metric to capture structural variation was assessed
using multivariate ordination methods. The DALI Z-score matrix was sum-
marized using correspondence analysis (as per [7]), and the DRMS and DP ROC
matrices were summarized using principal coordinates analysis. The percent-
age of variation explained by the first three dimensions from the ordination
analysis was compared for each method, and their ability to identify bio-
logically meaningful clusters was assessed through a visual inspection of the
ordination plots.
Using DRMS , the 1st three dimensions of structure space explained 76.1%
of globin chain variability. Inspection of this ordination plot revealed separa-
tion of a few individual chains (mostly hemoglobin chains), but no obvious
groups were identifiable (Fig. 1). Thus, although DRMS explained much of the
variation, it was unable to identify any biologically meaningful globin clusters.
Using DALI Z-scores, the 1st three dimensions of structure space explained
33.5% of globin chain variation, and to describe an equivalent amount of vari-
ation to DRMS (76%), 56 dimensions of the ordination were needed. Further,
the ordination plot revealed no obvious clusters of globin chains (Fig. 2). Thus,
DALI Z-scores were much less effective at summarizing structural variability,
and were unable to reveal biologically interpretable clusters of proteins.
Using DP ROC , the 1st three dimensions of the GM shape space explained
76.6% of the variation, which was similar to that found with DRMS . How-
ever, the ordination plot revealed remarkable separation of globin chains into
Comparison of Methods for Assessing Protein Structures 113
2 4
5
6 8
3
Fig. 3. Three-dimensional view of globin shape space from DP ROC . Labels corre-
spond to the following groups: 1: bacterial hemoglobins, 2: clam hemoglobins, 3: fer-
ric globins, 4: hemoglobin α-chains, 5: hemoglobin β-chains, 6: lamprey hemoglobins,
7: leghemoglobins, and 8: myoglobins.
6 Discussion
In this paper we described two metrics used for assessing structural similar-
ity (DALI Z-scores and DRMS ), and described how geometric morphometric
(GM) methods, commonly used in anatomical studies, may also be employed
to compare protein structures. We then compared the ability of three metrics
114 Dean C. Adams and Gavin J.P. Naylor
References
1. D. C. Adams and G. J. P. Naylor: A new method for evaluating the structural
similarity of proteins using geometric morphometrics. in S. Miyano, R. Shamir,
and T. Takagi (eds.) ( Currents in computational molecular biology). Universal
Academy Press, Tokyo 2000.
2. D. C. Adams and F. J. Rohlf: Ecological character displacement in Plethodon:
biomechanical differences found from a geometric morphometric study. Proc.
Natl. Acad. Sci. U.S.A. 97 (2000) 4106–4111.
3. F. L. Bookstein: Morphometric Tools for Landmark Data: Geometry and Biol-
ogy. Cambridge University Press, Cambridge 1991.
4. M. Gerstain and M. Levitt: Comprehensive assessment of automatic structural
alignment against a manual standard, the scop classification of proteins. Protein
Sci. 7 (1998) 445–456.
5. J.-F. Gibrat, T. Madej, and S. H. Bryant: Surprising similarities in structure
comparison. Curr. Opin. Struct. Biol. 6 (1996) 377–385.
Comparison of Methods for Assessing Protein Structures 115
1 Summary
A protein secondary structure prediction protocol involving up to 800 neural
network predictions has been developed by SBI-AT. An overall performance
of 80% is obtained for three-state (helix, strand, coil) DSSP categories. Input
to primary-layer neural networks includes sequence profiles, relative residue
position, relative chain length, and amino-acid composition. Secondary struc-
ture predictions are made for three consecutive residues simultaneously – a
technique which we describe as ‘output expansion’ – which boosts the perfor-
mance of second-layer structure-to-structure networks. Independent network
predictions arise from 10-fold cross validated training and testing of 1032 pro-
tein sequences at both primary and secondary network layers. Network output
activities are converted to probabilities. Finally, 800 different predictions are
combined using a novel balloting procedure.
2 Introduction
Prediction of three-dimensional protein structure from primary amino-acid
sequence is one of the biggest challenges in structural biology today. One
step toward solving this problem is by increasing the accuracy of secondary
structure predictions for subsequent use as input to ab initio calculations or
threading algorithms. The predictions are of significant importance to pro-
tein fold recognition and homology modeling, and the degree to which the
secondary structure can be determined has become a benchmark for protein
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 117-122, 2003.
Springer-Verlag Berlin Heidelberg 2003
118 Claus Lundegaard et al.
3 Methods
3.1 Neural Networks
A feed-forward neural network architecture was used, with input, hidden and
output layers (Figure 1). Input consists of sequence profiles in window sizes
Prediction of Protein Secondary Structure at High Accuracy 119
Network training data was prepared from the PDB version of August 1999.
A set of 1168 protein chains remained after an extensive quality check [11],
application of a strict pairwise sequence similarity threshold [9], and removal
of transmembrane spanning chains. Eight-category DSSP secondary structure
assignments [8] were reduced to three-category H (helix), E (extended strand)
and C (all other) for use in network training. Sequence profiles were obtained
using iterative BLAST searches [2]. The set of 1168 protein chains was fur-
ther homology reduced against the evaluation set prior to network training,
yielding a set of 1032 chains representing ∼200,000 amino-acid residues.
4 Results
A performance of 80% mean per-residue secondary structure prediction accu-
racy was obtained using a benchmark set of 126 protein sequences [11, 12].
120 Claus Lundegaard et al.
SBI−AT
Psi−Pred
PHD server
population 0.20
0.10
0.00
40.0 50.0 60.0 70.0 80.0 90.0 100.0
HEC performance (% correct)
HEC probabilities are obtained from network output activities via matrix
transformation, which provides an additional 1.2% in performance over un-
transformed activities. A conversion matrix is obtained for each of 800 network
combinations by processing the training data set.
80.2
80.0
Q3 performance (%)
79.8
A
WA
SD1
SD2
79.6 SD3
Fig. 3. Performance vs. Number of predictions and balloting threshold. Mean and
standard deviations were obtained using random selections with ‘replacement’, which
tends to underestimate performance.
References
1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J.
Mol. Biol. 215, 403–410.
2. Altschul, S. F., Madden, T. L., Schffer, A. A., Zhang, J., Zhang, Z., Miller, W.
& Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402.
3. Bohr, H., Bohr, J. & Brunak, S., et al. (1988) FEBS Lett 241, 223–228.
4. Chandonia, J. M. & Karplus, M. (1999) Proteins 35, 293–306.
5. Chou, P. Y. & Fasman, G. D. (1974) Biochemistry 13, 211–222.
6. Cuff, J. A. & Barton, G. J. (1999) Proteins 34, 508–519.
7. Jones, D. T. (1999) J. Mol. Biol. 292, 195–202.
8. Kabsch, W. & Sander, C. (1983) Biopolymers 22, 2577–2637.
122 Claus Lundegaard et al.
9. Lund, O., Frimand, K., Gorodkin, J., Bohr, H., Bohr, J., Hansen, J. & Brunak,
S. (1997) Protein Engineering 10, 1241–1248.
10. Orengo, C.A., Bray, J.E., Hubbard T., LoConte L. & Sillitoe I. (1999) Proteins
Suppl 3, 149–170.
11. Petersen, T. N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J., Brunak, S.,
Gippert, G. P. & Lund, O. (2000) Proteins 41, 17–20.
12. Rost, B. & Sander, C. (1993) J. Mol. Biol. 232, 584–599.
13. Rost, B. & Sander, C. (1994) Proteins 19, 55–72.
Self-consistent Knowledge-Based Approach to
Protein Design
1 Introduction
Two of the most investigated problems in molecular biology are protein fold-
ing and design. These problems stem from Anfinsen’s discovery [1] that the
sequence of amino acids of a naturally-occurring protein uniquely specifies
its thermodynamically stable native structure. The protein folding challenge
consists of predicting the native state of a protein from its sequence of amino
acids, while in protein design one is concerned to identify the amino acid se-
quences folding into a pre-assigned native conformation. This last issue, hav-
ing obvious practical and evolutionary significance, has attracted considerable
attention and effort of experimentalists and theorists [2, 3, 4, 5, 6, 7, 8].
The difficulty of the protein design problem is enormous because, in prin-
ciple, a rigorous approach [3, 7] would entail a simultaneous exploration of
both the family of viable sequences and the family of physical conformations.
By doing so, it would be possible to find the sequences having lower energy in
the target structure than in any other conformation. Stated mathematically,
to design a target structure Γ , one needs to identify the sequence of amino
acids, s, that maximizes the “occupation probability” according to Boltzmann
statistics:
exp (−βHs (Γ )) exp (−βHs (Γ ))
Ps (Γ ) = = (1)
exp (−βHs (Γ )) Zs
{Γ }
Ws (Γ ) ≡ Hs (Γ ) − Hs 0 , (2)
where the average · · · is carried out over all the conformations Γ that can
house s. Two main problems remain unsolved before the minimization of
Ws (Γ ) in sequence space can be exploited in automated contexts to design
protein structures. In fact neither the functional form of the effective energy
Hs is known, nor its average, · · · , is easily computable being dependent on
a large set of unknown conformations. In the next section we will introduce
a plausible simple form for Hs and will determine the corresponding Ws by
using data from a set of real proteins.
where i, j are the positions along the sequence of the amino acids and the sum
is taken over all possible pairs. B(si , sj ) represents the interaction strength
of the amino acid pair si and sj . However, only amino acids that are close
enough will interact in a non-negligible way. This is enforced with a suitable
weight function, or contact map.
Due to the linear dependence of the energy H on the contact map (the
only factors that contain geometric information about structures), the r.h.s.
of Eq. (2) can be re-casted into the following forms:
Self-consistent Knowledge-Based Approach to Protein Design 125
Hs = ∆ij B(si , sj ) , (4)
i<j
The average contact map, ∆(i, j), was obtained by collecting data on the
contact map of a variety of naturally occurring proteins. It turns out that
∆(i, j) is well reproduced as a (decreasing) function of the sequence sep-
aration, |i − j| when |i − j| < 16. Contacts between residues with sequence
separation larger than 16 are rather rare hence were modeled by assuming a
constant frequency of occurrence, ∆(0) . The value of ∆(0) is a free parameter
that is to be tuned separately for each protein length so that the average
number of overall contacts, i,j ∆i<j , matches the one observed in nature.
By using Eqs. (2) and (3), the scoring funcion Ws defined in Eq. (1) can
be re-written as:
Ws (Γ ) = (∆ij (Γ ) − ∆ij )B(si , sj ) , (5)
i<j
where the B(si , sj ) are parameters that have been set by requiring that the
scores Ws (Γ ) associated with the native sequences of 30 proteins on their own
native state be as small as possible, consistently with Eq. (2).
80
1erv
1fow
70 2imm
2mcm
3wrp
60
Success (%)
50
40
30
20
-2.5 -2 -1.5 -1 -0.5 0 0.5
W
Fig. 1. The success as a function of the cost function W (s, Γt ) = Hs (Γt ) − Hs per
site. Success is defined here by majority rule on a sampling of hundred (decorrelated)
sequences. The value of the cost function for the respective wild-type sequences is
between −0.48 and −0.78.
A further difficulty may stem from the fact that naturally occurring se-
quences may not have evolved to provide maximum thermodynamic stability,
Ws (Γ ) . Under these circumstances, one would observe that design solutions
which extremize Ws (Γ ) give a worse match with the natural protein than se-
quences with a higher Ws (Γ ) score. For this reason we chose to test the success
rate not only for the minimum value of W (s), but also for other sequences. In
particular it is interesting to compare all the sequences s with W (s) < W (s∗ ),
where s∗ is the wild-type sequence. For each annealing temperature we extract
100 decorrelated sequences and make statistical analysis on this sequence set.
We evaluate the average of W (s) for this set and a “super-sequence” by ap-
plying a pointwise majority rule to this set: for each site we assign the most
frequent amino acid class observed in this sequence set at the given location.
It appears that, indeed, the highest matching with the native sequence, is not
obtained for the lowest value of Ws (Γ ) , but for higher ones as shown in Fig. 1
. Furthermore one can see (figure not shown, see [12] for details) that the sites
that are assigned unambiguosly already at high values of Ws (Γ ) (i.e. early in
our stochastic minimization), have a high probability to match the natural
solution (for protein 1erv for the top 40 sites, there are 32 correct matches!).
It is tempting to conjecture that the residues that are assigned with very
little uncertainty by our design procedure (conserved design residues) could
also correspond to conserved residues in nature, i.e. amino acids that play a
fundamental role in the folding process.
This hypothesis is confirmed by the results of design attempts on two
heavily investigated proteins: barnase and chymotrypsin inhibitor [13, 14, 15].
Self-consistent Knowledge-Based Approach to Protein Design 127
Fig. 2. Scatter plot of the threshold value of the control parameter, T , at which
the designed class of each residue in barnase (sequentially labelled on the x axis) is
assigned with uncertainty less than 50 %. Lower values on the y axis correspond to
early locked sites in our design procedure. Circled dots represent sites belonging to
core1, core2 or core3 .
For each of them several experimental results are available which pinpoint the
key folding sites. For both proteins we found a striking agreement between
the conserved residues in our approach and the ones identified in mutagenesis
experiments. For example (see Fig. 2), for barnase, the residues for which the
class-locking occurs at high values of the control parameter, T , (i.e. large val-
ues of Ws (Γ ) ) correlate very well with the hydrophobic core1 which Fersht[14]
identified as the initiator of the folding transition. Analogous results hold for
CI2, where the top six conserved residues in our scheme contain the three
residues (ALA35, ILE76, LEU68) indicated by Itzhaki et al.[15] as the most
important in the folding process.
These striking results serve a two-fold purpose. On one hand they confirm
the validity of the present design approach; on the other they also show some of
its possible applications, in connection with the prediction of folding nucleus.
4 Summary
To summarize, we carried out automated protein design attempts over some
PDB conformations by introducing several novel strategies to identify optimal
energy-cost functions and select putative design solutions. A mere comparison
of designed sequences with the PDB ones gives a success rate between 40% and
55% when working with three classes of amino acids: a value well above the
random-guessing threshold. This success rate is not improving by introduc-
128 Andrea Rossi et al.
References
1. C. Anfinsen. Principles that govern the folding of protein chains. Science,
181:223–239, 1973.
2. S. Sun, R. Brem, R. Chan, and K. Dill. Designing amino acid sequences to fold
with good hydrophobic cores. Protein Eng., 8:1205–1213, 1995.
3. F. Seno, M. Vendruscolo, J. Banavar, and A. Maritan. Optimal protein design
procedure. Phys. Rev. Lett., 77:1901–1904, 1996.
4. J. Deutsch and T. Kurosky. New algorithm for protein design. Phys. Rev. Lett.,
76:323–326, 1996.
5. B. Dahiyat and S. Mayo. De novo protein design: fully automated sequence
selection. Science, 278(5335):82–87, 1997.
6. C. Micheletti, F. Seno, A. Maritan, and J. Banavar. Design of proteins with
hydrophobic and polar amino acids. Proteins: Structure Function and Genetics,
32:80, 1998.
7. C. Micheletti, J. Banavar, A. Maritan, and F. Seno. Protein structures and opti-
mal folding from a geometrical variational principle. Phys. Rev. Lett., 82:3372–
3375, 1999.
8. A. G. Street and L. S. Mayo. Computational protein design. Structure with
folding and design, 7:R105–109, 1999.
9. S. Kamtekar, J. Schiffer, H. Xiong, J. Babik, and M. Hecht. Protein design by
binary patterning of polar and nonpolar amino acids. Science, 262:1680–1685,
1993.
10. C. Chothia and A. M. Lesk. The relation between the divergence of sequences
and structures in proteins. EMBO J., 5:823–826, 1986.
11. C. Sander and R. Schneider. Database of homology-derived protein structures
and the structural meaning of sequence alignment. Proteins, 9:56–68, 1991.
12. A. Rossi, C. Micheletti, F. Seno and A. Maritan Self-consistent knowledge-based
approach to protein design. Bioph. J., 80, 2001 (in press).
13. A. Fersht. Structure and mechanism in proteinscience: a guide to enzyme catal-
ysis and protein folding. W.H. Freeman, New York, 1999.
14. A. R. Fersht. Optimization of rates of protein folding - the nucleation conden-
sation mechanism and its implications. Proc. Natl. Acad. Sci. USA, 92:10869–
10873, 1995.
Self-consistent Knowledge-Based Approach to Protein Design 129
Summary. This article deals with mathematical questions arising from the deter-
mination of protein structure from data obtained by solid-state nuclear magnetic
resonance (NMR). Solid-state NMR holds the promise of revealing the structure of
membrane proteins in a lipid bilayer. The derivation of protein structure from NMR
data has most often been done using proteins in liquid state, and the mathematical
analysis has been done using distance geometry and distance matrices. The math-
ematical analysis for solid state NMR uses orientational constraints rather than
distance constraints, and matrices of inner products rather than distance matri-
ces. Solving the structure from the data requires supplying a sequence of signs, a
situation somewhat analogous to the necessity to supply the phases to solve a struc-
ture from x-ray crystallographic data. Other problems in solving for the structure
arise from the condition that the gram determinants be non-negative, and this is
analogous problem in distance geometry that the distance matrix must satisfy the
conditions of the Cayley-Menger theorem.
1 Discrete Curves
It is convenient to think of a protein as a collection of discrete curves. A
discrete curve is a sequence of points p0 , . . . , pn in three dimensional space,
which can be thought of as atomic coordinates. The backbone is naturally
a discrete curve consisting of points representing the atoms –C1 –N–Cα –C1 –
proceeding from N-terminus to C-terminus. Although the atoms in side chains
have no natural sequential order, side chains are also formed from collection
of discrete curves.
A version of the Frenet frame for differentiable space curves can be given
for discrete space curves. Let
sj = |pj+1 − pj |
and define a unit tangent vector at pj , j = 0, . . . , n − 1, by
pj+1 − pj
tj = . (1)
sj
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 131-137, 2003.
Springer-Verlag Berlin Heidelberg 2003
132 John R. Quine and Timothy A. Cross
The discrete curve can be reconstructed from the sequences {tj } and {sj } by
k−1
pk − p1 = sj tj . (2)
j=1
If tj−1 and tj are not parallel, binormal and normal vectors can be given by
tj−1 × tj
bj = nj = bj × tj (3)
|tj−1 × tj |
are rotations about the x and z axes respectively. Note that θj is the exte-
rior bond angle at pj . Thus the discrete curve can be reconstructed up to a
Euclidean motion from the sequences {sj }, {tj−1 · tj } and the torsion angles
sequence {τj }, the bond lengths, bond angles and torsion angles.
The discrete Frenet frame at pj is related to one or more molecular frames
that can be defined there. A molecular frame can be defined for an atom
bonded to two others. Suppose atom B is bonded to atoms A and C, then a
right-handed orthogonal molecular frame M can be constructed by setting
pA − pB pC − pB u1 × u2
u1 = u2 = b= (7)
|pA − pB | |pC − pB | |u1 × u2 |
and forming the frame
M = (u1 , b × u1 , b) . (8)
This molecular frame is especially useful if the electron cloud is considered to
be symmetric the the plane of the three atoms, as is the case in the peptide
plane. In this case the local chemical and electric properties, as they relate to
NMR, are often assumed to be symmetric in the plane of u1 and u2 .
The frame M is related to the Frenet frame for the sequence pA , pB and
pC or the sequence pC , pB by a rotation leaving the plane of u1 and u2 fixed.
with respect to the magnetic field direction. This is in contrast to liquid state
NMR, where for each piece of data there is an ensemble of tensors T = R DR
where D is a fixed diagonal tensor and R is a random rotation with respect
to the laboratory frame. This rotation is random function both in the time
variable and in the statistical measure of the ensemble.
Because the NMR observable is a change in spin precession frequency
principally due to the chemical environment of the observed nucleus, each of
these tensors can be thought of as fixed in a molecular frame. Thus M TM is
constant and can be determined by powder experiments where the ensemble
of tensors is kept fixed with respect to time, but random with respect to the
statistical measure of the ensemble.
Generally, one of the principal axes of T is aligned with u1 × u2 and
where diag indicates a diagonal matrix and σ1 , σ2 , and σ3 are the principal
values of the tensor. The principal values and the parameter β are determined
from the powder experiment. A consequence of (9) is that for fixed bond angle,
and hence fixed u1 · u2 , the NMR observable B TB is a quadratic in u1 · B
and u2 · B, and setting it equal to its observed value gives and orientatioal
constraint.
The most common observables in our ss - NMR experiments are the
quadrupolar splitting, the dipolar splitting, and the chemical shift. The
quadrupolar and dipolar splittings give rise to zonal harmonic tensors with
2σ1 = −σ2 = −σ3 , and for these there is a unique principal major axis along a
bond direction. For the chemical shift tensors, the principal values are usually
distinct with σ1 the largest value, so that the maximum chemical shift occurs
in the u1 , u2 plane of the molecular frame.
Let
κj = B · tj βj = tj−1 · tj (10)
Proceeding under the assumption that the quadratic indices are known and
the sequence κj has been found, we solve for the unit vectors {tj }. Once the
unit tangent vectors are found, the atom coordinates can be found from the
sequence {sj } of bond lengths using (2).
We can solve recursively for {tj } under the condition that κ2j = 1 for all
j. Recall that we also assume that the bond angles are known, so that the
sequence βj is known.
To solve for the sequence {tj }, let
1 κj−1 κj
Gj = κj−1 1 βj (11)
κ j βj 1
be the matrix of inner products (gram matrix) of the vectors (B, tj−1 , tj ). Let
(13)
where εj = ±1 solves (10) recursively for the sequence {tj } of unit vectors
[7]. Note that
√
B · (tj × tj−1 ) = εj gj , (14)
i.e., εj is the sign of the above triple product. The εj are referred to as chi-
ralities.
The requirements for solving for the sequence {tj } using (13) are
1. κ2j = 1 for j = 1, . . . , n
2. gj ≥ 0 for j = 1, . . . , n − 1,
3. the choice of a sequence εj of chiralities.
Each of these requirements can lead to some problems in practice. The first
requirement can be a problem if one of the unit bond vectors to be determined
is parallel to B. In this case the sequence {tj } cannot be determined uniquely.
If, for example, one unit tangent vector is parallel to B then all of the sub-
sequent unit vectors can be rotated about B by the same rotation without
changing the bond angles or orientation constraints.
The second requirement is due to the fact that the gramian determinant
of three linearly independent vectors must be positive definite, since it gives
136 John R. Quine and Timothy A. Cross
the metric tensor. If some of the gj are computed to be negative, then some of
the data is bad, or the assumption about the bond angle is bad. The situation
is analogous to distance geometry where the distance matrix must be positive
semi-definite by the Cayley-Menger Theorem [3].
The third requirement indicates again there is another choice of signs re-
quired to solve for the structure from the orientational constraints.
The method of dealing with these problems varies. Occasionally the data
is good enough so that a sequence of orientations and bond angles can be sup-
plied so that 1. and 2. above are satisfied. The chiralities and the quadratic in-
dices must then be supplied and an initial structure obtained. There may be a
set of reasonable choices for these indices and several intitial structures. These
structures should then be refined using an energy penalty and a stereochem-
ical energy function in a way analogous to x-ray crystallographic refinement
[4] [5].
4 Acknowledgment
This work was supported by the National Science Foundation, DMB 9986036
(JRQ and TAC), DMS 9972858 (JRQ). This work was largely performed at
the National High Magnetic Field Laboratory supported by NSF Cooperative
Agreement (DMR 9527035) and the State of Florida.
References
1. T. A. Cross and J. R. Quine. Protein structure in anisotropic environments:
development of orientational constraints. Concepts in Magnetic Resonance, 12,
2000.
2. R. A. Engh and R. Huber. Accurate bond and angle parameters for x–ray protein
structure refinement. Acta Crystallogr. A, 47:392–400, 1991.
3. T. F. Havel and A. W. M. Dress. Distance geometry and geometric algebra.
Foundations of Physics, 23:1357–1374, 1993.
4. R. R. Ketchem, K.-C. Lee, S. Huo, and T. A. Cross. Macromolecular structural
elucidation with solid-state NMR–derived orientational constraints. J. Biomol.
NMR, 8:1–14, 1996.
5. R. R. Ketchum, B. Roux, and T. A. Cross. High-resolution polypeptide structure
in a lamellar phase lipid environment from solid state NMR derived orientational
constraints. Structure, 5:1655–1669, 1997.
6. A. Kvick, A.R. Al-Karaaghouli, and T.F. Koetzle. Deformation electron density
of α- glycylglycine at 82 K. I. the neutron diffraction study. Acta Crystallogr.,
B33:3796–3801, 1977.
7. J. R. Quine and T. A. Cross. Protein structure in anisotropic environments:
unique structural fold from orientational constraints. Concepts in Magnetic Res-
onance, 12, 2000.
Protein Structure from Solid-State NMR 137
t1
(a)
B t2
t1
(b)
(1)
B
t2
(2) t1
t2
(c)
Fig. 1. The recursion formula (13) for j = 2 is illustrated in the figure. The sphere
represents the set of unit vectors. The equations κ2 = B · t2 and β2 = t1 · t2 indicate
that given t1 , the the vector t2 lies on both of two circles about B and t1 .
1 Introduction
1.1 Background
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 139-145, 2003.
Springer-Verlag Berlin Heidelberg 2003
140 Flavio Seno et al.
1 1 6.5 − r
where : Ω(r) = Θ(4.65 − r) ∆(r) = + tanh (3)
2 2 2
Θ is the step function and rij is the distance (in Angstroms) between the
Cα atoms of the i-th and j-th amino acid. ∆ denotes the distance-dependent
strength of interactions between the i-th and the j-th amino acids along the
sequence mounted on the structure Γ , is the interaction matrix and Si
denotes the type of the ith amino acid in the sequence. Finally, r is a repulsive
term that penalizes cases where two non-consecutive pairs of C α ’s are closer
than 4.65 Å.
142 Flavio Seno et al.
20
20
(ndij − nnij )(i, j) + r (ndr − nnr ) ≡ aij (Γd ) · (i, j) + ar · r
i>j=1 i>j=1
≡ Q(Γd , ) > 0 . (4)
where nn,d
ij denotes the number of native/decoy contacts involving amino acids
of types i and j and nn,d
r denotes the strength of the native/decoy repulsive
term. Given the native state Γ and the sequence S, the 211 entries of ai,j (Γd )
plus ar depend on the geometrical properties of the decoys. For a given set of
M inequalities to be satisfied simultaneously, it is convenient to identify the
one (denoted with l) that, with the trial potentials is the worst satisfied one:
3 Results
We began by considering a single protein, PDB code: 1vcc, which has 77
residues. Starting from a set of random potentials, we generated 30 decoys for
Learning Effective Amino-Acid Interactions 143
which we computed the average root mean square deviation (RMSD),ḡ native
structure and its variance ∆ḡ. With these decoys, we found the potentials by
applying the perceptron algorithm. With the new interaction parameters, we
generated 30 more decoys and kept repeating this procedure.
In Fig. 1 we show the ḡ, as a function of the number of iterations. It is
remarkable that, with such a simplified model, ḡ can be decreased dramatically
from the initial value of about 6 Å to around 1.5-2.0 Å, which is just over the
order of the experimental uncertainty! This provides a nice demonstration of
the fact that it is possible to stabilize the native state in the native basin
within a low uncertainty.
Fig. 1. Results of the iterations on the single protein 1vcc: we show the asymptotic
RMSD of the decoy structures as a function of iteration.
4 Conclusions
Acknowledgements.
This article was supported by grants from INFM and MURST-COFIN99.
References
1. T.E. Creighton: Proteins: structure and molecular properties. W.H. Freeman
ed., New York 1992
2. G.M. Crippen: Prediction of protein folding from amino acid sequence over dis-
crete conformation space. Biochemistry 30 (1991) 4232-4237
3. F. Seno, A. Maritan and J.R. Banavar: Interaction potentials for protein folding.
Proteins: Structures, Function and Genetics 30 (1998) 244-248
4. C. Micheletti, F. Seno, A,. Maritan & J.R. Banavar: Learning effective amino acid
interaction through iterative stochastic technique. In press Proteins: Function,
Structure and Genetics (2000).
5. Bryngelson J. D. and Wolynes, P. G: Spin glasses and the statistical mechanics
of protein folding, Proc. Natl. Acad. Sci. USA 84 (1987) 7524-7528
6. C. Micheletti, J. R. Banavar, A. Maritan and F. Seno: Protein structures and
optimal folding from a geometrical variational principle. Phys. Rev. Lett. 82
(1999) 3372-3375
7. D.G. Covell and R. Jernigan: Conformations of folded proteins in restricted
spaces. Biochemistry 19 (1990) 3287
8. B.H. Park and M. Levitt: The complexity and accuracy of discrete state models
of protein structure. J. Mol. Biol. 249 (1995) 493-507
9. B.H. Park and M. Levitt: Energy functions that discriminate X-ray and near-
native folds from well-constructed decoys. J. Mol. Biol. 258 (1996) 367-392
Learning Effective Amino-Acid Interactions 145
10. W. Krauth & M. Mezard: Learning algorithms with optimal stability in neural
networks. J. Phys. A20 (1987) L745-L752
11. T. Lazaridis and M. Karplus: Effective energy functions for protein structure
prediction. Curr. Op. in Struct. Biol. 10 (2000) 139-145
Proteinlike Properties of Simple Models
CRPP, Avenue Albert Schweitzer, 33600 Pessac, and IRSAMC, Université Paul
Sabatier, 118 route de Narbonne, 31062 Toulouse Cédex, France.
During the last decade, it has been shown that several properties of natural
proteins can be captured by simple models, such as 2-D or 3-D lattice models
[1, 2, 5, 6, 9, 11, 13]. In these models, the protein is figured as a chain of beads
occupying the sites of a lattice in a self avoiding way.
C. Guerra, S. Istrail (Eds.): Protein Structure Analysis and Design, LNBI 2666, pp. 147-153, 2003.
Springer-Verlag Berlin Heidelberg 2003
148 Yves-Henri Sanejouand and Georges Trinquier
either hydrophobic (H) or polar (P) ones. Within the frame of this model,
there are 227 1.3.108 sequences and 103346 conformations[2], that is, 103346
different ways for a chain to go through all 27 lattice sites, going from a
site to a neighboring one at each step. One of these so-called hamiltonian
paths is shown in Fig. 1, for the following sequence: PHP4 HPHPHP15 H. Of
course, Fig. 1 can also be viewed as showing another possible conformation, the
reverse-labeled one, for the corresponding reverse-labeled sequence, namely,
HP15 HPHPHP4 HP. If such pairs of conformations are assumed to be identical,
the number of different compact conformations drops to 51704[6].
The fact that a sequence can be ”threaded” in a given structure in two different
ways, the position of the first residue in the first way being the position
of the last residue in the other way, is certainly not a property of natural
proteins, which are made with asymetric building blocks, namely, amino acids
of the L-series. As a consequence, the α−helix, one of their major structural
elements, is right-handed, and left-handed helices have not yet been observed
in the tridimensional structures of natural proteins. Note that the reverse-
labeled sequence of a given protein could quite well be synthesized. If buildt
with amino acids of the D-series, its tridimensional structure should be the
same as that of the natural protein, as far as the positioning of the amino
acid sidechains is concerned, except for proline residues, whose sidechain is
involved in a five-member ring with the backbone. However, while complete
syntheses of all-D proteins have been performed, to our knowledge, up to
now, following the seminal work of Shemyakin and its collaborators [12], only
small reverse-labeled all-D peptides, now called ”retro-inverso” peptides[3],
have been synthesized.
In order to study the sequence-structure relationship in the case of a given
model, it is necessary to choose an energy function. For lattice models, the
most usual one has the following form:
H= Eij ∆(ri − rj )
i<j
...............H.H.H.H..H. ...H..H.H...........H.H....
.............H.H...H.H..H. ...H..H.H.........H.....H..
...........H.....H.H.H..H. ...H..H.H.........H.H......
.........H.......H.H.H..H. ...H..H.H.......H.H........
.........H.....H...H.H..H. ...H..H.H.....H...H........
.........H...H.H.H..H..... ...H..H.H...H.............H
.........H.H.......H.H..H. ...H..H.H.H.........H......
.........H.H..H......H...H ...H..H.H.H.....H..........
.........H.H..H....H.....H ..H.............H.H.H....H.
.........H.H.H.....H....H. ..H...........H...H.H..H...
.........H.H.H...H......H. ..H...........H.H..H....H..
.........H.H.H...H..H..... ..H......H..H...H.H........
.........H.H.H.H........H. ..H.....H...........H.H..H.
.......H.....H.H.H..H..... ..H.....H......H..H.H......
.......H.H..H........H...H ..H.....H....H..H.H........
.......H.H..H......H.H.... ..H.....H..H..H.H..........
.......H.H..H....H.......H ..H.....H.H.H............H.
.......H.H.H..H......H.... ..H....H..............H.H.H
.......H.H.H.H......H..... ..H....H..H...H.H..........
......H..H.H.H.H.......... ..H....H..H.H.............H
.....H.......H.H.H..H..... ..H...H.............H.H..H.
..H..............H...H.H.H ..H...H....H..H.H..........
..H......H.....H.H.H...... ..H..H..H.H.H..............
..H....H.............H.H.H ..H.H...............H.H..H.
..H....H.....H.H.H........ ..H.H.............H.H..H...
..H....H.H.H.......H...... ..H.H.H.H................H.
..H..H...............H.H.H .H................H.H.H...H
..H..H.......H.H.H........ .H............H.H.H.......H
..H..H.H...............H.H .H........H.H.H...........H
..H..H.H.............H...H .H....H.H.H...............H
is, for each of them, a given conformation is the lowest-energy one, while, for all
of them, ∆, the energy gap between this conformation and the second lowest-
energy one, is: ∆ = 2. Strikingly, out of the 103346 possible conformations,
only 120 (0.11%) are found to be possible ground states. All of these so-
called remarkable structures are characterized by a large ”designability”[6],
as measured by Ns , the number of sequences they are the ground state of, Ns
ranging between 513 and 2306.
150 Yves-Henri Sanejouand and Georges Trinquier
In Fig. 1 the top structure[6] is shown, that is, the conformation which is
the ground state of the largest number of sequences. Among these 2306 se-
quences, PHP4 HPHPHP15 H has the smallest count of hydrophobic residues.
As a matter of fact, each remarkable structure is the ground state confor-
mation of a single five-hydrophobic residues sequence (all given in Table I),
and the corresponding five residues are always located as shown in Fig. 1,
that is, one of them being at the cube center, the four other ones being at
the center of facets that are non bonded to the residue at the cube center.
Furthermore, in all sequences sharing a given ground state, these five residues
are always hydrophobic[13]. The latter point helps to clarify why there are
120 remarkable conformations. Indeed, if a sequence like PHP4 HPHPHP15 H
has a non-degenerate ground state, this means that, out of the 103346 pos-
sible ones, there is only one way to bring its five hydrophobic residues close
together, so that each of them can interact with at least another hydrophobic
one[13]. In other words, it is from a topological point of view that the 120
remarkable conformations of the 3 × 3 × 3 cubic lattice model are atypical[8].
This is a quite satisfactory property of the model, since it means that the
same 120 conformations are also expected to be remarkable when different
Eij values are chosen. For instance, with Eij = EHH = −2 − γ, Eij = EHP =
−1, Eij = EP P = 0, and γ = 0.3, the conformation shown in Fig. 1 is still the
top structure, but with Ns = 3794[6], and the picture is now more complicated
than in the case of the additive potential. The difference between the additive
and nearly additive cases comes from the fact that the departure from addi-
tivity lifts the degeneracy of many sequences. Now energy gaps of ∆ = 0 + nγ
and ∆ = 2 ± nγ are observed, with n = 1, 2, etc[4]. For instance, with γ = 0.3,
nearly half of the sequences whose ground state is one of the remarkable con-
formations have energy gaps lower than 1.0[13]. Moreover, while 4.75% of the
sequences have a unique ground state, for the vast majority of them it is not
one of the remarkable conformations, and the corresponding energy gap is, on
average, close to 0.3, i.e., the γ value[6]. When only sequences with energy
gaps larger than 1.0 are considered, the picture obtained with the additive
potential is restored, despite a few minor differences arising from the overlap
of the energy gaps splitting around the ∆ = 0 and ∆ = 2 cases[13].
An interesting proteinlike property of the model allows for the determination
of all large gap sequences (i.e., with ∆ = 2, in the case of the additive po-
tential) whose ground state is a given remarkable conformation, without the
need of any huge enumeration[13]. To do so, starting for instance from the
corresponding remarkable five hydrophobic residues sequence given in Table
I, all 27 singly-mutated sequences are generated. Then, among them, those
with the same single ground state are retained. Next, for each sequence of this
subset, all 27 singly-mutated sequences are generated, and so on, until no new
sequence can be retained. What is shown through the success of such a proto-
col is that the Ns sequences of each of the 120 remarkable structures belong
to a ”neutral island” of the sequence space[13]. Note that such a property has
also been found in the case of square lattice models[1].
Proteinlike Properties of Simple Models 151
2 N-Soft-Spheres Models
Despite its many successes, the 3 × 3 × 3 cubic lattice model has several
drawbacks. In particular, the cubic geometry is a very unlikely one for a
polymer. Moreover, with such a geometry, one of the most spectacular prop-
erty of proteins, namely, their ability to fold into a given structure within a
limited amount of time, has to be studied during the course of Monte Carlo-
Metropolis processes[9, 11], that is, with a set of elementary displacements
chosen a priori. Noteworthy, with standard sets of displacements, it is diffi-
cult for a polymer to jump from a compact conformation to another, since
such a collective motion, during which several monomers are displaced, has
to be performed with local moves only, one or two monomers at a time. To
overcome this latter difficulty, and to allow for the study of the folding pro-
cess during the course of Molecular Dynamics simulations, it is necessary to
consider off-lattice models. However, with such models, one of the most ap-
pealing advantage of the 3×3×3 cubic lattice model is usually lost: the ability
to know, after some simple enumeration, which is the ground state and the
energy gap of any given sequence. The off-lattice model shortly introduced in
this section has been designed so as to retain most of the advantages of the
3 × 3 × 3 cubic lattice model. In this model, the polymer is represented as
a chain of N Lennard-Jones spheres of radius R2 , linked by bonds of length
R, and the sequence of the polymer is specified by ij , the well depth of the
interaction energy between two monomers. For instance, one can consider the
following case: ij = Eij , where Eij is the same as in lattice models. Thus,
the energy function is as follows:
152 Yves-Henri Sanejouand and Georges Trinquier
R 12 R6
H= −(Eij + Ec )( −2 )
i<j
rij rij
where rij is the distance between monomers i and j, and where Ec is a ”com-
paction term”, large enough to make sure that the lowest energy conformations
can be only one among the most compact. Such a term was also used in order
to study the folding process of 27-mers on a cubic lattice[11]. While in this
latter case the most compact geometry is the 3 × 3 × 3 cubic one, in the case
of N Lennard-Jones spheres, though usually more complicated, it is often a
very well defined one[14], like in the N = 19 case, shown in Fig. 2.
For such a family of models also, all the hamiltonian paths can be determined,
as well as the energy gap of any given sequence, and folding simulations of
large gap sequences can be performed. Such simulations are currently under
way.
References
1. E. Bornberg-Bauer. How are model protein structures distributed in sequence
space? Biophys. J., 73(5):2393–2403, 1997.
2. H. Chan and K. Dill. Compact polymers. Macromolecules, 22(12):4559–4573,
1989.
3. M. Chorev and M. Goodman. Recent developments in retro peptides and pro-
teins -an ongoing topochemical exploration. Trends Biotechnol., 13:438–445,
1995.
4. M. Ejtehadi, N. Hamedani, H. Seyed-Allaei, V. Shahrezaei, and M. Yahyane-
jad. Stability of preferable structures for a hydrophobic-polar model of protein
folding. Phys. Rev. E, 57(3):3298–3301, 1998.
5. N. Go. The consistency principle in protein-structure and pathways of folding.
Adv. Biophys., 18:149–164, 1984.
6. H. Li, R. Helling, C. Tang, and N. Wingreen. Emergence of preferred structures
in a simple model of protein folding. Science, 273:666–669, 1996.
7. H. Li, C. Tang, and N. Wingreen. Nature of driving force for protein folding: A
result from analyzing the statistical potential. Phys. Rev. letters, 79(4):765–768,
1997.
8. H. Li, C. Tang, and N. Wingreen. Are protein folds atypical? Proc. Natl. Acad.
Sci. USA, 95(9):4987–4990, 1998.
9. V. Pande and D. Rokhsar. Folding pathway of a lattice model for proteins.
Proc. Natl. Acad. Sci. USA, 96(4):1273–1278, 1999.
10. G. Raghunathan and R. Jernigan. Ideal architecture of residue packing and its
observation in protein structures. Protein Sci, 6(10):2072–2083, 1997.
11. A. Sali, E. Shakhnovich, and M. Karplus. How does a protein fold ? Nature,
369:248–251, 1994.
12. M. Shemyakin, Y. Ovchinnikov, and V. Ivanov. Topochemical investigations of
peptide systems. Angew. Chem. Int. Ed. Engl., 8:492–499, 1968.
13. G. Trinquier and Y.-H. Sanejouand. New protein-like properties of cubic lattice
models. Phys. Rev. E, 59(1):942–946, 1999.
Proteinlike Properties of Simple Models 153
14. D. Wales and J. Doye. Global optimization by basin-hopping and the lowest
energy structures of lennard-jones clusters containing up to 110 atoms. J. Phys.
Chem. A, 101:5111–5116, 1997.
List of Participants