Introduction To Structural Databases
Introduction To Structural Databases
Structural databases are essential tools for all crystallographic work and often need
to be consulted at several stages of the process of producing, solving, refining and
publishing the structure of a new material. Examples of such uses are:
The common information found in a structural database for each entry is:
The structures in the databases have been solved using X-ray, neutron and
electron diffraction techniques on samples that are generally single crystals, but with
the advances in structural solution using powder diffraction data, may be powders.
There are some entries whose structures are predicted from computational modelling
and some determined using NMR spectroscopy, these entries generally occur for
protein samples.
The first protein structure, myoglobin, was solved in 1958 and for the following
three decades the number of structures solved and deposited in the Protein Databank
(PDB) only grew to be in the low thousands. At the time that the CATH and SCOP
databases were established there were only ~3000 protein structures in the PDB.
Currently, in 2015, there are over 100,000.
Although global structural characteristics (i.e. the folds of homologous proteins)
are largely conserved during evolution, it is the buried secondary structures in the core
of the protein domains that are most highly conserved. Studies comparing protein
domains showed that in more remote relatives, especially those from very distant
species, there can be considerable insertions/deletions of amino acid residues. These
usually occur in the loops connecting the core secondary structures and can be very
extensive, sometimes folding into additional secondary structures that decorate or
embellish the structural core of the domain.
The development of structure comparison algorithms by Rossmann and Argos
and Matthews and Remington in the 1970s prompted several large scale analyses of
protein structures and in 1976 Levitt and Chothia published a seminal paper which
classified proteins according to their dominant secondary structure composition [8].
Four classes were recognized: mainly alpha-helical, mainly beta-strand, alternating
alpha-beta and alpha plus beta structures. Other analyses by Thornton and Sternberg
recognised common motifs recurring in particular classes. For example the right-
handed beta-alpha-beta motifs recurrent in alpha-beta proteins and different classes
of beta-turns.
Early structure comparison algorithms exploited rigid body algorithms to
superimpose the structures based on the 3D coordinates and the methods struggled
to achieve an optimal superimposition between distant homologues. Generally, they
failed to converge on a solution where substantial insertions/deletions (indels) and
changes in secondary structure orientations had occurred. Therefore in the late 1980s,
more sophisticated approaches were explored (SSAP, COMPARER, DALI discussed
in more detail below) which used a variety of strategies for handling the shifts in
secondary structure orientations and the extensive indels between distant
homologues. These more robust approaches enabled large scale comparisons and
classification of structural relatives.
Structural approaches used to recognize fold similarities and
homologues:
In the late 1980s the groups of Willie Taylor at NIMR and Tom Blundell at
Birkbeck College London adapted the dynamic programming algorithms used to
handle residue insertions and deletions in sequence alignment, to cope with the
associated structural variations that these give rise to in 3D. Sali and Blundell
extended this strategy to compare a range of features between proteins and used a
Monte-Carlo optimization to obtain a structural superposition of relatives. This was
encoded in the COMPARER algorithm. Whilst Taylor and Orengo decided to employ
a double dynamic strategy to go from 2D to 3D alignment and in 1989 developed the
SSAP algorithm which compared 3D views between residues in the proteins being
compared and used a summary level to accumulate all the dynamic programming
alignment ‘paths’ i.e. obtained by comparing ‘3D views’ from similar structural
contexts. A final application of dynamic programming to this summary level
determined the optimal alignment of the structures.
SSAP was demonstrated to be robust enough to cope with significant variations
between homologues and revealed interesting ancestral relationships such as
between the globins and plastocyanins that were undetectable using solely sequence
data. Although the algorithm is relatively slow i.e. compared to DALI , COMPARER
and the more recent STRUCTAL and FATCAT algorithms, this was not problematic in
the mid 80s when there were fewer than 2000 protein structures in the PDB (a faster
version of the method is now available: CATHEDRAL.
In 1997 the SSAP algorithm was modified to increase the speed for large scale
comparisons within the PDB, by employing a filter that only allows comparison of
proteins having sufficiently similar secondary structure arrangements and connectivity
in their common structural core. This new approach - CATHEDRAL - is nearly 1000
times faster than SSAP allowing CATH to remain up to date with the PDB.
Both SCOP and CATH further classified the domain superfamilies and fold
groups according to their architecture, where architecture describes the orientation of
the secondary structure elements in 3D regardless of their connectivity. However, in
CATH this was a formal level in the hierarchy whilst in SCOP architecture was simply
an annotation. Finally domains were assigned to protein classes depending on the
composition of secondary structure elements i.e. whether they were all-alpha, all-beta,
or mixtures of alpha and beta SCOP used more classes than CATH to capture these
divisions but most domains fall into similar categories in the two classifications.
Domain recognition :
Perhaps a major philosophical difference between the CATH and SCOP
classifications is in the approaches used to identify domains within multi-domain
protein structures. Domain recognition is problematic in that no formal quantitative
definition of a domain exists. However, heuristic approaches search for compact,
globular units with hydrophobic cores and more contacts between residues within the
domain unit than between domain units. Furthermore secondary structures are
unlikely to be shared between domains. These physical criteria have been encoded in
a wide range of different algorithms since the 1990s. To identify domains in CATH,
three independent such ab-initio methods (PUU, DETECTIVE, DOMAK) based on
these concepts are applied and the results compared to guide manual assignment of
domain boundaries.