0% found this document useful (0 votes)
201 views

Introduction To Structural Databases

The document discusses structural databases for biological macromolecules like proteins and nucleic acids. It provides examples of common structural databases, describes the type of information typically contained in database entries, and discusses the historical development and approaches used to classify structures and recognize structural similarities and evolutionary relationships.

Uploaded by

sumit mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views

Introduction To Structural Databases

The document discusses structural databases for biological macromolecules like proteins and nucleic acids. It provides examples of common structural databases, describes the type of information typically contained in database entries, and discusses the historical development and approaches used to classify structures and recognize structural similarities and evolutionary relationships.

Uploaded by

sumit mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Sumit Mahajan MSCII Biotechnology

Review of Structural Databases of Nucleotides and


micromolecules

Introduction to Structural Databases:

Structural databases are essential tools for all crystallographic work and often need
to be consulted at several stages of the process of producing, solving, refining and
publishing the structure of a new material. Examples of such uses are:

i. Before deciding to synthesise a new compound the database could be used to


check how many compounds with a particular chemical composition have been
reported.
ii. After synthesising and indexing the unit cell of a material the database can be
searched to see if a material with the same or a similar unit cell is already
known.
iii. If a material is found in the database with a similar unit cell to the new material
then its structure may be close enough (i.e. same symmetry and similar unit
cell contents) to be used as the starting model for the Rietveld refinement of the
new material.
iv. To verify the results of a structure refinement the database can be consulted to
find structures that have comparable bond distances, bond angles or
coordination environments to the new structure.

The common information found in a structural database for each entry is:

1. bibliographic information - author(s) names, journal reference


2. the chemical compound name, formula and oxidation states of the elements
present
3. the contents (number of formula units per unit cell), dimensions and symmetry
(crystal system and space group) of the unit cell
4. the symmetry of the structure, atomic coordinates, occupancies and thermal
parameters (isotropic or anisotropic)
5. comments on any special features of the experiment to collect the diffraction
data, for example, the temperature, and on any problems found in the structure
itself.
Not all the above information is found in each database for all entries, but all
databases contain the information listed in points (1) to (3). All the above data for each
entry is explicit. The explicit data can be used to generate data that is implicit in this
stored information but that must be calculated from the stored data. Implicit data
includes such things as, interatomic distances, bond angles, torsion angles,
coordination numbers and structural representations that are generated by software
within the database suites.

The structures in the databases have been solved using X-ray, neutron and
electron diffraction techniques on samples that are generally single crystals, but with
the advances in structural solution using powder diffraction data, may be powders.
There are some entries whose structures are predicted from computational modelling
and some determined using NMR spectroscopy, these entries generally occur for
protein samples.

One important point to note is the difference between these structural


databases and the database of powder diffraction files (ICDD-PDF). The latter
contains the "fingerprint" powder diffraction pattern of crystalline materials whose
structure may or may not be known. The structural databases contain structural
information for each material (the unit cell at least) derived from analysis of diffraction
data

Some Examples of Structural Database:


• SCOP: The very first, and best known protein structure classification scheme
is SCOP.
• CAT: CATH stands for Class, Architecture, Topology and Homology and
describes a hierarchy of four levels of increasingly detailed structural properties
for each protein in the PDB.
• BMRB: The BioMagResBank (BMRB: URL http://v^rvw^.bmrb.vv^isc.edu) is
the publicly-accessible depository for the results of NMR experiments on
peptides, proteins, and nucleic acids, and is recognized by the International
Society of Magnetic Resonance.
• MSD: The Macromolecular Structure Database – A relational database
representation of clean Protein Data Bank (PDB)
• 3DSeq: 3D sequence alignment server- Annotation of the alignments between
sequence database and the PDB
• FSSP: Based on exhaustive all-against-all 3D structure comparison of protein
structures currently in the Protein Data Bank (PDB)
• DALI: Fold Classification based on Structure-Structure Assignments
• 3Dee: Database of protein domain definitions wherein the domains have been
clustered on sequence and structural similarity
• NDB: Nucleic Acid Structure Database.
Historical Background:
The major structural classifications, SCOP and CATH, were established in the
mid-1990s. Several studies had shown the extent to which protein structures are
conserved during evolution which suggested that 3D structure was a valuable fossil
capturing the essential features of an evolutionary protein family and making it
possible to identify even very remotely related proteins through similarities in their
structures.

The first protein structure, myoglobin, was solved in 1958 and for the following
three decades the number of structures solved and deposited in the Protein Databank
(PDB) only grew to be in the low thousands. At the time that the CATH and SCOP
databases were established there were only ~3000 protein structures in the PDB.
Currently, in 2015, there are over 100,000.
Although global structural characteristics (i.e. the folds of homologous proteins)
are largely conserved during evolution, it is the buried secondary structures in the core
of the protein domains that are most highly conserved. Studies comparing protein
domains showed that in more remote relatives, especially those from very distant
species, there can be considerable insertions/deletions of amino acid residues. These
usually occur in the loops connecting the core secondary structures and can be very
extensive, sometimes folding into additional secondary structures that decorate or
embellish the structural core of the domain.
The development of structure comparison algorithms by Rossmann and Argos
and Matthews and Remington in the 1970s prompted several large scale analyses of
protein structures and in 1976 Levitt and Chothia published a seminal paper which
classified proteins according to their dominant secondary structure composition [8].
Four classes were recognized: mainly alpha-helical, mainly beta-strand, alternating
alpha-beta and alpha plus beta structures. Other analyses by Thornton and Sternberg
recognised common motifs recurring in particular classes. For example the right-
handed beta-alpha-beta motifs recurrent in alpha-beta proteins and different classes
of beta-turns.
Early structure comparison algorithms exploited rigid body algorithms to
superimpose the structures based on the 3D coordinates and the methods struggled
to achieve an optimal superimposition between distant homologues. Generally, they
failed to converge on a solution where substantial insertions/deletions (indels) and
changes in secondary structure orientations had occurred. Therefore in the late 1980s,
more sophisticated approaches were explored (SSAP, COMPARER, DALI discussed
in more detail below) which used a variety of strategies for handling the shifts in
secondary structure orientations and the extensive indels between distant
homologues. These more robust approaches enabled large scale comparisons and
classification of structural relatives.
Structural approaches used to recognize fold similarities and
homologues:
In the late 1980s the groups of Willie Taylor at NIMR and Tom Blundell at
Birkbeck College London adapted the dynamic programming algorithms used to
handle residue insertions and deletions in sequence alignment, to cope with the
associated structural variations that these give rise to in 3D. Sali and Blundell
extended this strategy to compare a range of features between proteins and used a
Monte-Carlo optimization to obtain a structural superposition of relatives. This was
encoded in the COMPARER algorithm. Whilst Taylor and Orengo decided to employ
a double dynamic strategy to go from 2D to 3D alignment and in 1989 developed the
SSAP algorithm which compared 3D views between residues in the proteins being
compared and used a summary level to accumulate all the dynamic programming
alignment ‘paths’ i.e. obtained by comparing ‘3D views’ from similar structural
contexts. A final application of dynamic programming to this summary level
determined the optimal alignment of the structures.
SSAP was demonstrated to be robust enough to cope with significant variations
between homologues and revealed interesting ancestral relationships such as
between the globins and plastocyanins that were undetectable using solely sequence
data. Although the algorithm is relatively slow i.e. compared to DALI , COMPARER
and the more recent STRUCTAL and FATCAT algorithms, this was not problematic in
the mid 80s when there were fewer than 2000 protein structures in the PDB (a faster
version of the method is now available: CATHEDRAL.

Figure 1 : Homepage of CATH database

In 1997 the SSAP algorithm was modified to increase the speed for large scale
comparisons within the PDB, by employing a filter that only allows comparison of
proteins having sufficiently similar secondary structure arrangements and connectivity
in their common structural core. This new approach - CATHEDRAL - is nearly 1000
times faster than SSAP allowing CATH to remain up to date with the PDB.

Figure 2 : Basic info and current statistics of CATH

The SCOP classification was largely constructed using manual evaluation of


domain relationships although available algorithms such as BLAST and DALI were
sometimes employed to guide this process. Despite the different approaches used
between CATH and SCOP (i.e. largely manual for SCOP and semi-automatic using
SSAP followed by manual curation for CATH) the two classifications identify similar

Figure 3 : Homepage of SCOP database


numbers of fold groups and homologous superfamilies and comparisons between
SCOP and CATH show a reasonable degree of equivalence between these structural
groupings.

Both SCOP and CATH further classified the domain superfamilies and fold
groups according to their architecture, where architecture describes the orientation of
the secondary structure elements in 3D regardless of their connectivity. However, in
CATH this was a formal level in the hierarchy whilst in SCOP architecture was simply
an annotation. Finally domains were assigned to protein classes depending on the
composition of secondary structure elements i.e. whether they were all-alpha, all-beta,
or mixtures of alpha and beta SCOP used more classes than CATH to capture these
divisions but most domains fall into similar categories in the two classifications.

Domain recognition :
Perhaps a major philosophical difference between the CATH and SCOP
classifications is in the approaches used to identify domains within multi-domain
protein structures. Domain recognition is problematic in that no formal quantitative
definition of a domain exists. However, heuristic approaches search for compact,
globular units with hydrophobic cores and more contacts between residues within the
domain unit than between domain units. Furthermore secondary structures are
unlikely to be shared between domains. These physical criteria have been encoded in
a wide range of different algorithms since the 1990s. To identify domains in CATH,
three independent such ab-initio methods (PUU, DETECTIVE, DOMAK) based on
these concepts are applied and the results compared to guide manual assignment of
domain boundaries.

Superfolds and the likely existence of limited folding arrangements


in nature:
Perhaps the most interesting revelation to emerge from the structural
classification data was the highly uneven distribution observed in the populations of
the fold groups. In 1994 Orengo, Jones and Thornton reported the existence of ten
‘superfolds’ accounting for nearly 50% of all domain relatives in CATH. The
percentage of non-redundant CATH domains currently assigned to the most highly
populated superfamilies. Many of these adopt TIM barrel, Rossmann and other folds
which possess very regular architectures i.e. layers of beta-sheets and/or alpha-
helices. This regularity could be one factor explaining their frequent occurrence in
nature. For example, these arrangements might be expected to accommodate
mutations more easily because secondary structures would be more able to slide
relative to each other, meaning that changes in residue size would be less likely to
disrupt the core packing arrangements. Furthermore the large central super-
secondary features e.g. beta-sheets or beta-barrels provide a stable core. Theoretical
analyses have also suggested that these folding arrangements would be able to
support large numbers of diverse sequences.
Based on the number of diverse sequence families found across the SCOP
classification and the proportion of all known sequence data that this represented,
Chothia postulated that there could be fewer than 1000 fold groups in nature, a
relatively small number compared to the tens of thousands of known proteins at that
time and therefore an exciting hypothesis which suggested that the use of structural
classifications would make an understanding of protein evolution tractable.
The structural classification data and large scale comparisons of domain structures
also revealed novel folding motifs (split beta-alpha-beta motifs) common to a large
proportion of alpha-beta domain superfamilies in which structures comprise a central
antiparallel betasheet covered by a layer of alpha-helices (alpha-beta-plait folds).
Superfamilies adopting this fold contain one or two of these ‘split beta-alpha-beta-
motifs’. They resemble the very common alpha-beta motifs earlier reported by
Thornton and Sternberg, in which two parallel strands are connected by an alpha-
helix, but in the ‘split beta-alpha-beta-motifs’ the beta-strands are effectively split by
the third antiparallel beta-strand which hydrogen bonds to them both.

Exploiting domain structure superfamilies in CATH to examine the


evolution of protein functions:
The expansion of CATH superfamilies with sequence data considerably
increased the amount of functional data too, allowing large scale studies of the
divergence of function within superfamilies during evolution. These studies showed
that whilst relatives in most superfamilies shared a common function, in the most
highly populated superfamilies considerable divergence of sequence and function had
occurred. A detailed study of 31 such diverse superfamilies revealed the molecular
mechanisms by which functions had changed. These phenomena ranged from small
local changes e.g. mutations of residues in the active site (which modified chemistry
or substrate specificity) or insertions of residues around the active site (which largely
affected binding of substrates); through to fusions of domains with different partners
(which could in turn modify active site geometries). Mutations and residue insertions
in other sites on the protein surface could bring about changes in protein interactions
or changes in oligomerisation state, again altering active site geometries.

Exploiting domain structure superfamilies in CATH to examine the


evolution of protein functions:
The expansion of CATH superfamilies with sequence data considerably
increased the amount of functional data too, allowing large scale studies of the
divergence of function within superfamilies during evolution. These studies showed
that whilst relatives in most superfamilies shared a common function, in the most
highly populated superfamilies considerable divergence of sequence and function had
occurred. A detailed study of 31 such diverse superfamilies revealed the molecular
mechanisms by which functions had changed. These phenomena ranged from small
local changes e.g. mutations of residues in the active site (which modified chemistry
or substrate specificity) or insertions of residues around the active site (which largely
affected binding of substrates); through to fusions of domains with different partners
(which could in turn modify active site geometries). Mutations and residue insertions
in other sites on the protein surface could bring about changes in protein interactions
or changes in oligomerisation state, again altering active site geometries.

Functional sub-classification in CATH-Gene3D and what this


reveals about the evolution of enzyme active sites:
More recently, the extreme divergence of functional properties of relatives in
some highly populated CATH superfamilies prompted the development of protocols to
sub-classify superfamilies into functional families (termed FunFams,). This was
achieved using a profile based protocol that recognized differences in specificity
determining residues between putative families. This is a challenging task as it
requires sufficient sequence diversity across a FunFam to enable detection of
conserved residues. As a result, it is harder to distinguish functional groups having
narrow species distribution and these groups will tend to merge with functionally close
families. Nevertheless independent validation by an international assessment (CAFA)
showed the FunFams to be highly competitive in providing functional annotations.
CATH-Gene3D currently identifies 110,000 functional families within 2700
superfamilies. 360 of these superfamilies comprise a single FunFam. In contrast, 350
of the largest superfamilies account for 75% of the FunFams. These are large,
universal superfamilies found in all kingdoms of life and accounting for more than 60%
of all predicted domain sequences in CATH-Gene3D.
Sub-classification into functional families allows comparison of functional sites
between relatives across a superfamily and gives insights into evolutionary
mechanisms underlying shifts in function. Information on functional sites is largely
restricted to relatives of known 3D structure which on average comprise less than 10%
of sequences within the superfamily (or less in some of the more ubiquitous and
diverse superfamilies). Comparisons of interfaces across superfamilies showed that
functionally diverse relatives were exploiting different surface patches on the structure
ie distinct sites, depending on their interaction partner, and that paralogues shared few
common interactors. However, quite frequently there was one location that was more
frequently exploited by diverse paralogues.
Conclusions:
The SCOP and CATH classifications organize the 3D structure of proteins into
evolutionary classifications that have enabled detailed studies of the molecular
mechanisms by which new protein structures and functions evolve. The sequence
patterns and fold libraries that they provide have enabled prediction of structural
relatives thereby providing structural annotations for more than 50 million domain
sequences, available on their sister sites (Gene3D, Superfamily respectively) and in
InterPro. The predicted data revealed the power law bias in superfamily populations
whereby most superfamilies are small but a few hundred are universal and very highly
populated. Combination of the sequence and structure data have supported large
scale comparative genome studies which revealed changes in domain architecture
across different species modifying the functional repertoires of those species. They
have also enabled phylogenetic studies that traced the evolution of different
chemistries within enzyme superfamilies; and structural studies that revealed the
changes in the catalytic machineries that bring about these functional shifts. CATH
superfamilies have also been used to detect patterns of domain presence and
absence in genomes that allow predictions of protein interactions.
Although ~20-25% of domain sequences in the genomes do not currently map
to any structural superfamilies in CATH or SCOP, the analyses of structures solved by
the structural genomics initiatives in the States - which targeted structurally
uncharacterized domain families in Pfam for structural determination - showed that
once the structures of these superfamilies had been solved they revealed a structural
or evolutionary relationship with an existing fold group or superfamily in SCOP or
CATH. In fact, nearly 98% of all new structures deposited in the PDB can be classified
in an existing CATH superfamily, suggesting that these classifications now account for
the majority of superfamilies in nature.
Over the last few years collaborations between SCOP and CATH have led to
mappings between these resources that help to confirm detection of very remote
homologues. Future collaborations are likely to enhance the quality of the data in both
resources by removing errors and sharing curation tasks to enable these resources to
keep pace with the still exponential increases in the structure and sequence data.
REFERRENCE:
 Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C. SCOP: a
structural classification of proteins database. Nucleic Acids Res. 2000;28(1):257–259.
doi:10.1093/nar/28.1.257
 I. Sillitoe, N. Dawson, J. Thornton, C. Orengo, The History of the CATH
Structural Classification of Protein Domains, Biochimie (2015), doi:
10.1016/j.biochi.2015.08.004.
 Structural Databases, David Jones, Department of Biological Sciences,
University of Warwick, Coventry, UK. GeneticsDatabases ISBN0-12-101625-0.
 https://doi.org/10.1016/j.bbamem.2018.01.005.

You might also like