GKL 789
GKL 789
GKL 789
21 6195–6204
doi:10.1093/nar/gkl789
Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology,
Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110 016, India
ABSTRACT threading (4–7) and (ii) de novo folding (8–12). The first
category of methods utilizes the structures of already solved
We describe here an energy based computer soft- proteins as templates (either locally or globally, at the
ware suite for narrowing down the search space of sequence level or at the sub-structure level). With large
tertiary structures of small globular proteins. The amounts of genome and proteome data accumulating via
protocol comprises eight different computational sequencing projects, comparative modeling has become the
modules that form an automated pipeline. It com- method of choice to characterize sequences where related
bines physics based potentials with biophysical representatives of a family exist in structural databases
filters to arrive at 10 plausible candidate structures (13–18). There are several web servers based on comparative
starting from sequence and secondary structure modeling approaches such as Swiss Model (4), CPHmodels
information. The methodology has been validated (19), FAMS (20) and ModWeb (21). The assessors for com-
here on 50 small globular proteins consisting of 2–3 parative modeling at CASP6 (Critical Assessment of protein
Structure Prediction methods) have noted small improve-
helices and strands with known tertiary structures.
ments in model quality despite increase in the available struc-
For each of these proteins, a structure within 3–6 Å tures but marginal improvement in alignment accuracy when
RMSD (root mean square deviation) of the native has compared to CASP5 (22). A natural limit for these
been obtained in the 10 lowest energy structures. approaches is the quantity of information available in the
The protocol has been web enabled and is accessible structural databases. This highlights the importance of
at http://www.scfbio-iitd.res.in/bhageerath. de novo techniques for protein folding.
Significant progress has been made in recent years towards
physics-based computation of protein structure, from a
knowledge of the amino acid sequence. This approach, com-
INTRODUCTION monly referred to as an ab initio method (23–25) is based on
The tertiary structure prediction of a protein using amino acid the thermodynamic hypothesis formulated by Anfinsen
sequence information alone is one of the fundamental (1973), according to which the native structure of a protein
unsolved problems in computational biology/molecular bio- corresponds to the global minimum of its free energy under
physics (1). The folding of protein molecules with a large given conditions (26). Protein structure prediction using ab
number of degrees of freedom spontaneously into a unique initio method is accomplished by a search for a conformation
three-dimensional (3-D) structure is of scientific interest corresponding to the global-minimum of an appropriate
intrinsically and due to its application in structure based potential energy function without the use of secondary struc-
drug design endeavors. The cost and time factors involved ture prediction, homology modeling, threading etc. (27). In
in experimental techniques urge for an early in silico solution contrast, methods characterized as de novo use the ab initio
to protein folding problem (2). The ultimate goal is to use strategies partly as well as database information directly or
computer algorithms to identify amino acid sequences that indirectly. Table 1 summarizes different known web servers/
not only adopt particular 3-D structures but also perform spe- groups for protein structure prediction and the function(s)
cific functions i.e. to propose designer proteins (3). therein. The tertiary structure prediction of protein starting
Contemporary approaches for protein structure prediction from its sequence has been successfully demonstrated on pro-
can be broadly classified under two categories viz. (i) com- tein sequences <85 residues in length by Baker’s group
parative modeling, which includes homology modeling and (28,29) using a fragment assembly methodology. The ProtInfo
*To whom correspondence should be addressed. Tel: +91 11 2659 1505; Fax: +91 11 2658 2037; Email: bjayaram@chemistry.iitd.ac.in
1. ROBETTA (28,29) (http://robetta.bakerlab.org) De novo Automated structure prediction analysis tool used to infer protein structural
information from protein sequence data
2. PROTINFO (30) (http://protinfo.compbio.washington.edu) De novo protein structure prediction web server utilizing simulated annealing for
generation and different scoring functions for selection of final five conformers
3. SCRATCH (31) (http://www.igb.uci.edu/servers/psss.html) Protein structure and structural features prediction server which utilizes recursive
neural networks, evolutionary information, fragment libraries and energy
4. ASTRO-FOLD (32) Astro-fold: first principles tertiary structure prediction based on overall deterministic
framework coupled with mixed integer optimization
5. ROKKY (33) (http://www.proteinsilico.org/rokky/rokky-p/) De novo structure prediction by the simfold energy function with the multi-canonical
ensemble fragment assembly
6. BHAGEERATH (http://www.scfbio-iitd.res.in/bhageerath) Energy based methodology for narrowing down the search space of small
Figure 1. The flow of information in Bhageerath web server, starting with the input from the user to the final 10 predictions made available to the user.
web server by Samudrala et al. (30) predicts protein tertiary We have developed a computationally viable de novo strat-
structure for sequences <100 amino acids using de novo meth- egy for tertiary structure prediction, processing and evalu-
odology, where by structures are generated using simulated ation. The web server christened Bhageerath takes as input
annealing search phase which minimizes a target scoring func- the amino acid sequence and secondary structure information
tion. Scratch web server by Baldi et al. (31) predicts the pro- for a query protein and returns 10 candidate structures for the
tein tertiary structure as well as structural features starting native. In this article, we report the validation and testing of
from the sequence information alone. Astro-fold (32) an ab the protein structure prediction web suite Bhageerath with
initio structure prediction framework by Klepeis and Floudas application to 50 small globular proteins. The programs are
employs local interactions and hydrophobicity for the identi- written in standard C++, with a total of more than 8000
fication of helices and beta-sheets respectively followed by lines of code and are easily portable on any POSIX (UNIX,
global optimization, stochastic optimization and torsion LINUX, IRIX and AIX) compliant system.
angle dynamics. De novo structure prediction by simfold
energy function with the multi-canonical ensemble fragment
assembly has been developed by Fujitsuka et al. (33). The
function has been tested on 38 proteins along with the frag- MATERIALS AND METHODS
ment assembly simulations and predicts structures within 6.5 Bhageerath (www.scfbio-iitd.res.in/bhageerath) software
s RMSD (root mean square deviation) of the native in 12 of suite for protein tertiary structure prediction narrows down
the cases. Arriving at structures between 3 and 6 s RMSD of the search space to generate probable candidate structures
the native expeditiously using ab initio or de novo methodolo- for the native. The flow chart diagram of Bhageerath is
gies remains a formidable challenge. depicted in Figure 1.
Nucleic Acids Research, 2006, Vol. 34, No. 21 6197
The first module involves the formation of a 3-D structure is <100 and the number of secondary structural elements
from the amino acid sequence with the secondary structural varies between two and three. We have selected our test
elements in place. The second module involves generation set of 50 proteins randomly from these 329 proteins. The
of a large number of trial structures with a systematic sam- length of the polypeptide chain varies from 17 to 70 and
pling of the conformational space of loop dihedrals. The the total number of helices and strands ranges between two
number of trial structures generated is 128(n 1) where n is and three.
the number of secondary structural elements. These structures The results obtained for the 50 globular proteins with the
are generated by choosing seven dihedrals from each of the web server are shown in Table 2. The table gives the PDB
loops (three at both ends and one dihedral from the middle ID, the number of amino acids in the sequence as well as
of the loop) and sampling two conformations for each dihe- the number and type of secondary structural elements present
dral. The values assigned for dihedrals F, Y to each amino in each protein in columns (i)–(iii). The number of structures
acid during structure generation are given in supplementary obtained after the persistence length and radius of gyration
information (Supplementary Table S1). The trial structures filters are given in column (iv) of Table 2. The lowest
DISCUSSION
supplementary information (Supplementary Tables S2–S7).
Thus, for new sequences with no known sequence homo- We describe here an energy based computational web server
logues, the Bhageerath web server has the potential to predict Bhageerath, for an automated candidate tertiary structure pre-
a structure to within 3–6 s RMSD of the native structure with diction. The web server permits predictive folding with mod-
accuracies comparable to the homology modeling servers. erate computational resources. The validation of the
Further comparison of the 10 structures obtained from computational protocol on 50 globular proteins has shown
Bhageerath was carried out with the five candidate structures that the web server selects one or more candidate structures
obtained from the ProtInfo web server (30) and 10 structures within an RMSD of 3–6 s with respect to the native in the
obtained with ROBETTA software (28) configured locally. 10 lowest energy structures. The results presented are for
The results shown in Table 4 indicate that the server proteins having 2–3 secondary elements with a, b and a/b
described here is able to predict structures with RMSDs com- structures and are obtained solely from the amino acid
parable to those obtained by ProtInfo web server and sequence and secondary structure information (without the
ROBETTA software. Supplementary Table S8 in the supple- aid of multiple sequence alignment, or fold recognition).
mentary information provides the comparison of the GDT_TS The results provide a benchmark as to the level of model
scores obtained using LGA server (42) for structures obtained accuracy one can expect from this web server.
with Bhageerath and ProtInfo web servers and ROBETTA All of the eight modules are currently being executed on a
software. The GDT_TS scores are also found to be compara- cluster with 32 dedicated UltraSparc III 900 MHz processors.
ble for structures obtained from these three different structure In contrast to typical short return times (ranging from 1 to
prediction methodologies. 10 min) for receiving results from comparative modeling
6200
Table 3. A comparison of protein tertiary structure prediction accuracies with different homology modeling servers available in public domain
Sl. PDB ID CPHModels (19) SwissModel (4) EsyPred3D (38) ModWeb (21) Geno3D (39) 3DJigSaw (40) Bhageerath
No. RMSD (Å) RMSD (Å) RMSD (Å) RMSD (Å) RMSD (Å) RMSD (Å) RMSD (Å)
Nucleic Acids Research, 2006, Vol. 34, No. 21
The numbers in parenthesis indicate the length of the protein model obtained. Supplementary Tables S2–S7 in the supplementary information contain the template ID, % sequence identity and alignment for each
method and structure shown above.
Nucleic Acids Research, 2006, Vol. 34, No. 21
6201
Table 4. A comparison of protein tertiary structure prediction accuracy with ProtInfo web server and ROBETTA software available in the public domain for 50 test
proteins
Sl. No. PDB ID RMSD without end loops (Å) (Bhageerath) RMSD without end RMSD without end loops (Å) (ROBETTA)a (28)
loops (Å) (ProtInfo)a (30)
1 1E0Q 4.5, 2.5, 3.0, 5.0, 3.4, 3.3, 3.2, 3.3, 5.9, 3.3 4.0, 4.1, 3.7, 3.9, 4.2 1.1b
2 1B03 10.3, 4.4, 5.9, 5.5, 6.7, 5.4, 4.5, 6.1, 6.9, 7.5 4.0, 4.7, 4.1, 4.5, 4.4 2.7, 3.0
3 1WQC 4.0, 4.5, 2.5, 3.8, 2.9, 5.1, 4.2, 5.7, 3.8, 4.7 2.1, 1.8, 1.8, 2.0, 2.1 2.3, 3.4
4 1RJU 6.1, 6.3, 6.6, 5.9, 6.6, 5.9, 6.6, 7.0, 6.7, 7.4 3.4, 4.9, 3.3, 4.8, 6.0 3.4, 4.0, 2.5, 3.2, 3.0, 3.6, 4.8, 2.9, 3.0, 3.1
5 1EDM 3.9, 3.5, 3.8, 4.0, 3.6, 5.2, 5.4, 4.1, 3.9, 4.7 3.4, 4.0, 3.7, 3.3, 3.1 0.4, 0.5, 0.4, 0.5, 0.6, 0.4, 0.7, 0.7, 1.1, 0.4
6 1AB1 4.8, 4.5, 4.3, 5.2, 4.2, 2.9, 4.5, 3.8, 5.8, 3.3 3.3, 5.1, 6.3, 3.6, 4.9 2.2, 2.8, 2.9, 2.4, 2.9, 2.7, 3.7, 3.5, 2.2, 3.3
7 1BX7 3.3, 4.0, 5.0, 3.2, 4.5, 3.8, 4.8, 3.1, 4.0, 3.5 2.6, 4.2, 3.7, 4.5, 2.1 0.9, 1.5, 1.0, 1.6, 1.5, 1.6, 1.4, 1.0, 2.0, 1.5
8 1B6Q 6.1, 8.4, 4.0, 4.4, 3.8, 10.1, 5.3, 9.7, 10.7, 3.1 10.2, 10.0, 10.0, 10.4, 10.5 10.0, 9.6, 8.5, 7.6, 12.0, 8.3, 8.2, 7.0, 10.2, 9.0
9 1ROP 5.3, 4.3, 9.2, 7.3, 7.5, 11.0, 14.2, 11.5, 8.7, 6.2 10.8, 11.5, 11.5, 10.1, 12.4 5.8, 10.3, 10.0, 11.7, 8.6, 7.0, 8.3, 7.7, 11.2, 13.6
servers, the expected prediction time with Bhageerath web The current version of the web server elicits secondary
server for two helix systems is 4–5 min while for three structure information from the user. For new sequences
helix systems it is 2–3 h. However, this depends on the where secondary structure information is not available, web
length of the sequence, number of secondary structure ele- based secondary structure prediction tools can be employed.
ments and the number of structures accepted after the bio- We have characterized the results obtained from five different
physical filters for processing the energetics of each trial freely available secondary structure prediction servers (43–47)
structure at the atomic level. It is currently able to process available on the web for the 50 test proteins. The pre-
4–5 normally sized jobs per day on 32 processors. dictions are provided in the supplementary information
Nucleic Acids Research, 2006, Vol. 34, No. 21 6203
Table 5. A list of modules of Bhageerath converted to independent web utilities with their respective URL’s
1 Persistence length filter (http://www.scfbio-iitd.res.in/software/proteomics/perlen.jsp) A filter based on the maximum uninterrupted length
of the polypeptide chain persisting in a particular
direction
2 Radius of gyration filter (http://www.scfbio-iitd.res.in/software/proteomics/rg.jsp) A filter based on the radius of the molecule and
defined as the root mean square distance of the
collection of atoms from their common centre
of gravity
3 Hydrophobicity ratio filter (http://www.scfbio-iitd.res.in/software/proteomics/hyphb.jsp) A filter based on hydrophobicity ratio, which is
defined as the ratio of loss in accessible surface
area (ASA) per atom of non-polar atoms to the
loss in accessible surface area per atom of
10. Ortiz,A.R., Kolinski,A. and Skolnick,J. (1998) Fold assembly of small 30. Hung,L.-H., Ngan,S.-C., Liu,T. and Samudrala,R. (2005) PROTINFO:
proteins using Monte Carlo simulations driven by restraints derived new algorithms for enhanced protein structure predictions. Nucleic
from multiple sequence alignments. J. Mol. Biol., 277, Acids Res., 33, W77–W80.
419–448. 31. Cheng,J., Randall,A.Z., Sweredoski,M.J. and Baldi,P. (2005)
11. Huang,E.S., Samudrala,R. and Ponder,J.W. (1999) Ab initio fold SCRATCH: a protein structure and structural feature prediction server.
prediction of small helical proteins using distance geometry Nucleic Acids Res., 33, W72–W76.
and knowledge-based scoring functions. J. Mol. Biol., 290, 32. Klepeis,J.L. and Floudas,C.A. (2003) ASTRO_FOLD: A combinatorial
267–281. and global optimization framework for ab initio prediction of
12. Simons,K.T., Strauss,C. and Baker,D. (2001) Prospects for ab initio three-dimensional structures of proteins from the amino acid sequence.
protein structural genomics. J. Mol. Biol., 306, 1191–1199. Biophys. J., 85, 2119–2146.
13. Rost,B. and Sander,C. (1996) Bridging the protein sequence-structure 33. Fujitsuka,Y., Chikenji,G. and Takada,S. (2005) SimFold energy
gap by structure predictions. Annu. Rev. Biophys. Biomol. Struct., 25, function for de novo protein structure prediction: consensus with
113–136. Rosetta. Proteins, 62, 381–398.
14. Guex,N., Diemand,A. and Peitsch,M.C. (1999) Protein modeling for 34. Narang,P., Bhushan,K., Bose,S. and Jayaram,B. (2005) A computational
all. Trends Biochem. Sci., 24, 364–367. pathway for bracketing native-like structures for small alpha helical