Academia.eduAcademia.edu

HYPO: A database of hypothetical human proteins

2018, Protein and peptide letters

All annotated genes were once hypothetical or uncharacterized. Keeping this as an epilogue, we have enhanced our erstwhile database of hypothetical proteins (HP) in human (HypoDB) with added annotation, application programming interfaces and descriptive features. The database hosts 1000+ manually curated records of the known unknown regions in the human genome. The new updated version of HypoDB with functionalities (Blast,Match) is freely accessible at http://www.bioclues.org/hypo2.

bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. HYPO: A database of hypothetical human proteins Vijayaraghava Seshadri Sundararajan1,2,*, Girik Malik3, Johny Ijaq1,4, Anuj Kumar1,5, Partha Sarathi Das1,6, Shidhi P.R7, Achuthsankar S Nair7, Prashanth Suravajhala1,8,*, and Pawan K Dhar 9* 1. Bioclues.org, Kukatpally, Hyderabad 500072, India 2. Environmental Health Institute, National Environment Agency, Singapore 138667 3. The Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, Department of Pediatrics, The Ohio state University, USA 4. Department of Zoology, Osmania University, Hyderabad 50007, India 5. Uttarakhand Council for Biotechnology (UCB), Prem Nagar, Dehradun-248007, India 6. Bioinformatics Infrastructure Facility, Department of Microbiology, Vidyasagar University, Midnapore 721102, West Bengal, India 7. Department of Computational Biology and Bioinformatics, University of Kerala,Thiruvanantapuram 695581, Kerala, India 8. Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, Rajasthan, India 9. School of Biotechnology, Jawaharlal Nehru University, New Delhi 110067, India * Corresponding authors: Prashanth Suravajhala (prash@bioclues.org) Pawan K Dhar (pawandhar@mail.jnu.ac.in) Vijayaraghava S Sundararajan (chanusuba@gmail.com) ABSTRACT All annotated genes were once hypothetical or uncharacterized. Keeping this as an epilogue, we have enhanced our former database of hypothetical proteins (HP) in human (HypoDB) with added annotation, application programming interfaces and descriptive features. The database hosts 1000+ manually curated records of the known ‘unknown’ regions in the human genome. The new updated version of HypoDB with functionalities (Blast, Match) is freely accessible at http://www.bioclues.org/hypo2. INTRODUCTION The advent of high-throughput genomic technologies has enabled understanding the components of the genome in a better way. Today, we can distinguish sequences that are coding, non-coding and also the ones that are not the bona fide genes at all, viz. pseudogenes. Nevertheless, there are some genes whose function remains obscure as they may not have similarities to known regions in the genome. Such known ‘unknown’ genes constituting the open reading frames (ORF) that remain in the epigenome are termed as orphan genes. The proteins that are expected to be expressed from these orphan genes but having no bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. experimental evidence of translation are termed as ‘hypothetical proteins’ (HPs) [1]. Moving from the then early stated 1.5% protein-coding sequences to the non-coding inter- and intragenic regions, all these HPs are a part of known ‘unknowns’ with undetermined role [2, 3]. Recently, the role of orphan genes and HPs has been deliberated in lieu of their role as artefacts and non-coding elements [4]. Furthermore, there are evolutionarily conserved regions (ECR) in the form of HPs which could be potential candidates for experimentation [5]. In addition, studies on structural aspects have led into determining tertiary structures of HPs based on geometrical, biophysical and biochemical studies which further emphasize the need for descriptors of these sequences [6]. Since 2006, we have been updating our primary database of HPs in human. Many of the reviews focusing on annotation from sequence/literature based searches [7], structural genomics [8] and functional genomics based approaches [9] have been well documented even as they constitute substantial part of human proteome. Due to lack of experimental confirmations regarding their molecular function, some of these variants are known as KIAA in bacteria while in eukaryotes they are tagged as 'unknown' as 'uncharacterized' with many accessions starting with prefix, “XP,” meaning predicted. Recently, we have reported making novel proteins from non-coding and less bona fide sequences, pseudogenes [10]. With the deluge of sequencing data pertaining to human HPs, there is a need to organize them into a database for their latent use and understand prime functions associated with various pathways and diseases. This would definitely serve as ready reference to researchers interested in finding the role of candidate HPs. What remains interesting to see is that there are many different types of regulatory sequences associated with HPs in controlling gene expression eventually falling under this list. METHODS Data Extraction and Collection Since 2006, we have been updating our primary database of HPs in human consistently [11]. The then 7540 of them are relatively deprecated with certainty and many are likely to be a part of potential duplicates, obscure list. Annotators and curators have made a magnanimous continuous effort in predicting the function of HPs and this finally resulted in 1048 manually curated proteins that have been experimentally proven / confirmed. They were included in our earlier version (Hypo DB 12_Mar_2012) of the Hypo database. In this current work, we have enhanced our database with new functionalities and core dependencies are added to the bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. new version of HypoDB 2.0. Set of HPs from earlier version are considered to extract the latest information on 13_Jun_2016 from UniProtKB (uniprot.org/uniprot/). This resulted in a set of 1015 final HPs that are considered in this newly updated version of HypoDB named as “Hypo DB 13_Jun_2016 (1015)". The HPs retrieved from UniProtKB are categorized as reviewed (n=923) and non-reviewed (n=92) and are tagged as UniProtKB[Swiss-Prot] and UniProtKB[TrEMBL], respectively in our DB. They form two sub-databases in Hypo; one with reviewed and the other with non-reviewed UniProtKB peptides. The total collection of these peptides is termed as ‘UniProtKB. The idea is to allow researchers to search and use the Hypo Database according to their research interest using experimentally validated sets (from Swiss-Prot) or non-experimentally validated set (TrEMBL) or use the complete set (UniProtKB) DATABASE DESIGN/ STRUCTURE/ORGANIZATION User Interface and Functionality At the core backend, the java construction allows views of every queried sequence mapped to several databases, viz. EMBL, PIR, HPRD and other interesting applications that include descriptors for structural databases, interaction and association databases like STRING, gene ontology and KEGG pathway reference databases apart from many sequence databases (an example is illustrated in Figure 1). The HypoDB contains primarily four application programming interfaces (API), viz. (a) Quick view interface where a few modular entries can be directly retrieved from the home page of the Hypo website. HPs overlapping classes are displayed in the home page (Quick view, as shown in Figure 2). This allows user to filter the list directly according to the selected HP. (b) Search interface (through Catalogue Sub-Menu system). Through several Catalogues, the Hypo Database is searched according to the user need and references (as detailed in a later paragraph as “Multiple search capabilities”, Figure 3). (c) Predict sequences from BLAST I/O parser [12] (BioTool page). Users can paste or upload their sequences to be tested against the total sets of HPs (sequences) through the integrated BLAST tool. (d) Another useful API is the core functionality with feature/annotation object model allowing the detailed view from outcome of features, while annotations can be downloaded by right click “save as” options containing ontology and featured lists. These bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. interfaces are embedded into catalogues result page, Quick View result page, BLAST outcome pages. Multiple search capabilities HYPO offers the end-user with multiple search capabilities (Figure 3). These six search tools can operate as independent search engines to interrogate the database or be executed as part of a more complex query. Five of these search utilities are based on individual catalogues that are created as vocabularies of terms from taxonomies, peptide families, species, keywords, and citations of 1015 HPs entries for easy browsing of the database. Taxonomy catalogue: Organisms are classified in a hierarchical tree structure. Taxonomy database contains every node (taxon) of the tree. Species catalogue: In this DB, we only considered “Homo sapiens” – which is set in this catalogue (the catalogue is not functional). Keyword catalogue: UniProt entries are tagged with keywords that can be used to retrieve particular subsets of entries. Family catalogue: In this DB, we only considered “Homo sapiens” – which is set in this catalogue (the catalogue is not functional). Citation catalogue: UniProt maintain publications with title (RT, example: Splicing variants of BLOM7); author name (RA, examples: Abaya), journal (RL, example: Am. J. Med. Genet. A), year of publication (YR, example: 1986). In Hypo2, we have catalogued “TITLE of manuscript”, “AUTHOR Name”, “JOURNAL Name”, and “YEAR of Publication”, are separately catalogued for easy search and reference. Advanced Search: The Hypo database also offers a straightforward option where user can search the database through any of the given ‘search fields’ and a ‘value’. For example, if the user selects species in ‘search field’ and Homo sapiens as ‘value’, it leads to a page that shows the list of all HPs described in the database. Clicking on them further shows detailed information about the selected list. This search category allows a combination of search terms, search fields and search values. Users can query the database using field names which are not listed in other catalogues. The HypoDB website also provides details on how to retrieve different components. The FAQ and help section will allow the users to get an introduction. The searches can be made in less than 10 seconds on a 2GB RAM and 2 GHz Core Duo processor. bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. INTEGRATED BIO-TOOLS Matcher: The Hypo database system presents a Biomarkers Matcher, along with a local gene card Matcher, which retrieves the Symbol, Description and Category of the Gene. The matcher also has an option to lookup the gene in NCBI, using the accession number. The matcher also describes the gene along with the HP and pseudogene linked to it. The result is presented in a tabular format, which can be sorted according to the needs, along with an option to perform a sub-search in the results displayed, based on any of the fields. BLAST functionality is integrated in the Hypo2 system, to enable user query of any new sequence to be aligned with the Hypo database entries and get the scores for a closer match. Users also can upload their sequences in the FASTA format and blast results will be (currently) shown on the screen. FUTURE DEVELOPMENT The HypoDB aims to develop APIs for implicitly searching resources linking them to other databases like NCBI Link-out. In the near future, we would like to streamline this with genomic sequences that are coding, non-coding and non-coding with coding potential, small RNAs, long non-coding RNAs that are not annotated etc. The lists of pseudogenes etc. are already under development, which will allow users to work with the concrete list. Although the Blast/GenBank parsing API is widely used, we may not use all output formats for interfacing, so a careful descriptor usage is needed even as we hope to continue the ongoing efforts with widely unsupported formats. Further, we plan to integrate analytical tools, viz. ClustalW, HMMER, NJplot, Hydrophobicity calculator, SignalP interfaces to the Hypo Database System. Researchers are welcome to identify niche areas and help us improve the interface. Currently the system is embedded with Java, Perl, PHP with SQL and HypoDB 3.0 version is expected by the end of 2017. CONCLUSIONS The HypoDB is perhaps the only open-source HP database with a range of tools for common bioinformatics retrievals. The homepage provides access to the interfaces with all search options. We hope to ensure that this serves as a standby reference to researchers who are interested in finding candidate sequences for their potential experimental work. As we march ahead in the post genomic era, a database such as HypoDB holds importance for ascertaining factual information from annotated entries. bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. ACKNOWLEDGEMENT We thank Dr. Chanditha Hapuarachchi, Environmental Health Institute, National Environment Agency, Singapore for proof-reading the final version of the manuscript. We acknowledge Arun Gupta for his contributions in the database design (Version 1). Conflict of interests The authors declare no competing interests, whatsoever. Author contributions VSS developed all APIs and database catalogue interfaces. GM provided descriptors and search interfaces. SPR, AK, JI and PSD have checked the annotations, inserted the figures and screenshots. PS, AK and JI wrote the preliminary version of the manuscript. PS and VSS wrote the final draft of the manuscript. ASN, PKD, and PS proofread the final manuscript. All authors agreed and have gone through the final version of the manuscript. REFERENCES 1. Galperin M.Y. Conserved 'hypothetical' proteins: new hints and new puzzles. Comp. Funct. Genomics. 2001. 2 (1): 14–18. 2. Little, P. F. Structure and function of the human genome. Genome research. 2005. 15(12), 1759-1766 3. Logan DC. Known ‘knowns’, known ‘unknowns’, unknown ‘unknowns’ and the propagation of scientific enquiry. J Exp Bot., 2009. 60(3):712-4. 4. Prabh N and Rödelsperger C. Are orphan genes protein-coding, prediction artifacts, or non-coding RNAs? BMC Bioinformatics. 2016. 17(1):226. 5. Galperin, M. Y. and Koonin, E. V. Conserved hypothetical” proteins: prioritization of targets for experimental study. Nucleic Acids Research. 2004. 32(18), 5452–5463 6. Kinoshita, K. and Nakamura, H. Protein informatics towards function identification. Current opinion in structural biology. 2003. 13(3), 396-400. bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 7. Lubec, G.; Afjehi-Sadat, L.; Yang, J. W. and John, J. P. P. Searching for hypothetical proteins: theory and practice based upon original data and literature. Progress in neurobiology. 2005. 77(1), 90-127. 8. Eisenstein E.; Gilliland G.L.; Herzberg O.; Moult J.; Orban J.; Poljak R.J.; Banerjei L.; Richardson D. and Howard A.J. Biological function made crystal clear annotation of hypothetical proteins via structural genomics. Curr. Opin. Biotech. 2000. 11(1):25-30. 9. Adams, M. A.; Suits, M. D. L.; Zheng, J. and Jia, Z. Piecing together the structure– function puzzle: Experiences in structure-based functional annotation of hypothetical proteins. Proteomics. 2007. 7: 2920–2932. 10. Shidhi, P. R., Suravajhala, P., Nayeema, A., Nair, A. S., Singh, S. and Dhar, P. K. Making novel proteins from pseudogenes. Bioinformatics. 2014. 31 (1) 33–39 11. Suravajhala, P. Hypo, hype and 'hyp' human proteins. Bioinformation. 2007. 2(1), 3133. 12. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol., 1990. 215:403-410 bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. TABLE AND FIGURES LEGENDS Figure 1: An overview of features and how the query is processed in HypoDB bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Fig.2. Screenshot showing the home page of the Hypo.2 database Quick view User can directly filter the list bioRxiv preprint doi: https://doi.org/10.1101/202887. this version posted October 13, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Figure.3. Multiple search capabilities.