BLAST-EXPLORER Helps You Building Datasets For Phylogenetic Analysis

Dereeper et al.
BMC Evolutionary Biology 2010, 10:8

http://www.biomedcentral.com/1471-2148/10/8
SOFTWARE Open Access
BLAST-EXPLORER helps you building datasets for

phylogenetic analysis
Alexis Dereeper1,2, Stephane Audic1, Jean-Michel Claverie1, Guillaume Blanc1*
Abstract
Background: The right sampling of homologous sequences for phylogenetic or molecular evolution analyses is a
crucial step, the quality of which can have a significant impact on the final interpretation of the study. There is no
single way for constructing datasets suitable for phylogenetic analysis, because this task intimately depends on the
scientific question we want to address, Moreover, database mining softwares such as BLAST which are routinely
used for searching homologous sequences are not specifically optimized for this task.
Results: To fill this gap, we designed BLAST-Explorer, an original and friendly web-based application that combines
a BLAST search with a suite of tools that allows interactive, phylogenetic-oriented exploration of the BLAST results
and flexible selection of homologous sequences among the BLAST hits. Once the selection of the BLAST hits is
done using BLAST-Explorer, the corresponding sequence can be imported locally for external analysis or passed to
the phylogenetic tree reconstruction pipelines available on the Phylogeny.fr platform.
Conclusions: BLAST-Explorer provides a simple, intuitive and interactive graphical representation of the BLAST
results and allows selection and retrieving of the BLAST hit sequences based a wide range of criterions. Although
BLAST-Explorer primarily aims at helping the construction of sequence datasets for further phylogenetic study, it
can also be used as a standard BLAST server with enriched output. BLAST-Explorer is available at http://www.
phylogeny.fr
Background subsequent phylogenetic analyses, this step has been lar-

The reconstruction of phylogenetic trees from molecular gely overlooked in recent software developments.
sequences has become a routine task not only for spe- There is no single way for constructing datasets suita-
cialists involved in molecular evolution or systematics ble for phylogenetic analysis, because this task intimately
but also for biologists working on their favourite gene/ depends on the scientific question we want to address.
protein family or annotating new genome sequences. For example, biologists may be concerned by the taxo-
The growing interest for phylogenetic information has nomic range of sequences, reduction of long-branch
stimulated the emergence of new integrated, user- attraction effect, presence of paralogues, orthologues,
friendly software that produce robust trees using sophis- pseudo-genes and/or multi-domain proteins, etc. Failure
ticated methods while remaining accessible to non-spe- in the constitution of datasets can lead to draw incorrect
cialists. Developers concentrated most of their effort on conclusion from phylogenetic studies. Tools have been
improving the speed, accuracy and versatility of the specifically designed to distinguish between orthologous
algorithms proposed for reconstructing phylogenetic and paralogous in genome/proteome datasets [1] and
trees from user-defined sets of homologous (ancestrally expressed sequence tag datasets [2,3], but they not
related) sequences. However, albeit choosing a good always the most convenient for punctual analyses.
initial sequence dataset is crucial to the validity of A frequent scenario in research is a biologist having a
particular sequence of interest in hands that needs to
find other sequences that are related to it in sequence
* Correspondence: blanc@igs.cnrs-mrs.fr databases to create a phylogenetic tree. The Basic Local
1
Information Génomique & Structurale - CNRS-UPR2589, Institut de Alignment Search Tool (BLAST) [4] is the most widely
Microbiologie de la Méditerranée - IFR 88, 163 Avenue de Luminy, 13009 used set of programs for this purpose, primarily owing
Marseille, France
© 2010 Dereeper et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Dereeper et al. BMC Evolutionary Biology 2010, 10:8 Page 2 of 6
to its speed of execution. However, the information pre- sequence databases are updated at monthly interval. The
sented in the traditional BLAST output is not optimized parallelization of the Blast searches is done as follows: in
for selecting and retrieving items for further phyloge- a first step, each compute node aligns the query
netic study. For example the ranking of the identified sequence against a distinct subdivision of the selected
matches in function of the alignment scores do not sequence database using the specified BLAST program.
reflect accurately the evolutionary distances between the Then the resulting hit sequences on each node are gath-
query and matching sequences (subject sequences). This ered together in a smaller database which is searched
is mainly because the BLAST scoring scheme has bias again in a second BLAST run for formatting the final
favouring hits that have smaller or no gaps (keep hits as output. In each BLAST run, the effective size of the
single, long, high-scoring local alignments) over hits generic database (i.e., NR, NT, etc.) is specified using
with long gaps (provoke the split of hits in multiple, the -z flag to allow accurate calculation of the alignment
short, low-scoring local alignments). Furthermore, E-value.
BLAST results only show the levels of divergence BLAST output post-processing
between the query sequence and each of the individual The Blast output is parsed to collect various features
matches but not between pairs of matching sequences. (scores, pairwise alignments, sequence annotations,
Yet this information is important when one wants to sequence identification numbers, e-values). For each hit,
control sequence diversity in the phylogenetic dataset (e. taxonomic information (species and lineage) is retrieved
g., avoiding over-sampling of sequences arising from the from a weekly updated local copy of the NCBI Taxon-
same taxon). Many features of individual alignments are omy database ftp://ftp.ncbi.nih.gov/pub/taxonomy/. The
useful for sampling homologous sequences (e.g., align- Blast output is entirely redesigned such that the infor-
ment coverage, level of similarity, etc) but their organi- mation most relevant to phylogenetic analysis becomes
zation and accessibility across the BLAST result page do easily accessible. Menu panels and images of the similar-
not facilitate their interpretation. ity tree and tiling diagrams (see below) are also
To fill this gap, we created BLAST-Explorer. This included.
web-based resource combines a fast parallelized BLAST Construction of the similarity tree
search with a suite of tools that allows an interactive This tool provides a phylogenetic-oriented graphical dis-
exploration of the BLAST results and the easy selection play of the BLAST results. First, a pseudo multiple-
of a suitable subset of homologous sequences. The tradi- sequence alignment (MSA) of the query and BLAST hit
tional BLAST output is entirely reformatted to highlight sequences is created by parsing the standard BLAST out-
phylogeny-relevant information and augmented with put: individual BLAST-aligned hit sequences are piled up
new features not provided by BLAST (taxonomy infor- by positioning each residue relative to its homologue in
mation, multiple alignments and similarity trees). the query sequence (high scoring pairs [HSP] stacking).
BLAST hits can be selected either individually or in Multiple non-overlapping HSPs for a same hit are conca-
bulk using various criterions. Selected items can then be tenated; regions of the hit sequence not aligned with the
imported locally as fasta-formated sequences or passed query sequence are substituted with gaps. When dupli-
to one of the phylogenetic tree reconstruction pipelines cated domains are present in the hit sequence, each
available on the Phylogeny.fr platform [5]. repeat unit produces a HSP with the homologous region
in the query sequence. In this case, only the repeat unit
Implementation contained in the highest scoring pair is included in the
Initial BLAST search MSA (i.e., repeated domains with lower alignment scores
Blast searches are parallelized on an in-house 25-node are not considered in the alignment). Although pseudo
Linux cluster (50 CPUs) and accept either proteins or MSAs may be less accurate than MSAs created by con-
nucleic acids as query. Searches can be done using the ventional programs (i.e., ClustalW, Muscle, etc.), we
BLASTP (protein against protein database) or BLASTN chose this option because it is much faster for large data-
(nucleotides against nucleotide database) algorithms. sets (up to 5000 hit sequences).
The TBLASTN (amino-acids against a 6-frames trans- This pseudo MSA is passed to ClustalW, which pro-
lated nucleotides database) and BLASTX (translated duces a similarity (p-distance) tree using the “-tree”
nucleotides against a protein database) algorithms are option. This tree is built with the neighbor-joining
also proposed but in this case, subsequent analyses (i.e., method, using either all sites of the alignment or gap-
similarity tree and multiple sequence alignment) are car- free sites only, depending on the user choice. A picture
ried out at the protein level. The use of TBLASTX is of the similarity tree is generated by the TreeDyn pro-
not proposed because its output is not manageable in gram, using the “reporting annotations” functions for
subsequent post-processing (e.g., alignments in overlap- color-coding. This image is incorporated into the new
ping non-coding reading frames). Protein and nucleotide HTML Blast output together with a map of embedded
Javascript actions allowing the mouse-click selection of o Two “Update tree” options allow redrawing the
hits. similarity tree by setting the appropriate number of
Implementation top-scoring BLAST hits or using a user-defined
The Blast Explorer web interface and scripts are imple- sequence selection. The tree is generated by combin-
mented in CGI/Perl. The interactive web page is pow- ing ClustalW [6] and TreeDyn [7] using either all
ered by the Javascript and AJAX technologies. The sites of the BLAST-reconstructed multiple-alignment
HTML pages are best viewed on a 19-inches (or larger) or gap-free sites only (N.B., the initial tree is com-
screen. puted using all sites).
o The “Add sequences to tree” option allow incor-
Results and discussion porating up to five external sequences (supplied by
Running BLAST-Explorer users) into the current hit sequence selection. The
The entry page of BLAST-EXPLORER is a simplified similarity tree is then recalculated to show the phy-
BLAST form that receive a single fasta-formated query logenetic position of the external sequences relative
sequence as input and allows (i) the selection of to the BLAST hit sequences.
BLASTN, BLASTP, TBLASTN, or BLASTX [4] as an
alignment algorithm, (ii) the selection of a sequence At the end of the selection process, selected sequences
database (Genbank NT for nucleotides; Genbank Non can be imported in fasta format ("get selected sequence”
Redundant Protein, Ensembl, PDB, RefSeq, Uniprot and button) or passed to one of the phylogenetic reconstruc-
Swissprot for proteins), (iii) the selection of a BLAST E- tion pipelines available on the phylogeny.fr platform [5]
Value threshold and (iv) the option of filtering out low- ("One click mode” or “Advanced mode” buttons).
complexity sequence segments. BLAST searches report Large-scale selection mode
a maximum of 5,000 hits. In the large-scale selection mode, several tools allow the
Small scale selection mode sampling of homologous sequences among the entire set
By default, the result page only shows the top-100 scor- of BLAST hits (including those that are not shown in
ing BLAST hits, while the remaining hits are kept in the top-100 BLAST subset) using global criterions. They
memory and can be activated using the large-scale selec- are grouped in a dedicated panel (Fig. 1H) and com-
tion tools (next section). Small-scale selection tools only prise:
apply on the top-100 scoring BLAST hits. The central
tool in this mode is the sequence similarity tree that o A pull-down menu that allows changing the e-
provides an approximate picture of the phylogenetic value threshold on BLAST hits
relationships between the query and the top BLAST hits o Buttons showing the distributions of the BLAST
(Fig. 1A). BLAST hits are renamed according to the spe- hits according to three BLAST alignment statistics
cies name. The similarity tree is documented with meta- (i.e., BLAST scores, percentage of similarity, and
information including hit description (Fig. 1B), align- alignment coverage). Bulk selection among the
ment coverage (Fig. 1C), taxonomy-based coloring (Fig. BLAST hits can then be done by selecting intervals
1D). The tree image allows a navigation across the of the distribution histogram.
BLAST result page (clicking on an alignment coverage o The “selection on taxonomy” tool enabling the
bar [Fig. 1C] leads to the corresponding pairwise align- selection of BLAST hits according to their taxo-
ment [Fig. 1E]), gives access to the database record (by nomic rank (e.g., Fig. 1I). The taxonomic informa-
clicking on the hit name), as well as to the selection of tion is presented as a hierarchical graph allowing
individual hits (check-boxes) or in bulk (by clicking on users to adjust the level of details that is relevant to
internal branches). their needs.
A dropdown menu (Fig. 1F) gives access to additional
small-scale selection tools: Following the application of the selection rules, the
result page (i.e., the similarity tree and individual pair-
o The top-panel shows the number of gap-free sites wise alignments) is updated to account for changes in
in the BLAST-reconstructed multiple-alignment of the list of the top-100 best BLAST.
selected sequences (see supplementary data). This Comparison with existing software
number is dynamically updated when BLAST hits Several existing BLAST post-processors combine BLAST
are added or removed from the selection. searches with automated phylogenetic analysis of the
o The “score histogram” tool shows the BLAST BLAST hits. However most of them do not pursue the
score values ranked in decreasing order. A score same goal and therefore differ in the nature of the
threshold can be applied by clicking on the histo- results. Also, the functionalities proposed to interact
gram (e.g., Fig. 1G). with the results vary greatly. Some of the applications
Figure 1 BLAST-Explorer main interface. BLAST-Explorer main interface showing the similarity tree (A), hit descriptions (B), a coverage diagram
representing the alignment of the hit sequences on the query (C), the taxonomy color code (D), individual BLAST pairwise alignments (E), the
small-scale (F) and large-scale (H) selection tool panels. The “Score histogram” tool (G) and “Selection on taxonomy” tool (I) are given as
examples.
allow filtering of the BLAST hits before phylogenetic The BLAST-Explorer output includes a phylogenetic
reconstruction, others do not. representation of the BLAST hits (i.e., the similarity
Phylogena is a standalone application for phylogenetic tree) that aims at helping in the hit selection process. It
annotation of unknown sequences [8] and implements is important to note that this tree is not optimized for
an automated intelligent filtering of BLAST hits before phylogenetic accuracy. Rather, we opted for a fast tree
phylogenetic reconstruction. In contrast with BLAST- reconstruction strategy that is however sufficiently
Explorer, the hit filtering method is optimized for robust for providing an approximate phylogenetic posi-
sequence annotation and do not enable interactive and tion of the BLAST hits. Thus we advise users to use
progressive refinement of the sequence dataset. Further- external specialized software if they want to improve or
more Phylogena does not allow retrieving the selected confirm the accuracy of the phylogenetic tree.
sequences for external analysis. Finally, it is important to note that in some phyloge-
Phylogenie is also a standalone application for auto- netic aspect, the the importance is a correct distinction
mated phylome generation and analysis [9]. Because the between orthologous and paralogous sequences
principal force of Phylogenie is to automatically produce
a large number of phylogenetic analyses in batch, it Conclusions
does not allow interactive filtering of BLAST hits before BLAST-Explorer provides a simple, intuitive and inter-
phylogenetic reconstruction. Phylogenie is a command- active graphical representation of the BLAST results
line driven pipeline, requiring at least some familiarity that can greatly help biologists in building their
with UNIX and command line tools. sequence datasets prior to phylogenetic studies.
Phyloblast [10] and the NCBI BLAST server [11] are
two web services that have the most in common with Availability and requirements
BLAST-Explorer. They produce an enriched BLAST • Project name: BLAST-Explorer
output and allow selection of hits using various criter- • Project home page: http://www.phylogeny.fr
ions. The Phyloblast server is apparently no longer (direct link: http://www.phylogeny.fr/version2_cgi/
maintained. Phyloblast only allowed comparing a protein one_task.cgi?task_type=blast).
sequence against a protein database using BLASTP • Operating system(s): Platform independent
whereas BLAST-Explorer allows nucleotide/nucleotide, • Programming language: Perl/CGI, javascript
protein/protein and translated nucleotide/protein com- • Other requirements: best viewed on a 19-inches
parisons. Tools for selecting hits before phylogenetic (or larger) screen
reconstruction are less versatile than those proposed by • Any restrictions to use by non-academics: None
BLAST-Explorer (selection based on species names and
sequence description). The NCBI BLAST service also
Acknowledgements
provides several tools for selecting and retrieving match- This project was supported by fundings from the ‘Réseau National des
ing sequences from the BLAST output; a distance tree Génopoles’ (RNG), “Infrastructures en Biologie Santé et Agronomie” (GIS-
of the BLAST hits can also be calculated. Here again IBISA). and CNRS.
the hit selection tools are more limited than in BLAST- Author details
Explorer (simple check boxes beside sequence descrip- 1
Information Génomique Structurale - CNRS-UPR2589, Institut de
tions). Furthermore the image of the distance tree does Microbiologie de la Méditerranée - IFR 88, 163 Avenue de Luminy, 13009
Marseille, France. 2Current Address: INRA - UMR 1097, DIA-PC “Diversité et
not allow interactive selection of the BLAST hits. This adaptation des Plantes Cultivées”, 2 Place P. Viala, 34060 Montpelllier -
makes selection on phylogenetic criterion less France.
straightforward.
Authors’ contributions
The principal strength of BLAST-Explorer is the flex- AD carried out most of the programming work and drafted the manuscript.
ibility of the sequence selection process and the richness SA participated in the programming. JMC participated in the design and
of the information displayed on screen. However, coordination of the project and drafted the manuscript. GB conceived of the
study, and participated in its design and coordination, participated in the
BLAST-Explorer does not propose pre-defined auto- programming and drafted the manuscript. All authors read and approved
mated methods of hit selection such as for example in the final manuscript.
Phylogena. Rather, BLAST hit selection is multi-dimen-
Received: 25 August 2009
sional and mainly human-driven though an interactive Accepted: 12 January 2010 Published: 12 January 2010
graphical interface in order to respond to a wide range
of sequence selection strategies. Another feature that References
differentiates BLAST-Explorer from other software is 1. Chen F, Mackey AJ, Vermunt JK, Roos DS: Assessing performance of
orthology detection strategies applied to eukaryotic genomes. PLoS One
that it is entirely web-based. Thus no installation on 2007, 2(4).
personal computer and no regular update of the
sequence databases are required.
2. Ebersberger I, Strauss S, von Haeseler A: HaMStR: Profile hidden markov

model based search for orthologs in ESTs. BMC Evolutionary Biology 2009,
9(1):157.
3. Schreiber F, Pick K, Erpenbeck D, Worheide G, Morgenstern B: OrthoSelect:
a protocol for selecting orthologous groups in phylogenomics. BMC
Bioinformatics 2009, 10(1):219.
4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.
5. Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF,
Guindon S, Lefort V, Lescot M, et al: Phylogeny.fr: robust phylogenetic
analysis for the non-specialist. Nucl Acids Res 2008, , 36 Web Server:
W465-469.
6. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties and weight matrix
choice. Nucleic Acids Res 1994, 22(22):4673-4680.
7. Chevenet F, Brun C, Banuls AL, Jacq B, Christen R: TreeDyn: towards
dynamic graphics and annotations for analyses of trees. BMC
Bioinformatics 2006, 7:439.
8. Hanekamp K, Bohnebeck U, Beszteri B, Valentin K: PhyloGena–a user-
friendly system for automated phylogenetic annotation of unknown
sequences. Bioinformatics 2007, 23(7):793.
9. Frickey T, Lupas AN: PhyloGenie: automated phylome generation and
analysis. Nucl Acids Res 2004, 32(17):5231-5238.
10. Brinkman FSL, Wan I, Hancock REW, Rose AM, Jones SJ: PhyloBLAST:
facilitating phylogenetic analysis of BLAST results. Oxford Univ Press
2001, 17:385-387.
11. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V,
Church DM, DiCuccio M, Edgar R, Federhen S: Database resources of the
national center for biotechnology information. Nucleic acids research 2007,
, 35 Database: D5.
doi:10.1186/1471-2148-10-8
Cite this article as: Dereeper et al.: BLAST-EXPLORER helps you building
datasets for phylogenetic analysis. BMC Evolutionary Biology 2010 10:8.
Publish with Bio Med Central and every

scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:

available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here: BioMedcentral

http://www.biomedcentral.com/info/publishing_adv.asp

BLAST-EXPLORER Helps You Building Datasets For Phylogenetic Analysis

Uploaded by

Copyright:

Available Formats

BLAST-EXPLORER Helps You Building Datasets For Phylogenetic Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BLAST-EXPLORER Helps You Building Datasets For Phylogenetic Analysis

Uploaded by

Copyright:

Available Formats

Dereeper et al.

BMC Evolutionary Biology 2010, 10:8

SOFTWARE Open Access

BLAST-EXPLORER helps you building datasets for

Background subsequent phylogenetic analyses, this step has been lar-

2. Ebersberger I, Strauss S, von Haeseler A: HaMStR: Profile hidden markov

Publish with Bio Med Central and every

Your research papers will be:

Submit your manuscript here: BioMedcentral

You might also like