BPGA User Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

BPGA: Bacterial Pan Genome Analysis Pipeline

User Manual

Developed by: Narendrakumar M. Chaudhari1, Vinod Kumar Gupta1 & Chitra Dutta*
Structural Biology & Bioinformatics Division, CSIR- Indian Institute of
Chemical Biology, 4, Raja S. C. Mullick Road, Kolkata 700032, India

Mailing addresses: NMC: naren.niper@gmail.com


VKG: vinodgupta299@gmail.com
CD: cdutta@iicb.res.in, cdutta.iicb@gmail.com

1
NMC & VKG contributed equally to this work.
*
Corresponding author

Keywords: Bioinformatics, Pan-genome Analysis, Comparative Genomics

Address of the Corresponding Author:


Structural Biology & Bioinformatics Division, CSIR-Indian Institute of Chemical Biology,
4, Raja S. C. Mullick Road, Kolkata 700032, India
E-mail: cdutta@iicb.res.in, cdutta.iicb@gmail.com
Phone: 91 33 2499 5812, Fax: 91 33 2472 3967, 91 33 2473 5197
Description
BPGA is a perl based pipeline to exploit protein clustering data for complete pan-
genome analysis of bacterial species. BPGA can process outputs of three major
clustering tools (USEARCH, CD-HIT and OrthoMCL) to obtain pan-genome profiles of
bacterial gene pools.

Availability:
http://sourceforge.net/projects/bpgatool/
http://www.iicb.res.in/bpga/index.html
Installation
Installation of BPGA is simple.
1. Download the installer for Windows or LINUX system from our sourceforge page:
http://sourceforge.net/projects/bpgatool/
2. Run the installer as Administrator. It will extract files to a folder locally.
Executables are present inside bin folder.
3. BPGA is written in perl but bundled in an executable; hence no modules are
needed to be installed.
(For using BPGA as perl code, user should install required perl modules
manually)
Win32::GUI (For Windows only)
Prima (For Linux only)
Term::ANSIScreen
File::chdir
File::Remove
Sort::Fields
Statistics::Basic
Statistics::Descriptive
Bio::Phylo
Data::Dumper

Other requirements
 Installation of gnuplot (4.6.6) is must for plotting graphs. You can
download Windows 32-bit version from here and 64-bit version from here.
Linux users can install gnuplot from terminal by command:
sudo apt-get install gnuplot-4.6.6
Installation of ps2pdf (ghostscript ) is must for proper plotting by
command:
sudo apt-get install ghostscript
 BPGA uses USEARCH as a default clustering tool. Users need to get
their own licensed Windows/Linux version freely available at:
http://www.drive5.com/usearch/download.html,
Note: 32 bit Version is freely available. It also works on 64 bit system.
rename it to usearch.exe (for Windows) or usearch (for Linux) and copy it
inside the bin folder.
Note: For USERACH to work properly on Windows, please check the
required vcomp100.dll system file inside Windows\System32 or
System64 folder of your computer. If not, put it in this place. It is
available at: http://www.drive5.com/usearch/manual/vcomp100.html
.
 MUSCLE is used for alignments and tree generation. It is provided with
the package.
 rsvg-convert.exe is required to handle SVG image data. It is also
provided with the BPGA package.
This is not available for Linux system. To run rsvg-convert.exe on Linux
system install wine by command:
sudo add-apt-repository ppa:ubuntu-wine/ppa -y && sudo
apt-get update && sudo apt-get install wine

Steps for run


BPGA is easy, user friendly command line interface and it’s better to run it
through command line/terminal.

 Option-1 (INPUT PREPARATION FOR CLUSTERING): Allows user to


prepare input for clustering using different type of sequence files. User can
select multiple files from file selection dialog. Allowed formats are *.faa (NCBI
protein FASTA), *.pep.fsa (HMP protein FASTA) or any protein FASTA file
and *.gbk/*.gb (genbank file). It will generate a single sequence file
INPUT_all.faa, which will be used for clustering. List of organisms will be
written to the list file. Dataset.xls will bear organism details for reference.

Note: BPGA treats separate files as separate organism. If there are multiple files
(chromosomes) for an organism, user should concatenate all files into a single file
for that organism (applicable for all formats).

All the files for a particular dataset should maintain uniform formats. User cannot
use genebank files for some organisms and FASTA files for others. But user can
use protein FASTA files of any type together (Using Any Protein FASTA File
option of Input Preparation step)

 Option-2 (DEFAULT PAN GENOME ANALYSIS): Allows user to perform


Pan-genome analysis on the data by clustering with USEARCH or by
processing pre-clustered data by CD-HIT or OrthoMCL.

Please note that, user must use input file (INPUT_all.faa generated by
Option-1) for clustering with CD-HIT (online server/offline package) or with
OrthoMCL pipeline with desired options.

CD-HIT Web Server is available at: http://weizhongli-lab.org/cdhit_suite/cgi-


bin/index.cgi?cmd=cd-hit . User is free to set parameters as desired.

While clustering with USEARCH, identity cut off can be set by the user (see
picture below).

 Option-3 (ONE CLICK MODE): Allows user to perform all the analyses in
single step using all default parameters :
 Clustering: USEARCH (Identity cut off = 50%)
 No. of combinations: 30 for less than 20 genomes and 20 for 20-50
genomes.
 Atypical GC Content Analysis: 2 × δ (Standard Deviation)
 Type of phylogeny tree: Neighbor Joining Tree (NJ).
 KEGG/COG Functional analysis: will be performed if dataset
contains less than 50 genomes.
 Subset Analysis: NA
In the next step, after completion of DEFAULT PAN GENOME ANALYSIS,
ADVANCED ANALYSIS OPTIONS will be available.

User may perform any of the 5 analyses one by one. Completion status of each
analysis will be displayed in brackets ( _NOT DONE_ or _DONE_ ).
After completing desired analyses, user should exit by typing ‘exit’’ and then
closing the terminal.

For MLST analysis, BPGA will halt and allow the user to make required changes.
For MLST analysis, on user defined housekeeping genes Check "mlst_core.txt"
for gene details and copy the respective cluster IDs from the first column to
"CORE_MLST.txt". If you Press enter to continue without this step, 20 random
clusters will be used for core phylogeny.

Results
Input preparation option will give input files (INPUT_all.faa, INPUT_all.ffn)
necessary for clustering and dataset file (DATASET.xls) containing organism
details. A file list, required for further analysis is also generated.

The default analysis will give simple pan/core plot (Default_Core_Pan_Plot.pdf),


distribution of gene families (Histogram.pdf), number of new genes added
(New_Genes_Plot.pdf), genome wise statistics (stats.xls), representative sequences
for core, accessory and unique gene families (REPSEQ_*.txt) and tab delimited
pan-matrix (matrix.txt).
Advanced options:
1. Pan-core plot with combinations will give core and pan genome boxplot
(Core_Pan_Plot.pdf ) and dot plot (Core_Pan_Dot_Plot.pdf ) generated using
desired number of unique combinations of genomes.
2. Phylogeny trees based on pan-matrix (Pan_phylogeny.pdf ) and core/MLST
gene/protein sequences (Core_phylogeny.pdf ) will be generated. Respective *.ph
files are provided for user to visualize using TreeView
(http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) and *.nwk files are also
available for user to visualize using TreeGraph2 (http://treegraph.bioinfweb.info/).
3. Atypical GC analysis will give sequences of core, accessory and unique genes
with atypical (extreme) GC content (*_genes_with_atypical_GC_content.txt).
4. Subset analysis will give default results for each group in separate folder. Group
specific genes are present in groups_specific_genes.txt.
5. Functional analysis will give COG and KEGG distribution of the core, accessory
and unique gene families based on representative sequences
(COG_DISTRIBUTION.pdf, COG_DISTRIBUTION_DETAILS.pdf,
KEGG_DISTRIBUTION.pdf, KEGG_DISTRIBUTION_DETAILS.pdf ).
COG and KEGG assignments for representative sequences are also provided.
(*_COG_hits3.txt and *_KEGG_hits3.txt )

Additional instructions
For subset analysis user must create a text file having information about groups to
be created. Here is the example,
Organism ID as
per dataset list
Group 1 1 2 3 4
Group 2 6 7 8 9 13 15
Group 3 5 10 11 12 14

Here, rows represent groups. Each number represents a genome (refer list file
created during preparation). Blue colored labels are just for representation
purpose. Actual file should contain only tab delimited values. Maximum 10 groups
can be formed. There should be no repeats or wrong id.
Accepted file formats:

 GBK: Freshly downloaded Genbank files from NCBI or HMP databases.


 FAA: Freshly downloaded amino acid sequence files from NCBI database.

 PEP.FSA: Freshly downloaded amino acid sequence files from HMPDACC


database.

You might also like