Evolutionary Mechanisms of Microbial Genomes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 131

International Journal of Evolutionary Biology

Evolutionary Mechanisms
of Microbial Genomes
Guest Editors: Hiromi Nishida, Shinji Kondo, Hideaki Nojiri,
Ken-ichi Noma, and Kenro Oshima
Evolutionary Mechanisms of Microbial Genomes
International Journal of Evolutionary Biology

Evolutionary Mechanisms of Microbial Genomes


Guest Editors: Hiromi Nishida, Shinji Kondo,
Hideaki Nojiri, Ken-ichi Noma, and Kenro Oshima
Copyright © 2011 SAGE-Hindawi Access to Research. All rights reserved.

This is a special issue published in volume 2011 of “International Journal of Evolutionary Biology.” All articles are open access articles
distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
International Journal of Evolutionary Biology

Editorial Board
Giacomo Bernardi, USA Kazuho Ikeo, Japan Jeffrey R. Powell, USA
Terry A. Burke, UK Yoh Iwasa, Japan Hudson Kern Reeve, USA
Ignacio Doadrio, Spain Henrik J. Jensen, UK Y. Satta, Japan
Simon Easteal, Australia Amitabh Joshi, India Koji Tamura, Japan
Santiago F. Elena, Spain Hirohisa Kishino, Japan Yoshio Tateno, Japan
Renato Fani, Italy A. Moya, Spain E. N. Trifonov, Israel
F. González-Candelas, Spain G. Pesole, Italy Eske Willerslev, Denmark
D. Graur, USA I. Popescu, USA Shozo Yokoyama, USA
A. Rus Hoelzel, UK David Posada, Spain
Contents
Evolutionary Mechanisms of Microbial Genomes, Hiromi Nishida, Shinji Kondo, Hideaki Nojiri,
Ken-ichi Noma, and Kenro Oshima
Volume 2011, Article ID 319479, 2 pages

Phylogenetic and Guanine-Cytosine Content Analysis of Symbiobacterium thermophilum Genes,


Hiromi Nishida and Choong-Soo Yun
Volume 2011, Article ID 634505, 5 pages

Unique Evolution of Symbiobacterium thermophilum Suggested from Gene Content and Orthologous
Protein Sequence Comparisons, Kenro Oshima, Kenji Ueda, Teruhiko Beppu, and Hiromi Nishida
Volume 2011, Article ID 376831, 8 pages

Prevalence of Mycobacterium tuberculosis in Taiwan: A Model for Strain Evolution Linked to Population
Migration, Horng-Yunn Dou, Shu-Chen Huang, and Ih-Jen Su
Volume 2011, Article ID 937434, 6 pages

Distribution of Genes Encoding Nucleoid-Associated Protein Homologs in Plasmids, Toshiharu Takeda,


Choong-Soo Yun, Masaki Shintani, Hisakazu Yamane, and Hideaki Nojiri
Volume 2011, Article ID 685015, 30 pages

New Insights on the Evolutionary History of Aphids and Their Primary Endosymbiont Buchnera
aphidicola, Vicente Pérez-Brocal, Rosario Gil, Andrés Moya, and Amparo Latorre
Volume 2011, Article ID 250154, 9 pages

Parallel Evolution and Horizontal Gene Transfer of the pst Operon in Firmicutes from Oligotrophic
Environments, Alejandra Moreno-Letelier, Gabriela Olmedo, Luis E. Eguiarte, Leon Martinez-Castilla,
and Valeria Souza
Volume 2011, Article ID 781642, 10 pages

Ectopic Gene Conversions in the Genome of Ten Hemiascomycete Yeast Species, Robert T. Morris and
Guy Drouin
Volume 2011, Article ID 970768, 11 pages

Sequence Analysis of SSR-Flanking Regions Identifies Genome Affinities between Pasture Grass Fungal
Endophyte Taxa, Eline van Zijll de Jong, Kathryn M. Guthridge, German C. Spangenberg,
and John W. Forster
Volume 2011, Article ID 921312, 11 pages

Evolutionary Origins of the Fumonisin Secondary Metabolite Gene Cluster in Fusarium verticillioides
and Aspergillus niger, Nora Khaldi and Kenneth H. Wolfe
Volume 2011, Article ID 423821, 7 pages

Computational Analysis Suggests That Lyssavirus Glycoprotein Gene Plays a Minor Role in Viral
Adaptation, Kevin Tang and Xianfu Wu
Volume 2011, Article ID 143498, 11 pages

Baculovirus: Molecular Insights on Their Diversity and Conservation, Solange Ana Belen Miele,
Matı́as Javier Garavaglia, Mariano Nicolás Belaich, and Pablo Daniel Ghiringhelli
Volume 2011, Article ID 379424, 15 pages
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 319479, 2 pages
doi:10.4061/2011/319479

Editorial
Evolutionary Mechanisms of Microbial Genomes

Hiromi Nishida,1 Shinji Kondo,2 Hideaki Nojiri,3 Ken-ichi Noma,4 and Kenro Oshima5
1
Agricultural Bioinformatics Research Unit, Graduate School of Agricultural and Life Sciences, The University of Tokyo,
Tokyo 113-8657, Japan
2 Laboratory for Cellular Systems Modeling, RIKEN Research Center for Allergy and Immunology, Kanagawa 230-0045, Japan
3 Laboratory of Environmental Biochemistry, Biotechnology Research Center, The University of Tokyo, Tokyo 113-8657, Japan
4
Gene Expression and Regulation Program, The Wistar Institute, Philadelphia, PA 19104, USA
5 Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo,

Tokyo 113-8657, Japan

Correspondence should be addressed to Hiromi Nishida, hnishida@iu.a.u-tokyo.ac.jp

Received 3 April 2011; Accepted 3 April 2011

Copyright © 2011 Hiromi Nishida et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Sequencing of more than 1,600 microbial genomes has been H. -Y. Dou et al. in “Prevalence of Mycobacterium tuber-
complete, and rigorous studies are underway to reveal the culosis in Taiwan: a model for strain evolution linked to
mechanisms of evolution which gave rise to the great variety population migration” presented an association study of dis-
in combination of gene functions encoded on these genomes. tinct Mycobacterium tuberculosis strains prevalent in Taiwan
Although comparative genomics based on orthologous genes with historical migrations of different ethnic populations
has elucidated a great deal of the phylogenetic relationship based on a comparison of the tandem repeat sequences as
among the sequenced genomes, the mechanisms which have genetic markers. T. Takeda et al. in “Distribution of genes
shaped the current states of the microbial genomes remain encoding nucleoid-associated protein homologs in plasmids”
elusive. Particularly, the contribution of external forces such reported biases associated with certain bacterial plasmids,
as horizontal gene transfer and pressure from environmental that is, increase of nucleoid-associated protein genes in
factors to genome evolution has yet to be investigated. This large bacterial plasmids and low GC content of plasmids
special issue presents six, three, and two papers related, re- encoding (histone-like nucleoid structuring protein) H-NS.
spectively, to bacterial, fungal, and viral evolutionary mech- V. Pérez-Brocal et al. in “New insights on the evolutionary
anisms. history of aphids and their primary endosymbiont Buchnera
Among the six papers regarding bacterial evolution, H. aphidicola” presented a study which supports the hypotheses
Nishida and C. -S. Yun in “Phylogenetic and guanine-cytosine of divergence of Buchnera aphidicola from their host lineages
content analysis of Symbiobacterium thermophilum genes” during an early Cretaceous period by demonstrating a closer
reported a mechanism of the Symbiobacterium genome relationship of a subfamily Eriosomatinae with Lachninae
which increased GC content of horizontally transferred genes than with Aphidinae. A. Moreno-Letelier et al. in “Parallel
and thereby maintained the genome with high GC content. evolution and horizontal gene transfer of the pst operon in Fir-
K. Oshima et al. in “Unique evolution of Symbiobacterium micutes from oligotrophic environments” demonstrated that
thermophilum suggested from gene content and orthologous the phosphate transport system gene operon of Firmicutes
protein sequence comparisons” performed phylogenetic anal- has two highly divergent clades which do not correlate either
yses of more than 50 Clostridia by comparing gene content with the type of habitat or with a phylogenetic congruence
and orthologous protein sequence and demonstrated that and proposed parallel evolution of this gene after horizontal
these two phylogenetic relationships are topologically differ- gene transfer events.
ent, strongly suggesting that each Clostridia has a species- Of the three papers dealing with fungal evolution, R.
specific gene content likely due to frequent genetic exchanges T. Morris and G. Drouin in “Ectopic gene conversions in
or gene losses which have occurred during evolution. the genome of ten hemiascomycete yeast species” found that
2 International Journal of Evolutionary Biology

ectopic gene conversions in the genome of ten hemias-


comycetes tend to occur more frequently between closely
linked genes and proposed that the mechanisms responsible
for the loss of introns in Saccharomyces cerevisiae were also
involved in the 3 -end gene conversion bias observed among
the paralogs. E. van Zijll de Jong et al. in “Sequence analysis of
SSR-flanking regions identifies genome affinities between pas-
ture grass fungal endophyte taxa” demonstrated that some
asexual Neotyphodium species arose following interspecies
hybridization between sexual Epichloe ancestors and charac-
terized Neotyphodium isolates based on sequence analysis of
genomic regions flanking simple sequence repeats. N. Khaldi
and K. H. Wolfe in “Evolutionary origins of the fumonisin sec-
ondary metabolite gene cluster in Fusarium verticillioides and
Aspergillus niger” compared the fumonisin secondary met
abolite gene cluster and proposed that the gene cluster
was horizontally transferred to Aspergillus niger from a
Sordariomycete.
As for the two papers of viral evolution, K. Tang and
X. Wu in “Computational analysis suggests that Lyssavirus
glycoprotein gene plays a minor role in viral adaptation”
found no significant evidence of positive selection on any
site of the Lyssavirus glycoprotein-coding gene (except for
AY987478) and proposed that the glycoprotein gene has been
under purifying selection and that the evolution of this gene
may not play a significant role in Lyssavirus adaptation.
S. A. B. Miele et al. in “Baculovirus: molecular insights on
their diversity and conservation” reported an evidence which
supports the current division of the Baculoviridae into
four genera, Alpha-, Beta-, Gamma-, and Deltabaculovirus
based on comparative studies of 57 genome sequences from
baculoviruses.
In closing this introduction to the special issue, we
would like to express our full appreciation to all the authors
and reviewers for their enormous efforts that have made
the timely completion of our assignment successful. We
sincerely hope that this special issue will stimulate further
the investigation of evolutionary mechanisms of microbial
genomes.
Hiromi Nishida
Shinji Kondo
Hideaki Nojiri
Ken-ichi Noma
Kenro Oshima
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 634505, 5 pages
doi:10.4061/2011/634505

Research Article
Phylogenetic and Guanine-Cytosine Content Analysis of
Symbiobacterium thermophilum Genes

Hiromi Nishida and Choong-Soo Yun


Agricultural Bioinformatics Research Unit, Graduate School of Agriculture and Life Sciences, The University of Tokyo, Bunkyo-ku,
Tokyo 113-8657, Japan

Correspondence should be addressed to Hiromi Nishida, hnishida@iu.a.u-tokyo.ac.jp

Received 10 September 2010; Revised 20 October 2010; Accepted 5 November 2010

Academic Editor: Shinji Kondo

Copyright © 2011 H. Nishida and C.-S. Yun. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Although the bacterium Symbiobacterium thermophilum has a genome with a high guanine-cytosine (GC) content (69%), it
belongs to a low GC content bacterial group. We detected only 18 low GC content regions with 5 or more consecutive genes whose
GC contents were below 65% in the genome of this organism. S. thermophilum has 66 transposase genes, which are markers of
transposable genetic elements, and 38 (58%) of them were located in the low GC content regions, suggesting that Symbiobacterium
has a similar gene silencing system as Salmonella. The top hit (best match) analyses for each Symbiobacterium protein showed that
putative horizontally transferred genes and vertically inherited genes are scattered across the genome. Approximately 25% of the
3338 Symbiobacterium proteins have the highest similarity with the protein of a phylogenetically distant organism. The putative
horizontally transferred genes also have a high GC content, suggesting that Symbiobacterium has gained many DNA fragments
from phylogenetically distant organisms during the early stage of Firmicutes evolution. After acquiring genes, Symbiobacterium
increased the GC content of the horizontally transferred genes and thereby maintained a genome with a high GC content.

1. Introduction genome. This variation in nucleotide content in bacteria is


not clearly understood [6–8]. Analyzing a high GC content
Symbiobacterium thermophilum is a syntrophic bacterium genome of a bacterium that belongs to a low GC content
that grows effectively when cocultured with a cognate Ge- group or vice versa is useful and important. Symbiobacterium
obacillus sp. [1]. Because of the lack of carbonic anhydrase belongs to the class Clostridia (low GC content group), but
in the course of Symbiobacterium evolution [2, 3], the its genome has a high GC content (69%). One possibility
major growth factor for this organism is CO2 generated is that Symbiobacterium has acquired this high GC content
by the growth of Geobacillus [4]. S. thermophilum has from DNA fragments through horizontal gene transfer [9,
a 3.57 Mbp circular genome that consists of 3338 protein- 10], homologous gene recombination [11, 12], or both.
coding sequences [3]. On the basis of the comparative Another possibility is that Symbiobacterium has increased the
genomic studies, Symbiobacterium is classified as a member GC content of the acquired genes and maintained the high
of the class Clostridia [3, 5]. Although Symbiobacterium GC content during evolution. In this study, we identified GC
phylogenetically belongs to Clostridia (low guanine-cytosine content of each gene of S. thermophilum. In addition, we
(GC) content bacterial group), the species S. thermophilum identified the horizontally transferred and vertically inher-
has a genome with a high GC content (69%). ited genes. In order to elucidate why the Symbiobacterium
GC content is commonly used as a marker in bacte- genome has a high GC content, we compared the GC
rial systematics; for example, actinobacteria have a high contents of the horizontally transferred genes with those of
GC content genome, and clostridia have a low GC content the vertically inherited genes.
2 International Journal of Evolutionary Biology

Fibrobacteres/Acidbacteria,
20
Thermobaculum, 19
Thermotogae, 27
Aquificae, 11
Bacteroides/Chlorobi, 30
Dictyoglomi, 11
Cyanobacteria, 35
Deinococcus-Thermus, 40 Other bacteria, 45
Archaea/Eukaryota, 63
Actinobacteria, 78
Chloroflexi, 173

Proteobacteria,
231 Clostridia, 1231
Symbiobacterium,
341

No hit, 392

Bacilli, 591

Figure 1: Pie chart of the categories of the 3338 Symbiobacterium protein-coding genes. A BLAST search was conducted for all proteins from
147 eukaryotes, 1047 bacteria, and 84 archaea in the KEGG database (http://www.kegg.jp) considering the parameter values given on the
GenomeNet website (http://www.genome.jp). The query amino acid sequence was each protein of Symbiobacterium thermophilum. The top
hit (best match) for each Symbiobacterium protein was recorded. However, if the top hit was absent or if the E-value of the top hit exceeded
0.1, the Symbiobacterium protein was considered to have no similar protein (category, “No hit”). We categorized the 3338 Symbiobacterium
proteins into the following 17 categories: “Actinobacteria,” “Aquificae,” “Archaea/Eukaryota,” “Bacilli,” “Bacteroidetes/Chlorobi,” “Chlo-
roflexi,” “Clostridia,” “Cyanobacteria,” “Deinococcus-Thermus,” “Dictyoglomi,” “Fibrobacteres/Acidobacteria,” “No hit,” “Proteobacteria,”
“Symbiobacterium,” “Thermobaculum,” “Thermotogae,” and “Other bacteria.” If the top hit was another protein(s) of Symbiobacterium,
then the query protein was considered to belong to the category “Symbiobacterium.”

2. Materials and Methods category was “Clostridia,” and the second largest cate-
gory was “Bacilli.” This is consistent with the results of
In this study, we classified the 3338 protein-coding sequences previous phylogenetic analyses [3]. The third and fourth
of S. thermophilum on the basis of the amino acid sequence largest categories were “No hit” and “Symbiobacterium,”
of each coded protein. A BLAST search was conducted for all respectively (Figure 1). Most genes belonging to the category
proteins from 147 eukaryotes, 1047 bacteria, and 84 archaea “Symbiobacterium” might share their origin with other genes
in the KEGG database (http://www.kegg.jp/) considering of the same category because 300 of the 341 genes had
the parameter values given on the GenomeNet website a similar protein sequence as that of the other organisms
(http://www.genome.jp/). The top hit (best match) for each that appeared below the top hit of the BLAST result
Symbiobacterium protein was recorded. However, if the top (Table 1 (see Supplementary matrial available online at
hit was missing or if the E-value of the top hit exceeded doi:10.4061/2011/634505)). For example, most transposable
0.1, we considered the Symbiobacterium protein to have elements belonged to “Symbiobacterium,” indicating that
no similar protein (category, “No hit”). If the top hit was they were duplicated on the Symbiobacterium genome after
another protein(s) of Symbiobacterium (category, “Symbio- invasion.
bacterium”), the protein-coding gene was considered to have When each gene was plotted on the basis of its category,
duplicated during evolution. Top hit analysis at the genome we detected 52 clusters containing 5 or more consecutive
level is a powerful tool for elucidating the phylogenetic genes belonging to “Clostridia” (Figure 2, pink regions in
lineage of an organism [13, 14]. Supplementary Table 1). These conserved gene clusters are
probably not acquired by horizontal gene transfer and are
3. Results and Discussion strongly considered to be vertically inherited. The putative
vertically inherited genes were scattered across the genome of
On the basis of the phylogenetic lineage of the organ- S. thermophilum (Figure 2). In addition, we detected 18 low
ism possessing the top hit protein shown in the BLAST GC content regions containing 5 or more consecutive genes
result, the 3338 S. thermophilum protein-coding genes whose GC contents were below 65% (Figure 3, yellow regions
were classified into 17 categories (Figure 1). The largest in Supplementary Table 1). These low GC content regions
International Journal of Evolutionary Biology 3

12 3 45 6 7 89 101112 13 14 15 16 17
5
4

Category
3
2
1
0
200 400 600 800 1000
STH gene number
18 19 20 212232 24 25 26 27 28 29 30 31 32 33 34 35
5
4
Category

3
2
1
0
1200 1400 1600 1800 2000 2200
STH gene number
36 37 38 39 40 41 42 43 44 45 46 47 48 49 5051 52
5
4
Category

3
2
1
0
2400 2600 2800 3000 3200
STH gene number

Figure 2: Plots of the location and category of the Symbiobacterium protein-coding genes. X-axis: STH gene number. Y-axis: 0: category
“No hit” (Symbiobacterium-specific genes); 1: category “Symbiobacterium” (multiple copied genes); 2: category “Clostridia;” 3: category
“Bacilli”; 4: categories “Actinobacteria,” “Aquificae,” “Bacteroidetes/Chlorobi,” “Chloroflexi,” “Cyanobacteria,” “Deinococcus-Thermus,”
“Dictyoglomi,” “Fibrobacteres/Acidobacteria,” “Proteobacteria,” “Thermobaculum,” “Thermotogae,” and “Other bacteria”; 5: category
“Archaea/Eukaryota.” The italicized numbers indicate 52 clusters (pink) containing 5 or more consecutive genes belonging to the category
“Clostridia.”

12 34 5 6 7 8

75
GC (%)

65
55
45
200 400 600 800 1000
STH gene number
9 10

75
GC (%)

65
55
45
1200 1400 1600 1800 2000 2200
STH gene number
11 1213 14 1516 17 18

75
GC (%)

65
55
45
2400 2600 2800 3000 3200
STH gene number

Figure 3: Plots of the location and GC content of the Symbiobacterium protein-coding genes. X-axis: STH gene number. Y-axis: GC content
(%) of gene. Red indicates the putative transposase-coding genes, and blue indicates the group II intron-coding maturase genes. The italicized
numbers indicate 18 low GC content regions containing 5 or more consecutive genes whose GC contents are below 65%.
4 International Journal of Evolutionary Biology

do not overlap with the 52 vertically inherited clusters It is suggested that Symbiobacterium has gained many
(Supplementary Table 1). On the basis of the KEGG gene DNA fragments from phylogenetically distant organisms
cluster database, we found 12 gene clusters in the 18 low GC during the early stage of evolution in the Firmicutes (consist-
content regions (Supplementary Table 2). ing of Bacilli and Clostridia). As the Symbiobacterium genes
Approximately 25% of the 3338 Symbiobacterium pro- of all categories have a high GC content (Supplementary
tein-coding genes belonged to categories consisting of Figure 1), it can be concluded that, after acquiring genes,
organisms phylogenetically distant from Symbiobacterium Symbiobacterium increased the GC content of the horizon-
(Figure 1), suggesting that Symbiobacterium frequently tally transferred genes and thereby maintained a genome
acquired genes during evolution. The proportion of hori- with a high GC content.
zontally transferred genes in the Symbiobacterium genome In contrast to the Symbiobacterium genome, the Fusobac-
is strongly suggested to be the highest among bacteria [15]. terium (phylogenetically closely related to Firmicutes)
These putative horizontally transferred genes are scattered genome has a low GC content (27%) [13]. It is suggested
across the genome of S. thermophilum (Figure 2). In addi- that Fusobacterium has gained many genes from phyloge-
tion, considering the species diversification of Bacilli and netically distant organisms [13]. In the course of evolution,
Clostridia, it is suggested that the categories “Bacilli” and Fusobacterium has probably decreased the GC content of the
“Clostridia” include not only vertically inherited genes but horizontally acquired genes and maintained a genome with a
also horizontally transferred genes. low GC content.
Does Symbiobacterium benefit from maintaining a
Transposase genes are generally used as markers of trans-
genome with a high GC content? Considering that CO2 is
posable genetic elements [16]. Most transposase-coding
the major growth factor of Symbiobacterium, its symbiotic
genes flank horizontally transferred genes [13]. S. ther-
partners may not be limited to Geobacillus. Symbiobacterium
mophilum has 66 putative transposase-coding genes, of
is widespread in different natural environments [26, 27].
which 38 (58%) are located in the low GC content
The difference in the genome base compositions between
regions (P-value = 1.3 × 10−99 ; Pearson’s chi-square test)
Symbiobacterium and its symbiotic partners may lead to a
(Figure 3), suggesting that Symbiobacterium has a similar
decrease in the frequency of a homologous recombination
silencing system as that of Salmonella [17, 18]. In the
between the 2 genomes. For example, the 5 sequenced chro-
silencing system, a histone-like nucleoid structuring (H-NS)
mosomal genomes of Geobacillus have a GC content ranging
protein binds to the region with a low GC content. Similar
from 42.8% to 52.5% (http://insilico.ehu.es/oligoweb/).
functional (H-NS) proteins were reported in Mycobacterium
In addition, homologous recombination is generally
and Pseudomonas [19, 20]. If Symbiobacterium also has such
effective for adaptive evolution [11]. However, if the popu-
proteins that bind the low GC content regions, the expres-
lation density is low or the recombining population is rare
sion of the transposable elements located in these regions
in the environment, adaptive evolution is hampered [11].
might be inhibited. As mentioned above, most transposable
Considering the wide distribution of Symbiobacterium in
elements belong to the category “Symbiobacterium,” which
natural environments, the population size of Symbiobac-
is consistent with the fact that the genes of this category
terium may be adequately large, suggesting that homologous
have lower GC contents than those of the other categories
recombination between the Symbiobacterium strains and
(Supplementary Figure 1). The regions consisting of low
different symbiotic partners may be effective for adaptive
GC content genes cannot be explained by the directional
evolution. Thus, it is hypothesized that Symbiobacterium
mutation pressure or amelioration of bacterial genomes
has maintained its extreme genome composition to avoid
[21, 22]. Interestingly, although H-NS proteins bind the low
homologous recombination between its genome and the
GC content regions in Mycobacterium, Pseudomonas, and
genomes of different species and to promote homologous
Salmonella [17–20], the H-NS protein of Escherichia coli does
recombination between its genome and the genomes of the
not specifically bind only these regions [23].
same species (or genus).
In addition, S. thermophilum has 30 group II intron-
encoding maturase genes. Group II introns are transpos-
able elements [24] that encode maturase as an intron- Acknowledgment
specific splicing factor [25]. The GC content of each
maturase gene is approximately 65% (Figure 3). These The authors thank Professor Teruhiko Beppu for the helpful
maturase genes are classified in “Symbiobacterium,” on the comments and critical review of the paper.
basis of amino acid sequence similarity. In contrast to
the transposase genes, the group II intron-encoding mat-
urase genes are not located in the 18 low GC content regions References
(Figure 3). If Symbiobacterium has both an H-NS protein
[1] K. Ueda and T. Beppu, “Lessons from studies of Sym-
binding the low GC content regions and a gene silencing sys-
biobacterium thermophilum, a unique syntrophic bacterium,”
tem similar to Mycobacterium, Pseudomonas, and Salmonella, Bioscience, Biotechnology and Biochemistry, vol. 71, no. 5, pp.
these maturases could be activated and the group II introns 1115–1121, 2007.
could be transposed to the Symbiobacterium genome. Of
[2] H. Nishida, T. Beppu, and K. Ueda, “Symbiobacterium lost
course, it is also possible that this transposition of the group carbonic anhydrase in the course of evolution,” Journal of
II introns is inhibited by another gene silencing system. Molecular Evolution, vol. 68, no. 1, pp. 90–96, 2009.
International Journal of Evolutionary Biology 5

[3] K. Ueda, A. Yamashita, J. Ishikawa et al., “Genome sequence [20] B. R. G. Gordon, Y. Li, L. Wang et al., “Lsr2 is a nucleoid-
of Symbiobacterium thermophilum, an uncultivable bacterium associated protein that targets AT-rich sequences and viru-
that depends on microbial commensalism,” Nucleic Acids lence genes in Mycobacterium tuberculosis,” Proceedings of the
Research, vol. 32, no. 16, pp. 4937–4944, 2004. National Academy of Sciences of the United States of America,
[4] T. O. Watsuji, T. Kato, K. Ueda, and T. Beppu, “CO2 supply vol. 107, no. 11, pp. 5154–5159, 2010.
induces the growth of Symbiobacterium thermophilum, a [21] J. G. Lawrence and H. Ochman, “Amelioration of bacterial
syntrophic bacterium,” Bioscience, Biotechnology and Biochem- genomes: rates of change and exchange,” Journal of Molecular
istry, vol. 70, no. 3, pp. 753–756, 2006. Evolution, vol. 44, no. 4, pp. 383–397, 1997.
[5] G. Ding, Z. Yu, J. Zhao et al., “Tree of life based on genome [22] N. Sueoka, “On the genetic basis of variation and heterogene-
context networks,” PLoS One, vol. 3, no. 10, Article ID e3357, ity of DNA base composition,” Proceedings of the National
2008. Academy of Sciences of the United States of America, vol. 48, pp.
[6] E. P. C. Rocha and E. J. Feil, “Mutational patterns cannot 582–592, 1962.
explain genome composition: are there any neutral sites in the [23] T. Oshima, S. Ishikawa, K. Kurokawa, H. Aiba, and N.
genomes of bacteria?” PLoS Genetics, vol. 6, no. 9, Article ID Ogasawara, “Escherichia coli histone-like protein H-NS pref-
e1001104, 2010. erentially binds to horizontally acquired DNA in association
[7] F. Hildebrand, A. Meyer, and A. Eyre-Walker, “Evidence with RNA polymerase,” DNA Research, vol. 13, no. 4, pp. 141–
of selection upon genomic GC-content in bacteria,” PLoS 153, 2006.
Genetics, vol. 6, no. 9, Article ID e1001107, 2010. [24] F. Michel and J. L. Ferat, “Structure and activities of group II
[8] R. Hershberg and D. A. Petrov, “Evidence that mutation is introns,” Annual Review of Biochemistry, vol. 64, pp. 435–461,
universally biased towards AT in bacteria,” PLoS Genetics, vol. 1995.
6, no. 9, Article ID e1001115, 2010. [25] M. Matsuura, J. W. Noah, and A. M. Lambowitz, “Mechanism
[9] J. P. Gogarten and J. P. Townsend, “Horizontal gene transfer, of maturase-promoted group II intron splicing,” EMBO
genome innovation and evolution,” Nature Reviews Microbiol- Journal, vol. 20, no. 24, pp. 7259–7270, 2002.
ogy, vol. 3, no. 9, pp. 679–687, 2005. [26] T. Sugihara, T. O. Watsuji, S. Kubota et al., “Distribution of
[10] E. V. Koonin, K. S. Makarova, and L. Aravind, “Horizontal Symbiobacterium thermophilum and related bacteria in the
gene transfer in prokaryotes: quantification and classification,” marine environment,” Bioscience, Biotechnology and Biochem-
Annual Review of Microbiology, vol. 55, pp. 709–742, 2001. istry, vol. 72, no. 1, pp. 204–211, 2008.
[11] B. R. Levin and O. E. Cornejo, “The population and evolution- [27] K. Ueda, M. Ohno, K. Yamamoto et al., “Distribution
ary dynamics of homologous gene recombination in bacteria,” and diversity of symbiotic thermophiles, Symbiobacterium
PLoS Genetics, vol. 5, no. 8, Article ID e1000601, 2009. thermophilum and related bacteria, in natural environments,”
[12] J. M. Smith, N. H. Smith, M. O’Rourke, and B. G. Spratt, “How Applied and Environmental Microbiology, vol. 67, no. 9, pp.
clonal are bacteria?” Proceedings of the National Academy of 3779–3784, 2001.
Sciences of the United States of America, vol. 90, no. 10, pp.
4384–4388, 1993.
[13] A. Mira, R. Pushker, B. A. Legault, D. Moreira, and F.
Rodrı́guez-Valera, “Evolutionary relationships of Fusobac-
terium nucleatum based on phylogenetic analysis and compar-
ative genomics,” BMC Evolutionary Biology, vol. 4, article no.
50, 2004.
[14] C. A. Fuchsman and G. Rocap, “Whole-genome reciprocal
BLAST analysis reveals that Planctomycetes do not share an
unusually large number of genes with Eukarya and Archaea,”
Applied and Environmental Microbiology, vol. 72, no. 10, pp.
6841–6844, 2006.
[15] Y. Nakamura, T. Itoh, H. Matsuda, and T. Gojobori, “Biased
biological functions of horizontally-transferred genes in
prokaryotic genomes,” Nature Genetics, vol. 36, no. 7, pp. 760–
766, 2004.
[16] M. G. I. Langille, W. W. L. Hsiao, and F. S. L. Brinkman,
“Detecting genomic islands using bioinformatics approaches,”
Nature Reviews Microbiology, vol. 8, no. 5, pp. 373–382, 2010.
[17] S. Lucchini, G. Rowley, M. D. Goldberg, D. Hurd, M. Harrison,
and J. C. Hinton, “H-NS mediates the silencing of laterally
acquired genes in bacteria.,” PLoS Pathogens, vol. 2, no. 8,
article e81, 2006.
[18] W. W. Navarre, S. Porwollik, Y. Wang et al., “Selective silencing
of foreign DNA with low GC content by the H-NS protein in
Salmonella,” Science, vol. 313, no. 5784, pp. 236–238, 2006.
[19] C. -S. Yun, C. Suzuki, K. Naito et al., “Pmr, a histone-like
protein H1 (H-NS) family protein encoded by the IncP-7
plasmid pCAR1, is a key global regulator that alters host
function,” Journal of Bacteriology, vol. 192, no. 18, pp. 4720–
4731, 2010.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 376831, 8 pages
doi:10.4061/2011/376831

Research Article
Unique Evolution of Symbiobacterium thermophilum
Suggested from Gene Content and Orthologous Protein
Sequence Comparisons

Kenro Oshima,1 Kenji Ueda,2 Teruhiko Beppu,2 and Hiromi Nishida3


1 Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo,
Bunkyo-ku, Tokyo 113-8657, Japan
2 Life Science Research Center, College of Bioresource Sciences, Nihon University, 1866 Kameino, Fujisawa 252-8510, Japan
3 Agricultural Bioinformatics Research Unit, Graduate School of Agriculture and Life Sciences, The University of Tokyo, Bunkyo-ku,

Tokyo 113-8657, Japan

Correspondence should be addressed to Hiromi Nishida, hnishida@iu.a.u-tokyo.ac.jp

Received 21 September 2010; Accepted 27 November 2010

Academic Editor: Hideaki Nojiri

Copyright © 2011 Kenro Oshima et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comparisons of gene content and orthologous protein sequence constitute a major strategy in whole-genome comparison
studies. It is expected that horizontal gene transfer between phylogenetically distant organisms and lineage-specific gene loss have
greater influence on gene content-based phylogenetic analysis than orthologous protein sequence-based phylogenetic analysis.
To determine the evolution of the syntrophic bacterium Symbiobacterium thermophilum, we analyzed phylogenetic relationships
among Clostridia on the basis of gene content and orthologous protein sequence comparisons. These comparisons revealed
that these 2 phylogenetic relationships are topologically different. Our results suggest that each Clostridia has a species-specific
gene content because frequent genetic exchanges or gene losses have occurred during evolution. Specifically, the phylogenetic
positions of syntrophic Clostridia were different between these 2 phylogenetic analyses, suggesting that large diversity in the living
environments may cause the observed species-specific gene content. S. thermophilum occupied the most distant position from the
other syntrophic Clostridia in the gene content-based phylogenetic tree. We identified 32 genes (14 under relaxed selection and 18
under functional constraint) evolving under Symbiobacterium-specific selection on the basis of synonymous-to-nonsynonymous
substitution ratios. Five of the 14 genes under relaxed selection are related to transcription. In contrast, none of the 18 genes under
functional constraint is related to transcription.

1. Introduction species, including S. thermophilum, and we analyzed the


concatenated alignment of ribosomal protein sequences [5].
Symbiobacterium thermophilum is a phylogenetically unique In a previous phylogenetic analysis that was based
bacterium that effectively grows only in coculture with a on ribosomal protein sequence comparisons [5], S. ther-
cognate Geobacillus sp. [1]. 16S rDNA-based phylogenetic mophilum was closely related to 6 recently sequenced
analysis has shown that it is actually a Gram-positive Clostridia that have distinct properties, that is, Carboxy-
bacterium [2]. Although S. thermophilum phylogenetically dothermus hydrogenoformans, Desulfitobacterium hafniense,
belongs to Clostridia (low GC-content bacterial group), the Moorella thermoacetica, Pelotomaculum thermopropionicum,
genome of S. thermophilum has a high GC content (68.7%) Desulfotomaculum reducens, and Syntrophomonas wolfei.
[3]. Furthermore, 2 recent independent analyses concluded Symbiobacterium is dependent on the multiple functions of
that Symbiobacterium affiliates with Clostridia (a class of Geobacillus, including the supply of CO2 [1]. C. hydrogeno-
Firmicutes): Ding et al. [4] carried out genome-context formans [6] grows by utilizing CO as a sole carbon source
network analysis of 195 fully sequenced representative and water as an electron acceptor, which produces CO2
2 International Journal of Evolutionary Biology

and hydrogen as waste products. D. hafniense [7] carries tree was reconstructed using MEGA software version 4 [14].
out anaerobic dechlorination of tetrachloroethene (PCE). The bootstrap was performed with 1000 replicates.
M. thermoacetica [8] is an acetogenic bacterium that has
been widely used to study the Wood-Ljungdahl pathway of 2.2. Phylogenetic Analysis on the Basis of 112 Orthologous
CO and CO2 fixation (reductive acetyl-CoA pathway). P. Protein Sequence Comparisons. We used the following 55
thermopropionicum [9] is a member of a complex anaerobic bacteria (54 Clostridia and 1 Bacillus) in this analysis:
microbial consortium where it catalyzes the intermediate Acidaminococcus fermentans, A. metalliredigens, A. degensii,
bottleneck step by digesting volatile fatty acids (VFAs) A. thermophilum, A. prevotii, B. subtilis, C. saccharolyticus,
and alcohols produced by upstream fermenting bacteria Candidatus D. audaxviator, C. hydrogenoformans, Clostridi-
and it supplies acetate, hydrogen, and CO2 to downstream ales genomosp. BVAB3 UPII9-5, C. acetobutylicum, C. beijer-
methanogenic archaea. D. reducens is an anaerobic sulfate- inckii, C. botulinum A ATCC 19397, C. botulinum A ATCC
reducing bacterium [10]. S. wolfei is a fatty-acid-degrading 3502, C. botulinum A Hall, C. botulinum A2 Kyoto, C.
hydrogen/formate-producing anaerobic bacterium [11]. botulinum A3 Loch Maree, C. botulinum B Eklund 17B, C.
Comparisons of gene content and orthologous protein botulinum B1 Okra, C. botulinum Ba4 657, C. botulinum
sequence constitute the major strategy in the whole-genome E3 Alaska E43, C. botulinum F Langeland, C. cellulolyticum,
comparison study [12]. Clostridia have the large amount C. difficile 630, C. difficile CD196, C. difficile R20291, C.
of bacteria. The phylogenetic position of Symbiobacterium kluyveri DSM 555, C. kluyveri NBRC 12016, C. novyi, C.
remains uncertain in Clostridia. In this study, we recon- perfringens ATCC 13124, C. perfringens SM101, C. perfrin-
structed phylogenetic trees of Clostridia on the basis of the gens 13, C. phytofermentans, C. tetani, C. thermocellum,
2 different methods and compared them. C. proteolyticus, D. hafniense DCB-2, D. hafniense Y51, D.
acetoxidans, D. reducens, E. eligens, E. rectale, F. magna,
2. Methods H. orenii, H. modesticaldum, M. thermoacetica, N. ther-
mophilus, P. thermopropionicum, S. thermophilum, S. wolfei,
2.1. Phylogenetic Analysis on the Basis of Gene Con- Thermoanaerobacter italicus, T. pseudethanolicus, T. sp. X514,
tent Comparisons. We used the following 51 bacteria (50 T. tengcongensis, and Veillonella parvula. From the above 55
Clostridia and 1 Bacillus belonging to Firmicutes) in this bacteria, 112 proteins were extracted as orthologous proteins
analysis: Alkaliphilus metalliredigens, Alkaliphilus oremlandii, by using a previously described method [15]. Thus, we
Ammonifex degensii, Anaerocellum thermophilum, Anaerococ- constructed 112 multiple alignments using Clustal W [16].
cus prevotii, Bacillus subtilis, Caldicellulosiruptor saccharolyti- Then, a concatenated multiple alignment of the 112 multiple
cus, Candidatus Desulforudis audaxviator, Carboxydother- alignments was generated. The complete multiple alignment
mus hydrogenoformans, Clostridium acetobutylicum, Clostrid- had 52,204 amino acid sites, including 19,818 gap/insertion
ium beijerinckii, Clostridium botulinum A ATCC 19397, C. sites. Hence, phylogenetic analyses were performed on the
botulinum A ATCC 3502, C. botulinum A Hall, C. botulinum basis of 32,386 amino acid sites without the gap/insertion
A2, C. botulinum A3 Loch Maree, C. botulinum B Eklund sites. The neighbor-joining tree was reconstructed using
17B, C. botulinum B1 Okra, C. botulinum Ba4, C. botulinum MEGA software version 4 [14]. The bootstrap was performed
E3, C. botulinum F Langeland, Clostridium cellulolyticum, with 1000 replicates. The rate variation among sites was
Clostridium difficile 630, C. difficile CD196, Clostridium considered to have a gamma-distributed rate (α = 1). The
kluyveri DSM 555, C. kluyveri NBRC 12016, Clostridium other default parameters (e.g., Poisson distance) were not
novyi, Clostridium perfringens ATCC 13124, C. perfringens changed.
SM101, C. perfringens 13, Clostridium phytofermentans,
Clostridium tetani E88, Clostridium thermocellum, Coprother- 2.3. Extraction of Genes Evolving under Symbiobacterium–
mobacter proteolyticus, Desulfitobacterium hafniense DCB- Specific Selection among Syntrophic Clostridia. Among Bacil-
2, D. hafniense Y51, Desulfotomaculum acetoxidans, Desul- lus subtilis, Carboxydothermus hydrogenoformans, Desulfi-
fotomaculum reducens, Eubacterium eligens, Eubacterium tobacterium hafniense, Moorella thermoacetica, Pelotomacu-
rectale, Finegoldia magna, Halothermothrix orenii, Heliobac- lum thermopropionicum, Desulfotomaculum reducens, Sym-
terium modesticaldum, Moorella thermoacetica, Natranaero- biobacterium thermophilum, and Syntrophomonas wolfei,
bius thermophilus, Pelotomaculum thermopropionicum, Sym- 472 genes were extracted as orthologous genes by the
biobacterium thermophilum, Syntrophomonas wolfei, Ther- previously described method [15]. Synonymous substitution
moanaerobacter pseudethanolicus, Thermoanaerobacter sp. occurs more frequently than nonsynonymous substitution
X514, and Thermoanaerobacter tengcongensis. Ortholog clus- in protein-coding sequences because of relaxed functional
ter analysis among the above 51 bacteria was performed constraints (nonsynonymous-to-synonymous ratio ω < 1)
using the MBGD [13] (Microbial Genome Database for [17], whereas they occur equally in noncoding regions and
Comparative Analysis; http://mbgd.nibb.ac.jp/). The analysis pseudogenes (ω = 1). We calculated the likelihood of both
(minimum cluster size, 2) provided a gene presence/absence the codon substitution model allowing for one ω (model
data matrix (10,636 genes × 51 organisms), which served as R1) and the S. thermophilum branch-specific model allowing
the basis for a distance matrix between all pairs of the 51 for 2 ratios (ω0 and ω1 ; model R2), using PAML version
organisms. The distance was calculated from the different 3.14 [18]. In model R2, the branches of the gene tree
ratios between the presence/absence patterns of the 10,636 were partitioned into the Symbiobacterium branch (ω1 ) and
genes. On the basis of distance matrix, a neighbor-joining other related branches (ω0 ). Likelihood ratio test statistics
International Journal of Evolutionary Biology 3

were calculated as twice the difference between the 2 log- Although the physiological reason for the high CO2
likelihoods (2Δ ln) and compared with a χ 2 distribution with requirement of S. thermophilum is not yet known, we
degrees of freedom equal to the difference in the number assumed that it is related to the carbonic anhydrase defi-
of parameters between the 2 models [19]. According to ciency (the ubiquitous enzyme catalyzing interconversion
this method, the genes evolving under the Symbiobacterium- between CO2 and bicarbonate; EC 4.2.1.1), as deficiency of
specific selection among Bacillus and 7 Clostridia were this enzyme results in the need for high CO2 levels in several
extracted. model microorganisms [1]. S. thermophilum lost this enzyme
in the course of evolution [5]. In this previous analysis,
we inferred that C. hydrogenoformans and M. thermoacetica
3. Results and Discussion have also lost the gene for carbonic anhydrase; however, we
recently noticed that C. hydrogenoformans had 2 potential
Phylogenetic relationships among Clostridia on the basis carbonic anhydrase coding genes with structures different
of gene content comparison (Figure 1) were topologically from the other syntrophic Clostridia carbonic anhydrases.
different from those generated on the basis of orthologous Therefore, only Moorella has lost the carbonic anhydrase
protein sequence comparison (Figure 2). For example, in the gene, in addition to Symbiobacterium. However, according to
gene content-based phylogenetic tree, Alkaliphilus, Clostrid- our results, these two bacteria are not closely related to each
ium (except for C. cellulolyticum and C. thermocellum), other (Figures 1 and 2), suggesting that the gene loss in these
Desulfitobacterium, and Eubacterium formed a monophyletic 2 species occurred independently during evolution.
lineage with 85% bootstrap support (Figure 1). In contrast, Our results imply that each syntrophic Clostridial organ-
in the 112 orthologous protein sequence-based phylogenetic ism, especially Symbiobacterium, would have genes that
relationships, Alkaliphilus, Anaerococcus, Clostridium (except evolved in an organism-specific manner. We expect that
for C. cellulolyticum and C. thermocellum), Eubacterium, and characterization of such genes will provide useful informa-
Finegoldia formed a monophyletic lineage with 98% boot- tion with regard to the evolutionary history and physiological
strap support (Figure 2). Thus, the phylogenetic positions features specific to the corresponding organism [21, 22]. We
of Anaerococcus, Desulfitobacterium, and Finegoldia were identified 32 genes evolving under Symbiobacterium-specific
different between these 2 trees. In addition, Coprothermobac- selection (Table 1). The analysis revealed that the likelihood
ter proteolyticus was positioned differently in the 2 trees. of model R2 was significantly higher (P < .05) than that
Moreover, the very long branch in the orthologous protein- of model R1 in the 32 genes. Of these, 14 genes showed
based tree suggests that C. proteolyticus has a substitution ω1 /ω0 > 1 and 18 showed ω1 /ω0 < 1.
pattern that is different from other related Clostridia. Among the 32 genes evolving under Symbiobacterium-
We expected horizontal gene transfer between phy- specific selection, the RNA chaperone Hfq-coding gene has
logenetically distant organisms and lineage-specific gene the highest ω1 value (0.5347) (Table 1). Hfq facilitates pairing
loss to have greater influence on the gene content-based interactions between small regulatory RNAs and their mRNA
phylogenetic analysis than the orthologous protein-based targets, which has a variety of functions in bacteria [23].
analysis [12, 20]. Bacteria make their gene content suitable Among 73 conserved amino acid sites of Hfq (Figure 3),
for the living environment by changing it through gene S. thermophilum has more specific sites (7 sites) than the
acquisition and loss. outgroup Bacillus (4 sites), indicating that the Hfq gene is
The phylogenetic positions of 2 D. hafniense strains are one of the genes evolving under Symbiobacterium-specific
located near those of Alkaliphilus, Clostridium (except for C. selection.
cellulolyticum and C. thermocellum), and Eubacterium in the Two genes related to transcription, sigA (RNA poly-
gene content-based phylogenetic tree (Figure 1). However, merase sigma factor coding gene) and rpoC (RNA poly-
those phylogenetic positions were located in the phylogenetic merase subunit beta’ coding gene) have evolved under
lineage of syntrophic Clostridia in the orthologous protein- relaxed selection (Table 1). These results could be related to
based tree (Figure 2). The gene content-based phylogenetic the high GC content of Symbiobacterium genes. Thus, we
tree (Figure 1) indicates that Symbiobacterium branched off hypothesized that the GC bias of the promoter sequence
at the earliest stage of Clostridia species diversification. In induced Symbiobacterium-specific SigA, a DNA-binding pro-
contrast, Natranaerobius branched off at the earliest species tein, which led to the structural change of RNA polymerase
diversification stage in the orthologous protein sequence- complex (including RpoC). We discussed the relationships
based phylogenetic tree (Figure 2). between the GC content and phylogeny of the Symbiobac-
Although S. thermophilum occupied the most basal posi- terium genes [24].
tion in the gene content-based Clostridia lineage (Figure 1), In addition, spoIIAB and cheY are also related to
it was located in the syntrophic Clostridia lineage on the basis transcription. Thus, 5 of the 14 genes under more relaxed
of orthologous protein sequence comparisons (Figure 2). selection than other Clostridia are related to transcription.
Syntrophic bacteria evolved to acquire different sets of genes However, none of the 18 genes under functional constraint
despite their close phylogenetic relationship. Thus, although is related to transcription. Those results suggest that, under
Symbiobacterium clusters with syntrophic Clostridia, its gene relaxed selection, the transcription system may be related
content is very different. S. thermophilum has the most to S. thermophilum-specific gene content. In fact, Sym-
distant position from the other syntrophic Clostridia in the biobacterium lost the transcriptional regulator genes arsR,
phylogenetic tree on the basis of gene content comparisons. GntR, and Lrp compared to other syntrophic Clostridia
4 International Journal of Evolutionary Biology

95 Clostridium botulinum A3 Loch Maree


99
Clostridium botulinum B1 Okra
83 Clostridium botulinum Ba4
100 Clostridium botulinum A2
99 Clostridium botulinum F Langeland
65 Clostridium botulinum A ATCC 3502
Clostridium botulinum A ATCC 19397
47 100
100 Clostridium botulinum A Hall
46 Clostridium tetani E88
Clostridium novyi
Clostridium perfringens SM101
100 Clostridium perfringens ATCC 13124
100 Clostridium perfringens 13
100
100 Clostridium kluyveri DSM 555
Clostridium kluyveri NBRC 12016
97 Clostridium acetobutylicum
47
Clostridium beijerinckii
86 Clostridium botulinum B Eklund 17B
88 99 Clostridium botulinum E3
100
Clostridium difficile 630
100 Clostridium difficile CD196
71 Clostridium phytofermentans
Eubacterium eligens
100
85 100 Eubacterium rectale
Alkaliphilus metalliredigens
100 Alkaliphilus oremlandii
Desulfitobacterium hafniense DCB-2
100 Desulfitobacterium hafniense Y51
94 Desulfotomaculum acetoxidans
99
89 Pelotomaculum thermopropionicum
100 Desulfotomaculum reducens
100 Heliobacterium modesticaldum
Carboxydothermus hydrogenoformans
100 Moorella thermoacetica
58 Candidatus Desulforudis audaxviator
55 78
100 Ammonifex degensii
Syntrophomonas wolfei
Natranaerobius thermophilus
100 Anaerococcus prevotii
98 Finegoldia magna
77
Coprothermobacter proteolyticus
Halothermothrix orenii
40 100 Thermoanaerobacter pseudethanolicus
100
Thermoanaerobacter sp. X514
27 Thermoanaerobacter tengcongensis
100 Clostridium cellulolyticum
77 Clostridium thermocellum
79 Anaerocellum thermophilum
100 Caldicellulosiruptor saccharolyticus
Symbiobacterium thermophilum
Bacillus subtilis

200

Figure 1: Phylogenetic relationships on the basis of gene content comparisons among 50 Clostridia and Bacillus subtilis. The ortholog
cluster analysis (minimum cluster size, 2) among the 51 bacteria was performed using the MBGD [13]. This analysis produced the gene
presence/absence data matrix (10,636 genes × 51 organisms), which was used to generate the distance matrix between all pairs of the 51
bacteria. On the basis of the distance matrix, a neighbor-joining tree was reconstructed using MEGA software version 4 [14]. The bootstrap
was performed with 1000 replicates. The bar indicates a 200-gene difference.

(See in the Supplementary Material available online at doi: under relaxed selection whereas argC (N-acetyl-gamma-
10.4061/2011/376831 Table S1.). glutamyl-phosphate reductase coding gene) has evolved
It is noteworthy that some functionally related genes under functional constraint. Another example is the genes
exhibited opposite nucleotide substitution patterns encoding flagella-associated proteins; flgG (flagellar hook
in S. thermophilum (Table 1). For example, argD (N- protein coding gene) has evolved under relaxed selection,
acetylornithine aminotransferase coding gene) has evolved whereas flgD (flagellar hook assembly protein coding gene)
International Journal of Evolutionary Biology 5

Clostridium botulinum A ATCC 3502


100
Clostridium botulinum A Hall
95
Clostridium botulinum A ATCC 19397
100 Clostridium botulinum A2 Kyoto
70 Clostridium botulinum F Langeland
100 76 Clostridium botulinum B1 Okra
Clostridium botulinum Ba4 657
100 Clostridium botulinum A3 Loch Maree
100 Clostridium kluyveri NBRC 12016
100 Clostridium kluyveri DSM 555
100
Clostridium tetani
100 Clostridium novyi
100 Clostridium acetobutylicum
100 Clostridium perfringens ATCC 13124
100 Clostridium perfringens 13
Clostridium perfringens SM101
66 100 Clostridium beijerinckii
100 Clostridium botulinum B Eklund 17B
100 Clostridium botulinum E3 Alaska E43
Clostridium phytofermentans
98
100 Eubacterium eligens
100 Eubacterium rectale
100 Finegoldia magna
Anaerococcus prevotii
89 Alkaliphilus metalliredigens
81
Clostridium difficile 630
100 Clostridium difficile R20291
100
100 Clostridium difficile CD196
87 Clostridiales genomosp. Cgen BVAB3 UPII9
100 Clostridium thermocellum
100 Clostridium cellulolyticum
100 Caldicellulosiruptor saccharolyticus
68 Anaerocellum thermophilum
87 Thermoanaerobacter tengcongensis
Thermoanaerobacter italicus
100 Thermoanaerobacter pseudethanolicus
100
42 100 Thermoanaerobacter sp. X514
Coprothermobacter proteolyticus
59 Carboxydothermus hydrogenoformans
Syntrophomonas wolfei
100 Veillonella parvula
45 Acidaminococcus fermentans
Symbiobacterium thermophilum
22 100 Desulfitobacterium hafniense Y51
26 99 Desulfitobacterium hafniense DCB-2
77 Heliobacterium modesticaldum
Moorella thermoacetica
91
99 Candidatus Desulforudis audaxviator
91 Ammonifex degensii
100 Desulfotomaculum reducens
99 Pelotomaculum thermopropionicum
99 Desulfotomaculum acetoxidans
Halothermothrix orenii
Natranaerobius thermophilus
Bacillus subtilis

0.1

Figure 2: Phylogenetic relationships on the basis of 112 orthologous protein sequence comparisons among 54 Clostridia and B. subtilis. The
112 proteins were extracted as orthologous proteins from the 55 bacteria by a previously described method [15]. We constructed the 112
multiple alignments by using Clustal W [16]. Then, a concatenated multiple alignment of the 112 multiple alignments was generated. The
complete multiple alignment had 52,204 amino acid sites, including 19,818 gap/insertion sites. Hence, phylogenetic analyses were performed
on the basis of 32,386 amino acid sites without the gap/insertion sites. The neighbor-joining tree was reconstructed using MEGA software
version 4 [14]. The bootstrap was performed with 1000 replicates. The rate variation among sites was assumed to have a gamma distributed
rate (α = 1). No other default parameters were changed. The bar indicates a 10% difference.

and fliS (flagellar protein coding gene) have evolved under been a limiting factor for the evolution of the above 2
functional constraint. flgG exhibited the highest ω1 /ω0 value flagellum genes in Symbiobacterium.
(75.48) (Table 1). Flagella mediate interactions between P. In conclusion, our results suggest that S. thermophilum
thermopropionicum and methanogenic archaea [25]. Similar has evolved in a unique manner compared to other syn-
specialized functions in syntrophic association could have trophic Clostridia from the perspective of gene content.
6 International Journal of Evolutionary Biology

Table 1: Genes evolving under Symbiobacterium-specific selection.

Gene ω1 ω1 /ω0 2Δ ln
ω1 /ω0 > 1
hfq (RNA chaperone, STH1746) 0.5347 24.3046 5.7413
spoIIAB (anti-sigma F factor, STH1813) 0.3967 5.9744 8.7835
flgG (flagellar hook protein, STH2995) 0.3774 75.4800 6.4323
ilvC (ketol-acid reductoisomerase, STH2688) 0.2240 3.4675 10.7272
rplL (50S ribosomal protein L7/L12, STH3086) 0.2183 8.3640 13.4750
argD (N-acetylornithine aminotransferase, 0.2084 2.0292 4.1224
STH2881)
rplK (50S ribosomal protein L11, STH3090) 0.1869 9.3450 4.2192
ylmE (alanine racemase domain-containing 0.1526 24.2222 15.4681
protein, STH1227)
proJ (gamma-glutamyl kinase, STH2540) 0.1497 26.2632 4.4715
sigA (RNA polymerase sigma factor, STH0588) 0.1315 3.7679 17.9996
rpoC (RNA polymerase subunit beta’, 0.0838 2.1487 7.2876
STH3084)
glmS (glucosamine-fructose-6-phosphate 0.0156 2.7857 13.0700
aminotransferase, STH1279)
aroE (3-phosphoshikimate 1 0.0125 2.8409 4.4748
carboxyvinyltransferase, STH1419)
cheY (two-component response regulator 0.0044 2.9333 6.7786
involved in modulation of flagellar, STH1540)
ω1 /ω0 < 1
flgD (flagellar hook assembly protein, 0.0123 0.0715 4.4609
STH2996)
fliS (flagellar protein FliS, STH2976) 0.0073 0.0885 4.0842
yloM (ribosomal RNA small subunit 0.0045 0.0441 12.0081
methyltransferase B, STH1349)
ftsH (cell division protease, STH3198) 0.0040 0.0655 11.9908
spoVFB (dipicolinate synthase subunit B, 0.0039 0.0591 6.5852
STH1546)
rplW (50S ribosomal protein L23, STH3073) 0.0039 0.1429 3.9835
trmD (tRNA methyltransferase, STH1470) 0.0038 0.0574 5.7865
argC (N-acetyl-gamma-glutamyl-phosphate 0.0038 0.0721 4.1368
reductase, STH2892)
rpsC (30S ribosomal protein S3, STH3069) 0.0037 0.1504 4.1064
prfA (peptide chain release factor RF-1, 0.0035 0.0750 6.3618
STH0073)
ligA (NAD-dependent DNA ligase, STH2825) 0.0034 0.0654 4.5717
spo0J (ParB-like nuclease domain-containing 0.0034 0.0397 10.1363
protein, STH3332)
ftsE (cell division ATP-binding protein, 0.0027 0.0407 5.6285
STH0139)
metG (methionyl-tRNA synthetase, STH3252) 0.0027 0.0470 4.1885
rplC (50S ribosomal protein L3, STH3075) 0.0023 0.0920 4.3304
rplB (50S ribosomal protein L2, STH3072) 0.0020 0.0617 3.9836
rpsH (30S ribosomal protein S8, STH3061) 0.0018 0.1047 4.5829
infA (translation initiation factor IF-1, 0.0004 0.0234 4.7842
STH3052)
International Journal of Evolutionary Biology 7

Sth V T K A S A S L Q D G F L N L L R R E N I P A T I Y L V N G Y Q L K G Y I R G F D N F T V A V E V DG R V Q L V Y K H A L S T I T P A R P L P V S V S Q I M R A G E G Q E V E G E E ∗
- - MK P I N I . . Q . . . Q I . K . . T Y V . V F . L . . F . . R . Q V K . . . . . . . L L . S E . KQ . . I . . . . I . . F A . Q K N V Q L E L E ∗ - - - - - - - - - - - - - - -
Bsu

Chy M S . N Q L N . . . A . . . Q V . K . . V G V . . F . I . . F . . . . F V K . . . . . . . I L . S E . KQ H M I . . . . I . . . I . Q . . V N T Y L A K G G N E E N T P S - - - - -
Dre M . . P Q I N . . . A . . . Q V . K . . . . V . . F . I . . F . . . . MV K . . . . . . . I L . S . . KQ L M . . . . . I . . . S . L . . V N T . F . E N K P I ∗ - - - - - - - - - -
Dha MN . . P I N . . . T . . . Q V . K . . M . V . . . . . . . F . . . . L V . . . . . . . . V I . F E . KQ . M . . . . . I . . V M . L . . I N L V A A S Q A S . E . R ∗ - - - - - - -
Mth MN . T Q GN . . . L . . . V . . . D . T . V . . . . . . . F . . . . V V . . . . . . . . V L D A . . KQ . M I . . . . I . . . M . F . . V N LM A E S R A Q . E A K . ∗ - - - - - -
Pth M . . P Q I N . . . A . . . Q V . K . . . . V . . F . . . . F . . . . MV . . . . . . . . I L . S E . KQ L M . . . . . I . . V S . L K . V S T . F . E A K A P E K S ∗ - - - - - - -
Swo M S . S Q I N . . . A . . . Q V . K D K . . V . V F . . . . F . I . . MV . . . . . . . . I I . . . Q KQ . . . . . . . I . . V A . L . . I S ML N L E A K S D D D ∗ - - - - - - - -

Figure 3: Alignment of amino acid sequences of Hfq. Sth, Symbiobacterium thermophilum; Bsu, Bacillus subtilis; Chy, Carboxydothermus
hydrogenoformans; Dre, Desulfotomaculum reducens; Dha, Desulfitobacterium hafniense; Mth, Moorella thermoacetica; Pth, Pelotomaculum
thermopropionicum; Swo, Syntrophomonas wolfei. Red and blue sites indicate Symbiobacterium- and Bacillus-specific sites, respectively. The
dots represent identical residues of S. thermophilum amino acid.

Codon substitution analysis also suggests several unique [9] T. Kosaka, S. Kato, T. Shimoyama, S. Ishii, T. Abe, and
genes that evolved in a Symbiobacterium-specific manner. K. Watanabe, “The genome of Pelotomaculum thermopro-
Although speculative, the gene loss or relaxed evolution pionicum reveals niche-associated evolution in anaerobic
of several transcriptional regulator genes implies that envi- microbiota,” Genome Research, vol. 18, no. 3, pp. 442–448,
ronmental response might be involved in Symbiobacterium- 2008.
specific evolution. [10] B. M. Tebo and A. Y. Obraztsova, “Sulfate-reducing bacterium
grows with Cr(VI), U(VI), Mn(IV), and Fe(III) as electron
acceptors,” FEMS Microbiology Letters, vol. 162, no. 1, pp. 193–
Acknowledgment 198, 1998.
[11] M. J. McInerney, M. P. Bryant, R. B. Hespell, and J.
This study was supported by the High-Tech Research Center W. Costerton, “Syntrophomonas wolfei gen. nov. sp. nov.,
Project of the Ministry of Education, Culture, Sports, Science an anaerobic, syntrophic, fatty acid-oxidizing bacterium,”
and Technology, Japan. Applied and Environmental Microbiology, vol. 41, no. 4, pp.
1029–1039, 1981.
[12] Y. I. Wolf, I. B. Rogozin, N. V. Grishin, and E. V. Koonin,
References “Genome trees and the tree of life,” Trends in Genetics, vol. 18,
no. 9, pp. 472–479, 2002.
[1] K. Ueda and T. Beppu, “Lessons from studies of Sym- [13] I. Uchiyama, T. Higuchi, and M. Kawai, “MBGD update 2010:
biobacterium thermophilum, a unique syntrophic bacterium,” toward a comprehensive resource for exploring microbial
Bioscience, Biotechnology and Biochemistry, vol. 71, no. 5, pp. genome diversity,” Nucleic Acids Research, vol. 38, no. 1, pp.
1115–1121, 2007. D361–D365, 2010.
[2] M. Ohno, H. Shiratori, M. J. Park et al., “Symbiobacterium [14] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4:
thermophilum gen. nov., sp. nov., a symbiotic thermophile that molecular evolutionary genetics analysis (MEGA) software
depends on co-culture with a Bacillus strain for growth,” Inter- version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8,
national Journal of Systematic and Evolutionary Microbiology, pp. 1596–1599, 2007.
vol. 50, no. 5, pp. 1829–1832, 2000. [15] K. Oshima and H. Nishida, “Phylogenetic relationships among
[3] K. Ueda, A. Yamashita, J. Ishikawa et al., “Genome sequence mycoplasmas based on the whole genomic information,”
of Symbiobacterium thermophilum, an uncultivable bacterium Journal of Molecular Evolution, vol. 65, no. 3, pp. 249–258,
that depends on microbial commensalism,” Nucleic Acids 2007.
Research, vol. 32, no. 16, pp. 4937–4944, 2004. [16] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL
[4] G. Ding, Z. Yu, J. Zhao et al., “Tree of life based on genome W: improving the sensitivity of progressive multiple sequence
context networks,” PLoS One, vol. 3, no. 10, Article ID e3357, alignment through sequence weighting, position-specific gap
2008. penalties and weight matrix choice,” Nucleic Acids Research,
[5] H. Nishida, T. Beppu, and K. Ueda, “Symbiobacterium lost vol. 22, no. 22, pp. 4673–4680, 1994.
carbonic anhydrase in the course of evolution,” Journal of [17] W. H. Li, T. Gojobori, and M. Nei, “Pseudogenes as a paradigm
Molecular Evolution, vol. 68, no. 1, pp. 90–96, 2009. of neutral evolution,” Nature, vol. 292, no. 5820, pp. 237–239,
[6] M. Wu, Q. Ren, A. S. Durkin et al., “Life in hot carbon 1981.
monoxide: the complete genome sequence of Carboxydother- [18] Z. Yang, “PAML: a program package for phylogenetic analysis
mus hydrogenoformans Z-2901,” PLoS Genetics, vol. 1, no. 5, p. by maximum likelihood,” Computer Applications in the Bio-
e65, 2005. sciences, vol. 13, no. 5, pp. 555–556, 1997.
[7] H. Nonaka, G. Keresztes, Y. Shinoda et al., “Complete genome [19] N. Goldman and Z. Yang, “A codon-based model of nucleotide
sequence of the dehalorespiring bacterium Desulfitobacterium substitution for protein-coding DNA sequences,” Molecular
hafniense Y51 and comparison with Dehalococcoides etheno- Biology and Evolution, vol. 11, no. 5, pp. 725–736, 1994.
genes 195,” Journal of Bacteriology, vol. 188, no. 6, pp. 2262– [20] B. Snel, M. A. Huynen, and B. E. Dutilh, “Genome trees
2274, 2006. and the nature of genome evolution,” Annual Review of
[8] E. Pierce, G. Xie, R. D. Barabote et al., “The complete Microbiology, vol. 59, pp. 191–209, 2005.
genome sequence of Moorella thermoacetica (f. Clostridium [21] K. Oshima and H. Nishida, “Detection of the genes evolving
thermoaceticum),” Environmental Microbiology, vol. 10, no. 10, under Ureaplasma-specific selection,” Journal of Molecular
pp. 2550–2573, 2008. Evolution, vol. 66, no. 5, pp. 529–532, 2008.
8 International Journal of Evolutionary Biology

[22] H. Nishida, “Ureaplasma urease genes have undergone a


unique evolutionary process,” Open Systems Biology Journal,
vol. 2, pp. 1–7, 2009.
[23] A. Jousselin, L. Metzinger, and B. Felden, “On the facultative
requirement of the bacterial RNA chaperone, Hfq,” Trends in
Microbiology, vol. 17, no. 9, pp. 399–405, 2009.
[24] H. Nishida and C.-S. Yun, “Phylogenetic and guanine-cytosine
content analysis of Symbiobacterium thermophilum genes,”
International Journal of Evolutionary Biology, vol. 2011, p. 5,
2011.
[25] T. Shimoyama, S. Kato, S. Ishii, and K. Watanabe, “Flagellum
mediates symbiosis,” Science, vol. 323, no. 5921, p. 1574, 2009.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 937434, 6 pages
doi:10.4061/2011/937434

Review Article
Prevalence of Mycobacterium tuberculosis in Taiwan:
A Model for Strain Evolution Linked to Population Migration

Horng-Yunn Dou,1 Shu-Chen Huang,1 and Ih-Jen Su1, 2


1 Division of Infectious Diseases, National Health Research Institutes, No. 35, Keyan Road, Zhunan Town, Miaoli County 350, Taiwan
2 Department of Pathology, National Cheng Kung University Hospital, Tainan 704, Taiwan

Correspondence should be addressed to Ih-Jen Su, suihjen@nhri.org.tw

Received 14 October 2010; Accepted 20 December 2010

Academic Editor: Shinji Kondo

Copyright © 2011 Horng-Yunn Dou et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The global evolution and spread of Mycobacterium tuberculosis (MTB), one of the most successful bacterial pathogens, remain
a mystery. Advances in molecular technology in the past decade now make it possible to understand MTB strain evolution and
transmission in the context of human population migration. Taiwan is a relatively isolated island, serving as a mixing vessel over the
past four centuries as colonization by different waves of ethnic groups occurred. By using mycobacterial tandem repeat sequences as
genetic markers, the prevalence of MTB strains in Taiwan revealed an interesting association with historical migrations of different
ethnic populations, thus providing a good model to explore the global evolution and spread of MTB.

1. Introduction and host factors are responsible for the transmission and
prevalence of different MTB strains. Because MTB has no
Tuberculosis (TB) remains a major worldwide health con- detectable horizontal gene transfer [7, 8], large sequence
cern and has been characterized as one of three epidemics polymorphisms (LPSs) can be used as phylogenetic markers
by the World Health Organization [1]. In 2006, more than to trace the evolutionary relationships of different strain
1.5 million people died of TB, an estimated 9.1 million families. Hirsh et al. presented a phylogenetic analysis
new cases appeared, and the number of total TB cases of genomic deletions or LSPs, which were identified by
worldwide reached about 14 million [2]. Findings from comparative genome hybridization using DNA microarrays
sites representative of Neolithic Europe, ancient Egypt, and [7]. Mycobacterial interspersed repetitive units (MIRUs)
the Greek and Roman empires revealed that TB is an loci comprise variable numbers of tandem repeat (VNTR)
ancient human disease [3]. Population migration due to sequences, which allow them to be used as powerful genotyp-
wars and New World expedition accounts for the major ing markers [9]. In terms of genetic diversity and mutation
transmission patterns of microbial pathogens, including rates, they resemble human microsatellites, which are widely
Mycobacterium tuberculosis (MTB). In the past decade, the used in human population genetics studies. By conducting
prevalence of MTB strains in different geographic regions MIRU-VNTR typing, Supply et al. were able to detect strong
and ethnic populations has been explored by molecular linkage disequilibrium between allele variants at these loci,
methods [4–6]. The reports revealed interesting patterns of indicative of a predominant clonal evolution in the MTB
strain distribution in different ethnic populations, which complex [8].
matched well to historical population migrations [5, 6]. Taiwan is a relatively isolated island situated to the
Therefore, strain variations in different populations may be southeast of mainland China. The ethnic populations of
used to elucidate the transmission patterns of MTB. Taiwan include Han Chinese who migrated to the island in
The distribution of TB in different geographic regions is two major waves: the first during the Ming dynasty around
characterized by the prevalence of different MTB strains with 1600 and the second between 1945 and 1950, when members
varied virulence and drug resistance. Both environmental of the military, veterans, and some civilians emigrated from
2 International Journal of Evolutionary Biology

RD type (No.) ST

105 1 ST11

207
2 ST26, ST11
Insertion of IS6110 in NTF

181 ST3, ST10, ST19, ST22,


3
ST25, STK, STN

150 4 ST10, ST19

142 5 ST3, ST10, ST19,


ST22, STK

150 142 6 ST10, ST19

Scheme of the proposed evolutionary of the Beijing lineages

Figure 1: Proposed origins and routes of spread of four strains of MTB to Taiwan.

mainland China due to the civil war there [4]; in total, about of ethnic populations on the island is about 2% native
two million mainland Chinese have migrated to Taiwan to aborigines and 98% Han Chinese (Council of Indigenous
date. Taiwan was occupied by the Dutch beginning in 1660 Peoples, Executive Yuan Taiwan, 2007). Previous studies in
for 40 years, and by the Japanese from 1895 until 1945. Taiwan have demonstrated a fivefold higher incidence of
There are 12 tribes of aboriginals on this island, which are TB among aborigines compared to Han Chinese [15]. In
presumed to represent the ethnics who have inhabited the addition, polymorphism at the NRAMP1 gene appears to be
island for at least four thousand years (Figure 1). associated with susceptibility to TB among aborigines but
Although both the incidence and mortality rate of TB not among the Han Chinese population [15]. Preliminary
have shown steadily declined since 1950, TB is still a leading studies on Beijing family MTB strains reveal differential
notifiable infectious disease on Taiwan. The populations distributions by geographic region in Taiwan [16]. These
of Taiwan that have tuberculosis among them include multifactor influences, including waves of immigration,
aborigines, veterans, and Taiwanese (Hoklo). Therefore, the allow us to trace the evolutionary history of pulmonary
heterogeneous components of ethnic populations constitute TB in Taiwan. Accordingly, we investigated TB evolution or
a good model with which to study MTB transmission and transmission in (1) the aborigines of Austronesian ethnicity,
host-pathogen relationships. An important question to be whose ancestors came to Taiwan more than 500 years ago;
answered here is whether distinct genotypes or lineages of (2) the veterans of Han Chinese origin, first-generation
MTB are distributed differently according to their hosts’ immigrants who moved to Taiwan 55–60 years ago; (3) the
ethnic origins and birthplaces. In the past years, we applied general Taiwanese population of Han Chinese, most of whose
MIRU-VNTR sequences as genetic markers and discovered ancestors migrated to Taiwan around 200–400 years ago
interesting findings on the origin and evolution of MTB in [4].
Taiwan, as described below. Based on spoligotyping classification, six distinct clades
of MTB isolates among three Taiwanese subpopulations
were identified: Beijing, Haarlem, East-African Indian (EAI),
2. Associations of Mycobacterium Latin American and Mediterranean (LAM), U, and the
tuberculosis Genotypes with Different Ethnic ill-defined T clade. Of the six known clades, the Beijing
and Migratory Populations in Taiwan genotype overall was the most prevalent, being found in 40%
of TB-positive aborigines, 72% of TB-positive veterans, and
Some epidemiologic studies have revealed that MTB geno- 56% of the TB-positive general population [4]. This result
type distribution is closely associated with geography, eth- coincides with the global situation, with the most prevalent
nicity, and population migrations [4, 5, 7]. Similar phylogeo- MTB strain worldwide being the Beijing genotype. Because
graphical population structures have been reported for other Beijing strains are rapidly spreading worldwide, major TB
human pathogens [10–13], some of which have been linked outbreaks are most often associated with this strain [6, 17–
to ancient human migrations [11, 12, 14]. 19]. The second most frequent clade was that of the Haarlem
In Taiwan, TB is a major disease with an annual family, which was present in 27% of aborigines and 13%
incidence of about 16,000 confirmed cases. The proportion of the general population, but in only 7% of veterans [4].
International Journal of Evolutionary Biology 3

The third most frequent type was the T family, which was In a molecular epidemiologic analysis undertaken to
present in 5% of aborigines, 10% of veterans, and 6% of the investigate the prevalence of genotypes, cluster pattern, and
general population. The remaining types were, in descending drug resistance of MTB isolates in metropolitan Taipei, 356
order of frequency, LAM, EAI, and U [4]. MTB isolates from patients presenting with pulmonary TB
The Beijing family, which has the highest prevalence in were studied; the major spoligotypes found were Beijing
the three Taiwanese subpopulations, can be further grouped lineages (52.5%), followed by Haarlem lineages (13.5%) and
into ancestral, modern, and recent strains by NTF locus EAI plus EAI-like lineages (11%) [1]. Based on NTF and RD
analysis and RD deletion analysis. The NTF region and analyses, as well as on drug-resistance testing, strains of the
RD deletion are associated with the length of time since Beijing family were more likely to be modern strains and have
an MTB strain emerged in the human population; thus, a higher percentage of multiple drug resistance than all of
they can be used to estimate the relative age of Beijing the other families combined. Because Han Chinese make up
family clusters. Results of NTF and RD analyses revealed that almost all of the general population of Taipei City, Beijing
ancient Beijing strains are prevalent among the aborigines, isolates found there were overwhelmingly modern strains
and modern Beijing strains predominate among veterans (96%). The predominance of the Beijing strain in Taipei city
and the general population [4]. The retention of ancient constitutes a big challenge for TB control. Another important
characteristics of MTB among aborigines may be due to the observation was that patients infected with the Beijing family
historical tendency of Taiwan aborigines to live separately were statistically younger than those infected with other
from the general population and thus have relatively little genotypes (Table 1). These results suggest a possible recent
intermingling with Han Chinese. spread of the Beijing genotype among younger individuals in
The Haarlem genotype is the second prevalent type of this area. Thus, even though Taiwan has had comprehensive
TB in Taiwan. The Haarlem strain was first isolated from a BCG vaccinations for more than 40 years, the predominance
patient living in Holland [20, 21] and is found mainly in of the Beijing family strain in the younger cohort in our study
Central America, the Caribbean, Europe, and West Africa, suggests that BCG may not adequately protect young people
suggesting a link between Haarlem and post-Columbus from the Beijing strain of MTB.
Europeans [19]. ogt and mgtC gene analyses for the Haar- This situation warrants closer attention to control policy
lem lineage demonstrated that Haarlem strains circulating and suggests that a better BCG vaccine is needed.
among aborigines in Taiwan are wild-type strains, whereas Of the 356 strains in this study, 281 isolates (79%)
most Haarlem strains currently isolated in Europe contain were sensitive to all four of the first-line agents tested
single nucleotide polymorphisms (SNPs) and are compara- and 75 (21%) were resistant to at least one drug; 2.8%
tively modern. These results are similar to those of the Beijing were multidrug resistant (MDR) (Table 2). Analysis of the
strains. Given Taiwan aborigines’ geographic isolation, the association between MDR and genotypes (as determined by
first transmission or exchange of Haarlem strains between spoligotyping) showed that the Beijing genotype is more
the Dutch and the aborigines in Taiwan may have occurred likely to be MDR than all other genotypes (Haarlem, T, EAI,
in the 16th century during the Dutch colonization period. others, and orphan combined) (P = .08, OR = 3.73, and 95%
The late 16th century of Ming Dynasty was also the period in C.I. = (0.78–17.83)). The EAI family is significantly more
which Han Chinese began to migrate from mainland China likely to be sensitive to all drugs compared to other genotypes
to Taiwan. Thus, the Han Chinese may have introduced (P = .02, OR = 3.64, and 95% C.I. = (1.09–12.15)). EAI
Beijing ancient strains into the MTB gene pool in Taiwan at belongs to a branch in the early evolution of MTB and shows
that time. more antibiotic-sensitive properties, perhaps due to a lack
of drug selection pressure. Interestingly among the orphan
strains, 5% were MDR and 20% were resistant to one drug,
3. Molecular Epidemiology and showing a distribution similar to that of the Beijing family.
Evolutionary Genetics of Mycobacterium Taken together, our data summarized in Figure 2 show
tuberculosis in Taipei the evolutionary relationships within the Beijing family of
strains in Taipei city. RD group 1 sublineage: 1 isolate of
We then turn to study the strain distribution of MTB in ST11; this isolate shows a deletion of the RD105 region.
Taipei, which is located in northern Taiwan and is the island’s RD group 2 sublineages include ST11 and ST26; these
capital city. The strain distribution of MTB in Taipei provides isolates show deletion of the RD105 and RD207 regions.
us with the transmission pattern in this metropolitan city RD group 3 sublineages include ST3, ST10, ST19, ST22,
against the background described above. The city proper STK, and STN; these isolates show deletion of the RD105,
occupies 272 square km and has a population of 2.6 million, RD207, and RD181 regions. RD group 4 sublineages include
with an additional 4.3 million inhabitants in the surrounding ST10 and ST19; these isolates show deletion of the RD105,
metropolitan area. The population of Taipei includes the RD207, RD181, and RD150 regions. RD group 5 sublineages
same ethnicities as described above for the entire island: include ST3, ST10 ST19, and ST22; these isolates show
Han Chinese, veterans, and Taiwanese aborigines [22]. The deletion of the RD105, RD207, RD181, and RD142 regions.
prevalence of TB in large urban areas such as Taipei is RD group 6 sublineages include ST10 and ST19; these
complicated by the close human-to-human contacts and isolates show deletion of the RD105, RD207, RD181, RD142,
potential multiple sources of MTB strains from different and RD150 regions. It has been suggested that insertion
ethnic and migratory populations. sequence- (IS-) mediated deletion events are an important
4 International Journal of Evolutionary Biology

Portuguese
LAM strain 1600

Austronesian ?
EAI strain Before 5000–2500 years

Second world war


1945
Beijing strain China

Ming dynasty
1368–1644

Netherlands
Haarlem strain 1600

Figure 2: Scheme of the proposed evolution of Beijing lineages. The scheme is based on the deletion of genomic regions (RD: region of
difference, shown in gray rectangles), and types of sequence (ST) designations from the studies of Filliol et al. [23] and Iwamoto et al. [24].

Table 1: Association of Beijing MTB genotype and different age groups of patientsa .

No. (%) isolates No. (%) of Beijing isolates


Age group (yr) Odds ratio 95% C.I. P value
356 187 (52.53)
≤25 34 (9.55) 29 (85.29) 5.80 2.11–15.98 .0002
≤30 54 (15.17) 37 (68.52) 2.18 1.11–4.28 .02
31–60 95 (26.69) 50 (52.63) 1.11 0.65–1.90 .7
61–75 85 (23.88) 39 (45.88) 0.85 0.49–1.48 .56
≥76 122 (34.27) 61 (50.00) 1 reference group
a
Adapted from [1].

Table 2: Association between MTB genotype and drug resistance in patientse .

Genotype family No. of isolates (%) MDR (%) Any one drug (%) All sensitivity (%)
Beijinga 187 (52.5) 8(4.2) 36 (19.4) 143 (76.4)
Haarlem 48 (13.5) 0 9 (18.8) 39 (81.2)
EAIb 40 (11.2) 0 3 (7.5) 37 (92.5)
T 25 (7.1) 0 8 (32.0) 17 (68.0)
“Others”c (LAM, U, MANU, Bovis1) 16 (4.5) 0 1 (6.3) 15 (93.7)
Unclassifiedd 40 (11.2) 2 (5) 8 (20) 30 (75)
Total 356 10 (2.8) 65 (18.2) 281 (79)
a
Including Beijing-like strains.
b Including EAI-like strains.
c “Others”, all genotype families with a frequency of less than 10 cases.
d Unclassified, no internationally recognized genotype family assigned, based on the SpolDB4 spoligotype database.
e Adapted from [1].

mechanism driving mycobacterial genome variation. Based sublineage evolution. Thus, neither of the RD type 1 and
on our results (Figure 2), the RD105 and RD207 deletions type 2 groups (which include ST26 and ST11) have an IS6110
appear to have been early events in the evolutionary history insert in the NTF region (N family). We still found some
of Beijing strains; however, the IS6110 insertion occurred characteristics of ancient Beijing strains (N family) in ST19,
after the RD181 deletion but has not always persisted in later ST10, and ST22.
International Journal of Evolutionary Biology 5

Figure 1 illustrates the proposed origins and routes of Acknowledgments


spread of four strains of MTB in Taiwan.
This project was supported by grants from the National
Health Research Institutes. All participants of this consor-
Route 1. The Beijing strain may have migrated to Taiwan tium are acknowledged for valuable discussions.
through two separate historic events: the first during the
Ming dynasty and the second wave shortly after World War
II. Through these two migrations, the ancient Beijing strain References
has evolved into the modern Beijing strain. [1] H. Y. Dou, F. C. Tseng, C. W. Lin et al., “Molecular
epidemiology and evolutionary genetics of Mycobacterium
tuberculosis in Taipei,” BMC Infectious Diseases, vol. 8, article
Route 2. Haarlem originated in the Netherlands. It migrated 170, 2008.
to Taiwan during the Dutch reign over the island in the [2] World Health Oraganization, “Global Tuberculosis Control.
16th century and continues to be a major strain here. It Surveillance, Planning, Financing,” 2007.
is also important to note that there has been no observed [3] B. Mathema, N. E. Kurepina, P. J. Bifani, and B. N. Kreiswirth,
genetic mutation in the strain that was passed onto the “Molecular epidemiology of tuberculosis: current insights,”
natives of Taiwan. The Haarlem strain that remained in the Clinical Microbiology Reviews, vol. 19, no. 4, pp. 658–685,
2006.
Netherlands, however, has mutations in the ogt and mgtC
[4] H. Y. Dou, F. C. Tseng, J. J. Lu et al., “Associations of
genes, thus, resulting in SNP variants. Mycobacterium tuberculosis genotypes with different ethnic
and migratory populations in Taiwan,” Infection, Genetics and
Evolution, vol. 8, no. 3, pp. 323–330, 2008.
Route 3. LAM originated in both Europe and the Americas. [5] S. Gagneux, K. DeRiemer, T. Van et al., “Variable host-
It may have migrated to Taiwan during the Portuguese reign pathogen compatibility in Mycobacterium tuberculosis,” Pro-
in the 16th century and been passed on to the natives of ceedings of the National Academy of Sciences of the United States
Taiwan. of America, vol. 103, no. 8, pp. 2869–2873, 2006.
[6] J. R. Glynn, J. Whiteley, P. J. Bifani, K. Kremer, and D. van
Soolingen, “Worldwide occurrence of Beijing/W strains of
Route 4. EAI originated in Taiwanese aborigines, entering Mycobacterium tuberculosis: a systematic review,” Emerging
Taiwan four thousand years ago. It may be closely associated Infectious Diseases, vol. 8, no. 8, pp. 843–849, 2002.
with the Austronesian culture. The Austronesian peoples [7] A. E. Hirsh, A. G. Tsolaki, K. DeRiemer, M. W. Feldman,
and P. M. Small, “Stable association between strains of
are a population in Oceania and Southeast Asia who
Mycobacterium tuberculosis and their human host popula-
speak languages of the Austronesian family. They include tions,” Proceedings of the National Academy of Sciences of the
Taiwanese aborigines; the majority ethnic groups of East United States of America, vol. 101, no. 14, pp. 4871–4876, 2004.
Timor, Indonesia, Malaysia, the Philippines, Brunei, Mada- [8] P. Supply, R. M. Warren, A. L. Bañuls et al., “Linkage disequi-
gascar, Micronesia, and Polynesia; the Polynesian peoples of librium between minisatellite loci supports clonal evolution of
New Zealand and Hawaii and the Austronesian peoples of Mycobacterium tuberculosis in a high tuberculosis incidence
Melanesia. area,” Molecular Microbiology, vol. 47, no. 2, pp. 529–538,
Problems are remaining to be solved. Molecular genetic 2003.
analysis of clinical MTB strains delineates relationships [9] P. Supply, E. Mazars, S. Lesjean, V. Vincent, B. Gicquel, and
C. Locht, “Variable human minisatellite-like regions in the
among closely related strains of pathogenic microbes and Mycobacterium tuberculosis genome,” Molecular Microbiol-
allows construction of genetic frameworks for examining ogy, vol. 36, no. 3, pp. 762–771, 2000.
the distribution of biomedically relevant traits such as [10] H. T. Agostini, R. Yanagihara, V. Davis, C. F. Ryschkewitsch,
virulence, transmissibility, and host range. Based on the and G. L. Stoner, “Asian genotypes of JC virus in Native
strain distribution in different ethnic populations, we will Americans and in a Pacific Island population: markers of viral
attempt to identify factors that determine the disease evolution and human migration,” Proceedings of the National
transmission. Comparative genomic hybridization (CGH) Academy of Sciences of the United States of America, vol. 94, no.
microarray chips will be designed based on the genomic 26, pp. 14542–14546, 1997.
sequence to conduct the population genetic study efficiently. [11] D. Falush, T. Wirth, B. Linz et al., “Traces of human migrations
The information we provided in this paper will help us in Helicobacter pylori populations,” Science, vol. 299, no. 5612,
pp. 1582–1585, 2003.
to better understand the dynamics of TB transmission in
[12] M. Monot, N. Honoré, T. Garnier et al., “On the origin of
Taiwan and hence is a good model to understand the global leprosy,” Science, vol. 308, no. 5724, pp. 1040–1042, 2005.
distribution of MTB strains among different geographic [13] J. M. Musser, J. S. Kroll, D. M. Granoff et al., “Global
regions and ethnic populations. genetic structure and molecular epidemiology of encapsulated
Haemophilus influenzae,” Reviews of Infectious Diseases, vol.
12, no. 1, pp. 75–111, 1990.
Conflict of Interests [14] T. Wirth, X. Wang, B. Linz et al., “Distinguishing human
ethnic groups by means of sequences from Helicobacter pylori:
The authors declare no conflict of interests. lessons from Ladakh,” Proceedings of the National Academy of
6 International Journal of Evolutionary Biology

Sciences of the United States of America, vol. 101, no. 14, pp.
4746–4751, 2004.
[15] Y. H. Hsu, C. W. Chen, H. S. Sun, R. Jou, J. J. Lee, and
I. J. Su, “Association of NRAMP 1 gene polymorphism
with susceptibility to tuberculosis in Taiwanese aboriginals,”
Journal of the Formosan Medical Association, vol. 105, no. 5,
pp. 363–369, 2006.
[16] R. Jou, C. Y. Chiang, and W. L. Huang, “Distribution of the
Beijing family genotypes of Mycobacterium tuberculosis in
Taiwan,” Journal of Clinical Microbiology, vol. 43, no. 1, pp. 95–
100, 2005.
[17] J. R. Glynn, K. Kremer, M. W. Borgdorff, M. P. Rodriguez,
and D. van Soolingen, “Beijing/W genotype Mycobacterium
tuberculosis and drug resistance: European concerted action
on new generation genetic markers and techniques for the
epidemiology and control of tuberculosis,” Emerging Infectious
Diseases, vol. 12, no. 5, pp. 736–743, 2006.
[18] P. J. Bifani, B. Mathema, N. E. Kurepina, and B. N. Kreiswirth,
“Global dissemination of the Mycobacterium tuberculosis W-
Beijing family strains,” Trends in Microbiology, vol. 10, no. 1,
pp. 45–52, 2002.
[19] K. Brudey, J. R. Driscoll, L. Rigouts et al., “Mycobac-
terium tuberculosis complex genetic diversity: mining the
fourth international spoligotyping database (SpolDB4) for
classification, population genetics and epidemiology,” BMC
Microbiology, vol. 6, article 23, 2006.
[20] P. Farnia, M. R. Masjedi, M. Mirsaeidi et al., “Prevalence of
Haarlem I and Beijing types of Mycobacterium tuberculosis
strains in Iranian and Afghan MDR-TB patients,” Journal of
Infection, vol. 53, no. 5, pp. 331–336, 2006.
[21] K. Kremer, D. van Soolingen, R. Frothingham et al., “Compar-
ison of methods based on different molecular epidemiological
markers for typing of Mycobacterium tuberculosis complex
strains: Interlaboratory study of discriminatory power and
reproducibility,” Journal of Clinical Microbiology, vol. 37, no.
8, pp. 2607–2618, 1999.
[22] “A Brief History of Taiwan—A Sparrow Transformed
into a Phoenix,” http://www.gio.gov.tw/Taiwan-Website/5-gp/
history/.
[23] I. Filliol, J. R. Driscoll, D. van Soolingen et al., “Snapshot of
moving and expanding clones of Mycobacterium tuberculosis
and their global distribution assessed by spoligotyping in an
international study,” Journal of Clinical Microbiology, vol. 41,
no. 5, pp. 1963–1970, 2003.
[24] T. Iwamoto, S. Yoshida, K. Suzuki, and T. Wada, “Popu-
lation structure analysis of the Mycobacterium tuberculosis
Beijing family indicates an association between certain sub-
lineages and multidrug resistance,” Antimicrobial Agents and
Chemotherapy, vol. 52, no. 10, pp. 3805–3809, 2008.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 685015, 30 pages
doi:10.4061/2011/685015

Research Article
Distribution of Genes Encoding Nucleoid-Associated Protein
Homologs in Plasmids

Toshiharu Takeda,1 Choong-Soo Yun,1, 2 Masaki Shintani,3


Hisakazu Yamane,1 and Hideaki Nojiri1, 2
1
Biotechnology Research Center, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
2 AgriculturalBioinformatics Research Unit, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
3
Japan Collection of Microorganisms, RIKEN Bioresource Center, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan

Correspondence should be addressed to Hideaki Nojiri, anojiri@mail.ecc.u-tokyo.ac.jp

Received 14 October 2010; Accepted 27 November 2010

Academic Editor: Hiromi Nishida

Copyright © 2011 Toshiharu Takeda et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Bacterial nucleoid-associated proteins (NAPs) form nucleoprotein complexes and influence the expression of genes. Recent studies
have shown that some plasmids carry genes encoding NAP homologs, which play important roles in transcriptional regulation
networks between plasmids and host chromosomes. In this study, we determined the distributions of the well-known NAPs Fis,
H-NS, HU, IHF, and Lrp and the newly found NAPs MvaT and NdpA among the whole-sequenced 1382 plasmids found in Gram-
negative bacteria. Comparisons between NAP distributions and plasmid features (size, G+C content, and putative transferability)
were also performed. We found that larger plasmids frequently have NAP gene homologs. Plasmids with H-NS gene homologs
had less G+C content. It should be noted that plasmids with the NAP gene homolog also carried the relaxase gene involved in the
conjugative transfer of plasmids more frequently than did those without the NAP gene homolog, implying that plasmid-encoded
NAP homologs positively contribute to transmissible plasmids.

1. Introduction [5, 6], although they have distinct DNA-binding activities:


HU binds to DNA nonspecifically whereas IHF binds to
Bacterial chromosomal DNA is folded to form a compacted a consensus sequence [7]. Lrp has a global influence on
structure, the nucleoid. The proteins involved in folding transcription regulation and is also involved in microbial
the chromosome are known as nucleoid-associated proteins virulence [8]. In addition to these well-known NAPs, many
(NAPs) [1, 2]. Because of their DNA-binding ability, NAPs other NAPs are found not only in Enterobacteriaceae but
can also play an important role in global gene regula- also in other organisms. For instance, NdpA, a functionally
tion [1, 2]. Each well-known NAP in Enterobacteriaceae unknown NAP, has been found in Gram-negative bacteria
may be categorized as a “factor for inversion stimulation” [9]. The MvaT family protein is the functional homolog of
(Fis), “histone-like nucleoid structuring protein” (H-NS), H-NS in Pseudomonas bacteria [10].
“histone-like protein from Escherichia coli strain U93” (HU), Horizontal gene transfer (HGT), which is mediated by
“integration host factor” (IHF), or “leucine-responsive reg- transduction, transformation, and conjugation, plays an
ulatory protein” (Lrp) [1]. Fis is one of the most abundant important role in the evolution of prokaryotic genomes
NAPs in exponentially growing E. coli cells, and its role as [11, 12]. Genes acquired by HGT can provide beneficial
a transcriptional regulator has been investigated [3]. H-NS functions such as resistance to antibiotics and advantages
binds DNA, especially A+T-rich regions including promoter to their host under selective pressures [13]. However, the
regions or horizontally acquired DNA and acts as a global mechanisms underlying the integration of newly acquired
transcriptional repressor [4]. HU and IHF are similar in genes into host regulatory networks are still unclear. Recent
amino acid sequence level, and both are global regulators investigations have shown that some plasmids carry the genes
2 International Journal of Evolutionary Biology

encoding NAP homologs, which play important roles in from the NCBI ftp site (April 2010). Duplicated plasmids
transcriptional regulation networks between plasmids and were removed manually, and the resultant 2260 plasmid
host chromosomes and in maintaining host cell fitness. sequences were used in this study. To understand what types
For example, Doyle et al. [14] reported that plasmid- of plasmids were included in the database, we classified them
encoded H-NS-like protein has a “stealth” function that into six groups according to their source organisms. The
allows for plasmid transfer into host cells without disrupting database included 1382 Gram-negative, 725 Gram-positive,
host regulatory networks, maintaining host cell fitness. Yun 81 archaeal, 43 eukaryotic, 1 viral, and 28 unclassified
and Suzuki et al. [15] reported that plasmid-encoded H- plasmids.
NS-like protein can also play a key role in optimizing
gene transcription both on the plasmid and in the host
3.2. Identification of the Plasmids Containing NAP Gene
chromosome.
Homologs. Using the amino acid sequences of well-known
In this study, we determined the distributions of NAP
NAPs (Fis, H-NS, HU, IHF, and Lrp) and newly found NAPs
homologs among plasmids and discussed their roles in the
(MvaT and NdpA), their distributions were surveyed for
maintenance of plasmid and host cell fitness.
plasmids using the TBLASTN program. Some plasmids had
ORFs showing sequence similarities to both HU and IHF. We
2. Materials and Methods adopted the one with the higher E value. Of 2260 plasmids,
155 (7%) contained the gene encoding NAP homolog. Of
2.1. Plasmid Database Collection and Local BLAST Analyses. those, 116 (75%) contained only one NAP gene homolog and
The completely sequenced plasmid database was down- 39 (25%) contained more than one NAP gene homolog. No
loaded from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/ plasmids carried the Fis gene homolog. Twenty-two plasmids
genomes/Plasmids/). Some duplicated sequence data of the carried the H-NS gene homolog, and all of them had a
same plasmids were removed from the database. Iden- Gram-negative origin (Table 1). Sixty-six plasmids had the
tification of plasmids that contain the genes encoding HU gene homolog; of these, 51 had a Gram-negative origin
NAP homologs was performed using the local TBLASTN and 15 had a Gram-positive origin (Table 2). Twenty-seven
program (ver. 2.2.24, ftp://ftp.ncbi.nlm.nih.gov/blast/execu- plasmids (25 with Gram-negative and 2 with Gram-positive
tables/blast+/LATEST/) under strict conditions (i.e., thresh- origins) carried the IHF gene homolog (Table 3). Forty-eight
olds of 30% identity and 70% query coverage). The complete plasmids (46 with Gram-negative, 1 with a Gram-positive,
amino acid sequences of Fis (DDBJ/EMBL/GenBank acces- and 1 with an archaeal origin) carried the Lrp gene homolog
sion no. AP 003801), H-NS (AP 001863), Hha (AP 001109), (Table 4). Of these, 23 (48%) contained more than one
HUα (AP 003818), HUβ (AP 001090), IHFα (AP 002332), Lrp gene homolog. On the other hand, MvaT and NdpA
IHFβ (AP 001542), Lrp (AP 001519), and NdpA (P33920) homologs were encoded on only 3 plasmids, and all of them
from E. coli K-12 W3110 and MvaT (AAP33788) from were of Gram-negative origin (Table 5). Previously reported
Pseudomonas aeruginosa PAO1 were used as query sequences. plasmids that are known to have NAP gene homologs
were included in those 155 plasmids. These included R27
2.2. Bacterial Genome Analyses. The complete genome (NC 002305) and pHCM1 (NC 003384) [18, 19] with the H-
sequences of bacteria were downloaded from the NCBI ftp NS gene homolog; pQBR103 (NC 009444) [20] with the HU
site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). The number and NdpA gene homologs; and pCAR1 (NC 004444) [21,
of NAP genes on proteobacterial genomes was investigated 22] with the MvaT, HU, and NdpA gene homologs. These
using the TBLASTN program (http://www.ncbi.nlm.nih results indicated the adequacy of our search. Because we
.gov/sutils/genom table.cgi) under strict conditions (i.e., used NAPs from Gram-negative bacteria as query sequences,
thresholds of 30% identity and 70% query coverage). it may be reasonable that 136 (88%) of 155 plasmids with
the NAP gene homolog belonged to the group isolated from
2.3. Plasmid Classification. Plasmids in the database were Gram-negative bacteria. Therefore, in further studies we
classified into six groups according to their source organisms: discussed the Gram-negative plasmid group.
Gram-negative, Gram-positive, archaeal, eukaryotic, viral,
and unclassified. Putative transferability of each Gram- 3.3. Relationships between Plasmid Size and NAP Gene
negative plasmid was determined by whether it carried Homolog Distributions. We first compared the sizes of 136
the relaxase gene of each MOB family that Garcillán- plasmids with NAP gene homologs with those of all 1382
Barcia et al. proposed [16]. Instead of using the local Gram-negative group plasmids. All 1382 plasmids could be
PSI-BLAST program (ver. 2.2.24, ftp://ftp.ncbi.nlm.nih.gov/ divided into 4 groups according to size, small (<10 kb),
blast/executables/blast+/LATEST/) as described by Gar- intermediate (10 to 100 kb), large (100 kb to 1 Mb), and mega
cillán-Barcia et al. [16], we used the local TBLASTN (>1 Mb) plasmids. The distribution of the 136 plasmids, each
program. of which had one or more genes encoding NAP homologs,
is shown in Figure 1(a): none of 415 small plasmids, 34
3. Results and Discussion (5%) of 686 intermediate plasmids, 90 (33%) of 269 large
plasmids, and 12 (100%) of 12 mega plasmids carried at
3.1. Database Collection and Plasmid Classification by Origin. least one NAP gene homolog. The average size of the 136
We downloaded the whole sequences of 2278 plasmids plasmids was larger (364 kb) than that of all 1382 plasmids
Table 1: Plasmids containing the gene encoding H-NS homologa .
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Erwinia amylovora ATCC
1 NC 013972 28243 50 66 99 3129 2728 −
49946
Aeromonas salmonicida
pAsa5 NC 009350 155098 54 46 99 941 534 − MOBF
subsp. salmonicida A449
47 99 16890 16483
Aeromonas salmonicida
International Journal of Evolutionary Biology

pAsal5 NC 009352 18536 54 46 99 12285 12692 −


subsp. salmonicida
Erwinia amylovora
pEA29 NC 013957 28259 50 66 99 3129 2728 −
CFBP1430
pEA29 NC 005706 Erwinia amylovora 28185 50 64 99 2991 2590 −
pEC-IMP NC 012555 Enterobacter cloacae 318782 48 64 99 109370 108969 − MOBH
pEC-IMPQ NC 012556 Enterobacter cloacae 324503 48 64 99 109370 108969 − MOBH
pEJ30 NC 004834 Erwinia sp. Ejp 556 29593 50 66 99 4651 4250 −
pEP36 NC 013263 Erwinia pyrifoliae Ep1/96 35909 50 66 99 25040 25441 −
pEP36 NC 004445 Erwinia pyrifoliae Ep1/96 35904 50 64 98 4675 4280 −
Erwinia tasmaniensis
pET45 NC 010699 44694 51 52 93 37435 37809 − MOBF
Et1/99
Erwinia tasmaniensis
pET49 NC 010697 48751 44 36 94 30821 31204 −
Et1/99
Salmonella enterica subsp.
pHCM1 NC 003384 enterica serovar Typhi str. 218160 48 61 99 131861 131460 − MOBH
CT18
Klebsiella pneumoniae
pK2044 NC 006625 224152 50 67 99 35717 36112 −
NTUH-K2044
Yersinia pseudotuberculosis
plasmid 153kb NC 009705 153140 40 44 100 139846 140265 −
IP 31758
pLVPK NC 005249 Klebsiella pneumoniae 219385 50 67 99 114397 114792 −
Salmonella enterica subsp.
pMAK1 NC 009981 enterica serovar 208409 47 61 99 60046 59645 − MOBH
Choleraesuis
3
4

Table 1: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Escherichia coli O111:H-
pO111 1 NC 013365 204604 47 61 99 80175 79774 − MOBH
str. 11128
Sodalis glossinidius str.
pSG1 NC 007713 83306 49 43 97 2533 2922 −
“morsitans”
Salmonella enterica subsp.
R27 NC 002305 180461 46 61 99 148225 148626 − MOBH
enterica serovar Typhi
R478 NC 005211 Serratia marcescens 274762 46 64 99 111747 111346 − MOBH
Salmonella enterica subsp.
Unnamed NC 011148 enterica serovar Agona str. 37978 41 43 95 7671 7288 −
SL483
a
This list is the result of a TBLASTN analysis using the amino acid sequence of H-NS as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage). Besides these plasmids, pSf-R27
from Shigella flexneri 2a str. 2457T was completely sequenced by Wei et al. [17] and encodes the H-NS-like protein Sfh.
b Average G+C content of the plasmid.
c Reported TBLASTN identity to H-NS.
d Plasmid classification according to its source organism (−, Gram-negative plasmid).
e Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
International Journal of Evolutionary Biology
Table 2: Plasmids containing the gene encoding HU homologa .
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Aromatoleum aromaticum
1 NC 006823 207355 58 55 99 186175 185909 −
EbN1
1 NC 007949 Polaromonas sp. JS666 360405 57 52 99 61052 60786 − MOBH
Deinococcus geothermalis
1 NC 008010 574127 66 38 97 550805 550545 +
DSM 11300
Lactococcus lactis subsp.
1 NC 008503 14041 34 37 94 9732 10007 + MOBP
cremoris SK11
1 NC 008242 Chelativorans sp. BNC1 343931 62 41 94 133932 133678 − MOBQ
Deinococcus deserti
2 NC 012529 314317 64 38 93 269648 269899 +
VCD115
Deinococcus deserti
International Journal of Evolutionary Biology

3 NC 012528 396459 61 40 96 8700 8957 +


VCD115
Cupriavidus metallidurans
Megaplasmid NC 007974 2580084 64 51 99 1393415 1393149 − MOBV
CH34
Desulfovibrio vulgaris str.
Megaplasmid NC 005863 202301 66 31 98 5502 5765 −
Hildenborough
Megaplasmid Deferribacter desulfuricans
NC 013940 308544 24 41 100 253817 253548 −
pDF308 SSM1
Megaplasmid
NC 005241 Ralstonia eutropha H16 452156 62 48 99 343060 342791 −
pHG1
p49879.1 NC 006907 Leptospirillum ferrooxidans 28878 58 47 99 3281 3015 − MOBQ
p49879.2 NC 006909 Leptospirillum ferrooxidans 28012 55 48 99 15858 15592 − MOBQ
pAH187 270 NC 011655 Bacillus cereus AH187 270082 34 59 100 113139 112870 +
pAH820 272 NC 011777 Bacillus cereus AH820 272145 34 58 100 153060 152791 +
pAM04528 NC 012693 Salmonella enterica 158213 52 57 99 14067 14333 − MOBH
pAOVO01 NC 008765 Acidovorax sp. JS42 72689 62 46 100 65140 64871 − MOBF
Acetobacter pasteurianus
pAPA01-011 NC 013210 191799 53 47 100 154736 154467 −
IFO 3283-01
46 99 38442 38708
pAR060302 NC 012692 Escherichia coli 166530 53 57 99 15755 16021 − MOBH
Aeromonas salmonicida
pAsa4 NC 009349 166749 53 60 99 26844 26578 − MOBH
subsp. salmonicida A449
pAtS4c NC 011984 Agrobacterium vitis S4 211620 59 45 94 141245 140991 − MOBQ
pAtS4e NC 011981 Agrobacterium vitis S4 631775 57 41 94 40476 40222 − MOBQ
pBc239 NC 011973 Bacillus cereus Q1 239246 33 52 100 191895 192164 +
Bacteroides fragilis NCTC
pBF9343 NC 006873 36560 32 35 92 15803 15558 − MOBP
9343
Burkholderia phymatum
pBPHY01 NC 010625 1904893 62 43 99 826527 826252 −
STM815
5
6

Table 2: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Burkholderia phymatum
pBPHY02 NC 010627 595108 59 45 99 98625 98359 −
STM815
Bacillus thuringiensis
pBtoxis NC 010076 127923 32 52 99 77382 77648 +
serovar israelensis
Bacillus weihenstephanensis
pBWB401 NC 010180 417054 34 59 100 338347 338078 +
KBAB4
pCAR1 NC 004444 Pseudomonas resinovorans 199035 56 42 99 97809 98075 − MOBH
pCAUL01 NC 010335 Caulobacter sp. K31 233649 67 44 99 97598 97329 − MOBQ
pCER270 NC 010924 Bacillus cereus 270082 34 59 100 169548 169279 +
Lactococcus lactis subsp.
pDBORO NC 009137 16404 35 37 94 16387 16112 +
lactis bv. diacetylactis
pDVUL01 NC 008741 Desulfovibrio vulgaris DP4 198504 66 31 98 198317 198054 −
peH4H NC 012690 Escherichia coli 148105 53 57 99 14067 14333 − MOBH
pG9842 209 NC 011775 Bacillus cereus G9842 209488 30 60 100 88828 88559 +
pH308197 258 NC 011339 Bacillus cereus H3081.97 258484 34 59 100 83033 83302 +
Candidatus Hamiltonella
pHD5AT NC 012752 defensa 5AT (Acyrthosiphon 59032 45 45 99 14981 15247 − MOBP
pisum)
Yersinia pestis bv. Orientalis
pIP1202 NC 009141 182913 53 57 99 14067 14333 − MOBH
str. IP275
Cupriavidus metallidurans
plasmid 2 NC 007972 171459 61 46 99 125530 125261 −
CH34
Cupriavidus metallidurans
pMOL28 NC 006525 171461 61 46 99 51529 51798 −
CH34
Lactobacillus salivarius
pMP118 NC 007930 242436 32 54 99 56763 56497 + MOBV
UCC118
Nostoc punctiforme PCC
pNPUN02 NC 010632 254918 41 44 99 74804 74538 − MOBV
73102
Ochrobactrum anthropi
pOANT02 NC 009670 101491 59 49 94 32700 32446 −
ATCC 49188
Photobacterium damselae
pP91278 NC 008613 131520 52 57 99 125918 126184 − MOBH
subsp. piscicida
Photobacterium damselae
pP99-018 NC 008612 150157 51 57 99 133314 133580 − MOBH
subsp. piscicida
pPER272 NC 010921 Bacillus cereus 272145 34 58 100 153060 152791 +
Pseudomonas syringae pv.
pPMA4326A NC 005918 46697 55 42 99 1520 1786 −
maculicola
Pseudomonas syringae pv.
pPMA4326B NC 005919 40110 55 45 99 1457 1723 −
maculicola
International Journal of Evolutionary Biology
Table 2: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Pseudomonas fluorescens
pQBR103 NC 009444 425094 53 51 99 182862 183128 −
SBW25
Rhizobium leguminosarum
pR132503 NC 012853 516088 59 47 94 300662 300916 − MOBQ
bv. trifolii WSM1325
pRA1 NC 012885 Aeromonas hydrophila 143963 51 58 99 15573 15839 − MOBH
pRALTA NC 010529 Cupriavidus taiwanensis 557200 60 46 98 153542 153276 −
Acaryochloris marina
pREB1 NC 009926 374161 47 46 100 339743 340012 − MOBF
MBIC11017
Acaryochloris marina
pREB2 NC 009927 356087 45 48 100 57583 57852 − MOBF
International Journal of Evolutionary Biology

MBIC11017
Acaryochloris marina
pREB3 NC 009928 273121 45 46 100 234682 234951 − MOBF
MBIC11017
42 100 243339 243608
Rhizobium leguminosarum
pRL7 NC 008382 151564 58 48 94 20484 20230 − MOBQ
bv. viciae 3841
Rhizobium leguminosarum
pRLG203 NC 011370 308747 58 49 94 141121 140867 −
bv. trifolii WSM2304
pRp12D01 NC 012855 Ralstonia pickettii 12D 389779 58 37 99 321346 321080 − MOBH
pSG2 NC 007184 Sodalis glossinidius 27240 45 45 86 10072 9845 −
pSG3 NC 007186 Sodalis glossinidius 19201 51 51 100 13812 13543 −
Salmonella enterica subsp.
pSN254 NC 009140 enterica serovar Newport 176473 53 57 99 14067 14333 − MOBH
str. SL254
pTiS4 NC 011982 Agrobacterium vitis S4 258824 57 41 94 27356 27102 − MOBQ
40 94 83408 83154
pTi-SAKURA NC 002147 Agrobacterium tumefaciens 206479 56 44 94 95763 95509 − MOBQ
Aliivibrio salmonicida
pVSAL840 NC 011311 83540 40 60 99 31361 31627 − MOBF
LFI1238
58 99 77350 77084
pYR1 NC 009139 Yersinia ruckeri 158038 51 57 99 15070 15336 − MOBH
Agrobacterium tumefaciens
Ti NC 003065 214233 57 44 94 139735 139481 − MOBQ
str. C58
a
This list is the result of a TBLASTN analysis using the amino acid sequence of HUα or HUβ as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to HU.
d Plasmid classification according to its source organism (−, Gram-negative plasmid; +, Gram-positive plasmid).
e Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
7
8

Table 3: Plasmids containing the gene encoding IHF homologa .


G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Agrobacterium tumefaciens
At NC 003064 542868 57 36 82 112654 112412 − MOBQ
str. C58
Methylobacterium
Megaplasmid NC 012811 1261460 68 33 94 720582 720860 −
extorquens AM1
Methylobacterium
p2META1 NC 012809 37858 65 44 95 28369 28635 − MOBQ
extorquens AM1
Alicyclobacillus
pAACI01 NC 013206 acidocaldarius subsp. 91726 54 43 80 62668 62432 +
acidocaldarius DSM 446
Arthrobacter
pACHL01 NC 011879 426858 64 32 92 408818 408546 +
chlorophenolicus A6
Allochromatium vinosum
pALVIN02 NC 013862 39929 53 60 98 10902 10627 −
DSM 180
Candidatus Accumulibacter
pAph01 NC 013193 phosphatis clade IIA str. 167595 62 56 95 144197 144463 − MOBP
UW-1
Candidatus Accumulibacter
pAph03 NC 013191 phosphatis clade IIA str. 37695 59 58 97 5412 5140 −
UW-1
Agrobacterium radiobacter
pAtK84b NC 011990 184668 59 38 86 54109 53855 − MOBQ
K84
Agrobacterium radiobacter
pAtK84c NC 011987 388169 57 43 93 340807 340532 −
K84
46 93 10327 10052
pAtS4b NC 011991 Agrobacterium vitis S4 130435 56 47 97 44880 45152 − MOBQ
pBBta01 NC 009475 Bradyrhizobium sp. BTAi1 228826 61 39 86 6642 6388 −
pBFY46 NC 006297 Bacteroides fragilis YCH46 33716 34 35 89 25098 25343 − MOBP
Beijerinckia indica subsp.
pBIND01 NC 010580 181736 56 36 77 179816 179601 − MOBF
indica ATCC 9039
Sphingobium japonicum
pCHQ1 NC 014007 190974 63 36 90 63111 63377 −
UT26S
pGLOV01 NC 010815 Geobacter lovleyi SZ 77113 53 38 92 41196 41468 −
pM44601 NC 010373 Methylobacterium sp. 4-46 57951 65 35 97 7806 7534 −
Methylobacterium populi
pMPOP01 NC 010727 25164 65 49 93 10635 10375 −
BJ001
Methylobacterium
pMRAD03 NC 010514 42985 63 38 94 26778 26515 − MOBF
radiotolerans JCM 2831
International Journal of Evolutionary Biology
Table 3: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Methylobacterium
pMRAD04 NC 010517 37743 64 38 94 10763 10500 −
radiotolerans JCM 2831
International Journal of Evolutionary Biology

Pelobacter propionicus DSM


pPRO1 NC 008607 202397 48 41 94 129679 129957 −
2379
Rhodobacter sphaeroides
pRSPA01 NC 009429 877879 68 49 97 783519 783791 −
ATCC 17025
Sphingomonas wittichii
pSWIT01 NC 009507 310228 64 40 95 106554 106820 − MOBF
RW1
36 92 35341 35069
MOBP ,
pTcM1 NC 010600 Acidithiobacillus caldus 65158 57 56 89 25186 25449 −
MOBQ
Xanthomonas campestris pv.
pXCV183 NC 007507 182572 60 33 95 138753 138490 −
vesicatoria str. 85-10
Ti NC 002377 Agrobacterium tumefaciens 194140 55 43 97 180164 180436 − MOBQ
Ti plasmid
NC 010929 Agrobacterium tumefaciens 244978 55 36 86 209743 209489 − MOBQ
pTiBo542
45 98 187204 187479
a
This list is the result of a TBLASTN analysis using the amino acid sequence of IHFα or IHFβ as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to IHF.
d Plasmid classification according to its source organism (−, Gram-negative plasmid; +, Gram-positive plasmid).
e Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
9
10
Table 4: Plasmids containing the gene encoding Lrp homologa .
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
Paracoccus denitrificans
1 NC 008688 653815 67 41 92 252075 251623 −
PD1222
42 93 464218 464673
36 96 639341 639811
37 85 110140 109724
Rhodobacter sphaeroides
A NC 009007 114045 69 39 93 30241 29789 − MOBF
2.4.1
Rhodobacter sphaeroides
B NC 007488 114178 70 43 96 81861 81385 −
2.4.1
bglu 1p NC 012723 Burkholderia glumae BGR1 133591 61 36 90 124017 123577 −
Megaplasmid NC 008043 Ruegeria sp. TM1040 821788 59 41 84 143820 144233 −
41 91 687257 687706
36 91 734136 733690
Cupriavidus metallidurans
Megaplasmid NC 007974 2580084 64 44 88 1171245 1170814 − MOBV
CH34
40 91 1169702 1169256
38 97 1586726 1586250
Megaplasmid NC 006569 Ruegeria pomeroyi DSS-3 491611 63 36 88 356303 355869 − MOBC
Megaplasmid NC 007336 Ralstonia eutropha JMP134 634917 61 35 93 377503 377045 −
p42e NC 007765 Rhizobium etli CFN 42 505334 62 34 71 255037 255384 −
p42f NC 007766 Rhizobium etli CFN 42 642517 61 45 88 436907 437341 −
43 91 406350 405901
41 85 491383 491799
39 95 210634 211098
39 96 199426 199899
pAB510a NC 013855 Azospirillum sp. B510 1455109 68 57 88 274908 275342 −
44 95 979549 980013
32 94 1180335 1179874
pAB510b NC 013856 Azospirillum sp. B510 723779 67 44 84 471830 472243 −
32 94 318139 318600
pAB510c NC 013857 Azospirillum sp. B510 681723 67 45 85 408064 407645 −
34 91 36385 36834
pAB510d NC 013858 Azospirillum sp. B510 628837 68 44 79 472768 472379 −
40 90 323184 322741
37 87 281438 281866
30 85 619027 618623
pAtS4e NC 011981 Agrobacterium vitis S4 631775 57 30 87 460443 460871 − MOBQ
34 74 425247 424888
Burkholderia phymatum
pBPHY01 NC 010625 1904893 62 46 85 1153608 1154027 −
STM815
Burkholderia phymatum
pBPHY02 NC 010627 595108 59 41 91 271795 271346 −
International Journal of Evolutionary Biology

STM815
Table 4: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
pC NC 010997 Rhizobium etli CIAT 652 1091523 61 46 88 617696 618130 − MOBQ
42 90 609059 608619
39 95 417738 418202
42 79 714804 715193
39 93 406570 407025
pCAUL01 NC 010335 Caulobacter sp. K31 233649 67 34 89 182479 182042 − MOBQ
pEST4011 NC 005793 Achromobacter denitrificans 76958 62 58 88 41224 40793 − MOBP
58 88 34233 33802
Ralstonia solanacearum
pGMI1000MP NC 003296 2094509 67 43 98 1737958 1738437 −
GMI1000
International Journal of Evolutionary Biology

46 93 822030 821572
pHV4 NC 013966 Haloferax volcanii DS2 635786 62 33 71 401763 401410 Archaea
pIJB1 NC 013666 Burkholderia cepacia 99448 63 58 88 74907 75338 − MOBP
Klebsiella pneumoniae
pK2044 NC 006625 224152 50 33 90 194643 195086 −
NTUH-K2044
pLVPK NC 005249 Klebsiella pneumoniae 219385 50 33 90 46236 46679 −
Mesorhizobium loti
pMLa NC 002679 351911 59 32 93 185603 185148 −
MAFF303099
30 89 207314 206877
Mesorhizobium loti
pMLb NC 002682 208315 60 37 93 24632 24177 −
MAFF303099
pNGR234a NC 000914 Rhizobium sp. NGR234 536165 58 41 70 197189 196845 − MOBQ
30 89 188867 188430
pNGR234b NC 012586 Rhizobium sp. NGR234 2430033 62 46 90 656547 656107 − MOBQ
45 85 667494 667913
43 90 1038020 1038463
44 85 682796 683215
38 96 2400849 2401319
44 79 709104 708715
41 89 28336 28761
33 89 1108900 1109337
36 90 703213 702764
32 77 1112953 1112582
Polaromonas
pPNAP04 NC 008760 143747 59 35 90 142511 142068 −
naphthalenivorans CJ2
Rhizobium leguminosarum
pR132501 NC 012848 828924 60 47 88 234905 234471 − MOBQ
bv. trifolii WSM1325
44 86 386338 386760
11
12

Table 4: Continued.
G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
39 93 645542 645087
42 79 147165 146776
pRALTA NC 010529 Cupriavidus taiwanensis 557200 60 38 91 465839 465393 −
pRHL1 NC 008269 Rhodococcus jostii RHA1 1123075 65 36 91 854207 854656 +
33 84 783666 783253
Rhizobium leguminosarum
pRL12 NC 008378 870021 61 46 88 599116 598682 − MOBQ
bv. viciae 3841
43 88 658287 658718
39 93 45601 45146
42 79 450080 449691
Rhizobium leguminosarum
pRL8 NC 008383 147463 59 33 87 70763 70344 − MOBQ
bv. viciae 3841
Rhizobium leguminosarum
pRLG201 NC 011368 1266105 60 45 89 917573 917136 − MOBQ
bv. trifolii WSM2304
44 85 41998 42417
44 79 473039 472650
40 93 1162146 1161691
40 93 1150939 1150484
32 88 707587 707162
Rhodobacter sphaeroides
pRSKD131A NC 011962 157345 70 42 96 148295 147819 −
KD131
Rhodobacter sphaeroides
pRSKD131B NC 011960 103355 70 39 93 98400 97948 −
KD131
Rhodobacter sphaeroides
pRSPA01 NC 009429 877879 68 40 90 31309 30866 −
ATCC 17025
39 88 659383 658952
Rhodobacter sphaeroides
pRSPH01 NC 009040 122606 70 39 93 118088 118540 −
ATCC 17029
Sinorhizobium medicae
pSMED01 NC 009620 1570951 61 40 77 143180 143557 − MOBQ
WSM419
34 89 574284 573847
Sinorhizobium medicae
pSMED02 NC 009621 1245408 60 42 91 556486 556932 − MOBQ
WSM419
40 91 842324 842758
31 87 22345 21917
Sinorhizobium medicae
pSMED03 NC 009622 219313 60 46 95 105044 105508 −
WSM419
International Journal of Evolutionary Biology
Table 4: Continued.
International Journal of Evolutionary Biology

G+C content Query


Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
pSmeSM11a NC 013545 Sinorhizobium meliloti 144170 60 46 96 70449 70922 − MOBQ
pSymA NC 003037 Sinorhizobium meliloti 1021 1354226 60 43 89 1060699 1060262 − MOBQ
pSymB NC 003078 Sinorhizobium meliloti 1021 1683333 62 38 90 440778 440335 − MOBQ
36 89 29555 29992
pTiS4 NC 011982 Agrobacterium vitis S4 258824 57 42 79 96920 97309 − MOBQ
Thermomicrobium roseum
Unnamed NC 011961 917738 66 30 85 736739 737146 − MOBP
DSM 5159
a
This list is the result of a TBLASTN analysis using the amino acid sequence of Lrp as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to Lrp.
d Plasmid classification according to its source organism (−, Gram-negative plasmid; +, Gram-positive plasmid).
e Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
13
14

Table 5: Plasmids containing the gene encoding MvaT or NdpA homologa .


G+C content Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end Classificationd MOB familye
(%)b coverage (%)
MvaT
pCAR1 NC 004444 Pseudomonas resinovorans 199035 56 61 98 77640 77993 − MOBH
Pseudomonas fluorescens
pQBR103 NC 009444 425094 53 61 96 98076 97717 −
SBW25
pWW53 NC 008275 Pseudomonas putida 107929 57 61 98 8415 8768 −
NdpA
p0908 NC 010113 Vibrio sp. 0908 81413 49 51 99 79731 78736 −
pCAR1 NC 004444 Pseudomonas resinovorans 199035 56 36 98 95390 94395 − MOBH
Pseudomonas fluorescens
pQBR103 NC 009444 425094 53 31 99 161413 160400 −
SBW25
a
This list is the result of a TBLASTN analysis using the amino acid sequence of MvaT or NdpA as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to MvaT or NdpA.
d Plasmid classification according to its source organism (−, Gram-negative plasmid).
e Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
International Journal of Evolutionary Biology
Table 6: Gram-negative plasmids containing the gene encoding Hha-like proteina .
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end MOB familyd
(%)b homolog coverage (%)
55989p NC 011752 Escherichia coli 55989 72482 46 53 92 10025 9828
NR1 NC 009133 Escherichia coli 94289 52 53 92 87193 87390 MOBF
p1658/97 NC 004998 Escherichia coli 125491 51 55 92 36419 36616 MOBF
p1ESCUM NC 011749 Escherichia coli UMN026 122301 50 53 92 53508 53311 MOBF
p2ESCUM NC 011739 Escherichia coli UMN026 33809 42 62 90 7682 7488 MOBQ
p53638 226 NC 010719 Escherichia coli 53638 225683 48 55 92 67615 67418 MOBF
pAPEC-O1-R NC 009838 Escherichia coli APEC O1 241387 46 50 92 61389 61586 MOBH
pAPEC-O2-ColV NC 007675 Escherichia coli 184501 49 55 92 3882 3685 MOBF
International Journal of Evolutionary Biology

pAPEC-O2-R NC 006671 Escherichia coli 101375 53 53 92 4856 4659 MOBF


Shigella boydii CDC
pBS512 211 NC 010660 210919 46 55 89 190719 190910 MOBF
3083-94
Shigella boydii CDC
pBS512 33 NC 010657 33103 41 62 90 2894 3088
3083-94
pC15-1a NC 005327 Escherichia coli 92353 53 53 92 87490 87687 MOBF
pCP301 NC 004851 Shigella flexneri 2a str. 301 221618 46 55 92 207828 208025 MOBF
Citrobacter rodentium
pCROD1 NC 013717 54449 47 56 92 53220 53417
ICC168
Citrobacter rodentium
pCROD2 NC 013718 39265 42 62 90 15526 15332
ICC168
Salmonella enterica subsp.
pCT02021853 74 NC 011204 enterica serovar Dublin str. 74551 49 62 90 48482 48288 MOBQ
CT 02021853
pCTX-M3 NC 004464 Citrobacter freundii 89468 51 38 71 26136 26294 MOBP
89468 31 96 40648 40439
pCTXM360 NC 011641 Klebsiella pneumoniae 68018 51 38 71 64551 64709 MOBP
68018 31 96 10927 10718
Salmonella enterica subsp.
pCVM29188 146 NC 011076 enterica serovar Kentucky 146811 49 53 92 18755 18558 MOBF
str. CVM29188
pEC14 114 NC 013175 Escherichia coli 114222 51 53 92 113985 114182 MOBF
pEC-IMP NC 012555 Enterobacter cloacae 318782 48 H-NS 50 92 60491 60688 MOBH
pEC-IMPQ NC 012556 Enterobacter cloacae 324503 48 H-NS 50 92 60491 60688 MOBH
pEG356 NC 013727 Shigella sonnei 70275 52 53 92 69444 69641 MOBF
pEK499 NC 013122 Escherichia coli 117536 53 53 92 41985 42182
pEK516 NC 013121 Escherichia coli 64471 53 53 92 31410 31213
15
16

Table 6: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end MOB familyd
(%)b homolog coverage (%)
pEL60 NC 005246 Erwinia amylovora 60145 51 38 71 23187 23345 MOBP
60145 31 96 37863 37654
Escherichia coli ETEC
pEntH10407 NC 013507 67094 51 55 78 43421 43254 MOBF
H10407
Salmonella enterica subsp.
pHCM1 NC 003384 enterica serovar Typhi str. 218160 48 H-NS 47 100 105911 106117 MOBH
CT18
Klebsiella pneumoniae
pK2044 NC 006625 224152 50 H-NS, Lrp 45 85 143331 143528
NTUH-K2044
pK29 NC 010870 Klebsiella pneumoniae 269674 46 50 92 59322 59519 MOBH
pKF3-70 NC 013542 Klebsiella pneumoniae 70057 52 53 92 15967 15770 MOBF
pKF3-94 NC 013950 Klebsiella pneumoniae 94219 52 58 96 9596 9390 MOBF
pKP187 NC 011282 Klebsiella pneumoniae 342 187922 47 64 96 110083 109877
187922 42 89 1550 1344
Klebsiella pneumoniae
pKPN3 NC 009649 subsp. pneumoniae MGH 175879 52 59 97 56930 56721 MOBF
78578
Yersinia pseudotuberculosis
plasmid 153 kb NC 009705 153140 40 H-NS 69 93 63342 63542
IP 31758
153140 56 92 49734 49931
pLVPK NC 005249 Klebsiella pneumoniae 219385 50 H-NS, Lrp 61 97 148056 147847
219385 45 85 214828 215025
Salmonella enterica subsp.
pMAK1 NC 009981 enterica serovar 208409 47 H-NS 47 100 49208 49414 MOBH
Choleraesuis
pMAS2027 NC 013503 Escherichia coli 42644 43 62 90 19685 19491 MOBQ
Escherichia coli O103:H2
pO103 NC 013354 75546 49 55 92 51727 51924 MOBF
str. 12009
Escherichia coli O111:H-
pO111 1 NC 013365 204604 47 H-NS 47 100 66925 67131 MOBH
str. 11128
Escherichia coli O111:H-
pO111 3 NC 013366 77690 50 55 92 11975 12172 MOBF
str. 11128
Escherichia coli O157:H7
pO157 NC 013010 94601 48 55 92 70792 70989
str. TW14359
Escherichia coli O157:H7
pO157 NC 011350 94644 48 55 92 54735 54932
str. EC4115
Escherichia coli O157:H7
pO157 NC 007414 92077 48 55 92 1667 1864
EDL933
Escherichia coli O157:H7
pO157 NC 002128 92721 48 55 92 71183 71380
str. Sakai
International Journal of Evolutionary Biology
Table 6: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end MOB familyd
(%)b homolog coverage (%)
pO26I NC 011812 Escherichia coli 72946 51 53 92 66608 66805 MOBF
pO86A1 NC 008460 Escherichia coli 120730 49 55 92 101598 101795 MOBF
pOLA52 NC 010378 Escherichia coli 51602 46 62 90 12114 11920 MOBQ
Salmonella enterica subsp.
pOU1114 NC 010421 34595 41 62 90 5446 5252 MOBQ
enterica serovar Dublin
Salmonella enterica subsp.
pOU1115 NC 010422 74589 49 62 90 37246 37052 MOBQ
International Journal of Evolutionary Biology

enterica serovar Dublin


pSB4 227 NC 007608 Shigella boydii Sb227 126697 47 55 92 110688 110885 MOBF
pSE11-1 NC 011419 Escherichia coli SE11 100021 50 56 92 58407 58210 MOBP
Salmonella enterica subsp.
pSE34 NC 010860 32950 41 62 90 21875 22069 MOBQ
enterica serovar Enteritidis
pSFO157 NC 009602 Escherichia coli 121239 50 52 75 1709 1870 MOBF
Sodalis glossinidius str.
pSG1 NC 007713 83306 49 H-NS 48 92 2294 2491
“morsitans”
pSG1 NC 007182 Sodalis glossinidius 81553 49 48 92 56217 56414
pSMS35 130 NC 010488 Escherichia coli SMS-3-5 130440 51 55 92 3364 3167 MOBF
pSS 046 NC 007385 Shigella sonnei Ss046 214396 45 55 92 178363 178560 MOBF
pUTI89 NC 007941 Escherichia coli UTI89 114230 51 53 92 113993 114190 MOBF
pWR501 NC 002698 Shigella flexneri 221851 46 55 92 207534 207731 MOBF
R100 NC 002134 Shigella flexneri 2b str. 222 94281 52 53 92 87185 87382 MOBF
Salmonella enterica subsp.
R27 NC 002305 180461 46 H-NS 47 100 159402 159196 MOBH
enterica serovar Typhi
R478 NC 005211 Serratia marcescens 274762 46 H-NS 50 92 59426 59623 MOBH
R721 NC 002525 Escherichia coli 75582 43 66 90 35285 35091
Salmonella enterica subsp.
Unnamed NC 011148 enterica serovar Agona str. 37978 41 H-NS 42 93 1363 1163
SL483
a
This list is the result of a TBLASTN analysis using the amino acid sequence of Hha as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to Hha.
d Plasmid classification according to its relaxase gene sequence as described by Garcillán-Barcia et al. [16].
17
18

Table 7: MOBH -family plasmids of Gram-negative origina .


G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
1 NC 007949 Polaromonas sp. JS666 360405 57 HU 52 99 61052 60786
1 NC 008573 Shewanella sp. ANA-3 278942 46
2 NC 007950 Polaromonas sp. JS666 338007 60
ICEhin1056 NC 011409 Haemophilus influenzae 59393 39
pAM04528 NC 012693 Salmonella enterica 158213 52 HU 57 99 14067 14333
pAPEC-O1-R NC 009838 Escherichia coli APEC O1 241387 46
pAR060302 NC 012692 Escherichia coli 166530 53 HU 57 99 15755 16021
Aeromonas salmonicida
pAsa4 NC 009349 166749 53 HU 60 99 26844 26578
subsp. salmonicida A449
pCAR1 NC 004444 Pseudomonas resinovorans 199035 56 MvaT 61 98 77640 77993
NdpA 36 98 95390 94395
HU 42 99 97809 98075
pEC-IMP NC 012555 Enterobacter cloacae 318782 48 H-NS 64 99 109370 108969
pEC-IMPQ NC 012556 Enterobacter cloacae 324503 48 H-NS 64 99 109370 108969
peH4H NC 012690 Escherichia coli 148105 53 HU 57 99 14067 14333
Salmonella enterica subsp.
pHCM1 NC 003384 enterica serovar Typhi str. 218160 48 H-NS 61 99 131861 131460
CT18
Yersinia pestis bv. Orientalis
pIP1202 NC 009141 182913 53 HU 57 99 14067 14333
str. IP275
pK29 NC 010870 Klebsiella pneumoniae 269674 46
Rhodoferax ferrireducens
plasmid1 NC 007901 257447 54
T118
Salmonella enterica subsp.
pMAK1 NC 009981 enterica serovar 208409 47 H-NS 61 99 60046 59645
Choleraesuis
Marinobacter aquaeolei
pMAQU02 NC 008739 213290 53
VT8
Escherichia coli O111:H-
pO111 1 NC 013365 204604 47 H-NS 61 99 80175 79774
str. 11128
Photobacterium damselae
pP91278 NC 008613 131520 52 HU 57 99 125918 126184
subsp. Piscicida
International Journal of Evolutionary Biology
Table 7: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
International Journal of Evolutionary Biology

Photobacterium damselae
pP99-018 NC 008612 150157 51 HU 57 99 133314 133580
subsp. piscicida
pRA1 NC 012885 Aeromonas hydrophila 143963 51 HU 58 99 15573 15839
pRp12D01 NC 012855 Ralstonia pickettii 12D 389779 58 HU 37 99 321346 321080
Salmonella enterica subsp.
pSN254 NC 009140 enterica serovar Newport 176473 53 HU 57 99 14067 14333
str. SL254
pTK9001 NC 013930 Thioalkalivibrio sp. K90mix 240256 62
pYR1 NC 009139 Yersinia ruckeri 158038 51 HU 57 99 15070 15336
Salmonella enterica subsp.
R27 NC 002305 180461 46 H-NS 61 99 148225 148626
enterica serovar Typhi
R478 NC 005211 Serratia marcescens 274762 46 H-NS 64 99 111747 111346
Rts1 NC 003905 Proteus vulgaris 217182 46
a
This list is the result of a TBLASTN analysis using the 300 N-terminal amino acid sequence of protein TraI R27 as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to each NAP.
19
20

Table 8: MOBQ -family plasmids of Gram-negative origina .


G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
1 NC 008242 Chelativorans sp. BNC1 343931 62 HU 41 94 133932 133678
Nitrosospira multiformis
3 NC 007617 14159 50
ATCC 25196
Nitrobacter hamburgensis
3 NC 007961 121408 62
X14
Agrobacterium tumefaciens
At NC 003064 542868 57 IHF 36 82 112654 112412
str. C58
C NC 010542 Cyanothece sp. ATCC 51142 14685 38
ColE9-J NC 011977 Escherichia coli 7577 50
DN1 NC 002636 Dichelobacter nodosus 5112 62
Sphingopyxis alaskensis
F plasmid NC 008036 28543 60
RB2256
Actinobacillus
p11745 NC 013546 5486 38
pleuropneumoniae
Actinobacillus
p12494 NC 010889 14393 33
pleuropneumoniae
Acinetobacter baumannii
p1ABAYE NC 010401 5644 35
AYE
Methylobacterium
p1META1 NC 012807 44195 68
extorquens AM1
Methylobacterium
p1METDI NC 012987 141504 65
extorquens DM4
Salmonella enterica subsp.
p2007057 NC 011897 enterica serovar 4270 47
Bovismorbificans
Acinetobacter baumannii
p2ABSDF NC 010396 25014 35
SDF
p2ESCUM NC 011739 Escherichia coli UMN026 33809 42
Methylobacterium
p2META1 NC 012809 37858 65 IHF 44 95 28369 28635
extorquens AM1
Acinetobacter baumannii
p3ABSDF NC 010398 24922 34
SDF
p42a NC 007762 Rhizobium etli CFN 42 194229 58
p49879.1 NC 006907 Leptospirillum ferrooxidans 28878 58 HU 47 99 3281 3015
International Journal of Evolutionary Biology
Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
p49879.2 NC 006909 Leptospirillum ferrooxidans 28012 55 HU 48 99 15858 15592
pAb5S9 NC 009476 Aeromonas bestiarum 24716 54
pACRY07 NC 009473 Acidiphilium cryptum JF-5 5629 58
Agrobacterium radiobacter
pAgK84 NC 011994 44420 54
K84
pAM5 NC 008691 Acidiphilium multivorum 5161 58
pAMI2 NC 010847 Paracoccus aminophilus 18563 62
International Journal of Evolutionary Biology

pAMI3 NC 013513 Paracoccus aminophilus 5575 61


Acetobacter pasteurianus
pAPA01-030 NC 013212 49961 54
IFO 3283-01
Acetobacter pasteurianus
pAPA01-040 NC 013213 3204 54
IFO 3283-01
Agrobacterium radiobacter
pAtK84b NC 011990 184668 59 IHF 38 86 54109 53855
K84
pAtS4b NC 011991 Agrobacterium vitis S4 130435 56 IHF 47 97 44880 45152
pAtS4c NC 011984 Agrobacterium vitis S4 211620 59 HU 45 94 141245 140991
pAtS4e NC 011981 Agrobacterium vitis S4 631775 57 HU 41 94 40476 40222
Lrp 30 87 460443 460871
Lrp 34 74 425247 424888
pAV2 NC 010310 Acinetobacter venetianus 15135 36
pB NC 010996 Rhizobium etli CIAT 652 429111 58
pBGR3 NC 012847 Bartonella grahamii as4aup 28192 36
Shigella boydii CDC
pBS512 5 NC 010659 5114 46
3083-94
pC NC 010997 Rhizobium etli CIAT 652 1091523 61 Lrp 46 88 617696 618130
Lrp 42 90 609059 608619
Lrp 39 95 417738 418202
Lrp 42 79 714804 715193
Lrp 39 93 406570 407025
pCAUL01 NC 010335 Caulobacter sp. K31 233649 67 HU 44 99 97598 97329
Lrp 34 89 182479 182042
21
22

Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
pCAUL02 NC 010333 Caulobacter sp. K31 177878 64
pCCK1900 NC 011378 Pasteurella multocida 10226 61
pCCK381 NC 006994 Pasteurella multocida 10874 61
Candidatus Azobacteroides
pCFPG4 NC 011563 pseudotrichonympha 4149 44
genomovar. CFP2
pCHE-A NC 012006 Enterobacter cloacae 7560 60
pColE8 NC 012882 Escherichia coli 6751 51
Citrobacter rodentium
pCROD3 NC 013719 3910 51
ICC168
Salmonella enterica subsp.
pCT02021853 74 NC 011204 enterica serovar Dublin str. 74551 49
CT 02021853
Salmonella enterica subsp.
enterica serovar
pCVM19633 4 NC 011093 4585 48
Schwarzengrund str.
CVM19633
Dinoroseobacter shibae DFL
pDSHI01 NC 009955 190506 60
12
Erwinia tasmaniensis
pET09 NC 010695 9299 47
Et1/99
Gluconacetobacter
pGDIA01 NC 011367 27455 59
diazotrophicus PAl 5
Gluconobacter oxydans
pGOX3 NC 006674 14547 56
621H
Oligotropha
pHCG3 NC 005873 133058 61
carboxidovorans OM5
Desulfobacterium
pHRM2a NC 012109 68709 42
autotrophicum HRM2
pIGJC156 NC 009781 Escherichia coli 5146 47
pIGMS5 NC 010883 Escherichia coli 6750 51
pIGWZ12 NC 010885 Escherichia coli 4072 50
pISP3 NC 013970 Sphingomonas sp. MM-1 43398 63
pJD4 NC 002098 Neisseria gonorrhoeae 7426 38
plasmid1 NC 007801 Jannaschia sp. CCS1 86072 58
International Journal of Evolutionary Biology
Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
pLD-TEX-KL NC 009966 Fluoribacter dumoffii 66512 39
pMAC NC 006877 Acinetobacter baumannii 9540 35
pMAS2027 NC 013503 Escherichia coli 42644 43
pMbo4.6 NC 013500 Moraxella bovis 4658 39
Methylobacterium
pMCHL01 NC 011758 380207 66
chloromethanicum CM4
pMG160 NC 004527 Rhodobacter blasticus 3431 67
pMG828-2 NC 008487 Escherichia coli 4091 50
pMG828-4 NC 008489 Escherichia coli 7462 48
International Journal of Evolutionary Biology

pMMCU1 NC 013056 Acinetobacter calcoaceticus 8771 35


pMMCU2 NC 013506 Acinetobacter baumannii 10270 36
Methylobacterium
pMRAD01 NC 010510 586164 70
radiotolerans JCM 2831
Actinobacillus
pMS260 NC 005312 8124 61
pleuropneumoniae
pNGR234a NC 000914 Rhizobium sp. NGR234 536165 59 Lrp 41 70 197189 196845
Lrp 30 89 188867 188430
pNGR234b NC 012586 Rhizobium sp. NGR234 2430033 62 Lrp 46 90 656547 656107
Lrp 45 85 667494 667913
Lrp 43 90 1038020 1038463
Lrp 44 85 682796 683215
Lrp 38 96 2400849 2401319
Lrp 44 79 709104 708715
Lrp 41 89 28336 28761
Lrp 33 89 1108900 1109337
Lrp 36 90 703213 702764
Lrp 32 77 1112953 1112582
Novosphingobium
pNL2 NC 009427 aromaticivorans DSM 487268 66
12444
Escherichia coli O111:H-
pO111 4 NC 013367 8140 50
str. 11128
pO26-S4 NC 011228 Escherichia coli 6758 51
pOLA52 NC 010378 Escherichia coli 51602 46
Salmonella enterica subsp.
pOU1114 NC 010421 34595 42
enterica serovar Dublin
23
24

Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
Salmonella enterica subsp.
pOU1115 NC 010422 74589 49
enterica serovar Dublin
Salmonella enterica subsp.
pP NC 003455 4301 50
enterica serovar Enteritidis
pP742405 NC 011733 Cyanothece sp. PCC 7424 18083 38
pP742406 NC 011734 Cyanothece sp. PCC 7424 15219 40
Pseudomonas syringae pv.
pPMA4326C NC 005921 8244 53
maculicola
Polaromonas
pPNAP07 NC 008763 9898 57
naphthalenivorans CJ2
Pelobacter propionicus DSM
pPRO2 NC 008608 30722 56
2379
pPT1 NC 002143 Comamonas testosteroni 15398 56
Rhizobium leguminosarum
pR132501 NC 012848 828924 60 Lrp 47 88 234905 234471
bv. trifolii WSM1325
Lrp 44 86 386338 386760
Lrp 39 93 645542 645087
Lrp 42 79 147165 146776
Rhizobium leguminosarum
pR132502 NC 012858 660973 61
bv. trifolii WSM1325
Rhizobium leguminosarum
pR132503 NC 012853 516088 59 HU 47 94 300662 300916
bv. trifolii WSM1325
Rhizobium leguminosarum
pR132504 NC 012852 350312 61
bv. trifolii WSM1325
Rhizobium leguminosarum
pR132505 NC 012854 294782 60
bv. trifolii WSM1325
pRF NC 007110 Rickettsia felis URRWXCal2 62829 34
pRFdelta NC 007111 Rickettsia felis URRWXCal2 39263 33
pRi1724 NC 002575 Agrobacterium rhizogenes 217594 57
pRi2659 NC 010841 Agrobacterium rhizogenes 185462 58
Rhizobium leguminosarum
pRL10 NC 008381 488135 60
bv. viciae 3841
Rhizobium leguminosarum
pRL11 NC 008384 684202 61
bv. viciae 3841
Rhizobium leguminosarum
pRL12 NC 008378 870021 61 Lrp 46 88 599116 598682
bv. viciae 3841
International Journal of Evolutionary Biology
Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
Lrp 43 88 658287 658718
Lrp 39 93 45601 45146
Lrp 42 79 450080 449691
Rhizobium leguminosarum
pRL7 NC 008382 151564 58 HU 48 94 20484 20230
bv. viciae 3841
Rhizobium leguminosarum
pRL8 NC 008383 147463 59 Lrp 33 87 70763 70344
bv. viciae 3841
Rhizobium leguminosarum
pRLG201 NC 011368 1266105 60 Lrp 45 89 917573 917136
bv. trifolii WSM2304
International Journal of Evolutionary Biology

Lrp 44 85 41998 42417


Lrp 44 79 473039 472650
Lrp 40 93 1162146 1161691
Lrp 40 93 1150939 1150484
Lrp 32 88 707587 707162
pRM NC 010927 Rickettsia monacensis 23486 32
Salmonella enterica subsp.
pSC101 NC 002056 enterica serovar 9263 51
Typhimurium
pSE11-6 NC 011411 Escherichia coli SE11 4082 49
Salmonella enterica subsp.
pSE34 NC 010860 32950 41
enterica serovar Enteritidis
Sinorhizobium medicae
pSMED01 NC 009620 1570951 62 Lrp 40 77 143180 143557
WSM419
Lrp 34 89 574284 573847
Sinorhizobium medicae
pSMED02 NC 009621 1245408 60 Lrp 42 91 556486 556932
WSM419
Lrp 40 91 842324 842758
Lrp 31 87 22345 21917
pSmeSM11a NC 013545 Sinorhizobium meliloti 144170 60 Lrp 46 96 70449 70922
Sinorhizobium meliloti
pSmeSM11b NC 010865 181251 59
SM11
pSMS35 4 NC 010486 Escherichia coli SMS-3-5 4074 50
pSx-Qyy NC 006826 Sphingobium xenophagum 5683 56
pSymA NC 003037 Sinorhizobium meliloti 1021 1354226 60 Lrp
pSymB NC 003078 Sinorhizobium meliloti 1021 1683333 62 Lrp 38 90 440778 440335
25
26

Table 8: Continued.
G+C content NAP gene Query
Plasmid name Accession no. Source organism Length (nt) Identity (%)c Subject start Subject end
(%)b homolog coverage (%)
Lrp 36 89 29555 29992
Roseobacter denitrificans
pTB3 NC 008388 16575 55
OCh 114
pTcM1 NC 010600 Acidithiobacillus caldus 65158 57 IHF 56 89 25186 25449
pTiS4 NC 011982 Agrobacterium vitis S4 258824 57 HU 41 94 27356 27102
HU 40 94 83408 83154
Lrp 42 79 96920 97309
pTi-SAKURA NC 002147 Agrobacterium tumefaciens 206479 56 HU 44 94 95763 95509
Sphingobium japonicum
pUT1 NC 014005 31776 64
UT26S
Sphingobium japonicum
pUT2 NC 014009 5398 61
UT26S
Xanthobacter autotrophicus
pXAUT01 NC 009717 316164 65
Py2
Xanthomonas campestris pv.
pXCV19 NC 007505 19146 60
vesicatoria str. 85-10
pXF51 NC 002490 Xylella fastidiosa 9a5c 51158 50
pYAN-1 NC 008246 Sphingobium yanoikuyae 5182 62
pYAN-2 NC 008247 Sphingobium yanoikuyae 4924 64
RSF1010 NC 001740 Escherichia coli 8684 61
Symbiotic
NC 004041 Rhizobium etli CFN 42 371254 58
plasmid p42d
Ti NC 002377 Agrobacterium tumefaciens 194140 55 IHF 43 97 180164 180436
Agrobacterium tumefaciens
Ti NC 003065 214233 57 HU 44 94 139735 139481
str. C58
Ti plasmid
NC 010929 Agrobacterium tumefaciens 244978 55 IHF 36 86 209743 209489
pTiBo542
IHF 45 98 187204 187479
Phenylobacterium zucineum
Unnamed NC 011143 382976 69
HLK1
a
This list is the result of a TBLASTN analysis using the 300 N-terminal amino acid sequence of protein MobA RSF1010 as a query under strict conditions (i.e., thresholds of 30% identity and 70% query coverage).
b Average G+C content of the plasmid.
c Reported TBLASTN identity to each NAP.
International Journal of Evolutionary Biology
International Journal of Evolutionary Biology 27

250 10 kb 100 kb 1 Mb 40 10 kb 100 kb 1 Mb

200
30

150
Frequency

Frequency
20

100

10
50

0 0

25 210 215 220 225 210 215 220 225


Length Length

H-NS Lrp
HU MvaT
IHF NdpA

(a) (b)

Figure 1: Size comparison of the Gram-negative plasmids with and without NAP gene homologs. (a) A total of 136 Gram-negative plasmids
with one or more NAP gene homologs and 1246 Gram-negative plasmids without NAP gene homologs are shown by black and white bars,
respectively. (b) Gram-negative plasmids with each NAP gene homolog are as follows: H-NS, red; HU, blue; IHF, green; Lrp, purple; MvaT,
yellow; and NdpA, orange.

(83 kb). These results suggest that larger plasmids, especially were found in 588 proteobacterial genomes. Frequency of
>100 kb, frequently have NAP gene homologs. Carrying NAP genes in plasmids was higher (1 per 236 kb) than that
large plasmids may reduce host fitness more than carrying in proteobacterial genomes (1 per 1.8 Mb), also suggesting
small plasmids because the former have more genes that that larger plasmids frequently have NAP gene homologs to
can disrupt transcriptional networks in the host cell. In minimize their negative effects on the host cell.
addition, large plasmids may have more binding sites for Of the plasmids with the NAP gene homolog, the average
NAPs than small plasmids. Because chromosome-encoded size of those with the H-NS gene homolog was relatively
NAPs bind to both chromosomes and plasmids, carrying small (132 kb) while that of those with the Lrp gene homolog
large plasmids may also result in a reduction in the binding was relatively large (725 kb). The average sizes of those
of NAPs to the host chromosome, causing undesirable with the other NAP gene homologs were as follows: HU
effects on the host cell. Plasmid-encoded NAP homologs (301 kb), IHF (230 kb), MvaT (244 kb), and NdpA (235 kb)
may interact with chromosome-encoded NAPs, coordinately (Figure 1(b)). H-NS exists in an oligomeric form and binds
sustain the structure of both chromosome and plasmid, and to DNA, especially A+T-rich regions, by bridging it [25]. This
regulate the transcriptional regulation network [23]. In fact, function may be important for regulating gene expression
recent studies have shown that some plasmid-encoded NAP on relatively small plasmids among those with the NAP
homologs can complement the depletion of chromosomal gene homolog. The activity of H-NS can also be modulated
NAPs and optimize gene transcription both on plasmids by Hha-like proteins [26]. Intriguingly, TBLASTN analysis
and in the host chromosome [14, 15, 24]. Thus, larger showed that 12 (55%) of 22 plasmids with the H-NS
plasmids may have NAP gene homologs to maintain host gene homolog also carried gene encoding Hha-like protein
cell fitness. In addition, the average size of the 38 plasmids although only 65 (5%) of all 1382 plasmids carried Hha-like
containing more than one NAP gene homolog was larger protein gene (Table 6). This suggests the close relationship
(790 kb) than that of the 98 plasmids containing only one of H-NS and Hha-like protein. On the other hand, Lrp
NAP gene homolog (199 kb). This suggests that particularly exists in dimeric, octameric, and hexadecameric forms and
large plasmids have many NAP gene homologs to maintain compacts DNA by wrapping it [27]. This distinctive DNA-
themselves in the host cell. binding ability may be essential for maintaining the structure
Distributions of the NAP genes on proteobacterial of particularly larger plasmids.
genomes were also surveyed using the TBLASTN program.
The average size of the completely sequenced bacterial 3.4. Relationships between Plasmid G+C Content and NAP
genomes was 3.25 Mb and 1054 NAP genes (100, Fis; 125, H- Gene Homolog Distributions. Next, we surveyed the G+C
NS; 236, HU; 247, IHF; 127, Lrp; 119, MvaT; and 100, NdpA) content of the Gram-negative group plasmids with and
28 International Journal of Evolutionary Biology

150 40

30
100
Frequency

Frequency
20

50
10

0 0

0 20 40 60 80 100 0 20 40 60 80 100
G+C content (%) G+C content (%)

H-NS Lrp
HU MvaT
IHF NdpA

(a) (b)

Figure 2: G+C content comparison of the Gram-negative plasmids with and without NAP gene homologs. (a) A total of 136 Gram-negative
plasmids with one or more NAP gene homologs and 1246 Gram-negative plasmids without NAP gene homologs are shown by black and
white bars, respectively. (b) Gram-negative plasmids with each NAP gene homolog are as follows: H-NS, red; HU, blue; IHF, green; Lrp,
purple; MvaT, yellow; and NdpA, orange.

without NAP gene homologs. The average G+C content essential function of plasmids, through which they play an
of the 136 plasmids with NAP gene homologs was higher important role in bacterial evolution and host cell behavior
(56.4%) than that of all 1382 plasmids (44.8%) (Figure 2(a)). [11, 12]. Relaxase is an essential protein for plasmid trans-
Note that the average G+C content of large and mega mission involved in the cleavage of the transferring DNA at
plasmids (55.0% and 62.9%, resp.) was higher than that the origin of transfer (oriT) site, and plasmids with relaxase
of small and intermediate plasmids (44.8% and 40.4%). genes are thought to be transmissible. Garcillán-Barcia et al.
Considering that larger plasmids frequently had NAP gene [16] proposed that transmissible plasmids can be classified
homologs, this seems reasonable. Nevertheless, plasmids into 6 MOB families (MOBC , MOBF , MOBH , MOBP , MOBQ ,
with H-NS gene homologs had a lower G+C content (48.3%) and MOBV ) according to the amino acid sequences of
than did those with other NAP gene homologs, including 6 prototype relaxase proteins. MOBF and MOBH families
HU (54.2%), IHF (58.7%), Lrp (62.3%), MvaT (55.6%), are predominantly composed of conjugative plasmids, also
and NdpA (52.9%) (Figure 2(b)). H-NS family protein binds called self-transmissible plasmids, and the other 4 families
A+T-rich regions not only on chromosomes but also on are composed of both mobilizable and conjugative plasmids.
plasmids [15]. Acquisition of a large A+T-rich plasmid with Recent studies have reported that plasmid-encoded H-NS
many H-NS binding sites may result in a reduction in family proteins have a “stealth” function and aide horizontal
the binding of H-NS to the host chromosome and host transfer of plasmids [14, 15]. Other NAPs also act as global
cell fitness [14]. It is therefore possible that large A+T- transcriptional regulators and may regulate expression of
rich plasmids may have to supply another H-NS encoded genes involved in plasmid transmission. To discuss the
on themselves to minimize the effect on the host cell. On relationship between NAP gene homolog distribution and
the other hand, although MvaT-family proteins are the plasmid transferability, we determined the distribution of
functional homolog of H-NS [10, 15], plasmids containing genes encoding relaxase proteins in Gram-negative plasmids
the MvaT gene homolog were not particularly low in G+C according to the classification by Garcillán-Barcia et al.
content. Although only three plasmids contained the MvaT [16]. Four hundred and nine (30%) of 1382 Gram-negative
gene homolog and thus we cannot discuss this interesting plasmids carried relaxase genes, and 71 (17%) of those 409
phenomenon in detail, the difference between H-NS and plasmids carried NAP gene homologs. Note that 71 (52%)
MvaT may be derived from their different origin or host of 136 plasmids with NAP gene homologs carried relaxase
bacteria. genes. This indicates that plasmids with NAP gene homologs
frequently carried the relaxase genes than did those without
3.5. Relationships between Plasmid Transferability and NAP NAP gene homologs. This phenomenon may be related to the
Gene Homolog Distributions. Conjugative transfer is an average size of the plasmids. That of the 409 plasmids with
International Journal of Evolutionary Biology 29

relaxase genes was relatively larger (145 kb) than that of all how bacteria adapt and evolve by acquiring foreign genes by
1382 plasmids (83 kb), corresponding to the fact that larger HGT.
plasmids frequently had NAP gene homologs.
Four hundred and nine plasmids were classified into each
MOB family (13, MOBC ; 128, MOBF ; 29, MOBH ; 86, MOBP ; References
131, MOBQ ; and 26, MOBV ). Plasmid 1 (NC 008545) [1] C. J. Dorman, “Chapter 2 nucleoid-associated proteins and
was classified into both the MOBC and MOBF families. bacterial physiology,” Advances in Applied Microbiology, vol.
In addition, the MOBP , MOBQ , and MOBV families were 67, pp. 47–64, 2009.
partially overlapped as described by Garcillán-Barcia et al. [2] S. C. Dillon and C. J. Dorman, “Bacterial nucleoid-associated
[16]. Seventy-one plasmids with NAP gene homologs were proteins, nucleoid structure and gene expression,” Nature
contained in each MOB family (1, MOBC ; 11, MOBF ; 20, Reviews Microbiology, vol. 8, no. 3, pp. 185–195, 2010.
MOBH ; 8, MOBP ; 30, MOBQ ; and 2, MOBV ). Intriguingly, [3] M. D. Bradley, M. B. Beach, A. P. J. de Koning, T. S. Pratt, and
20 (69%) of 29 MOBH -family plasmids encoded some NAP R. Osuna, “Effects of Fis on Escherichia coli gene expression
homologs, and most of them were H-NS or HU (Table 7). during different growth stages,” Microbiology, vol. 153, no. 9,
The MOBH family was composed of predominantly large pp. 2922–2940, 2007.
conjugative plasmids, such as the IncHI1 group of plasmids, [4] W. W. Navarre, S. Porwollik, Y. Wang et al., “Selective silencing
suggesting that HU may also contribute to plasmid transmis- of foreign DNA with low GC content by the H-NS protein in
Salmonella,” Science, vol. 313, no. 5784, pp. 236–238, 2006.
sion as does H-NS. Furthermore, 30 (23%) of 131 MOBQ -
[5] J. Oberto, S. Nabti, V. Jooste, H. Mignot, and J. Rouviere-
family plasmids also contained some NAP gene homologs,
Yaniv, “The HU regulon is composed of genes responding to
and 15 (50%) of those carried Lrp gene homologs (Table 8). anaerobiosis, acid stress, high osmolarity and SOS induction,”
The MOBQ family was composed of both mobilizable PLoS One, vol. 4, no. 2, article e4367, 2009.
and conjugative plasmids, such as those of Rhizobium and [6] M. W. Mangan, S. Lucchini, V. Danino, T. Ó. Cróinı́n, J.
Agrobacterium, implying that Lrp may also affect plasmid C. D. Hinton, and C. J. Dorman, “The integration host
conjugation. In the other MOB families, plasmids containing factor (IHF) integrates stationary-phase and virulence gene
NAP gene homologs were less than 10% (8%, MOBC ; 9%, expression in Salmonella enterica serovar Typhimurium,”
MOBF ; 9%, MOBP ; and 8%, MOBV ). This phenomenon may Molecular Microbiology, vol. 59, no. 6, pp. 1831–1847, 2006.
also be related to the average size of the plasmids contained in [7] K. K. Swinger and P. A. Rice, “IHF and HU: flexible architects
each MOB family. MOBH (220 kb) and MOBQ (198 kb) were of bent DNA,” Current Opinion in Structural Biology, vol. 14,
larger than MOBC (78 kb), MOBF (117 kb), MOBP (87 kb), no. 1, pp. 28–35, 2004.
and MOBV (149 kb). On the other hand, the average G+C [8] B. K. Cho, C. L. Barrett, E. M. Knight, Y. S. Park, and
content of all plasmids belonging to each MOB family was as B. Ø. Palsson, “Genome-scale reconstruction of the Lrp
follows: MOBC (52%), MOBF (52%), MOBH (51%), MOBP regulatory network in Escherichia coli,” Proceedings of the
(53%), MOBQ (54%), and MOBV (46%). No relationship National Academy of Sciences of the United States of America,
vol. 105, no. 49, pp. 19462–19467, 2008.
between the distribution of NAP gene homologs of each
[9] L. D. Murphy, J. L. Rosner, S. B. Zimmerman, and D. Esposito,
MOB family and the G+C content of plasmids was found.
“Identification of two new proteins in spermidine nucleoids
isolated from Escherichia coli,” Journal of Bacteriology, vol. 181,
3.6. Conclusions. We compared the distribution of NAP gene no. 12, pp. 3842–3844, 1999.
homologs among plasmids and plasmid features. Larger [10] C. Tendeng, O. A. Soutourina, A. Danchin, and P. N. Bertin,
plasmids frequently had NAP gene homologs, possibly to “MvaT proteins in Pseudomonas spp.: a novel class of H-NS-
like proteins,” Microbiology, vol. 149, no. 11, pp. 3047–3050,
maintain themselves and host cell fitness. Plasmids with
2003.
NAP gene homologs also frequently carried relaxase genes.
[11] L. S. Frost, R. Leplae, A. O. Summers, and A. Toussaint,
Although this may be related to their relatively larger sizes, “Mobile genetic elements: the agents of open source evolu-
together with the fact that NAPs affect global gene regulation, tion,” Nature Reviews Microbiology, vol. 3, no. 9, pp. 722–732,
it is likely that NAPs contribute to plasmid transmission. 2005.
Considering the fact that NAPs encoded on plasmids actually [12] C. M. Thomas and K. M. Nielsen, “Mechanisms of, and
help the host cell to integrate newly acquired genes into host barriers to, horizontal gene transfer between bacteria,” Nature
regulatory networks [14, 15], large plasmids with NAP gene Reviews Microbiology, vol. 3, no. 9, pp. 711–721, 2005.
homologs may be generally more beneficial not only for the [13] A. Carattoli, “Plasmid-mediated antimicrobial resistance in
host cell, but also for their own existence. Salmonella enterica,” Current Issues in Molecular Biology, vol.
5, no. 4, pp. 113–122, 2003.
NAP homologs encoded on plasmids can interact with
[14] M. Doyle, M. Fookes, AL. Ivens, M. W. Mangan, J. Wain, and
different types of NAPs encoded on the host chromosome
C. J. Dorman, “An H-NS-like stealth protein aids horizontal
and cooperatively regulate host transcriptional networks. DNA transmission in bacteria,” Science, vol. 315, no. 5809, pp.
Understanding these mechanisms in more detail will shed 251–252, 2007.
light on the meanings of the distributions of NAPs on [15] C.-S. Yun, C. Suzuki, K. Naito et al., “Pmr, a histone-like
plasmids and chromosomes. Comprehensive analysis of their protein H1 (H-NS) family protein encoded by the IncP-7
binding sites in the host and plasmid genomes will help plasmid pCAR1, is a key global regulator that alters host
us to understand the relationships between G+C content function,” Journal of Bacteriology, vol. 192, no. 18, pp. 4720–
and the presence of NAPs. Such information will explain 4731, 2010.
30 International Journal of Evolutionary Biology

[16] M. P. Garcillán-Barcia, M. V. Francia, and F. de la Cruz, “The


diversity of conjugative relaxases and its application in plasmid
classification,” FEMS Microbiology Reviews, vol. 33, no. 3, pp.
657–687, 2009.
[17] J. Wei, M. B. Goldberg, V. Burland et al., “Complete genome
sequence and comparative genomics of Shigella flexneri
serotype 2a strain 2457T,” Infection and Immunity, vol. 71, no.
5, pp. 2775–2786, 2003.
[18] C. K. Sherburne, T. D. Lawley, M. W. Gilmour et al., “The
complete DNA sequence and analysis of R27, a large IncHI
plasmid from Salmonella typhi that is temperature sensitive
for transfer,” Nucleic Acids Research, vol. 28, no. 10, pp. 2177–
2186, 2000.
[19] J. Wain, L. T. D. Nga, C. Kidgell et al., “Molecular analysis
of incHI1 antimicrobial resistance plasmids from Salmonella
serovar Typhi strains associated with typhoid fever,” Antimi-
crobial Agents and Chemotherapy, vol. 47, no. 9, pp. 2732–
2739, 2003.
[20] A. Tett, A. J. Spiers, L. C. Crossman et al., “Sequence-based
analysis of pQBR103; a representative of a unique, transfer-
proficient mega plasmid resident in the microbial community
of sugar beet,” ISME Journal, vol. 1, no. 4, pp. 331–340, 2007.
[21] K. Maeda, H. Nojiri, M. Shintani, T. Yoshida, H. Habe, and T.
Omori, “Complete nucleotide sequence of carbazole/dioxin-
degrading plasmid pCAR1 in Pseudomonas resinovorans strain
CA10 indicates its mosaicity and the presence of large
catabolic transposon Tn4676,” Journal of Molecular Biology,
vol. 326, no. 1, pp. 21–33, 2003.
[22] Y. Takahashi, M. Shintani, H. Yamane, and H. Nojiri, “The
complete nucleotide sequence of pCAR2: pCAR2 and pCAR1
were structurally identical incP-7 carbazole degradative plas-
mids,” Bioscience, Biotechnology and Biochemistry, vol. 73, no.
3, pp. 744–746, 2009.
[23] P. Deighan, C. Beloin, and C. J. Dorman, “Three-way inter-
actions among the Sfh, StpA and H-NS nucleoid-structuring
proteins of Shigella flexneri 2a strain 2457T,” Molecular
Microbiology, vol. 48, no. 5, pp. 1401–1416, 2003.
[24] S. C. Dillon, A. D. S. Cameron, K. Hokamp, S. Lucchini, J. C. D.
Hinton, and C. J. Dorman, “Genome-wide analysis of the H-
NS and Sfh regulatory networks in Salmonella Typhimurium
identifies a plasmid-encoded transcription silencing mecha-
nism,” Molecular Microbiology, vol. 76, no. 5, pp. 1250–1265,
2010.
[25] R. T. Dame, M. C. Noom, and G. J. L. Wuite, “Bacterial
chromatin organization by H-NS protein unravelled using
dual DNA manipulation,” Nature, vol. 444, no. 7117, pp. 387–
390, 2006.
[26] C. Madrid, C. Balsalobre, J. Garcı́a, and A. Juárez, “The
novel Hha/YmoA family of nucleoid-associated proteins: use
of structural mimicry to modulate the activity of the H-NS
family of proteins,” Molecular Microbiology, vol. 63, no. 1, pp.
7–14, 2007.
[27] S. de los Rios and J. J. Perona, “Structure of the Escherichia
coli leucine-responsive regulatory protein Lrp reveals a novel
octameric assembly,” Journal of Molecular Biology, vol. 366, no.
5, pp. 1589–1602, 2007.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 250154, 9 pages
doi:10.4061/2011/250154

Research Article
New Insights on the Evolutionary History of Aphids
and Their Primary Endosymbiont Buchnera aphidicola

Vicente Pérez-Brocal,1, 2 Rosario Gil,2, 3 Andrés Moya,1, 2, 3 and Amparo Latorre1, 2, 3


1
Área de Genómica y Salud, Centro Superior de Investigación en Salud Pública (CSISP), Avenida de Cataluña 21,
46020 Valencia, Spain
2
CIBER Epidemiologı́a y Salud Pública (CIBERESP), Spain
3 Departament de Genètica, Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València,

Apartado Postal 22085, 46071 Valencia, Spain

Correspondence should be addressed to Vicente Pérez-Brocal, perez vicbro@gva.es

Received 13 October 2010; Accepted 24 December 2010

Academic Editor: Hiromi Nishida

Copyright © 2011 Vicente Pérez-Brocal et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Since the establishment of the symbiosis between the ancestor of modern aphids and their primary endosymbiont, Buchnera
aphidicola, insects and bacteria have coevolved. Due to this parallel evolution, the analysis of bacterial genomic features constitutes
a useful tool to understand their evolutionary history. Here we report, based on data from B. aphidicola, the molecular evolutionary
analysis, the phylogenetic relationships among lineages and a comparison of sequence evolutionary rates of symbionts of four
aphid species from three subfamilies. Our results support previous hypotheses of divergence of B. aphidicola and their host lineages
during the early Cretaceous and indicate a closer relationship between subfamilies Eriosomatinae and Lachninae than with the
Aphidinae. They also reveal a general evolutionary pattern among strains at the functional level. We also point out the effect of
lifecycle and generation time as a possible explanation for the accelerated rate in B. aphidicola from the Lachninae.

1. Introduction The vertical mode of transmission of B. aphidicola, from


mother to eggs and embryos, together with the location in
Aphids constitute a diversified group of insects widespread specific host cells (the bacteriocytes), determines a pop-
and of economical relevance as crop pests. The underlying ulation scenario for this bacterium characterized by their
reason of their ecological success is their novel capability to low effective population size, with frequent bottlenecks and
exploit ecological niches with little competitors, mainly due little chance of genetic recombination with other bacteria.
to their diet based on phloem, which is abundant and of As a result, the genome reductive process undergone by
easy access but represents an unbalanced source of nutrients, B. aphidicola encompasses a decrease in the genomic size
rich in sugars and poor in amino acids [1]. The clue to the due to the loss of unnecessary genes in the new intracellular
use of new resources lies in the establishment of an obligate context, the increase in A+T content compared to its free-
endosymbiotic relationship between the ancestor of aphids living relatives, a significant acceleration in evolutionary
and a gamma-proteobacterium, the ancestor of Buchnera rates, mainly due to the accumulation of nonsynonymous
aphidicola. This single event of infection has been dated at substitutions, the loss in codon bias, loss of many regulatory
least 150–200 million years ago (MYA) [2] according to the proteins and functions, as well as the retention of genes
fossil record or to 80–150 MYA based on molecular data [3]. linked to their symbiotic role [4–9].
As a result of millions of years of cospeciation of host and This particular history of genome reduction is pertinent
endosymbiont, the current species of aphids carrying their to understand the coevolution between particular aphid
specific strains of B. aphidicola emerged. hosts and B. aphidicola. Many of the genes that are involved
2 International Journal of Evolutionary Biology

in recombination and/or genetic transference were lost at the with the lowest gene complement of those analyzed. For
beginning of the symbiotic association and, consequently, each one of the genes present in B. aphidicola BCc having
the B. aphidicola clones have evolved independently in each an orthologous in at least one of the other B. aphidicola
particular host with no or little chance of gene exchange strains, an analysis of relative substitution rates between pairs
among B. aphidicola from different aphid hosts [10]. of B. aphidicola strains was carried out, using E. coli as out-
The comparison of the topology of phylogenetic trees group. Specifically, we applied a Tajima’s relative rate test [25]
based on aphid genes and those from B. aphidicola reveals with MEGA4, generating six comparisons for each of the
a perfect match [2, 11]. As a result of this parallel evolu- aligned genes. Genes showing accelerated rates were grouped
tionary pattern, B. aphidicola can be regarded as an excellent according to a nonredundant categories classification based
marker in order to elucidate the evolutionary relationship of on that used in the sequencing work on Aquifex aeolicus [26],
aphids harboring particular B. aphidicola strains. with some modifications [27].
The analysis of B. aphidicola genes that follow an evolu-
tionary pattern that agrees with the molecular clock hypoth- 2.4. Estimate of Evolutionary Acceleration among Genomes.
esis [12, 13] can be used to estimate the divergence time The sequence from the 338 protein-coding genes shared
between pairs of aphids. This is possible because two aphid by the four B. aphidicola strains plus E. coli was used to
species, Acyrthosiphon pisum and Schizaphis graminum quantify the relative degree of evolutionary acceleration
belonging to two tribes of the subfamily Aphidinae, have an among strains. To do this, nucleotide sequences were con-
estimated divergence time calibrated from their fossil record catenated with BioEdit and aligned using the ClustalW
of 50 to 70 MY [14]. In addition, using molecular data tool implemented in the MEGA4 [23]. Three different
from complete B. aphidicola genomes available, Pérez-Brocal estimates of substitution rates per site between species i
and coworkers calculated the divergence time of aphids and j (Ki j ) were carried out with MEGA4, using (a) the
belonging to subfamilies Eriosomatinae (Baizongia pistaciae) total and (b) nonsynonymous nucleotide positions, under
and Lachninae (Cinara cedri) [15]. Based on morphological the Kimura 2-parameters and the modified Nei-Gojobori
traits, the subfamilies Eriosomatinae and Lachninae have methods, respectively, and (c) amino acid sequences, using
traditionally been considered very divergent. In fact, most the JTT substitution matrix. K01 and K02 were calculated
phylogenetic hypotheses based both on morphological and according to Moran [28], being taxon 0 the last common
molecular data consider the Lachninae as a sister group ancestor of the endosymbiont strains compared in each test
of the Aphidinae [11, 16]. However, the position of this (taxa 1 and 2). The calculation of total and nonsynonymous
subfamily remains controversial, as recent phylogenies based substitutions allowed us to account for the phenomenon
on molecular sequences located the subfamily in a basal of saturation. To check for saturation, the “transition and
position [17–19]. Here, we follow a genomic approach to transversion versus divergence” plot was implemented by
deepen the evolutionary analyses and propose a phylogeny DAMBE v4.2.13 for the concatenation of shared genes using
of the three subfamilies of aphids based on the genome the first and second positions as well as the third one
sequence of their primary endosymbionts B. aphidicola. In [29]. This method has been successfully used previously to
addition, in order to detect if there is any selective effect estimate saturation due to divergence [30–33].
related to the specific role of the genes, we also gave a closer Additionally, for each protein-coding gene under study,
look to the acceleration pattern of each functional category. the values of both synonymous (dS ) and nonsynonymous
(dN ) nucleotide substitutions were calculated, using a mod-
2. Materials and Methods ified Nei-Gojobori model (Jukes Cantor) implemented by
MEGA4 [23]. To calculate the synonymous (λS ) and non-
2.1. Genome Sequences Used in This Study. The genome se- synonymous (λN ) nucleotide substitutions per million years,
quences used in this study were retrieved from GenBank we used the expression λ = K/2T, where K is the number of
database. The four B. aphidicola strains are B. aphidicola nucleotide differences per site and T the estimated divergence
Acyrthosiphon pisum str. APS (BAp, Accession no. BA000003 time. The T values used in these analyses were 107 MY
[20]), B. aphidicola Schizaphis graminum (BSg, Accession no. for (B. aphidicola BAp-BSg-)BBp, 111 MY for (B. aphidicola
AE013218 [21]), B. aphidicola Baizongia pistaciae (BBp, BAp-BSg-)BCc, and 112 MY for B. aphidicola BBp-BCc.
Accession no. AE016826 [22]), and B. aphidicola Cinara cedri These are the previously determined lowest values for each
(BCc, Accession no. CP000263 [15]). Escherichia coli was range of estimated divergence times among strains [15],
used as out-group in all comparisons: E. coli str. K12 substr. based on the range of 50 to 70 MY since the strains used
MG1655 (Eco, Accession no. U00096). for calibration (B. aphidicola BAp and BSg) diverged as
2.2. Sequence Alignments. For protein-coding genes, nucle- estimated from the fossil record [14]. The global average
otide sequences were translated into amino acids using the λS and λN values for each pair of B. aphidicola strains was
ClustalW tool implemented in the MEGA4 package [23]. calculated, as well as the partial average λS and λN values for
The generated amino acid sequences were used, in turn, each functional category [26, 27] between all the strain pairs.
as a template to align the corresponding nucleotides with
MUSCLE v3.6 [24], to reduce ambiguities. 2.5. Phylogenetic Analyses. Since saturation was achieved at
the third position in all comparisons but BAp and BSg, in
2.3. Estimate of Strain-Specific Evolutionary Rates. B. aphidi- order to reduce the loss of phylogenetic signal we excluded
cola BCc was used as a reference strain since it is the one this position when working with nucleotides to perform
International Journal of Evolutionary Biology 3

our phylogenetic analyses. The concatenated sequence of the the nucleotides, but with values in B. aphidicola BCc of 2 to
338 protein-coding genes shared by the four B. aphidicola 3-fold those of B. aphidicola BBp and BAp-BSg, respectively.
strains was used to reconstruct the phylogenetic relation- The evolutionary acceleration among genomes was also
ships among them. Maximum Likelihood (ML) analyses determined through the analysis of the synonymous (λS )
were carried out with PAUP4.0b10 [34] for nucleotides, and nonsynonymous (λN ) nucleotide substitutions per mil-
and Phyml v2.4.5 [35] for amino acids, according to the lion years. The results show that both rates exhibit an
best models of nucleotide (GTR+I+G) and amino acid opposite pattern (Figure 1). Differences in both λS and λN
(CpREV+I+G+F) substitutions for those genes derived are statistically significant (ANOVA test, significance level
from jModelTest [36] and ProtTest 1.4 [37], respectively. 0.05), clustering into three separate groups for λS and two
Nucleotides and amino acids were also used for Bayesian groups for λN , according to Tukey’s range tests. When
analysis, with MrBayes v3.1.2 [38], using four MCMC synonymous substitutions (Figure 1(a)) are considered, the
strands, 1,000,000 generations, with trees sampled every 100 more accelerated rate is found in the comparison between
generations. Consensus trees were produced after excluding strains B. aphidicola BAp and BSg, a second group includes
an initial burn-in of 25% of the samples, as recommended. B. aphidicola BBp with the two aforementioned, and the least
In a previous study, the evolutionary analyses of the accelerated one includes all rates in which B. aphidicola BCc
four B. aphidicola strains showed that only 21 genes fulfill is involved. A different pattern is found for nonsynonymous
the molecular clock hypothesis [15]. The topologies of the substitutions (Figure 1(b)), where the more accelerated
21 phylogenetic trees based on these genes were obtained group includes all the comparisons involving B. aphidicola
by ML using PAUP 4.0b10 [34], in order to determine the BCc, and the other one includes the remaining three
most plausible evolutionary relationships among strains and comparisons.
compare them with the phylogenetic reconstruction.
3.2. Analyses of the Evolutionary Rates at a Functional
Level. The general pattern identified at a genomic level is
2.6. Statistical Analyses. All statistical analyses were per- reproduced at every functional category (see Section 2), with
formed using the software package R (http://www.r-pro- the same three and two groups of B. aphidicola strain pairs
ject.org) [39]. A chi-square analysis was applied to the global found in λS and λN , respectively (Figure 2). On the other
distribution of the accelerated genes among B. aphidicola hand, no significant differences are found among functional
strains compared to the distribution within functional categories in any strain for λS (Figure 2(a)). However, a
families, to test whether any particular functional category significant increase in λN is found for the genes involved in
contains a significantly increased or reduced number of cell envelope in all the strains (P < .05) and to a lesser extent
accelerated genes. Twelve comparisons with Yates’ correction in the category of poorly characterized genes (Figure 2(b)).
were carried out, at a significance level α = 0.05. This could be due to a significant acceleration of the flagellar
The average rates of synonymous (λS ) and nonsynony- genes still remaining in B. aphidicola, especially in BCc, the
mous (λN ) substitutions per site per million years of the strain which has undergone the most drastic reduction in the
six possible comparisons among B. aphidicola strains were flagellar machinery.
compared using a one-way ANOVA analysis followed by In a previous study, we determined the global relative
Tukey’s range tests to find which means are significantly distribution of accelerated genes displayed by the strains,
different from one another. using Tajima’s relative rate test [15]. According to this test, B.
aphidicola BCc presents a higher number of accelerated genes
3. Results (56%–83%), while B. aphidicola BBp presents intermediate
values (0.6%–35%), and the fewest appear in B. aphidicola
3.1. Comparison of the Evolutionary Rates in B. aphidicola BSg and specially BAp. This trait is observed in each
Strains at a Genome Level. The relative rate test on the 338 functional category with no significant differences (Figure 3).
concatenated protein-coding genes (Table 1) reveals that, This homogeneous distribution of the accelerated genes
since the last common ancestor of each pair of strains, the across functional categories was tested by the application of
accumulation of both nucleotide and amino acid substi- χ 2 tests, based on the observed number of accelerated genes
tutions, as well as the nonsynonymous substitution rates for each category and the expected number of genes based on
follows different rates in the different strains, but the values the totality of them for each pair of strains. None of the tests
obtained using all three parameters are equivalents for any was statistically significant at P < .05 (Table 2).
given strain pair. Thus, for the nucleotide sequences, a similar
pattern of relative evolutionary rates was observed when total 3.3. Phylogenetic Analyses Show an Evolutionary Radiation
and nonsynonymous substitution rates are considered. B. Pattern. According to the molecular clock hypothesis, two
aphidicola BSg and BAp show a similar rate (1.12 : 1), the taxa sharing a common ancestor should have accumulated
one in B. aphidicola BBp being slightly higher (1.3-1.4-fold the same number of substitutions since they diverge. In the
that of B. aphidicola BSg and BAp) and B. aphidicola BCc B. aphidicola case, only 21 genes do not reject the molecular
being the one with more accelerated rates (1.7-fold that of clock hypothesis [15]. These genes can be used to identify the
B. aphidicola BBp and more than 2-fold that of B. aphidicola phylogenetic relationships among the strains under study,
BAp and BSg). As for the amino acid sequences, the relative which will also reflect the relationships among their insect
acceleration shows a similar patter as the one observed for hosts. However, three different tree topologies appear in
4 International Journal of Evolutionary Biology

Table 1: Relative rate tests for the 338 concatenated protein-coding genes shared by the four B. aphidicola strains included in this study plus
E. coli a : (a) nonsynonymous sites, (b) all nucleotides, and (c) amino acids.

(a)

Taxon 1 Taxon 2 Taxon 3 K 12 K 13 K 23 K 13 − K 23 K 01 /K 02


BAp BSg Eco 0.152 0.339 0.348 −0.009 0.89
BAp BBp Eco 0.319 0.339 0.395 −0.056 0.70
BAp BCc Eco 0.380 0.339 0.494 −0.155 0.42
BSg BBp Eco 0.319 0.348 0.395 −0.047 0.74
BSg BCc Eco 0.377 0.348 0.494 −0.146 0.44
BBp BCc Eco 0.392 0.395 0.494 −0.099 0.60
(b)

Taxon 1 Taxon 2 Taxon 3 K 12 K 13 K 23 K 13 − K 23 K 01 /K 02


BAp BSg Eco 0.242 0.617 0.630 −0.013 0.89
BAp BBp Eco 0.421 0.617 0.685 −0.068 0.72
BAp BCc Eco 0.452 0.617 0.791 −0.174 0.50
BSg BBp Eco 0.417 0.63 0.685 −0.055 0.77
BSg BCc Eco 0.445 0.63 0.791 −0.161 0.47
BBp BCc Eco 0.463 0.685 0.791 −0.106 0.62
(c)

Taxon 1 Taxon 2 Taxon 3 K 12 K 13 K 23 K 13 − K 23 K 01 /K 02


BAp BSg Eco 0.350 0.814 0.842 −0.028 0.85
BAp BBp Eco 0.845 0.814 1.001 −0.187 0.64
BAp BCc Eco 1.126 0.814 1.410 −0.596 0.30
BSg BBp Eco 0.850 0.842 1.001 −0.159 0.68
BSg BCc Eco 1.180 0.842 1.410 −0.568 0.33
BBp BCc Eco 1.186 1.001 1.410 −0.409 0.48
a In each test, taxa 1 and 2 represent B. aphidicola strains, taxon 3 represents E. coli, and taxon 0 represents the last common ancestor of taxa 1 and 2.
K i j is the estimate of substitutions per site between taxon i and taxon j.

0.008 0.002

0.0019
0.007

0.0018
0.006
λN

0.0017
λS

0.005
0.0016

0.004 0.0015

0.003 0.0014
BAp BAp BAp BSg BSg BBp BAp BAp BAp BSg BSg BBp
BSg BBp BCc BBp BCc BCc BSg BBp BCc BBp BCc BCc
(a) B. aphidicola strains (b) B. aphidicola strains

Figure 1: Global average values (and confidence interval of 95%) of (a) synonymous (λS ) and (b) nonsynonymous (λN ) nucleotide sub-
stitutions per site per million years. The divergence times among strains are 50 (BAp-BSg), 107 (BAp-BBp and BSg-BBp), 111 (BAp-BCc
and BSg-BCc), and 112 (BBp-BCc) MY, respectively. The numbers of shared protein-coding genes are 348 (BAp-BSg), 347 (BAp-BBp), 354
(BAp-BCc), 343 (BSg-BBp), 350 (BSg-BCc), and 350 (BBp-BCc), respectively.
International Journal of Evolutionary Biology 5

Table 2: Yates’ chi-square tests for the accelerated genes classified by functional category in four B. aphidicola strains compared in pairs.
Acceleration is based on Tajima’s relative rate tests. The total number of comparisons for each particular category and pair of strains is shown
in brackets ( ). A/B: number of accelerated genes in A compared to B and in B compared to A, respectively.

Observed Pairs of strains


Functional category BAp/BSg BAp/BBp BAp/BCc BSg/BBp BSg/BCc BBp/BCc
(1) Information storage and processing 5/12 (160) 0/54 (160) 0/137 (162) 0/47 (158) 0/122 (160) 1/86 (160)
(2) Protein processing, folding, and secretion 1/1 (25) 0/11 (24) 0/19 (25) 0/10 (24) 0/18 (25) 0/14 (24)
(3) Cellular processes 0/0 (10) 0/5 (10) 0/7 (10) 0/3 (10) 0/7 (10) 1/7 (10)
(4) Metabolism 2/9 (103) 3/34 (103) 0/86 (104) 2/32 (104) 0/88 (105) 0/61 (106)
(5) Cell envelope 0/0 (14) 1/7 (13) 0/12 (14) 1/7 (13) 0/12 (14) 0/10 (13)
(6) Poorly characterized 1/1 (33) 1/10 (34) 0/29 (35) 0/7 (32) 0/23 (33) 0/15 (34)
Total 9/23 (345) 5/121 (344) 0/290 (350) 3/106 (341) 0/270 (347) 2/193 (347)
Expected Pairs of strains
Functional category BAp/BSg BAp/BBp BAp/BCc BSg/BBp BSg/BCc BBp/BCc
(1) Information storage and processing 4.17/10.67 2.33/56.28 0.00/134.23 1.39/49.11 0.00/124.50 0.92/88.99
(2) Protein processing, folding and secretion 0.65/1.67 0.35/8.44 0.00/20.71 0.21/7.46 0.00/19.45 0.14/13.35
(3) Cellular processes 0.26/0.67 0.15/3.52 0.00/8.29 0.09/3.11 0.00/7.78 0.06/5.56
(4) Metabolism 2.69/6.87 1.50/36.23 0.00/86.17 0.91/32.33 0.00/81.70 0.61/58.96
(5) Cell envelope 0.36/0.93 0.19/4.57 0.00/11.60 0.11/4.04 0.00/10.89 0.07/7.23
(6) Poorly characterized 0.86/2.20 0.49/11.96 0.00/29.00 0.28/9.95 0.00/25.68 0.20/18.91
χ2 s (with Yates’ correction, 5 d.f.) 0.501/0.933 3.491/1.908 0.000/0.195 4.776/2.762 0.000/0.720 7.455/1.598
P value .992/.968 .625/.862 1.000/.999 .444/.737 1.000/.982 .189/.902

a similar number of cases for these 21 genes (see Figure 4). viviparous insects about 250 MYA as a divergent group from
Six genes generated the topology a (B. aphidicola BCc basal), the oviparous Adelgidae and Phylloxeridae [11]. The basal
seven the topology b (B. aphidicola BBp basal), and eight the radiation of the family Aphididae was dated by molecular
topology c (B. aphidicola BCc and BBp clustered). Therefore, data to the Cretaceous, 80 to 150 MYA [3]. Although the
the analysis of these genes, individually considered, does initial development of aphids took place on gymnosperms
not resolve the position of the B. aphidicola BCc and BBp during the Mesozoic, most of their current diversity is linked
strains. This result points at the possibility of a radiation to angiosperms, especially to grass [40]. The extraordinary
within a relatively short period of time, giving rise to the diversity of aphids found today, affecting specially the sub-
subfamilies. To confirm this point, and in order to solve the family Aphididae, started during the Tertiary (Miocene), as a
deepest relationship among subfamilies, a more exhaustive consequence of the proliferation of herbaceous angiosperms
phylogenetic reconstruction was carried out, based on all [41, 42].
the concatenated protein-coding genes shared by the four B. The phylogenetic position of the subfamily Lachninae
aphidicola strains. The resulting phylogenetic tree (Figure 5) within the Aphididae is controversial. Traditionally, phyloge-
shows the same topology as tree c in Figure 4, that is, nies based on both morphological characters [11, 16] and on
a well supported clade consisting of both members of the mitochondrial rDNA [3] have placed them as a monophyletic
subfamily Aphidinae, as expected, and another clade that group clustering with the Aphidinae. However, phylogenies
shows a clustering of B. aphidicola BBp and BCc, also with based on sequences from both nuclear and mitochondrial
the maximum statistical support. The uneven branch length, aphid genes (long-wavelength opsin gene, the elongation
being that of B. aphidicola BCc significantly longer, indicates factor 1α gene, and mitochondrial genes encoding ATPase
the evolutionary acceleration experienced by this strain. The 6 subunit and the subunit II of the cytochrome oxidase),
topology obtained using amino acid sequences is identical, as well as those based on their primary endosymbiont
but the relative length of B. aphidicola BCc’s branch is B. aphidicola (16S rDNA and the β subunit of the F-ATPase
even longer, reflecting a higher value of nonsynonymous complex) [17–19] place them as a basal group apart from
substitutions. the Aphidinae. This fact has implications about those aphids
feeding on conifers (such as most members of the subfamily
4. Discussion Lachninae, including C. cedri) being regarded as ancestral
to groups feeding on angiosperms or, alternatively, as more
4.1. Reconstruction of the Evolutionary History of Aphids recent secondarily derived conifer suckers.
Belonging to Subfamilies Aphidinae, Eriosomatinae and Lach- Our phylogenetic analysis supports the presence of one
ninae. Aphids emerged as a monophyletic group of clade clustering B. aphidicola BBp and BCc, and another
6 International Journal of Evolutionary Biology

0.009 To solve this point, it would be necessary to sequence


0.008 the genome of a greater number of B. aphidicola strains,
0.007 including members of the different tribes from the subfamily
Lachninae (work in progress). This would allow us to estab-
0.006
lish the date of divergence between those tribes and, thus,
0.005 try to relate this fact to the change of vegetal host in either
λS

0.004 direction.
0.003
0.002 4.2. Accelerated Evolutionary Rates in B. aphidicola within the
Subfamily Lachninae. From an evolutionary perspective, the
0.001
protein-coding genes of B. aphidicola show higher ratios of
0
(1) (2) (3) (4) (5) (6)
nonsynonymous versus synonymous substitutions (dN / dS )
than those of free-living bacteria, due to an accelerated rate
(a) of nonsynonymous substitutions, a characteristic of bacterial
0.004 endosymbionts [14, 28], where mutations with amino acid
0.0035 replacement are not efficiently eliminated by a relaxed puri-
fying selection, leading to a greater accumulation of amino
0.003
acid changes than in free-living bacteria. These nonsynony-
0.0025 mous substitutions end up in fixation by genetic drift, due
to the mode of transmission and the population dynamics of
λN

0.002
B. aphidicola. This acceleration of evolutionary rates is par-
0.0015
ticularly evident in B. aphidicola BCc, presumably because
0.001 factors promoting the accumulation of nonsynonymous
0.0005 substitutions are more intense in this strain. One of those
factors is the extreme reduction of the repair machinery,
0 barely able to counterbalance the accumulation of slightly
(1) (2) (3) (4) (5) (6)
deleterious mutations. In addition, there is a stronger effect
BAp-BSg BSg-BBp of genetic drift that promotes the fixation of slightly delete-
BAp-BBp BSg-BCc rious mutations probably imposed by its coexistence within
BAp-BCc BBp-BCc the aphid with a secondary symbiont, Serratia symbiotica,
(b) and its larger size compared to other B. aphidicola lineages
[43]. A closer look at the particular genes that contribute to
Figure 2: Average values (and confidence interval of 95%) of this acceleration observed in B. aphidicola BCc allows us to
(a) synonymous (λS ) and (b) nonsynonymous (λN ) nucleotide conclude that they are distributed among different functional
substitutions per site per million years for each functional category. categories, with none of them accumulating significant
The numbers of shared protein-coding genes are 348 (BAp-BSg),
differences in the proportion of accelerated genes (as seen in
347 (BAp-BBp), 354 (BAp-BCc), 343 (BSg-BBp), 350 (BSg-BCc),
and 350 (BBp-BCc), respectively. (1) Information storage and pro-
Figure 3 and Table 2). This fact reveals that the process of
cessing; (2) protein processing, folding, and secretion; (3) cellular gene degradation acts on any type of gene independently of
processes; (4) metabolism; (5) cell envelope; (6) poorly character- their functional role. However, our results indicate that even
ized. Each given comparison is colored as illustrated above. if the accelerated genes are scattered homogeneously across
all the functional categories in all B. aphidicola strains, genes
of some functional categories, such as cellular envelope, are
significantly more accelerated within all the lineages. That
clade consisting of B. aphidicola BAp and BSg. This result points to the ongoing action of selective constraints affecting
is consistent with a panorama of a rapid evolutionary nonsynonymous substitution rates.
radiation of the main subfamilies of aphids, during the Regarding synonymous substitutions, when pairs of
early Cretaceous (144-100 MYA), which seems concordant strains of B. aphidicola were compared based on the average
with previous proposals [3]. In addition, our evolutionary number of synonymous substitutions per site (dS ), a greater
molecular data from B. aphidicola point out that aphids accumulation was observed in the B. aphidicola BBp strain
belonging to subfamilies Eriosomatinae and Lachninae share compared to bacteria from aphids of the subfamily Aphid-
a common ancestor more closely related than compared to inae (B. aphidicola BAp and BSg), while the smallest value
the members of subfamily Aphidinae. If true, our data refute is found between the B. aphidicola BAp and BSg strains
the traditional phylogenetic reconstructions that placed [15]. However, if the temporary factor is considered, the
Aphidinae and Lachninae as a monophyletic group [11]. rates of synonymous nucleotide substitutions per site and
However, we do not have evidence to conclude whether, million years are greater in the endosymbionts from the
within the subfamily Lachninae, tribes feeding on conifers Aphidinae (B. aphidicola BAp and BSg strains), registering
are ancestral or more recent than those living on herbaceous the B. aphidicola BCc strain the smallest values. These results
angiosperms, since our analysis does not resolve which strain demonstrate that the synonymous substitution rate in B.
(and thus which host aphid) is basal compared to the others. aphidicola is a variable character, yet the explanation for
International Journal of Evolutionary Biology 7

100

90

80

70

60
(%)

50

40

30

20

10

0
(1) Information (2) Protein (3) Cellular (4) Metabolism (5) Cell envelope (6) Poorly Global
storage and processing, processes characterized
processing folding, and
secretion
Functional categories
BAp > BSg BSg > BBp
BSg > BAp BBp > BSg
BAp > BBp BSg > BCc
BBp > BAp BCc > BSg
BAp > BCc BBp > BCc
BCc > BAp BCc > BBp

Figure 3: Relative distribution of the accelerated genes based on their functional category, between pairs of B. aphidicola strains. Accelerated
genes were calculated by Tajima’s relative rate tests. A > B indicates a significantly higher accumulation of substitutions in strain A than in
strain B (P < .05).

BAp BAp
BSg BSg
BBp BCc
BCc BBp
Eco Eco
(a) (b)
BAp
BSg
BBp
BCc
Eco
(c)

Figure 4: Topologies of the phylogenetic trees for the 21 genes that follow the hypothesis of molecular clock [15]. The trees were obtained
by maximum likelihood, with the program PAUP 4.0b10.

these divergent patterns is not obvious. As stated elsewhere compared to the other two aphid lineages. Additionally
[7, 14, 44], these differences can be attributed to differences a differential mutation rate per generation cannot be ruled
in the host’s life cycle, as well as ecological factors such as out. For example, endosymbionts from aphids with short
host-alternation and variations in the effective population generation times can accumulate more synonymous muta-
size showed by the two members of the Aphidinae subfamily tions per million years (case of the Aphidinae) than those
8 International Journal of Evolutionary Biology

BAp [6] J. J. Wernegreen, A. O. Richardson, and N. A. Moran, “Parallel


acceleration of evolutionary rates in symbiont genes underly-
100/100/1 Aphidinae ing host nutrition,” Molecular Phylogenetics and Evolution, vol.
BSg
19, no. 3, pp. 479–485, 2001.
[7] T. Itoh, W. Martin, and M. Nei, “Acceleration of genomic
BBp evolution caused by enhanced mutation rate in endocellular
Eriosomatinae
symbionts,” Proceedings of the National Academy of Sciences of
100/100/1 the United States of America, vol. 99, no. 20, pp. 12944–12948,
BCc Lachninae 2002.
[8] J. J. Wernegreen, “Genome evolution in bacterial endosym-
bionts of insects,” Nature Reviews Genetics, vol. 3, no. 11, pp.
Eco
850–861, 2002.
[9] A. Mira and N. A. Moran, “Estimating population size and
0.1
transmission bottlenecks in maternally transmitted endosym-
Figure 5: Phylogenetic tree obtained by maximum likelihood using biotic bacteria,” Microbial Ecology, vol. 44, no. 2, pp. 137–143,
PAUP4.0b10 on nucleotide sequences and the GTR+I+G evolution- 2002.
ary model. Topologies obtained from amino acid sequences, using [10] The International Aphid Genomics Consortium, “Genome
Phyml v2.4.5 and MrBayes v3.1.2, are identical. Trees are based aequence of the pea aphid Acyrthosiphon pisum,” PLoS Biology,
on the concatenated sequence of the 338 protein-coding genes vol. 8, no. 2, Article ID e1000313, 2010.
shared by the four B. aphidicola strains and E. coli. Numbers beside [11] O. E. Heie, “Palaeontology and phylogeny,” in Aphids: Their
the internal nodes are the maximum likelihood bootstrap values Biology, Natural Enemies and Control, A. K. Minks and P.
from 300 resamplings obtained with PAUP4.0b10, Phyml and the Harrewijn, Eds., vol. 2A, pp. 367–391, Elsevier, Amsterdam,
Bayesian MCMC posterior probability, respectively. The scale bar The Netherlands, 1987.
represents the number of nucleotide substitutions per site. [12] E. Zuckerkandl and L. Pauling, “Molecular disease, evolution,
and genetic heterogeneity,” in Horizons in Biochemistry, M.
Kasha and B. Pullman, Eds., pp. 189–225, Academic Press,
New York, NY, USA, 1962.
with longer generation times, such as the Eriosomatinae and [13] E. Zuckerkandl and L. Pauling, “Evolutionary divergence and
the Lachninae. Future studies are required to understand the convergence in proteins,” in Evolving Genes and Proteins, V.
evolutionary processes driving these patterns. Bryson and H. J. Vogel, Eds., pp. 97–166, Academic Press, New
York, NY, USA, 1965.
[14] M. A. Clark, N. A. Moran, and P. Baumann, “Sequence
Acknowledgments evolution in bacterial endosymbionts having extreme base
compositions,” Molecular Biology and Evolution, vol. 16, no.
Financial support was provided by Grant BFU2009-12895- 11, pp. 1586–1598, 1999.
C02-01/BMC (Ministerio de Educación y Ciencia, Spain) to [15] V. Pérez-Brocal, R. Gil, S. Ramos et al., “A small microbial
A. Latorre and European Community’s Seventh Framework genome: the end of a long symbiotic relationship?” Science,
Programme (FP7/2007–2013) under Grant Agreement num- vol. 314, no. 5797, pp. 312–313, 2006.
ber 212894 and Prometeo/2009/092 (Conselleria d’Educació, [16] W. Wojciechowski, Studies on the Systematic System of
Generalitat Valenciana, Spain) to A. Moya. Aphids (Homoptera, Aphidinea), Uniwersytet Slaski, Katowice,
Poland, 1992.
[17] D. Martinez-Torres, C. Buades, A. Latorre, and A. Moya,
References “Molecular systematics of aphids and their primary endosym-
bionts,” Molecular Phylogenetics and Evolution, vol. 20, no. 3,
[1] J. Sandstrom and J. Pettersson, “Amino acid composition of pp. 437–449, 2001.
phloem sap and the relation to intraspecific variation in pea [18] B. Ortiz-Rivas, A. Moya, and D. Martı́nez-Torres, “Molecular
aphid (Acyrthosiphon pisum) performance,” Journal of Insect systematics of aphids (Homoptera: Aphididae): new insights
Physiology, vol. 40, no. 11, pp. 947–955, 1994. from the long-wavelength opsin gene,” Molecular Phylogenetics
[2] N. A. Moran, M. A. Munson, P. Baumann, and H. Ishikawa, “A and Evolution, vol. 30, no. 1, pp. 24–37, 2004.
molecular clock in endosymbiotic bacteria is calibrated using [19] B. Ortiz-Rivas and D. Martı́nez-Torres, “Combination of
the insect hosts,” Proceedings of the Royal Society B, vol. 253, molecular data support the existence of three main lineages in
no. 1337, pp. 167–171, 1993. the phylogeny of aphids (Hemiptera: Aphididae) and the basal
[3] C. D. von Dohlen and N. A. Moran, “Molecular data support position of the subfamily Lachninae,” Molecular Phylogenetics
a rapid radiation of aphids in the Cretaceous and multiple and Evolution, vol. 55, no. 1, pp. 305–317, 2010.
origins of host alternation,” Biological Journal of the Linnean [20] S. Shigenobu, H. Watanabe, M. Hattori, Y. Sakaki, and H.
Society, vol. 71, no. 4, pp. 689–717, 2000. Ishikawa, “Genome sequence of the endocellular bacterial
[4] J. J. Wernegreen and N. A. Moran, “Evidence for genetic symbiont of aphids Buchnera sp. APS,” Nature, vol. 407, no.
drift in endosymbionts (Buchnera): analyses of protein-coding 6800, pp. 81–86, 2000.
genes,” Molecular Biology and Evolution, vol. 16, no. 1, pp. 83– [21] I. Tamas, L. Klasson, B. Canbäck et al., “50 million years of
97, 1999. genomic stasis in endosymbiotic bacteria,” Science, vol. 296,
[5] L. Klasson and S. G. E. Andersson, “Evolution of minimal- no. 5577, pp. 2376–2379, 2002.
gene-sets in host-dependent bacteria,” Trends in Microbiology, [22] R. C. H. J. van Ham, J. Kamerbeek, C. Palacios et al., “Reduc-
vol. 12, no. 1, pp. 37–43, 2004. tive genome evolution in Buchnera aphidicola,” Proceedings
International Journal of Evolutionary Biology 9

of the National Academy of Sciences of the United States of [41] O. E. Heie, “Aphid ecology in the past and a new view on
America, vol. 100, no. 2, pp. 581–586, 2003. the evolution of Macrosiphini,” in Individuals, Populations and
[23] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4: Patterns in Ecology, S. R. Leather, A. D. Watt, N. J. Mills, and K.
Molecular Evolutionary Genetics Analysis (MEGA) software F. A. Walters, Eds., pp. 409–418, Intercept, Andover, UK, 1994.
version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8, [42] O. E. Heie, “The evolutionary history of aphids and a hypoth-
pp. 1596–1599, 2007. esis on the coevolution of aphids and plants,” Bollettino di
[24] R. C. Edgar, “MUSCLE: multiple sequence alignment with Zoologia Agraria e di Bachicoltura, vol. 28, pp. 149–155, 1996.
high accuracy and high throughput,” Nucleic Acids Research, [43] L. Gómez-Valero, A. Latorre, and F. J. Silva, “The evolutionary
vol. 32, no. 5, pp. 1792–1797, 2004. fate of nonfunctional DNA in the bacterial endosymbiont
[25] F. Tajima, “Simple methods for testing the molecular evolu- Buchnera aphidicola,” Molecular Biology and Evolution, vol.
tionary clock hypothesis,” Genetics, vol. 135, no. 2, pp. 599– 21, no. 11, pp. 2172–2181, 2004.
607, 1993. [44] H. Ochman, S. Elwyn, and N. A. Moran, “Calibrating bacterial
[26] G. Deckert, P. V. Warren, T. Gaasterland et al., “The complete evolution,” Proceedings of the National Academy of Sciences of
genome of the hyperthermophilic bacterium Aquifex aeolicus,” the United States of America, vol. 96, no. 22, pp. 12638–12643,
Nature, vol. 392, no. 6674, pp. 353–358, 1998. 1999.
[27] R. Gil, F. J. Silva, E. Zientz et al., “The genome sequence
of Blochmannia floridanus: comparative analysis of reduced
genomes,” Proceedings of the National Academy of Sciences of
the United States of America, vol. 100, no. 16, pp. 9388–9393,
2003.
[28] N. A. Moran, “Accelerated evolution and Muller’s rachet in
endosymbiotic bacteria,” Proceedings of the National Academy
of Sciences of the United States of America, vol. 93, no. 7, pp.
2873–2878, 1996.
[29] X. Xia and Z. Xie, “DAMBE: software package for data analysis
in molecular biology and evolution,” Journal of Heredity, vol.
92, no. 4, pp. 371–373, 2001.
[30] A. T. Marques, A. Antunes, P. A. Fernandes, and M. J.
Ramos, “Comparative evolutionary genomics of the HADH2
gene encoding Aβ-binding alcohol dehydrogenase/17β-
hydroxysteroid dehydrogenase type 10 (ABAD/HSD10),”
BMC Genomics, vol. 7, article 202, 2006.
[31] M. G. Fain and P. Houde, “Multilocus perspectives on the
monophyly and phylogeny of the order Charadriiformes
(Aves),” BMC Evolutionary Biology, vol. 7, article 35, 2007.
[32] M. Farfán, D. Miñana-Galbis, M. C. Fusté, and J. G. Lorén,
“Divergent evolution and purifying selection of the flaA gene
sequences in Aeromonas,” Biology Direct, vol. 4, article 23,
2009.
[33] M. Daly, L. C. Gusmão, A. J. Reft, and E. Rodrı́guez, “Phy-
logenetic signal in mitochondrial and nuclear markers in sea
anemones (cnidaria, Actiniaria),” Integrative and Comparative
Biology, vol. 50, no. 3, pp. 371–388, 2010.
[34] D. L. Swofford, PAUP∗. Phylogenetic analysis using parsimony
(∗and other methods). Version 4, Sinauer Associates, Sunder-
land, Mass, USA, 2002.
[35] S. Guindon and O. Gascuel, “A simple, fast, and accurate algo-
rithm to estimate large phylogenies by maximum likelihood,”
Systematic Biology, vol. 52, no. 5, pp. 696–704, 2003.
[36] D. Posada, “jModelTest: phylogenetic model averaging,”
Molecular Biology and Evolution, vol. 25, no. 7, pp. 1253–1256,
2008.
[37] F. Abascal, R. Zardoya, and D. Posada, “ProtTest: selection of
best-fit models of protein evolution,” Bioinformatics, vol. 21,
no. 9, pp. 2104–2105, 2005.
[38] J. P. Huelsenbeck and F. Ronquist, “MRBAYES: Bayesian
inference of phylogenetic trees,” Bioinformatics, vol. 17, no. 8,
pp. 754–755, 2001.
[39] R Development Core Team, R: A Language and Environment
for Statistical Computing, R Foundation for Statistical Com-
puting, Vienna, Austria, 2010, http://www.R-project.org.
[40] V. F. Eastop, “Biotypes of aphids,” in Perspectives in Applied
Biology, A. D. Lowe, Ed., vol. 51 of Bulletin of the Entomological
Society of New Zealand, pp. 40–51, 1973.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 781642, 10 pages
doi:10.4061/2011/781642

Research Article
Parallel Evolution and Horizontal Gene Transfer of the pst
Operon in Firmicutes from Oligotrophic Environments

Alejandra Moreno-Letelier,1, 2 Gabriela Olmedo,2 Luis E. Eguiarte,1 Leon Martinez-Castilla,3


and Valeria Souza1
1 Departamento de Ecologia Evolutiva, Instituto de Ecologia, Universidad Nacional Autónoma de México, Apdo. Postal 70-275,
Ciudad Universitaria, 04510 México D. F., Mexico
2 Departamento de Ingenierı́a Genética, CINVESTAV Campus Guanajuato, Apdo. Postal 629, 36500 Irapuato, Mexico
3 Departamento de Bioquimica, Facultad de Quimica, Universidad Nacional Autonoma de México, Apdo. Postal 70-275,

Ciudad Universitaria, 04510 México D. F., Mexico

Correspondence should be addressed to Valeria Souza, souza@servidor.unam.mx

Received 22 October 2010; Accepted 22 December 2010

Academic Editor: Hiromi Nishida

Copyright © 2011 Alejandra Moreno-Letelier et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

The high affinity phosphate transport system (pst) is crucial for phosphate uptake in oligotrophic environments. Cuatro Cienegas
Basin (CCB) has extremely low P levels and its endemic Bacillus are closely related to oligotrophic marine Firmicutes. Thus, we
expected the pst operon of CCB to share the same evolutionary history and protein similarity to marine Firmicutes. Orthologs of
the pst operon were searched in 55 genomes of Firmicutes and 13 outgroups. Phylogenetic reconstructions were performed for the
pst operon and 14 concatenated housekeeping genes using maximum likelihood methods. Conserved domains and 3D structures
of the phosphate-binding protein (PstS) were also analyzed. The pst operon of Firmicutes shows two highly divergent clades with no
correlation to the type of habitat nor a phylogenetic congruence, suggesting horizontal gene transfer. Despite sequence divergence,
the PstS protein had a similar 3D structure, which could be due to parallel evolution after horizontal gene transfer events.

1. Introduction Cyanobacteria [8, 10–13]. The pho regulon is comprised of 20


or so genes that include phosphatases, phosphate transport
Phosphorus is an essential nutrient for multiple processes systems, and other enzymes used to assimilate phosphorus
such as the synthesis of DNA, RNA, ATP, and many other form other sources such as phosphonates [8]. Even though
pathways involving phosphorylation [1]. However, it is not the pho regulon is found in both Eubacteria and Archaea,
an abundant element on the planet and can only be obtained the number and identity of the genes are highly variable and
form organic detritus or from tectonics and volcanism [2, 3], not always congruent with the 16S rRNA gene phylogeny
so, its availability is a limiting factor for all life forms. As of the organisms [11, 14]. It is also to be expected that the
growth rate and primary productivity are highly dependent genes involved in phosphate uptake and metabolism would
on phosphorus [4–6], bacteria have different mechanisms for be under strong selection.
the uptake and storage of phosphates to be able to cope with Among the genes of the pho regulon, the high affinity
this limitation [1, 7–9]. phosphate transport system (pst) is thought to be responsible
Some of the genes involved in phosphorus metabolism for phosphate uptake under nutrient stress [8, 10]. Pst is a
belong to the pho regulon that is induced by phosphorus typical ABC transport system encoded in 4 to 6 genes in a
starvation by a two-component regulatory system in sev- single operon [10, 15–17]. As an ABC transporter, the pst
eral bacteria such as Escherichia coli, Bacillus subtilis, and operon belongs to one of the largest gene families and is
2 International Journal of Evolutionary Biology

found in all Eubacteria and Archaea and the level of sequence operon of the Firmicutes from CCB to have a marine affinity
divergence indicates an ancient origin of each lineage of and to be related (both in sequence and structure) to the pst
transporters [18, 19]. operons of other marine Firmicutes that live in oligotrophic
The genes of the pst operon are arranged in the following waters.
way: the pstS gene, coding for a periplasmic protein that In this study we analyzed for the first time the evolu-
binds phosphate with high affinity; pstC and pstA, coding tionary relationships, gene architecture, of the pst operons
for the two proteins proposed to form the inner membrane of 55 complete genomes of the main lineages of Firmicutes
channel; pstB, coding for an ATPase that energizes the [38] with special emphasis on CCB and marine taxa, as
transport [18]. However, some variation exists in the number well as the protein structure of PstS from a few Bacillus.
of genes in the operon. In Escherichia coli and Clostridium To evaluate phylogenetic congruity between the phosphate
acetobutylicum, the gene phoU, coding for a repressor of uptake genes and housekeeping genes, expected to reflect
the pho regulon, is also located in the operon [15, 17], vertical descent, we performed a phylogenetic reconstruction
while in B. subtilis and its close relatives there are no phoU of the genes of the pst operon and of 14 proteins of the
orthologs. Also, the gene pstB is duplicated (pstBA and core genome of Firmicutes. We also compared the protein
pstBB; [10]). The pst operon presents further variation in structure of phosphate-binding protein PstS of Bacillus from
Cyanobacteria, where the genes pstS or pstB may be missing oligotrophic and eutrophic environments, to try to evaluate
from the operon depending on the strain and environmental any association between protein sequence and structure to
conditions [11], or additional pstS copies may be present the environment in which the members of Firmicutes live.
although not associated to the operon [11, 20].
The pst phosphate uptake system is particularly crucial 2. Materials and Methods
in oligotrophic environments such as the North Pacific,
North Atlantic and the Eastern Mediterranean Sea [6, 2.1. Phylogenetic Reconstructions. We used the amino acid
21]. Metagenomic studies have shown that there are some sequence of the substrate-binding protein gene pstS of
functional adaptations for P uptake in such oligotrophic Bacillus coahuilensis and Bacillus subtilis [9, 39] to identify
waters [7, 8, 20, 22]. Another example of an extreme the orthologs of the pst operon in the draft and complete
oligotrophic environment is the Cuatro Cienegas Basin genomes of the main lineages of Firmicutes (for accession
(CCB), that presents very low levels of P in the ecosystem numbers see Table S1 of Supplementary Material available
[4, 23, 24]. Phosphate concentrations range from 0.008 to online at doi: 10.4061/2011/781642). Searches were per-
0.6 μM, in Pozas Azules and Rio Mezquites, respectively formed using psi-Blast, and the sequences identified with
(E. Rebollar and F. Garcı́a-Oliva pers. com.; [4]), but for at least 30% of identity over a minimum of 70% in length,
most water systems P concentrations lie below the threshold and e-value <10−35 were considered as orthologs [27]. As the
concentration for the expression of the pho regulon in B. pstS gene can be duplicated in some genomes, the operon
subtilis (0.1 mM; [10]). structure of the genes in the operon was manually checked
CCB is an isolated oasis in the center of the Chihuahuan in all cases, and only genes with the highest bit scores
Desert, with water systems rich in microbial mats and and lowest e-values were considered in cases of multiple
stromatolites, and its microbiota exhibits ancestral marine hits in the same genome. For the cases with multiple hits
affinities [9, 25–29]. Despite the extreme oligotrophy of the of the entire operon, all those extra copies of the operon
ecosystem, CCB has a high level of diversity and species were also included in the analysis. We also included 11
endemism both at the macro- and microscopic levels [24, sequences of the pst operon of non-Firmicutes that had
25, 30–33]. We believe that this high rate of diversification is Blast scores better than our threshold. As outgroups, we
a consequence of the extreme oligotrophy of the ecosystem included 2 genomes of non-Firmicutes: Thermotoga maritima
[24], where the lack of available P promotes both repro- (Thermotogae) and Pelobacter carbinolicus (δ-Proteobacteria),
ductive and geographic isolation, by limiting replication and that gave the next best hitting scores below our threshold.
the frequency of genetic exchange [24, 34–36]. Moreover, Due to the high sequence variation of the pstS gene and
two of the newly sequenced taxa, Bacillus coahuilensis and the different number of copies of the other genes in the
Bacillus sp. m3–13, have particular adaptations to low P operon, the reconstruction was done with genes pstC, pstA,
environments. Unlike Escherichia coli or B. subtilis, CCB and and pstB in a concatenated matrix. The phoU gene was
marine Bacillus lack the low affinity phosphate uptake system excluded because it was missing in several Bacillus species.
so they must rely on the high affinity transport system [9, 27]. When both pstBA and pstBB were present, pstBB was used
There are some comparative studies about genes involved in the analysis as it is the ortholog of pstB; pstBA was not
in phosphorus uptake in Cyanobacteria [8, 11, 20], but as far included, The pstBB gene was identified on the basis of
as we know, no studies exist in other bacterial groups. Hence, genomic context, as it is the second gene coding for an atp-
we believe that an analysis of the phosphorus uptake in the binding protein in the operon. We compared the topology
Firmicutes from CCB in comparison to sequenced Firmicutes of the pst operon phylogeny with a phylogeny reconstructed
from different environments could help us understand the from 14 concatenated amino acid sequences from genes from
evolution of the high affinity phosphate transport system. the core genome of Firmicutes. Those 14 genes were chosen
Firmicutes is a cosmopolitan and ancient lineage [37], and from a set of genes already identified by Maughan [38] and
their diversification happened during a time in the Earth’s Alcaraz et al. ([27]; GI from B. subtilis: 2632976, 2632269,
history where P was very scarce [3, 5]. We expected the pst 2632399, 2634021, 2636597, 16079910, 50812244, 50812227,
International Journal of Evolutionary Biology 3

16079600, 16077523, 16080084, 16077081, and 16077661, A pstS pstC pstA pstB phoU
2635239). Four Cyanobacteria, Chloroflexus aurantiacus and
Thermotoga maritima were used as outgroups (list of strain B pstS pstC pstA pstBA pstBB phoU
names and genome accession numbers in Table S1 of
supplementary material). To establish a temporal frame of C pstS pstC pstA pstBA pstBB phoU
events, we dated the 14 gene phylogeny using a penalized
likelihood method implemented by r8s [40]. The calibration D pstS pstC pstA pstB phoU
of the tree was done using dates of geological events: the
divergence of aerobic firmicutes was fixed at 2300 million E pstS pstC pstA phoU
years ago, a conservative date for the Great Oxidation Event
[3, 37]. The divergence of CCB Firmicutes and their closest Figure 1: Gene architecture of the pst operon in different groups of
relatives was constrained to have a minimum age of 35 my, bacteria. A: Bacillus cereus group, marine Bacillus and Clostridium;
that corresponds to the uplift of the Sierra Madre Oriental B: Listeria and Streptococcus; C: Bacillus subtilis group, Bacillus
that finally isolated CCB from the Gulf of Mexico [41]. marisflavi, and Bacillus sp. CH108; D: Brevibacillus, Oceanobacil-
lus, Desulfitobacterium, Acaryochloris; E: Synechococcus, Geobacillus
All reconstructions were done using amino acid
kaustophilus, Sebaldella termitidis, B. cereus group. In C–E, the
sequences aligned using MUSCLE [42] and a Maximum regulatory gene phoU is found elsewhere in the genome, not as part
Likelihood approach, implemented by Raxml v.7.0.4 [43]; of the operon or entirely missing.
(CIPRES portal: http://www.phylo.org/) with a LG substi-
tution model chosen using ProtTest v 2.1 [44] with the
Akaike Information Criterion, 4 substitution categories, and
allowing Raxml to estimate the proportion of invariant sites in the same operon, and it was missing in the B. subtilis
and the gamma shape parameter. For both datasets 100 group, and Clostridium tetani. Synechococcus sp. 7002 also
bootstrap replicates were performed. lacked the pstB gene in the operon (Figure 1). Duplica-
tions of the pstS gene were found in the Bacillus cereus
2.2. PstS Protein Motifs and 3D Structure. The main con- group, Exiguobacterium spp., Brevibacillus brevis, Bacillus
served motifs of the substrate-binding protein PstS were sp. B14905, Enterococcus faecalis, Lactobacillus plantarum,
detected using the MEME suite ([45]; http://meme.sdsc.edu/ and Geobacillus kaustophilus, but the entire operon was
meme/) using the default parameters and an alignment of only duplicated in Streptococcus pneumoniae and Symbiobac-
the PstS amino acid sequence from all the Firmicutes, but terium thermophilus. In the B. cereus group, an incomplete
including in this alignment only one sequence representative copy (pstSCA, lacking pstB) of the pst operon was found,
of the Bacillus cereus group. similar to that of B. subilis, thus it was not used for the
The 3D structure of the PstS protein was modeled based concatenated phylogeny, but only for the PstS phylogeny (see
on the 3-D structures of PstS from Yersinia pestis (PDB ID: Figure S1 in the supplementary material). All Bacillus from
2z22) and PstS-1 from Mycobacterium tuberculosis (PDB ID: CCB and most marine Bacillus had just one copy of the pstS
1pc3; [46]). Only the PstS from B. subtilis, B. coahuilensis, gene (the exception, Bacillus sp. B14905).
B. sp. m3-13, and B. sp. NRRL-14911 were modeled using The phylogenetic reconstruction of the concatenated
the web-based module of MODELLER using the default set- PstC, PstA, and PstB protein sequences showed two distinct
tings (http://modbase.compbio.ucsf.edu/ModWeb20-html/ and highly supported clades (Figure 2) that bear no relation
modweb.html; [47]). Comparisons of 3D models were to either the type of habitat or the phylogenetic relationships
performed with TOPOFIT. This method only takes into obtained with the amino acid sequences from housekeeping
account the geometric attributes of the proteins and not genes (Figure 3). Reconstructions made with each sequence
the sequence similarity, so it is able to find structure independently, yielded the same basic topology, with minor
homology in highly variable proteins [48]. The quality of differences in branch length (data not shown), so the phylo-
the models was evaluated with the r.m.s.d. value (root mean genetic signal was present in all three genes. We named one
squared deviation) and the z-score (a measure of the energy of the clades “cereus-like”, which consists of the pstSCABU
separation between two protein folds). Images were prepared operon structure (operon architecture A, in Figure 1), and it
with CHIMERA (http://www.cgl.ucsf.edu/chimera). includes all members of the B. cereus group, most of Bacillus
and Staphylococcus, Exiguobacterium, an anaerobic soil fir-
3. Results micute Desulfitobacterium hafniense, and most noteworthy,
several Cyanobacteria and Archaea (Figure 2). None of the
Using a psi-Blast search we were able to find orthologs of the members of that clade have the duplication of gene pstB, and
pst operon in all members of Firmicutes, several Cyanobacte- only two taxa (Desulfitobacterium and Oceanobacillus) lack
ria and Archaea as well as in some Bacteroidetes, Fusobacteria, the gene phoU in the operon (for the operon structure of all
Actinobacteria, and Planctomycetes. In the search we observed taxa in the dataset see Table S1 of Supplementary Material).
that the gene architecture of the operon showed variation The other highly supported clade was named “subtilis-
within and outside Firmicutes (Figure 1). Several taxa had like” (operon architecture C, in Figure 1) and it included the
a duplication in tandem of the gene pstB, as was the case members of the B. subtilis group, a marine Bacillus, Bacillus
for Bacillus subtilis, Listeria monocytogenes, and Streptococcus marisflavi and its sister species Bacillus sp. CH108 from CCB,
pneumoniae. Cyanobacteria generally lacked the gene phoU Listeria, Clostridium, some host-associated firmicutes, and
4 International Journal of Evolutionary Biology

Streptococcus pneumoniae R6 2
Symbiobacterium thermophilum lAM 14863
Collinsella intestinalis (actinobacteria)
∗ Listeria innocua clip 11262
∗ ∗ Listeria monocytogenes EGD-e
∗ Enterococcus faecalis V583
∗ Streptococcus pneumoniae R6
∗ ∗
Lactobacillus plantarum WCFS1
Brevibacillus brevis NBRC 100599
Bacillus clausii KSM-K16
Bacillus licheniformis ATCC 14580
∗ ∗ Bacillus pumilus ATCC 7061
∗ ∗ Bacillus amyloliquefaciens FZB42
∗ ∗ Bacillus subtilis str. 168

Bacillus marisflavi TF-11
∗ Bacillus sp. CH108∗
Pelotomaculum thermopropionicum SI
Geobacillus kaustophilus HTA426
∗ Clostridium tetani E88
∗ ∗ Clostridium acetobutylicum ATCC 824
Thermoanaerobacter tengcongensis MB4
Symbiobacterium thermophilum lAM 14863 2
∗ Geobacillus sp.Y412MC10
Thermosynechococcus elongatus (cyanobacteria)
∗ Cyanothece sp. (cyanobacteria)
Acaryochloris marina (cyanobacteria)
∗ Synechococcus sp. (cyanobacteria)
∗ ∗ Chloroflexus aurantiacus (chloroflexi)
Rhodothermus marinus (bacteroidetes)
Plantomyces maris (planctomycetes)
∗ Haloterrigena turkmenica (archea)
Natromonas pharaonis (archea)
∗ Sebaldella termitidis (fusobacteria)
Exiguobacterium sibiricum 255–55
∗ Exiguobacterium sp. AT1b

Exiguobacterium sp. EPVM∗
Bacillus sp. NRRL B-14911
Bacillus aquimaris TF-12
Bacillus coahuilensis m4-4∗
Staphylococcus group
∗ ∗
Macrococcus caseolyticus JCSC5402
∗ Lysinibacillus sphaericus C3-41
∗ Bacillus sp. B14905
∗ ∗ Oceanobacillus iheyensis HTE831
Bacillus halodurans C-125

Bacillus sp. m3-13∗
∗ Bacillus sp. SG-1
∗ ∗ Bacillus sp. p15.4∗
∗ ∗ Desulfitobacterium hafniense Y51
∗ Desulfitobacterium hafniense DCB-2
Anoxybacillus flavithermus WK1
∗ Geobacillus thermodenitrificans NG80–2
∗ Geobacillus sp. G11MC16
∗ Bacillus cereus group
Pelobacter carbinolicus (d-Proteobacteria)
∗ Thermotoga maritima (thermotogae)
0.4

∗ CCB
Other oligotrophic environments
Marine Subtilis-like operon
Eutrophic Cereus-like operon

Figure 2: Maximum likelihood phylogenetic reconstruction of the concatenated PstC, PstA, and PstB (PstBB) protein sequences encoded
by the pst operon. Branch colors indicate the two divergent clades: subtilis-like and cereus-like. Tag colors indicate the type of habitat where
each species is found. Bootstrap values above 70% are indicated with an asterisk. The phylogeny of the individual proteins has a very similar
topology (data not shown).
International Journal of Evolutionary Biology 5

an Actinobacteria (Collinsella intestinalis; Figure 2). The gene initial 3D model based on a more distantly-related PstS,
architecture of the operon in the members of this clade is because the PstS from B. coahuilensis also had bad fitting
more variable. The members of the genus Bacillus have the rmsd and z-score values with the PstS of B. sp. NRRL-14911
gene pstB duplicated and lack the phoU gene in the operon despite having high sequence similarity (74% identity).
or entirely (Figure 1). Listeria, Enterococcus and Streptococcus Thus, the 3D model of the PstS of B. coahuilensis can still be
also have the pstB gene duplicated but the gene phoU is in the improved.
operon, and although the pst operon in Clostridium has an Despite the structure similarities among the PstS of B.
architecture similar to that of B. cereus, it is very different at subtilis, B. sp. m3-13, and B. sp. NRRL-14911, the active site
sequence level, as seen from the fact that these two are located showed some striking differences in amino acid composition.
in different clades (Figure 2). B. subtilis and most of the taxa that grouped in the subtilis-
The high variation at the amino acid sequence level like clade have an arginine as the first residue of the active
observed for PstS is common for substrate-binding proteins site, just like Y. pestis and M. tuberculosis, while the firmicutes
of ABC transporters [18]. In our case, PstS had a shape of the cereus-like clade have a proline in the same position
parameter of the Gamma distribution for site rates of (Figure 5(d)). Also, some members of the cereus-like clade,
3.4745, while the PstC, PstA, and PstB proteins had a like B. sp. m3-13, B. halodurans, Desulfitobacterium spp., O.
shape parameter of 1.0559 and the proteins used for the iheyensis, B. sp. SG-1, and the Staphylococcus group had also
housekeeping gene phylogeny had a shape parameter of a histidine in the second residue of the active site, while all
0.7002 (Mega 4; [49]). the other taxa had a serine (Figure 5(d)). In view of these
The pst operon was not monophyletic for marine changes in amino acid composition an additional codon-
Bacillus, even though marine Bacillus are mostly mono- based Z-test of selection was made for the pstS gene with
phyletic, as determined from house-keeping genes (Figure 3) Mega 4 (Tables S2 and S3 of supplementary material; [49]).
and from other reconstructions (Figure 3; [27]). The main In all cases, dS (synonymous substitutions) was significantly
incongruence observed in the tree obtained from the amino higher than dN (non-synonymous substitutions), suggesting
acid sequence of the proteins encoded in the pst operon purifying selection.
is the position of the B. subtilis group in a clade with the
Listeria and Streptococcus sequences, instead of grouping 4. Discussion
with the rest of Bacillus taxa (Figure 2). This contrasts with
the house-keeping genes phylogeny, were B. subtilis and As expected, the pst operon was found in all Firmicutes.
its close relatives are found well within the Bacillus clade However, not so expected was the finding of two types
(Figure 3). Also, the B. marisflavi-B.sp CH108 clade, that of operons in these Gram positives. We describe them as
groups with other marine Bacillus in the house-keeping subtilis-like and cereus-like operons, after the best known
genes phylogeny (Figure 3), appears as a sister group of members of Bacillus. Even more interesting, these operons
B. subtilis and close taxa in the pst operon reconstruction were not shared by descent in the monophyletic groups of
(Figure 2). Also, Bacillus sp. m3-13 from CCB, appears within Bacillus, neither they were operons related to the particular
the B. subtilis clade in the house-keeping genes phylogeny, habitat of the strains. Both operons were very divergent
but is sister to Bacillus sp. SG-1 from the Gulf of Mexico in the from each other at the amino acid sequence level, suggesting
pst phylogeny. Another main topological incongruence of the independent parallel evolution.
pst phylogeny compared to the one done with housekeeping The high divergence of the two types of pst-operons in
genes, is that of sister taxa Bacillus halodurans, and Bacillus Firmicutes and their incongruence with species phylogeny
clausii that are found in different clades: B. clausii is found is most noteworthy. Contrary to what was expected, the
in a clade with B. subtilis, B. marisflavi, and Bacillus sp. pst operon of marine Bacillus is not monophyletic, even
CH108 while B. halodurans forms a monophiyletic clade though marine and CCB Bacillus are resolved as a mono-
with some marine Bacillus and in turn, is sister to the clade phyletic group in the 14 housekeeping gene reconstruction.
of Desulfitobacterium hafniense, an anaerobic species that is Therefore, we cannot argue for a common origin due to
found in a basal position in the Firmicutes clade obtained shared environmental conditions. The patchy distribution
from housekeeping genes (Figure 3). of either type of operon within the phylogeny of Firmicutes
Regarding the encountered motifs of protein PstS suggests horizontal gene transfer, especially considering
(Figure 4), we observed a marked difference between the PstS closely related species with entirely different pst operons (B.
of the cereus-group and that from the subtilis-group. Motifs clausii and B. halodurans, B. subtilis clade, and B. sp m3-13;
4 and 5 are located in the same region of the protein but Figure 2). Another possible explanation of these divergent
are markedly different, while motif 3 is found in both sets of operons, would be an ancient duplication. However, it is
sequences, but in the subtilis-like PstS it had a lower e-value a hypothesis difficult to test, since only very few taxa have
(Figure 4). Despite the marked difference at the sequence both kinds of operons or at least partial copies. At least in
level, the PstS proteins of B. subtilis, B. sp. m3-13, and B. the case of the B. cereus group, the subtilis-like pstS copy
sp. NRRL-14911, the 3D structures of the proteins showed is more closely related to that of Clostridium, while the
similarity with geometry-based alignments (low r.m.s.d. and pstS of B. subtilis is closely related to Listeria, suggesting
high z-scores; Figures 5(a)–5(c)), with the exception of B. an independent acquisition (see Figure S1 in supplementary
coahuilensis that showed the worse fit values of the three material). If this partial operon or any of the extra copies
comparisons (Figure 5). This could be a product of the of pstS found in different organisms are functional and
6 International Journal of Evolutionary Biology

Thermotoga maritima (thermotogae)



Chloroflexus aurantiacus (cloroflexi)
Acaryochloris marina (cyanobacteria)
Synechococcus sp. (cyanobacteria)

∗ Cyanothece sp. (cyanobacteria)
Bacillus halodurans C-125
∗ Bacillus clausii KSM-K16
Exiguobacterium sibiricum 255-55
Exiguobacterium sp. AT1b
∗ Exiguobacterium sp. EPVM∗

Oceanobacillus iheyensis HTE831
Lysinibacillus sphaericus C3-41
∗ Bacillus sp. B14905
∗ Staphylococcus group

∗ Macrococcus caseolyticus JCSC5402
∗ Listeria monocytogenes EGD-e
∗ ∗ Listeria innocua Clip11262
∗ Lactobacillus plantarum WCFS1
Enterococcus faecalis V583

Streptococcus pneumoniae R6
Bacillus subtilis str. 168
∗ Bacillus amyloliquefaciens FZB42


Bacillus licheniformis ATCC 14580

∗ Bacillus pumilus ATCC 7061
Bacillus sp. m3-13∗
III Bacillus sp. SG-1
∗ Bacillus sp. p15.4∗
∗ ∗ Bacillus aquimaris TF-12
IV
∗ Bacillus marisflavi TF-11
II ∗ Bacillus sp. CH108∗
∗ Bacillus coahuilensis m4-4∗
I Bacillus sp. NRRL B-14911

∗ Bacillus cereus group
Geobacillus thermodenitrificans NG80-2
∗ Geobacillus kaustophilus HTA426
∗ ∗ Anoxybacillus f1avithermus WK1
Brevibacillus brevis NBRC 100599
Geobacillus sp. Y412MC10
Thermoanaerobacter tengcongensis MB4
∗ Clostridium tetani E88
∗ Clostridium acetobutylicum ATCC 824
Symbiobacterium thermophilum lAM 14863
Pelotomaculum thermopropionicum SI

Desulfitobacterium hafniense DCB-2

3500 3000 2500 2000 1500 1000 500 0


Million years ago
∗ CCB
Subtilis-like operon
Marine Cereus-like operon
Eutrophic Both operons
Other oligotrophic environments

Figure 3: Maximum likelihood phylogeny of Firmicutes based on the concatenated amino acid sequence of 14 housekeeping genes and
dated with a penalized likelihood method. The branch colors indicate the type of operon present in each taxon. Tag colors refer to the type
of habitat. Clade I corresponds to aerobic Firmicutes and clade II includes CCB and marine Bacillus. Clade I had a fixed age of 2300 my and
clades III and IV had a fixed minimum age of 35 my. Bootstrap values above 70% are denoted with an asterisk. Clades with branch lengths
of 0 were collapsed (D. hafniense DCB-2-D. hafniense Y51 and G. thermodenitrificans NG80-2-G.sp. G11MC16).

expressed, is still not known and would require experimental fewer rearrangements due to phages or some sort of
validation. constraint for the transcription and/or regulation of the pst
The relative conservation of gene architecture in Firmi- operon that has kept the gene architecture fairly constant
cutes as opposed to what is seen in Cyanobacteria suggests since the divergence of Cyanobacteria and Firmicutes around
International Journal of Evolutionary Biology 7

Motif
4

1
Bits
3
2
1
0
I
V
GV
AS
MA
L

1
D
L
E
GSSTV P
AGSF
TAL
N M
Q
Y
A
F
G
L
AF
L
P
H
IV
VM
T
S
G
I
AAQL
M
EA AEEF
VKQ
LI
K
E
V
T
A
VN
S
I
GS
SN
D
A
W
K
N
L
G
D
Y
M
A
S
T
R
Q
G
K
K
EA
N
T
G
T
Q
AK

D
I
EH
V
D
Q
Y
E
A
N
A P V VI V
N D K S G S
T
SG
V
T
N
K
I
E
A
I Q
R
T
R
K
C
Q
F
G
K
E
T

S
D
T
S
R
N
N
QSS
PA
V
E
L
G
N
A
K
T
RL
G
V
I
A
T
Q
M
A
N
GTG GF S S
G
T LEQV
A
Q
V
M
I
G
S
N
S
K
T
N
QA
K
R
F
I

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
MEME (no SSC) 25.6.2010 09:55
4
3

P I
W DEE PR S GT D F Y Y E VI L
Bits

2
2 G G D
1
RSG
DPS
NVE Q
P VFFG
T
NES KLYS FAF T
KE
A
V
M
H
N
EEKFK
E
AK
S
KL
Q
VN
T I V
A NAP
I A D
K
N
GRA
R
QV
A
V
S

RET
E
D
N
Y
S
NTG
D
N
V
Y

0
M W
L
A
Q
Q
Y
A
I
EL S S K
H
Q
AD
YQ
AT
A
D
A
Q
K
V

I D D
M T
PLH K
H
DT
A SN S
W
M
V
A
T LF
T
I
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
MEME (no SSC) 25.6.2010 09:55

DDNVLV GV DKNA I GFYFGYAYY ENKDKLKAV I D


4
3

LSE
Bits

2
3 T QS S G K
1
V
E

0
QAQQ
N
E
S
Y
D
A
P
KNAE
T
F TSR G
TST
YQA R
P
MV
I T
I MI
T
R
D
S
K
N L I
A
E
TN
M
Q
V
R
Q
ET
RS
K
D E
R
NE
LA
PQS
G
E
Y
T
H
GL
M A V
L
I
G PAT
FNF
LS
G
V
A
L
V
M
I
Q
Q
A
S
N
T
Q
R
T
A
P
M
E
NT
G
E
S
NM
TE
QVR I
GVM
A
G
L
P
LA
I N
Q
L
P
I S
Q
N
H
N
K
S
Q
G
E
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
MEME (no SSC) 25.6.2010 09:55

EF VA DGLTVVVNK NDWV LT EL I W
4
3
Bits

VE
2
4 K Y E DE
DQ KKMF
T
1
0
K
Q
M
L
H
E
D
I
P QY
KV
LI
Q
P
E
A
S
L
L
A
I
GK
S
L
F
L
Q
N
T
S
A I
V SI
A
F
M
L
T
M
L
I
TD
T
K
H
AA
A
I S
HR
PD Q
N
S

AE
TF A
CQ
LA
N
KD
T
N
C
Y
ES
K
H
MI
S
D
K
T
Q
I
D
L
F
EG
Q
A
Y
DV
I
R
Q
V
A
N
H
E
E
L
I
R
M
G
D
A
S
T
N

L
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
MEME (no SSC) 25.6.2010 09:56

EEKDG I A LVDH VAVVGMA VNPK VGVKDNDLI T QL I FTGK I TNW


4
3

AV
Bits

5 2
D KK K PA
D KQ KD
SE
S ED
1
EK P R E K T IK EV
T Q I CA I AF K
V
KLKADKTEQ A T S N T Q
KEVQ
Q
V
S

I GV I TH
L

0
K
TL L
QS
DN
P
D
N
T
EA
I
S N
D
AD K
I E
K
SGA LS
L
I
NK
G
DA AN I D
T
E
L
R
M
VG
TY
DN I RA ALDENPAEG
I
G
T
TY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
MEME (no SSC) 25.6.2010 09:56

cereus-like pstS 1 4 2 3

subtilis-like pstS 1 5 2 3

Figure 4: Conserved motifs of the PstS protein for both cereus-like and subtilis-like operons. Red stars on top of residues on motif 2 indicate
the binding site of phosphate. Motifs 4 and 5 are aligned (black lines) to show homologous amino acid positions. The height of the blocks is
proportional to the e-value of each motif. Motif 3 for subtilis-like PstS has an e-value < 10−10 .

1 2 34
Yersinia pestis R ADG S G T
Mycobacterium tuberculosis R SDG S GD
Bacillus subtilis R PDS S GT
Bacillus sp. m3-13 P NEN H G T
Bacillus coahuilensis m4-4 P GTD S G T
Bacillus sp. NRRL-14911 P GTD S G T

(a) (b) (c) (d)

Figure 5: Comparison of 3D structures with TOPOFIT of PstS from B. subtilis (red), with (a) B. sp. NRRL-14911 (cyan; r.m.s.d. = 1.02,
z-score = 39.89), (b) Bacillus sp. m3-13 (green; r.m.s.d. = 1.53, z-score = 20.24) and (c) B. coahuilensis m4-4 (yellow; r.m.s.d. = 1.49, z-score
= 9.65). Residues involved in phosphate binding are highlighted in purple. (d) shows an alignment of the active site of PstS and the different
amino acids involved in phosphate binding are highlighted in gray.

3 billion years ago ([37]; this study). Even though we Even if the general architecture of the operon suggests less
observed a constant operon architecture, we also observed recombination than in the Cyanobacteria lineages, in Firmi-
two taxa with both types of operon (S. thermophilum and S. cutes several other taxa had more than one copy of the pstS
pneumoniae), one early divergent and the other more derived gene or presented an incomplete extra copy of the operon. A
(Figure 3), however, the protein sequence of either of them is phylogenetic analysis of PstS (see Figure S1 in Supplemen-
very divergent from the other taxa. tary Material available online at doi:10.406/2011/781642)
8 International Journal of Evolutionary Biology

suggests that these extra copies of the pstS gene were acquired Firmicutes and some pathogenic groups (Listeria), while a pst
later by HGT rather than acquired by duplications, since operon similar to that of Cyanobacteria is found mostly in
different copies of the gene belonging to the same organism oligotrophic Firmicutes and in some other pathogenic groups
are found in different places in the phylogeny. Even though (Staphylococcus), with various cases of ancient HGT between
the high sequence variation present in pstS at both nucleotide either group (i.e., B. halodurans, B. clausii, and D. hafniense).
and protein levels make it hard to obtain a well-supported Therefore, we should reconsider the environmental con-
phylogenetic reconstruction. From our analysis it is evident strain hypothesis: Bacillus sp. CH108 from the Churince
that the pstS gene is evolving at a faster rate than the rest water system in the present shares similar environmental
of the genes of the operon, where the purifying selection conditions to other Bacillus from CCB, but has a subtilis-like
maintains the folding of the protein instead of the amino pst operon instead of the cereus-like operon that is common
acid sequence [19]. Since we found no significant positive to other CCB species. The sister species of B. sp. CH108,
selection for pstS, the high level of sequence divergence B. marisflavi from the Yellow Sea of Korea [56], also has
could be due to the accumulation of repeated mutations a subtilis-like operon. This suggests that acquisition of the
after the ancient split between the cereus-like and subtilis-like operon predates the divergence of the two taxa (∼92 my ago;
operon [50]. Figure 3), which in turn is older than the last time CCB was
Interestingly, despite the marked sequence divergence of connected to the ocean, ca. 45 my ago. This may leads to the
the genes of the pst operon of firmicutes and particularly idea that this particular arrangement is not specific to the
the pstS gene (Figure 4), the 3D structures of PstS from the actual oligotrophic conditions in Cuatro Cienegas [41, 57]
subtilis-like and cereus-like operons were surprisingly similar but it is an adaptation to an ancient sea.
(Figures 5(a)–5(c)). However, it is still to be determined
how the changes of particular amino acids in the phosphate 5. Conclusions
binding site (Figure 5(d)) would affect the formation of the
hydrogen bonds necessary for phosphate uptake [51]. In par- The pst operon in Firmicutes showed a very high sequence
ticular, it is possible that the presence of a proline (P) instead divergence that is not correlated to either phylogenetic rela-
of serine in the active site of the cereus-like PstS (Figure 5(d)) tionships among taxa, the type of habitat, or the phosphorus
could have some relevance in the discrimination between the availability where these organisms currently live. Thus, it
mono- and dibasic forms of phosphate, since this amino acid is likely that the current distribution the pst operon was
only acts as a hydrogen bond acceptor but not as a donor determined by a very early divergence and repeated events of
[46]. The difference of affinity to phosphate and the potential HGT of the phosphate transporter genes followed by parallel
selectivity for either of the phosphate species should be evolution that lead to similar 3D structures. Unlike what was
investigated experimentally, as the protein backbone is also observed in Cyanobacteria, most Firmicutes only have one
involved in the hydrogen bond formation and not only the or a couple of copies of the PstS protein, so it is crucial
side chains of the residues [46, 51]. The amino acid changes for phosphate uptake that both function and affinity are
in the active site of the PstS protein can have an effect on conserved in the substrate binding protein.
the efficiency of the acceptor under different pH conditions
[46, 51], and maybe some of those substitutions could be Acknowledgments
related to habitat. Nevertheless, this idea is not sustained
within the cereus-like clade, were it can be observed that the This project was funded by CONACYT-SEP Grant no.
lineage with B. sp m3-13 (Figure 2), the Staphylococcus group 57507 and CONACyT-Semarnat 2006-C01-23459 awarded
and the clade with Anoxyblacillus flavithermus as well as two to V. Souza and a CINVESTAV Multidisciplinary awarded
species of Geobacillus have a histidine instead of a serine in to G. Olmedo. V. Souza and L.E. Eguiarte worked in this
the second residue of the active site (Figure 5(d)), and all paper during a sabbatical leave supported by DGAPA and
those taxa actually live in environments with a wide range UC-Mexus, respectively. Thanks are due to L. D. Alcaraz,
of pH conditions, as is the case in the rest of Firmicutes. E. Rebollar, F. Garcı́a-Oliva, D. Ortega-Del Vecchyo, I.
It has been previously noted that protein structure is Hernandez, V. Lopez, G. Moreno-Hagelsieb, B.S. Gaut, M.
fairly conserved in nature, and that proteins with only 8% Tenaillon, O. Tenaillon, A. Vázquez-Lobo, and D. Piñero for
similarity at sequence level can have a much higher similarity all their insights and comments.
in their structural features [52, 53]. In our case, some of the
PstS proteins had a sequence similarity as low as 17% when References
comparing those from the subtilis- or cereus-like operons, yet [1] S. G. Tetu, B. Brahamsha, D. A. Johnson et al., “Microarray
their structure was fairly conserved (Figures 5(a)–5(c)). This analysis of phosphate regulation in the marine cyanobac-
could be due to natural selection acting on protein structure, terium Synechococcus sp. WH8102,” ISME Journal, vol. 3, no.
thus allowing for changes in amino acids that would not alter 7, pp. 835–849, 2009.
the basic features of the protein [54]. The fact that these [2] P. G. Falkowski, T. Fenchel, and E. F. Delong, “The microbial
similar protein structures occur on lineages that have such engines that drive earth’s biogeochemical cycles,” Science, vol.
deep divergences such as Cyanobacteria and Firmicutes (ca. 3 320, no. 5879, pp. 1034–1039, 2008.
billion years ago), favors the idea of parallel evolution from [3] D. Papineau, “Global biogeochemical changes at both ends of
a common ancestor [37, 54, 55]. This very ancient diver- the Proterozoic: insights from phosphorites,” Astrobiology, vol.
gence produced one pst operon mostly found in anaerobic 10, no. 2, pp. 165–181, 2010.
International Journal of Evolutionary Biology 9

[4] J. J. Elser, J. Watts, J. H. Schampel, and J. Farmer, “Early [20] A. C. Martiny, Y. Huang, and W. Li, “Occurrence of phosphate
Cambrian food webs on a trophic knife-edge? A hypothesis acquisition genes in Prochlorococcus cells from different ocean
and preliminary data from a modern stromatolite-based regions,” Environmental Microbiology, vol. 11, no. 6, pp. 1340–
ecosystem.,” Ecology Letters, vol. 9, no. 3, pp. 295–303, 2006. 1347, 2009.
[5] J. J. Elser and A. Hamilton, “Stoichiometry and the new [21] R. Feingersch, M. T. Suzuki, M. Shmoish et al., “Microbial
biology: the future is now,” PLoS Biology, vol. 5, no. 7, article community genomics in eastern Mediterranean Sea surface
e181, 2007. waters,” ISME Journal, vol. 4, no. 1, pp. 78–87, 2010.
[6] M. V. Zubkov, I. Mary, E. M. S. Woodward et al., “Microbial [22] B. A. S. Van Mooy, H. F. Fredricks, B. E. Pedler et al.,
control of phosphate in the nutrient-depleted North Atlantic “Phytoplankton in the ocean use non-phosphorus lipids in
subtropical gyre,” Environmental Microbiology, vol. 9, no. 8, response to phosphorus scarcity,” Nature, vol. 458, no. 7234,
pp. 2079–2089, 2007. pp. 69–72, 2009.
[7] D. B. Rusch, A. L. Halpern, G. Sutton et al., “The Sorcerer [23] J. J. Elser, J. H. Schampel, F. Garcia-Pichel et al., “Effects
II Global Ocean Sampling expedition: northwest Atlantic of phosphorus enrichment and grazing snails on modern
through eastern tropical Pacific,” PLoS Biology, vol. 5, no. 3, stromatolitic microbial communities,” Freshwater Biology, vol.
article e77, 2007. 50, no. 11, pp. 1808–1825, 2005.
[8] M. M. Adams, M. R. Gómez-Garcı́a, A. R. Grossman, and D. [24] V. Souza, L. E. Eguiarte, J. Siefert, and J. J. Elser, “Microbial
Bhaya, “Phosphorus deprivation responses and phosphonate endemism: does phosphorus limitation enhance speciation?”
utilization in a thermophilic Synechococcus sp. from microbial Nature Reviews Microbiology, vol. 6, no. 7, pp. 559–564, 2008.
mats,” Journal of Bacteriology, vol. 190, no. 24, pp. 8171–8184, [25] V. Souza, L. Espinosa-Asuar, A. E. Escalante et al., “An
2008. endangered oasis of aquatic microbial biodiversity in the
[9] L. D. Alcaraz, G. Olmedo, G. Bonilla et al., “The genome of Chihuahuan desert,” Proceedings of the National Academy of
Bacillus coahuilensis reveals adaptations essential for survival Sciences of the United States of America, vol. 103, no. 17, pp.
in the relic of an ancient marine environment,” Proceedings 6565–6570, 2006.
of the National Academy of Sciences of the United States of [26] C. Desnues, B. Rodriguez-Brito, S. Rayhawk et al., “Biodiver-
America, vol. 105, no. 15, pp. 5803–5808, 2008. sity and biogeography of phages in modern stromatolites and
thrombolites,” Nature, vol. 452, no. 7185, pp. 340–343, 2008.
[10] Y. Qi, Y. Kobayashi, and F. M. Hulett, “The pst operon of
[27] L. D. Alcaraz, G. Moreno-Hagelsieb, L. E. Eguiarte, V. Souza,
Bacillus subtilis has a phosphate-regulated promoter and is
L. Herrera-Estrella, and G. Olmedo, “Understanding the
involved in phosphate transport but not in regulation of the
evolutionary relationships and major traits of Bacillus through
Pho regulon,” Journal of Bacteriology, vol. 179, no. 8, pp. 2534–
comparative genomics,” BMC Genomics, vol. 11, no. 1, article
2539, 1997.
no. 332, 2010.
[11] A. C. Martiny, M. L. Coleman, and S. W. Chisholm, “Phos-
[28] M. Breitbart, A. Hoare, A. Nitti et al., “Metagenomic and
phate acquisition genes in Prochlorococcus ecotypes: evidence
stable isotopic analyses of modern freshwater microbialites in
for genome-wide adaptation,” Proceedings of the National
Cuatro Ciénegas, Mexico,” Environmental Microbiology, vol.
Academy of Sciences of the United States of America, vol. 103,
11, no. 1, pp. 16–34, 2009.
no. 33, pp. 12552–12557, 2006.
[29] B. M. Winsborough, E. Theriot, and D. B. Czarnecki, Diatoms
[12] M. Aguena and B. Spira, “Transcriptional processing of the pst on a Continental Island: Lazarus Species, Marine Disjuncts and
operon of Escherichia coli,” Current Microbiology, vol. 58, no. other Endemic Diatoms of the Cuatro Ciénegas basin, Coahuila,
3, pp. 264–267, 2009. México, University of Texas, Austin Tex, USA, 2008.
[13] A. C. Martiny, A. P. K. Tai, D. Veneziano, F. Primeau, and S. W. [30] W. L. Minkley, “Cuatro Cienegas fishes: research reviewd and
Chisholm, “Taxonomic resolution, ecotypes and the biogeog- a local test of diversity versus habitat size,” Journal of the
raphy of Prochlorococcus,” Environmental Microbiology, vol. 11, Arizona-Nevada Academy of Science, vol. 19, pp. 13–21, 1984.
no. 4, pp. 823–832, 2009. [31] E. W. Carson and T. E. Dowling, “Influence of hydrogeo-
[14] M. Sebastian and J. W. Ammerman, “The alkaline phosphatase graphic history and hybridization on the distribution of
PhoX is more widely distributed in marine bacteria than the genetic variation in the pupfishes Cyprinodon atrorus and C.
classical PhoA,” ISME Journal, vol. 3, no. 5, pp. 563–572, 2009. bifasciatus,” Molecular Ecology, vol. 15, no. 3, pp. 667–679,
[15] M. Aguena, E. Yagil, and B. Spira, “Transcriptional analysis 2006.
of the pst operon of Escherichia coli,” Molecular Genetics and [32] S. G. Johnson, “Age, phylogeography and population structure
Genomics, vol. 268, no. 4, pp. 518–524, 2002. of the microendemic banded spring snail, Mexipyrgus chur-
[16] N. E. E. Allenby, N. O’Connor, Z. Prágai et al., “Post- inceanus,” Molecular Ecology, vol. 14, no. 8, pp. 2299–2311,
transcriptional regulation of the Bacillus subtilis pst operon 2005.
encoding a phosphate-specific ABC transporter,” Microbiol- [33] M. Tobler and E. W. Carson, “Environmental variation,
ogy, vol. 150, no. 8, pp. 2619–2628, 2004. hybridization, and phenotypic diversification in Cuatro
[17] R. J. Fischer, S. Oehmcke, U. Meyer et al., “Transcription of Ciénegas pupfishes,” Journal of Evolutionary Biology, vol. 23,
the pst operon of Clostridium acetobutylicum is dependent on no. 7, pp. 1475–1489, 2010.
phosphate concentration and pH,” Journal of Bacteriology, vol. [34] R. Cerritos, P. Vinuesa, L. E. Eguiarte et al., “Bacillus
188, no. 15, pp. 5469–5478, 2006. coahuilensis sp. nov., a moderately halophilic species from a
[18] K. Tomii and M. Kanehisa, “A comparative analysis of desiccation lagoon in the Cuatro Ciénegas Valley in Coahuila,
ABC transporters in complete microbial genomes,” Genome Mexico,” International Journal of Systematic and Evolutionary
Research, vol. 8, no. 10, pp. 1048–1059, 1998. Microbiology, vol. 58, no. 4, pp. 919–923, 2008.
[19] A. L. Davidson, E. Dassa, C. Orelle, and J. Chen, “Structure, [35] R. Cerritos, L. E. Eguiarte, M. Avitia et al., “Diversity of cultur-
function, and evolution of bacterial ATP-binding cassette able thermo-resistant aquatic bacteria along an environmental
systems,” Microbiology and Molecular Biology Reviews, vol. 72, gradient in Cuatro Ciénegas, Coahuila, Mexico,” Antonie Van
no. 2, pp. 317–364, 2008. Leeuwenhoek, vol. 99, no. 2, pp. 303–318, 2010.
10 International Journal of Evolutionary Biology

[36] A. E. Escalante, L. E. Eguiarte, L. Espinosa-Asuar, L. J. Forney, [54] A. Sánchez-Flores, E. Pérez-Rueda, and L. Segovia, “Protein
A. M. Noguez, and V. Souza Saldivar, “Diversity of aquatic homology detection and fold inference through multiple
prokaryotic communities in the Cuatro Cienegas basin,” alignment entropy profiles,” Proteins: Structure, Function and
FEMS Microbiology Ecology, vol. 65, no. 1, pp. 50–60, 2008. Genetics, vol. 70, no. 1, pp. 248–256, 2008.
[37] F. U. Battistuzzi, A. Feijao, and S. B. Hedges, “A genomic [55] R. Woods, D. Schneider, C. L. Winkworth, M. A. Riley, and R.
timescale of prokaryote evolution: insights into the origin of E. Lenski, “Tests of parallel molecular evolution in a long-term
methanogenesis, phototrophy, and the colonization of land,” experiment with Escherichia coli,” Proceedings of the National
BMC Evolutionary Biology, vol. 4, article no. 44, 2004. Academy of Sciences of the United States of America, vol. 103,
[38] H. Maughan, “Rates of molecular evolution in bacteria are no. 24, pp. 9107–9112, 2006.
relatively constant despite spore dormancy,” Evolution, vol. 61, [56] J. H. Yoon, I. G. Kim, K. H. Kang, T. K. Oh, and Y. H. Park,
no. 2, pp. 280–288, 2007. “Bacillus marisflavi sp. nov. and Bacillus aquimaris sp. nov.,
[39] F. Kunst, N. Ogasawara, I. Moszer et al., “The complete isolated from sea water of a tidal flat of the Yellow Sea in
genome sequence of the gram-positive bacterium Bacillus Korea,” International Journal of Systematic and Evolutionary
subtilis,” Nature, vol. 390, no. 6657, pp. 249–256, 1997. Microbiology, vol. 53, no. 5, pp. 1297–1303, 2003.
[40] M. J. Sanderson, “Estimating absolute rates of molecular evo- [57] F. Vega, T. Nyborg, M. Perrilliat, M. Montellanos-Ballesteros,
lution and divergence times: a penalized likelihood approach,” S. R. S. Cevallos-Ferriz, and S. A. Quiroz-Barroso, Studies on
Molecular Biology and Evolution, vol. 19, no. 1, pp. 101–109, Mexican Paleontology, Springer, Dordrecht, The Nederlands,
2002. 2006.
[41] I. Ferrusquı́a-Villafranca, “Geologı́a de México: una sinopsis,”
in Diversidad biológica de México: orı́genes y distribución, T. P.
Ramamoorthy et al., Ed., Instituto de Biologı́a UNAM, México
D.F., 1998.
[42] R. C. Edgar, “MUSCLE: a multiple sequence alignment
method with reduced time and space complexity,” BMC
Bioinformatics, vol. 5, article no. 113, 2004.
[43] A. Stamatakis, P. Hoover, and J. Rougemont, “A rapid
bootstrap algorithm for the RAxML web servers,” Systematic
Biology, vol. 57, no. 5, pp. 758–771, 2008.
[44] F. Abascal, R. Zardoya, and D. Posada, “ProtTest: selection of
best-fit models of protein evolution,” Bioinformatics, vol. 21,
no. 9, pp. 2104–2105, 2005.
[45] T. L. Bailey and C. Elkan, “Fitting a mixture model by
expectation maximization to discover motifs in biopolymers,”
in Proceedings of the 2nd International Conference on Intelligent
Systems for Molecular Biology, vol. 2, pp. 28–36, 1994.
[46] M. Tanabe, O. Mirza, T. Bertrand et al., “Structures of OppA
and PstS from Yersinia pestis indicate variability of interactions
with transmembrane domains,” Acta Crystallographica Section
D, vol. 63, no. 11, pp. 1185–1193, 2007.
[47] N. Eswar, B. Webb, M. A. Marti-Renom et al., “Comparative
protein structure modeling using Modeller,” Current Protocols
in Bioinformatics, vol. 5, pp. 5.6.1–5.6.30, 2006.
[48] V. A. Ilyin, A. Abyzov, and C. M. Leslin, “Structural alignment
of proteins by a novel TOPOFIT method, as a superimposition
of common volumes at a topomax point,” Protein Science, vol.
13, no. 7, pp. 1865–1874, 2004.
[49] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4:
Molecular Evolutionary Genetics Analysis (MEGA) software
version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8,
pp. 1596–1599, 2007.
[50] A. L. Hughes, “Looking for Darwin in all the wrong places:
the misguided quest for positive selection at the nucleotide
sequence level,” Heredity, vol. 99, no. 4, pp. 364–373, 2007.
[51] H. Luecke and F. A. Quiocho, “High specificity of a phosphate
transport protein determined by hydrogen bonds,” Nature,
vol. 347, no. 6291, pp. 402–406, 1990.
[52] E. V. Koonin, Y. I. Wolf, and G. P. Karev, “The structure of the
protein universe and genome evolution,” Nature, vol. 420, no.
6912, pp. 218–223, 2002.
[53] R. A. Goldstein, “The structure of protein evolution and the
evolution of protein structure,” Current Opinion in Structural
Biology, vol. 18, no. 2, pp. 170–177, 2008.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 970768, 11 pages
doi:10.4061/2011/970768

Research Article
Ectopic Gene Conversions in the Genome of Ten Hemiascomycete
Yeast Species

Robert T. Morris and Guy Drouin


Département de Biologie et Centre de Recherche Avancée en Génomique Environnementale, Université d’Ottawa, 30 Marie Curie,
Ottawa, ON, Canada K1N 6N5

Correspondence should be addressed to Guy Drouin, gdrouin@science.uottawa.ca

Received 21 July 2010; Revised 18 September 2010; Accepted 15 October 2010

Academic Editor: Hiromi Nishida

Copyright © 2011 R. T. Morris and G. Drouin. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

We characterized ectopic gene conversions in the genome of ten hemiascomycete yeast species. Of the ten species, three diverged
prior to the whole genome duplication (WGD) event present in the yeast lineage and seven diverged after it. We analyzed gene
conversions from three separate datasets: paralogs from the three pre-WGD species, paralogs from the seven post-WGD species,
and common ohnologs from the seven post-WGD species. Gene conversions have similar lengths and frequency and occur between
sequences having similar degrees of divergence, in paralogs from pre- and post-WGD species. However, the sequences of ohnologs
are both more divergent and less frequently converted than those of paralogs. This likely reflects the fact that ohnologs are more
often found on different chromosomes and are evolving under stronger selective pressures than paralogs. Our results also show
that ectopic gene conversions tend to occur more frequently between closely linked genes. They also suggest that the mechanisms

responsible for the loss of introns in S. cerevisiae are probably also involved in the gene 3 -end gene conversion bias observed
between the paralogs of this species.

1. Introduction similarity between a damaged gene and its repair template


can cause a 10- to 40-fold decrease in recombination
The repair of double strand DNA breaks is a critical frequency [2, 3]. Similarly, in Saccharomyces cerevisiae, larger
biological process which maintains genome stability. The gene conversions are limited to more similar sequences [4].
primary process whereby double-strand DNA breaks are Chromosomally linked genes are converted more frequently
repaired is via homologous recombination; this process than dispersed genes in Drosophila and humans [5, 6]. In
requires the use of a repair template gene which provides a S. cerevisiae, increasing distance between paralogs located on
copy of the missing information caused by the double-strand the same chromosome tends to decrease their conversion
DNA breaks. The repair template can either be an allele frequency [4, 7, 8]. In some genomes, different regions of
(allelic recombination) or a paralog (ectopic recombination). genes are converted at different rates. For example, in S.
An end product of the homologous recombination pathway cerevisiae, genes conversions between dispersed paralogs are
 
is the replacement of the broken part of the damaged gene more frequent at their 3 -ends [4]. This 3 -bias is likely the
by a homologous portion of the repair template gene. The result of gene conversion with incomplete cDNA molecules
damaged gene is therefore converted by the template gene [9].
(reviewed in [1]). The availability of ten hemiascomycete genomes pro-
The factors affecting, and the characteristics of, ectopic vides the opportunity to study ectopic gene conversions
and allelic gene conversions have been the focus of many within a clade with as much sequence divergence as the
studies, and sequence similarity has been shown to have entire Chordate phylum [10]. The evolution of several
a profound effect on gene conversion propensity between hemiascomycetes species was affected by a whole genome
paralogs. In Escherichia coli, a 2%–4% decrease in sequence duplication event (WGD) which occurred some 150 millions
2 International Journal of Evolutionary Biology

2. Materials and Methods


A B C
2.1. Genome Sequences. The S. cerevisiae, S. paradoxus, S.
mikatae, S. bayanus, S. kudriavzevii, and S. castellii genome
sequences were retrieved from the Saccharomyces Genome
A Database (SGD; ftp://genome-ftp.stanford.edu/pub/yeast/
sequence/). The C. glabrata, K. lactis, D. hansenii, and Y.
Figure 1: Schematic representation of ohnologs and paralogs.
lipolytica genome sequences and distance files (∗ .ptt files)

Genes A and A represent ohnologs created by a genome duplica- were retrieved from the NCBI ftp website (ftp://ftp.ncbi.nih
tion. These genes are therefore located on different chromosomes. .gov/).
Genes B and C represent paralogs created by tandem duplications
of gene A. These genes are therefore on the same chromosome as
gene A. 2.2. Gene Family Data Sets. We used three different data
sets of protein coding genes. To retrieve the post-WGD
ohnologs from the seven post-WGD species, we used the
551 S. cerevisiae duplicated gene pairs (1102 ohnologs)
identified by Byrne and Wolfe [21] as queries. Sequences
years ago (MYA; [11–14]). The genomes of Kluyveromyces from C. glabrata and S. castellii were retrieved using the
lactis, Debaryomyces hansenii, and Yarrowia lipolytica all Yeast Gene Order Browser (http://wolfe.gen.tcd.ie/ygob/),
diverged before the whole genome duplication event that and those from the other 4 species were retrieved from
occurred in the yeast lineage (pre-WGD species; [10]). the Saccharomyces Genome Database (ftp://genome-ftp
The S. cerevisiae, Saccharomyces paradoxus, Saccharomyces .stanford.edu/pub/yeast/sequence/fungal genomes/Multiple
mikatae, Saccharomyces bayanus, Saccharomyces kudriavzevii, species align/other/fungalAlignCorrespondance.txt). Our
Saccharomyces castellii, and Candida glabrata genomes all data set of ohnologs in post-WGD species is therefore only
diverged after this whole genome duplication event (post- composed of the ohnologs pairs also found in S. cerevisiae.
WGD species; [15–17]). We used this subset of ohnologs because the efficient
The advantage of separating these genomes into two detection of gene conversion events using the GENECONV
groups is that we are able to perform two comparisons. The method requires that at least three sequences be available
first compares the characteristics of ectopically converted [4]. To detect gene conversions in ohnologs, we therefore
ohnologs and paralogs between the post-WGD species. The needed ohnologs from at least two species and we used the
post-WGD ohnologs are composed of the duplicated gene ohnologs of S. cerevisiae to retrieve ohnologs pairs from the
pairs that resulted from the whole genome duplication other 6 post-WGD species. Retrieving common ohnologs
[11, 18]. The post-WGD paralogs data set is composed of also allowed us to study gene conversions between similar
the genes from multigene families containing at least three genes in seven different genomes.
members in the genome of the seven post-WGD species but The post-WGD paralog data set was constructed using
excluding all ohnologs (Figure 1). The second comparison the BLASTCLUST program available at the NCBI FTP site.
involves the contrast of the characteristics of ectopically Gene families were defined as being composed of sequences
converted paralogs between pre- and post-WGD species. The having at least 60% amino acid identity over at least 50%
pre-WGD paralogs data set is composed of the genes from of their length. If genes previously identified as ohnologs
multigene families containing at least three members in the were grouped into paralog multigene families, then these
genome of the three pre-WGD species. genes were removed from the family to ensure that there
The previous studies have shown that the reason why was no redundancy between the ohnolog and paralog data
many ohnologs are still found in yeast genomes is because sets (see Figure 1). The pre-WGD paralog data set was also
they provide a selective advantage [19, 20]. Ohnologs are constructed using the BLASTCLUST program, and gene
maintained by selection either because they carry out a families were also defined as being composed of sequences
subset of the functions that were previously assumed by having at least 60% amino acid identity over at least 50% of
their preduplication ancestor (subfunctionalization), assume their length.
new functions (neofunctionalization), or provide increased
gene product dosage. We therefore expect that most ectopic
gene conversions between ohnolog genes will be deleterious 2.3. Sequence Alignments and Gene Conversion Detection.
and removed by selection. If so, ectopic gene conversions ClustalW was used to align the protein sequences of multi-
between ohnologs should be less frequent than those between gene families’ members [22]. DNA sequences were then fitted
paralogs. In addition, based on the previous studies, we to the protein alignments using a PERL script.
expect that gene conversion frequency should decrease as Gene conversions were detected using the GENECONV
the distance between related genes increases (and be least method [23]. Redundant gene conversions within a multi-
frequent for genes situated on different chromosomes), that gene family were detected by examining the phylogenetic
the length of gene conversion tracts should be positively tree of each family and removed from the analysis [4]. If
correlated with sequence similarity and that converted the same gene conversion was detected at the same location

regions should be more frequent at the 3 -end of genes in the multigene family alignment in closely related descen-
[4]. dents of a common ancestor then the most parsimonious
International Journal of Evolutionary Biology 3

Table 1: Number of ohnologs and paralogs in the pre- and post- 35


WGD genomes.
30
Number of ohnolog Number of paralog
Genome 25

Number of families
families families
Post-WGD 20
S. cerevisiae 551 (2) 30 (3–40)
15
S. paradoxus 436 (2) 80 (3–68)
S. mikatae 412 (2) 86 (3–37) 10
S. kudriavzevii 226 (2) 13 (3–20)
5
S. bayanus 462 (2) 75 (3–23)
C. glabrata 300 (2) 16 (3–7) 0
S. castellii 398 (2) 17 (3–10) 3 5 7 9 11 13 15 17 19 21 23 25 27
Family size
Pre-WGD
K. lactis n.a. 15 (3–9) Post-WGD
D. hansenii n.a. 43 (3–9) Pre-WGD
Y. lipolytica n.a. 60 (3–26) Figure 2: The distribution of the average number of paralog gene
Notes. The range of multigene family sizes is provided in brackets. n.a.: not families (mean ± S.D.) within the seven postduplication genomes
applicable. and three preduplication genomes is shown. Five outlier families
including two families of size 63 and 68 from S. paradoxus, two
families of size 32 and 40 from S. cerevisiae, and a single family of
38 genes from S. mikatae are not shown in the figure to improve the
explanation is that the conversion event occurred within visual clarity of the data.
the common ancestor, therefore only one of the conversions
detected in the set of descendents was retained for further
analysis. To control for false positives, gene conversions
between sequences having less than 80% maximum flanking and analysis was done using Microsoft Excel (Microsoft,
similarity were removed from the analysis [24]. Redmond, WA, USA) and S-plus v 7.0 (Insightful, Seattle,
WA, USA). The G-Power program was used to calculate
2.4. Gene Conversion Characteristics. The gene conversion the power of the ANOVA tests [25]. Power calculations
frequency for each species was calculated using two different for correlation tests were done using an online application
methods. The first method calculates the conversion fre- (http://calculators.stat.ucla.edu/powercalc/correlation/) and
quency as the ratio of the number of conversions divided SAS 9.1.3 (SAS Institute Inc., Cary, NC, USA).
by the total number of gene comparisons between multi-
gene family members. The second method calculates the
2.5. Numbers of Substitutions per Site and Gene Ontology.
frequency as the ratio of the number of gene conversions
The number of nonsynonymous substitutions per non-
divided by the total number of multigene family members.
synonymous site (Ka) and synonymous substitutions per
Intra- and interchromosomal gene conversion frequencies
synonymous site (Ks) and their ratio (Ka/Ks) were calculated
were calculated for the S. cerevisiae, C. glabrata ohnolog
for the protein coding regions (excluding the converted
and paralog multigene families. In addition intra- and inter-
regions) of each pair of converted genes using the YN00
chromosomal conversion frequencies were calculated for the
program from the PAML software [26, 27].
paralog multigene families of K. lactis, D. hansenii, and Y.
The processes in which the S. cerevisiae ohnologs and
lipolytica genomes. These frequencies are calculated as the
paralogs are involved were analyzed using the gene ontology
ratio of intra- (or inter-) chromosomal conversions divided
annotations of the Saccharomyces Genome Database at http:
by the total number of intra- (or inter-) chromosomal gene
//db.yeastgenome.org/cgi-bin/GO/goTermFinder.pl [28].
comparisons. The gene conversion length was obtained from
the GENECONV output. The maximum similarity for the
flanking 100 nucleotides was calculated for each converted 3. Results
gene pair using an in-house PERL script. The locations
of the converted regions were calculated as the correlation 3.1. Ohnolog and Paralog Multigene Families. Ohnolog and
between the positions of each conversion with respect to paralog multigene families were analyzed to determine
the length of the converted genes. A positive correlation whether the number and size of these two types of families

indicates a bias towards the 3 -end of genes, and a negative were different in different yeast genomes (Table 1, Figure 2).

correlation indicates a bias towards the 5 -end of genes. The genomes of the six post-WGD species from which we
The distance between converted genes was calculated only retrieved ohnolog pairs using the S. cerevisiae ohnologs
for conversions detected within S. cerevisiae, C. glabrata, K. contain an average of 372 ± 90 ohnolog pairs. The number
lactis, D. hansenii, and Y. lipolytica because position data of ohnolog pairs found in each of these six different genomes
for the other five species was not available. Data tabulation is not significantly different from this average when using a
4 International Journal of Evolutionary Biology

Table 2: Percentage of gene comparisons between multigene family than genes found on different chromosomes (Table 3).
members located on the same chromosome. Similarly, in the ohnolog families of S. cerevisiae, genes
Genome Ohnologs Paralogs located on the same chromosome are converted 4 times
more frequently than genes found on different chromosomes
Post-WGD
(Table 3). In contrast, there is an almost complete absence of
S. cerevisiae 4.0% (22/551) 8.4% (163/1930)
gene conversions between the ohnologs found within the C.
C. glabrata 5.0% (15/300) 38.6% (29/75) glabrata genome (Table 3).
Pre-WGD In pre-WGD genomes, the paralogs found on the same
K. lactis n.a. 21% (26/124) chromosomes of K. lactis and Y. lipolytica are not converted
D. hansenii n.a. 31% (86/270) more frequently than paralogs found on different chromo-
Y. lipolytica n.a. 18% (158/884) somes but the D. hansenii paralogs found on the same
Notes. The ratios in brackets are the number of gene comparisons between chromosomes are converted roughly 3 times more frequently
genes found on the same chromosome divided by the total number of gene than those found on different chromosomes (Table 3).
comparisons. n.a.: not applicable. The mean number (±S.D.) of conversions detected
within the paralog gene families of the pre- (38 ± 33) and
post-WGD (30 ± 16) genomes is not statistically different
Bonferroni-corrected α-value of 0.0083 (Wilcoxon rank sum (t-test, P = 0.67; Table 4). Although the ohnolog families
test; [29]). of post-WGD genomes contain only an average of 7 ± 5
For post-WGD paralogs, only the S. mikatae genome has conversions, this number is also not significantly different
significantly more paralog families than average (45.28 ± from the average number of conversions found in post-WGD
33.36; Wilcoxon rank sum test, P = 0.009) and only the S. paralog families (t-test, P = 0.06).
kudriavzevii genome has significantly fewer paralog families When considering gene conversion frequencies with
than average (P = 0.009). The mean size of the paralog respect to the total number of comparisons, gene conversions
families (5.7 ± 5.2 genes/family) is similar in all post-WGD of post-WGD species are either equally frequent in paralog
genomes except that of C. glabrata which has significantly and ohnolog families (in the S. paradoxus, S. mikatae, and S.
smaller paralog families than average (3.3±0.99 genes/family, bayanus genomes) or significantly more frequent in paralog
Wilcoxon rank sum test, P = 0.003). than in ohnolog families in the four other post-WGD families
For the pre-WGD paralogs, the numbers of paralog (t-test, P = 0.046; Table 4).
families in the three pre-WGD genomes are not significantly When considering gene conversion frequencies with
different from the population mean (39.33 ± 22.72; Wilcoxon respect to the total number of multigene family members, the
rank sum test, P ≥ 0.27). The mean size of all paralog mean conversion frequency for paralogs (19.03 ± 16.29%) is
families in these three genomes (4.41 ± 2.59 paralogs per significantly larger than the frequency for ohnologs (0.74 ±
family) is similar to the mean family size of each pre-WGD 0.46; Wilcoxon two sample test, P = 0.0006).
genome (Wilcoxon rank sum test, P ≥ 0.09). Finally, there is We believe that using gene conversion frequencies with
no statistical difference between the number (Wilcoxon rank respect to the total number of multigene family mem-
sum test, P = 0.83) and the mean size of paralog families bers is more appropriate to compare gene frequencies
(Wilcoxon rank sum test, P = 0.17) or between pre- and between ohnologs and paralogs because it better reflects
post-WGD species. the much larger number of conversions found in paralogs
when compared to ohnologs. For example, in the case of
3.2. Organization of Gene Families. The organization of the S. cerevisiae with 13 conversions between ohnologs and
multigene families can be measured as the proportion of 110 conversion between paralogs (Table 4), the conversion
multigene family members located on the same chromo- frequency for ohnologs is 2.35% (13/551) and 5.71% for
some. Since most paralogs originate from unequal crossover paralogs (110/1930) when frequencies are calculated with
events, they are expected to be most often found on the respect to the total number of comparisons. However, these
same chromosome. In contrast, since ohnologs are remnants frequencies do not take into account the fact that 1102
of ancient genome duplication events, they are expected to ohnolog sequences were compared (551 pairs) whereas
be most often found on different chromosomes. The higher only 212 paralog sequences (i.e., less than the fifth of
percentage of paralogs found on the same chromosome is the number of ohnolog sequences) were compared (for a
therefore consistent with the likely mode of origin of these total of 1930 pairwise comparisons) to obtain the 5.71%
two types of duplicated genes (Table 2). The percentage of frequency of paralogs. In contrast, if one compares the
paralogs found on the same chromosome is also similar frequencies calculated with respect with the number of
between pre- and post-WGD genomes (Table 2). genes, the frequency of conversions is 1.17% (13/1102) for
ohnologs and 51.40% for paralogs (110/212). The large
3.3. Gene Conversion Frequency and Distance between Con- difference between the two ways of calculating frequen-
verted Genes. In post-WGD genomes, intrachromosomal cies is due to the fact that frequencies calculated with
gene conversions tend to occur more frequently than respect to the total number of comparisons have a much
interchromosomal conversions. In the paralog families of larger denominator which biases the comparisons between
S. cerevisiae and C. glabrata, genes located on the same ohnologs and paralogs. For example, for a family with 10
chromosome are converted 2 to 10 times more frequently paralogous sequences, the number of pairwise comparisons
International Journal of Evolutionary Biology 5

Table 3: Intra- and interchromosomal gene conversion frequencies for pre- and post-WGD genomes.

Genome Ohnologs Paralogs


Intrachromosomal Interchromosomal Intrachromosomal Interchromosomal
Post-WGD
frequency frequency frequency frequency
S. cerevisiae 9.1% (2/22) 2.1% (11/529) 9.2% (15/163) 5.4% (95/1767)
C. glabrata 0% (0/15) 0.007% (2/285) 24.1% (7/29) 2.2% (1/46)
Pre-WGD
K. lactis n.a. n.a. 11.5% (3/26) 12.2% (12/98)
D. hansenii n.a. n.a. 36% (31/86) 11.4% (21/184)
Y. lipolytica n.a. n.a. 1.9% (3/158) 2.9% (21/726)
Notes. Values in brackets indicate the ratio of the number of gene conversions divided by the number of gene comparisons. Data for S. paradoxus, S. mikatae,
S. kudriavzevii, S. bayanus, and S. castellii are not provided because position data was not available for the genes of these genomes. n.a.: not applicable.

Table 4: The number and frequency of gene conversions in ohnologs and paralogs.

Ohnologs Paralogs
Frequency (%) with Frequency (%) with Frequency (%) with Frequency (%) with
Genomes
respect to total respect to total number respect to total respect to total number
Number Number
number of of multigene family number of of multigene family
comparisons members comparisons members
Post-WGD
S. cerevisiae 13 2.35 1.17 110 5.71 51.40
S. paradoxus 7 1.60 0.80 44 1.54 9.20
S. mikatae 6 1.45 0.73 26 1.50 4.80
S. kudriavzevii 2 0.88 0.44 20 7.96 29.80
S. bayanus 14 3.03 1.51 50 3.60 12.40
C. glabrata 2 0.67 0.33 8 10.67 14.80
S. castellii 2 0.50 0.25 8 5.06 10.80
Pre-WGD
K. lactis n.a. n.a. n.a. 15 12.09 23.80
D. hansenii n.a. n.a. n.a. 52 19.25 31.70
Y. lipolytica n.a. n.a. n.a. 24 2.71 8.20
Notes. n.a.: not applicable.

will be 45 ([10(10−1)]/2) whereas it will only be 5 for 10 were found on the same chromosomes (statistical power
ohnologs. analyses require at least 4 data points).
Ectopic gene conversions between paralogs are equally
frequent in both pre- and post-WGD genomes. Median
gene conversion frequencies relative to both total number of 3.4. Gene Conversion Length and Flanking Similarity. The
comparisons and number of multigene family members are median lengths of the gene conversions between ohnologs
not statistically different between pre-WGD (12.09%, 23.8%) are identical in all seven post-WGD genomes (Table 5;
and post-WGD (5.06%, 12.4%) paralogs (Table 4; Wilcoxon multiple comparison ANOVA test, P = 0.86, α = 0.05). The
two sample test, P = 0.26 with respect to the number of gene median lengths of gene conversions between the paralogs
comparisons and P = 0.82 with respect to the number of of pre-WGD genomes are also equal (P = 0.34). How-
multigene family members). ever, the median length of the gene conversions between
There is a significant negative correlation (Spearman paralogs are significantly longer in S. cerevisiae than in
rank correlation test) between gene conversion frequency S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus
and distance between paralogs located on the same chro- (multiple comparison ANOVA, P < 0.0001). In post-WGD
mosomes in the genomes of S. cerevisiae (r = − 0.54; genomes, the median length of gene conversion in paralogs
P = 0.008), C. glabrata (r = − 0.74; P = 0.048), and D. and ohnolog (182 and 186.5 bp, resp.) are not significantly
hansenii (r = − 0.45; P = 0.008). Correlations could not different (pairwise Wilcoxon rank tests, Table 5). Finally, the
be calculated for the other paralog and/or ohnolog data sets median lengths of gene conversions are significantly different
either because gene distance information was not available from each other between pre-WGD (150 bp) and post-WGD
for some species (see above) or because less than four genes (182 bp) paralogs (Wilcoxon two sample test, P = 0.02,
6 International Journal of Evolutionary Biology

Table 5: Gene conversion lengths of pre- and post-WGD species.

Ohnologs (bp) Paralogs (bp) Wilcoxon test


Genome
Median 1st quartile 3rd quartile Min Max Median 1st quartile 3rd quartile Min Max P-value
Post-WGD
S. cerevisiae 272 107 465 60 773 382.5 141 869 8 2642 0.22
S. paradoxus 235 98 354 50 531 106 51.5 232 14 1060 0.17
S. mikatae 165.5 95 431 68 568 167 83 366 14 535 0.64
S. kudriavzevii 270.5 146 395 146 395 136 85 172 25 391 0.19
S. bayanus 149.5 71 315 45 905 126 76 203 21 724 0.50
C. glabrata 83.5 27 140 27 140 130 83.5 386 59 668 0.36
S. castellii 144 118 170 118 170 226 73.5 581.5 44 862 0.69
Pre-WGD
K. lactis n.a. n.a. n.a. n.a. n.a. 99 40 236 32 1127 n.a.
D. hansenii n.a. n.a. n.a. n.a. n.a. 183 104.5 310.5 18 1309 n.a.
Y. lipolytica n.a. n.a. n.a. n.a. n.a. 83 27.5 196 16 1770 n.a.
Note. Wilcoxon two-sample tests were used to detect differences between the median gene conversion lengths of ohnologs and paralogs. n.a.: not applicable.

Table 6: Maximum flanking similarity of gene conversions in pre and post-WGD species.

Ohnolog maximum flanking similarity (%) Paralog maximum flanking similarity (%) Wilcoxon test
Genome
Median 1st quartile 3rd quartile Min Max Median 1st quartile 3rd quartile Min Max P-value
Post-WGD
S. cerevisiae 88 84 94 80 97 95.6 91 99 80 100 0.001
S. paradoxus 89 83 94 82 97 90.3 87 97.5 80 100 0.24
S. mikatae 87.5 82 92 81 96 91.7 86.6 95.6 81 100 0.15
S. kudriavzevii 86.8 85.7 88 85.7 88 94 93 97 85 99 0.07
S. bayanus 87.6 85 92 80 98 92.9 86 99 81 100 0.04
C. glabrata 84.5 83 86 83 86 92.6 90 99.5 86 100 0.08
S. castellii 87 86 88 86 88 93 87 97 85 100 0.35
Pre-WGD
K. lactis n.a. n.a. n.a. n.a. n.a. 90 86 97 81 98 n.a.
D. hansenii n.a. n.a. n.a. n.a. n.a. 93 86.3 97 80 100 n.a.
Y. lipolytica n.a. n.a. n.a. n.a. n.a. 86.5 83.3 93.5 80 100 n.a.
Note. Wilcoxon two sample tests were used to detected differences between the median flanking similarities of ohnologs and paralogs. n.a.: not applicable.

α = 0.05). These median lengths are similar to the average The median sequence similarities of regions flanking
length of the S. cerevisiae conversions observed in a previous gene conversions between the paralogs of pre-WGD genomes
study (173 bp, [4]). are equal (Table 6; multiple ANOVA tests, P = 0.21, α =
The median sequence similarities of regions flanking 0.05). However, converted genes within pre-WGD paralogs
gene conversions between ohnologs are equal in all seven have significantly less flanking similarity (pooled median
post-WGD genomes (Table 6; multiple ANOVA tests, P = of 90%) than converted paralogs in post-WGD genomes
0.97, α = 0.05). Furthermore, the median sequence similar- (pooled median of 94%; Wilcoxon two sample test, P =
ities of regions flanking gene conversions between paralogs 0.0004, α = 0.05, Table 6). We do not know whether this
are equal in all seven genomes (multiple comparison ANOVA difference has any biological significance.
test, P = 0.18, α = 0.05). Analysis of the relationship between the length of gene
Although the median flanking similarity of the converted conversions and flanking similarity indicates a significant
paralogs of post-WGD species is always higher than that positive correlation within the ohnologs of the seven post-
of their ohnologs, this difference is only significant in the WGD genomes (Spearman rank correlation test, r = 0.44,
genome of S. cerevisiae and S. bayanus (Table 6). However, P = 0.005; Figure 3(a)), the paralogs of the seven post-
this lack of statistical significance is likely the result of the WGD genomes (r = 0.36, P = 0; Figure 3(b)) and the
relatively low power of these statistical tests because the paralogs of the three pre-WGD genomes (r = 0.35, P = 0;
power of each test was ≤61% (results not shown). Figure 3(c)).
International Journal of Evolutionary Biology 7

3.5. Ka, Ks, Ka/Ks Ratios and Ontology of Ohnolog and 3000
Paralog Converted Genes. In post-WGD genomes, the fact
that synonymous substitutions (Ks) are lower for converted 2500

Conversion length (bp)


paralogs than for converted ohnologs suggests that paralogs
2000
have a more recent origin (Table 7). Therefore, the higher
Ka/Ks ratio of paralogs clearly indicates that paralogs are 1500
under less selection constraints than ohnologs. Furthermore,
the similar Ka/Ks ratios of pre- and post-WGD paralogs 1000
indicate that the paralogs of pre- and post-WGD evolve
under similar selective constraints (Table 7). 500
The ohnologs and paralogs of S. cerevisiae are involved in
different processes. Although many of the GO terms shown 0
in Table 8 are not mutually exclusive (e.g., “transposition” 0 0.2 0.4 0.6 0.8 1
and “transposition, RNA-mediated”), analyses of the pro- Maximum flanking similarity
cesses in which these genes are involved show that ohnologs (a)
are involved in regulation, essential biosynthetic processes,
and metabolic processes whereas paralogs are involved trans-
position, transport, and nonessential biosynthetic processes. 3000

2500
3.6. Location of Converted Regions. When considering
pre-WGD paralogs, post-WGD paralogs and post-WGD Conversion length (bp) 2000
ohnologs, only the post-WGD paralogs of S. cerevisiae show

a significant bias of gene conversions towards the 3 -end 1500
of genes (Table 9). However, the fact that the power of all
nonsignificant tests is smaller than 15% suggests that this bias 1000
might also exist in the data sets where it was not detected but
500
that our data are not sufficient to detect it (Table 9).
0
0 0.2 0.4 0.6 0.8 1
4. Discussion
Maximum flanking similarity
Using S. cerevisiae ohnologs as queries allowed us to retrieve (b)
an average of 372 ohnolog pairs from the other six post-
WGD genomes (Table 1). Although these seven species
are phylogenetically related (see [13] for a phylogenetic 3000
tree of these fungi species), and therefore did not evolve
independently, it is very unlikely that species as divergent 2500
Conversion length (bp)

as S. cerevisiae and C. glabrata (which diverged soon after


2000
the whole genome duplication, some 150 MYA), would
have kept 300 pairs of common ohnologs by chance. In 1500
fact, assuming that the ancestral pre-WGD genome had
5000 genes and that current post-WGD genomes have 5500 1000
genes [13], one would expect them to have kept only 50
ohnologs in common (0.1 × 0.1 × 5000) by chance alone. 500
As we discuss further below, this suggests that common
ohnologs provide a selective advantage and evolve under 0
strong selective constraints. 0 0.2 0.4 0.6 0.8 1
Since the number and the mean size of paralog multigene Maximum flanking similarity
families are not significantly different between pre- and post- (c)
WGD species, the genome duplication event in the post-
WGD genome ancestor did not significantly increase the Figure 3: Correlation between gene conversion length and
number or mean size of paralog multigene families in post- maximum flanking sequence similarity. (a) Conversions detected
between the ohnologs of the six Saccharomyces species and C.
WGD species (Table 1, Figure 2). The small number and size
glabrata. There are 107 conversions, 46 of which have ≥80%
of gene families in C. glabrata have already been noticed flanking similarity. (b) Conversions detected between the paralogs
and are likely the result of reductive evolution and gene loss of the six Saccharomyces species and C. glabrata. There are 401
through relatively high genome instability [10, 12, 30]. conversions, 311 of which have ≥80% flanking similarity. (c)
The chromosomal distribution of ohnologs and paralogs Conversions detected the paralogs of the three pre-WGD genomes.
is very different. Whereas, on average, 23.4% of paralogs are There are 147 conversions, 91 of which have ≥80% flanking
found on the same chromosomes, only 4.5% of ohnologs are similarity.
8 International Journal of Evolutionary Biology

Table 7: Nonsynonymous substitutions per nonsynonymous site (Ka), synonymous substitutions per synonymous site (Ks), and Ka/Ks
ratios (± standard deviations) for pairs of converted genes in pre- and post-WGD species.

Ka Ks Ka/Ks
Genome
Ohnologs Paralogs Ohnologs Paralogs Ohnologs Paralogs
Post-WGD
S. cerevisiae 0.04 ± 0.03 0.09 ± 0.08 0.96 ± 0.49 0.37 ± 0.44 0.04 ± 0.02 0.38 ± 0.27
S. paradoxus 0.09 ± 0.11 0.18 ± 0.20 0.91 ± 0.76 0.56 ± 0.40 0.10 ± 0.05 0.46 ± 0.57
S. mikatae 0.09 ± 0.11 0.17 ± 0.19 1.87 ± 1.06 0.56 ± 0.31 0.04 ± 0.04 0.34 ± 0.45
S. kudriavzevii 0.06 ± 0.04 0.08 ± 0.04 0.95 ± 0.46 0.47 ± 0.59 0.06 ± 0.01 0.38 ± 0.34
S. bayanus 0.11 ± 0.09 0.13 ± 0.12 1.91 ± 1.68 0.40 ± 0.46 0.07 ± 0.05 0.40 ± 0.28
C. glabrata 0.25 ± 0.17 0.04 ± 0.04 1.32 ± 0.08 0.36 ± 0.55 0.18 ± 0.12 0.37 ± 0.45
S. castellii 0.18 ± 0.09 0.13 ± 0.07 2.80 ± 1.18 0.29 ± 0.12 0.06 ± 0.01 0.61 ± 0.46
Pre WGD
K. lactis n.a. 0.20 ± 0.26 n.a. 0.61 ± 0.40 n.a. 0.49 ± 0.58
D. hansenii n.a. 0.10 ± 0.07 n.a. 0.50 ± 0.40 n.a. 0.31 ± 0.17
Y. lipolytica n.a. 0.25 ± 0.19 n.a. 1.12 ± 0.38 n.a. 0.46 ± 0.38
n.a.: not applicable.

Table 8: GO terms associated with biological processes for the ohnologs and paralogs of S. cerevisiae.

Cluster Background
GO term P-value
frequency frequency
Ohnologs
Biological regulation 21.9% 13.8% 5.2 × 10−13
Regulation of biological process 18.0% 11.3% 3.9 × 10−10
Regulation of cellular process 16.8% 10.5% 2.6 × 10−09
External encapsulating structure organization and biogenesis 6.4% 2.8% 6.8 × 10−09
Cell wall organization and biogenesis 6.4% 2.8% 6.8 × 10−09
Protein amino acid phosphorylation 4.0% 1.4% 2.1 × 10−08
Cellular polysaccharide biosynthetic process 2.0% 0.5% 4.1 × 10−08
Polysaccharide biosynthetic process 2.0% 0.5% 9.9 × 10−08
Carbohydrate biosynthetic process 2.7% 0.9% 9.3 × 10−07
Cellular carbohydrate metabolic process 5.2% 2.3% 1.0 × 10−06
Carbohydrate metabolic process 5.5% 2.5% 1.7 × 10−06
Paralogs
Transposition 32.8% 1.3% 9.7 × 10−109
Transposition, RNA-mediated 32.8% 1.3% 9.7 × 10−109
Carbohydrate transport 5.2% 0.5% 9.5 × 10−09
Monosaccharide transport 4.0% 0.3% 4.0 × 10−07
Hexose transport 4.0% 0.3% 4.0 × 10−07
Thiamin and derivative metabolic process 3.2% 0.3% 4.0 × 10−05
Thiamin biosynthetic process 2.8% 0.2% 2.0 × 10−4
Thiamin and derivative biosynthetic process 2.8% 0.3% 3.1 × 10−4
Thiamin metabolic process 2.8% 0.3% 3.1 × 10−4
Telomere maintenance via recombination 2.8% 0.3% 4.8 × 10−4
Amino acid catabolic process 3.6% 0.5% 1.0 × 10−3
Cellular response to nitrogen levels 1.6% 0.1% 1.6 × 10−3
Notes. Frequencies were calculated from 1100 ohnologs, 250 paralogs, and 7159 background genes. Only the twelve most significant results for each type of
genes are shown.
International Journal of Evolutionary Biology 9

Table 9: Correlations between the location of the converted regions mechanistic differences in the repair of double-stranded-
and their position in the converted genes in pre- and post-WGD breaks between pre- and post-WGD species because the
genomes. majority of repair genes have been maintained throughout
the evolution of the hemiascomycetes [33].
Ohnolog Paralog
Genome The previous studies have demonstrated a negative
R-value Power R-value Power
correlation between gene conversion frequency and physical
Post WGD distance on the same chromosome [4, 7]. We also observed
S. cerevisiae −0.07 0.036 0.73∗ n.a. such a negative correlation in the genomes of S. cerevisiae,
S. paradoxus 0.12 0.049 −0.19 0.072 C. glabrata, and D. hansenii (see above). However, a lack
S. mikatae 0.00 0.025 −0.19 0.076 of data (statistical power) prevented the detection of such a
S. kudriavzevii −0.17 0.065 −0.09 0.043 relationship in the paralogs of K. lactis and Y. lipolytica and
S. bayanus 0.24 0.095 0.11 0.047 the ohnologs of S. cerevisiae and C. glabrata. This correlation
C. glabrata 0.00 0.025 0.06 0.034 could result from the fact that the DNA repair mechanisms
S. castellii 0.17 0.066 −0.09 0.043 preferentially search for suitable repair templates close to
the damaged gene. Since ohnologs are more often found
Pre WGD
on different chromosomes (Table 2), this would also explain
K. lactis n.a. n.a. −0.32 0.14
why conversions are less frequent between ohnologs than
D. hansenii n.a. n.a. 0.02 0.028 between paralogs. On the other hand, our recent analyses
Y. lipolytica n.a. n.a. 0.14 0.055 of the human genome [6] has shown that, in the human
The R-values indicate correlation values. Significant correlations (Spearman genome, the negative correlation between gene conversion
rank correlation test P < 0.05) are labeled with ∗ . The power of each frequency and physical distance is simply the result of the fact
correlation test is provided except for S. cerevisiae paralogs, where the null
hypothesis was rejected, and for ohnologs for which a power test could not
that most duplicated genes are found next to one another.
be performed. n.a.: not applicable. Thus the negative correlation we observed in some yeast
species might also disappear if we normalized our data to
take into account the fact that most paralogs are located next
to one another on the same chromosome [10].
found on the same chromosomes (Table 2). A likely explana- Sequence similarity requirements for ectopic conversions
tion for this difference is that paralogs often originate from and the amount of negative selection are very similar
unequal crossing over or replication slippage events whereas between pre- and post-WGD paralogs. Several pieces of
ohnologs originate from whole genome duplication events information support these conclusions. The fact that the
(page 250 of [31], [18], pages 199–202 of [32]). Since gene frequency (Table 4), length (Table 5), and flanking sequence
conversions tend to be more frequent between genes found similarities (Table 6) of gene conversion of the paralogs
on the same chromosomes than between genes located on within pre- and post-WGD species are similar indicates that
different chromosomes (Table 3), this explains, in part, why mechanistic similarities are present between these genomes.
gene conversions tend to be more frequent between paralogs In addition, the fact that the mean Ka/Ks values for the
than between ohnologs (Table 4). In fact, on average, when paralog families of pre- and post-WGD species are alike
comparing gene conversion using total numbers, frequency (Table 7) suggests that their genes are under similar selective
calculated using the number of multigene family members, pressures and have similar gene conversion constraints. This
or frequency based on the number of gene comparisons, suggests that, despite the different ecological niches of the
gene conversions are more frequent in the paralogs of pre- yeast species, these paralogs evolve in similar ways.
and post-WGD genes than in the ohnologs of the post-WGD Surprisingly, the sequence of similarity flanking conver-
genomes (Table 4). sions between post-WGD ohnologs is always lower than that
The previous work on yeast, Drosophila, and humans flanking post-WGD paralogs (Table 6). This is likely due to
has shown that intrachromosomal gene conversions are the fact that ohnologs are much older than paralogs (i.e.,
more frequent than interchromosomal gene conversions [4– they have larger Ks values; Table 7), which gave time to
6]. A possible explanation for the relatively high frequency accumulate more substitutions, and are under more selective
of intrachromosomal conversions in D. hansenii (36%, constrains (i.e., they have larger Ka/Ks ratios; Table 7).
Table 3) is that multiple tandem duplication events have been Stronger selective constraints are expected to select against
identified within this genome and, therefore, most paralogs conversions which would homogenize ohnologs because
are still located on the same chromosomes [10]. In contrast, such homogenization would erase the functional differences
in K. lactis and Y. lipolytica, gene conversions between that each member of a pair of ohnologs has acquired during
intra- and interchromosomal paralogs are equally frequent evolution. As mentioned above, the different function each
(Table 3). The highly redundant Y. lipolytica genome has member of a pair of ohnologs has acquired (neofunction-
been shown to be undergone a high degree of map dispersion alization) also likely explains why different yeast genomes
[10]. The low frequency of intrachromosomal conversions have so many common ohnologs (Table 1; [20]). Conversely,
observed in this genome might therefore be the result of one of the effects of repeated gene conversion due to less
the dispersion of tandemly duplicated paralogs to other negative selective pressure on paralogs is that the sequence of
chromosomes. A similar phenomenon might be present in similarity between them will increase. Thus, the observation
K. lactis. It is unlikely that these exceptions are due to that ectopic gene conversions occur more frequently between
10 International Journal of Evolutionary Biology

paralogs than ohnologs (Table 4) might not only be due to post-WDG species (Table 1, Figure 2), that paralogs are
the fact that ohnologs are more often found on different more often found on the same chromosomes than ohnologs
chromosomes (Table 2) but also due to ohnologs being (Table 2), that gene conversions tend to be more frequent
under stronger selective constraints than paralogs (Table 7). between genes found on the same chromosomes than
These stronger selective constraints are due to the fact that between genes located on different chromosomes (Table 3),
ohnologs are involved in essential processes (regulation, that gene conversions tend to be more frequent between
essential biosynthetic processes and metabolic processes) paralogs than between ohnologs (Table 4), that the frequency
whereas paralogs are involved in nonessential processes (Table 4), length (Table 5), and flanking sequence similarities
(transposition, transport and nonessential biosynthetic pro- (Table 6) of the gene conversions between the paralogs of
cesses; Table 8). This is similar to the situation within genes pre- and post-WGD species are similar, that there is a
where gene conversions have been shown to be less frequent positive correlation between the length of gene conversions
in more functionally important regions [34, 35]. and flanking similarity in all converted genes (Figure 3),
The previous studies on S. cerevisiae have found that gene that ohnologs are under stronger selective constraints than

conversions are biased toward the 3 end of converted genes. paralogs (Table 7), that these stronger selective constraints
This has been attributed to ectopic gene conversion via cDNA are due to the fact that ohnologs are involved in essential
intermediates [4]. Our results confirm that conversions are processes whereas paralogs are involved in nonessential

biased toward the 3 -end of genes within the S. cerevisiae processes (Table 8), and that conversions are biased toward

paralog dataset [4, Table 9]. The fact that no significant the 3 -end of the S. cerevisiae paralogs (Table 9). In the
bias was detected within any other species is likely a result future, since it has recently been shown that the expression
of the low statistical power due to the small amount of levels of duplicated genes influence their rate of sequence
data available for each of these species (Table 9). This low divergence [39], it would be interesting to test whether the
statistical power for the distribution of gene conversions increased ectopic gene conversion frequency we observed in
other than those between S. cerevisiae paralogs likely reflects C. glabrata, D. hansenii, and K. lactis (Table 3) is due to
the facts that whereas there were 110 conversions between S. conversions between highly expressed genes.
cerevisiae paralogs, there were only between 8 and 52 gene
conversions between the paralogs of the other nine yeast
species (Table 4). They were also only between 2 and 14 Acknowledgments
gene conversions between the ohnologs of the 7 post-WGD The authors thank the two anonymous referees for their
species. These low numbers of gene conversion are therefore useful and constructive comments on a previous version of
not sufficient to ascertain whether their distribution is this paper. This work was supported by a Discovery Grant
significantly biased. from the Natural Science and Engineering Research Council

The suggestion that the 3 -end bias of the gene conver- of Canada to G. Drouin.
sions between S. cerevisiae paralogs is due to ectopic gene
conversions with cDNA intermediates is consistent with the
low number of introns present in this species as well as References

their 5 -position bias [4, 36, 37]. The genome of this species
contains only 286 introns, and most of these introns are [1] Y. Aylon and M. Kupiec, “DSB repair: the yeast paradigm,”
 DNA Repair, vol. 3, no. 8-9, pp. 797–815, 2004.
located at the 5 -end of the genes in which they are present
[37]. This contrasts with the 139,418 introns found in the [2] V. M. Watt, C. J. Ingles, M. S. Urdea, and W. J. Rutter,
“Homology requirements for recombination in Escherichia
human genome and with the absence of intron position bias
coli,” Proceedings of the National Academy of Sciences of the
in human genes [37]. The model proposed by Fink to explain United States of America, vol. 82, no. 14, pp. 4768–4722, 1985.

both the paucity and 5 -position bias of S. cerevisiae introns
[3] P. Shen and H. V. Huang, “Homologous recombination in
posits that incomplete cDNA molecules can recombine with Escherichia coli: dependence on substrate length and homol-

their genomic copies leading to both intron loss and a 5 - ogy,” Genetics, vol. 112, no. 3, pp. 441–457, 1986.
position bias of the remaining introns [36, 37]. This model [4] G. Drouin, “Characterization of the gene conversions between
was later supported by the experimental demonstration that the multigene family members of the yeast genome,” Journal
cDNA molecule can recombine with their genomic copy [9]. of Molecular Evolution, vol. 55, no. 1, pp. 14–23, 2002.
Since the genomes of C. glabrata, D. hansenii, K. lactis, and [5] W. R. Engels, C. R. Preston, and D. M. Johnson-Schlitz, “Long-
Y. lipolytica all have few introns and that their introns have a range cis preference in DNA homology search over the length
 
5 -position bias [38], one would also expect to observe a 3 - of a Drosophila chromosome,” Science, vol. 263, no. 5153, pp.
end bias for their gene conversions if they often occur with 1623–1625, 1994.
cDNA copies. As discussed above, the fact that we did not [6] D. Benovoy and G. Drouin, “Ectopic gene conversions in the
observe such a bias in these four species could be due to human genome,” Genomics, vol. 93, no. 1, pp. 27–32, 2009.
the low statistical power of our tests. Alternatively, it could [7] A. S. H. Goldman and M. Lichten, “The efficiency of meiotic
reflect recombination differences between S. cerevisiae and recombination between dispersed sequences in Saccharomyces
these four species. cerevisiae depends upon their chromosomal location,” Genet-
In summary, our results show that the number and ics, vol. 144, no. 1, pp. 43–55, 1996.
mean size of multigene families composed of paralogous [8] G. Achaz, E. Coissac, A. Viari, and P. Netter, “Analysis
sequences are not significantly different between pre- and of intrachromosomal duplications in yeast Saccharomyces
International Journal of Evolutionary Biology 11

cerevisiae: a possible model for their origin,” Molecular Biology [27] Z. Yang and R. Nielsen, “Estimating synonymous and non-
and Evolution, vol. 17, no. 8, pp. 1268–1275, 2000. synonymous substitution rates under realistic evolutionary
[9] L. K. Derr and J. N. Strathern, “A role for reverse transcripts models,” Molecular Biology and Evolution, vol. 17, no. 1, pp.
in gene conversion,” Nature, vol. 361, no. 6408, pp. 170–173, 32–43, 2000.
1993. [28] E. L. Hong, R. Balakrishnan, Q. Dong et al., “Gene Ontology
[10] B. Dujon, D. Sherman, G. Fischer et al., “Genome evolution in annotations at SGD: new data sources and annotation meth-
yeasts,” Nature, vol. 430, no. 6995, pp. 35–44, 2004. ods,” Nucleic Acids Research, vol. 36, no. 1, pp. D577–D581,
[11] K. H. Wolfe and D. C. Shields, “Molecular evidence for an 2008.
ancient duplication of the entire yeast genome,” Nature, vol. [29] W. P. Rice, “Analysing tables of statistical tests,” Evolution, vol.
387, no. 6634, pp. 708–713, 1997. 43, no. 1, pp. 223–225, 1989.
[12] M. Kellis, B. W. Birren, and E. S. Lander, “Proof and [30] G. Fischer, E. P. Rocha, F. Brunet, M. Vergassola, and B. Dujon,
evolutionary analysis of ancient genome duplication in the “Highly variable rates of genome rearrangements between
yeast Saccharomyces cerevisiae,” Nature, vol. 428, no. 6983, pp. hemiascomycetous yeast lineages,” PLoS Genetics, vol. 2, no.
617–624, 2004. 3, article e32, 2006.
[13] K. Wolfe, “Evolutionary genomics: yeasts accelerate beyond [31] D. Graur and W.-H. Li, Fundamentals of Molecular Evolution,
BLAST,” Current Biology, vol. 14, no. 10, pp. R392–R394, 2004. Sinauer Associates, Sunderland, Mass, USA, 2nd edition, 2000.
[14] D. R. Scannell, K. P. Byrne, J. L. Gordon, S. Wong, and [32] M. Lynch, The Origins of Genome Architecture, Sinauer
K. H. Wolfe, “Multiple rounds of speciation associated with Associates, Sunderland, Mass, USA, 2007.
reciprocal gene loss in polyploid yeasts,” Nature, vol. 440, no. [33] G.-F. Richard, A. Kerrest, I. Lafontaine, and B. Dujon, “Com-
7082, pp. 341–345, 2006. parative genomics of hemiascomycete yeasts: genes involved
[15] A. Goffeau, G. Barrell, H. Bussey et al., “Life with 6000 genes,” in DNA replication, repair, and recombination,” Molecular
Science, vol. 274, no. 5287, pp. 546–567, 1996. Biology and Evolution, vol. 22, no. 4, pp. 1011–1023, 2005.
[16] P. Cliften, P. Sudarsanam, A. Desikan et al., “Finding func- [34] Z. Zhao, D. Hewett-Emmett, and W.-H. Li, “Frequent gene
tional features in Saccharomyces genomes by phylogenetic conversion between human red and green opsin genes,”
footprinting,” Science, vol. 301, no. 5629, pp. 71–76, 2003. Journal of Molecular Evolution, vol. 46, no. 4, pp. 494–496,
[17] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. 1998.
Lander, “Sequencing and comparison of yeast species to [35] J. P. Noonan, J. Grimwood, J. Schmutz, M. Dickson, and R. M.
identify genes and regulatory elements,” Nature, vol. 423, no. Myers, “Gene conversion and the evolution of protocadherin
6937, pp. 241–254, 2003. gene cluster diversity,” Genome Research, vol. 14, no. 3, pp.
[18] K. H. Wolfe, “Yesterday’s polyploids and the mystery of 354–366, 2004.
diploidization,” Nature Reviews Genetics, vol. 2, no. 5, pp. 333– [36] G. R. Fink, “Pseudogenes in yeast?” Cell, vol. 49, no. 1, pp. 5–6,
341, 2001. 1987.
[19] A. van Hoof, “Conserved functions of yeast genes support [37] T. Mourier and D. C. Jeffares, “Eukaryotic intron loss,” Science,
the duplication, degeneration and complementation model vol. 300, no. 5624, p. 1393, 2003.
for gene duplication,” Genetics, vol. 171, no. 4, pp. 1455–1461, [38] D.-K. Niu, W.-R. Hou, and S.-W. Li, “mRNA-mediated intron
2005. losses: evidence from extraordinarily large exons,” Molecular
[20] S. Wong and K. H. Wolfe, “Duplication of genes and genomes Biology and Evolution, vol. 22, no. 6, pp. 1475–1481, 2005.
in yeasts,” in Comparative Genomics, P. Sunnerhagen and [39] S. Pyne, S. Skiena, and B. Futcher, “Copy correction and
J. Piskur, Eds., vol. 15, pp. 78–99, Springer, Heidelberg, concerted evolution in the conservation of yeast genes,”
Germany, 2005. Genetics, vol. 170, no. 4, pp. 1501–1513, 2005.
[21] K. P. Byrne and K. H. Wolfe, “The Yeast Gene Order Browser:
combining curated homology and syntenic context reveals
gene fate in polyploid species,” Genome Research, vol. 15, no.
10, pp. 1456–1461, 2005.
[22] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL
W: improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice,” Nucleic Acids Research,
vol. 22, no. 22, pp. 4673–4680, 1994.
[23] S. Sawyer, GENECONV molecular biology computer pro-
gram, 1999, http://www.math.wustl.edu/∼sawyer/geneconv.
[24] D. Posada and K. A. Crandall, “Evaluation of methods for
detecting recombination from DNA sequences: computer
simulations,” Proceedings of the National Academy of Sciences of
the United States of America, vol. 98, no. 24, pp. 13757–13762,
2001.
[25] E. Erdfelder, F. Faul, and A. Buchner, “GPOWER: a general
power analysis program,” Behavior Research Methods, Instru-
ments, & Computers, vol. 28, no. 1, pp. 1–11, 1996.
[26] Z. Yang, “PAML: a program package for phylogenetic analysis
by maximum likelihood,” CABIOS, vol. 13, no. 5, pp. 555–556,
1997.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 921312, 11 pages
doi:10.4061/2011/921312

Research Article
Sequence Analysis of SSR-Flanking Regions Identifies Genome
Affinities between Pasture Grass Fungal Endophyte Taxa

Eline van Zijll de Jong,1, 2, 3 Kathryn M. Guthridge,1, 2, 4 German C. Spangenberg,1, 2, 4, 5


and John W. Forster1, 2, 4, 5
1
Department of Primary Industries, Biosciences Research Division, Victorian AgriBiosciences Centre, 1 Park Drive,
La Trobe University Research and Development Park, Bundoora, VIC 3083, Australia
2 Molecular Plant Breeding Cooperative Research Centre, La Trobe University Research and Development Park,

Bundoora, VIC 3083, Australia


3 Bioprotection Research Centre, P.O. Box 84, Lincoln University, Lincoln 7647, Canterbury, New Zealand
4
Dairy Futures Cooperative Research Centre, La Trobe University Research and Development Park, Bundoora,
VIC 3083, Australia
5 La Trobe University, Bundoora, VIC 3086, Australia

Correspondence should be addressed to John W. Forster, john.forster@dpi.vic.gov.au

Received 14 October 2010; Accepted 10 December 2010

Academic Editor: Hiromi Nishida

Copyright © 2011 Eline van Zijll de Jong et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

Fungal species of the Neotyphodium and Epichloë genera are endophytes of pasture grasses showing complex differences of life-cycle
and genetic architecture. Simple sequence repeat (SSR) markers have been developed from endophyte-derived expressed sequence
tag (EST) collections. Although SSR array size polymorphisms are appropriate for phenetic analysis to distinguish between taxa,
the capacity to resolve phylogenetic relationships is limited by both homoplasy and heteroploidy effects. In contrast, nonrepetitive
sequence regions that flank SSRs have been effectively implemented in this study to demonstrate a common evolutionary origin
of grass fungal endophytes. Consistent patterns of relationships between specific taxa were apparent across multiple target loci,
confirming previous studies of genome evolution based on variation of individual genes. Evidence was obtained for the definition
of endophyte taxa not only through genomic affinities but also by relative gene content. Results were compatible with the
current view that some asexual Neotyphodium species arose following interspecific hybridisation between sexual Epichloë ancestors.
Phylogenetic analysis of SSR-flanking regions, in combination with the results of previous studies with other EST-derived SSR
markers, further permitted characterisation of Neotyphodium isolates that could not be assigned to known taxa on the basis of
morphological characteristics.

1. Introduction and Hanlin) confer both beneficial and detrimental agro-


nomic traits [3–7]. Molecular genetic marker-based studies
Fungal endophytes of the genus Neotyphodium and Epichloë have contributed to knowledge of endophyte genetics, tax-
are widespread in temperate grasses of the Poaceae sub- onomy, and phylogeny. Neotyphodium species were originally
family Pooideae [1, 2]. In agronomically important pasture placed in the form (asexual) genus Acremonium [8], but were
grasses such as perennial ryegrass (Lolium perenne L.) and reclassified into the new form genus Neotyphodium when
tall fescue (Festuca arundinacea Schreb. [Darbysh.] syn. sequence analysis of ribosomal RNA-encoding (rDNA) genes
L. arundinaceum), the respective symbionts (N. lolii [Latch, indicated a monophyletic group with the sexual Epichloë
Christensen, and Samuels] Glenn, Bacon, and Hanlin and species [1, 9]. Phylogenetic analysis of genes for conserved
N. coenophialum [Morgan-Jones and Gams] Glenn, Bacon, proteins such as the β-tubulin gene (tub2), translation
2 International Journal of Evolutionary Biology

elongation factor 1-α (tefA), and actin (actG) also provided of SSRs [33–35] permits phylogenetic assessment across
evidence for close relationships between Neotyphodium and the transcriptional units of multiple gene classes. SSR-
Epichloë species [10–17]. flanking regions have been used for phylogenetic analysis of
Although taxa such as N. lolii are haploid in nature, other multiple organisms [31, 32, 36, 37], resolving relationships
Neotyphodium species were shown to contain multiple gene to otherwise inaccessible levels [31, 32, 37].
copies and conform to heteroploid genomic constitutions Consistent with these previous studies, gene-associated
[17]. The single or multiple gene copies of asexual Neotypho- SSR loci have previously been shown to discriminate endo-
dium species appear to correspond to those of specific hap- phyte taxa based on size polymorphism [38], but did not
loid Epichloë species. This observation has been interpreted permit phylogenetic analysis. The present study describes the
to support a hybrid origin for heteroploid taxa: for instance, comparison of sequences that flank the SSR array in 5 inde-
N. coenophialum has been proposed to have arisen through pendently selected gene loci, across 23 distinct fungal endo-
hybridisation and subsequent nuclear fusion events involving phyte isolates. The derived data have determined the extent
the extant taxa E. typhina, E. baconii, and E. festucae. The of molecular variation underlying SSR size polymorphism,
relative genome sizes of haploid and heteroploid endophytes confirmed current models for genome affinities, inferred
(c. 30 Mb for N. lolii; c. 60 Mb for N. coenophialum) phylogenetic relationships and models of genome evolution
lend some support to this hypothesis [18], subject to the (including a role for selective gene loss), and elucidated
possibility of selective gene loss subsequent to hybridisation the genomic origin of several previously unclassified Neoty-
events. Phylogenetic relationships between endophyte taxa phodium taxa.
are hence complex and reticulated. Sequence analysis of
individual gene loci may be used to infer such relationships
based on affinities between shared genomes. However, 2. Materials and Methods
performance differences between individual genes have been
2.1. Endophyte Isolates. Phylogenetic analysis was performed
observed. The resolution capacity provided by rDNA and
on 20 endophyte isolates representing three Neotyphodium
actA genes was low in comparison to other genes [12–15],
and five Epichloë taxa, as well as three Neotyphodium isolates
possibly due to homoplasy effects [13]. Heteroploid-like
which could not be assigned to known taxa based on their
Neotyphodium species also display aneuploidy for some loci,
morphological characters (A. Leuchtmann, pers. comm.),
such as the rDNA gene, limiting resolution of complete
and a tall fescue endophyte taxon (FaTG-2) which has yet
phylogenies [19]. A broader survey of gene classes is hence
to be allocated a Linnean name (Table 1). Endophyte isolates
desirable to further clarify affinities between endophyte
were cultured and DNA was extracted as described previ-
taxa.
ously [38].
Simple sequence repeats (SSRs) or microsatellites [20]
have been widely used for analysis of genetic variation within
and between closely related species [21]. A high rate of 2.2. DNA Sequence Analysis of EST-SSR Amplicons. Genomic
mutation [22] renders SSR array length polymorphism amplicons obtained with primer pairs designed to five EST-
particularly useful for intraspecific genetic studies. How- SSR loci (NCESTA1AB04, NCESTA1FH03, NCESTA1AG07,
ever, sequence analysis has revealed complex mechanisms NLESTA1GF09, and NLESTA1NF04) were analysed. Ampli-
controlling allele size variation, limiting the efficiency of cons were obtained as described previously [38]. Amplicons
interspecific phylogenetic analysis. Repeat number varia- from haploid taxa were analysed by direct sequencing, with
tion is thought to arise from polymerase slippage during the exception of locus NLESTA1NF04 for which sequencing
replication [23], but constraints on threshold size for allele was performed on purified plasmids containing the cloned
expansion [24] and on allele size range [25] are evident. amplicon [38]. Sequencing reactions were performed in 10 μl
In addition, interruptions of the repeat structure tend to reaction volumes containing 4 μl Sequencing Reagent Premix
stabilise SSR loci [26]. Constraints on allele size may con- from the DYEnamic ET Terminator Cycle Sequencing kit
sequently lead to inaccurate assessment of phylogenetic (Amersham Biosciences, Little Chalfont, UK), 0.5 μM of
divergence between taxa. Size homoplasy of distinct alleles forward or reverse primer for the locus of interest and
arising from insertions, deletions, and base substitutions in 5 μl of amplicon in a thermocycler (GeneAmp; PE Applied
the SSR flanking regions are also common [27–30]. Changes Biosystems, Forster City, California, USA.) programmed for
in flanking regions appear to occur independently of changes 20 seconds at 92◦ C followed by 30 cycles of 20 s at 95◦ C,
in the SSR repeat array [28, 30]. Due to these factors, allelic 15 s at 50◦ C, 2 minutes at 60◦ C, then 10 min at 60◦ C.
variation of SSR loci, as assessed by amplicon size variation, Sequencing products were purified using Autoseq 96 plates
is appropriate only for phenetic analysis and not suitable for (Amersham Biosciences), dried at 80◦ C for 30 min and
phylogenetic reconstruction. resuspended in 5 μl of sterile Milli-Q water before analysis
In contrast, several studies have performed phylogenetic on a ABI Prism 3700 automated sequencer (PE Applied
interpretation through analysis of SSR-flanking sequence Biosystems). For multiple products amplified in a single
regions. The resolving power of evolutionary studies using reaction from nonhaploid taxa, cloning was used to sep-
individual structural genes may be constrained by limited arate the different amplicons. Following amplification, the
divergence [31], and studies of a small number of gene products were purified using a Microspin S-300 HR Column
loci may not be representative of whole-genome varia- (Amersham Biosciences). The purified products were cloned
tion [32]. However, the abundant genomic distribution into pGEM-T Easy Vector (Promega, Madison, Wisconsin,
International Journal of Evolutionary Biology 3

Table 1: Endophyte isolates used for phylogenetic analysis.

Species or taxon Isolate Host species Origin Source


N. coenophialum 9309 Festuca arundinacea France ETH Zürich1
N. coenophialum 9920/1 F. arundinacea U.S.A. ETH Zürich
N. coenophialum 9920/2 F. arundinacea U.S.A. ETH Zürich
N. coenophialum 9920/3 F. arundinacea U.S.A. ETH Zürich
FaTG-2 8907 F. arundinacea U.S.A. ETH Zürich
N. lolii 9601 Lolium perenne Belgium ETH Zürich
N. lolii Ellett H5837 L. perenne New Zealand DPI-Hamilton2
N. lolii North African 6 L. perenne Morocco DPI-Hamilton
N. lolii Victorian 2 L. perenne Australia DPI-Hamilton
N. uncinatum 9414 F. pratensis Germany ETH Zürich
Neotyphodium sp. 9303/2 Elymus europaeus Switzerland ETH Zürich
Neotyphodium sp. 9727 F. arizonica U.S.A. ETH Zürich
Neotyphodium sp. 9728 L. perenne New Zealand ETH Zürich
E. baconii 9707 Agrostis tenuis Switzerland ETH Zürich
E. bromicola 9630 Bromus erectus Switzerland ETH Zürich
E. clarkii 9401 Holcus lanatus Switzerland ETH Zürich
E. festucae 9412 F. gigantea Switzerland ETH Zürich
E. festucae 9436 F. pratensis Switzerland ETH Zürich
E. festucae 9713 F. rubra Switzerland ETH Zürich
E. festucae 9718 F. gigantea Switzerland ETH Zürich
E. festucae 9722 F. rubra England ETH Zürich
E. sylvatica 9301 Brachypodium sylvaticum Switzerland ETH Zürich
E. typhina 9635 Dactylis glomerata Switzerland ETH Zürich
1
ETH Zürich: Geobotanisches Institut, ETH, Zürich, Switzerland.
2 DPI-Hamilton: Department of Primary Industries, Primary Industries Research Victoria, Hamilton, Victoria, Australia.

U.S.A.) and transformed into competent cells. Inserts were (SH) test [43] was used to test for significant differences.
amplified from transformed colonies and sequenced. The robustness of the trees was measured by the Bootstrap
Consensus sequences were derived through analysis of method [44] with 1000 replicates. A bootstrap value of
several independently isolated clones or direct sequencing of 70% or greater was considered to be well supported. For
both strands. Sequences were compared using Sequencher Maximum Likelihood (ML) and distance-based analysis,
(version 4.0) (Gene Codes Corporation, Ann Arbor, Michi- sequences were analysed with indels removed. Distance
gan, U.S.A.). BLASTX (version 2.2.1 and 2.2.6) [39] was used matrices were obtained using the F84 model [42, 45] and
to search for similarities between the EST sequence of SSR clustered using the Fitch-Margoliash (FM) method [46]
loci and protein sequences in the protein databases available or Neighbor-Joining (NJ) method [47]. The transition/
from the National Centre for Biological Information (NR, transversion ratio was estimated using the Tree-Puzzle
PDB and SwissProt; http://www.ncbi.nlm.nih.gov/BLAST/). program (version 5.0) [48] or by the ML method. To
estimate the transition/transversion ratio by the ML method,
2.3. Phylogenetic Analysis of EST-SSR Amplicons. The DNA different possible values for the transition/transversion ratio
sequences of unique amplicons were prepared for phyloge- were evaluated in multiple runs to find the value with the
netic analysis by compilation in FastA format into a single file maximum likelihood estimate. The same approach was used
for sequence alignment in ClustalX (version 1.8) [40]. Man- to estimate the among-site rate heterogeneity by the ML
ual realignment of sequences removed primer termini and method. To estimate the among-site rate heterogeneity by
polymorphic SSR arrays and converted insertion-deletion the Minimum Evolution (ME) method, distance matrices
(indel) regions into single multistate characters. Sequence generated for each site were analysed and the total branch
alignments were analysed by clustering or tree searching length was taken as the estimated value of the rate of change
methods available in PHYLIP (version 3.6a3) (J. Felsenstein, for each of the sites.
University of Washington, Seattle, Washington, U.S.A., avail-
able from http://evolution.gs.washington.edu/phylip.html). 3. Results
For parsimony analysis, sequences were analysed with indels
either removed, or coded as single multistate characters. 3.1. Characterisation of Endophyte EST-SSR Amplicons.
Where multiple trees were resolved, the Kishino-Hasegawa- Genomic amplicons from five EST-SSR loci, ranging
Templeton (KHT) test [41, 42] or Shimodaira-Hasegawa in length from 181–385 bp, were characterised from 12
4 International Journal of Evolutionary Biology

Table 2: Characteristics of EST-derived SSR loci used for phylogenetic analysis of endophyte isolates.

EST-SSR locus
NCESTA1AB04 NCESTA1FH03 NCESTA1GA07 NLESTA1GF09 NLESTA1NF04
Product size (bp) 241–257 217–252 181–206 364–385 242–265
Number of indels 5 7 3 6 2
Number of unique SSR array variants 2 13 11 1 19
Number of unique gene variants 14 13 13 11 15
coding + intron +
Composition of products unknown coding + intron unknown
5 -UTR unknown
1
Length of unit for phylogenetic analysis (bp) 195 163 116 321 166
CDS or exon 151 46 238
UTR or intron 44 70 83
Number of informative characters1 40 37 23 44 51
CDS or exon 30 10 30
UTR or intron 10 13 14
Percent informative characters1 21 21 20 14 31
CDS or exon 20 22 13
UTR or intron 23 19 17
Number of trees resolved 3 1 4 2 2
1
Each indel was coded as a single multistate character.

Neotyphodium and 10 Epichloë isolates, as well as FaTG-2 chimeras generated by PCR-mediated recombination, were
(Table 2). The sequences of two loci (NCESTA1AB04 and also obtained.
NLESTA1GF09) shared amino acid sequence similarity with Pairwise comparisons identified between 23–51 infor-
hypothetical or predicted proteins of Neurospora crassa and mative characters for the different loci (Table 2). The pro-
Magnoporthe grisea and were mainly composed of coding portion of informative characters ranged from 14% (locus
sequence (CDS) as well as 5 -untranslated region (5 -UTR) NLESTA1GF09) to 31% (locus NLESTA1NF04), but was c.
and intron sequences, respectively. The sequences of the re- 20% for the other loci. A similar proportion of informative
maining three loci did not show similarity with any proteins characters occurred in both the coding and noncoding
in public databases. Amplification products from locus sequences of the eligible loci.
NCESTA1GA07 were predominantly composed of intron
sequence, based on comparison of EST and genomic DNA
sequences. 3.2. Phylogenetic Analysis of Endophyte EST-SSR Amplicons.
Size polymorphisms between taxa for the selected loci Loci were analysed through sequence alignment (Supple-
resulted from variation at a number of indel sites (Table 2). mentary Material: Appendices 1–5 available at doi:10.4067/
Differences also occurred in the repeat unit number of the 2011/921312) individually, rather than as a combined
SSR array for three loci (NCESTA1FH03, NCESTA1GA07 dataset, due to variation of both inferred ploidy level and
and NLESTA1NF04; data not shown), accounting for the number of observed haplotypes between different SSR loci
majority of the observed size polymorphisms. A number from heteroploid isolates. Between one and four trees
of sequence haplotypes (defined here as amplicons with were resolved for the different loci using the Parsimony
multiple sequence variant content, but clearly related to method (Figures 1–5). A single tree was resolved for locus
a single common reference) for each locus were identified NCESTA1FH03 (Figure 2). The multiple trees obtained for
across the sample set, with identity observed between loci NCESTA1AB04 (Figure 1), NLESTA1GF09 (Figure 4)
those of several Neotyphodium and Epichloë isolates. Single and NLESTA1NF04 (Figure 5) only differed in the placement
haplotypes for individual genes were observed for N. lolii, of one or two species. More variation was evident in the
unclassified Neotyphodium isolate 9727, and the different branching of the multiple trees identified for the locus
Epichloë species, while N. coenophialum, FaTG-2, N. uncina- NCESTA1GA07 (Figure 3). The trees, however, were not
tum, and unclassified Neotyphodium isolates 9303/2 and 9728 found to be significantly different in the KHT or SH tests.
generated multiple haplotypes. The number of haplotypes The majority of branches in the trees were supported
present in these species varied between loci, but with a by bootstrap analysis. Similar trees were resolved for the
maximum of three for N. coenophialum and Neotyphodium different loci using the ML, FM, and NJ methods (data not
isolates 9303/2 and 9728, and two for FaTG-2 and N. shown). In tests performed using the ML and ME methods,
uncinatum. Variation was observed in cloning efficiency of no significant differences were detected in the rate of change
different PCR products for those species possessing multiple between coding and non-coding sequences or between exon
haplotypes. Aberrant haplotypes, which were likely to be and intron sequences for eligible loci (data not shown).
International Journal of Evolutionary Biology 5

N. coenophialum∗ (4), N. uncinatum∗ (1)


100
Neotyphodium sp. 9727

E. sylvatica (1)

E. clarkii (1) E. sylvatica (1)

87 Neotyphodium sp. 9727


E. typhina (1)

Neotyphodium sp. 9303/2∗ E. clarkii (1)

Neotyphodium sp. 9303/2∗∗ E. typhina (1)


90
E. bromicola (1) Neotyphodium sp. 9303/2∗
78
N. uncinatum∗∗ (1) (b)

E. baconii (1)

N. coenophialum∗∗ (4), FaTG-2∗ (1)


E. baconii (1)
E. festucae 9412
86 E. festucae (4), N. coenophialum∗∗∗ (4), N. coenophialum∗∗ (4), FaTG-2∗ (1)
Neotyphodium sp. 9303/2∗∗∗ and 9728
E. festucae 9412
N. lolii (4), FaTG-2∗∗ (1)
E. festucae (4), N. coenophialum∗∗∗ (4),
Neotyphodium sp. 9303/2∗∗∗ and 9728
(a)
N. lolii (4), FaTG-2∗∗ (1)

(c)

Figure 1: Parsimony analysis of sequence haplotypes derived from reference Neotyphodium and Epichloë isolates for the EST-SSR locus
NCESTA1AB04. (a) One of three most parsimonious trees obtained. The left edge is the inferred midpoint root. Branches with bootstrap
values of greater than 70% from 1000 bootstrap replications are marked. The number of isolates with an identical haplotype is indicated in
the brackets following the species name. In the instances when multiple haplotypes were identified, the relevant variant number is indicated
by the number of asterisks following the species name or isolate number. The boxed area indicated by the dotted line indicates the regions
representing subtrees. ((b) and (c)) Subtrees showing the alternative branching of E. sylvatica and E. baconii in the other trees.

Phylogenetic analysis of the different loci revealed similar N. coenophialum and N. uncinatum shared identical or very
genomic relationships among endophyte species, as sum- closely related haplotypes for all five loci. These variants
marised in a network format (Figure 6). Close relationships also grouped with those from Group 2 Epichloë species. N.
between haplotypes from different taxa were deduced to coenophialum also shared common haplotypes with FaTG-
indicate partial or complete commonality of genome con- 2 and E. baconii. The remaining haplotypes common to N.
tent. Some locus-dependent differences in tree topology coenophialum and FaTG-2 grouped with the corresponding
(Figures 1–5) were observed, but specific taxa consistently single haplotypes from E. festucae and N. lolii. For a subset
grouped together. In most instances, the Epichloë species of the target loci, N. uncinatum-derived sequences grouped
were separated into two groups, separation being sup- with the corresponding haplotypes from E. bromicola.
ported by bootstrap analysis. The first contained E. festucae, Unclassified Neotyphodium isolates also displayed close
E. baconii, and E. bromicola, and the second contained E. genetic relationships with known taxa. Neotyphodium isolate
typhina, E. clarkii, and E. sylvatica, Neotyphodium species 9727 produced single haplotypes from each locus that were
being included within both groups. Group 1 Epichloë species either identical or very similar to the haplotypes common
were further divided into distinct branches according to their between N. coenophialum and N. uncinatum, and grouped
taxonomic classification, while Group 2 endophytes showed most closely with those derived from E. sylvatica. Neoty-
higher levels of genetic similarity. Close genetic relationships phodium isolates 9303/2 and 9728 were closely related, all
were evident between several Neotyphodium and Epichloë locus-specific haplotypes showing a high degree of sequence
species. These relationships were observed in most of the similarity. One of three subclasses of derived haplotypes
trees and were also supported by bootstrap analysis. The sin- grouped to form a distinct well-supported group with those
gle haplotypes derived from both N. lolii and E. festucae were from E. festucae, N. lolii, N. coenophialum, and FaTG-2.
grouped together in all trees and were identical in structure Isolates 9303/2 and 9728 exhibited a second haplotype sub-
for two of the loci (NCESTA1FH03 and NLESTA1GF09). class that grouped with counterparts from E. bromicola and
Multiple haplotypes from N. coenophialum, FaTG-2, and N. uncinatum, and the same class was present in isolate over
N. uncinatum were consistently associated with counter- two loci. The remaining haplotype sub-class (observed in
parts from specific Neotyphodium and Epichloë species. isolate 9303/2 for four loci, and in isolate 9728 for one locus)
6 International Journal of Evolutionary Biology

E. sylvatica (1) The results of this and other studies [30, 31] suggest that
size variation may provide a relatively accurate measure of
E. typhina (1)
99 genetic variation between closely related species. Although
E. clarkii (1) homoplasy was not taken into account, SSRs have previously
proven useful for genetic discrimination within and between
N. uncinatum∗ (1) endophyte species [38]. Presumably, the inherently variable
95
N. coenophialum∗ (4), Neotyphodium sp. 9727 nature of SSRs and the large number of loci analysed reduced
the potential biasing effects of individual loci. The complex
E. festucae (5), N. lolii (3), Neotyphodium sp. 9303/2∗ and 9728∗ nature of SSR loci, however, demonstrates the critical value
N. uncinatum∗∗ (1) of sequence level analysis for phylogenetic inference.
The flanking regions of gene-associated SSRs were highly
99 E. bromicola (1)
conserved within and between endophyte taxa (80%–100%
sequence identity across coding and non-coding regions),
Neotyphodium sp. 9303/2∗∗ and 9728∗∗
supporting a common origin for these species [1, 9]. Despite
E. baconii (1) this level of sequence conservation, SSR-flanking regions
96 were informative for studying genetic relationships. The dif-
FaTG-2 (1)
ferent individual loci obtained similar genetic relationships,
N. coenophialum∗∗ (2) consistent with previous studies of other genes. Differences
in the power of the individual loci to resolve relationships
N. coenophialum 9309∗∗ and 9920/1∗∗ were identified due to variation in number of informa-
Figure 2: Parsimony analysis of sequence haplotypes derived from tive characters and composition (exon, intron, coding, or
reference Neotyphodium and Epichloë isolates for the EST-SSR locus non-coding) of amplicons. In other studies, SSR-flanking
NCESTA1FH03. Diagram properties are as described in the legend sequences from different loci have been aggregated to
to Figure 1. Note that the nucleotide sequence of the amplification increase the number of informative characters and improve
product from N. lolii isolate North African 6 detected at this locus resolution of phylogenetic relationships [31, 32, 36, 37].
by autoradiography [38] was not obtained. However, variation of inferred ploidy level and of number of
haplotypes derived from different loci in heteroploids would
potentially bias such aggregation studies. As a consequence,
each locus was analysed separately in this study.
grouped with the equivalent haplotypes from E. typhina and
E. clarkii. 4.2. Genome Affinities between Neotyphodium and Epichloë
Species. The close relationships between taxa were in accor-
dance with those predicted from genes more commonly used
4. Discussion for phylogenetic analysis such as rDNA, tubB, tefA, and
actG. Single locus-specific haplotypes were obtained from
4.1. Application of EST-SSR Loci to Endophyte Phylogenetic
all Epichloë species and from N. lolii, the latter being closely
Analysis. The application of SSR markers for phylogenetic
related to E. festucae. Other Neotyphodium species contained
analysis is limited by two main factors: complex molecular
multiple haplotypes that were similar to those from dif-
evolution of SSR loci and the occurrence of size homoplasy
ferent Epichloë species. Occurrence of different haplotype
between distinct SSR alleles. Sequence analysis of selected
subclasses in N. coenophialum, FaTG-2, and N. uncinatum
SSR loci in the current study has demonstrated that these
is consistent with the heteroploid or nonhaploid nature of
factors influence the generic inability of SSR markers to
these species [11, 14] and indicates the presence of multiple
resolve phylogenetic relationships among endophyte species
genomes (Figure 6).
[38]. Changes in the SSR array repeat number appeared to
be independent of flanking region changes: some of the locus
NLESTA1NF04-derived haplotypes from multiple E. festucae 4.3. Relationships with Unclassified Neotyphodium Isolates.
isolate differed for SSR array number, but exhibited identity Neotyphodium isolates that could not be assigned to known
for flanking sequence, while others showed the converse morphological classes also appear to differ from charac-
relationship. SSR allele size homoplasy occurred between terised taxa at the molecular level [17]. Isolate 9303/2 and
different endophyte taxa of distinct origins as a result of isolate 9728 have been assigned to taxonomic groupings
insertions, deletions, and base substitutions in both the SSR HeuTG-2 and LpTG-2, respectively, based on phylogenetic
motif and flanking sequences, as observed for the locus analysis of the tefA and tubB genes (A. Leuchtmann, pers.
NCESTA1FH03-specific E. festucae and E. baconii-related comm.; [17]). Analysis of SSR-flanking regions in this study,
N. coenophialum haplotypes. Endophyte SSR locus arrays however, suggests that the isolates show closer affinities than
were highly variable, and differences in repeat unit number formerly predicted. Moon et al. [17] reported the detection
generally accounted for allele size variation between closely of two haplotype classes for each isolate, closely related to
related endophyte species, while indel and base substitution those from E. bromicola and E. typhina (HeuTG) and from
incidence increased when comparisons were made between E. festucae and E. typhina (LpTG-2). These phylogenetic
more distantly related taxa. affinities were also detected in the current study. However,
International Journal of Evolutionary Biology 7

N. coenophialum∗ (4), FaTG-2∗ (1)


E. bromicola (1) E. bromicola (1)
E. baconii (1)
N. coenophialum∗ (4), FaTG-2∗ (1)
N. lolii (4), FaTG-2∗∗ (1)
E. baconii (1) E. festucae 9713
Neotyphodium sp. 9303/2∗ and 9728∗
N. lolii (4), FaTG-2∗∗ (1) E. festucae 9722, N. coenophialum∗∗ (4)
E. festucae (3)
E. festucae 9713
(b)
Neotyphodium sp. 9303/2∗ and 9728∗
E. bromicola (1)
E. festucae 9722, N. coenophialum∗∗ (4) N. coenophialum∗ (4), FaTG-2∗ (1)
E. baconii (1)
E. festucae (3)
Neotyphodium sp. 9303/2∗ and 9728∗
E. typhina (1) E. festucae 9713
N. lolii (4), FaTG-2∗∗ (1)
E. sylvatica (1) E. festucae 9722, N. coenophialum∗∗ (4)
100 E. festucae (3)
E. clarkii (1), Neotyphodium sp. 9303/2∗∗ and 9728∗∗
(c)
Neotyphodium sp. 9727
E. baconii (1)
N. coenophialum∗∗∗ (4), N. uncinatum (1) N. coenophialum∗ (4), FaTG-2∗ (1)
E. baconii (1)
E. festucae (3)
(a)
N. lolii (4), FaTG-2∗∗ (1)
Neotyphodium sp. 9303/2∗ and 9728∗
E. festucae 9722, N. coenophialum∗∗ (4)
E. festucae (3)
(d)

Figure 3: Parsimony analysis of sequence haplotypes derived from reference Neotyphodium and Epichloë isolates for the EST-SSR locus
NCESTA1GA07. (a) One of four most parsimonious trees found. Diagram properties are as described in the legend to Figure 1 ((b), (c), and
(d)) Subtrees showing the alternative topologies of the E. festucae, E. baconii, and E. bromicola clade in the other trees.

E. bromicola (1)
100 N. coenophialum∗ (4)
83 E. baconii (1)
88 E. festucae 9713 and 9718, N. coenophialum∗∗ (4), N. lolii (4),
FaTG-2 (1), Neotyphodium sp. 9303/2∗ and 9728
E. festucae (3)
Neotyphodium sp. 9303/2∗∗
97 E. typhina (1)
E. clarkii (1)
E. sylvatica (1) Neotyphodium sp. 9303/2∗∗
73 ∗∗∗ (4), N. uncinatum (1) E. typhina (1)
87 N. coenophialum
E. clarkii (1)
Neotyphodium sp. 9727
E. sylvatica (1)
(a) N. coenophialum∗∗∗ (4), N. uncinatum (1)
Neotyphodium sp. 9727
(b)

Figure 4: Parsimony analysis of sequence haplotypes derived from reference Neotyphodium and Epichloë isolates for the EST-SSR locus
NLESTA1GF09. (a) One of two most parsimonious trees found. Diagram properties are as described in the legend to Figure 1. (b) Subtree
showing the alternative topology of the E. typhina, E. clarkii, and E. sylvatica clade in the other tree.

both flanking sequence analysis, as well as phenetic studies Although isolates 9303/2 and 9728 share common
based on a larger number of SSR loci [38], detected a third affinities, DNA-based phylogenetic and phenetic analyses
haplotype sub-class for both isolates and suggested common suggest mutual genetic divergence and placement in taxo-
affinities with E. festucae, E. bromicola, and E. typhina, re- nomic groups with different relative gene content. Although
spectively. Accurate inference of phylogenetic relationships similar-sized haplotypes were detected, SSR polymorphism
among Neotyphodium and Epichloë species consequently re- between these isolates was greater than that detected within
quires characterisation of a number of different genomic loci. N. coenophialum, N. lolii, and E. festucae [38]. In addition,
8 International Journal of Evolutionary Biology

95 Neotyphodium sp. 9303/2∗


100 E. typhina (1)
E. clarkii (1)
N. coenophialum∗ (4), N. uncinatum∗ (1), Neotyphodium sp. 9727
E. sylvatica (1)
Neotyphodium sp. 9303/2∗∗ and 9728∗
86 E. bromicola (1)
N. uncinatum∗∗ (1)
97 E. baconii (1)
95
N. coenophialum∗∗ (4)
E. festucae (2), N. coenophialum∗∗∗ (4)
99
N. lolii (4), FaTG-2 (1)
E. festucae 9412
E. festucae 9713, Neotyphodium sp. 9303/2∗∗∗ and 9728∗∗
E. festucae 9436 N. coenophialum∗ (4), N. uncinatum∗ (1), Neotyphodium sp. 9727
(a) E. sylvatica (1)
(b)

Figure 5: Parsimony analysis of sequence haplotypes derived from reference Neotyphodium and Epichloë isolates for the EST-SSR locus
NLESTA1NF04. (a) One of two most parsimonious trees found. Diagram properties are as described in the legend to Figure 1. (b) Sub-tree
showing the alternative branching of E. sylvatica in the other tree. Note that the nucleotide sequence of the second amplification product
from FaTG-2 and the third amplification product from unidentified Neotyphodium isolate 9728 detected at this locus by autoradiography
[38] was not obtained.

All loci
N. lolii E. festucae

Loci A/C/D/E Loci A/C/D/E


N. coenophialum FaTG-2
All loci Loci A/B/C/E∗

E. typhina, E. clarkii, E. sylvatica N. uncinatum E. baconii


All loci
Loci A/B/E
All loci
N9727 E. bromicola

Loci A/B/E, loci B/E


N9303/2, N9728
Loci A/C/D/E, loci C/E∗ Both all loci

Figure 6: Summary of genomic affinities between Neotyphodium and Epichloë species predicted from phylogenetic analysis of the flanking
regions of five SSR loci: NCESTA1AB04 (a), NCESTA1FH03 (b), NCESTA1GA07 (c), NLESTA1GF09 (d), and NLESTA1NF04 (e). Predicted
partial or complete genomes, based on variant haplotype sub-classes, are indicated as black circles. Lines that connect putative common
genomes, and the level of support for each inference are indicated in terms of the number of loci providing confirmatory data. This
information relates to the next most adjacent taxon in the topology of the diagram. An asterisk indicates a locus-specific variant obtained by
PCR, but for which the nucleotide sequence was not obtained. The dotted line defines the inferred division between the two major groups
of Epichloë species.

9303/2 and 9728 failed to cluster together in an AFLP- analysis further demonstrated differences in both number
derived phenogram [38], which represents a genome-wide and type of haplotype. E. festucae-related haplotypes were
assessment of genetic polymorphism. SSR polymorphism detected in both isolates for all five loci, while E. bromicola-
analysis also detected substantial differences in the num- like and E. typhina-like sequence variants were observed
ber of locus-specific haplotypes: isolate 9728 produced a more frequently in isolate 9303/2 than isolate 9728 and
higher proportion of single haplotype classes. DNA sequence were not represented between all loci. Neotyphodium species
International Journal of Evolutionary Biology 9

with common phylogenetic affinities are known to occur in the whole genome levels [53] and may have been facilitated
several different grass species. The LpTG-2, N. tembladerae, by structural features such as presence of conserved repetitive
and N. australiense endophytes, which are resident in L. elements.
perenne, Poa huecu, and Echinopogon ovatus, respectively, all In conclusion, this study demonstrates the application
appear to be phylogenetically related to E. festucae and E. of SSR-flanking sequences to studies of genome affinities
typhina [10, 16]. However, these endophytes appear to be between pasture grass fungal endophyte species for clarifi-
related to different E. typhina strains and also differ in their cation of novel modes of genome evolution. The inferred
genome structure [10, 13, 16]. Differences in transcript levels affinities were consistent with those obtained from gene loci
associated with different gene-specific sequence variants, as that are more commonly used in molecular phylogenetics,
observed for the 60S ribosomal protein-encoding gene in but provided a more extensive survey of genomic loci, that
this study, may also contribute to phenotypic trait variation may be ultimately extended to whole genome comparisons
between different heteroploid endophyte taxa. based.on second-generation sequencing technologies.
Two asexual endophyte species, N. huerfanum and N.
tembladerae, are known to occur in Festuca arizonica [17].
Phylogenetic analysis of the third unclassified Neotyphodium Acknowledgments
isolate (9727), which was also derived from F. arizonica, Endophytes isolates were kindly provided by Dr. Adrian
suggests that it may belong to the former taxon. Isolate 9727 Leuchtmann (ETH Zürich, Switzerland) and Nola McFar-
produced single haplotypes and these sequences, like those lane (DPI-Hamilton). This work was supported by the Vic-
of the N. huerfanum tefA and tubB loci [17], are closely torian Department of Primary Industries and the Molecular
related to the inferred E. typhina-related haplotypes from N. Plant Breeding Cooperative Research Centre. All experi-
coenophialum and N. uncinatum (Section 4.2). These results ments conducted during this study comply with current
were also supported by SSR polymorphism-based phenetic Australian laws.
analysis, in which 9727 clustered with E. typhina, E. clarkii,
and E. sylvatica, while in AFLP analysis the isolate clustered
with N. uncinatum [38]. References
[1] G. A. Kuldau, J. S. Liu, J. F. White, M. R. Siegel, and C. L.
4.4. Origins of Neotyphodium and Epichloë Species. Due to Schardl, “Molecular systematics of Clavicipitaceae supporting
their close phylogenetic relationships with specific Epichloë monophyly of genus Epichloë and form genus Ephelis,”
species, Neotyphodium species have been proposed to have Mycologia, vol. 89, no. 3, pp. 431–441, 1997.
originated from these sexual endophyte taxa either directly [2] K. Clay and C. Schardl, “Evolutionary origins and ecological
through the loss of the sexual state, or through interspecific consequences of endophyte symbiosis with grasses,” American
Naturalist, vol. 160, supplement S, pp. S99–S127, 2002.
hybridisation of distinct Epichloë and Neotyphodium species.
[3] R. T. Gallagher, E. P. White, and P. H. Mortimer, “Ryegrass
The first process is proposed to have given rise to haploid
staggers: isolation of potent neurotoxins lolitrem A and
Neotyphodium species such as N. lolii, while the heteroploid lolitrem B from staggers-producing pastures,” New Zealand
Neotyphodium species such as N. coenophialum, FaTG-2, Veterinary Journal, vol. 29, no. 10, pp. 189–190, 1981.
and N. uncinatum may have arisen through the second [4] S. G. Yates, R. D. Plattner, and G. B. Garner, “Detection of
evolutionary process. Because Epichloë species form unique ergopeptine alkaloids in endophyte infected, toxic Ky-31 tall
mating populations [49–51] and Neotyphodium species are fescue by mass spectrometry/mass spectrometry,” Journal of
not known to sporulate in vivo [52], this second mode of evo- Agricultural and Food Chemistry, vol. 33, no. 4, pp. 719–722,
lution is thought to have been a parasexual process involving 1985.
somatic fusion of endophyte hyphae. This hypothesis does, [5] D. D. Rowan and D. L. Gaynor, “Isolation of feeding deterrents
however, require physical colocation between endophyte taxa against argentine stem weevil from ryegrass infected with the
that generally occur in distinct host species. In addition, endophyte Acremonium loliae,” Journal of Chemical Ecology,
mechanisms of gene loss following nuclear fusion are nec- vol. 12, no. 3, pp. 647–658, 1986.
essary to account for the observed genomic composition of [6] M. Arachevaleta, C. W. Bacon, C. S. Hoveland et al., “Effect of
contemporary heteroploid taxa, as a range of studies [2, 10, the tall fescue endophyte on plant response to environmental
11, 13, 14, 16, 17] have shown that extant Neotyphodium stress,” Agronomy Journal, vol. 81, no. 1, pp. 83–90, 1989.
species do not appear to have the full complement of genes [7] C. Ravel, C. Courty, A. Coudret, and G. Charmet, “Beneficial
present in phylogenetically related Epichloë species. Loss of effects of Neotyphodium lolii on the growth and the water sta-
tus in perennial ryegrass cultivated under nitrogen deficiency
genes involved in sexual reproduction and pathogenicity
or drought stress,” Agronomie, vol. 17, no. 3, pp. 173–181,
would be a prerequisite for such genomic rearrangement 1997.
events, as well as genes vulnerable to dosage-dependent [8] G. Morgan-Jones and W. Gams, “Notes on Hyphomycetes.
effects. It is also formally possible that sexual Epichloë species XLI. An endophytes of Festuca arundinacea and the anamorph
may have arisen from asexual Neotyphodium species in of Epichloë typhina, new taxa in one of two new sections of
response to selective environmental pressures, a mechanism Acremonium,” Mycotaxon, vol. 15, no. 1, pp. 311–318, 1982.
requiring both gene loss and gene gain, possibly through [9] A. E. Glenn, C. W. Bacon, R. Price, and R. T. Hanlin,
horizontal gene transfer. Mechanisms for both processes have “Molecular phylogeny of Acremonium and its taxonomic
been inferred through comparisons of different fungal taxa at implications,” Mycologia, vol. 88, no. 3, pp. 369–383, 1996.
10 International Journal of Evolutionary Biology

[10] C. L. Schardl, A. Leuchtmann, H. F. Tsai, M. A. Collett, D. [26] D. B. Goldstein and A. G. Clark, “Microsatellite variation
M. Watt, and D. B. Scott, “Origin of a fungal symbiont of in North American populations of Drosophila melanogaster,”
perennial ryegrass by interspecific hybridization of a mutualist Nucleic Acids Research, vol. 23, no. 19, pp. 3882–3886, 1995.
with the ryegrass choke pathogen, Epichloë typhina,” Genetics, [27] M. C. Grimaldi and B. Crouau-Roy, “Microsatellite allelic
vol. 136, no. 4, pp. 1307–1317, 1994. homoplasy due to variable flanking sequences,” Journal of
[11] H. F. Tsai, J. S. Liu, C. Staben et al., “Evolutionary diversifica- Molecular Evolution, vol. 44, no. 3, pp. 336–340, 1997.
tion of fungal endophytes of tall fescue grass by hybridization [28] I. Colson and D. B. Goldstein, “Evidence for complex muta-
with Epichloë species,” Proceedings of the National Academy of tions at microsatellite loci in Drosophila,” Genetics, vol. 152,
Sciences of the United States of America, vol. 91, no. 7, pp. 2542– no. 2, pp. 617–627, 1999.
2546, 1994. [29] I. Clisson, M. Lathuilliere, and B. Crouau-Roy, “Conservation
[12] C. L. Schardl, A. Leuchtmann, K. R. Chung, D. Penny, and and evolution of microsatellite loci in primate taxa,” American
M. R. Siegel, “Coevolution by common descent of fungal Journal of Primatology, vol. 50, no. 3, pp. 205–214, 2000.
symbionts (Epichloë spp.) and grass hosts,” Molecular Biology [30] X. Chen, Y. G. Cho, and S. R. McCouch, “Sequence divergence
and Evolution, vol. 14, no. 2, pp. 133–143, 1997. of rice microsatellites in Oryza and other plant species,”
[13] C. D. Moon, B. Scott, C. L. Schardl, and M. J. Christensen, Molecular Genetics and Genomics, vol. 268, no. 3, pp. 331–343,
“The evolutionary origins of Epichloë endophytes from annual 2002.
ryegrasses,” Mycologia, vol. 92, no. 1–6, pp. 1103–1118, 2000. [31] S. R. Santos, T. L. Shearer, A. R. Hannes, and M. A.
[14] K. D. Craven, J. D. Blankenship, A. Leuchtmann, K. Hignight, Coffroth, “Fine-scale diversity and specificity in the most
and C. L. Schardl, “Hybrid fungal endophytes symbiotic with prevalent lineage of symbiotic dinoflagellates (Symbiodinium,
the grass Lolium pratense,” Sydowia, vol. 53, no. 1, pp. 44–73, Dinophyceae) of the Caribbean,” Molecular Ecology, vol. 13,
2001. no. 2, pp. 459–469, 2004.
[15] K. D. Craven, P. T. W. Hsiau, A. Leuchtmann, W. Hollin, and [32] T. Asahida, A. K. Gray, and A. J. Gharrett, “Use of microsatel-
C. L. Schardl, “Multigene phylogeny of Epichloë species, fungal lite locus flanking regions for phylogenetic analysis? A prelim-
symbionts of grasses,” Annals of the Missouri Botanical Garden, inary study of Sebastes subgenera,” Environmental Biology of
vol. 88, no. 1, pp. 14–34, 2001. Fishes, vol. 69, no. 1-4, pp. 461–470, 2004.
[16] C. D. Moon, C. O. Miles, U. Järlfors, and C. L. Schardl, “The [33] D. Field and C. Wills, “Abundant microsatellite polymorphism
evolutionary origins of three new Neotyphodium endophyte in Saccharomyces cerevisiae, and the different distributions of
species from grasses indigenous to the Southern Hemisphere,” microsatellites in eight prokaryotes and S. cerevisiae, result
Mycologia, vol. 94, no. 4, pp. 694–711, 2002. from strong mutation pressures and a variety of selective
forces,” Proceedings of the National Academy of Sciences of the
[17] C. D. Moon, K. D. Craven, A. Leuchtmann, S. L. Clement,
United States of America, vol. 95, no. 4, pp. 1647–1652, 1998.
and C. L. Schardl, “Prevalence of interspecific hybrids amongst
[34] M. S. Röder, V. Korzun, K. Wendehake et al., “A microsatellite
asexual fungal endophytes of grasses,” Molecular Ecology, vol.
map of wheat,” Genetics, vol. 149, no. 4, pp. 2007–2023, 1998.
13, no. 6, pp. 1455–1467, 2004.
[35] M. Morgante, M. Hanafey, and W. Powell, “Microsatellites
[18] G. A. Kuldau, H. F. Tsai, and C. L. Schardl, “Genome sizes of
are preferentially associated with nonrepetitive DNA in plant
Epichloë species and anamorphic hybrids,” Mycologia, vol. 91,
genomes,” Nature Genetics, vol. 30, no. 2, pp. 194–200, 2002.
no. 5, pp. 776–782, 1999.
[36] M. C. Fisher, G. Koenig, T. J. White, and J. W. Taylor, “A test for
[19] C. L. Schardl, J. S. Liu, J. F. White, R. A. Finkel, Z. An, concordance between the multilocus genealogies of genes and
and M. R. Siegel, “Molecular phylogenetic relationships of microsatellites in the pathogenic fungus Coccidioides immitis,”
nonpathogenic grass mycosymbionts and clavicipitaceous Molecular Biology and Evolution, vol. 17, no. 8, pp. 1164–1174,
plant pathogens,” Plant Systematics and Evolution, vol. 178, no. 2000.
1-2, pp. 27–41, 1991. [37] M. Rossetto, J. McNally, and R. J. Henry, “Evaluating the
[20] D. Tautz, “Hypervariability of simple sequences as a general potential of SSR flanking regions for examining taxonomic
source for polymorphic DNA markers,” Nucleic Acids Research, relationships in the Vitaceae,” Theoretical and Applied Genet-
vol. 17, no. 16, pp. 6463–6471, 1989. ics, vol. 104, no. 1, pp. 61–66, 2002.
[21] J. L. Weber and P. E. May, “Abundant class of human DNA pol- [38] E. Van Zijll De Jong, K. M. Guthridge, G. C. Spangenberg,
ymorphisms which can be typed using the polymerase chain and J. W. Forster, “Development and characterization of EST-
reaction,” American Journal of Human Genetics, vol. 44, no. 3, derived simple sequence repeat (SSR) markers for pasture
pp. 388–396, 1989. grass endophytes,” Genome, vol. 46, no. 2, pp. 277–290, 2003.
[22] M. D. Schug, C. M. Hutter, K. A. Wetterstrand, M. S. Gaudette, [39] S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped
T. F. C. Mackay, and C. F. Aquadro, “The mutation rates of di-, BLAST and PSI-BLAST: a new generation of protein database
tri- and tetranucleotide repeats in Drosophila melanogaster,” search programs,” Nucleic Acids Research, vol. 25, no. 17, pp.
Molecular Biology and Evolution, vol. 15, no. 12, pp. 1751– 3389–3402, 1997.
1760, 1998. [40] J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and
[23] G. Levinson and G. A. Gutman, “Slipped-strand mispairing: D. G. Higgins, “The CLUSTAL X windows interface: flexible
a major mechanism for DNA sequence evolution,” Molecular strategies for multiple sequence alignment aided by quality
Biology and Evolution, vol. 4, no. 3, pp. 203–221, 1987. analysis tools,” Nucleic Acids Research, vol. 25, no. 24, pp.
[24] O. Rose and D. Falush, “A threshold size for microsatellite 4876–4882, 1997.
expansion,” Molecular biology and evolution, vol. 15, no. 5, pp. [41] A. R. Templeton, “Phylogenetic inference from restriction
613–615, 1998. endonuclease cleavage site maps with particular reference to
[25] J. C. Garza, M. Slatkin, and N. B. Freimer, “Microsatellite allele the evolution of humans and the apes,” Evolution, vol. 37, no.
frequencies in humans and chimpanzees, with implications 2, pp. 221–244, 1983.
for constraints on allele size,” Molecular Biology and Evolution, [42] H. Kishino and M. Hasegawa, “Evaluation of the maximum
vol. 12, no. 4, pp. 594–603, 1995. likelihood estimate of the evolutionary tree topologies from
International Journal of Evolutionary Biology 11

DNA sequence data, and the branching order in hominoide,”


Journal of Molecular Evolution, vol. 29, no. 2, pp. 170–179,
1989.
[43] H. Shimodaira and M. Hasegawa, “Multiple comparisons of
log-likelihoods with applications to phylogenetic inference,”
Molecular Biology and Evolution, vol. 16, no. 8, pp. 1114–1116,
1999.
[44] J. Felsenstein, “Confidence limits on phylogenies: an approach
using the bootstrap,” Evolution, vol. 39, no. 4, pp. 783–791,
1985.
[45] J. Felsenstein and G. A. Churchill, “A Hidden Markov Model
approach to variation among sites in rate of evolution,”
Molecular Biology and Evolution, vol. 13, no. 1, pp. 93–104,
1996.
[46] W. M. Fitch and E. Margoliash, “Construction of phylogenetic
trees,” Science, vol. 155, no. 760, pp. 279–284, 1967.
[47] N. Saitou and M. Nei, “The neighbor-joining method: a
new method for reconstructing phylogenetic trees,” Molecular
biology and evolution, vol. 4, no. 4, pp. 406–425, 1987.
[48] H. A. Schmidt, K. Strimmer, M. Vingron, and A. Von Haeseler,
“TREE-PUZZLE: maximum likelihood phylogenetic analysis
using quartets and parallel computing,” Bioinformatics, vol.
18, no. 3, pp. 502–504, 2002.
[49] A. Leuchtmann, C. L. Schardl, and M. R. Siegel, “Sexual com-
patibility and taxonomy of a new species of Epichloë symbiotic
with fine fescue grasses,” Mycologia, vol. 86, no. 6, pp. 802–812,
1994.
[50] A. Leuchtmann and C. L. Schardl, “Mating compatibility and
phylogenetic relationships among two new species of Epichloë
and other congeneric European species,” Mycological Research,
vol. 102, no. 10, pp. 1169–1182, 1998.
[51] C. L. Schardl and A. Leuchtmann, “Three new species of
Epichloë symbiotic with North American grasses,” Mycologia,
vol. 91, no. 1, pp. 95–107, 1999.
[52] J. F. White, “Endophyte-host associations in forage grasses. XI.
A proposal concerning origin and evolution,” Mycologia, vol.
80, no. 4, pp. 442–446, 1988.
[53] E. L. Braun, A. L. Halpern, M. A. Nelson, and D. O. Natvig,
“Large-scale comparison of fungal sequence information:
mechanisms of innovation in Neurospora crassa and gene loss
in Saccharomyces cerevisiae,” Genome Research, vol. 10, no. 4,
pp. 416–430, 2000.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 423821, 7 pages
doi:10.4061/2011/423821

Research Article
Evolutionary Origins of the Fumonisin Secondary Metabolite
Gene Cluster in Fusarium verticillioides and Aspergillus niger

Nora Khaldi1 and Kenneth H. Wolfe2


1
UCD Conway Institute of Biomolecular and Biomedical Research, UCD School of Medicine and Medical Sciences,
and UCD Complex and Adaptive Systems Laboratory, University College Dublin, Dublin 4, Ireland
2 Smurfit Institute of Genetics, Trinity College Dublin, Dublin 2, Ireland

Correspondence should be addressed to Nora Khaldi, khaldin@tcd.ie

Received 15 October 2010; Revised 10 January 2011; Accepted 15 March 2011

Academic Editor: Hiromi Nishida

Copyright © 2011 N. Khaldi and K. H. Wolfe. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The secondary metabolite gene clusters of euascomycete fungi are among the largest known clusters of functionally related genes
in eukaryotes. Most of these clusters are species specific or genus specific, and little is known about how they are formed during
evolution. We used a comparative genomics approach to study the evolutionary origins of a secondary metabolite cluster that
synthesizes a polyketide derivative, namely, the fumonisin (FUM) cluster of Fusarium verticillioides, and that of Aspergillus niger
another fumonisin (fumonisin B) producing species. We identified homologs in other euascomycetes of the Fusarium verticillioides
FUM genes and their flanking genes. We discuss four models for the origin of the FUM cluster in Fusarium verticillioides and argue
that two of these are plausible: (i) assembly by relocation of initially scattered genes in a recent Fusarium verticillioides; or (ii)
horizontal transfer of the FUM cluster from a distantly related Sordariomycete species. We also propose that the FUM cluster was
horizontally transferred into Aspergillus niger, most probably from a Sordariomycete species.

1. Introduction peptides [15]. Synthesis of these molecules requires many


successive enzymatic steps, and the genes for these enzymes
The order of genes along eukaryotic chromosomes is often are almost invariably clustered together at a single genomic
assumed to be random, but there is growing evidence that location. It has been suggested that physical clustering may
the chromosomal position of some genes is maintained by allow the genes to be coregulated by means of chromatin
natural selection [1–3], and that selection can sometimes modification [15], or that it may facilitate horizontal transfer
operate to move genes to new locations [4–6]. Among of intact clusters between species [16].
the eukaryotes, many of the most notable examples of the Since the discovery of secondary metabolite gene clusters,
physical clustering of genes with related functions occur in their mechanism of origin and assembly has remained a
the filamentous fungi [7–10]. The most striking fungal gene matter of speculation. The growing number of available
clusters are those involved in the synthesis of secondary euascomycete genome sequences now enables us to both
metabolites such as sterigmatocystin (a 25-gene cluster in predict new secondary metabolite clusters [17] and take a
Aspergillus nidulans [11]), fumonisin (a 17-gene cluster in phylogenomic approach to the evolutionary origins of these
Fusarium verticillioides [12, 13]), and trichothecene (a 12- clusters. In this study we focus on one of the largest known
gene cluster in Fusarium sporotrichioides [14]). Secondary secondary metabolite clusters “Fumonisin”, a polyketide
metabolites are organic molecules that are not essential synthase type cluster. Fumonisins (FB/FC) are mycotoxins
for the normal growth of the fungus but which function produced by some species in the Gibberella fujikuroi species
in host/pathogen interactions or other forms of commu- complex, of which F. verticillioides and F. proliferatum are
nication or warfare between organisms. They are typically the most studied (Gibberella is a teleomorphic form of
modified polyketides, terpenes, alkaloids, or nonribosomal the genus Fusarium). The FUM genes are organized as a
2 International Journal of Evolutionary Biology

cluster of 17 genes in both species (including FUM20 and scenario is formally possible, but what makes one scenario
FUM21, two additional genes recently revealed in the cluster more likely than another is parsimony-consideration of the
[13]), though the location of the cluster differs between number of separate events that are required to have taken
the two [12, 18]. The products of the FUM cluster genes place in order to account for it. In other words, a scenario
include a polyketide synthase, fatty acyl-CoA synthases, and involving fewer events is a more likely explanation of the
cytochrome P450 monooxygenases. Expression of the genes observed data.
in the cluster, but not the neighboring genes, is induced
under conditions when fumonisin (FB/FC) is synthesized
[12]. The ability to synthesize fumonisin (FB/FC) has a 3.1. Origins of the FUM Gene Cluster in F. verticillioides.
patchy phylogenetic distribution across the genus Fusarium, Gene-by-gene phylogenetic analyses were carried out to
due to the variable presence or absence of the FUM cluster decipher the evolutionary history of the FUM genes using
among different isolates [19]. F. graminearum, for instance, homologs of these genes in four euascomycetes: Fusarium
does not synthesize fumonisin. graminearum, Neurospora crassa, Magnaporthe grisea, and
Recently, fumonisin (FB) production has also been Aspergillus nidulans (we use the word “homologs” for
reported in Aspergillus niger [20]. The genes in this species convenience in situations where we are unsure whether genes
are also clustered and homologous to the FUM genes of are orthologs or paralogs). We also constructed individual
F. verticillioides [21]. This is surprising given the large phylogenies for the genes located on either side of the F.
evolutionary distance between A. niger and F. verticillioides. verticillioides FUM cluster. Homology relationships between
We combined comparative genomics with phylogenetic the genes in or near the F. verticillioides FUM cluster and
analysis to investigate whether genes in the FUM cluster have other euascomycete genes are summarized in Figure 1. We
homologs in filamentous fungi that do not synthesize the identified probable orthologs of 13 of the 17 FUM genes
mycotoxins, and so to study how this cluster was formed. (Figure 1). These genes are not arranged in clusters in the
other genomes.
A region of F. graminearum chromosome 1 (genes
2. Methods FG00269–F00276; Figure 1) contains orthologs of the five
genes to the left of the FUM cluster (F. verticillioides NPT1,
Our analysis was done using the completely sequenced WDR1, PNG1, ZNF1, and ZBD1) immediately adjacent to
genomes of the euascomycetes A. nidulans [22], F. gramin- orthologs of the two genes to the right of the cluster (F.
earum [23], M. grisea [24], and N. crassa [25]. A set of 87,000
verticillioides ORF21 and MPU1), with nothing in between
expressed sequence tags from F. verticillioides [26] was used
them in F. graminearum. Similarly, in M. grisea the ZBD1
for analysis of rates of sequence evolution in this species
ortholog is beside the ORF21 ortholog, and in A. nidulans
compared to F. graminearum.
the PNG1 ortholog is beside the ORF21 ortholog. Thus,
To identify homologs of genes in the FUM clusters,
we first used each protein as a query in a BLASTP search chromosomal sites orthologous to the F. verticillioides FUM
against the NCBI nonredundant protein sequence database. cluster-flanking regions exist adjacent to one another in F.
Because the F. verticillioides genome is not present in this graminearum, M. grisea, and A. nidulans, but no orthologs of
database, we made a local BLAST database for expressed the FUM genes themselves are found at these sites (Figure 1).
sequence tags from this species. Sequences giving hits with Instead, homologs of the FUM genes are scattered on
Expect (E) values of less than 1e − 4 were retained and different chromosomes of the genomes of these fumonisin-
used for phylogenetic analysis. Each set of proteins was nonproducing species. In F. graminearum, for instance,
aligned using ClustalW [27] and poorly aligned regions the 10 homologs of FUM genes are dispersed across four
were removed using Gblocks [28]. Maximum likelihood trees chromosomes and none of them is located close to another
were constructed using PHYML [29] with the JTT amino or to the FUM-flanking genes. Thus the FUM cluster genes
acid substitution matrix and four categories of substitution appear to have been inserted into a pre-existing genomic
rates. Bootstrapping was done using the default options locus between ZBD1 and ORF21.
in PHYML with 100 replicates per run. The trees were Further, we examined the genomic contexts around each
eyeballed, the distant sequences were removed, and the steps of the FUM cluster homologs in other species. For example,
above (ClustalW, Gblocks, and PHYML) were repeated on the F. verticillioides FUM cluster gene FUM11 is homologous
the remaining sequences. to FG07875 in F. graminearum and to MG03479 in M. grisea
(Figure 1). We will refer to FG07875 and MG03479 as focal
genes. When we examine the regions around these focal
3. Results genes in the F. graminearum and M. grisea genomes, we
In the following we are interested in identifying scenarios find that some of the neighboring genes near them are also
that can explain the origins of the current FUM gene cluster. orthologs of each other. On one side FG07874 (1 gene away
We do this by listing all possible scenarios that might have from the focal gene) is an ortholog of MG03474 (5 genes
taken place in evolution, and comparing their plausibility. away), and on the other side FG07864 (11 genes away from
Building an evolutionary scenario is not straightforward the focal gene) is an ortholog of MG03480 (1 gene away
because many of the events took place in extinct species and from the focal gene). In Figure 1, only the focal genes are
only a few clues remain in the current organisms. Almost any shown but these similarities of context are indicated by the
International Journal of Evolutionary Biology 3

FV NPT1 WDR1 PNG1 ZNF1 ZBD1 FUM21 FUM1 FUM6 FUM7 FUM8 FUM9 FUM10 FUM11 FUM12 FUM20 FUM13 FUM14 FUM15 FUM16 FUM17 FUM18 FUM19 ORF20 ORF21 MPU1

28
I
FG 00269 00270 00271 00272 00273 00275 00276
−5 −5−1
II
07596 03431 10362 07875 00042 01048 08543 05525 03851 10995
1 −1

−10 −7 5 −4 1
VI
NCU 00649 00650 00651 00723 08935 05075 06951 03929 00008 02468 02458 08363

−1−1 -1 0

1
III V
MG 03618 03597 03598 10932 07239 07238 08600
1 2 2 −5
VII
06199 03479 02982 06585 07858 02094 04956 05189 03090 05009

−11 −14 1

2 2
IV
AN 9116 3787 3788 1721
1
−10 7
VI
6052 3612 5234 4397 3461 7813 7884 2464 4332 7839
19−12

Figure 1: Representation of the fumonisin cluster and its flanking genes in F. verticillioides (FV). Columns are homologs of the FUM genes
and orthologs of the flanking genes identified by phylogenetic analysis in F. graminearum (FG), N. crassa (NCU), M. grisea (MG), and A.
nidulans (AN). Genes in the latter four species are identified by gene numbers from their genome projects. Different colors represent different
chromosomes. The long lines in F. graminearum, M. grisea, and A. nidulans show that in those species, there is a site in the genome that
corresponds to the FUM cluster location, but no FUM genes are present at that locus. Curved lines and numbers in orange symbols indicate
conservation of the neighboring genes around FUM homologs in F. graminearum, N. crassa, M. grisea, and A. nidulans. For each gene show
in the figure (the focal genes) we considered the two genes immediately next to it. If these genes have orthologs located <20 genes away from
the focal gene’s ortholog in another species, a symbol indicates this fact. For example, the numbers −10 (in a circle) and −7 (in a diamond)
connected to gene NCU08935 indicate that the gene immediately after NCU08935 (i.e., NCU08936) has an ortholog in M. grisea that is 10
genes away from MG06199 (i.e., MG06189), and an ortholog in A. nidulans that is 7 genes away from AN04397 (i.e., AN04392). Triangles,
squares, circles, and diamonds indicate relationships to F. graminearum, N. crassa, M. grisea, and A. nidulans, respectively.

small numbers at the ends of the curved lines attached to the According to it, a cluster existed in the common ancestor
symbols for the focal genes FG07875 and MG03479. of Sordariomycetes (i.e., F. verticillioides, F. graminearum,
Overall, the homologs of five FUM cluster genes N. crassa, and M. grisea). This cluster became duplicated in
(FUM10, FUM11, FUM16, FUM17, and FUM18) show some this ancestor, and then one copy disintegrated, dispersing
degree of local synteny conservation among the four species its genes around the genome. F. verticillioides retained both
that do not contain FUM clusters. In other words, each of the cluster and the scattered genes, whereas the other
these genes is in a conserved location among some of the four Sordariomycete species retained only the scattered genes.
species, and these locations are not close to one another. Support for this scenario comes from some trees (Fum6,
The phylogenetic trees obtained from individual FUM Fum10, Fum15, and Fum19) which show that the FUM genes
genes are shown in Figure 2 (for FUM6 and FUM15) and have duplicates in F. verticillioides, but are single copy in non-
supplemental 1 available on line at doi:10.4061/2011/423821 producing fumonisin species. A duplication in the common
(for the other genes). The trees present a diversity of ancestor of Sordariomycetes is suggested by the trees for
topologies, such that no one sentence story can explain some genes (Fum6, Fum13, and Fum15), whereas an older
the origin of the FUM cluster. One cannot expect all the duplication in the common ancestor of Sordariomycetes plus
FUM gene trees to have identical topologies—especially Eurotiomycetes (not as shown in Figure 3(a)) is suggested by
given the possibility that different genes have been subject the trees for other genes (Fum7, Fum8, Fum10, and Fum19).
to very different evolutionary constraints—but even still
the diversity of topologies is surprising. To interpret these Scenario 2 (ancient duplications of scattered genes, followed by
trees, we consider four possible scenarios for the origin of recent assembly of a cluster in F. verticillioides). This scenario
the cluster (Figure 3), and what tree topologies they would is illustrated in Figure 3(b). Scenarios 1 and 2 both require
predict. To evaluate these scenarios, we concentrate on the that through evolution numerous independent events of
FUM genes that are present in both F. verticillioides and loss have occurred in very distantly related species (for
A. niger (FUM1, FUM6, FUM7, FUM8, FUM9, FUM10, illustration purposes all the losses in Figures 3(a) and 3(b)
FUM13, FUM14, FUM15, and FUM19). Below, we discuss are placed on the F. graminearum branch and in the common
these four scenarios. ancestor of N. crassa and M. graminearum, but other
combinations or losses on other branches are also possible).
Scenario 1 (vertical inheritance of an ancestral cluster). One problem with Scenarios 1 and 2 is that, according
This scenario is illustrated schematically in Figure 3(a). to the phylogenies of Fum7, Fum10, Fum15, and Fum19,
4 International Journal of Evolutionary Biology

0.2 0.2
N. gruberi (290982958) C. neoformans (58268240)
100 F. graminearum (46126437) P. marneffei (212545520)
57 F. verticillioides (FVEG 12598)
100 Fum15p F. verticillioides
100 Fum6 F. verticillioides
100 98 Fum15p F. oxysporum
Fum6 A. niger (145229593)
84 G. graminicola (310801393) Fum15p A. niger (145229611)
100 A. terreus (115491047) 100 B. fuckeliana (154299019)
N. fischeri (119474235) 91
S. sclerotiorum (156058097)
78
56 A. niger (145236853)
A. oryzae (169780466) 90 C. globosum (116180286)
B. fuckeliana (154290857) 42 P. anserina (171687955)
77 M. grisea (145603111)
A. fumigatus (159127706) 48
100 57
100 94 F. graminearum (46108332)
N. fischeri (119492053)
100 F. verticillioides (FVEG 01179)
A. clavatus (121705344)
100 98 49 99 N. haematococca (302926030)
A. niger (145252164)
A. benhamiae (302509954) G. graminicola (310793835)
95 L. maculans (312212210)
L. maculans (312216372)
100 P. nodorum (169623638) 25 100 P. nodorum (169618405)
62 P. teres (311328957)
60 N. haematococca (302889455) P. chrysogenum (255932339)
G. graminicola (310796607) 47
98 A. fumigatus (159127167)
79 100 F. graminearum (46110180)
F. verticillioides (FVEG 07269) 100
100 N. haematococca (302882331) N. fischeri (119490142)
95 100 N. crassa (85104987)
74 30 100 A. fumigatus (159128971)
C. globosum (116197312) 100
M. oryzae (145601517) 57 A. clavatus (121710284)
58 M. oryzae (39942060)
A. terreus (115399036)
95 N. haematococca (302883579) 70
100 A. nidulans (67541330) A. niger (145243558)
92 G. graminicola (310798900) A. clavatus (121712156)
(a) (b)

Figure 2: Maximum likelihood trees for FUM6, 15, and their homologs. (a) FUM6; (b) FUM15. In each tree, genes that appear in Figure 1
are named in red. The species name and the NCBI ID are provided on each branch. Bootstrap percentages are shown for all nodes. Trees
were constructed from amino acid sequences as described in Section 2 using PHYML after alignment with ClustalW.

the duplicates retained in all the fumonisin-nonproducing with the exception of two genes (Fum10 and Fum19) no
Sordariomycetes are coincidently always the same copy (as homolog of a FUM gene has remained in F. verticillioides.
illustrated by the parallel losses of multiple green genes, Although the trees for Fum6 and Fum15 show homologs in F.
but not pink ones, in Figures 3(a) and 3(b)). This can be verticillioides, the duplications in these cases greatly precede
visualized in the trees by the fact that the homologs of the origin of the GFSC clade.
the FUM genes in fumonisin-non-producing species are The numbers of steps required for the different scenarios
orthologs of each other, showing a more or less typical in Figures 3(a), 3(b), and 3(c) make it more parsimonious
species phylogeny. If the second copy in F. graminearum, to argue that the FUM cluster became assembled in an
M. grisea, or N. crassa (the one represented in green dots in ancestor of F. verticillioides (Figure 3(c)) than to argue that
Figures 3(a) and 3(b) had been retained, this copy would be either the cluster or the individual genes underwent early
closer to the FUM gene than to genes in the fumonisin-non- duplication and then got lost multiple times (Figures 3(a)
producing species, which is not what we observe. Together
and 3(b)). Additionally, as explained above, Figures 3(a) and
these observations make both Scenarios 1 and 2 very unlikely.
3(b) would also imply that the same copy of the duplicates
Scenario 3 (FUM gene duplication and cluster assembly specif- in the fumonisin-non-producing species were independently
ically on the branch leading to F. verticillioides and the GFSC). retained (a minimum of two independent retentions of the
The specificity of the FUM cluster to the Gibberella fujikuroi same copy are required: one on the F. graminearum lineage,
species complex (GFSC, which includes F. verticillioides, F. and one in the common ancestor of N. crassa and M. grisea,
oxysporum, and F. proliferatum) points towards a complex- pink genes in Figures 3(a) and 3(b)).
specific cluster. This scenario is shown in Figure 3(c). It The hypothesis of assembly requires that each gene
proposes that the FUM cluster was built in an ancestor of transposed once, from an ancestral location, to its current
F. verticillioides after its speciation from F. graminearum. location in F. verticillioides. The genes must have been
Most of the gene trees do not support this model, because sequentially relocated, with selection for each step.
International Journal of Evolutionary Biology 5

FV FV unexpected retention of the same copy in many of the trees


(represented in pink in Figure 3(b); Fum7, Fum10, Fum15,
and Fum19). This scenario may explain the unexpected
FG FG phylogenetic positioning of some FUM gene outside the
expected class of species. For example, in the case of Fum7,
Fum 9, Fum 10, Fum13, Fum15, and Fum19 a horizontal
NC NC gene transfer of the FUM genes into F. verticillioides would
explain such topologies.
Figure 3(d) illustrates how the horizontal transfer sce-
MG MG nario, like the recent assembly hypothesis (Figure 3(c)),
reduces the number of independent events necessary to
AN AN explain the FUM cluster. However, the horizontal transfer
(a) (b) scenario does not posit any mechanism for the assembly
? of the cluster; it just shifts the question to how the cluster
FV became assembled in the donor species.
FV
3.2. The Fumonisin Cluster in A. niger Results from Horizontal
FG
Gene Transfer. A. niger has been shown to produce fumon-
FG
isin [13] and contains clustered homologs of many of the F.
verticillioides FUM genes [21] (Figure 2). Our phylogenetic
NC
NC analysis illuminates the origin of this cluster in A. niger and
how it relates to the cluster in F. verticillioides (phylogenetic
trees in Figure 2 and Supplemental Figure 1).
MG MG The first trend evident from this phylogenetic analysis is
that genes from the FUM cluster in F. verticillioides, F. oxyspo-
AN AN rum, and A. niger define clades supported by high bootstrap
(c) (d) values (>90%, Figure 2), to the exclusion of homologous
genes from Sordariomycetes and Eurotiomycetes.
Figure 3: The four most likely scenarios giving rise to the current Because we extended our analysis to many species, we
fumonisin cluster in F. verticillioides. The species represented on were faced with the problem of low bootstrap values for
all four trees are: F. verticillioides (FV), F. graminearum (FG), N. many of the FUM genes trees. The two trees shown in
crassa (NCU), M. grisea (MG), and A. nidulans (AN). A red circle
Figure 2(b) (FUM6 and FUM15) are the ones with the high
represents a duplication event. A blue star represents an assembly
and clustering event, while a disassembly is shown using a black bootstrap support for relevant branches, and a reasonably
square. Red crosses indicate a loss event. When the FUM genes, or correct species phylogeny. Our analysis shows that both
their ancestral genes, are clustered they are represented by a short these genes in A. niger clearly group with genes from the
line. The line is dashed if the genes are not clustered. Two colors are Sordariomycetes, rather than with genes in the more closely
assigned to the duplicated genes and green genes are ancestors of the related (Eurotiomycetes) species including other A. niger
FUM genes (or the current FUM genes), while pink genes are the genes. Bootstrap values for grouping the A. niger FUM genes
paralogs of FUM genes (ones found in all other Sordariomycetes). with the Sordariomycete homologs are 98–100% (Figure 2).
(a) represents the vertical transfer where the ancestor processed
The disagreement of this result with the expected A.
a version of the FUM cluster. (b) The ancestor in this scenario
contained the ancestral genes of the FUM cluster (scattered). (c) niger species relationships, are suggestive of horizontal gene
represents a recent event of duplication and assembly of the FUM transfer between A. niger and an ancestor existing prior to
cluster in an ancestor of F. verticillioides. Finally, (d) represents the the divergence of F. verticillioides and F. oxysporum. More
horizontal gene transfer scenario. importantly, it is more likely that the transfer occurred from
Sordariomycetes to A. niger (or an ancestor of this species),
rather than the opposite. Indeed, the opposite would result
Scenario 4 (origin of the F. verticillioides fum cluster by in the FUM genes (from F. verticillioides, F. oxysporum, and
horizontal transfer from a distantly related fungus). This A. niger) clustering in the Eurotiomycetes subphylum as
scenario is shown in Figure 3(d). Under this scenario the opposed to the Sordariomycetes as seen in Figure 2. For
donor could be a Sordariomycete (as suggested by the the FUM6 and FUM15 trees, we used the likelihood ratio
trees for Fum6, Fum13, and Fum15), or a more distant test (LRT) to test whether the topologies shown (Figure 2)
species that is an outgroup to both Sordariomycetes and have significantly higher likelihoods than alternative trees
Eurotiomycetes (as suggested by Fum7, Fum9, Fum10, and where the A. niger was placed in the Eurotiomycetes and
Fum19). Although we cannot identify a specific donor, we constrained to form a monophyletic group. In both cases the
cannot rule out this possibility. Indeed this scenario reduces topology shown in Figure 2 is significantly more likely than
the number of events leading to the current trees. Under the tree expected if genes were inherited vertically (P < .001
this scenario, we would not have multiple losses, nor the for each).
6 International Journal of Evolutionary Biology

Because the clusters in A. niger and in F. verticillioides Competition, either between one fungal species and
share only 11 of the 17 known FUM genes (including another, or between a fungus and a host species, is likely
FUM21), these two types of cluster have probably had a long to result in strong selection on the secondary metabolite
history of independent evolution, although they certainly repertoire of filamentous fungal species. This arms race
share a common ancestor. We conclude that the cluster in A. between organisms pressurizes the organism to create new
niger originated by horizontal transfer from an ancestor of F. chemical weapons, which are the products of new secondary
verticillioides and F. oxysporum. metabolite gene clusters. It is relatively easy to envisage that
neofunctionalization after gene duplication, or partial cluster
duplication as appears to have happened in the origins of
4. Discussion the Ace1 cluster [31], could result in the production of
a new secondary metabolite and so could be selectively
In the literature three scenarios for the creation of a gene advantageous. It is harder to understand why relocating
cluster have been described: horizontal of an existing cluster genes, as has happened in the FUM cluster, can be evolu-
from one genome to another [30, 31]; the duplication of an tionarily advantageous. One possibility is that the mere act
ancestral cluster [31]; the de novo creation of a cluster from of relocating a gene can have the consequence of changing
initially scattered genes that become relocated into one locus the end product of a pathway, because the expression of
[4]. We find that the FUM genes are apparent duplicates of all the genes in a cluster is coordinated. For example,
conserved genes in Sordariomycetes (Figures 1 and 2). We imagine that we have two secondary metabolite biosynthesis
think that two of the scenarios we discussed could plausibly pathways, 1 and 2. If a cytochrome P450 oxidoreductase
account for the observed data. First, the FUM cluster could gene that originally functioned in pathway 1 is suddenly
be the result of horizontal cluster transfer into an ancestor of relocated so that it becomes coexpressed with the genes in
F. verticillioides (Scenario 4). This scenario is similar to our pathway 2 (and no longer co-expressed with pathway 1), it
observation of the ACE1 cluster in A. clavatus [31], and the is possible that its product could begin to act on one of the
more recent observation of the horizontal gene transfer of intermediate molecules in pathway 2. The result would be
the sterigmatocystin cluster [32]. Secondly, the FUM cluster that the products of pathways 1 and 2 are both changed.
may have been assembled after recent gene duplication in an Alternative possibilities include that there is selection for
ancestor of F. verticillioides (Scenario 3). The latter scenario tighter regulation (e.g., if an intermediate molecule in the
resembles our previous observations on the DAL gene cluster pathway is toxic), or that there is epistatic selection for tight
of S. cerevisiae [4], though it should be noted that the linkage between interacting alleles [34].
DAL genes code for a catabolic pathway (degradation of
allantoin, a secondary nitrogen source), whereas the FUM
genes are part of an anabolic pathway (secondary metabolite Acknowledgment
biosynthesis). The authors would like to thank the reviewers for their
On the other hand we propose that the fumonisin valuable comments and suggestions.
cluster in A. niger was acquired via horizontal gene transfer.
It has been shown in recent years that horizontal gene
transfer between filamentous fungi is more common than References
was originally thought. Many independent genes can transfer
between distantly related species such as that observed [1] L. D. Hurst, C. Pál, and M. J. Lercher, “The evolutionary
between and ancestor of A. oryzae and Sordariomycetes [33]; dynamics of eukaryotic gene order,” Nature Reviews Genetics,
vol. 5, no. 4, pp. 299–310, 2004.
also an entire secondary metabolite cluster has been shown
[2] L. D. Hurst, E. J. B. Williams, and C. Pál, “Natural selection
to have horizontally transferred between a relative of M.
promotes the conservation of linkage of co-expressed genes,”
grisea into an ancestor of A. clavatus [31]. Here again this Trends in Genetics, vol. 18, no. 12, pp. 604–606, 2002.
finding adds to the repertoire of horizontally transferred [3] G. A. C. Singer, A. T. Lloyd, L. B. Huminiecki, and K. H. Wolfe,
genes between fungal species and shows that this exchange “Clusters of co-expressed genes in mammalian genomes
mechanism is not so uncommon after all. Moreover it shows are conserved by natural selection,” Molecular Biology and
how an entire cluster can transfer between distantly related Evolution, vol. 22, no. 3, pp. 767–775, 2005.
species and remain functional in the new species. In addition, [4] S. Wong and K. H. Wolfe, “Birth of a metabolic gene cluster in
the differences between the A. niger and F. verticillioides yeast by adaptive gene relocation,” Nature Genetics, vol. 37, no.
Fum clusters highlights how a cluster can diverge by adding, 7, pp. 777–782, 2005.
removing, or reshuffling the genes. [5] B. Field and A. E. Osbourn, “Metabolic diversification—
Our lack of knowledge about what benefit the metabolite independent assembly of operon-like gene clusters in different
plants,” Science, vol. 320, no. 5875, pp. 543–547, 2008.
confers on the organism hampers our understanding of
[6] X. Qi, S. Bakht, M. Leggett, C. Maxwell, R. Melton, and A.
the selective purpose of this clustering. However, we can Osbourn, “A gene cluster for secondary metabolism in oat:
be almost certain that the reason behind the clustering is implications for the evolution of metabolic diversity in plants,”
not simply to synthesize the metabolite, which is possible Proceedings of the National Academy of Sciences of the United
with scattered genes. It is more likely that the selective force States of America, vol. 101, no. 21, pp. 8233–8238, 2004.
involves selection for a tight coregulation of gene expression, [7] N. H. Giles, M. E. Case, and J. Baum, “Gene organization and
perhaps mediated by a LaeA-type universal regulator. regulation in the qa (quinic acid) gene cluster of Neurospora
International Journal of Evolutionary Biology 7

crassa,” Microbiological Reviews, vol. 49, no. 3, pp. 338–358, polymorphism and pathogen specialization,” Science, vol. 317,
1985. no. 5843, pp. 1400–1402, 2007.
[8] E. P. Hull, P. M. Green, H. N. Arst, and C. Scazzocchio, [24] R. A. Dean, N. J. Talbot, D. J. Ebbole et al., “The genome
“Cloning and physical characterization of the L-proline sequence of the rice blast fungus Magnaporthe grisea,” Nature,
catabolism gene cluster of Aspergillus nidulans,” Molecular vol. 434, no. 7036, pp. 980–986, 2005.
Microbiology, vol. 3, no. 4, pp. 553–559, 1989. [25] J. E. Galagan, S. E. Calvo, K. A. Borkovich et al., “The
[9] N. P. Keller and T. M. Hohn, “Metabolic pathway gene clusters genome sequence of the filamentous fungus Neurospora
in filamentous fungi,” Fungal Genetics and Biology, vol. 21, no. crassa,” Nature, vol. 422, no. 6934, pp. 859–868, 2003.
1, pp. 17–29, 1997. [26] D. W. Brown, F. Cheung, R. H. Proctor et al., “Compar-
[10] J. W. Cary, P.-K. Chang, and D. Bhatnagar, “Clustered ative analysis of 87,000 expressed sequence tags from the
metabolic pathway genes in filamentous fungi,” in Applied fumonisin-producing fungus Fusarium verticillioides,” Fungal
Mycology and Biotechnology, G. G. Khachatourians and D. K. Genetics and Biology, vol. 42, no. 10, pp. 848–861, 2005.
Arora, Eds., pp. 165–198, Elsevier, Amsterdam, The Nether- [27] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL
lands, 2001. W: improving the sensitivity of progressive multiple sequence
[11] D. W. Brown, J. H. Yu, H. S. Kelkar et al., “Twenty-five alignment through sequence weighting, position-specific gap
coregulated transcripts define a sterigmatocystin gene cluster penalties and weight matrix choice,” Nucleic Acids Research,
in Aspergillus nidulans,” Proceedings of the National Academy vol. 22, no. 22, pp. 4673–4680, 1994.
of Sciences of the United States of America, vol. 93, no. 4, pp. [28] J. Castresana, “Selection of conserved blocks from multiple
1418–1422, 1996. alignments for their use in phylogenetic analysis,” Molecular
[12] R. H. Proctor, D. W. Brown, R. D. Plattner, and A. E. Biology and Evolution, vol. 17, no. 4, pp. 540–552, 2000.
Desjardins, “Co-expression of 15 contiguous genes delineates [29] S. Guindon and O. Gascuel, “A simple, fast, and accurate algo-
a fumonisin biosynthetic gene cluster in Gibberella monili- rithm to estimate large phylogenies by maximum likelihood,”
formis,” Fungal Genetics and Biology, vol. 38, no. 2, pp. 237– Systematic Biology, vol. 52, no. 5, pp. 696–704, 2003.
249, 2003. [30] N. J. Patron, R. F. Waller, A. J. Cozijnsen et al., “Origin and
[13] D. W. Brown, R. A. E. Butchko, M. Busman, and R. H. Proctor, distribution of epipolythiodioxopiperazine (ETP) gene clus-
“The Fusarium verticillioides FUM gene cluster encodes a ters in filamentous ascomycetes,” BMC Evolutionary Biology,
Zn(II)2Cys6 protein that affects FUM gene expression and vol. 7, article 174, 2007.
fumonisin production,” Eukaryotic Cell, vol. 6, no. 7, pp. 1210– [31] N. Khaldi, J. Collemare, M. H. Lebrun, and K. H. Wolfe,
1218, 2007. “Evidence for horizontal transfer of a secondary metabolite
[14] D. W. Brown, R. B. Dyer, S. P. McCormick, D. F. Kendra, and gene cluster between fungi,” Genome Biology, vol. 9, no. 1,
R. D. Plattner, “Functional demarcation of the Fusarium core article R18, 2008.
trichothecene gene cluster,” Fungal Genetics and Biology, vol. [32] J. C. Slot and A. Rokas, “Horizontal transfer of a large and
41, no. 4, pp. 454–462, 2004. highly toxic secondary metabolic gene cluster between fungi,”
[15] N. P. Keller, G. Turner, and J. W. Bennett, “Fungal sec- Current Biology, vol. 21, no. 2, pp. 134–139, 2011.
ondary metabolism—from biochemistry to genomics,” Nature [33] N. Khaldi and K. H. Wolfe, “Elusive origins of the extra genes
Reviews Microbiology, vol. 3, no. 12, pp. 937–947, 2005. in Aspergillus oryzae,” PLoS ONE, vol. 3, no. 8, article e3036,
[16] U. L. Rosewich and H. C. Kistler, “Role of horizontal 2008.
gene transfer in the evolution of fungi,” Annual Review of [34] M. Nei, “Modification of linkage intensity by natural selec-
Phytopathology, vol. 38, pp. 325–363, 2000. tion,” Genetics, vol. 57, no. 3, pp. 625–641, 1967.
[17] N. Khaldi, F. T. Seifuddin, G. Turner et al., “SMURF: genomic
mapping of fungal secondary metabolite clusters,” Fungal
Genetics and Biology, vol. 47, no. 9, pp. 736–741, 2010.
[18] C. Waalwijk, T. Van Der Lee, I. De Vries, T. Hesselink, J. Arts,
and G. H. J. Kema, “Synteny in toxigenic Fusarium species:
the fumonisin gene cluster and the mating type region as
examples,” European Journal of Plant Pathology, vol. 110, no.
5-6, pp. 533–544, 2004.
[19] R. H. Proctor, R. D. Plattner, D. W. Brown, J. A. Seo, and Y. W.
Lee, “Discontinuous distribution of fumonisin biosynthetic
genes in the Gibberella fujikuroi species complex,” Mycological
Research, vol. 108, no. 7, pp. 815–822, 2004.
[20] J. C. Frisvad, J. Smedsgaard, R. A. Samson, T. O. Larsen, and
U. Thrane, “Fumonisin B production by Aspergillus niger,”
Journal of Agricultural and Food Chemistry, vol. 55, no. 23, pp.
9727–9732, 2007.
[21] S. E. Baker, “Aspergillus niger genomics: past, present and into
the future,” Medical Mycology, vol. 44, no. 1, pp. 17–21, 2006.
[22] J. E. Galagan, S. E. Calvo, C. Cuomo et al., “Sequencing
of Aspergillus nidulans and comparative analysis with A.
fumigatus and A. oryzae,” Nature, vol. 438, no. 7071, pp. 1105–
1115, 2005.
[23] C. A. Cuomo, U. Güldener, J.-R. Xu et al., “The Fusar-
ium graminearum genome reveals a link between localized
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 143498, 11 pages
doi:10.4061/2011/143498

Research Article
Computational Analysis Suggests That Lyssavirus Glycoprotein
Gene Plays a Minor Role in Viral Adaptation

Kevin Tang1 and Xianfu Wu2


1 BCFB, DSR, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
2 Rabies, PRB, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA

Correspondence should be addressed to Xianfu Wu, xwu@cdc.gov

Received 15 October 2010; Revised 15 December 2010; Accepted 3 January 2011

Academic Editor: Hiromi Nishida

Copyright © 2011 K. Tang and X. Wu. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The Lyssavirus glycoprotein (G) is a membrane protein responsible for virus entry and protective immune responses. To explore
possible roles of the glycoprotein in host shift or adaptation of Lyssavirus, we retrieved 53 full-length glycoprotein gene sequences
from NCBI GenBank. The sequences were from different host isolates over a period of 70 years in 21 countries. Computational
analyses detected 1 recombinant (AY987478, a dog isolate of CHAND03, genotype 1 in India) with incongruent phylogenetic
support. No recombination was detected when AY98748 was excluded in the analyses. We applied different selection models to
identify selection pressure on the glycoprotein gene. One codon at amino acid residual 483 was found to be under weak positive
selection with marginal probability of 95% by using the maximum likelihood method. We found no significant evidence of positive
selection on any site of the glycoprotein gene when the putative recombinant AY987478 was excluded. The computational analyses
suggest that the G gene has been under purifying selection and that the evolution of the G gene may not play a significant role in
Lyssavirus adaptation.

1. Introduction and inducing protective immune responses [10, 11]. The role
of the G gene in rabies spillover, host shift, and adaptation
Positive selection and recombination are important mech- has not been analyzed thoroughly. The information could
anisms in microbial pathogen adaption to new hosts, help understand viral pathogenesis and develop a vaccine for
resistance to antibiotics, and evasion of immune responses a broad spectrum of lyssavirus infections.
[1]. RNA viruses have high mutation rates due to lack Here, we used newly developed computational algo-
of both proofreading and postreplicative repair activities rithms as well as traditional methods to investigate potential
associated with RNA replicases and reverse transcriptases recombination events and selection pressures in the G gene
[2], which benefits RNA viruses in adapting to the changing of Lyssaviruses. The dataset for the study was comprised
environment. Recombination is a general phenomenon in of 53 full-length glycoprotein gene sequences isolated from
evolution and plays a significant role in viral fitness [3, 4]. different hosts in 21 countries over a period of 70 years.
Rabies virus is a single-stranded negative RNA virus belong- We hypothesized that if different hosts with rabies infections
ing to the order Mononegavirales, family Rhabdoviridae, over decades did not lead to positive selection or recombina-
genus Lyssavirus, which causes rabies in all warm-blooded tion events in the G gene, the gene does not play a significant
mammals. Host shift and spillover events are frequently role in lyssavirus adaptation.
reported in rabies [5–9]. The nucleotide substitution rate
of lyssaviruses is estimated to be around 10−4 per site 2. Methods
per year [7]. The RNA-dependent RNA-polymerase (RdRp
or L) together with phosphoprotein (P), functions as the 2.1. Dataset. We choose a dataset that covers lyssavirus iso-
transcriptase and replicase complex. The glycoprotein (G) is lates spatially and geographically over a long period of time
the only outer membrane protein responsible for virus entry in various animal hosts. Fifty-three full-length G sequences
2 International Journal of Evolutionary Biology

from 21 countries isolated over a period of 70 years were Within the comparison, the likelihood ratio test statistic
retrieved from NCBI GenBank. The sequences were aligned used to determine the level of significance was calculated as
using fast statistical alignment (FSA, [12]). Briefly, FSA is a twice the difference of the likelihood scores (2Δl) estimated
probabilistic multiple-sequence alignment algorithm, which by each model. The significance was determined under χ 2
uses a “distance-based” approach to aligning homologous distribution. The degrees of freedom for the M1 versus M2
protein, RNA, or DNA sequences. It produces superior and M7 versus M8 tests are 2 [22]. If M8 or M2 is significantly
alignments of homologous sequences that are subject to favored and it contains codons with ω > 1, positive
very different evolutionary constraints. The nucleotide (nt) selection is significantly evident. Posterior probabilities of the
sequence alignment of the lyssavirus G genes was corrected inferred positively selected sites were estimated by the Bayes
manually by visual inspection using the amino acid sequence empirical Bayes (BEB) approach [23].
alignment. Gaps were removed if they existed in majority of We also applied single-likelihood ancestor counting
the sequences. (SLAC), fixed-effects likelihood (FEL), and random-effects
likelihood (REL) [18] to indentify selection pressure on
2.2. Phylogenetic Analyses. A phylogenetic tree was recon- individual codons of the G gene in lyssaviruses.
structed by using the neighbor joining algorithm in the
MEGA 4 package [13]. The maximum composite likelihood
model was used as well as the pairwise deletion option 3. Results
for gaps. The statistical significance of the phylogeny was
measured by bootstrap with 1,000 replicates. 3.1. Recombination Analyses. Our dataset covered lyssa-
viruses isolated over a period of 70 years from 21 countries
(Table 1), including the new and old continents. The hosts
2.3. Recombination Detection. We first applied PHI [14], NSS included bats, cows, dogs, foxes, humans, raccoons, sheep,
[15], and Max χ 2 [16] tests (implemented in PhiPack [14]) and skunks.
with 1,000 permutations to detect recombination. Sequences The PHI and Max χ 2 tests suggested significant evidence
involved in the recombination and breakpoints were deter- of recombination in the G gene. By 1000 permutations,
mined by using 3SEQ [17] and GARD implemented in the P-values of PHI and Max χ 2 test were .006 and 0,
the Datamonkey web interface [18, 19]. The recombination respectively. However, no significant evidence (P = .796) of
was further verified by bootscanning and phylogenetic recombination was detected by using the NSS test.
incongruence analysis. Bootscanning was performed using By using 3SEQ, 6 long recombinant sequences (>100 bp)
SimPlot software version 3.5.1 [20]. The parameters for were detected: AF233275, AY237121, AY987478, DQ074978,
bootscanning were window size, 200 bp; step, 10 bp; Gap- DQ849071, and L04523 (Table 2). Two breakpoints were
Strip, on; bootstrap replicate, 1000; distance model, Kimura identified in all recombinants. The first breakpoint was at
(2-parameter); tree algorithm, neighbor-joining. nucleotide position between 400 and 800. The second break-
point was around nucleotide position of 1080. However,
2.4. Selection Analyses. To test positive selection on sites of the two breakpoints for DQ074978 and L04523 were at
the G gene in Lyssaviruses, the Codeml program in PAML the very beginning and around nucleotide position of 109,
software package version 4.4 was employed [21]. Codeml respectively.
implements the maximum likelihood method to test if The analysis by using GARD also suggested evidence
positive selection has taken place at sites within a gene. This of recombination with significant topological incongruence
method uses different codon substitution models to estimate at the 2 breakpoints (Table 3). The first breakpoint was at
the number of nonsynonymous (dN) and synonymous sub- nucleotide position of 441 and the second was at nucleotide
stitutions (dS) per site among codons, since different amino position of 1089. The significance value for the 2 breakpoints
acids in a protein could be under different selective pressures, was 0.01. The left hand side (LHS) and the right hand side
thus creating a different ω (dN/dS) ratio. The models in our (RHS) P-values for the 2 breakpoints were .0004.
dataset analyses were M0 (one-ratio), M1 (nearly neutral), We analyzed the recombination events by using Boot-
M2 (positive selection), M7 (β distribution), and M8 (β + Scanning as implemented in SimPlot. Sequence AY987478
ω > 1) [22]. The M0 model estimates overall ω for the was used as a query sequence in all four cases (Figures 1(a)–
data. The M1 model estimates codon site proportion p0 1(d)). The analysis confirmed the recombination event in
with ω0 < 1 and proportion p1 (p1 = 1 − p0 ) with ω1 the G gene of lyssavirus. The high bootstrap values support
= 1. The M2 model allows an additional class of positively clustering sequence AY987478 with AF325489 (Figures 1(a)
selected sites with proportion p2 (p2 = 1 − p1 − p0 ) with and 1(b)) and with AY237121 (Figures 1(c) and 1(d)) at
ω2 estimated from the data. The M7 model specifies that positions from 1 to around 440 and at positions from around
ω follows a beta distribution and the value of ω is allowed 1130 to the end of the sequences. The bootstrap values are
to change between 0 and 1. Parameters p and q of the beta also high for clustering AY987478 with AF23375 (Figures
distribution are estimated from the data in the M7 model. In 1(a) and 1(c)) and DQ074978 (Figures 1(b) and 1(d)) at
the M8 model, a proportion of sites p0 has a ω in the beta positions from around 540 to 1000. The switches of the high
distribution and the proportion p1 sites are assumed to be bootstrap values at nucleotide positions from around 440 to
positively selected. Two sets of comparisons (M2 versus M1, 540 and from 1000 to 1130 indicate two possible breakpoints
M8 versus M7) were made to test the hypothesis of selection. for the recombination.
International Journal of Evolutionary Biology 3

Table 1: Sequences of glycoprotein gene used in this study.

Accession no. Country Host Year of isolation Strain/isolate Genotype References


AB115921 Indonesia Dog 2001 SN01-23 GT1 Unpublished
AF233275 India Sheep PV11 GT1 Unpublished
AF298141 USA Bat 1979 USA7-BT GT1 Badrane et al. [24]
AF298142 Poland Bat 1985 EBL1POL GT5 Badrane et al. [24]
AF298143 France Bat 1989 EBL1FRA GT5 Badrane et al. [24]
AF298144 Finland Bat 1986 EBL2FIN GT6 Badrane et al. [24]
AF298145 Holland Bat 1986 EBL2HOL GT6 Badrane et al. [24]
AF298146 S. Africa Bat 1970 DuvSAF1 GT4 Badrane et al. [24]
AF298147 S. Africa Bat 1981 DuvSAF2 GT4 Badrane et al. [24]
AF325487 Malaysia Human 1985 MAL1-HM GT1 Badrane and Tordo [7]
AF325489 Nepal Dog 1989 NEP1-DG GT1 Badrane and Tordo [7]
AF325490 French Bovine 1985 GUY1-BV GT1 Badrane and Tordo [7]
AF325491 Brazil Bovine 1986 BRA1-BV GT1 Badrane and Tordo [7]
AF325492 Mexico Bat 1987 MEX2-VP GT1 Badrane and Tordo [7]
AF325494 USA Bat 1981 USA8-BT GT1 Badrane and Tordo [7]
AF325495 USA Bat 1982 USA9-BT GT1 Badrane and Tordo [7]
AF401285 Thailand 8743THA GT1 Unpublished
AF426297 Australia Bat 1997 ABLSF12NB GT7 Guyatt et al. [25]
AF426298 Australia Bat 1997 ABLSF11KW GT7 Guyatt et al. [25]
AJ871962 China Vaccine PM GT1 Unpublished
AY009098 China Human 1986 CNX8601 GT1 Tang et al. [26]
AY009099 China Human 1986 CNX8511 GT1 Tang et al. [26]
AY009100 China Dog (Vaccine) 1955 CTN GT1 Tang et al. [26]
AY237121 India Dog RVD GT1 Unpublished
AY257980 Thailand Human HM65 GT1 Hemachudha et al. [27]
AY257982 Thailand Human HM88 GT1 Hemachudha et al. [27]
AY257983 Thailand Human HM208 GT1 Hemachudha et al. [27]
AY987478 India Dog 1999 CHAND03 GT1 Unpublished
D14873 Japan Vaccine RC-HL GT1 Unpublished
D16330 Japan Vaccine RC-HL GT1 Ito et al. [28]
DQ074978 India Dog GT1 Agrawal et al. [29]
DQ076097 S. Korea Bovine SKRBV0404HC GT1 Hyun et al. [30]
DQ076099 S. Korea Dog SKRRD9903YG GT1 Hyun et al. [30]
DQ767897 China Vaccine CTN-35 GT1 Unpublished
DQ849071 China Dog 1994 GX4 GT1 Meng et al. [31]
DQ849072 China Dog 1992 CQ92 GT1 Meng et al. [31]
L04522 China Vaccine (Dog) 1931 3aG GT1 Bai et al. [32]
L04523 China Vaccine (dog) 1993 CGX89-1 GT1 Bai et al. [32]
L40426 CVS GT1 Yelverton et al. [33]
M81058 Algeria Dog ALG1-DG GT1 Benmansour et al. [34]
M81059 Algeria Human GT1 Benmansour et al. [34]
M81060 Algeria Human GT1 Benmansour et al. [34]
U03765 Canada Vulpes 8480FX GT1 Nadin-Davis et al. [35]
U03766 Arctic Circle Dog 1992 Arctic A1-1090DG GT1 Nadin-Davis et al. [35]
U03767 Canada Dog 1993 Hudson Bay-4055DG GT1 Nadin-Davis et al. [35]
U11736 Canada Arctic Fox 91RABN1035 GT1 Nadin-Davis et al. [36]
U11755 Canada Skunk 91RABN1578 GT1 Nadin-Davis et al. [36]
U27214 USA Raccoon NY 516 GT1 Nadin-Davis et al. [37]
U27215 USA Raccoon NY 771 GT1 Nadin-Davis et al. [37]
U27216 USA Raccoon FLA 125 GT1 Nadin-Davis et al. [37]
U27217 USA Raccoon PA R89 GT1 Nadin-Davis et al. [37]
U52946 USA Bat 1994 SHBRV GT1 Morimoto et al. [38]
X69122 India Vaccine Flury GT1 Unpublished
4 International Journal of Evolutionary Biology

Table 2: Recombination detection in glycoprotein gene of lyssavirus by using 3SEQ.

P Q C P-value Dunn Sidak Breakpoints


M81058 AY987478 AF233275 0 2.08E − 08 432–440, 1080–1089 456–496, 1080–1089
M81060 AY987478 AF233275 1E − 12 1.31E − 07 432–440, 1080–1089 456–496, 1080–1089
AY987478 M81059 AY237121 0 2.81E − 11 441–455, 1077–1079
AY987478 M81058 AY237121 0 2.13E − 13 441–455, 1077–1079
AY987478 M81060 AY237121 0 1.13E − 13 441–455, 1077–1079
AY987478 AF233275 AY237121 1.3E − 10 1.88E − 05 432–455, 1068–1089 465–518, 1068–1089
AY987478 L04522 AY237121 1.1E − 08 1.48E − 03 627–638, 1077–1089 663–666, 1077–1089
AY987478 AF325489 AY237121 0 2.71E − 15 700-701, 1077–1097
AY987478 U11755 AY237121 3.2E − 10 4.42E − 05 717–719, 1077–1082 729–734, 1077–1082
AY987478 U11736.2 AY237121 3.3E − 09 4.61E − 04 717–719, 1077–1082 729–734, 1077–1082
AY987478 DQ849071 AY237121 6.1E − 11 8.61E − 06 736-737, 1077–1079
AY987478 DQ076097 AY237121 1.2E − 10 1.69E − 05 630–638, 1077–1089 699–701, 1077–1089
AY987478 DQ076099 AY237121 9E − 12 1.31E − 06 700-701, 1077–1089 714–719, 1077–1089
AY987478 L04523 AY237121 2.2E − 09 3.04E − 04 736-737, 1077–1079
AY987478 X69122 AY237121 4E − 12 6.00E − 07 666–669, 1032–1049 666–669, 1077–1089
AY987478 AY009098 AY237121 4E − 12 4.99E − 07 693–701, 1077–1079 705–711, 1077–1079
AY987478 AY009099 AY237121 4E − 12 4.99E − 07 693–701, 1077–1079 705–711, 1077–1079
AY987478 DQ849072 AY237121 2.1E − 11 3.02E − 06 693–701, 1077–1079 705–711, 1077–1079
AY987478 AJ871962 AY237121 1E − 12 7.29E − 08 750–794, 1077–1089
AY987478 AF325487 AY237121 0 1.36E − 08 780–794, 1077–1079
AY987478 L40426 AY237121 4.8E − 11 6.71E − 06 750–794, 1077–1089
AY987478 AF401285 AY237121 0 9.99E − 10 780–795, 1077–1079
AY987478 AY257983 AY237121 2.3E − 11 3.27E − 06 780–795, 1077–1079
AY987478 AY257980 AY237121 0 5.70E − 09 750–761, 1077–1079 780–795, 1077–1079
AY987478 AY257982 AY237121 5.9E − 11 8.33E − 06 780–795, 1032–1043 780–795, 1077–1079
AY987478 DQ767897 AY237121 1E − 07 1.46E − 02 759–767, 972–974
AY987478 U52946 AY237121 5.3E − 08 7.43E − 03 741–748, 900-901 741–748, 918–938
AY987478 U03766 AY237121 2.4E − 07 3.27E − 02 717–719, 876–889 717–719, 894–914
AY987478 U03765 AY237121 2.6E − 07 3.55E − 02 717–719, 876–889 717–719, 894–914
AY237121 AF233275 AY987478 0 1.38E − 39 432–452, 1077–1089
AY237121 DQ074978 AY987478 0 1.16E − 38 432–452, 1077–1089
AY237121 L04522 AY987478 0 7.15E − 25 627–647, 1065–1089
AY237121 DQ076097 AY987478 0 1.47E − 08 630–647, 1056–1058 630–647, 1065–1089
AY237121 U03767 AY987478 1E − 11 1.42E − 06 630–638, 1041–1058 630–638, 1065–1079
AY237121 AJ871962 AY987478 0 6.36E − 20 642–647, 1041–1058 642–647, 1065–1089
AY237121 X69122 AY987478 0 3.16E − 26 642–647, 1041–1049 654–659, 1041–1049
AY237121 L40426 AY987478 0 1.38E − 17 642–647, 1041–1058 642–647, 1065–1089
AY237121 M81058 AY987478 0 9.60E − 21 441–452, 1065–1079 618–710, 1065–1079
AY237121 M81060 AY987478 0 6.49E − 23 441–452, 1065–1079 618–710, 1065–1079
AY237121 D14873 AY987478 0 6.82E − 17 685–701, 1065–1085 705–710, 1065–1085
AY237121 D16330 AY987478 0 5.23E − 17 685–701, 1065–1085 705–710, 1065–1085
AY237121 AY257980 AY987478 5.3E − 08 7.49E − 03 705–710, 1041–1046
AY237121 DQ849071 AY987478 1.9E − 09 2.68E − 04 708–710, 1041–1046
AY237121 L04523 AY987478 1.3E − 07 1.85E − 02 708–710, 1041–1046
AY237121 DQ076099 AY987478 0 1.07E − 08 634–647, 1056–1058 634–647, 1065–1089
AY237121 U11755 AY987478 1E − 12 1.60E − 07 630–647, 1056–1058 630–647, 1065–1082
AY237121 U11736.2 AY987478 0 2.29E − 08 630–647, 1056–1058 630–647, 1065–1082
AY237121 DQ767897 AY987478 1.7E − 09 2.37E − 04 708–710, 1017–1022 708–710, 1041–1046
AY237121 AY009098 AY987478 1.8E − 08 2.58E − 03 705–710, 1041–1046 736-737, 1041–1046
AY237121 AY009099 AY987478 1.8E − 08 2.58E − 03 705–710, 1041–1046 736-737, 1041–1046
International Journal of Evolutionary Biology 5

Table 2: Continued.
P Q C P-value Dunn Sidak Breakpoints
AY237121 AF325487 AY987478 7.5E − 10 1.06E − 04 705–710, 1041–1046 732–734, 1041–1046
AY237121 U03766 AY987478 1.2E − 09 1.75E − 04 630–638, 1041–1058 630–638, 1065–1079
AY237121 U03765 AY987478 2.3E − 10 3.26E − 05 630–638, 1041–1058 630–638, 1065–1079
AY237121 M81059 AY987478 0 1.27E − 18 441–452, 993–998 441–452, 1017–1034
AY237121 AF325490 AY987478 9.2E − 09 1.30E − 03 705–710, 993–995 705–710, 1017–1019
AY237121 AF325491 AY987478 7E − 12 9.93E − 07 705–710, 993–995
AY237121 AF325492 AY987478 9.6E − 09 1.34E − 03 700-701, 993–995 705–710, 993–995
AY237121 DQ849072 AY987478 1.3E − 07 1.83E − 02 705–710, 924–935 705–710, 945–950
AY237121 AY009100 AY987478 3.4E − 07 4.62E − 02 708–710, 885–887 708–710, 924–938
AY237121 AF401285 AY987478 6.4E − 09 8.95E − 04 736-737, 883–887
AY237121 AY257983 AY987478 9.9E − 08 1.38E − 02 732–734, 883–887 732–734, 1041–1046
M81059 AY987478 DQ074978 0 9.05E − 10 519–522, 1080–1089
M81058 AY987478 DQ074978 0 4.12E − 09 519–522, 1080–1089
M81060 AY987478 DQ074978 0 2.80E − 08 519–522, 1080–1089
AY009100 M81059 DQ849071 2E − 08 2.80E − 03 0–3, 108–119
AY009100 M81058 DQ849071 3E − 08 4.20E − 03 0–3, 108–119
AY009100 M81060 DQ849071 1.2E − 07 1.70E − 02 0–3, 108–119
AY009100 AJ871962 DQ849071 2.2E − 07 3.02E − 02 0–3, 108–110
AY009100 M81059 L04523 9.6E − 09 1.34E − 03 0–3, 108–119
AY009100 M81058 L04523 1.5E − 08 2.04E − 03 0–3, 108–119
AY009100 M81060 L04523 6.7E − 08 9.37E − 03 0–3, 108–119 0–3, 139–161
AY009100 AJ871962 L04523 1.3E − 07 1.88E − 02 0–3, 108–110
Note: P and Q are putative parent sequences, and C is the putative child sequence in the recombination.

Table 3: KH tests verify the significance of breakpoints estimated by M2 and M1 was 0. The corresponding P value was .99,
by GARD analysis. which is not significant to reject the nearly null hypothesis
of neutral selection in M1. In the comparison between the
Breakpoint LHS P-value RHS P-value Significance
null neutral site model (M7) and the selection model (M8),
441 .00040 .00040 0.01 the 2Δl was 18.18 and the corresponding P-value was .0001,
1089 .00040 .00040 0.01 indicating that the positive selection model was significantly
favored over the null neutral site model. Posterior proba-
bilities of the inferred positively selected sites estimated by
Since recombination with 2 breakpoints was predicted by the BEB approach were shown in Table 5. Four amino acid
3SEQ, GARD, and Bootscanning, we constructed phyloge- sites at 466, 483, 486, and 490 were identified to be under
netic trees by using sequences from the beginning to the first positive selection. But only the site at position 483 had a
breakpoint and the sequences from the second breakpoint marginal significance support with posterior probability of
to the end (Figure 2(a)) and a phylogenetic tree with 95% and weak positive selection pressure with ω of 1.466.
sequences between the two breakpoints (Figure 2(b)). The The corresponding posterior probabilities for sites at 466,
reconstructed trees presented conflicting topological posi- 486 and, 490 were 68%, 56%, and 82%, respectively.
tions of the putative recombinant AY987478. The putative To test the effect of recombination on positive selection
recombinant was clustered with AY237121 and AF325489 analysis, we excluded the putative recombinant AY987478
in Figure 2(a), but clustered with DQ074978 and AF233275 from the dataset. Similar results were observed, and the BEB
in Figure 2(b). All other 5 putative recombinants did not posterior probability supports for amino acid sites under
present phylogenetic incongruence. The same result was also positive selection were nonsignificant (Table 5). When all
verified by GARD (data not shown). When AY987478 was six putative recombinants were excluded in our analysis, no
excluded from the dataset, the P-values of Phi, Max χ 2 , evidence was found to support positive selection either in M1
and NSS were .121, .209, and .791, respectively, suggesting or M7 (data not shown). In all cases, the ω in M0 was either
no evidence of recombination. The GARD analysis did not 0.07 or 0.08. Overall, 87% of the sites in the G gene had a very
indicate evidence of recombination either. low ω value of 0.05 in M2 and M7, indicating strong selective
constraints on those sites.
3.2. Selection Pressure Analyses. The selection pressure anal- To study the effect of viral passages and possible genetic
ysis with the glycoprotein gene by using PAML is presented bottlenecks on the results, we repeated the analysis with a
in Table 4. The likelihood ratio test statistic (2Δl) estimated dataset excluding six vaccine sequences and the sequence
6 International Journal of Evolutionary Biology

Bootscan-query: AY987478 Bootscan-query: AY987478

100 100
90 90
80 80
Permuted trees (%)

Permuted trees (%)


70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0

100 300 500 700 900 1100 1300 1500 100 300 500 700 900 1100 1300 1500
Nucleotide position Nucleotide position

AF325489 AF325489
AF233275 DQ074978
M81060 M81060
(a) (b)

Bootscan-query: AY987478 Bootscan-query: AY987478

100 100
90 90
80 80
Permuted trees (%)

Permuted trees (%)

70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0

100 300 500 700 900 1100 1300 1500 100 300 500 700 900 1100 1300 1500
Nucleotide position Nucleotide position

AY237121 AY237121
AF233275 DQ074978
M81060 M81060
(c) (d)

Figure 1: Bootscanning analysis of recombination in glycoprotein gene of lyssavirus by using the SimPlot program with a window size of
200 nucleotides and a step size of 10 nucleotides.

AF233275 (PV11) from cell culture of lyssaviruses under 4. Discussion


intensive cell culture. We found no significant evidence for
positive selection pressure on any site of the G gene. Lyssaviruses can infect all warm-blooded mammals, and
Analyses using SLAC, REL, and FEL found no evidence of spillover events and host shift have been well documented
any amino acid in the G gene under positive selection, instead [5–9]. The molecular mechanism of rabies infection and
most of the amino acids were found to be under negative transmission is still not completely understood, and the
selection (Table 6). One site at position 416 was under phenomenon usually leads to the connection with rabies
marginal positive selection by FEL with P-value of .0999, virus G protein, since G is the only membrane protein
narrowly passing the significance level of 0.1. However, this responsible for virus entry both in vitro and in vivo.
result was not supported by SLAC and REL. Therefore, it is a reasonable assumption that rabies virus
International Journal of Evolutionary Biology 7

99 M81059 89 M81059
100 M81058 100 M81058
93 M81060 69 M81060
L04522 L04522
97 100 D14873 100 D14873
66
100 D16330 100 D16330

100 AF233275 99 AF233275


100 DQ074978 98 DQ074978
100
X69122 AY987478
100 AJ871962 X69122
100
100 L40426 99 AJ871962
84
100 DQ076097 100 L40426
DQ076099 100 DQ076097
100 U11755 DQ076099
100
59 U11736.2 99 U11755
84 99
100 U03767 U11736.2
100 U03766 100 U03767
92 U03765 100 U03766
AF325489 98 U03765
100 AY987478 AF325489
99
100 AY237121 94 AY237121
100 AY009098 98 79 AY009098
100 AY009099 100 DQ849072
96 DQ849072 97 AY009099
AB115921 AB115921
69
100 AY009100 100 AY009100
99 100 DQ767897 DQ767897
DQ849071 99 DQ849071
100 L04523 99 100 L04523
100 66
AF325487 AF325487
100
AF401285 AF401285
100 100
60 AY257980 99 AY257980
98 AY257982 59 AY257983
100 AY257983 100 AY257982
U27215 53 U27215
100 99 U27214
U27214
100 100 U27217
U27217
U27216 U27216
99
100 AF325494 100 AF325495
98
AF325495 59 77 AF325494
100 AF298141 AF298141
82
U52946 100 U52946
76 AF325492 81 40 AF325492
100 AF325490 AF325491
46 100
AF325491 AF325490
AF426297 AF426297
100 AF426298 100 AF426298
100 AF298144 100 AF298144
AF298145 AF298145
100 AF298142 100 AF298142
99
AF298143 AF298143
99 AF298146 81 AF298146
100 AF298147 100 AF298147

0.05 0.05

(a) (b)

Figure 2: (a) NJ phylogenetic tree of 53 glycoprotein gene sequences with regions concatenated from position of 1 to 441 and position of
1090 to 1572. Bootstrap values of 1000 replicates are shown above the branches. The red marker represents the putative recombinant. (b) NJ
phylogenetic tree of 53 glycoprotein gene sequences with region from position of 441 to 1089. Bootstrap values of 1000 replicates are shown
above the branches. The red marker represents the putative recombinant.
8 International Journal of Evolutionary Biology

Table 4: Parameter estimates, dN/dS ratio, likelihood score, and test statistics under models of variable ω ratios among sites for the
glycoprotein gene in lyssavirus.

Model comparison
Parameter estimates dN/dS Likelihood scores (l) Positive selection
(2Δl, d.f., P)
M0: one ratio ω = 0.08 0.08 −24586.10 None
ω0 = 0.05, ω1 = 1,
M1: Nearly neutral 0.17 −24010.40 Not allowed
(p0 = 0.87, p1 = 0.13)
ω0 = 0.05, ω1 = 1, ω2 = 1,
M2 versus M1:0, d.f.
M2: Positive selection (p0 = 0.87, p1 = 0.06, 0.17 −24010.40 None
= 2, P = .99
p2 = 0.07)
M7: β, Neutral p = 0.26, q = 2.11 0.10 −23443.16 Not allowed
p0 = 0.98, p = 0.28,
M8: β + ω > 1, M7 versus M8: 18.18,
q = 2.92, (p1 = 0.02), 0.10 −23434.07 See Table 6
Selection d.f. = 2, P = .0001
ω = 1.0

Table 5: Positive selection sites in the glycoprotein gene predicted by using Bayes empirical analysis under different PAML models.

Codon Amino acid Posterior probability Post mean ± S.E.


Dataset I Dataset II Dataset I Dataset II Dataset I Dataset II Dataset I Dataset II
466 466 A A 0.68 0.72 1.27 ± 0.35 1.29 ± 0.34
483 483 V V 0.95 0.84 1.46 ± 0.16 1.39 ± 0.26
486 486 T T 0.56 0.53 1.19 ± 0.36 1.16 ± 0.36
490 490 Q Q 0.82 0.80 1.38 ± 0.27 1.36 ± 0.29
Dataset I: The whole 53 nucleotide sequences. Dataset II: AY987478 was excluded.

adaptation is due to the G gene. Positive selection is an of passaged lyssaviruses from the dataset in this study did
important evolutionary force that drives adaptation. It is not not affect the readout of the analyses. It appears that rabies
surprising that evolutionary scientists first applied selection spillover, host shift (happened naturally), virus escape by
analysis to the G gene of lyssaviruses [39]. One notable monoclonal antibody selection, and vaccine strains (under
difference between the previous investigations and our study various in vitro and in vivo conditions) is not the result of
was the dataset. Previous dataset with 55 complete G gene positive selection in the G gene.
sequences were from isolates of natural rabies infections, Recombination is another important evolutionary driv-
excluding passages and vaccine strains. Our dataset included ing force in adaptation, and it is a mechanism that pre-
street 53 rabies isolates and vaccine strains collected over vents the accumulation of deleterious substitutions [44].
a period of 70 years from 21 countries. The neutrality It allows the acquisition of multiple genetic changes in
tests on the G in lyssavirus indicated that the protein was a single step and can combine genetic information to
under negative selection. Analysis of heterogeneous selective produce advantageous genotypes. It may be important for
pressures on the amino acid sites across the gene found incremental host adaptation after switching to new host has
no evidence for positive selection on any site when the occurred [45]. Recombination in rabies viruses had been
putative recombinant AY987478 was excluded. Instead, most proposed, but it was not thoroughly inspected [46, 47]. Our
of the sites were under strong negative selection, which study suggested one recombinant event. The recombinant
was consistent with previous investigations using only street sequence AY987478 was from a dog isolate (CHAND03,
rabies isolates [39, 40]. The only weak positive selection genotype 1) and the possible parental sequences were isolated
identified by our analyses was at amino acid residue 483 (not from dogs and sheep from the same geographic area (India
in the ectodomain). No positive selection has been detected and Nepal). However, the putative recombinant AY987478
in the main epitope II or III, the site of virus escape identified could be an artifact from sequencing or sample contamina-
by monoclonal antibody binding selections in vitro. It is tion. Generation of recombinants in the course of reverse
possible that the results were confounded by the sequences transcription of RNA and subsequent PCR is a well-known
from isolates under intensive cell culture. Repeated passages phenomenon [48–50]. From the bootscanning analysis in
of an RNA virus resulted in loss of fitness due to Muller’s this research, the 3 prime and 5 prime regions of AY987478
ratchet [41]. Serial virus passages severely reduce population were clustered with putative parents with a bootstrap value of
size when a small set of founder population is reintroduced 100%, indicating little difference between the two sequences
into an identical unpopulated environment, which may lead in the two regions. By checking the sequences, there are
to the stochastic loss of certain genotypes, especially the regions of about 450 bases long that are identical between
rare genotypes [42, 43]. However, exclusion of sequences the recombinant and the corresponding parent, which is
International Journal of Evolutionary Biology 9

Table 6: Detection of selection pressure on glycoprotein gene using methods implemented in the Datamonkey website.

Dataset Mean dN/dS Positive selection sites Negative selection sites Codon (P-Value)
SLAC FEL REL SLAC FEL REL SLAC FEL REL
Dataset I 0.1226 0.1278 0 0 0 397 418 0
Dataset II 0.1231 0.1274 0 0 0 391 417 0
Dataset III 0.1214 0.1233 0 1 0 386 416 0 416 (.0999)
Dataset I: The whole 53 nucleotide sequences. Dataset II: AY987478 was excluded. Dataset III: the six putative recombinants were excluded.

rare considering the high mutation rate in RNA viruses. The verification may help identify the role of the G gene in
homologous recombination rate in negative-sense RNA virus lyssavirus adaptation.
was found to be low [46], which is supported by a recent
report that homologous recombination is very rare or absent
in influenza A virus [17]. Further experimentation is needed Acknowledgments
to prove that the recombinant AY987478 is not an artifact. The authors thank Jan Pohl, Elizabeth Neuhaus, and
In summary, we did not find significant support for pos- Charles E. Rupprecht for support in this investigation.
itive selection pressure on G gene in lyssavirus isolates from They also thank Kathryn Kellar and Scott Sammons for
different rabies hosts and vaccine strains that cover 70 years helpful suggestions to the paper. Use of trade names and
of evolution in 21 countries. The recombination analysis commercial sources are for identification only and do not
suggested an orphan event that needs further investigation. imply endorsement by the U S Department of Health and
It appears that evolution of the G gene may not play a major Human Services. The findings and conclusions in this paper
role in lyssavirus adaptation. It is surprising considering the are those of the authors and do not necessarily represent the
functions of glycoprotein in lyssavirus infection. It has been views of the funding agency.
reported that host switching from chiropters to carnivores
has occurred in lyssavirus evolution history [7, 9]. Spillovers
of lyssaviruses from chiropters to other animals may have References
happened repeatedly and still occur [8]. Transmission of
[1] T. Lefébure and M. J. Stanhope, “Evolution of the core and
European bt lyssavirus 1 (EBLV-1) was reported in sheep
pan-genome of Streptococcus: positive selection, recombina-
[51], stone marten [52], and cats [53]. For a successful tion, and genome composition,” Genome Biology, vol. 8, no. 5,
spillover and subsequent adaptation, there must be effective pp. R71.1–R71.16, 2007.
cross-species viral exposure and compatibility between the [2] D. A. Steinhauer, E. Domingo, and J. J. Holland, “Lack of
virus and the new host to allow replication and transmission. evidence for proofreading mechanisms associated with an
Lyssavirus infections are typically transmitted by the virus- RNA virus polymerase,” Gene, vol. 122, no. 2, pp. 281–288,
laden saliva of a rabid animal via a bite or scratch, which 1992.
can facilitate cross-species viral exposures. The initial viral [3] K. Kirkegaard and D. Baltimore, “The mechanism of RNA
interaction with cells of a new host plays a critical role recombination in poliovirus,” Cell, vol. 47, no. 3, pp. 433–443,
in determining host specificity and host shift [45]. For 1986.
example, feline virus acquired the ability to infect dogs [4] M. Worobey and E. C. Holmes, “Evolutionary aspects of
through changes in its capsid protein that binds to canine recombination in RNA viruses,” Journal of General Virology,
vol. 80, no. 10, pp. 2535–2543, 1999.
transferrin receptor on canine cells [54]. Lyssavirus G is a
[5] D. M. Pfukenyi, D. Pawandiwa, P. V. Makaya, and U.
surface glycoprotein responsible for receptor recognition and Ushewokunze-Obatolu, “A retrospective study of wildlife
membrane fusion [7–9, 55]. It is reasonable to expect that rabies in Zimbabwe,” Tropical Animal Health and Production,
the protein is under positive selection pressure in the viral vol. 41, no. 4, pp. 565–572, 2009.
adaptation to the new host. The lack of positive selection [6] A. I. Wandeler, S. A. Nadin-Davis, R. R. Tinline, and C.
in the G glycoprotein suggests that the virus is not subject E. Rupprecht, “Rabies epidemiology: some ecological and
to strong immune selection [25]. The G gene may escape evolutionary perspectives,” in Lyssaviruses, C. E. Rupprecht, B.
the immunity of the host since lyssaviruses migrate from Dietzchold, and H. Koprowski, Eds., pp. 297–324, Springer,
the peripheral to the central nervous systems [7]. Recent Berlin, Germany, 1994.
investigation demonstrated that diminishing frequencies of [7] H. Badrane and N. Tordo, “Host switching in Lyssavirus
both cross-species transmission and host shifts were found history from the chiroptera to the carnivora orders,” Journal
of Virology, vol. 75, no. 17, pp. 8096–8104, 2001.
with increasing phylogenetic distance between bat species
[8] L. K. Crawford-Miksza, D. A. Wadford, and D. P. Schnurr,
[9], indicating the virus, thus the G gene, is subject to less “Molecular epidemiology of enzootic rabies in California,”
selection pressure in a similar host and cellular environment Journal of Clinical Virology, vol. 14, no. 3, pp. 207–219, 1999.
[7, 25]. However, the G gene might have been under relative [9] D. G. Streicker, A. S. Turmelle, M. J. Vonhof, I. V. Kuzmin, G. F.
low positive selection that was not detected by current McCracken, and C. E. Rupprecht, “Host phylogeny constrains
computational methods. More sensitive method or properly cross-species emergence and establishment of rabies virus in
relaxed statistical significance stringency with experimental bats,” Science, vol. 329, no. 5992, pp. 676–679, 2010.
10 International Journal of Evolutionary Biology

[10] B. Dietzschold, W. H. Wunner, T. J. Wiktor et al., “Char- [27] T. Hemachudha, S. Wacharapluesadee, B. Lumlertdaecha et
acterization of an antigenic determinant of the glycoprotein al., “Sequence analysis of rabies virus in humans exhibiting
that correlates with pathogenicity of rabies virus,” Proceedings encephalitic or paralytic rabies,” Journal of Infectious Diseases,
of the National Academy of Sciences of the United States of vol. 188, no. 7, pp. 960–966, 2003.
America, vol. 80, no. 1, pp. 70–74, 1983. [28] H. Ito, N. Minamoto, T. Watanabe et al., “A unique mutation
[11] X. Yan, P. S. Mohankumar, B. Dietzschold, M. J. Schnell, of glycoprotein gene of the attenuated RC-HL strain of rabies
and Z. F. Fu, “The rabies virus glycoprotein determines the virus, a seed virus used for production of animal vaccine in
distribution of different rabies virus strains in the brain,” Japan,” Microbiology and Immunology, vol. 38, no. 6, pp. 479–
Journal of Neurovirology, vol. 8, no. 4, pp. 345–352, 2002. 482, 1994.
[12] R. K. Bradley, A. Roberts, M. Smoot et al., “Fast statistical [29] S. Agrawal, A. K. Shasany, and S. P. S. Khanuja, “Plant
alignment,” PLos Computational Biology, vol. 5, no. 5, Article transformation vectors having street rabies virus (Indian
ID e1000392, 2009. strain) glycoprotein gene,” Journal of Plant Biochemistry and
[13] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4: Biotechnology, vol. 14, no. 2, pp. 81–87, 2005.
molecular evolutionary genetics analysis (MEGA) software [30] B. H. Hyun, K. K. Lee, IN. J. Kim et al., “Molecular
version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8, epidemiology of rabies virus isolates from South Korea,” Virus
pp. 1596–1599, 2007. Research, vol. 114, no. 1-2, pp. 113–125, 2005.
[14] T. C. Bruen, H. Philippe, and D. Bryant, “A simple and robust [31] S. L. Meng, J. X. Yan, GE. L. Xu et al., “A molecular
statistical test for detecting the presence of recombination,” epidemiological study targeting the glycoprotein gene of rabies
Genetics, vol. 172, no. 4, pp. 2665–2681, 2006. virus isolates from China,” Virus Research, vol. 124, no. 1-2, pp.
[15] I. B. Jakobsen and S. Easteal, “A program for calculating and 125–138, 2007.
displaying compatibility matrices as an aid in determining [32] X. Bai, C. K. Warner, and M. Fekadu, “Comparisons of
reticulate evolution in molecular sequences,” Computer Appli- nucleotide and deduced amino acid sequences of the glycopro-
cations in the Biosciences, vol. 12, no. 4, pp. 291–295, 1996. tein genes of a Chinese street strain (CGX89-1) and a Chinese
[16] J. M. Smith, “Analyzing the mosaic structure of genes,” Journal vaccine strain (3aG) of rabies virus,” Virus Research, vol. 27,
of Molecular Evolution, vol. 34, no. 2, pp. 126–129, 1992. no. 2, pp. 101–112, 1993.
[17] M. F. Boni, D. Posada, and M. W. Feldman, “An exact [33] E. Yelverton, S. Norton, J. F. Obijeski, and D. V. Goeddel,
nonparametric method for inferring mosaic structure in “Rabies virus glycoprotein analogs: biosynthesis in Escherichia
sequence triplets,” Genetics, vol. 176, no. 2, pp. 1035–1047, coli,” Science, vol. 219, no. 4585, pp. 614–619, 1983.
2007. [34] A. Benmansour, M. Brahimi, C. Tuffereau, P. Coulon, F. Lafay,
[18] S. L. Kosakovsky Pond, D. Posada, M. B. Gravenor, C. H. and A. Flamand, “Rapid sequence evolution of street rabies
Woelk, and S. D. W. Frost, “GARD: a genetic algorithm for glycoprotein is related to the highly heterogeneous nature of
recombination detection,” Bioinformatics, vol. 22, no. 24, pp. the viral population,” Virology, vol. 187, no. 1, pp. 33–45, 1992.
3096–3098, 2006. [35] S. A. Nadin-Davis, G. Allen Casey, and A. I. Wandeler, “A
[19] S. L. Kosakovsky Pond and S. D. W. Frost, “Datamonkey: rapid molecular epidemiological study of rabies virus in central
detection of selective pressure on individual sites of codon Ontario and western Quebec,” Journal of General Virology, vol.
alignments,” Bioinformatics, vol. 21, no. 10, pp. 2531–2533, 75, no. 10, pp. 2575–2583, 1994.
2005. [36] S. A. Nadin-Davis, M. I. Sampath, G. A. Casey, R. R. Tinline,
[20] K. S. Lole, R. C. Bollinger, R. S. Paranjape et al., “Full- and A. I. Wandeler, “Phylogeographic patterns exhibited by
length human immunodeficiency virus type 1 genomes from Ontario rabies virus variants,” Epidemiology and Infection, vol.
subtype C- infected seroconverters in India, with evidence of 123, no. 2, pp. 325–336, 1999.
intersubtype recombination,” Journal of Virology, vol. 73, no. [37] S. A. Nadin-Davis, W. Huang, and A. I. Wandeler, “The design
1, pp. 152–160, 1999. of strain-specific polymerase chain reactions for discrimina-
[21] Z. Yang, “PAML 4: phylogenetic analysis by maximum like- tion of the racoon rabies virus strain fron indigenous rabies
lihood,” Molecular Biology and Evolution, vol. 24, no. 8, pp. viruses of Ontario,” Journal of Virological Methods, vol. 57, no.
1586–1591, 2007. 1, pp. 1–14, 1996.
[22] Z. Yang, R. Nielsen, N. Goldman, and A. M. K. Pedersen, [38] K. Morimoto, M. Patel, S. Corisdeo et al., “Characterization
“Codon-substitution models for heterogeneous selection pres- of a unique variant of bat rabies virus responsible for newly
sure at amino acid sites,” Genetics, vol. 155, no. 1, pp. 431–449, emerging human cases in North America,” Proceedings of the
2000. National Academy of Sciences of the United States of America,
[23] Z. Yang, W. S. W. Wong, and R. Nielsen, “Bayes empirical vol. 93, no. 11, pp. 5653–5658, 1996.
Bayes inference of amino acid sites under positive selection,” [39] E. C. Holmes, C. H. Woelk, R. Kassis, and H. Bourhy, “Genetic
Molecular Biology and Evolution, vol. 22, no. 4, pp. 1107–1118, constraints and the adaptive evolution of rabies virus in
2005. nature,” Virology, vol. 292, no. 2, pp. 247–257, 2002.
[24] H. Badrane, C. Bahloul, P. Perrin, and N. Tordo, “Evidence [40] H. Bourhy, J. M. Reynes, E. J. Dunham et al., “The origin
of two Lyssavirus phylogroups with distinct pathogenicity and and phylogeography of dog rabies virus,” Journal of General
immunogenicity,” Journal of Virology, vol. 75, no. 17, pp. 3268– Virology, vol. 89, no. 11, pp. 2673–2681, 2008.
3276, 2001. [41] D. K. Clarke, E. A. Duarte, A. Moya, S. F. Elena, E. Domingo,
[25] K. J. Guyatt, J. Twin, P. Davis et al., “A molecular epidemiology and J. Holland, “Genetic bottlenecks and population passages
study of Australian bat lyssavirus,” Journal of General Virology, cause profound fitness differences in RNA viruses,” Journal of
vol. 84, no. 2, pp. 485–496, 2003. Virology, vol. 67, no. 1, pp. 222–228, 1993.
[26] Q. Tang, L. A. Orciari, C.E. Rupprecht, and X. Zhao, [42] M. Oberle, O. Balmer, R. Brun, and I. Roditi, “Bottlenecks and
“Sequencing and positional analysis of the glycoprotein gene the maintenance of minor genotypes during the life cycle of
of four Chinese rabies viruses,” Zhongguo Bingduxue, vol. 15, Trypanosoma brucei,” PLos Pathogens, vol. 6, no. 7, Article ID
no. 1, pp. 22–33, 2000 (Chinese). e1001023, 2010.
International Journal of Evolutionary Biology 11

[43] L. M. Wahl, P. J. Gerrish, and I. Saika-Voivod, “Evaluating the


impact of population bottlenecks in experimental evolution,”
Genetics, vol. 162, no. 2, pp. 961–971, 2002.
[44] M. Poss, A. Idoine, H. A. Ross, J. A. Terwee, S. VandeWoude,
and A. Rodrigo, “Recombination in feline lentiviral genomes
during experimental cross-species infection,” Virology, vol.
359, no. 1, pp. 146–151, 2007.
[45] C. R. Parrish, E. C. Holmes, D. M. Morens et al., “Cross-
species virus transmission and the emergence of new epidemic
diseases,” Microbiology and Molecular Biology Reviews, vol. 72,
no. 3, pp. 457–470, 2008.
[46] E. R. Chare, E. A. Gould, and E. C. Holmes, “Phylogenetic
analysis reveals a low rate of homologous recombination in
negative-sense RNA viruses,” Journal of General Virology, vol.
84, no. 10, pp. 2691–2703, 2003.
[47] L. Geue, S. Schares, C. Schnick et al., “Genetic characterisation
of attenuated SAD rabies virus strains used for oral vaccina-
tion of wildlife,” Vaccine, vol. 26, no. 26, pp. 3227–3235, 2008.
[48] A. Flockerzi, J. Maydt, O. Frank et al., “Expression pattern
analysis of transcribed HERV sequences is complicated by ex
vivo recombination,” Retrovirology, vol. 4, no. 39, pp. 1–12,
2007.
[49] G. Luo and J. Taylor, “Template switching by reverse transcrip-
tase during DNA synthesis,” Journal of Virology, vol. 64, no. 9,
pp. 4321–4328, 1990.
[50] A. Meyerhans, J. P. Vartanian, and S. Wain-Hobson, “DNA
recombination during PCR,” Nucleic Acids Research, vol. 18,
no. 7, pp. 1687–1691, 1990.
[51] H. Bourhy, L. Dacheux, C. Strady, and A. Mailles, “Rabies in
Europe in 2005,” Euro Surveillanc, vol. 10, no. 11, pp. 213–216,
2005.
[52] K. Tjørnehøj, A. R. Fooks, J. S. Agerholm, and L. Rønsholt,
“Natural and experimental infection of sheep with european
bat lyssavirus type-1 of danish bat origin,” Journal of Compar-
ative Pathology, vol. 134, no. 2-3, pp. 190–201, 2006.
[53] L. Dacheux, F. Larrous, A. Mailles et al., “European bat
lyssavirus transmission among cats, Europe,” Emerging Infec-
tious Diseases, vol. 15, no. 2, pp. 280–284, 2009.
[54] K. Hueffer, J. S. L. Parker, W. S. Weichert, R. E. Geisel, J.
Y. Sgro, and C. R. Parrish, “The natural host range shift
and subsequent evolution of canine parvovirus resulted from
virus-specific binding to the canine transferrin receptor,”
Journal of Virology, vol. 77, no. 3, pp. 1718–1726, 2003.
[55] P. Durrer, Y. Gaudin, R. W. H. Ruigrok, R. Graf, and J.
Brunner, “Photolabeling identifies a putative fusion domain
in the envelope glycoprotein of rabies and vesicular stomatitis
viruses,” Journal of Biological Chemistry, vol. 270, no. 29, pp.
17575–17581, 1995.
SAGE-Hindawi Access to Research
International Journal of Evolutionary Biology
Volume 2011, Article ID 379424, 15 pages
doi:10.4061/2011/379424

Review Article
Baculovirus: Molecular Insights on
Their Diversity and Conservation

Solange Ana Belen Miele, Matı́as Javier Garavaglia, Mariano Nicolás Belaich,
and Pablo Daniel Ghiringhelli
LIGBCM (Laboratorio de Ingenierı́a Genética y Biologı́a Celular y Molecular), Departamento de Ciencia y Tecnologı́a,
Universidad Nacional de Quilmes, Roque Saenz Peña 352, Bernal, Argentina

Correspondence should be addressed to Pablo Daniel Ghiringhelli, pdghiringhelli@gmail.com

Received 15 October 2010; Revised 4 January 2011; Accepted 14 February 2011

Academic Editor: Kenro Oshima

Copyright © 2011 Solange Ana Belen Miele et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

The Baculoviridae is a large group of insect viruses containing circular double-stranded DNA genomes of 80 to 180 kbp. In
this study, genome sequences from 57 baculoviruses were analyzed to reevaluate the number and identity of core genes and to
understand the distribution of the remaining coding sequences. Thirty one core genes with orthologs in all genomes were identified
along with other 895 genes differing in their degrees of representation among reported genomes. Many of these latter genes are
common to well-defined lineages, whereas others are unique to one or a few of the viruses. Phylogenetic analyses based on core
gene sequences and the gene composition of the genomes supported the current division of the Baculoviridae into 4 genera:
Alphabaculovirus, Betabaculovirus, Gammabaculovirus, and Deltabaculovirus.

1. Background The Baculoviridae family is divided into four genera


according to common biological and structural charac-
Baculoviruses are arthropod-specific viruses containing teristics: Alphabaculovirus, which includes lepidopteran-
large double-stranded circular DNA genomes of 80,000– specific baculoviruses and is subdivided into Group I or
180,000 bp. The progeny generation is biphasic, with two Group II based on the type of fusogenic protein, Betabac-
different phenotypes during virus infection: budded viruses ulovirus, comprising lepidopteran-specific granuloviruses,
(BVs), during the initial stage of the multiplication cycle, Gammabaculovirus, which includes hymenopteran-specific
and occlusion-derived viruses (ODVs), at the final stages of baculoviruses, and finally Deltabaculovirus which, to date,
replication [1, 2]. In general, primary infection takes place comprises only CuniNPV and possibly the still undescribed
in the insect midgut cells after ingestion of occlusion bodies dipteran-specific baculoviruses [1, 18–20].
(OBs). Following this stage, systemic infection is caused by The comparison between known genome sequences of all
the initial BV progeny [3, 4]. And finally, OBs are produced baculoviruses has been the source for identifying a common
during the last stage of the infection. These OBs comprise set of genes, the baculovirus core genes. However, there
virions embedded in a protein matrix which protects them are probably more orthologous sequences that may not
from the environment [5, 6]. be identified due to the accumulation of many mutations
Baculoviruses have been used extensively in many biolog- throughout evolution. Thus, core genes seem to be a key
ical applications such as protein expression systems, models factor for some of the main biological functions, such as
of genetic regulatory networks and genome evolution, puta- those necessary to transcribe viral late genes, produce virion
tive nonhuman viral vectors for gene delivery, and biological structure, infect gut cells abrogate host metabolism and
control agents against insect pests [7–17]. establish infections [21–24].
2 International Journal of Evolutionary Biology

12 12
Alphabaculovirus Alphabaculovirus
10 Group I 10 Group II
x = 44.9 x = 41.6
Genome quantity

Genome quantity
8 8

6 6

4 4

2 2

0 0
30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99

30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99
Range of CG content (%) Range of CG content (%)
(a) (b)
12 12

Betabaculovirus Gammabaculovirus
10 10
x = 37.7 x = 33.5
Genome quantity

Genome quantity

8 8

6 6

4 4

2 2

0 0
30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99

30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99
Range of CG content (%) Range of CG content (%)
(c) (d)
12 22
Deltabaculovirus 20 Baculoviridae
10 18 x = 41.3
16
Genome quantity

Genome quantity

8 14
12
6
10
8
4
6
2 4
2
0 0
30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99

30–32.99

33–35.99

36–38.99

39–40.99

41–43.99

44–46.99

47–49.99

50–52.99

53–55.99

56–58.99

Range of CG content (%) Range of CG content (%)


(e) (f)
Figure 1: GC content in baculovirus genomes. The different histograms contain the distribution of baculovirus genomes according to their
GC content and their genus classification. Black bars highlight genomes with a GC content higher than 50%.
International Journal of Evolutionary Biology 3

α β 1000
18
9 19

1 8 800

ORFs quantity
600

0 31 0
400

200
0 0

70 0 32 0

Betabaculovirus
+ Betabaculovirus
Baculoviridae

Alphabaculovirus

Alphabaculovirus

Alphabaculovirus

Gammabaculovirus
Alphabaculovirus
δ γ

Group II
Group I
Figure 2: Baculovirus core genes. The different circles represent the
4 baculovirus genera (in yellow Alphabaculovirus; in green Betabac-
ulovirus; in red Gammabaculovirus; in blue Deltabaculovirus). The
numbers contained within the overlapping regions indicate the
amount of shared genes between all members of the genera. The
numbers within the circles but outside the overlapping regions Figure 3: Whole baculovirus gene content. The histogram shows
indicate the amount of genes shared by all members of that genus the amount of different reported genes in each baculovirus genus
but with the absence of orthologous sequences in the remaining or recognized lineage (bars in pink color), and the subset of shared
genera. These estimations were inferred by Blast P algorithm genes for all members of the corresponding phylogenetic clade
(http://www.ncbi.nlm.nih.gov/) considering E = 0.001 as cutoff (bars in green color). This bar graph was performed using the
value and comparing all reported baculovirus ORFs between them. information resulting from the comparison of all ORFs reported
The identity of common genes is provided in the Supplementary in the 57 baculovirus with known genomes, analyzing all against
data available at doi:10.4061/2011/379424 all by Blast P algorithm (http://www.ncbi.nlm.nih.gov/) considering
E = 0.001 as cutoff value.

For this report, previous data as well as bioinformatic Insertions, deletions, duplication events, and/or sequence
studies conducted on currently available sets of completely reorganizations by recombination or transposition pro-
sequenced baculovirus genomes were taken into account and cesses seem to be the main forces of the macroevolution
have resulted in a summary of gene content and phylogenetic in this particular kind of biological entities. For exam-
analyses which validates the classification of this important ple, the loss or gain of genetic material could provide
viral family. new important abilities for colonization of new hosts,
or they could improve performance within established
2. Baculovirus Ancestral Genes hosts. However, there seems to be a set of core genes
whose absence would imply the loss of basic biological
There are currently 57 complete baculovirus genomes de- functions, and that could be typical of the viral fam-
posited in GenBank (Table 1). These include 41 Alphabac- ily. In view of this, and considering previous reports
uloviruses, 12 Betabaculoviruses, 3 Gammabaculoviruses, and [1, 19, 22, 23], the amount and identity of baculovirus
1 Deltabaculovirus. common genes were reevaluated (Table 2). As a result,
As a first approach to perform a comparative analysis, the P6.9 and Desmoplakin were recognized in this work, as
GC content of the genomes were calculated (Figure 1). The core proteins by using sequence analysis complementary
histogram revealed that many baculoviruses have about 41% to the standard ones (see Supplementary files available at
of GC content although several of them have significantly doi:10.4061/2011/379424).
higher values (CfMNPV at 50.1%, CuniNPV at 50.9%, The group of conserved sequences found in all bac-
AnpeNPV-L2 at 53.5%, AnpeNPV-Z at 53.5%, LyxyNPV ulovirus genomes is consistently estimated at about 30 shared
at 53.5%, OpMNPV at 55.1%, and LdMNPV at 57.5%). genes, regardless of the increasing number of genomes
A detailed analysis of DNA content did not show a clear analyzed [22, 148]. Meanwhile, the role or function assigned
pattern of GC content that could be associated with each to several sequences has been renewed, according to new
genus. studies. In particular, it has been identified that 38k (Ac98)
Further characterization of the patterns of gene con- gene encodes a protein which is part of the capsid struc-
tent and organization may prove useful for establish- ture [121, 122]; P33 (Ac92) is a sulfhydryl oxidase which
ing evolutionary relationships among members of Bac- could be related to the proper production of virions in
uloviridae. The high variability observed in the number the infected cell nucleus [123–125]; ODV-EC43 (Ac109)
of coding sequences becomes a key feature of viruses is a structural component which would be involved in
with large DNA genomes that infect eukaryotic cells [18]. BV and ODV generation [126]; P49 (Ac142) is a capsid
4 International Journal of Evolutionary Biology

Table 1: Baculovirus complete genomes.

Accesion Genome Annotated


Genus Name Abbreviation Code GC% Ref.
number (bp) ORFs
Antheraea pernyi
AnpeNPV-Z APN NC 008035 126629 145 53.5 [27]
NPV-Z
Antheraea pernyi
AnpeNPV-L2 AP2 EF207986 126246 144 53.5 [28]
NPV-L2
Anticarsia gemmatalis
AgMNPV-2D AGN NC 008520 132239 152 44.5 [29]
MNPV-2D
Autographa californica
AcMNPV-C6 ACN NC 001623 133894 154 40.7 [30]
Alphabaculovirus- MNPV-C6
Group I Bombyx mori NPV BmNPV BMN NC 001962 128413 137 40.4 [31]
Bombyx mandarina
BomaNPV BON NC 012672 126770 141 40.2 [32]
NPV
Choristoneura
CfDEFMNPV CDN NC 005137 131160 149 45.8 [33]
fumiferana DEF MNPV
Choristoneura
CfMNPV CFN NC 004778 129593 145 50.1 [34]
fumiferana MNPV
Epiphyas postvittana
EppoNPV EPN NC 003083 118584 136 40.7 [35]
NPV
Hyphantria cunea NPV HycuNPV HCN NC 007767 132959 148 45.5 [36]
Maruca vitrata MNPV MaviMNPV MVN NC 008725 111953 126 38.6 [37]
Orgyia pseudotsugata
OpMNPV OPN NC 001875 131995 152 55.1 [38]
MNPV
Plutella xylostella
PlxyMNPV PXN NC 008349 134417 149 40.7 U
MNPV
Rachiplusia ou MNPV RoMNPV RON NC 004323 131526 146 39.1 [39]
Adoxophyes honmai
AdhoNPV AHN NC 004690 113220 125 35.6 [40]
NPV
Adoxophyes orana NPV AdorNPV AON NC 011423 111724 121 35.0 [41]
Agrotis ipsilon NPV AgipNPV AIN NC 011345 155122 163 48.6 U
Agrotis segetum NPV AgseNPV ASN NC 007921 147544 153 45.7 [42]
Apocheima cinerarium
ApciNPV APO FJ914221 123876 118 33.4 U
NPV
Chrysodeixis chalcites
ChChNPV CCN NC 007151 149622 151 39.0 [43]
NPV
Clanis bilineata NPV ClbiNPV CBN NC 008293 135454 129 37.7 [44]
Ectropis obliqua NPV EcobNPV EON NC 008586 131204 126 37.6 [45]
Euproctis
EupsNPV EUN NC 012639 141291 139 40.4 [46]
pseudoconspersa NPV
Alphabaculovirus- Helicoverpa armigera
HearNPV-C1 HA1 NC 003094 130759 135 38.9 [47]
Group II NPV-C1
Helicoverpa armigera
HearNPV-G4 HA4 NC 002654 131405 135 39.0 [47]
NPV-G4
Helicoverpa armigera
HearMNPV HAN NC 011615 154196 162 40.1 [48]
MNPV
Helicoverpa armigera HearSNPV-
HAS NC 011354 132425 143 39.2 [49]
SNPV-NNg1 NNg1
Helicoverpa zea SNPV HzSNPV HZN NC 003349 130869 139 39.1 U
Leucania separata
LeseNPV-AH1 LSN NC 008348 168041 169 48.6 [50]
NPV-AH1
Lymantria dispar
LdMNPV LDN NC 001973 161046 163 57.5 [51]
MNPV
Lymantria xylina
LyxyMNPV LXN NC 013953 156344 157 53.5 [52]
MNPV
International Journal of Evolutionary Biology 5

Table 1: Continued.
Accesion Genome Annotated
Genus Name Abbreviation Code GC% Ref.
number (bp) ORFs
Mamestra configurata
MacoNPV-90-2 MCN NC 003529 155060 169 41.7 [53]
NPV-90-2
Mamestra configurata
MacoNPV-90-4 MC4 AF539999 153656 168 41.7 [54]
NPV-90-4
Mamestra configurata
MacoNPV-B MCB NC 004117 158482 169 40.0 [55]
NPV-B
Orgyia leucostigma
OrleNPV OLN NC 010276 156179 135 39.9 U
NPV
Spodoptera exigua
SeMNPV SEN NC 002169 135611 142 43.8 U
MNPV
Spodoptera frugiperda
SfMNPV-3AP2 SF2 NC 009011 131330 143 40.2 [56]
MNPV-3AP2
Spodoptera frugiperda
SfMNPV-19 SF9 EU258200 132565 141 40.3 [57]
MNPV-19
Spodoptera litura
SpliNPV-II SLN NC 011616 148634 147 45.0 U
NPV-II
Spodoptera litura
SpliNPV-G2 SL2 NC 003102 139342 141 42.8 [58]
NPV-G2
Trichoplusia ni SNPV TnSNPV TNN NC 007383 134394 144 39.0 [59]
Adoxophyes orana GV AdorGV AOG NC 005038 99657 119 34.5 [60]
Agrotis segetum GV AgseGV ASG NC 005839 131680 132 37.3 U
Choristoneura
ChocGV COG NC 008168 104710 116 32.7 [61]
occidentalis GV
Cryptophlebia
CrleGV CLG NC 005068 110907 129 32.4 [62]
leucotreta GV
Betabaculovirus Cydia pomonella GV CpGV CPG NC 002816 123500 143 45.3 [63]
Helicoverpa armigera
HearGV HAG NC 010240 169794 179 40.8 [64]
GV
Phthorimea operculella
PhopGV POG NC 004062 119217 130 35.7 [65]
GV
Plutella xylostella GV PlxyGV PXG NC 002593 100999 120 40.7 [66]
Pieris rapae GV PiraGV PRG GQ884143 108592 120 33.2 U
Pseudaletia unipuncta
PsunGV PUG EU678671 176677 183 39.8 U
GV-Hawaiin
Spodoptera litura
SpliGV SLG NC 009503 124121 136 38.8 [67]
GV-K1
Xestia c-nigrum GV XnGV XCG NC 002331 178733 181 40.7 [68]
Neodiprion abietis NPV NeabNPV NAN NC 008252 84264 93 33.4 [69]
Gamma Neodiprion lecontei
NeleNPV NLN NC 005906 81755 93 33.3 [70, 71]
NPV
Neodiprion sertifer
NeseNPV NSN NC 005905 86462 90 33.8 [71, 72]
NPV
Delta Culex nigripalpus NPV CuniNPV CNN NC 003084 108252 109 50.9 [73]
This table contains all of baculoviruses used in bioinformatic studies, sorted by genus (and within them by alphabetical order). MNPV is the abbreviation
of multicapsid nucleopolyhedrovirus; NPV is the abbreviation of nucleopolyhedrovirus; SNPV is the abbreviation of single nucleopolyhedrovirus; GV is the
abbreviation of granulovirus. The accession numbers are from National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/) and
correspond to the sequences of complete genomes. Code is an acronym used for practicality. U: unpublished.

protein important in DNA processing, packaging, and capsid processes from virogenic stroma to cytoplasm [132]; PIF-4
morphogenesis [129]; Ac81 interacts with Actin 3 in the (Ac96) and PIF-5 (ODV-56, Ac148) are ODV envelope
cytoplasm but does not appear in BVs or in ODVs [135]; proteins with an essential role in per os infection route [145,
ODV-E18 (Ac143) would mediate BV production [131]; 147]; Ac68 may be involved in polyhedron morphogenesis
desmoplakin (Ac66) seems to be essential in releasing [130].
6 International Journal of Evolutionary Biology

Table 2: Core genes.

ACN LDN CPG NSN CNN


Replication
lef-1 [74] 14 123 74 68 45
lef-2 [74] 6 137 41 57 25
DNA pol [75–78] 65 83 111 28 91
Helicase [79–90] 95 97 90 61 89
Transcription
lef-4 [91–95] 90 93 95 62 96
lef-8 [91, 96] 50 51 131 81 26
lef-9 [95, 97] 62 64 117 40 59
p47 [91, 98] 40 48 68 49 73
lef-5 [98–101] 99 100 87 58 88
Packaging, assembly, and release
p6.9 [102–104] 100 101 86 36 23
vp39 [105–108] 89 92 96 89 24
vlf-1 [100, 109–113] 77 86 106 45 18
alk-exo [114–116] 133 157 125 31 53
vp1054 [117] 54 57 138 85 8
vp91/p95 [118] 83 91 101 84 35
gp41 [119, 120] 80 88 104 47 33
38 k [121, 122] 98 99 88 59 87
p33 [123–125] 92 94 93 24 14
odv-ec43 [126–128] 109 107 55 70 69
p49 [129] 142 20 15 63 30
odv-nc42 [130] 68 80 114 41 58
odv-e18 [131] 143 19 14 65 31
desmoplakin [132] 66 82 112 29 92
Cell cycle arrest and/or interaction with host proteins
odv-e27 [133, 134] 144 18 97 66 32
ac81 [135] 81 89 103 48 106
Oral infectivity
pif-0/p74 [136–141] 138 27 60 50 74
pif-1 [142–144] 119 155 75 79 29
pif-2 [136, 142] 22 119 48 55 38
pif-3 [142] 115 143 35 69 46
pif-4/19k/odv-e28 [145] 96 98 89 60 90
pif-5/odv-e56 [146, 147] 148 14 18 38 102
The virus names are indicated in three letter code according to established in Table 1.
Numbers in columns indicates the corresponding ORFs of each genome.

The number and identity of shared orthologous genes in and after diversification (Table 3). The possibility that non-
every accepted member of each genus were investigated, and shared genes found only in one genus which represent
the unique sequences typical of each clade as well as those baculovirus ancestral sequences deleted in the other lineages
shared between different phylogenetic groups were identified should also be considered. In any case, a set of particular
(Figure 2). genes which could help in an appropriate genus taxonomy
This analysis shows that the four accepted baculovirus of new baculoviruses with partial sequence information were
genera have accumulated a large number of genes during obtained from this analysis.
evolution. Probably, many of these sequences have been
incorporated into viral genomes prior to diversification 3. Whole Baculovirus Gene Content
processes since they are found in members of different
genera. In contrast, other genes are unique to each genus, The study of all genes reported in the 57 completely
suggesting that they have been incorporated more recently sequenced viral genomes revealed the existence of about
International Journal of Evolutionary Biology 7

Table 3: Shared genes∗ .

Core genes
lef-2 (ACN6), lef-1 (ACN14), pif-2 (ACN22), p47 (ACN40), lef-8 (ACN50), vp1054 (ACN54), lef-9 (ACN62), DNA polymerase
(ACN65), Desmoplakin (ACN66), ACN68, vlf-1 (ACN77), gp41 (ACN80), ACN81, vp91/p95 (ACN83), vp39 (ACN89), lef-4
(ACN90), p33 (ACN92), helicase (ACN95), 19K (ACN96), 38 K (ACN98), lef-5 (ACN99), p6.9 (ACN100), odv-ec43 (ACN109),
PIF-3 (ACN115), pif-1 (ACN119), alkaline exonuclease (ACN133), p74 (ACN138), p49 (ACN142), odv-e18 (ACN143), odv-e27
(ACN144), odv-e56 (ACN148)
Alpha + Beta + Gamma
Polh (ACN8), dbp (ACN25), p48 (ACN103), ACN145, pp34/PEP (ACN131), odv-e25 (ACN94), p40 (ACN101), ACN106/107
Alpha + Beta + Delta
F-protein (ACN23)
Alpha + Beta
pk-1 (ACN10), 38,7 kDa (ACN13), lef-6 (ACN28), pp31/39K (ACN36), ACN38, ACN53, 25K FP (ACN61), LEF-3 (ACN67), ACN75,
ACN76, tlp20 (ACN82), p18 (ACN93), P12 (ACN102), ACN108, p24 (ACN129), me53 (ACN139), ACN146, ie-1 (ACN147)
Alpha
orf1629 capsid (ACN9), ACN19, pkip-1 (ACN24), ACN34, ACN51, iap-2 (ACN58/59), ACN104, p87/vp80 (ACN141), ie-0 (ACN71)
Alpha Group I
ptp-1/bvp (ACN1), ACN5, odv-e26 (ACN16), iap-1 (ACN27), ACN30, ACN72, ACN73, ACN114, ACN124, gp64 (ACN128), p25
(ACN132), ie-2 (ACN151)
Beta
CPG4, CPG5, CPG20, CPG23, CPG29, CPG33, CPG39, CPG45, Metalloproteinase (CPG46), CPG62, FGF-1 (CPG76), CPG79,
CPG99, CPG100, CPG115, IAP-5 (CPG116), CPG123, CPG135, FGF-3 (CPG140)
Gamma
NSN3, NSN9, NSN11, NSN12, NSN13, NSN16, NSN18, NSN19, NSN20, NSN26, NSN29, NSN34, NSN37, NSN39, NSN42, NSN43,
NSN44, NSN51, NSN52, NSN53, NSN54, NSN56, NSN64, NSN72, NSN74, NSN76, NSN77, NSN79, NSN82, NSN85, NSN86,
NSN89
Delta
CNN2, CNN3, CNN6, CNN7, CNN9, CNN10, CNN11, CNN12, CNN13, CNN15, CNN16, CNN17, CNN20, CNN21, CNN22,
CNN27, CNN28, CNN31, CNN36, CNN37, CNN39, CNN40, CNN41, CNN42, CNN43, CNN44, CNN47, CNN48, CNN49, CNN50,
CNN51, CNN52, CNN53, CNN55, CNN56, CNN57, CNN60, CNN61, CNN62, CNN63, CNN64, CNN65, CNN66, CNN67, CNN68,
CNN70, CNN71, CNN72, CNN75, CNN76, CNN77, CNN78, CNN79, CNN80, CNN81, CNN82, CNN83, CNN84, CNN85, CNN86,
CNN93, CNN94, CNN97, CNN98, CNN99, CNN100, CNN101, CNN103, CNN105, CNN107
∗ Shared genes are indicated only for one selected specie. See supplementary tables for the respective ORF numbers in each specie.

895 different ORFs, a set of sequences that might be called obtained from Group I alphabaculoviruses, suggests that
the whole baculovirus gene content. This high number of this lineage of viruses would constitute the newest clade
potential coding sequences contrasts with the range of gene in baculovirus evolution history [149]. This is based on
content among the family members, which is between 90– the assumption that Group I alphabaculoviruses have had
181 genes (Alphabaculovirus: 118–169; Betabaculovirus: 116– less time to incorporate new sequences from different
181; Gammabaculovirus: 90–93; Deltabaculovirus: 109) as sources (host genomes, other viral genomes, bacterial
well as with the proportion of core genes which represents genomes, etc.) since the appearance of their common
only 3%. This curious biological feature supports the ancestor.
hypothesis that highlights the great importance of structural
mutations in the macroevolution of viruses with large 4. Baculovirus Core Gene Phylogeny
DNA genomes. From this view, the set of genes shared
by all members belonging to each baculovirus genus was Traditional attempts to infer relationships between bac-
compared to those corresponding to the whole genus gene uloviruses were performed by amino acid or nucleotide
content (Figure 3). sequence analyses of single genes encoding proteins such
The analysis shows that Group I alphabaculoviruses as polyhedrin/granulin (the major component of OBs), the
and gammabaculoviruses have a lower diversity of gene envelope fusion polypeptides known as F protein and GP64,
content with respect to the rest of lineages. This information, or DNA polymerase protein, among many other examples
coupled with the significant number of genome sequences [149–152].
8 International Journal of Evolutionary Biology

100 ACN
100 PXN
100 RON
Clade a BMN
100
100 BON
MVN
100

Group I
100 AP2
APN
100 AGN
100 100 CDN
Clade b EPN
98 HCN
100
100 CFN
100 OPN
100 SL2
LSN
100 AON
AHN

Alphabaculovirus
100 99 APO
76 OLN
100 EUN
100 EON
100 71 CBN
LXN
100 LDN
58 HZN
HAS

Group II
87 100
HA4
100 HA1
100 100 TNN
62 CCN
100 MC4
100 MCN
100
MCB
100 HAN
100 100 ASN
AIN
100 SF9
100
SF2
100 SLN
100 SEN
100 XCG
Clade a 100 HAG
100 PUG
91 SLG
Betabaculovirus

100 ASG
PXG
100 AOG
POG
100
100 CPG
88 CLG
Clade b
100 PRG
100 COG
NSN
100 NAN γ
100 NLN
CNN δ

Figure 4: Baculovirus genome phylogeny. Cladogram based on amino acid sequence of core genes. The 31 identified core genes from
Baculoviridae family were independently aligned using MEGA 4 [25] program with gap open penalty = 10, gap extension penalty = 1,
and dayhoff matrix [26]. Then, a concatemer was generated and phylogeny inferred using the same software (UPGMA; bootstrap with
1000 replicates; gap/missing data = complete deletion; model = amino (dayhoff matrix); patterns among sites = same (homogeneous); rates
among sites = different (gamma distributed); gamma parameter = 2.25). Baculoviruses are identified by the acronyms given in Table 1, and
the accepted distribution in lineages and genera are also indicated. Gammabaculovirus and Deltabaculovirus are referenced by Greek letters.
The proposed clades of Betabaculoviruses are shown in bold letters.
International Journal of Evolutionary Biology 9

Average PAM distances


4

0
Desmoplakin
Helicase
LEF-2
P6.9
VP91
ODV-EC27
VP39
ODV-NC42
ALK-EXO
VP1054
LEF-1
P49
LEF-4
P47
ODV-EC43
DNApol
P74
PIF-1
P33
GP41
PIF-4
PIF-3
38K
VLF-1
LEF-5
PIF-5
Ac81
LEF-8
ODV-E18
LEF-9
PIF-2
Core proteins
Figure 5: Baculovirus core gene variability. Histograms show the average PAM250 distances for each core gene with their corresponding
standard deviations. These values were calculated using MEGA 4 program (UPGMA; bootstrap with 1000 replicates; gap/missing data =
complete deletion; model = amino (dayhoff matrix); patterns among sites = same (homogeneous); rates among sites = different (gamma
distributed; gamma parameter = 2.25)). PAM (point accepted mutation) matrices refers to the evolutionary distance between pairs of
sequences. Given the weak similarity between several core proteins, PAM250 matrix was selected. The divergence considered in this matrix
is 250 mutations per 100 amino acid sequence and was calculated to analyze more distantly related sequences. PAM250 is considered a good
general matrix for protein similarity search.

Mostly, the evolutionary inferences were in agreement and clade b (AdorGV, PhopGV, CpGV, CrleGV, PiraGV,
with much stronger subsequent studies based on sequence ChocGV).
analyses derived from sets of genes with homologous Despite the evolutionary inference based on core genes,
sequences in all baculoviruses. Thus, these new approaches there was a remaining question: “is the tolerance to changes
were based on the construction of common-protein- in all core genes the same?”. The answer could be reached by
concatemers which were used to propose evolution patterns an individual core gene variability analysis for which studies
for baculoviruses [149]. of sequence distance for each baculovirus core gene were
Then, the fact that a viral family consists of members performed (Figure 5).
who share a common pattern of genes and functions and The resulting order of core genes shows that pif-2 was
whose proliferation cycle continuously challenges the viral the most conserved baculovirus ancestral sequence, whereas
viability turns it essential to take into account their higher desmoplakin was the gene with evidence of greatest variabil-
or lesser tolerance to the molecular changes. Molecular ity. This analysis reveals that genomes can be evolutionarily
constraints regarding tolerance to changes in core genes constrained in different ways depending on the proteins they
are different from those of other genes. Therefore, core encode.
genes should be considered the most ancestral genes which The gain of access to new hosts might be an important
may have diverged in higher or lesser degrees. According force for gene evolution. During an infection process, the
to this, a phylogenetic study was performed based on genome variants that appear with mutations introduced
concatemers obtained from multiple alignments of the by errors in the replication/reparation machinery could
31 proteins recognized in this work as core genes for be quickly incorporated into the virus population if the
the 57 available baculoviruses with sequenced genomes nucleotide changes offered a better biological performance
(Figure 4). when proteins were translated. The DNA helicase gene was
The obtained cladogram reproduces the current bac- considered as an important host range factor being, for this
ulovirus classification based on 4 genera. Additionally, this study, the second core sequence showing more variability
approach consistently separates the alphabaculoviruses into [87]. However, other sequences like pif-2 gene would not
two lineages: Group I and Group II. And the same can be accumulate mutations because the protein encoded might
observed when analyzing Group I, where the presence of two lose vital functions not necessarily associated with the nature
different clades can be clearly inferred (clade a and clade b). of the host.
These groupings result in accordance with previous reports
[20, 150]. In Group II alphabaculoviruses, a clear clustering 5. Conclusions
may not be identified and would not allow to suggest a
subdivision. Baculoviridae is a large family of viruses which infect
In contrast, in the Betabaculovirus genus, it is possible and kill insect species from different orders. The valu-
to propose their separation into two different clades: clade able applications of these viruses in several fields of life
a (XnGV, HearGV, PsunGV, SpliGV, AgseGV, and PlxyGV), sciences encourage their constant study with the goal of
10 International Journal of Evolutionary Biology

understanding the molecular mechanisms involved in the Biological Control, Academic Press, New York, NY, USA,
generation of progeny in the appropriate cells as well as 1985.
the processes by which they evolve. The establishment of [7] G. Zhang, “Research, development and application of Helio-
solid bases to recognize their phylogenetic relationships is this viral pesticide in China,” Resources and Environment in
necessary to facilitate the generation of new knowledge and the Yangtze Valley, vol. 3, pp. 1–6, 1994.
the development of better methodologies. [8] R. D. Possee, “Baculoviruses as expression vectors,” Current
In view of this, many researchers have proposed and Opinion in Biotechnology, vol. 8, no. 5, pp. 569–572, 1997.
used different bioinformatic methodologies to identify genes [9] F. Moscardi, “Assessment of the application of baculoviruses
as well as related baculoviruses. Some of them were based for control of lepidoptera,” Annual Review of Entomology, vol.
44, pp. 257–289, 1999.
on gene sequences [150], gene content [17], or genome
[10] T. A. Kost and J. P. Condreay, “Recombinant baculoviruses as
rearrangements [152]. In this work, a combination of core
expression vectors for insect and mammalian cells,” Current
gene sequence and gene content analyses were applied to Opinion in Biotechnology, vol. 10, no. 5, pp. 428–433, 1999.
reevaluate Baculoviridae classification. To our knowledge,
[11] A. B. Inceoglu, S. G. Kamita, A. C. Hinton et al., “Recom-
the most important fact is that this report is the first work binant baculoviruses for insect control,” Pest Management
which identifies the whole baculovirus gene content and Science, vol. 57, no. 10, pp. 981–987, 2001.
the shared genes that are unique in different genera and [12] T. A. Kost, J. P. Condreay, and D. L. Jarvis, “Baculovirus
subgenera. All this information should be taken into account as versatile vectors for protein expression in insect and
to group and classify new virus isolates and to propose mammalian cells,” Nature Biotechnology, vol. 23, no. 5, pp.
molecular methodologies to diagnose baculoviruses based 567–575, 2005.
on proper gene targets according to gene variability and gene [13] M. D. Summers, “Milestones leading to the genetic engineer-
content. ing of baculoviruses as expression vector systems and viral
pesticides,” Advances in Virus Research, vol. 68, pp. 3–73,
2006.
Acknowledgments [14] A. B. Inceoglu, S. G. Kamita, and B. D. Hammock, “Geneti-
cally modified baculoviruses: a historical overview and future
This work was supported by research funds from Agencia outlook,” Advances in Virus Research, vol. 68, pp. 323–360,
Nacional de Promoción Cientı́fica y Técnica (ANPCyT) and 2006.
Universidad Nacional de Quilmes. P. D. Ghiringhelli is mem- [15] X. Shi and D. L. Jarvis, “Protein N-glycosylation in the
ber of the Research Career of CONICET (Consejo Nacional baculovirus-insect cell system,” Current Drug Targets, vol. 8,
de Investigaciones Cientı́ficas y Técnicas), M. N. Belaich holds no. 10, pp. 1116–1125, 2007.
a postdoctoral fellowship of CONICET, S. A. B. Miele holds [16] J. P. Condreay and T. A. Kost, “Baculovirus expression vectors
a fellowship of CONICET, and M J Garavaglia holds a fel- for insect and mammalian cells,” Current Drug Targets, vol. 8,
lowship of CIC-PBA (Comisión de Investigaciones Cientı́ficas no. 10, pp. 1126–1131, 2007.
de la Provincia de Buenos Aires). The authors acknowledge [17] X. L. Sun and H. Y. Peng, “Recent advances in biological
to Lic. Javier A. Iserte, Lic. Betina I. Stephan and Lic. Laura control of pest insect by using viruses in China,” Virologica
Esteban for their helping with the paper. S. A. B. Miele and Sinica, vol. 22, no. 2, pp. 158–162, 2007.
M. Javier Garavaglia both contributed equally to this work. [18] E. A. Herniou, T. Luque, X. Chen et al., “Use of whole genome
sequence data to infer baculovirus phylogeny,” Journal of
Virology, vol. 75, no. 17, pp. 8117–8126, 2001.
References [19] E. A. Herniou, J. A. Olszewski, J. S. Cory, and D. R. O’Reilly,
“The genome sequence and evolution of baculoviruses,”
[1] J. A. Jehle, G. W. Blissard, B. C. Bonning et al., “On the Annual Review of Entomology, vol. 48, pp. 211–234, 2003.
classification and nomenclature of baculoviruses: a proposal
[20] J. A. Jehle, M. Lange, H. Wang, Z. Hu, Y. Wang, and
for revision,” Archives of Virology, vol. 151, no. 7, pp. 1257–
R. Hauschild, “Molecular identification and phylogenetic
1266, 2006.
analysis of baculoviruses from Lepidoptera,” Virology, vol.
[2] G. W. Blissard and G. F. Rohrmann, “Baculovirus diversity
346, no. 1, pp. 180–193, 2006.
and molecular biology,” Annual Review of Entomology, vol.
35, no. 1, pp. 127–155, 1990. [21] M. M. van Oers and J. M. Vlak, “Baculovirus genomics,”
Current Drug Targets, vol. 8, no. 10, pp. 1051–1068, 2007.
[3] E. A. Kozlov, T. L. Levitina, and N. M. Gusak, “The
primary structure of baculovirus inclusion body proteins. [22] G. F. Rohrman, Baculovirus Molecular Biology, National
Evolution and structure-function aspects,” Current Topics in Library of Medicine (US), NCBI, Bethesda, Md, USA, 2008.
Microbiology and Immunology, vol. 131, pp. 135–164, 1986. [23] T. Hayakawa, G. F. Rohrmann, and Y. Hashimoto, “Patterns
[4] G. F. Rohrmann, “Baculovirus structural proteins,” Journal of of genome organization and content in lepidopteran bac-
General Virology, vol. 73, no. 4, pp. 749–761, 1992. uloviruses,” Virology, vol. 278, no. 1, pp. 1–12, 2000.
[5] T. Ohkawa, J. O. Washburn, R. Sitapara, E. Sid, and L. [24] C. B. McCarthy and D. A. Theilmann, “AcMNPV ac143 (odv-
E. Volkman, “Specific binding of Autographa californica e18) is essential for mediating budded virus production and
M nucleopolyhedrovirus occlusion-derived virus to midgut is the 30th baculovirus core gene,” Virology, vol. 375, no. 1,
cells of heliothis virescens larvae is mediated by products of pp. 277–291, 2008.
pif genes Ac119 and Ac022 but not by Ac115,” Journal of [25] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4:
Virology, vol. 79, no. 24, pp. 15258–15264, 2005. molecular evolutionary genetics analysis (MEGA) software
[6] R. Jackes, E. Maromorosch, and K. Sherman, Stability version 4.0,” Molecular Biology and Evolution, vol. 24, no. 8,
of Insect Viruses in the Environment. Viral Insecticides for pp. 1596–1599, 2007.
International Journal of Evolutionary Biology 11

[26] R. M. Schwartz and M. O. Dayhoff, “Matrices for detecting [42] A. K. Jakubowska, S. A. Peters, J. Ziemnicka, J. M. Vlak,
distant relationships,” in Atlas of Protein Sequences, M. O. and M. M. van Oers, “Genome sequence of an enhancin
Dayhoff, Ed., pp. 353–358, National Biomedical Research gene-rich nucleopolyhedovirus (NPV) from Agrotis segetum:
Foundation, 1979. collinearity with Spodoptera exigua multiple NPV,” Journal of
[27] Q. Fan, S. Li, L. Wang et al., “The genome sequence of the General Virology, vol. 87, no. 3, pp. 537–551, 2006.
multinucleocapsid nucleopolyhedrovirus of the Chinese oak [43] M. M. van Oers, M. H. C. Abma-Henkens, E. A. Herniou, J.
silkworm Antheraea pernyi,” Virology, vol. 366, no. 2, pp. C. W. de Groot, S. Peters, and J. M. Vlak, “Genome sequence
304–315, 2007. of Chrysodeixis chalcites nucleopolyhedrovirus, a baculovirus
[28] Z. M. Nie, Z. F. Zhang, D. Wang et al., “Complete sequence with two DNA photolyase genes,” Journal of General Virology,
and organization of Antheraea pernyi nucleopolyhedrovirus, vol. 86, no. 7, pp. 2069–2080, 2005.
a dr-rich baculovirus,” BMC Genomics, vol. 8, Article ID 248, [44] S. Y. Zhu, J. P. Yi, W. D. Shen et al., “Genomic sequence, orga-
2007. nization and characteristics of a new nucleopolyhedrovirus
[29] J. V. de Castro Oliveira, J. L. C. Wolff, A. Garcia-Maruniak et isolated from Clanis bilineata larva,” BMC Genomics, vol. 10,
al., “Genome of the most widely used viral biopesticide: anti- Article ID 91, 9 pages, 2009.
carsia gemmatalis multiple nucleopolyhedrovirus,” Journal of [45] X. C. Ma, J. Y. Shang, Z. N. Yang, Y. Y. Bao, Q. Xiao,
General Virology, vol. 87, no. 11, pp. 3233–3250, 2006. and C. X. Zhang, “Genome sequence and organization of a
[30] M. D. Ayres, S. C. Howard, J. Kuzio, M. Lopez-Ferber, and nucleopolyhedrovirus that infects the tea looper caterpillar,
R. D. Possee, “The complete DNA sequence of Autographa Ectropis obliqua,” Virology, vol. 360, no. 1, pp. 235–246, 2007.
californica nuclear polyhedrosis virus,” Virology, vol. 202, no. [46] X. D. Tang, Q. Xiao, X. C. Ma, Z. R. Zhu, and C. X. Zhang,
2, pp. 586–605, 1994. “Morphology and genome of Euproctis pseudoconspersa
[31] S. Gomi, K. Majima, and S. Maeda, “Sequence analysis of the nucleopolyhedrovirus,” Virus Genes, vol. 38, no. 3, pp. 495–
genome of Bombyx mori nucleopolyhedrovirus,” Journal of 506, 2009.
General Virology, vol. 80, no. 5, pp. 1323–1337, 1999. [47] C. X. Zhang, X. C. Ma, and Z. J. Guo, “Comparison
of the complete genome sequence between C1 and G4
[32] Y. P. Xu, Z. P. Ye, C. Y. Niu et al., “Comparative analysis
isolates of the Helicoverpa armigera single nucleocapsid
of the genomes of Bombyx mandarina and Bombyx mori
nucleopolyhedrovirus,” Virology, vol. 333, no. 1, pp. 190–199,
nucleopolyhedroviruses,” Journal of Microbiology, vol. 48, no.
2005.
1, pp. 102–110, 2010.
[48] J. G. Ogembo, S. Chaeychomsri, K. Kamiya et al., “Cloning
[33] J. G. de Jong, H. A. M. Lauzon, C. Dominy et al., “Anal-
and comparative characterization of nucleopolyhedroviruses
ysis of the Choristoneura fumiferana nucleopolyhedrovirus
isolated from African bollworm, Helicoverpa armigera, (Lepi-
genome,” Journal of General Virology, vol. 86, no. 4, pp. 929–
doptera: Noctudiae) in different geographic regions,” Journal
943, 2005.
of Insect Biotechnology and Sericology, vol. 76, no. 1, pp. 39–
[34] H. A. M. Lauzon, P. B. Jamieson, P. J. Krell, and B. M. Arif, 49, 2007.
“Gene organization and sequencing of the Choristoneura
[49] X. Chen, W. F. J. Ijkel, C. Dominy et al., “Identification,
fumiferana defective nucleopolyhedrovirus genome,” Journal
sequence analysis and phylogeny of the lef-2 gene of
of General Virology, vol. 86, no. 4, pp. 945–961, 2005. Helicoverpa armigera single-nucleocapsid baculovirus,” Virus
[35] O. Hyink, R. A. Dellow, M. J. Olsen et al., “Whole genome Research, vol. 65, no. 1, pp. 21–32, 1999.
analysis of the Epiphyas postvittana nucleopolyhedrovirus,” [50] H. Xiao and Y. Qi, “Genome sequence of Leucania seperata
Journal of General Virology, vol. 83, no. 4, pp. 957–971, 2002. nucleopolyhedrovirus,” Virus Genes, vol. 35, no. 3, pp. 845–
[36] M. Ikeda, M. Shikata, N. Shirata, S. Chaeychomsri, and M. 856, 2007.
Kobayashi, “Gene organization and complete sequence of the [51] J. Kuzio, M. N. Pearson, S. H. Harwood et al., “Sequence
Hyphantria cunea nucleopolyhedrovirus genome,” Journal of and analysis of the genome of a baculovirus pathogenic for
General Virology, vol. 87, no. 9, pp. 2549–2562, 2006. Lymantria dispar,” Virology, vol. 253, no. 1, pp. 17–34, 1999.
[37] Y. R. Chen, C. Y. Wu, S. T. Lee et al., “Genomic and host range [52] Y. S. Nai, C. Y. Wu, T. C. Wang et al., “Genomic sequencing
studies of Maruca vitrata nucleopolyhedrovirus,” Journal of and analyses of Lymantria xylina multiple nucleopolyhe-
General Virology, vol. 89, no. 9, pp. 2315–2330, 2008. drovirus,” BMC Genomics, vol. 11, no. 1, Article ID 116, 2010.
[38] C. H. Ahrens, R. L. Q. Russell, C. J. Funk, J. T. Evans, S. H. [53] S. Li, M. Erlandson, D. Moody, and C. Gillott, “A physical
Harwood, and G. F. Rohrmann, “The sequence of the Orgyia map of the Mamestra configurata nucleopolyhedrovirus
pseudotsugata multinucleocapsid nuclear polyhedrosis virus genome and sequence analysis of the polyhedrin gene,”
genome,” Virology, vol. 229, no. 2, pp. 381–399, 1997. Journal of General Virology, vol. 78, no. 1, pp. 265–271, 1997.
[39] R. L. Harrison and B. C. Bonning, “The nucleopolyhe- [54] L. Li, Q. Li, L. G. Willis, M. Erlandson, D. A. Theilmann, and
droviruses of Rachiplusia ou and Anagrapha falcifera are C. Donly, “Complete comparative genomic analysis of two
isolates of the same virus,” Journal of General Virology, vol. field isolates of Mamestra configurata nucleopolyhedrovirus-
80, no. 10, pp. 2793–2798, 1999. A,” Journal of General Virology, vol. 86, no. 1, pp. 91–105,
[40] M. Nakai, C. Goto, W. Kang, M. Shikata, T. Luque, 2005.
and Y. Kunimi, “Genome sequence and organization of a [55] L. Li, C. Donly, Q. Li et al., “Identification and genomic
nucleopolyhedrovirus isolated from the smaller tea tortrix, analysis of a second species of nucleopolyhedrovirus isolated
Adoxophyes honmai,” Virology, vol. 316, no. 1, pp. 171–183, from Mamestra configurata,” Virology, vol. 297, no. 2, pp.
2003. 226–244, 2002.
[41] S. Hilton and D. Winstanley, “Genomic sequence and bio- [56] R. L. Harrison, B. Puttler, and H. J. R. Popham, “Genomic
logical characterization of a nucleopolyhedrovirus isolated sequence analysis of a fast-killing isolate of Spodoptera
from the summer fruit tortrix, Adoxophyes orana,” Journal of frugiperda multiple nucleopolyhedrovirus,” Journal of Gen-
General Virology, vol. 89, no. 11, pp. 2898–2908, 2008. eral Virology, vol. 89, no. 3, pp. 775–790, 2008.
12 International Journal of Evolutionary Biology

[57] J. L. C. Wolff, F. H. Valicente, R. Martins, J. V. Oliveira, [72] A. Garcia-Maruniak, J. E. Maruniak, P. M. A. Zanotto et al.,
and P. M. Zanotto, “Analysis of the genome of Spodoptera “Sequence analysis of the genome of the Neodiprion sertifer
frugiperda nucleopolyhedrovirus (SfMNPV-19) and of the nucleopolyhedrovirus,” Journal of Virology, vol. 78, no. 13,
high genomic heterogeneity in group II nucleopolyhe- pp. 7036–7051, 2004.
droviruses,” Journal of General Virology, vol. 89, no. 5, pp. [73] C. L. Afonso, E. R. Tulman, Z. Lu et al., “Genome sequence
1202–1211, 2008. of a baculovirus pathogenic for Culex nigripalpus,” Journal of
[58] Y. Pang, J. Yu, L. Wang et al., “Sequence analysis of Virology, vol. 75, no. 22, pp. 11157–11165, 2001.
the Spodoptera litura multicapsid nucleopolyhedrovirus [74] J. T. Evans, D. J. Leisy, and G. F. Rohrmann, “Characterization
genome,” Virology, vol. 287, no. 2, pp. 391–404, 2001. of the interaction between the baculovirus replication factors
[59] L. G. Willis, R. Siepp, T. M. Stewart, M. A. Erlandson, LEF-1 and LEF-2,” Journal of Virology, vol. 71, no. 4, pp.
and D. A. Theilmann, “Sequence analysis of the complete 3114–3119, 1997.
genome of Trichoplusia ni single nucleopolyhedrovirus and [75] A. L. Vanarsdall, K. Okano, and G. F. Rohrmann, “Charac-
the identification of a baculoviral photolyase gene,” Virology, terization of the replication of a baculovirus mutant lacking
vol. 338, no. 2, pp. 209–226, 2005. the DNA polymerase gene,” Virology, vol. 331, no. 1, pp. 175–
[60] S. Wormleaton, J. Kuzio, and D. Winstanley, “The complete 180, 2005.
sequence of the Adoxophyes orana granulovirus genome,” [76] J. Huang and D. B. Levin, “Expression, purification and
Virology, vol. 311, no. 2, pp. 350–365, 2003. characterization of the Spodoptera littoralis nucleopolyhe-
[61] S. R. Escasa, H. A. M. Lauzon, A. C. Mathur, P. J. Krell, and B. drovirus (SpliNPV) DNA polymerase and interaction with
M. Arif, “Sequence analysis of the Choristoneura occidentalis the SpliNPV non-hr origin of DNA replication,” Journal of
granulovirus genome,” Journal of General Virology, vol. 87, General Virology, vol. 82, no. 7, pp. 1767–1776, 2001.
no. 7, pp. 1917–1933, 2006.
[77] V. V. McDougal and L. A. Guarino, “Autographa californica
[62] M. Lange and J. A. Jehle, “The genome of the Cryptophlebia nuclear polyhedrosis virus DNA polymerase: measurements
leucotreta granulovirus,” Virology, vol. 317, no. 2, pp. 220– of processivity and strand displacement,” Journal of Virology,
236, 2003. vol. 73, no. 6, pp. 4908–4918, 1999.
[63] T. Luque, R. Finch, N. Crook, D. R. O’Reilly, and D. [78] X. Hang and L. A. Guarino, “Purification of Autographa
Winstanley, “The complete sequence of the Cydia pomonella californica nucleopolyhedrovirus DNA polymerase from
granulovirus genome,” Journal of General Virology, vol. 82, infected insect cells,” Journal of General Virology, vol. 80, no.
no. 10, pp. 2531–2547, 2001.
9, pp. 2519–2526, 1999.
[64] R. L. Harrison and H. J. R. Popham, “Genomic sequence
[79] J. G. M. Heldens, Y. Liu, D. Zuidema, R. W. Goldbach, and
analysis of a granulovirus isolated from the Old World
J. M. Vlak, “Characterization of a putative Spodoptera exigua
bollworm, Helicoverpa armigera,” Virus Genes, vol. 36, no. 3,
multicapsid nucleopolyhedrovirus helicase gene,” Journal of
pp. 565–581, 2008.
General Virology, vol. 78, no. 12, pp. 3101–3114, 1997.
[65] A. Taha, A. Nour-el-Din, L. Croizier, M. López Ferber, and
G. Croizier, “Comparative analysis of the granulin regions [80] S. Maeda, S. G. Kamita, and A. Kondo, “Host range
of the Phthorimaea operculella and Spodoptera littoralis expansion of Autographa californica nuclear polyhedrosis
virus (NPV) following recombination of a 0.6-kilobase-pair
granuloviruses,” Virus Genes, vol. 21, no. 3, pp. 147–155,
DNA fragment originating from Bombyx mori NPV,” Journal
2000.
of Virology, vol. 67, no. 10, pp. 6234–6238, 1993.
[66] Y. Hashimoto, T. Hayakawa, Y. Ueno, T. Fujita, Y. Sano, and
T. Matsumoto, “Sequence analysis of the Plutella xylostella [81] E. Ito, D. Sahri, R. Knippers, and E. B. Carstens, “Baculovirus
granulovirus genome,” Virology, vol. 275, no. 2, pp. 358–372, proteins IE-1, LEF-3, and P143 interact with DNA in vivo: a
2000. formaldehyde cross-linking study,” Virology, vol. 329, no. 2,
pp. 337–347, 2004.
[67] Y. Wang, J. Y. Choi, J. Y. Roh, S. D. Woo, B. R. Jin, and Y. H. Je,
“Molecular and phylogenetic characterization of Spodoptera [82] V. V. Mcdougal and L. A. Guarino, “The Autographa
litura granulovirus,” Journal of Microbiology, vol. 46, no. 6, californica nuclear polyhedrosis virus p143 gene encodes a
pp. 704–708, 2008. DNA helicase,” Journal of Virology, vol. 74, no. 11, pp. 5273–
[68] T. Hayakawa, R. Ko, K. Okano, S. I. Seong, C. Goto, 5279, 2000.
and S. Maeda, “Sequence analysis of the Xestia c-nigrum [83] D. K. Bideshi and B. A. Federici, “DNA-independent ATPase
granulovirus genome,” Virology, vol. 262, no. 2, pp. 277–297, activity of the Trichoplusia ni granulovirus DNA helicase,”
1999. Journal of General Virology, vol. 81, no. 6, pp. 1601–1604,
[69] S. P. Duffy, A. M. Young, B. Morin, C. J. Lucarotti, B. F. Koop, 2000.
and D. B. Levin, “Sequence analysis and organization of the [84] GE. Liu and E. B. Carstens, “Site-directed mutagenesis of
Neodiprion abietis nucleopolyhedrovirus genome,” Journal the AcMNPV p 143 gene: effects on baculovirus DNA
of Virology, vol. 80, no. 14, pp. 6952–6963, 2006. replication,” Virology, vol. 253, no. 1, pp. 125–136, 1999.
[70] H. A. M. Lauzon, C. J. Lucarotti, P. J. Krell, Q. Feng, A. [85] D. K. Bideshi and B. A. Federici, “The Trichoplusia ni
Retnakaran, and B. M. Arif, “Sequence and organization granulovirus helicase in unable to support replication of
of the Neodiprion lecontei nucleopolyhedrovirus genome,” Autographa californica multicapsid nucleopolyhedrovirus in
Journal of Virology, vol. 78, no. 13, pp. 7023–7035, 2004. cells and larvae of T. ni,” Journal of General Virology, vol. 81,
[71] H. A. M. Lauzon, A. Garcia-Maruniak, P. M. de A. Zanotto no. 6, pp. 1593–1599, 2000.
et al., “Genomic comparison of Neodiprion sertifer and [86] J. T. Evans, G. S. Rosenblatt, D. J. Leisy, and G. F.
Neodiprion lecontei nucleopolyhedroviruses and identifica- Rohrmann, “Characterization of the interaction between the
tion of potential hymenopteran baculovirus-specific open baculovirus ssDNA-binding protein (LEF 3) and putative
reading frames,” Journal of General Virology, vol. 87, no. 6, helicase (P143),” Journal of General Virology, vol. 80, no. 2,
pp. 1477–1489, 2006. pp. 493–500, 1999.
International Journal of Evolutionary Biology 13

[87] O. Argaud, L. Croizier, M. López-Ferber, and G. Croizier, [102] M. E. Wilson and L. K. Miller, “Changes in the nucleoprotein
“Two key mutations in the host-range specificity domain complexes of a baculovirus DNA during infection,” Virology,
of the p143 gene of Autographa californica nucleopolyhe- vol. 151, no. 2, pp. 315–328, 1986.
drovirus are required to kill Bombyx mori larvae,” Journal of [103] M. E. Wilson, T. H. Mainprize, P. D. Friesen, and L. K. Miller,
General Virology, vol. 79, no. 4, pp. 931–935, 1998. “Location, transcription, and sequence of a baculovirus
[88] S. G. Kamita and S. Maeda, “Abortive infection of the bac- gene encoding a small arginine-rich polypeptide,” Journal of
ulovirus Autographa californica nuclear polyhedrosis virus in Virology, vol. 61, no. 3, pp. 661–666, 1987.
Sf-9 cells after mutation of the putative DNA helicase gene,” [104] M. Wang, E. Tuladhar, S. Shen et al., “Specificity of bac-
Journal of Virology, vol. 70, no. 9, pp. 6244–6250, 1996. ulovirus P6.9 basic DNA-binding proteins and critical role
[89] G. Croizier, L. Croizier, O. Argaud, and D. Poudevigne, of the C terminus in virion formation,” Journal of Virology,
“Extension of Autographa californica nuclear polyhedrosis vol. 84, no. 17, pp. 8821–8828, 2010.
virus host range by interspecific replacement of a short [105] S. M. Thiem and L. K. Miller, “Identification, sequence, and
DNA sequence in the p143 helicase gene,” Proceedings of the transcriptional mapping of the major capsid protein gene of
National Academy of Sciences of the United States of America, the baculovirus Autographa californica nuclear polyhedrosis
vol. 91, no. 1, pp. 48–52, 1994. virus,” Journal of Virology, vol. 63, no. 5, pp. 2008–2018, 1989.
[90] G. Liu and E. B. Carstens, “Site-directed mutagenesis of [106] M. N. Pearson, R. L. Q. Russell, G. F. Rohrmann, and G.
the AcMNPV p 143 gene: effects on baculovirus DNA S. Beaudreau, “p39, a major baculovirus structural protein:
replication,” Virology, vol. 253, no. 1, pp. 125–136, 1999. immunocytochemicalcharacterization and genetic location,”
[91] L. A. Guarino, B. Xu, J. Jin, and W. Dong, “A virus-encoded Virology, vol. 167, no. 2, pp. 407–413, 1988.
RNA polymerase purified from baculovirus-infected cells,” [107] G. W. Blissard, R. L. Quant-Russell, G. F. Rohrmann,
Journal of Virology, vol. 72, no. 10, pp. 7985–7991, 1998. and G. S. Beaudreau, “Nucleotide sequence, transcriptional
[92] L. A. Guarino, J. Jin, and W. Dong, “Guanylyltransferase mapping, and temporal expression of the gene encoding
activity of the LEF-4 subunit of baculovirus RNA poly- p39, a major structural protein of the multicapsid nuclear
merase,” Journal of Virology, vol. 72, no. 12, pp. 10003–10010, polyhedrosis virus of Orgyia pseudotsugata,” Virology, vol.
1998. 168, no. 2, pp. 354–362, 1989.
[93] C. H. Gross and S. Shuman, “RNA 5’-triphosphatase, [108] S. Lu, G. Ge, and Y. Qi, “Ha-VP39 binding to actin and the
nucleoside triphosphatase, and guanylyltransferase activities influence of F-actin on assembly of progeny virions,” Archives
of baculovirus LEF-4 protein,” Journal of Virology, vol. 72, no. of Virology, vol. 149, no. 11, pp. 2187–2198, 2004.
[109] J. R. McLachlin and L. K. Miller, “Identification and charac-
12, pp. 10020–10028, 1998.
terization of vlf-1, a baculovirus gene involved in very late
[94] J. Jin, W. Dong, and L. A. Guarino, “The LEF-4 subunit of
gene expression,” Journal of Virology, vol. 68, no. 12, pp.
baculovirus RNA polymerase has RNA 5’- triphosphatase
7746–7756, 1994.
and ATPase activities,” Journal of Virology, vol. 72, no. 12, pp.
[110] S. Yang and L. K. Miller, “Control of baculovirus polyhedrin
10011–10019, 1998.
gene expression by very late factor 1,” Virology, vol. 248, no.
[95] C. H. Gross and S. Shuman, “Characterization of a
1, pp. 131–138, 1998.
baculovirus-encoded RNA 5’-triphosphatase,” Journal of
[111] S. Yang and L. K. Miller, “Expression and mutational analysis
Virology, vol. 72, no. 9, pp. 7057–7063, 1998.
of the baculovirus very late factor 1 (vlf-1) gene,” Virology,
[96] J. S. Titterington, T. K. Nun, and A. L. Passarelli, “Functional
vol. 245, no. 1, pp. 99–109, 1998.
dissection of the baculovirus late expression factor-8 gene: [112] V. S. Mikhailov and G. F. Rohrmann, “Binding of the
sequence requirements for late gene promoter activation,” baculovirus very late expression factor 1 (VLF-1) to different
Journal of General Virology, vol. 84, no. 7, pp. 1817–1826, DNA structures,” BMC Molecular Biology, vol. 3, Article ID
2003. 14, 2002.
[97] C. Iorio, J. E. Vialard, S. McCracken, M. Lagacé, and C. D. [113] A. L. Vanarsdall, K. Okano, and G. F. Rohrmann, “Char-
Richardson, “The late expression factors 8 and 9 and possibly acterization of the role of very late expression factor 1 in
the phosphoprotein p78/83 of Autographa californica mult- baculovirus capsid structure and DNA processing,” Journal
icapsid nucleopolyhedrovirus are components of the virus- of Virology, vol. 80, no. 4, pp. 1724–1733, 2006.
induced RNA polymerase,” Intervirology, vol. 41, no. 1, pp. [114] V. S. Mikhailov, K. Okano, and G. F. Rohrmann, “Baculovirus
35–46, 1998. alkaline nuclease possesses a 5 → 3 exonuclease activity and
[98] J. R. McLachlin and L. K. Miller, “Identification and charac- associates with the DNA-binding protein LEF-3,” Journal of
terization of vlf-1, a baculovirus gene involved in very late Virology, vol. 77, no. 4, pp. 2436–2444, 2003.
gene expression,” Journal of Virology, vol. 68, no. 12, pp. [115] V. S. Mikhailov, K. Okano, and G. F. Rohrmann, “Specificity
7746–7756, 1994. of the endonuclease activity of the baculovirus alkaline
[99] A. Lu and L. K. Miller, “The roles of eighteen baculovirus nuclease for single-stranded DNA,” Journal of Biological
late expression factor genes in transcription and DNA Chemistry, vol. 279, no. 15, pp. 14734–14745, 2004.
replication,” Journal of Virology, vol. 69, no. 2, pp. 975–982, [116] K. Okano, A. L. Vanarsdall, and G. F. Rohrmann, “Character-
1995. ization of a baculovirus lacking the alkaline nuclease gene,”
[100] J. W. Todd, A. L. Passarelli, A. Lu, and L. K. Miller, “Factors Journal of Virology, vol. 78, no. 19, pp. 10650–10656, 2004.
regulating baculovirus late and very late gene expression in [117] J. Olszewski and L. K. Miller, “Identification and characteri-
transient-expression assays,” Journal of Virology, vol. 70, no. zation of a baculovirus structural protein, VP1054, required
4, pp. 2307–2317, 1996. for nucleocapsid formation,” Journal of Virology, vol. 71, no.
[101] A. L. Passarelli and L. K. Miller, “Identification of genes 7, pp. 5040–5050, 1997.
encoding late expression factors located between 56.0 and [118] R. L. Q. Russell and G. F. Rohrmann, “Characterization
65.4 map units of the Autographa californica nuclear polyhe- of P91, a protein associated with virions of an Orgyia
drosis virus genome,” Virology, vol. 197, no. 2, pp. 704–714, pseudotsugata baculovirus,” Virology, vol. 233, no. 1, pp. 210–
1993. 223, 1997.
14 International Journal of Evolutionary Biology

[119] M. Whitford and P. Faulkner, “A structural polypeptide of [134] S. C. Braunagel, H. He, P. Ramamurthy, and M. D. Summers,
the baculovirus Autographa californica nuclear polyhedrosis “Transcription, translation, and cellular localization of three
virus contains O-linked N-acetylglucosamine,” Journal of Autographa californica nuclear polyhedrosis virus structural
Virology, vol. 66, no. 6, pp. 3324–3329, 1992. proteins: ODV-E18, ODV-E35, and ODV-EC27,” Virology,
[120] J. Olszewski and L. K. Miller, “A role for baculovirus GP41 in vol. 222, no. 1, pp. 100–114, 1996.
budded virus production,” Virology, vol. 233, no. 2, pp. 292– [135] H. Q. Chen, KE. P. Chen, Q. Yao, Z. J. Guo, and L. L. Wang,
301, 1997. “Characterization of a late gene, ORF67 from Bombyx mori
[121] W. Wu, T. Lin, L. Pan et al., “Autographa californica multiple nucleopolyhedrovirus,” FEBS Letters, vol. 581, no. 30, pp.
nucleopolyhedrovirus nucleocapsid assembly is interrupted 5836–5842, 2007.
upon deletion of the 38K gene,” Journal of Virology, vol. 80, [136] O. Simón, S. Gutiérrez, T. Williams, P. Caballero, and
no. 23, pp. 11475–11485, 2006. M. López-Ferber, “Nucleotide sequence and transcriptional
[122] W. Wu, H. Liang, J. Kan et al., “Autographa californica analysis of the pif gene of Spodoptera frugiperda nucleopoly-
multiple nucleopolyhedrovirus 38K is a novel nucleocapsid hedrovirus (SfMNPV),” Virus Research, vol. 108, no. 1-2, pp.
protein that interacts with VP1054, VP39, VP80, and itself,” 213–220, 2005.
Journal of Virology, vol. 82, no. 24, pp. 12356–12364, 2008. [137] G. P. Pijlman, A. J. P. Pruijssers, and J. M. Vlak, “Identifica-
[123] C. M. Long, G. F. Rohrmann, and G. F. Merrill, “The tion of pif-2, a third conserved baculovirus gene required for
conserved baculovirus protein p33 (Ac92) is a flavin adenine per os infection of insects,” Journal of General Virology, vol.
dinucleotide-linked sulfhydryl oxidase,” Virology, vol. 388, 84, no. 8, pp. 2041–2049, 2003.
no. 2, pp. 231–235, 2009. [138] P. Faulkner, J. Kuzio, G. V. Williams, and J. A. Wilson,
[124] W. Wu and A. L. Passarelli, “Autographa californica multiple “Analysis of p74, a PDV envelope protein of Autographa
nucleopolyhedrovirus Ac92 (ORF92, P33) is required for californica nucleopolyhedrovirus required for occlusion body
budded virus production and multiply enveloped occlusion- infectivity in vivo,” Journal of General Virology, vol. 78, no. 12,
derived virus formation,” Journal of Virology, vol. 84, no. 23, pp. 3091–3100, 1997.
pp. 12351–12361, 2010. [139] W. Zhou, L. Yao, H. Xu, F. Yan, and Y. Qi, “The function of
[125] Y. Nie, M. Fang, and D. A. Theilmann, “Autographa califor- envelope protein p74 from Autographa californica multiple
nica multiple nucleopolyhedrovirus core gene ac92 (p33) is nucleopolyhedrovirus in primary infection to host,” Virus
required for efficient budded virus production,” Virology, vol. Genes, vol. 30, no. 2, pp. 139–150, 2005.
409, no. 1, pp. 38–45, 2011. [140] E. J. Haas-Stapleton, J. O. Washburn, and L. E. Volkman,
[126] M. Fang, H. Wang, H. Wang et al., “Open reading frame 94 “P74 mediates specific binding of Autographa californica
of Helicoverpa armigera single nucleocapsid nucleopolyhe- M nucleopolyhedrovirus occlusion-derived virus to primary
drovirus encodes a novel conserved occlusion-derived virion cellular targets in the midgut epithelia of Heliothis virescens
protein, ODV-EC43,” Journal of General Virology, vol. 84, no. larvae,” Journal of Virology, vol. 78, no. 13, pp. 6786–6791,
11, pp. 3021–3027, 2003. 2004.
[127] F. Deng, R. Wang, M. Fang et al., “Proteomics analysis [141] L. Yao, W. Zhou, H. Xu, Y. Zheng, and Y. Qi, “The Heliothis
of Helicoverpa armigera single nucleocapsid nucleopoly- armigera single nucleocapsid nucleopolyhedrovirus envelope
hedrovirus identified two new occlusion-derived virus- protein P74 is required for infection of the host midgut,”
associated proteins, HA44 and HA100,” Journal of Virology, Virus Research, vol. 104, no. 2, pp. 111–121, 2004.
vol. 81, no. 17, pp. 9377–9385, 2007. [142] S. C. Braunagel, W. K. Russell, G. Rosas-Acosta, D. H.
[128] KE. Peng, M. Wu, F. Deng et al., “Identification of Russell, and M. D. Summers, “Determination of the protein
protein-protein interactions of the occlusion-derived virus- composition of the occlusion-derived virus of Autographa
associated proteins of Helicoverpa armigera nucleopolyhe- californica nucleopolyhedrovirus,” Proceedings of the National
drovirus,” Journal of General Virology, vol. 91, no. 3, pp. 659– Academy of Sciences of the United States of America, vol. 100,
670, 2010. no. 17, pp. 9797–9802, 2003.
[129] A. L. Vanarsdall, M. N. Pearson, and G. F. Rohrmann, [143] I. Kikhno, S. Gutiérrez, L. Croizier, G. Crozier, and M. López
“Characterization of baculovirus constructs lacking either Ferber, “Characterization of pif, a gene required for the per
the Ac 101, Ac 142, or the Ac 144 open reading frame,” os infectivity of Spodoptera littoralis nucleopolyhedrovirus,”
Virology, vol. 367, no. 1, pp. 187–195, 2007. Journal of General Virology, vol. 83, no. 12, pp. 3013–3022,
[130] G. Li, J. Wang, R. Deng, and X. Wang, “Characterization of 2002.
AcMNPV with a deletion of ac68 gene,” Virus Genes, vol. 37, [144] T. Ohkawa, J. O. Washburn, R. Sitapara, E. Sid, and L.
no. 1, pp. 119–127, 2008. E. Volkman, “Specific binding of Autographa californica
[131] C. B. McCarthy and D. A. Theilmann, “AcMNPV ac143 (odv- M nucleopolyhedrovirus occlusion-derived virus to midgut
e18) is essential for mediating budded virus production and cells of heliothis virescens larvae is mediated by products of
is the 30th baculovirus core gene,” Virology, vol. 375, no. 1, pif genes Ac119 and Ac022 but not by Ac115,” Journal of
pp. 277–291, 2008. Virology, vol. 79, no. 24, pp. 15258–15264, 2005.
[132] J. Ke, J. Wang, R. Deng, and X. Wang, “Autographa [145] M. Fang, Y. Nie, S. Harris, M. A. Erlandson, and D. A.
californica multiple nucleopolyhedrovirus ac66 is required Theilmann, “Autographa californica multiple nucleopolyhe-
for the efficient egress of nucleocapsids from the nucleus, drovirus core gene ac96 encodes a per os infectivity factor
general synthesis of preoccluded virions and occlusion body (pif-4),” Journal of Virology, vol. 83, no. 23, pp. 12569–12578,
formation,” Virology, vol. 374, no. 2, pp. 421–431, 2008. 2009.
[133] M. Belyavskyi, S. C. Braunagel, and M. D. Summers, [146] S. C. Braunagel, D. M. Elton, H. Ma, and M. D. Sum-
“The structural protein ODV-EC27 of Autographa californica mers, “Identification and analysis of an Autographa cali-
nucleopolyhedrovirus is a multifunctional viral cyclin,” Pro- fornica nuclear polyhedrosis virus structural protein of the
ceedings of the National Academy of Sciences of the United occlusion-derived virus envelope: ODV-E56,” Virology, vol.
States of America, vol. 95, no. 19, pp. 11205–11210, 1998. 217, no. 1, pp. 97–110, 1996.
International Journal of Evolutionary Biology 15

[147] W. O. Sparks, R. L. Harrison, and B. C. Bonning, “Autographa


californica multiple nucleopolyhedrovirus ODV-E56 is a
per os infectivity factor, but is not essential for binding
and fusion of occlusion-derived virus to the host midgut,”
Virology, vol. 409, no. 1, pp. 69–76, 2011.
[148] D. P. A. Cohen, M. Marek, B. G. Davies, J. M. Vlak, and
M. M. van Oers, “Encyclopedia of Autographa californica
nucleopolyhedrovirus genes,” Virologica Sinica, vol. 24, no.
5, pp. 359–414, 2009.
[149] Y. Jiang, F. Deng, S. Rayner, H. Wang, and Z. Hu, “Evidence of
a major role of GP64 in group I alphabaculovirus evolution,”
Virus Research, vol. 142, no. 1-2, pp. 85–91, 2009.
[150] E. A. Herniou and J. A. Jehle, “Baculovirus phylogeny and
evolution,” Current Drug Targets, vol. 8, no. 10, pp. 1043–
1050, 2007.
[151] P. M. de Andrade Zanotto and D. C. Krakauer, “Complete
genome viral phylogenies suggests the concerted evolution of
regulatory cores and accessory satellites,” PLoS ONE, vol. 3,
no. 10, Article ID e3500, 2008.
[152] D. Goodman, N. Ollikainen, and C. Sholley, “Baculovirus
phylogeny based on genome rearrangements,” in Proceedings
of the International Conference on Comparative Genomics, vol.
4751 of Lecture Notes in Computer Science, pp. 69–82, 2007.

You might also like