10201148
10201148
10201148
BY
ARSHAN NASIR
THESIS
Urbana, Illinois
Adviser:
ABSTRACT
machinery and virus-specific parasites have raised important questions about their origin.
Evidence advocates for their inclusion into global phylogenomic studies and their consideration
as a distinct and ancient form of life. Here we reconstruct phylogenies describing the evolution
of proteomes and protein domain structures of viruses and cells that define viruses as a ‘fourth
supergroup’ along with cellular superkingdoms Archaea, Bacteria, and Eukarya. Universal trees
of life (uToLs) place viruses at their root and trees of domains indicate they have evolved via
massive reductive evolutionary processes. Since viral domains are widespread among cellular
proteomes we propose that viruses mediate gene transfer between cellular species and crucially
enhance biodiversity. Results call for a change in the way viruses are perceived. They likely
represent a distinct and most ancient form of life and a very crucial part of our planet’s
ACKNOWLEDGEMENTS
I am grateful to many people for their continuous support, understanding and endless
love. The first and foremost, I am grateful to my late father Muhammad Nasir Qureshi and my
mother Shaheen Nasir. My parents, brother and sister have given me their unequivocal support
throughout for which my mere expression of thanks does not suffice. It is only due to their
efforts and prayers that I am able to accomplish this much. This thesis would not have been
possible without the input of my principal adviser, Professor Gustavo Caetano-Anolles. I wish to
express my gratitude for his kindness, invaluable assistance, support and guidance throughout the
duration of my degree. I consider myself fortunate to start my research career under his
supervision. The good advice and friendship of my second supervisor, Dr. Kyung Mo Kim has
been invaluable both on academic and personal level, for which I am extremely grateful. I also
wish to thank Drs. Sandra Rodriguez-Zas and James Whitfield for serving on my graduate
committee and for valuable comments. Thanks are also extended to Liudmila Yafremava and
Kyung Mo Kim for information regarding the organism lifestyles, which proved useful in my
Pakistan for providing fellowship to support my education. Last, but by no means least, I
acknowledge my friends Zohaib, Afraz, Yasir, Masood, Waqar, Nadeem, Ahmed, Tahir, Abbas,
Zeeshan, Tayyab, and Rafia for their encouragement. In the end, I am happy to have worked with
the most amazing group of young scientists at the Department of Crop Science, UIUC and the
TABLE OF CONTENTS
REFERENCES ............................................................................................................................. 81
APPENDIX A ............................................................................................................................... 90
APPENDIX B ............................................................................................................................... 92
1
CHAPTER 1: BACKGROUND
Viruses are tiny infectious agents that depend upon a cellular host to reproduce. They
have a very simple architecture and consist of genetic material (either DNA or RNA) that is
enclosed inside a protein coat called capsid (1). Some viruses (e.g., influenza viruses)
additionally have lipid membranes that cover these capsids. These outer membranes are
recognized as viral envelopes (2). Envelopes mediate entry of viral DNA into the host cells by
means of interaction between the surface glycoproteins present on viral envelopes and receptors
Viruses infect all cellular (Archaea, Bacteria and Eukarya) (4) and acellular life forms
(virophages i.e., parasites of viruses) (5). Viruses that infect Bacteria are known as
serving as model systems (6). Viruses isolated from Archaea are highly diverse and
morphologically complex (7). Many have been isolated from extreme thermophilic environments
raising intriguing questions about their origins (7). Eukaryal viruses are responsible for
numerous diseases such as Hepatitis, AIDS, and flu. Additionally, there is now evidence for the
existence of virophages, i.e., viruses that infect other viruses (5, 8, 9). This ubiquitous nature of
viruses suggests they are biologically active and able to affect cellular life (8).
Collectively, the group of all viruses is referred to as the ‘virosphere’ (10). The
virosphere is the most abundant and diverse group of biological entities on planet Earth (10).
Their abundance is noteworthy especially in marine environments where they are highly
concentrated (11). In seawater, there are 5-10 virus-like particles (mostly bacteriophages) per
2
bacterium (12) and nearly 10 million bacteriophages per milliliter (13). More than 5,000
genotypes of viruses can be found in 100L of seawater (14) and nearly 94% of the oceanic
In addition, the virosphere exhibits remarkable levels of diversity in terms of physical and
genome size, morphology and other characteristics. Viruses range from being very small (~200
nm in diameter) (1) to giant viruses, with some viruses being as large as numerous bacteria (e.g.,
mimiviruses and megaviruses) (16, 17). The number of genes encoded by the viral genome
ranges from 2-3 genes (e.g., retroviruses) to nearly 1,000 for mimiviruses and megaviruses (16,
17). Morphologically, they form various structures (e.g., helical, polyhedral, complex etc.).
Genetic material can either be DNA (e.g., adenoviruses) or RNA (e.g., human immunodeficiency
virus) that in turn can either be single stranded or double stranded (1). As such, the remarkable
abundance and diversity of the virosphere has led to its acceptance as an integral component of
The question about the origin of viruses is fundamental to our understanding of life.
Unfortunately, studying viral origins is challenging because first, unlike cellular organisms, they
do not have any fossil record. We are therefore left to deduce their evolutionary trajectories by
studying modern day viruses. Second, the virosphere is highly diverse (as explained above).
Such high levels of diversity, and the knowledge that no single gene unites all the viruses,
suggest they had multiple evolutionary origins (i.e., their origin was polyphyletic) (18). In
contrast, living cellular organisms are united by the presence of the ribosome and a set of genes
that are shared by all the organisms (19). There are however groups of viruses that share
conserved sets of genes and appear to have a common evolutionary origin (i.e., monophyletic)
3
(read below). However, the virosphere as a whole has been suggested to be not monophyletic
(20). Given these realizations, three theories have been proposed to explain the origins of
viruses.
Proponents of this theory argue that viruses are more ancient than the cellular organisms
and predate the ‘Last Universal Common Ancestor’ (LUCA). This theory implies that ancient
viruses had the ability to self-sustain and replicate on their own (18, 21, 22). This theory has
received a fair amount of criticism because all modern day viruses need a cellular host to
This theory states that viruses are degenerates (reduced forms) of modern day parasitic
organisms and evolved by reductive evolutionary processes. In doing so, they lost all the genes
required for self-sustenance (e.g., metabolic genes) and maintained only those required to
The reduction hypothesis has been contested on grounds of lack of intermediates between
viruses and parasitic organisms (23). Recently, giant viruses (e.g., mimiviruses and megaviruses)
have been shown to cross the barriers between viruses and small parasitic organisms (16),
The escape hypothesis considers viruses to have evolved from genetic material of the
host cell that escaped from its control and developed autonomous evolving abilities (25). The
support of this theory comes from the observation that modern day viruses can act as mediators
of gene transfer and pick-pocket genes from their cellular hosts, a mechanism known as
4
horizontal gene transfer (HGT). Some scientists believe that HGT has been a dominant force in
shaping the viral proteomes (26, 27). This theory, however, fails to explain the presence of
structures and genetic material in viruses that have no homologs in the cellular world (10, 28).
The theories presented above have been highly debated and questioned (23). Some
scientists believe that a composite or new explanation may be more accurate and sound. Raoult
and Forterre (2008) proposed the division of life into two forms where cellular organisms are
united by the presence of ribosomes (ribosome encoding organisms, REOs) and viruses united by
Viruses are generally regarded as neither living nor non-living (29). They are entirely
dependent upon the host cells for reproduction and for this reason are considered to be obligate
intracellular parasites (29). Viruses however follow both the biologically active and inactive
lifecycles. Once inside the cells, they either follow the ‘lytic life cycle’ during which they lyse
the cell and release viral progeny or the ‘lysogenic life cycle’ during which they integrate their
genome into the host DNA and remain dormant (30). One of the striking findings to come out of
the sequencing of human genome project was the realization that endogenous retroviruses
account for nearly 8% of the human genome (31). These endogenous retroviruses resemble
modern day retroviruses and are considered to be the remnants of ancient viral infections (31).
Outside the host cells, viruses are metabolically inactive and survive long periods of time as
crystals (chemical objects) (32). Based on these realizations and dual life cycles (both biological
and chemical), viruses are regarded as intermediates between living and non-living forms (29)
One important realization is that not all viruses have minute genomes. In particular, a
large group of dsDNA viruses that infect Eukarya have large genomes with hundreds of genes
(33, 34). This group is known as Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) and is
apparently a monophyletic group that includes five major families of viruses including
Unlike other viruses that are strictly dependent on the host cells for replication, NCLDV
are not exclusively dependent (35). They encode several key proteins involved in the processes
of DNA replication, translation, and DNA repair (20). Additionally, all the families of NCLDV
share a conserved set of 9 genes that are present in all the families and 22 genes present in at
least 3 out of 4 families (34). Presence of this conserved core suggests that NCLDV originated
Mimiviridae
The most exciting addition to the NCLDV was the introduction of Mimiviridae. The first
by researchers in France (36). APMV (simply mimivirus) is one of the largest and most complex
viruses known to date and for that reason, it was initially mistakenly identified as gram-positive
bacterium until its viral nature was described in 2003 (16, 36). Mimivirus infects amoeba and
indirect evidence (presence of antibodies) suggests it may be responsible for human pneumonia
(37). Mimivirus is a ‘Gulliver among Lilliputians’ (38) as it’s sheer physical (750 nm diameter)
and genome size (>1Mb) outstrips almost all the known viruses and numerous parasitic bacteria
(16). Mimivirus has many other additional ‘unviral’ features as well, such as the presence of
genes related to DNA repair and protein translation. Mimivirus enters the host cell by a unique
6
entry mechanism called ‘Stargate’ (39) and encodes promoter-like elements that share structural
resemblance with the eukaryotic promoters (40). Mimivirus relatives are highly abundant in
marine environments (41) suggesting it is not the only known giant virus. Recently, the
discovery of Megavirus chilensis (megavirus) and description of its genome identified megavirus
as the largest virus known to date (17). Both the particle and genome size of megavirus exceed
Our knowledge about the viral world has increased dramatically since the discovery of
viruses harboring large genomes (16, 17, 36). These viruses defy the definition of virus,
overlapping numerous bacteria both in genomic and physical features. Their existence has raised
interesting questions about the status of viruses in the living world and called for a redefinition
(19).
Some scientists believe that the ability to release viral progeny and the overlap between
giant viruses and cellular organisms supports their identification as a distinct form of life (42).
Others believe that because most viruses have a simpler organization and lack ribosomes
(essential components of translation machinery) they cannot and should not be compared to the
cellular organisms (27, 29). The question remains one of the most debated ones in the field of
study the evolutionary relationships between viruses and cellular organisms on firmer grounds.
Sequencing of viral genomes with genome sizes comparable to bacteria gives us the opportunity
to reconstruct their phylogenetic history by comparing them to other viruses and/or cellular
organisms.
7
The concept of a Tree of Life (ToL)
The unusual genomic and physical features of giant viruses argue for the inclusion of
viruses, at least the ones with large genomes, into the universal tree of life (ToL) together with
the cellular proteomes of Archaea, Bacteria and Eukarya. The ToL is a model that explains
diversification of living organisms into three distinct superkingdoms i.e., Archaea, Bacteria and
Eukarya. It was first described by Carl Woese who identified Archaea as a distinct domain of life
and proposed the three-domain classification system (44, 45). Prior to Woese’s work, the
organisms of the ToL were divided into prokaryotic and eukaryotic groups.
trees are built from data that are either morphological or molecular features (characters),
expressing different values or properties (character states), and describe the history of a
biological system (taxa) (46). These trees are network representations with branches, nodes
connecting branches and leaves (taxa). Prior to the advancements in molecular biology,
phylogenies were restricted to the use of morphological data and fossil records. These
phylogenies were limited due to the lack of comparable characters between the many lineages of
The advances in molecular biology enabled the use of molecular characters for
phylogenetic reconstruction, such as gene sequence, gene order, gene content and structural
information (47). This led to many studies that grouped organisms into kingdoms and
however were not included in these analyses (44, 45, 48, 49).
8
Discovery of giant viruses encouraged scientists to place viruses on the global
evolutionary scale. Initial phylogenomic analysis based on the concatenation of seven universally
polymerase II largest subunit, RNA polymerase II second largest subunit, Proliferating cell
nuclear antigen, and 5’-3’ exonuclease) in the proteomes of mimivirus, Archaea, Bacteria and
Eukarya placed mimivirus at the base of Eukarya and identified mimivirus as a distinct
superkingdom in the ToL (16). This study was challenged and future analysis involving
individual proteins, rather than the concatenated set nested mimivirus consistently within
Eukarya and close to amoeba (the cellular host of mimivirus) (50). Authors proposed that
previous analysis was flawed due to the use of concatenated protein sequences that could
produce misleading trees if proteins evolved by different evolutionary mechanisms (50). Nesting
of mimivirus within Eukarya suggested that the gene repertoire of mimivirus was acquired from
amoeba by HGT (50). These experiments highlight the differences in the methods used to build
phylogenetic trees (e.g., concatenated set VS individual sequences) and suggest that HGT has
tailored the proteome of mimiviruses. However, both these analyses used only a limited number
of species and a very small set of proteins. Increased sampling of species from Archaea,
Bacteria, Eukarya and a large number of giant viruses coupled with a phylogenetic model that is
robust against the artifacts described above appears to be the most likely solution to study the
In general, there are two main approaches, sequence-based approaches that utilize
nucleic acid or protein sequence information and structure-based approaches that focus on
Sequence-based approaches deal with gene sequence(s) and use an alignment file to
protein sequences such that homologus regions are aligned and identified easily. The aligned
regions are of particular interest and reflect structural and evolutionary importance whereas gaps
in alignment represent past insertion or deletion events (i.e., indels) (51). Multiple genes can be
aligned simultaneously (52) and large sampling of taxa is advised to increase the power of
phylogenies. Other features such as gene content and gene order (whole-genome features) have
been utilized in past that do not require alignment computation (53, 54). These features compare
genomes simultaneously and are considered superior over the use of sequence alignment because
changes in gene content and gene order in genomes result in billions of possible states as
opposed to only four states for nucleotide alignment and twenty for protein alignment (46). As
such, they are relatively robust against the affects of HGT and problems resulting from the use of
sequence alignment (read below) (46). Both approaches are computationally intensive and
depend upon the phylogenetic model, data and assumptions that can be influential when
2. Structure-based approaches
reap the benefit of processes occurring at higher and more conserved levels of the structural
hierarchy (read below). These processes are responsible for the redundant appearance and
accumulation of modules in the structure of living organisms [e.g., how many times particular
10
protein domains are present in proteomes (i.e., genomic abundance)]. Domain structure is
maintained in proteins and remains conserved as apposed to the amino acid sequence that is
dynamic and variable (48, 55-58). For this reason, protein domains are also considered
evolutionary units (48, 59, 60) and make useful ‘phylogenetic characters’ when recovering deep
evolutionary relatedness (59, 62, 63) such as in the well-established ‘Structural Classification of
Proteins’ (SCOP) (59, 63). In SCOP, domains having high sequence conservation (>30%) are
grouped into fold families (FFs), FFs with evidence of common ancestry, based on structural and
functional relatedness, into FSFs, FSFs with common secondary structure in similar
arrangements into folds (Fs), and Fs sharing a common secondary structure into classes (58, 59,
63). Domains are identified using concise classification strings (css) (e.g., d.211.1.1, where d
represents the protein class, 211 the F, 1 the FSF and 1 the FF). The 110,800 domains indexed in
SCOP 1.75 (corresponding to 38,221 PDB entries) are classified into 1,195 F, 1,962 FSFs, and
3,902 FFs. Compared to the number of protein entries in UniProt (534,695 total entries as of
February 22, 2012) the number of domain structural designs is quite smaller and suggests that the
Because domains defined at higher levels of SCOP classification (i.e., FSFs) are more
conserved than domains defined at lower levels (i.e., FFs and sequences) and less likely to be
affected by HGT (61, 64-67), the genomic abundance of FSFs in proteomes has been
successfully utilized as the phylogenomic characters to reconstruct the trees of proteomes (ToPs)
and trees of protein domains (ToDs) (48, 49). This focus on structure as general evolutionary
principle of biology offers several advantages over standard phylogenetic methods and
11
overcomes important limitations imposed by the violation of assumptions that occur when
attempting to extract deep phylogenetic signal present in molecular sequence data. The top ten of
1. ToDs and ToPs are derived from non-parametric models of genomic abundance that
are free from problems of homology in the alignments of sequences (68, 69). Once structural and
SCOP (59, 63), homology is established. In contrast, sequence alignment remains problematic
because there is still not an objective function in bioinformatics that can describe homology in
2. Domain structure is more conserved than amino acid sequence and valuable when
3. ToDs and ToPs are not affected by the serious problem raised by Maddison (70) of
characters that are not applicable to all taxa in a data set [e.g., insertion/deletion (indel) sites in
affecting the validity of deep phylogenetic inference (72). This problem does not apply to
domain abundance, which is evolutionarily highly conserved, increases with time, and in doing
5. ToDs and ToPs are appropriately based on a historical analysis of molecular units of
evolution, function and structure, the proteins domains (59, 73). In contrast, trees of genes
(ToGs) generally consider genes are evolutionary units. However, a substantial number of genes
code for proteins that have multiple domains [55% in Archaea, 72% in Bacteria, and 84% in
evolutionary histories (75). Sequence analysis fails to take into consideration the historical
relationships and evolutionary heterogeneities that exist between subsets of sequence sites. In
contrast, the study of molecular domains is impervious to the history of domain make up. It
speciation event) and paralogy (duplicated genes) in sequence analysis (76) is inapplicable to
domains at any level of structural abstraction, which by definition include all domain sequence
variants (59). According to SCOP definitions, protein domains that belong for example to an FF
8. Evolutionary processes such as convergent evolution and horizontal gene transfer can
confound phylogenetic analysis leading to erratic interpretations when dealing with molecular
sequence data. However, the effect of these processes on ToLs built from domain structures
defined at conserved and higher levels of SCOP hierarchy appears to be very limited (49, 64-67,
73). Even the lower Pfam hierarchical level of structural organization showed limited influence
of HGT (<10%) (65). Phylogenetic and statistical analyses have revealed that indeed the
convergent evolution of domain structures is rare (49, 64-67, 73) and the diversity of protein
(74). Additionally, the hypothesis suggestion multiple origins for proteins is not well supported
(77).
and phylogenetic error (78). ToDs are refractory to the problem since they sample the set of all
known domains (i.e., they portray history of an operationally finite set of taxa).
13
10. Finally, the most fundamental principle of phylogenetic analysis, character
independence, states that each character must serve as an independent hypothesis of evolution
(79). Violation of character independence is serious and results in phylogenies that do not reflect
true evolutionary history (80). Molecular structure is defined by interactions between nucleotide
sites in a protein sequence. This mere fact violates character independence of sequence analysis,
especially when ToGs include sequences with structures that are divergent. In contrast, ToDs and
ToPs are free from violation of character independence as long as individual domains
has been used previously to reconstruct the ToLs proposing an early divergence of Archaea
relative to Bacteria and Eukarya (67, 81, 82), reveal the reductive trends in the proteomes of
cellular organisms (48, 83), used to reconstruct the structural make up of the last universal
common ancestor (LUCA) of life (49), to link molecules and planetary events (inferred from
For these reasons, global evolutionary statements considering viruses should be tested
with a phylogenetic model that is robust against the artifacts that result from the use of sequence-
based approaches. The genomic abundance counts of protein domains (defined into FSFs) are
useful phylogenetic character states that have been successfully utilized in the past to describe
the evolution of protein domains and organisms (see chapter 2 for description of methodology).
SUPERFAMILY hosts a library of hidden Markov models (HMMs) that are profiles
based on multiple sequence alignments, representing a protein superfamily (or family). These
14
models represent all the proteins with known structure and are generated by the iterative
Sequence Alignment and Modeling (SAM) algorithm (85). SAM takes as input a protein
sequence with known 3D structure and searches for its homologs in the sequence databases.
Once the homologs have been retrieved, a multiple sequence alignment file (MSA) is generated
that is used by SAM to generate a SUPERFAMILY model which best describes the features of
this alignment. SAM is considered to be one of the best algorithms for the detection of remote
protein homologies (86, 87). Once the models have been generated, any protein sequence with an
unknown structure can be searched against the entire library of these models and the model with
the best hit is returned. This model represents a FSF and that FSF is assigned to the query protein
sequence with high accuracy (error rate <1%) (85). SUPERFAMILY currently provides FSF
structural assignments for a total of 1,258 model organisms including 96 Archaea, 862 Bacteria
annotation contributes to the description of the functional make up of proteomes (58). Assigning
molecular functions to FSFs is a difficult task since majority of the FSFs (~80%) perform
multiple functions and are quite diverse (49). For example, most of the ancient FSFs, such as the
P-loop-containing NTP hydrolase FSF (c.37.1), are highly abundant in nature and include many
FFs (20 in case of c.37.1) (58). Each of those families may participate in multiple pathways and
perform related but different functions. The SUPERFAMILY functional annotation scheme
introduced by Vogel and Chothia is a one-to-one mapping between FSFs and molecular
functions and is based on information from various resources, including the Cluster of
Orthologus Groups (COG) and Gene Ontology (GO) databases and manual surveys (88-91).
15
When a FSF is involved in multiple functions, the most predominant function is assigned to that
multi-functional FSF under the assumption that it is the most ancient and predominantly present
in all proteomes. The error rate in assignments is estimated to be <10% for large FSFs and <20%
Intracellular processes (ICP), Extracellular processes (ECP), Regulation, General, and Other.
The major categories are subdivided into 50 minor categories and the mapping between FSFs
(defined in SCOP 1.73) and molecular functions corresponding to the major and minor is
In this study, we take advantage of both the SUPERFAMILY structural and functional
annotation schemes for domains defined at FSF level and make evolutionary statements about
the origin of viruses. For this exercise, we consider viruses with medium-to-very-large genomes
(mainly NCLDV) since their lineages apportion considerable diversity to the virosphere. We also
explore the functional make up of cellular proteomes and discover that it is remarkably
conserved with the exception of parasitic organisms. Results yield significant insights into the
evolution and origins of viruses and describe the mapping between structures and functions for
Introduction
Ever since the discovery of viruses, scientists have wondered about their origins and roles
in cellular evolution. There is considerable abundance and diversity in the virosphere (the group
of all viruses) for it to be regarded as an integral component of the biosphere (15). The last few
decades have seen an increase in our knowledge about the viruses, particularly boosted by the
discovery of viruses with large genomes (17, 36). These viruses question the way we have
perceived viruses in the past and call for a redefinition (9, 19, 92) In this study, we take
advantage of the advancements in the fields of genomics and bioinformatics to study the
evolution of viruses on a scale comparable to the cellular organisms. We sample viruses with
virospheric diversity (42). This study offers valuable insights into the origin and evolution of
Cellular dataset
at FSF level for publically available sequenced genomes (85, 93). We downloaded the FSF
assignments for a total of 981 organisms (70 Archaea, 652 Bacteria, and 259 Eukarya) from the
SUPERFAMILY web server (date: August 29th, 2010). These 981 proteomes constitute the
cellular dataset.
Viral dataset
NCLDV and 5 viruses from Archaea, Bacteria and Eukarya from NCBI viral genome resource
SCOP 1.75) to the viral dataset using HMMs of structural recognition in SUPERFAMILY at
probability cutoff E value of 10-4. The cellular and viral datasets define a total dataset of 1,037
proteomes (56 Viruses, 70 Archaea, 652 Bacteria, and 259 Eukarya) with a total FSF repertoire
of 1,739 FSFs (91 FSFs had no representation in our datasets and were excluded from the
analysis).
Phylogenomic analysis
We generated phylogenomic trees that describe the evolution of protein domains (ToDs)
and proteomes (ToPs) using the genomic abundance of FSFs as phylogenomic characters. We
began by counting the number of times each FSF was represented in each proteome for the total
18
dataset. We define this count as the genomic abundance value (g) and presented these values in a
1037 * 1733 ([total number of proteomes] x [total number of FSFs]) matrix. Because large
genomes are expected to have higher counts of FSFs compared to smaller genomes, g values can
range from 0 (absent) to thousands (82, 87). In order to account for the unequal genome sizes
and large variances, we normalized the g values to a 0-23 scale in an alphanumeric format (0-9
In this formula, a and b represent a FSF and a proteome respectively; gab describes the g
value of the FSF a in the proteome b; gmax is the maximum g value in the matrix and the Round
function normalizes the gab value taking into account the gmax and standardizes the values to a 0-
23 scale (gab_norm) (82, 87). The normalization results in 24 transformed values that are
compatible with PAUP* ver. 4.0b10 (phylogenomic reconstruction software) (94) and represent
For the trees describing evolution of FSF domains (i.e., ToDs), we declared N (maximum
value) as the ancestral character state under the assumption that the most abundant FSF
(character N) appeared first in evolution. For this reconstruction, FSFs were treated as taxa and
proteomes as characters. For the trees describing evolution of proteomes (i.e., ToPs), we declared
0 (minimal value) as the ancestral character state under the assumption that the ancestral
proteome had a fairly simpler architecture and there was a progressive trend towards organismal
complexity. For this reconstruction, data matrix was transposed to represent proteomes as taxa
and FSFs as characters. Ancestral states were declared using the ANCSTATE command and
trees were rooted using the Lundberg method that does not require to specify the outgroup taxa.
Maximum Parsimony (MP) was used to search for the best possible tree. To evaluate the
19
reliability of phylogenetic trees, we carried out a bootstrap analysis with 1,000 replicates. From
the ToDs, we calculated the relative age of each FSF defined as the node distance (nd) using a
PERL script that counts the number of nodes from a hypothetical ancestral FSF at the base of the
tree to each leaf and provides it in a relative 0-1 scale (48). In order to evaluate homoplasy (i.e.,
conflict between data and tree) affecting the ToPs, retention indexes (ri) were calculated for
individual FSF characters using the ‘DIAG’ option in PAUP* (87, 95).
superkingdoms, Archaea (A), Bacteria (B) and Eukarya (E), and viruses (V) (which we
ABEV category, those present in all but one superkingdoms to the ABE, AEV, BEV, and ABV
categories, those present in two superkingdoms to the AB, AE, AV, BE, BV, and EV categories,
We used a previously reported distribution index (f) to describe the popularity of FSFs
across all the proteomes. This index ranges from 0 to 1 and represents the fraction of proteomes
harboring a particular architecture (48). An f value of 1 indicates that a particular FSF is present
in all the proteomes and a value close to zero indicates that it present in only very few
proteomes.
Estimating the bias in the spread of viral FSFs among cellular proteomes
The total repertoire of FSFs present in each of the cellular superkingdom was split into
two components: (i) the set of FSFs shared with viruses, and (ii) the set of cellular FSFs of that
superkingdom (shared or not shared with other superkingdoms). The f index for both components
20
was represented in a boxplot generated using R programming language (link: http://www.r-
project.org/).
In order to estimate which taxonomic groups are enriched with viral FSFs, we compared
the counts of FSFs in cellular taxonomic groups (background) against the viral groups (sample).
The probability of enrichment of a particular taxonomic group was calculated using the
In this equation, M indicates the number of FSFs in the background; k indicates the
number of FSFs in the sample (i.e., viral taxonomic group); N is the total number of FSFs of the
two sets; n represents the total number of FSFs in sample and P(X = k) is the probability that
implies a chance that a random variable X has k FSFs for a given taxonomic group (49, 66).
Referring to the equation above and previous literature (49, 66), we calculated P values for the
individual viral taxonomic groups having k/n larger (i.e., overrepresented) or smaller (i.e.,
underrepresented) than M/N, and evaluated statistical significance with 95% confidence level (P
< 0.05).
We used the functional annotation scheme described by Vogel and Chothia (88-90) to
assign molecular functions to viral dataset. Mapping between the categories is described in
chapter 3 (58).
21
Results
distinct FSFs) were detected by the SUPERFAMILY HMMs in the sampled viral proteomes
(Table 2.1). Six out of these FSFs were unique to viruses (had no cellular representation) and
were part of a small subset of viral specific FSFs that are responsible for functions unique to
viruses, such as attachment to the host cell receptors and DNA (b.126.1, b.21.1, a.54.1, g.51.1),
inhibiting caspases to trigger anti-apoptosis (b.28.1), and acting as major capsid proteins
(b.121.2). On average, ~34% (total FSFs/total proteins) of proteins in viral proteomes received
SCOP structural assignments (Table 2.1). This was higher than expected, given the simplistic
and reduced nature of viral proteomes. As expected, mimivirus had the most number of
to only 163 distinct FSFs, a rather poor repertoire when compared for example with FSFs present
in the proteomes of riboorganisms exhibiting free-living (FL) lifestyle (ranging from 407 FSFs
in Staphylothermus marinus to 1,084 in Capitella sp.). FSF number was however comparable to
theta (189 distinct FSFs), NanoArchaeam equitans (211 distinct FSFs) and Candidatus
Hodgkinia (115 distinct FSFs)]. The average reuse level of mimiviral FSFs was quite low as well
(total FSFs/distinct FSFs = 3.25) but still comparable to that of organisms with similar genome
size or lifestyles (e.g., 3.03 in Staphylothermus marinus, 1.42 in Candidatus Hodgkinia, 2.01 in
NanoArchaeam equitans, and 2.48 in Guillardia theta). Thus while mimiviruses have a genome
22
size comparable to numerous small bacteria they also seem to share with them a very simple
proteome. Both the genomic and proteomic features of mimiviruses overlap significantly with
the obligate parasitic unicellular organisms and exhibit a level of complexity that was never
The sharing of FSF domains between cellular and viral taxonomic groups
FSFs are not equally shared by proteomes of the three cellular superkingdoms, Archaea
(A), Bacteria (B) and Eukarya (E), and viruses (V), which we here refer collectively as
supergroups. In turn, FSFs exist that are uniquely present (groups A, B, E or V) or are shared by
two (AB, AE, AV, BE, BV, EV), three (ABE, ABV, AEV, BEV) or all superkingdoms (ABEV).
A Venn diagram (Figure 2.1) describes FSF distributions in supergroups and highlights
the differential enrichment of viral FSFs within the cellular taxonomic groups. All cellular
taxonomic groups share FSFs with the viral superkingdom. ABE is the most populated group
with 557 FSFs, BE is the second largest with 291 FSFs and ABEV makes the third largest group
with 229 FSFs. Eukaryotes have the highest number of supergroup-specific FSFs with 335
(~19% of the total FSFs present only in eukaryotes). Bacteria have the second highest number
with 163 (9.37%) bacteria-only FSFs, followed by Archaea with 22 (1.26%) and viruses with 6
(0.345%) supergroup-specific FSFs, respectively. The relatively low number of viral specific
FSFs in our dataset can be explained by the fact that the current sequencing trend is biased
towards the sequencing of viruses with medical or economic importance (10). Sequencing of
more viral genomes, especially the large genomes of giant viruses (like mimiviruses and
megaviruses) will lead to the identification of more viral-specific FSFs. However, we expect that
patterns of FSF sharing between viral and cellular taxonomic groups will remain the same.
23
Primordial reductive evolutionary processes explain viral make up
We generated ToDs using information in the total dataset of 1,037 proteomes that
included free-living and parasitic riboorganisms and viruses and embodied 1,739 FSFs (Figure
2.2). These trees describe the appearance of protein domains in proteomes on an evolutionary
scale and represent structural chronologies that unfold directly from the trees by counting the
number of nodes along a lineage from the root to an individual leaf (i.e., an FSF). This node
distance (nd) defines the age of each FSF on a relative 0-1 scale, with nd = 0 (48) representing
the origin of proteins and nd = 1 the present, and is linearly proportional to geological time and
thus can be used to date domains defined at FSF level accurately (48, 84, 87). Previous studies
show that these chronologies uncovered unprecedented details of protein and proteome
diversification at fold, FSF and FF levels of structural complexity (48, 74). Here, trees of FSFs
highlight the distribution of FSFs present in viruses at an evolutionary time scale when nd values
are plotted against the relative number of riboorganisms and viruses using each FSF (distribution
index; f) (Figure 2.3). We find that most viral FSFs originated either very early or very late,
showing a clear bimodal pattern of proteome diversification. For example, both the oldest
(c.37.1) and youngest (d.211.1) FSF are present in viruses (Figure 2.2)
The distribution of FSFs in the total dataset revealed that the most ancient FSF, the P-
loop hydrolase fold FSF (c.37.1), was omnipresent in all proteomes (f = 1), including the viral
proteomes (Figure 2.3:Total). In total, 28 ancient FSFs had f >0.947 and were present in almost
all cellular and viral proteomes. However, the representation of FSFs decreased in the timeline
with increasing nd until f approaches 0 at about nd = 0.58 (Figure 2.3:Total). From this point
onwards, an opposite but interesting trend takes place and the FSFs increase their representation
in proteomes increasing nd. Distribution plots for the individual superkingdoms (Figure 2.3)
24
confirmed that, in general, the most ancient FSFs (nd = 0-0.4) are shared by most proteomes.
However, the representation of ancient FSFs decreases in time, first in the viral proteomes
(Figure 2.3:Viruses), and then in the cellular proteomes, starting with Archaea (Figure
2.3:Archaea), then Bacteria (Figure 2.3:Bacteria) and finally Eukarya (Figure 2.3:Eukarya).
However, FSF distributions approached a minimum first in viruses, and then in Archaea,
Eukarya and Bacteria, in that order, while in general, viruses, Archaea, and Bacteria maintained
small representation of younger FSFs in proteomes while FSF representation increased in the
In general, loss of ancient architectures was abrupt and massive for viruses. It started very
early in evolution but substantially dropped in the nd = 0.4-0.6 range (Figure 2.3:Viruses). The
class II aminoacyl-tRNA synthetases (aaRS) and biotin synthetases FSF (d.104.1) was the first
FSF to be completely lost in our viral proteomic set nd = 0.0516 (Figure 2.3:Viruses and boxplot
for ABE taxonomic group in Figure 2.4). This FSF includes the class II aaRS enzymes that
alongwith class I aaRS enzymes are responsible for charging tRNAs with correct amino acids
and make the central components of the translational machinery (96, 97). Megavirus has been
shown to encode for class II aaRS for AsnRS in addition to 6 class I enzymes (TyrRS, MetRS,
ArgRS, CysRS, TrpRS, and IleRS) and this makes it the only virus known to possess both class I
and class II asRS enzymes (17). While d.104.1 is absent in the viral dataset, we find that
mimiviruses maintain a small representation of class I aaRS (TyrRS, MetRS, ArgRS, CysRS;
c.26.1, f = 0.035). This indicates that loss of class II enzymes from the mimiviral lineage
occurred very early in evolution. These enzymes make the central components of the
translational apparatus (98). The mimivirus genome encodes 4 functionally active class I aaRSs
25
and phylogenetic and statistical evidence (read below) suggests they were not transferred from
its cellular host (amoeba) via HGT (17). In addition, c.26.1 has a higher character retention index
(ri) value (0.84) relative to the mean ri values of FSFs in Archaea (0.72), Bacteria (0.75), and
Eukarya (0.76) (Figure 2.5). The ri measures levels of homoplasy (conflict in how data matches
the reconstructed tree) that portray processes other than the vertical inheritance of characters
(FSFs) on a relative scale of 0-1 and is sensitive against the large number of taxa (95). Thus, the
existence of a partial translational apparatus in the mimiviral genome does not owe its presence
to HGT but rather to reductive evolution of the proteome of an ancestral more complicated virus,
which included functional translational machinery (17). In conclusion, the substantial loss of
domain structures that materializes very early in our timeline supports the concept that viruses
are reductive evolutionary variants of a primordial form that coevolved with the ancestor of
diversified cellular life (Figure 2.3). Mimivirus is the least reduced form of that ancestral virus
The distributions of FSFs in the proteomes of Archaea (Figure 2.3:Archaea) confirm the
first reductive tendencies in the emerging world of cellular organisms (48, 83). Our results show
this occurred after viral reductive tendencies were already in place (Figure 2.3:Viruses).
Archaeal f values started to drop at nd = 0.185 and the first domain structure completely lost in
Archaea was the lysozyme-like FSF (d.2.1, nd = 0.185). In contrast, Bacteria and Eukarya
maintained higher representation of FSFs in their proteomes and the total loss of an FSF is seen
f.16.1) and congruently in viruses (b.28.1; b.126.1), Archaea (h.1.19) and Eukarya (g.71.1) at nd
reductive evolutionary processes and of eukaryotes by genomic expansion (48, 83). Both viruses
and Archaea, the most ancient organismal groups, suffered an extended history of genomic
reduction and diversified into superkingdoms much later and concurrently with Eukarya (nd =
0.5867). These reductive processes and diversification patterns define the appearance of
taxonomic groups of FSFs and their relative numbers (Figure 2.4). The ABEV is the most
ancient group with nd ranging between 0-1 and a median nd of 0.22 with a total of 229 FSFs (the
third largest after ABE and BE) that are common between the cellular domains and viruses. Its
appearance was followed by the ABE, BEV, and BE taxonomic groups, in that order.
Superkingdom-specific groups appear much later. Interestingly, the appearance of BV, EV and
AV FSFs (Figure 2.4) occurs soon after the appearance of the respective superkingdom-specific
FSFs or after the diversification of the respective superkingdoms. We hypothesize that these are
the FSFs that were discovered when viruses began to infect their hosts and adapted to a parasitic
life style. This occurred when lineages of diversified riboorganisms were already in existence
(read below).
We generated uToLs that describe the evolution of proteomes in our total dataset of 1037
riboorganisms and viruses. Intrinsically rooted phylogenomic trees were built using FSFs (total
set: 1,739 FSFs) as distinguishable linearly ordered multistate phylogenomic characters. The
most parsimonious tree reconstruction showed that organisms in Archaea, Bacteria, Eukarya, and
Viruses formed 4 distinct groups, some monophyletic, and that viruses occupied the most basal
position of the uToL (Figure B.1). Inclusion of parasitic (P) and obligate parasitic (OP)
riboorganisms induced a topology of the cellular superkingdoms that favored the canonical
27
rooting of the ToL (44, 45) with the early divergence of Bacteria relative to Archaea and
Eukarya (Figure B.1). Exclusion of P an OP organisms in the total dataset and in equally
sampled sets of riboorganisms (50 Archaea, 50 Bacteria, and 50 Eukarya) and capsid-encoding
organisms (50 Viruses) from superkingdoms produced trees that retained viruses as the most
ancient monophyletic group but placed Archaea as the second oldest (Figure 2.6). The viral
monophyletic group has very little diversity (small branches) compared to the cellular
proteomes. We explain the lack of diversity in viral proteomes due to their prolonged history of
reductive evolution and the fact that the set of FSFs that distinguish viruses are unevenly
distributed in the virosphere. For example, the translation related enzymes are only present in
mimiviruses (and megaviruses). Similarly, other viral specific FSFs are unevenly distributed.
Thus, the virosphere as a group makes the most diverse group on the planet but sets of single
inheritance in the evolution of proteomes, we calculated ri for each phylogenetic character (ri for
each FSF) used to build the uToL as a relative measure of homoplasy. It has often been proposed
that viruses pickpocket genes from their cellular hosts via HGT (27, 29). Therefore, we
compared the distribution of FSFs in viruses with the distribution of cellular FSFs by plotting
them against their ri values. Both the viruses and cellular FSFs follow a similar bimodal
distribution and do not appear significantly different from each other (Figure 2.5). In contrast,
viral characters (FSFs) are distributed with relatively higher ri values, supporting a better fit of
viral characters to the phylogeny. We thus conclude that better fit of viral characters to the
phylogeny indicate FSFs in viruses are not acquired horizontally from their hosts and both the
viral and cellular FSFs are subject to the similar levels of HGT.
28
A plot that describes the diversity (use) and abundance (reuse) of FSFs (total number of
distinct FSFs versus the total number of FSF domains that are encoded in a proteome) show
viruses have the simplest proteomes, followed progressively by Archaea, Bacteria and Eukarya,
in that order (Figure 2.7). Remarkably, organisms follow a congruent trend towards structural
diversity and organismal complexity. This trend confirms our initial evolutionary model of
proteome growth and again reveals the ancestrality of viruses and Archaea (48, 84).
The spread (f) of viral FSFs relative to cellular FSFs in individual proteomes of Archaea,
Bacteria and Eukarya appeared significantly biased (Figure 2.8). When compared to cell-specific
FSFs, FSFs shared by viruses and cells were significantly widespread in the proteomes of a
superkingdom. Viruses hold 294, 265, and 239 FSFs in common with Eukarya, Bacteria and
Archaea, respectively. Median f values of these FSFs were considerably higher than those of
corresponding cellular FSFs in Eukarya (0.978 vs. 0.416), Bacteria (0.8826 vs. 0.329), and
Archaea (0.742 vs. 0.514). This bias is remarkable in the case of Eukarya where ~98% of
proteomes are enriched with viral FSFs (Figure 2.8). Archaeal and bacterial proteomes are also
enriched with viral FSFs but at lower levels. Remarkably, patters of enrichment follow patterns
The popularity and abundance of viral FSFs in cellular proteomes suggests that viruses
have been a very active and crucial factor in mediating domain transfer between the individual
proteomes and enhancing biodiversity. These domains are present in a wide array of hosts with
remarkable diversity, ranging from very small to very complex organisms, and provide further
support to the ancient and primordial nature of viruses (18). The transfer of domains from virus
to host is a relatively new concept and needs to be explored further. A significant proportion of
29
human and all vertebrate genomes harbor endogenous retroviruses (31, 99). The capsid proteins
widespread homologs in the eukaryal superkingdom (100). Together these observations and our
findings support the view that viral-to-host gene transfers are common events and viruses play an
We studied the molecular functions of 293 (out of 304) FSFs that are present in the viral
dataset using the functional annotation scheme described by Vogel and Chothia (88-90). For the
rest of 11 FSFs, functional annotation was not available. When plotted against time (nd), we note
that the majority of the viral FSFs either appeared very early (nd<0.4) or very late (0.6<nd<1.0)
(Figure 2.9) supporting previous results. We find that most of the viral FSFs perform metabolic
and Extracellular processes, in that order. This order matches the functional distribution
described previously for the cellular superkingdoms (58) and provides supports to the hypothesis
that viruses coevolved with the cellular ancestors. A significant drop in the number of
FSFs/functions is seen in the nd range 0.4-0.6 which is the period marked by massive gene loss
in both viruses and cellular organisms (Figure 2.3). In contrast, a relatively even distribution of
functions is seen in the nd range 0.6-1.0 which is the period marked by superkingdom
diversification and genome expansion in Eukarya (Figure 2.3) (48). The functions acquired by
viruses during this late period include those related to Extracellular processes (toxin/defense,
immune response, cell adhesion), General (protein interaction, general, ion binding, small
molecule binding) and Other (viral proteins, and proteins with unknown functions) functional
categories (Figure 2.10). We hypothesize that viruses acquired these functions to adapt to the
30
parasitic lifestyle after suffering massive gene loss between nd 0.4-0.6. This is also evident by
the appearance of superkingdom specific taxonomic groups (AV, BV, and EV) after the
appearance of the respective superkingdoms (Figure 2.4). In contrast, the number of FSFs
Effect of HGT
(49), significant enrichment (over-representation) of FSFs in viruses (in the ABEV, BEV, AEV,
ABV, AV, BV, and EV taxonomic groups) are taken as indication that they have acquired FSFs
from their cellular hosts through lateral transfer. We calculated the probability of enrichment of a
particular taxonomic group using the hypergeometric distribution and found that only the ABEV
FSF group was significantly over-represented (P<0.05). The AEV FSF group was also over-
represented but it was not statistically significant (P = 0.29). In contrast, all the other taxonomic
groups were under-represented (Table 2.2). Results suggest viral FSFs did not originate in the
cellular superkingdoms and then transfer to viruses. Significant overrepresentation of the ABEV
group, which (as shown above) represents the most ancient FSF group, and the under-
representation of viral FSFs in all the other (more derived) taxonomic groups suggests common
Viruses have a negative connotation linked to them. They are ubiquitous parasites (20)
and infect both cellular and acellular life forms. On a planetary scale, they make up the
virosphere, the most abundant and diverse part of our biosphere (13, 15, 101). However, as our
knowledge continues to grow and evidence is presented for the existence of giant viruses (17,
36), the overlap between capsid-encoding and ribosome-encoding organisms grows. Giant
viruses now defy the word ‘virus’ by definition and call for a change in the way we have
Consequently, there is an ongoing debate whether to redefine viruses, give them the
status of living beings, and include them in uToL (19, 27, 29, 42, 102). Proponents for the
inclusion argue viruses are a distinct form of life based on their resemblance with numerous
parasitic bacteria. Adversaries of the theory do not find the unique nature of mimiviral genomes
compelling enough but instead argue that viruses have been ‘pick pocketing’ genes from cellular
species and thus are gene robbers (27, 29). In short, they contend that the set of genes of
mimiviruses could be explained by HGT from host cell to viral form because they share
sequence homology with the cellular species. However, the majority of the mimiviral genes lack
clear cellular homologs and only ~10% of the genes are shared between mimiviruses and the
cellular world (42). If viruses are considered gene robbers, then so are numerous parasitic
bacteria with reduced proteomes (42). More importantly, the origin of viral specific genes/FSFs
(our results) remains unanswered if viruses are gene robbers and not worthy of a living status. In
contrast, the popularity of viral FSFs in hosts ranging from very simple to complex organisms
(our results) together with the abundance of VHGs in very diverse set of viruses support the
existence of an ancient viral world (18). In summary, a significant number of FSFs exist in the
32
virosphere, including a lower than expected count for viral specific FSFs. Sequencing of more
viral genomes, especially larger genomes, will probably lead to the identification of more viral
specific FSFs. However, we expect that the proportion and distribution of FSFs in the rest of the
structurally equivalent to the eukaryotic TATA-box is consistent with the reductive mode of
evolution for viral proteomes. Reductive tendencies in proteomes have been explained in detail
previously and numerous obligate intracellular parasitic bacteria are known to have gone this
route (48, 83). Our results reveal a massive loosing trend of FSFs in viruses that started very
early in evolution. This trend supports the reductive evolutionary model. Archaea followed the
route soon afterwards and then later on were joined by Bacteria and Eukarya. In light of our
results, mimiviruses should be viewed as the least reduced form of an ancestral virus that
coevolved with the universal common cellular ancestor. Mimivirus thus represents a living fossil.
We have recently shown that the phylogenomic method used for the reconstruction of
proteome trees is robust against unequal sampling and thus the selection of a significantly large
number of bacterial proteomes (in our case 652 bacterial lineages) does not affect the overall
topology of the tree, as long as organisms included exhibit a free-living lifestyle (49). However,
the presence of highly reduced proteomes, such as small obligate intracellular parasitic bacteria
and viruses adds a bias to the analysis of trees because the species with simpler repertoires of
FSFs (i.e., absence of domains) generally occupy the most basal branches of the tree (48, 49).
defined at the FSF level represent a much higher level of structural organization and the
evolutionary impact of HGT is very limited at such higher levels of structural organization (64-
33
67, 87). Furthermore, the extent of homoplasy in our tree measured with ri values (95) for all
FSFs suggest domain structures present in viruses were not acquired from their hosts via HGT.
Moreover, FSFs acquired via HGT are expected to be over-represented in genomes but our
results show that taxonomic groups of FSFs that include viruses are highly under-represented
(49).
While FSFs were not acquired from their hosts, our data suggests virus-to-host HGT has
been pervasive since very ancient times. We show that viral FSFs (common to viruses and
cellular superkingdoms) are significantly more widespread across proteomes than cellular FSFs,
suggesting viruses mediated transfer of these FSFs into cellular hosts. Furthermore, the fact that
the viral FSFs are also universally present in the proteomes of organisms that range from very
simple to very complex, advocates for the primordial and ancient nature of viruses and their
continued impact in evolution of cellular life. Remarkably, one of the most surprising findings to
come out of the human genome project was the discovery that a significant proportion of the
human genome represents remnants of ancient viral infections (103). Viruses are thus a
previously disregarded factor in enhancing biodiversity. They are capable of leaving imprints
We show that viruses harbor a significant number of FSFs and suggest that they have
evolved via massive reductive evolutionary processes. In addition, viruses occupy the most basal
branches of the uToL, as a monophyletic and distinct group along with the three other generally
accepted superkingdoms: Archaea, Bacteria, and Eukarya. This placement and the unique and
complex taxonomic distribution of FSFs suggest viruses truly represent a genuine superkingdom.
Based on these findings, we propose that mimiviruses are living fossils of ancestral viruses that
coevolved with primordial cells. We finally highlight the crucial contribution of the virosphere to
biodiversity by mediating HGT between cellular species. We contend that the virosphere has
been playing a pivotal role in the evolution of cellular superkingdoms since the origins of
cellular life. Our analyses provide evidence for the existence of viruses as a distinct form of life
that lacks ribosomes. We propose that viruses should be regarded as symbionts and not parasites
of cellular species, and that parasitic relationships with their hosts are late adaptations. Our
results call for a change in the way we perceive viruses in our world, highlighting the central role
Figure 2.1 The Venn diagram highlights the distribution of FSFs in the taxonomic groups.
36
Figure 2.2 A phylogenomic tree of protein domain structures describing the evolution of 1,739 FSFs in 1,037
proteomes (4,63,915 steps; CI = 0.051; RI = 0.795; g1 = -0.127). Taxa are FSFs and characters are proteomes.
Terminal leaves of viruses and cellular FSFs were labeled in red and blue respectively.
37
Figure 2.3 Distribution index (f, the number of species using an FSF/total number of species) of each FSF plotted
against relative age (nd, number of nodes from the root/total number of nodes) for the four supergroups (Total) and
(nd). Vertical lines within each distribution represent medians. Dotted vertical lines represent important evolutionary
FSFs are represented in blue. Both groups of FSFs follow a similar distribution and generally the viral FSFs are
distributed with higher ri values supporting a better fit of viral characters to the phylogeny.
40
Figure 2.6 One optimal (P<0.01) most parsimonious phylogenomic tree describing the evolution of 150 FL
riboorganisms (50 each from Archaea, Bacteria, and Eukarya) and 50 viruses generated using the census of
abundance of 1,739 FSFs (1,517 parsimoniously informative sites; 62,061 steps; CI = 0.156; RI = 0.804; g1 = -
0.325). Terminal leaves of Viruses (V), Archaea (A), Eukarya (E) and Bacteria (B) were labeled in red, blue, black
FSF domains that are encoded) for riboorganisms and viruses (50 for each supergroup). Both axes are in logarithmic
scale.
42
Figure 2.8 Viral FSF enrichment. Boxplots comparing the distribution index (f) of viral FSFs with cellular-only
FSFs (shared or not shared with other cellular superkingdom) for each cellular superkingdom. Pie-charts above each
number of viral FSFs corresponding to major functional categories plotted against time (nd).
44
Figure 2.10 Functional distributions of viral FSFs in minor functional categories. Histograms comparing the number
of viral FSFs corresponding to each of the minor categories within each major functional category.
45
Tables
Table 2.1 List of dsDNA viruses sampled along with the statistics on structural assignments.
3e
ascovirus 1a
PM2
necrosis virus
isolate China
iridescent virus
mimivirus
46
Table 2.1 (contd.)
Chlorella virus 1
Chlorella virus 1
virus
entomopoxvirus 'L'
2490
entomopoxvirus
subtype 1
16
k, number of FSFs in viral taxonomic group; n, total number of FSFs in viral supergroup; M, number of FSFs
CONSERVED1
Introduction
The functional repertoire of a cell is largely embodied in its proteome, the collection of
proteins encoded in the genome of an organism. The molecular functions of proteins are the
direct consequence of their structure and structure can be inferred from sequence using hidden
Markov models of structural recognition (85, 93, 104). Here we analyze the functional
annotation of protein domain structures in almost a thousand sequenced genomes, exploring the
the distribution of domains with respect to molecular functions they perform in the three
superkingdoms of life. In general, most of the protein repertoire is spent in functions related to
metabolic processes but there are significant differences in the usage of domains for regulatory
and extra-cellular processes both within and between superkingdoms. Our results support the
hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion
mechanisms that were directed towards innovating new domain architectures for regulatory and
multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell
signaling and adhesion, immune responses, and toxin production). Proteomes of microbial
superkingdoms, Archaea and Bacteria retained fewer numbers of domains and maintained simple
and smaller protein repertoires. Viruses appear to play an important role in the evolution of
1
This chapter has been published as manuscript in open access journal Genes (58). The rights to reprint were
retained by the authors and transferred to IEEE for parts of chapter that were communicated as poster paper in a
conference organized by IEEE. The copyright owner, IEEE has given permission to reuse text in thesis/dissertation.
© 2011 IEEE. Reprinted, with permission, from Nasir, A.; Naeem, A.; Khan, M.J; Lopez-Nicora, H.D.; Caetano-
Anolles, G, The functional make up of proteomes is remarkably conserved, Nov. 2011. (105).
51
superkingdoms. We finally identify few genomic outliers that deviate significantly from the
of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms
spend most of their domains on information functions, including translation and transcription,
rather than on metabolism and harbor a domain repertoire characteristic of parasitic organisms.
Chlamydiae (PVC) superphyla was no different than the rest of bacteria, failing to support claims
of them representing a separate superkingdom. In turn, Protista and Bacteria shared similar
functional distribution patterns suggesting an ancestral evolutionary link between these groups.
Our results yield a global picture of the functional organization of proteomes. Results
suggest that the functional structure of proteomes is remarkably conserved across all organisms,
ranging from small bacteria to complex eukaryotes. There is also evidence for the existence of
genomic outliers that deviate from global trends. Here we explore what makes these proteomes
distinct.
52
Materials and Methods
Data retrieval
We used the protein architecture assignments for a total of 965 organisms including 70
Archaea, 651 Bacteria and 244 Eukarya from the total dataset described in chapter 2. Remaining
their lifestyles was done manually and resulted in 592 free living (FL), 153 parasitic (P), and 158
The most recent domain functional annotation file for SCOP 1.73 was downloaded from
the SUPERFAMILY web server. This file describes the one-to-one mapping between FSFs and
molecular functions based on the scheme described by Vogel and Chothia (Table 3.1) (88-90).
For each genome we extracted the set of unique FSFs encoded and then mapped them to the 7
major and 50 minor functional categories (Table 3.1). We calculated both the percentage and
(http://www.python.org/download/).
Statistical analysis
which is the appropriate test to detect differences between means for groups having unequal
variances (106). We excluded organisms with P and OP lifestyles in order to remove noise from
the data. Additionally, in order to meet asymptotic normality, we used the Log10 transformation
Nmax is the largest value in the matrix; and Nnormal is the normalized and scaled score for FSF x in
y superkingdom.
54
Results
We studied the molecular functions of 1,646 domains defined at the FSF level of
structural abstraction (SCOP 1.73) that are present in the proteomes of a total of 965 organisms
spanning the three superkingdoms. A total of 135 FSFs that could not be annotated were
excluded from the analysis. Out of the 1,646 FSFs included, approximately one-third (32.38%)
processes (ICP) (12.63%), Regulation (12.45%), and Information (12.21%) are uniformly
distributed within proteomes. In contrast, General (7.96%) and Extracellular processes (ECP)
(5.77%) are significantly underrepresented compared to the rest (Figure 3.1:A). The total number
of FSFs in each category exhibits the following decreasing trend: Metabolism > Other > ICP >
Regulation > Information > General > ECP. These patterns of FSF number and relative
proteome content are for the most part maintained when studying the functional annotation of
FSFs belonging to each superkingdom (Figure 3.1:B). However, the number of FSFs in each
superkingdom varies considerably and increases in the order Archaea, Bacteria and Eukarya, as
given the central importance of metabolic networks. However, the much larger number of FSFs
corresponding to functional category Other is quite unexpected. The 273 FSFs belonging to this
category include 200 and 73 FSFs in sub-categories unknown functions and viral proteins,
respectively (Table 3.1). Viruses are generally defined as gene poor biological entities. However,
the number of domains belonging to viral proteins that are present in cellular organisms makes a
noteworthy contribution to the total pool of FSFs (4.43%). Thus, viruses have a much more rich
55
and diverse repertoire of domain structures than previously thought and their association with
cellular life has contributed considerable structural diversity to the proteomic make up.
The numbers of FSFs belonging to categories Regulation, Information, and ICP are
uniformly distributed in proteomes. However, the ECP category is the least represented, perhaps
because this category is the last to appear in evolution (48, 57). Extracellular processes are more
Multicellular organisms need efficient communication, such as signaling and cell adhesion. They
also trigger immune responses and produce toxins when defending from parasites and pathogens.
These ECP processes, which are depicted in the minor categories of cell adhesion, immune
response, blood clotting and toxins/defense, are needed when interacting with environmental
biotic and abiotic factors and for maintaining the integrity of multicellular structure. These
categories are also present in the microbial superkingdoms but their functional role may be
We note that current genomic research is highly shifted towards the sequencing of
microbial genomes, especially those that hold parasitic lifestyles and are of bacterial origin. In
fact, 67% of proteomes in our dataset belong to Bacteria. This bias can affect conclusions drawn
from global trends such as those in Figure 3.1:A, including the under-representation of ECP
differs in organisms belonging to the three superkingdoms, we analyzed proteomes at the species
level and calculated both the percentage and actual number of FSFs corresponding to different
counts of FSFs, and do so consistently for the three superkingdoms: Metabolism > Information >
ICP > Regulation > Other > General > ECP. Note that trend lines across proteomes seldom
overlap and cross (Figure 3.2). It is noteworthy however that this trend differs from the
decreasing total numbers of FSFs we described above (Figure 3.1). Thus, no correlation should
be expected between the numbers of FSFs for individual proteomes and the total set for each
functional distributions of FSFs (Figure 3.2:A). The only exception appears to be the slight
overrepresentation of Regulation FSFs (green trend lines) and underrepresentation of ICP (black
trend lines) in Archaea compared to Bacteria (especially Proteobacteria). These distributions are
Metabolism and Information are decreased while those of all other five functional categories are
significantly and consistently increased (Figure 3.2A). There is also more variation evident in
Eukarya; large groups of proteomes exhibit different patterns of functional use (clearly evident in
Metabolism ranges 30-50% of proteomic content (100-350 FSFs). This variation is not present in
small increases in the representation of the other six repertoires, with the notable exception of
Information. In this particular case, when Metabolism goes down Information goes up. For
example, bacterial proteomes with metabolic FSF repertoires of <45% offset their decrease by a
corresponding increase in Information FSFs (generally from ~20% to ~35%, Figure 3.2:A). In all
superkingdoms, we identify groups of proteomes or few outliers that deviate from the global
trends (vertical dotted lines in Figure 3.2:A). As we will discuss below this is generally a
below). Outliers are particularly evident in Bacteria and harbor sharp increases in Information
decreases of Metabolism are generally offset by increases of the Regulation category, with an
hand with decreases in Information, and are correspondingly offset mostly by increases in
Regulation and ECP. Apparently, the advantages of regulatory control (e.g., signal transduction
When we look at the actual number of FSFs within each functional repertoire (Figure
3.2:B), we observe a clear trend in domain use that matches the total trend for superkingdoms
described above (Figure 3.1). In most cases, the functional repertoires of Archaea are smaller
than those of Bacteria, and bacterial repertoires are generally smaller than those of Eukarya
(Figure 3.2:B). This holds true for all functional categories. However, the numbers of metabolic
FSFs vary 1.5-4 fold in proteomes of superkingdoms, the change being maximal in Bacteria.
While both proteomes in Eukarya and Bacteria show similar ranges of metabolic FSFs, the
58
repertoire of Archaea is more constrained. Furthermore, FSFs belonging to categories Other and
ECP are significantly higher in Eukarya than in the microbial superkingdoms. These remarkable
observations suggest high conservation in the make up of proteomes of superkingdoms and at the
same time considerable levels of flexibility in the metabolic make-up of organisms. Results also
support the evolution of the protein complements of Archaea and Bacteria via reductive
evolutionary processes and Eukarya by genome expansion mechanisms (48, 83). Reductive
tendencies in microbial superkingdoms do not show bias in favor of any functional category.
Furthermore, enrichment of eukaryal proteomes with viral proteins supports theories that state
that viruses have played an important role in the evolution of Eukarya (107).
We also explain the functional distribution of FSFs at the phyla/kingdom level for each
superkingdom (Figure 3.2). Plots describing the percentages (Figure 3.2:A) and actual number of
FSFs in proteomes (Figure 3.2:B) highlight the existence of ‘outliers’ (vertical dotted lines in
Figure 3.2:A) that deviate from the global functional trends that are typical of each
superkingdom.
with each other. Only N. equitans could be considered an outlier (insets of Figure 3.2). Its
proteome deviates from the global archaeal signature by reducing its proteomic make up (it has
only 200 distinct FSFs) and by exchanging Information for metabolic FSFs. N. equitans is an
obligate intracellular parasite (108) that is part of a new phylum of Archaea, the Nanoarchaeota
(109). N. equitans has many atypical features, including the almost complete absence of operons
and presence of split genes (110), tRNA genes that code for only half of the tRNA molecule
59
(111), and the complete absence of the nucleic acid processing enzyme RNAse P (112). Some of
these features were used to propose that N. equitans is a living fossil (113), represents the root of
superkingdom Archaea and the ToL (114), and is part of a very ancient and yet to be described
most ancient superkingdom (48, 49) and has placed N. equitans at the base of the ToL together
with other archaeal species. Its ancestral nature is therefore in line with the evolutionary and
functional uniqueness of N. equitans and the very distinct functional repertoire we here report.
In Bacteria, the functional repertoires of bacterial phyla were also remarkably conserved.
Only Information and Metabolism showed significantly distinct patterns and considerable
variation in the use of FSFs. Again, decreases in representation of metabolic FSFs were generally
offset by increases in informational FSFs (Figure 3.2:A). Notable outliers include the Tenericutes
and the Spirochetes. As groups, they have the highest relative usage of Information FSFs, which
are clearly offset by a decrease in metabolic FSFs. The Tenericutes is a phylum of Bacteria that
includes class Mollicutes. Members of the Mollicutes are typical obligate parasites of animals
and plants (some of medical significance such as Mycoplasma) that lack cell walls and have
gliding motility. These organisms are characterized by small genome sizes (115) considered to
have evolved via reductive evolutionary processes (116). Because of its unique properties and
history, mycoplasmas have been used recently to produce a completely synthetic genome (117).
There were also clear outliers in the Proteobacteria. These included Candidatus Blochmannia
and Candidatus Hodgkinia cicadicola (symbiont of cicadas). These bacteria are generally
endosymbionts of insects (e.g., ants, sharpshooters, psyllids, cicadas) that have undergone
60
irreversible specialization to an intracellular lifestyle. Candidatus Carsonella ruddii has the
smallest genome of any bacteria (118). There were also bacterial proteome groups that were
expected to be outliers but were no different than the rest. Bacteria belonging to the superphylum
because they have a ‘eukaryotic touch’ (119). Indeed, PVC bacteria display genetic and cellular
features that are characteristics of Eukarya and Archaea, including the presence of Histone H1,
condensed DNA surrounded by membrane, alpha-helical repeat domains and beta-propeller folds
that make up eukaryotic-like membrane coats, reproduction by budding, ether lipids and lack of
cell walls (120-122). Due to the unique nature of the PVC superphylum, it was proposed that
Eukarya and Archaea (122). However, ToLs generated from domain structures in hundreds of
proteomes did not dissect the PVC superphylum into a separate group (48, 49). Functional
distributions of FSFs now show PVC proteomes appear no different from the rest of bacteria
(Figure 3.2). These results do not support PVC-inspired theories that explain the diversification
belonging to individual kingdoms in Eukarya had functional signatures that were highly
conserved (Figure 3.2:A). However, these signatures differed between groups. Plants and Fungi
had functional representations that were very similar and showed little diversity. In contrast,
Metazoa functional distributions increased the representation of ECP and Regulation FSFs in
exchange of FSFs in Metabolism and Information. Protista had patterns that resemble those of
Plants and Fungi but had widely varying metabolic repertoires, very much like Bacteria. This
possible link between basal eukaryotes and bacteria revealed by our comparative analysis is
61
consistent with the existence of an ancestor of Bacteria and Eukarya and the early rise of
Archaea (48). Only few outliers belonging to kingdoms Fungi (Encephalitozoon cuniculi and
Encephalitozoon intestinalis) and Protista (Guillardia theta) were identified. E. cuniculi and E.
intestinalis are amitochondriate microsporidian parasites with highly reduced genomes (123).
Similarly, Guillardia theta is a nucleomorph that has a highly compact and reduced genome with
When we look at the actual number of FSFs in proteomes of phyla and kingdoms (Figure
3.2:B) we observe that while the overall patterns match those of FSF representation (Figure
3.2:A), FSF number revealed considerable variation in the metabolic repertoire of Protista and
Bacteria. FSFs in these groups typically ranged 130-340, with PVC and Spirochetes exhibiting
the smallest range (130-300 FSFs). In contrast, metabolic repertoires of Archaea and the other
eukaryotic kingdoms typically ranged 200-260 FSFs and 270-350 FSFs, respectively. This
link of phyla within superkingdoms Eukarya and Bacteria. Plots of FSF number also clarified
functional patterns in outliers, revealing they did not have more numbers of FSFs in Information
but rather have reduced metabolic repertoires. This shows parasitic outliers get rid of metabolic
The analysis thus far revealed the existence of a small group of outliers within each
superkingdom. Manual inspection of lifestyles of these organisms showed that all of these
organisms are united by a parasitic or symbiotic lifestyle. For example, N. equitans is the
smallest archaeal genome ever sequenced and represents a new phylum, the NanoArchaeaota
(109). This organism interacts with Ignicoccus hospitalis, establishing the only known
62
parasite/symbiont relationship of Archaea, and harbors a highly reduced genome (110).
Parasitic/symbiotic relationships with various plants and animals can be found in Tenericutes and
species are eukaryotic parasites that lack mitochondria and have highly reduced genomes (123).
E. cunniculi has even a chromosomal dispersion of its ribosomal genes, very much like N.
equitans, and the rRNA of the large ribosomal subunit reduced to its universal core (125). It has
also reduced intergenic spacers and shorter proteins. Similarly, Guillardia theta is a nucleomorph
that has a highly compact and reduced genome with loss of nearly all the metabolic genes (124).
general tendencies that resemble those of the outliers, we classified organisms into three different
lifestyles: free living (FL) (592 proteomes), facultative parasitic (P) (153 proteomes), and
obligate parasitic (OP) (158 proteomes). Functional distributions for the 7 major functional
categories for these proteomic sets explained the role of parasitic life on proteomic constitution
(Figure 3.3). Plots of percentages (Figure 3.3:A) and actual number of FSFs in proteomes
(Figure 3.3:B) showed FSF distribution in FL organisms were remarkably homogenous and that
the vast majority of variability within superkingdoms was ascribed to the P and OP lifestyles.
This variability was for the most part explained by a sharp decline in the number of metabolic
FSFs that are assigned to the Metabolism general category (Figure 3.3:B). Plots also support the
hypothesis that parasitic organisms have gone the route of massive genome reduction in a
tendency to lose all of their metabolic genes. This tendency makes them more and more
dependent on host cells for metabolic functions and survival (126, 127).
63
Since maximum variability lies within the proteome repertoires of P and OP organisms
(Figure 3.3) and parasitism/symbiosis in these organisms is the result of secondary adaptations,
the analysis of proteomic diversity in FL organisms allows to test if differences in the functional
the order Archaea, Bacteria and Eukarya. When compared to the total proteomic set (Figure 3.2),
Metabolism remains the predominant functional category and a large number of domains in all
the proteomes perform metabolic functions. Again, the proteomes of Eukarya have the richest
FSF repertoires, and those of Archaea the most simple. Analysis of variance showed that the
number of FSFs for each functional repertoire was consistently different between superkingdoms
(P<0.0001; Table 3.2). This supports the conclusions drawn from earlier analyses that the
microbial superkingdoms followed a genome reduction path while Eukarya expanded their
The seven major categories of molecular functions map to 50 minor categories (Table
superkingdoms (Figure 3.4). Only category ‘not annotated’ (NONA) was excluded from analysis.
In terms of percentage (Figure 3.4:A), the overall functional signature is split into two
components: prokaryotic and eukaryotic. Prokaryotes spend most of their domain repertoire on
Metabolism and Information whereas Eukarya stand out in ECP (particularly cell adhesion,
immune response), Regulation (DNA binding, signal transduction), and all the minor functional
functional repertoires with a significantly large number of FSFs devoted for each minor
functional category. Bacteria and Archaea work with small number of domains. However, the
Figures 3.1 and 3.2). These results are consistent with the evolutionary trends in proteomes
described previously (48, 83). Our results support the complex nature of the LUCA (49) and are
consistent with the evolution of microbial superkingdoms via reductive evolutionary processes
and the evolution of eukaryal proteomes by genome expansion (48, 83). It appears that Archaea
went on the route of genome reduction very early in evolution and was followed by Bacteria and
finally Eukarya (48). Late in evolution, the eukaryal superkingdom increased the representation
of FSFs and developed a rich proteome (48). This can explain the relatively huge and diverse
between Bacteria and Eukarya except for minor category ‘Translation’ (green trend lines in
Figure 3.4) that is significantly higher in Eukarya compared to Bacteria. This shows that Bacteria
exhibit incredible metabolic and informational diversity despite their reduced genomic
complements. We conclude that the genome expansion in Eukarya occurred primarily for
Our analysis depends upon the accuracy of assigning structures to protein sequences and
the SCOP protein classification and SUPERFAMILY functional annotation schemes. Databases
such as SCOP and SUPERFAMILY are continuously updated with more and more genomes and
new assignments. We therefore ask the reader to focus on the general trends in the data such as
the exact percentage or numbers of FSFs in each functional repertoire. Trends related to the
number of domains in Archaea relative to Bacteria and Eukarya and the reduction of metabolic
repertoires in parasitic organisms should be considered robust since these have been reliably
observed in previous studies with more limited datasets (48, 83). Biases in sampling of
proteomes in the three superkingdoms is not expected to over or underestimate the remarkably
conserved nature of the functional make up. We show that the conservation of molecular
functions in proteomes is only broken in genomic outliers that are united by parasitic lifestyles.
Thus equal sampling will not significantly alter the global trends described for individual
superkingdoms. In light of our results, organism lifestyle is the only factor affecting the
conserved nature of proteomes. Finally, we propose that lower or higher than expected numbers
of FSFs in any category (subcategory) can be explained either by possible limitations of the
scheme used to annotate molecular functions of FSFs or the simple nature of the functional
repertoire. For example, the number of FSFs in minor category structural proteins (major
category General) is only 7 (Table 3.3) despite the importance of structural proteins in cellular
organization. Table 3.3 lists the description of these FSFs and shows that indeed these FSF
domains play important structural roles. Their limited number indicates that the structural and
functional organization is quite limited and only very few folds are utilized for important
66
structural roles. Another possibility is the ‘hidden’ overlap between FSFs and molecular
annotation scheme. Most of the large FSFs include many FFs and participate in multiple
pathways; for few FSFs a complete functional profile may not be intuitively obvious. This may
be one of the shortcomings of using this functional annotation scheme but dissection of such
detailed functions and pathways is a difficult task and is not described in this study. In summary,
we do not believe that the classification or annotation schemes, despite their limitations, would
domains in superkingdoms for proteomes for which we have structural assignments. The average
distribution of FSFs in phyla, kingdoms, and superkingdoms reveals that the biggest proportion
of each proteome is devoted in all cases to functions related to metabolism (Figure 3.5).
Phylogenomic analysis has shown that metabolism appeared earlier than other functional group
and their structures were the first to spread in life (57). This would explain the relative large
representation of metabolism in the functional toolkit of cells. Usage of domains related to ECP
and Regulation is significantly higher in Metazoa compared to the rest. This showcases the
importance of regulation signal transduction mechanisms for eukaryotic organisms (128, 129).
Our results support the view that prokaryotes evolved via reductive evolutionary processes
whereas genome expansion was the route taken by eukaryotic organisms. Genome expansion in
Eukarya seems to be directed towards innovation of FSF architectures, especially those linked to
Regulation, ECP and General. Finally, viral structures make up a substantial proportion of
cellular proteomes and appear to have played an important role in the evolution of cellular life.
Organisms with parasitic lifestyles have simple and reduced proteomes and rely on host cells for
metabolic functions. Tenericutes are unique in this regard. They spend most of their proteomic
that the conservation of molecular functions in proteomes is only broken in ‘outliers’ with
parasitic lifestyles that do not obey the global trends. We conclude that organism lifestyle is a
Figure 3.1 Number of FSF domains annotated for each major functional category defined in SCOP 1.73 (A) and in
the three superkingdoms (B). The functional distributions show that coarse-grained functions are conserved across
cellular proteomes and metabolism is the most dominant functional category. Numbers in parentheses indicate the
total number of FSFs annotated in each dataset. The number of FSFs increases in the order Archaea, Bacteria and
Eukarya.
69
Figure 3.2 The functional distribution of FSFs in individual proteomes of the three superkingdoms. Both the
percentage (A) and actual FSF numbers (B) indicate conservation of functional distributions in proteomes and the
existence of considerable functional flexibility between superkingdoms. Dotted vertical lines indicate genomic
outliers. Insets highlight the interplay between Metabolism (yellow trend lines) and Information (red trend lines) in
N. equitans.
70
Figure 3.3 The functional distribution of FSFs with respect to organism lifestyle. Both the percentage (A) and actual
FSF numbers (B) indicate that obligate parasitic (OP) and facultative parasitic (P) organisms exhibit considerable
variability in their metabolic repertoires (yellow trend lines) that is offset by corresponding increases in the
Archaea (A) and Bacteria (B) spend most of their proteomes in functions related to Metabolism and Information
whereas Eukarya (E) stands out in the minor categories of Regulation, General, ICP and ECP. In turn, the number
of FSFs increases in the order Archaea, Bacteria and Eukarya. Eukaryal proteomes have the richest functional
repertoires for Regulation, Other, General, ICP and ECP. The number of metabolic FSFs in Bacteria appears not to
be significantly different from Eukarya but still greater than Archaea. Translation (green trend lines) is the only
design in proteomes. Numbers in parentheses indicate total number of proteomes analyzed for each phyla/kingdom.
74
Tables
Table 3.1 Mapping between the major and minor functional categories for 1,781 protein domains defined in SCOP
1.73 and the number of FSFs corresponding to each minor category. m/tr, metabolism and transport.
Photosynthesis 20
E- transfer 31
Nitrogen m/tr 1
Nucleotide m/tr 30
Carbohydrate m/tr 30
Polysaccharide m/tr 21
Storage 0
Coenzyme m/tr 50
Lipid m/tr 17
Secondary metabolism 11
Redox 55
Transferases 29
Ion binding 13
Lipid/membrane binding 4
Ligand binding 3
General 28
Protein interaction 49
Structural protein 7
75
Table 3.1 (contd.)
Translation 92
Transcription 24
DNA replication/repair 68
RNA processing 10
Nuclear structure 0
Viral proteins 73
Immune response 19
Blood clotting 5
Toxins/defense 40
Phospholipid m/tr 6
Cell motility 20
Trafficking/secretion 0
Protein modification 35
Proteases 52
Ion m/tr 21
Transport 54
DNA-binding 66
Kinases/phosphatases 15
Signal transduction 53
Receptor activity 18
76
Table 3.2 The comparison of functional categories across superkingdoms using Welch’s ANOVA.
category General.
reconstruct the phylogenomic history of protein domains and organisms. We included viruses
with medium-to-very-large proteomes into our analyses and compared them to the cellular
organisms. To our knowledge, this is the first exercise that makes extensive use of molecular
data to study the evolution of viruses on a scale that is comparable to the cellular organisms.
Additionally, we assigned molecular functions to protein domains (grouped into FSFs) and
studied the proteomic make up. While structural information proved useful for the reconstruction
composite exercise of linking structure and function proved highly useful and yielded significant
insights into the evolution of organisms, highlighting the conserved nature of proteomes. The
1. The virosphere harbors a significant number of protein domains including many that are
2. Viruses evolved via massive reductive evolutionary processes that explain their highly
reduced genomes.
3. Viruses are more ancient than the cellular superkingdoms and predate the LUCA.
4. Viruses mediate gene transfer between cellular organisms and enhance planetary
biodiversity.
79
5. The parasitic lifestyle of viruses is a late adaptation and results from massive gene loss
7. Parasitic organisms deviate from the conserved trends and harbor reduced genomes with
8. Prokaryotes (Archaea & Bacteria) evolve by genome reduction (like viruses) whereas
9. Functional distributions in the genomes of Bacteria and Eukarya suggest an ancestral link
We used molecular structure to generate trees that are rooted, i.e., they are statements that
impose an evolutionary arrow, from origins of change in their root branches to very recent
changes in its leaves. These trees describe the evolution of protein domain structures (i.e., ToDs)
and proteomes (i.e., ToPs). They are not phenetic statements (48, 49, 74, 81, 82, 87). While they
are built from multistate or quantitative valued characters, speciation in trees fulfills a molecular
clock that is compatible with paleobiology and the geological record (84). ToPs produce ToLs
We assigned a single molecular function to each FSF. Because larges FSFs include many
FFs and participate in multiple pathways, this assignment does not present a complete profile of
molecular functions (49, 58, 105). However, dissection of such detailed annotations is a difficult
80
task that is not considered in our study. The focus on one-to-one mapping between structures and
functions enables us to compare the distributions of functions across hundreds of proteomes and
when linked with evolutionary information embedded in the rooted trees reveals ancestral and
derived functions.
Our analyses depend upon the schemes used for the assignment of structures and functions to
proteomes. These results should be considered robust unless the aforementioned schemes
REFERENCES
1. Wessner DR (2010) Discovery of the giant mimivirus. Nature Education 3: 61.
2. Gibbs AJ, Calisher CH & Garcı́a-Arenal F (1995) Molecular basis of virus evolution (Cambridge university press,
London).
3. Lin CL, Chung CS, Heine HG & Chang W (2000) Vaccinia virus envelope H3L protein binds to cell surface
heparan sulfate and is important for intracellular mature virion morphogenesis and virus infection in vitro and in
4. Pagaling E, et al (2007) Sequence analysis of an archaeal virus isolated from a hypersaline lake in inner
5. La Scola B, et al (2008) The virophage as a unique parasite of the giant mimivirus. Nature 455: 100-104.
6. Mc Grath S, Fitzgerald GF & van Sinderen D (2007) Bacteriophages in dairy products: Pros and cons. Biotechnol
J 2: 450-455.
7. Prangishvili D, Forterre P & Garrett RA (2006) Viruses of the archaea: A unifying view. Nat Rev Microbiol 4:
837-848.
8. Pearson H (2008) 'Virophage' suggests viruses are alive. Nature 454: 677.
9. Claverie JM & Abergel C (2009) Mimivirus and its virophage. Annu Rev Genet 43: 49-66.
10. Abroi A & Gough J (2011) Are viruses a source of new protein folds for organisms? - virosphere structure space
11. Anderson NG, Cline GB, Harris WW & Green JG (1967) in Transmission of viruses by the water route. ed Berg
12. Breitbart M & Rohwer F (2005) Here a virus, there a virus, everywhere the same virus?. Trends Microbiol 13:
278-284.
13. Bergh O, Borsheim KY, Bratbak G & Heldal M (1989) High abundance of viruses found in aquatic
14. Rohwer F & Thurber RV (2009) Viruses manipulate the marine environment. Nature 459: 207-212.
15. Suttle CA (2007) Marine viruses--major players in the global ecosystem. Nat Rev Microbiol 5: 801-812.
16. Raoult D, et al (2004) The 1.2-megabase genome sequence of mimivirus. Science 306: 1344-1350.
82
17. Arslan D, Legendre M, Seltzer V, Abergel C & Claverie JM (2011) Distant mimivirus relative with a larger
genome highlights the fundamental features of megaviridae. Proc Natl Acad Sci U S A 108: 17486-17491.
18. Koonin EV, Senkevich TG & Dolja VV (2006) The ancient virus world and evolution of cells. Biol Direct 1: 29.
19. Raoult D & Forterre P (2008) Redefining viruses: Lessons from mimivirus. Nat Rev Microbiol 6: 315-319.
20. Koonin EV & Yutin N (2010) Origin and evolution of eukaryotic large nucleo-cytoplasmic DNA viruses.
21. Prangishvili D, Stedman K & Zillig W (2001) Viruses of the extremely thermophilic archaeon sulfolobus.
22. Koonin EV & Dolja VV (2006) Evolution of complexity in the viral world: The dawn of a new vision. Virus Res
117: 1-4.
23. Forterre P (2006) The origin of viruses and their possible roles in major evolutionary transitions. Virus Res 117:
5-16.
24. Bandea CI (1983) A new theory on the origin and the nature of viruses. J Theor Biol 105: 591-602.
25. Hendrix RW, Lawrence JG, Hatfull GF & Casjens S (2000) The origins and ongoing evolution of viruses.
26. Moreira D (2000) Multiple independent horizontal transfers of informational genes from bacteria to plasmids
and phages: Implications for the origin of bacterial replication machinery. Mol Microbiol 35: 1-5.
27. Moreira D & Brochier-Armanet C (2008) Giant viruses, giant chimeras: The multiple evolutionary histories of
28. Koonin EV, Senkevich TG & Dolja VV (2009) Compelling reasons why viruses are relevant for the origin of
29. Moreira D & Lopez-Garcia P (2009) Ten reasons to exclude viruses from the tree of life. Nat Rev Microbiol 7:
306-311.
31. Griffiths DJ (2001) Endogenous retroviruses in the human genome sequence. Genome Biol 2: REVIEWS1017.
32. Stanley WM (1935) Isolation of a crystalline protein possessing the properties of tobacco-mosaic virus. Science
81: 644-645.
83
33. Iyer LM, Aravind L & Koonin EV (2001) Common origin of four diverse families of large eukaryotic DNA
34. Iyer LM, Balaji S, Koonin EV & Aravind L (2006) Evolutionary genomics of nucleo-cytoplasmic large DNA
35. Van Etten JL (2003) Unusual life style of giant chlorella viruses. Annu Rev Genet 37: 153-195.
37. La Scola B, Marrie TJ, Auffray JP & Raoult D (2005) Mimivirus in pneumonia patients. Emerg Infect Dis 11:
449-452.
38. Koonin EV (2005) Virology: Gulliver among the lilliputians. Curr Biol 15: R167-9.
39. Zauberman N, et al (2008) Distinct DNA exit and packaging portals in the virus acanthamoeba polyphaga
40. Suhre K, Audic S & Claverie JM (2005) Mimivirus gene promoters exhibit an unprecedented conservation
41. Ghedin E & Claverie JM (2005) Mimivirus relatives in the sargasso sea. Virol J 2: 62.
42. Claverie JM & Ogata H (2009) Ten good reasons not to exclude giruses from the evolutionary picture. Nat Rev
43. Ludmir EB & Enquist LW (2009) Viral genomes are part of the phylogenetic tree of life. Nat Rev Microbiol 7:
44. Woese CR & Fox GE (1977) Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc
45. Woese C (1998) The universal ancestor. Proc Natl Acad Sci U S A 95: 6854-6859.
46. Delsuc F, Brinkmann H & Philippe H (2005) Phylogenomics and the reconstruction of the tree of life. Nat Rev
Genet 6: 361-375.
47. Zuckerkandl E & Pauling L (1965) Molecules as documents of evolutionary history. J Theor Biol 8: 357-366.
48. Wang M, Yafremava LS, Caetano-Anolles D, Mittenthal JE & Caetano-Anolles G (2007) Reductive evolution of
architectural repertoires in proteomes and the birth of the tripartite world. Genome Res 17: 1572-1585.
49. Kim KM & Caetano-Anolles G (2011) The proteomic complexity and rise of the primordial ancestor of
51. Mount DW (2004) Bioinformatics: Sequence and Genome Analysis, (Cold Spring Harbor Laboratory Press, Cold
52. Rokas A, Williams BL, King N & Carroll SB (2003) Genome-scale approaches to resolving incongruence in
53. Tekaia F, Lazcano A & Dujon B (1999) The genomic tree as revealed from whole proteome comparisons.
54. Clarke GD, Beiko RG, Ragan MA & Charlebois RL (2002) Inferring genome trees by using a filter to eliminate
phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol
184: 2072-2080.
55. Gerstein M & Hegyi H (1998) Comparing genomes in terms of protein structure: Surveys of a finite parts list.
56. Chothia C, Gough J, Vogel C & Teichmann SA (2003) Evolution of the protein repertoire. Science 300: 1701-
1703.
57. Caetano-Anolles G, Wang M, Caetano-Anolles D & Mittenthal JE (2009) The origin, evolution and structure of
58. Nasir A, Naeem A, Khan MJ, Lopez-Nicora HD & Caetano-Anolles G (2011) Functional annotation of protein
domains reveals remarkable conservation in the functional make up of proteomes across superkingdoms. Genes 2:
869-911.
59. Murzin AG, Brenner SE, Hubbard T & Chothia C (1995) SCOP: A structural classification of proteins database
for the investigation of sequences and structures. J Mol Biol 247: 536-540.
60. Riley M & Labedan B (1997) Protein evolution viewed through escherichia coli protein sequences: Introducing
the notion of a structural segment of homology, the module. J Mol Biol 268: 857-868.
61. Caetano-Anolles D, Kim KM, Mittenthal JE & Caetano-Anolles G (2011) Proteome evolution and the metabolic
62. Ponting CP & Russell RR (2002) The natural history of protein domains. Annu Rev Biophys Biomol Struct 31:
45-71.
85
63. Andreeva A, et al (2008) Data growth and its impact on the SCOP database: New developments. Nucleic Acids
64. Gough J (2005) Convergent evolution of domain architectures (is rare). Bioinformatics 21: 1464-1471.
65. Choi IG & Kim SH (2007) Global extent of horizontal gene transfer. Proc Natl Acad Sci U S A 104: 4489-4494.
66. Forslund K, Henricson A, Hollich V & Sonnhammer EL (2008) Domain tree-based analysis of protein
67. Yang S & Bourne PE (2009) The evolutionary history of protein domains viewed by species phylogeny. PLoS
One 4: e8378.
68. Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment?. Syst Biol 58: 150-
158.
69. Anisimova M, Cannarozzi GM & Liberles DA (2010) Finding the balance between the mathematical and
70. Maddison WP (1993) Missing data versus missing characters in phylogenetic analysis. Syst Biol 42: 576-581.
71. De Laet J (2005) in Parsimony, phylogeny and genomics. ed Albert VA (Oxford University Press, Oxford), pp
81-116.
72. Sober E & Steel M (2002) Testing the hypothesis of common ancestry. J Theor Biol 218: 395-408.
73. Caetano-Anolles G, et al (2009) The origin and evolution of modern metabolism. Int J Biochem Cell Biol 41:
285-297.
74. Wang M & Caetano-Anolles G (2009) The evolutionary mechanics of domain organization in proteomes and the
75. Sun FJ & Caetano-Anolles G (2008) The origin and evolution of tRNA inferred from phylogenetic analysis of
76. Kim KM, Sung S, Caetano-Anolles G, Han JY & Kim H (2008) An approach of orthology detection from
homologous sequences under minimum evolution. Nucleic Acids Res 36: e110.
77. Theobald DL (2010) A format test of the theory of universal common ancestry. Nature 465: 219-222.
78. Zwickl DJ & Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51: 588-
598.
79. Kluge AG & Farris JS (1969) Quantitative phyletics and the evolution of anurans. Syst Zool 18: 1-32.
86
80. Huelsenbeck JP & Nielsen R (1999) Effect of nonindependent substitution on phylogenetic accuracy. Syst Biol
48: 317-328.
81. Caetano-Anolles G & Caetano-Anolles D (2003) An evolutionarily structured universe of protein architecture.
82. Wang M & Caetano-Anolles G (2006) Global phylogeny determined by the combination of protein domains in
83. Wang M, Kurland CG & Caetano-Anolles G (2011) Reductive evolution of proteomes and protein structures.
84. Wang M, et al (2011) A universal molecular clock of protein folds and its power in tracing the early history of
aerobic metabolism and planet oxygenation. Mol Biol Evol 28: 567-582.
85. Gough J & Chothia C (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP
sequence searches, alignments and genome assignments. Nucleic Acids Res 30: 268-272.
86. Karplus K (2009) SAM-T08, HMM-based protein structure prediction. Nucleic Acids Res 37: W492-497.
87. Caetano-Anolles D, Kim KM, Mittenthal JE & Caetano-Anolles G (2011) Proteome evolution and the metabolic
88. Vogel C, Berzuini C, Bashton M, Gough J & Teichmann SA (2004) Supra-domains: Evolutionary units larger
89. Vogel C, Teichmann SA & Pereira-Leal J (2005) The relationship between domain duplication and
90. Vogel C & Chothia C (2006) Protein family expansions and biological complexity. PLoS Comput Biol 2: e48.
91. Tatusov RL, et al (2003) The COG database: An updated version includes eukaryotes. BMC Bioinformatics 4:
41.
92. Claverie JM, et al (2006) Mimivirus and the emerging concept of "giant" virus. Virus Res 117: 133-144.
93. Gough J, Karplus K, Hughey R & Chothia C (2001) Assignment of homology to genome sequences using a
library of hidden markov models that represent all proteins of known structure. J Mol Biol 313: 903-919.
94. Swofford DL (2002) Phylogenomic Analysis Using Parsimony and Other Programs (PAUP*) Ver 4.0b10.
95. Farris JS (1989) The retention index and homoplasy excess. Syst Zool 18: 406-407.
87
96. Eriani G, Delarue M, Poch O, Gangloff J & Moras D (1990) Partition of tRNA synthetases into two classes
97. Ibba M, Curnow AW & Soll D (1997) Aminoacyl-tRNA synthesis: Divergent routes to a common goal. Trends
98. O'Donoghue P & Luthey-Schulten Z (2003) On the evolution of structure in aminoacyl-tRNA synthetases.
99. Boeke JD & Stoye JP (1997) in Retroviruses, eds Coffin JM, Hughes SH & Varmus HE (Cold Spring Harbor
100. Liu H, et al (2010) Widespread horizontal gene transfer from double-stranded RNA viruses to eukaryotic
101. Edwards RA & Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504-510.
102. Ludmir EB & Enquist LW (2009) Viral genomes are part of the phylogenetic tree of life. Nat Rev Microbiol 7:
103. Belshaw R, et al (2004) Long-term reinfection of the human genome by endogenous retroviruses. Proc Natl
104. Wilson D, et al (2009) SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and
105. Nasir A, Naeem A, Khan MJ, Lopez-Nicora HD, Caetano-Anolles G (2011) The functional make up of
proteomes is remarkably conserved. IEEE International Conference on Bioinformatics and Biomedicine Workshops,
106. Welch BL (1938) The significance of the difference between two means when the populationvariances are
107. Koonin EV, Wolf YI, Nagasaki K & Dolja VV (2008) The big bang of picorna-like virus evolution antedates
108. Das S, Paul S, Bag SK & Dutta C (2006) Analysis of nanoarchaeum equitans genome and proteome
composition: Indications for hyperthermophilic and parasitic adaptation. BMC Genomics 7: 186.
109. Huber H, et al (2002) A new phylum of archaea represented by a nanosized hyperthermophilic symbiont.
111. Randau L, Munch R, Hohn MJ, Jahn D & Soll D (2005) Nanoarchaeum equitans creates functional tRNAs
from separate genes for their 5'- and 3'-halves. Nature 433: 537-541.
112. Randau L, Schroder I & Soll D (2008) Life without RNase P. Nature 453: 120-123.
113. Di Giulio M (2006) Nanoarchaeum equitans is a living fossil. J Theor Biol 242: 257-260.
114. Di Giulio M (2007) The tree of life might be rooted in the branch leading to nanoarchaeota. Gene 401: 108-
113.
115. Woese CR, Maniloff J & Zablen LB (1980) Phylogenetic analysis of the mycoplasmas. Proc Natl Acad Sci U S
A 77: 494-498.
116. Chambaud I, et al (2001) The complete genome sequence of the murine respiratory pathogen mycoplasma
117. Gibson DG, Smith HO, Hutchison CA,3rd, Venter JC & Merryman C (2010) Chemical synthesis of the mouse
118. Nakabachi A, et al (2006) The 160-kilobase genome of the bacterial endosymbiont carsonella. Science 314:
267.
119. Forterre P & Gribaldo S (2010) Bacteria with a eukaryotic touch: A glimpse of ancient evolution?. Proc Natl
120. Devos DP & Reynaud EG (2010) Evolution. intermediate steps. Science 330: 1187-1188.
121. Kamneva OK, Liberles DA & Ward NL (2010) Genome-wide influence of indel substitutions on evolution of
bacteria of the PVC superphylum, revealed using a novel computational method. Genome Biol Evol 2: 870-886.
123. Katinka MD, et al (2001) Genome sequence and gene compaction of the eukaryote parasite encephalitozoon
124. Douglas S, et al (2001) The highly reduced genome of an enslaved algal nucleus. Nature 410: 1091-1096.
89
125. Peyretaillade E, et al (1998) Microsporidian encephalitozoon cuniculi, a unicellular eukaryote with an unusual
chromosomal dispersion of ribosomal genes and a LSU rRNA reduced to the universal core. Nucleic Acids Res 26:
3513-3520.
126. Martin W & Herrmann RG (1998) Gene transfer from organelles to the nucleus: How much, what happens, and
127. Keeling PJ & Slamovits CH (2005) Causes and effects of nuclear genome reduction. Curr Opin Genet Dev 15:
601-608.
128. Burglin TR (2008) Evolution of hedgehog and hedgehog-related genes, their origin from hog proteins in
ancestral eukaryotes and discovery of a novel hint motif. BMC Genomics 9: 127.
129. Ingham PW, Nakano Y & Seger C (2011) Mechanisms and functions of hedgehog signalling across the
APPENDIX A
Table A.1 Top ten technical and conceptual advantages of trees from protein domain abundance over standard
presence of gene)
without outgroups
alignment
Problem of Yes No No
inapplicables in
phylogenetic analysis
loss
units)
False assumption of Yes (e.g., nucleotide No (when used with No (the tree reveals
homogeneity of parts
Orthology/paralogy Yes No No
of taxon sampling
independence independently)
92
APPENDIX B
Figure B.1 uToL reconstructed from the total proteome dataset. One optimal (P<0.01) most parsimonious
phylogenomic tree describing the evolution of 981 riboorganisms (652 Bacteria, 70 Archaea, and 259 Eukarya) and
56 viruses (51 NCLDV and 5 viruses from Archaea, Bacteria and Eukarya) generated using the census of abundance
of 1,739 FSFs (1,696 parsimoniously informative sites; 2,63,635 steps; CI = 0.043; RI = 0.800; g1 = -0.112).
Terminal leaves of Viruses (V), Archaea (A), Bacteria (B), and Eukarya (E) were colored red, blue, green and black
respectively.