Genomic Analysis of Bacterial Outbreaks
Outbreaks of infectious diseases often produce social alarms. These can be very local or reach
every corner of every village and city on Earth. But all they share a need for a quick control and
remediation that ensures the safety of the population. The identification and control of the
source of an outbreak becomes a health priority and many efforts are devoted to these activities
in the first days and weeks after the detection and/or declaration of an outbreak (Mortimer,
Outbreaks come in many shapes and flavors. For epidemiologists, an outbreak is simply an
unusual increase in the prevalence of a disease in time and space. Hence, some outbreaks may
be declared and last for years while others are reduced to a few days or weeks; similarly, there
might be an outbreak in a school or nursing home, but we talked a few years ago about an
epidemic outbreak of “swine influenza” (Fraser et al, 2009;General Directorate of Epidemiology
et al, 2009) and the WHO and other health organizations are currently worried about the spread
of Zika virus. In some cases, the spread of the infectious pathogen occurs in a series of successive
infections from one host to another thus producing transmission chains or networks, depending
on the topology of the resulting connections among infected persons.
One of the first tasks when an outbreak is suspected is to establish the basic parameters for
controlling it. This can depend on the detection of a source, and the application of actions that
prevent it from spreading the pathogen, or the characterization of the vector, so it can be
controlled with chemical or biological agents, or the identification of the hereditary factors that
allow the pathogen eluding previous, successful treatments and originate nosocomial outbreaks
of multi-resistant strains. The advent of faster and cheaper gene sequencing techniques lead to
the first systematic and general proposal of using a universal typing scheme that was
reproducible, cheap, objective and easily exchangeable among laboratories, known as MultiLocus Sequence Typing or MLST (Maiden et al, 1998). In this method, the nucleotide sequence
of 6-7 loci is determined and used to derive an array of allele profiles in these loci. A new
combination of allele profiles corresponds to a new sequence type which is uploaded to a webserver for easy access. Typing schemes, with detailed laboratory protocols, proficiency tests, and
full information on identified sequences types are available for tens of bacterial species in
general and specific web-servers (see, for instance,
For many pathogens, the availability of a MLST scheme represented a more than significant
change in the analysis of outbreaks. This method quickly became the new “gold-standard” for
typing pathogens and replaced previous methods. However, for a few but important pathogens
no MLST scheme revealing enough genetic variation for effectively distinguishing among nonepidemiologically linked isolates could be designed. These pathogens include the causative
agents of plague (Yersinia pestis), anthrax (Bacillus anthracis), tuberculosis (Mycobacterium
tuberculosis) and leprosy (Mycobacterium leprae), among others, and are collectively known as
“genetically monomorphic bacteria” (Achtman, 2012). Specific typing methods such as insertion
sequence RFLP and MIRU-VNTR were applied to M. tuberculosis, the pathogenic bacteria with
the highest incidence and causing more deaths every year in the history of humankind. In these
and other cases, the solutions adopted relied on very fast evolving markers, which are usually
prone to homoplastic changes, thus resulting in some false positive identifications of phenotypic
identities as indicative of very recent ancestry. Although this is not a problem in most settings,
it became evident that the same logic applied in using MLST could be extended to the complete
genome sequences to attain “perfect” accuracy by using all the genetic information in the
isolates and not only a small sample from it.
This approach was first used in an outbreak setting in the investigation of the letters covered
with anthrax spores in the aftermath of the 9/11 attacks in the USA. Complete genome
sequences were obtained from a B. anthracis isolate derived from one of the victims and one
reference strain, providing 60 SNPs that could be used subsequently to probe the common origin
of the strain used in the bioterrorist attacks (Read et al, 2002). This work clearly showed that
using the complete genome sequence was a more effective method for comparing isolates even
in almost completely monomorphic species. However, Sanger sequencing is rather slow and
painstaking as a result of the need to cut or amplify the genome in small pieces that are
subsequently sequenced and assembled into a complete genome sequence. This situation
changed dramatically with the introduction of new sequencing methods, then known as “nextgeneration sequencing” technologies. They offered several advantages over the traditional
Sanger method (Medini et al, 2008). At the same time, other problems arose, such as the
difficulties in handling and analyzing very large volumes of data, a myriad of programs and
methods to analyze them, and new conceptual challenges in the interpretation of the results.
In this chapter we provide a brief overview of the different next-generation sequencing
platforms and methods currently available for deriving complete genome sequences from
bacteria, the main results in terms of the epidemiological and evolutionary advances that have
resulted from their application to bacterial outbreaks and transmission networks, and provide a
more detailed analysis of two cases, the analysis of Legionella pneumophila outbreaks and of M.
tuberculosis transmission networks.
High throughput sequencing technologies in outbreak investigations
Several high throughput sequencing platforms have been applied to the genomic study of both
bacterial and virus pathogens. Encouraged by the increasing need of sequencing human
genomes, three technologies were almost simultaneously released from different companies:
454 (Roche, introduced in 2005 and discontinued in 2016), Solexa (Illumina, introduced in 2006),
and SOLiD (Life Technologies, introduced in 2006). These platforms share a general workflow,
based on the idea of performing billions of sequencing reactions simultaneously. These are
produced through molecular amplification of DNA fragments that are previously attached to a
solid surface. These have been enhanced in their subsequent updates to increase both
sequencing quality and throughput (Figure 1).
Although 454 was the first released platform, its use has mainly been relegated to metagenomic
studies (Schlüter et al, 2008b;Schlüter et al, 2008a;Ghai et al, 2010) because of its long reads
and relatively high error rates, which complicates the study of transmission chains or related
cases during outbreak investigations. However, it has been used as the main technology in
several studies (Lewis et al, 2010;Kennemann et al, 2011;Loman and Constantinidou, 2013) and
also following mixed strategies involving the usage of 454 reads as scaffolds and posterior error
correction using Illumina (McAdam et al, 2012;Hasan et al, 2012). SOLiD has been the least used
for outbreak investigations due to shorter and lower quality reads. As an example, it has been
punctually applied in the investigation of L. pneumophila outbreaks in an endemic locality in
Spain (Sánchez-Busó et al, 2014), Mycobacterium abscessus subsp. bolletii in Brazil and UK
outbreaks (Davidson et al, 2013) or Coccidioides immitis producing coccidioidomycosis in
transplanted patients in Los Angeles (Engelthaler et al, 2011). By far, Illumina has been the most
widely used platform because of its high quality and sensible sized reads, which allow more
accurate mapping and SNP calling. A thorough summary of the application of different
sequencing technologies to analyze different mainly bacterial outbreaks is shown in Table 1.
In 2010, the Ion Torrent (Life Technologies) platform, a new benchtop device with a different
sequencing strategy was commercialized. This technology is based on monitoring pH changes in
multi-well plates. A single reaction occurs per well so that when a hydrogen atom is released
after the incorporation of each nucleotide during amplification, the pH in the media changes in
a nucleotide-specific manner, so that the system is able to translate chemical into digital
information. Reads produced by the Ion Torrent were of relatively good quality and was
punctually applied to the study of Escherichia coli outbreaks (Mellmann et al, 2011;Holmes et
al, 2015) and Pseudomonas aeruginosa (Snyder et al, 2013;Witney et al, 2014).
In early 2011, the PacBio RS system was also released, being the first platform performing Single
Molecule Real Time (SMRT) sequencing, which is being increasingly applied to complete
microbial genomes because of the long read lengths (Mutreja et al, 2011). But the definite
current revolution in sequencing technologies with an impact in public health has been the
release of the Oxford Nanopore MinION platform, currently in test mode, and scalable in the
form of the GridION platform. These contain a membrane with millions of embedded nanopores
coupled with a polymerase. Changes in the electrical conductivity in the membrane as the
different four bases pass through the nanopore are measured, allowing sequencing in real time.
Specifically, the MinION platform is an USB-like device which can be connected directly to a
computer and provide the sequences from extracted DNA in real time after a very simple library
preparation. The portable MinION platform has been shown to be useful in real-time outbreak
investigations, such as the 2015 Ebola virus disease epidemic in West Africa (Quick et al, 2016).
The different platforms differ in their sequencing strategy, which yields different throughputs
and sequence qualities. Currently, the highest throughput can be achieved with the HiSeq X Ten
Illumina platform, which can yield up to 3 billion of paired-end 150 bp sequences. This high level
throughput is mainly directed to population-scale human genome sequencing projects. In the
case of microorganism sequencing, because their genomes are much smaller, sequencing
throughput must depend on the depth of coverage required for each specific study. However,
large-scale microbial sequencing projects can benefit from these high throughput platforms by
multiplexing different strains in the same run. Coverage depths of 50X-100X are usually sought
for base call error correction, minimizing the rate of false positive SNPs. Currently, the
technologies with the lowest error rates are Illumina platforms, and the highest error rate from
raw data is provided by Oxford Nanopore and PacBio platforms. However, bioinformatics
pipelines for error correction during the post-processing of reads improve these rates, especially
in the second case, in which the current final error rate can get as low as 1E-05. Multiple reviews
on the characteristics of the different sequencing technologies, applications, advantages and
drawbacks have been published in the literature up to now (Metzker, 2010;Casey et al,
2013;Ekblom and Wolf, 2014).
Choosing the most appropriate sequencing technology depends on the scope of the study. High
throughput technologies can be applied in different steps during an outbreak investigation
(Köser et al, 2012); from the detection and identification of the pathogen in direct uncultured
samples (i.e. blood, sputum, etc.), epidemiological typing and detection of mutations associated
to drug susceptibility to the study of transmission chains and potential super-spreaders.
Achievements and limitations of NGS in outbreak investigations
Initial results. Although NGS techniques and devices became available around 2005 (Loman and
Pallen, 2015), it took a few more years until the new technologies were firstly applied to analyze
an outbreak. This corresponded to an outbreak of methicillin-resistant Staphylococcus aureus
(MRSA) (Harris et al, 2010). They analyzed a set of 63 isolates from two origins, a global collection
of 43 samples collected between 1982 and 2003, and 20 isolates from a Thai hospital sampled
in a very short time period (months), suspected to correspond to a transmission chain. Their
results provided evidence for the international spread of the resistant clone of S. aureus and the
single origin of the samples from the hospital. But they also showed that bacteria can and do
evolve rapidly. They estimated that in the core genome, the set of shared positions among all
the studied isolates, the rate of divergence was about 1 SNP every 6 weeks. This explained the
lack of identity among most hospital isolates, which differed in a few SNPs from each other, but
it also revealed differences from the patterns of evolution revealed by other markers, such as
spa and PFGE. Of note, the analysis of complete genomes showed that over a quarter of the
homoplasies found among the isolates were directly related to the evolution of resistance to
At about the same time, Lewis et al. (2010) used complete genome sequences to establish
relationships among otherwise indistinguishable strains of Acinetobacter baumannii which had
cause a small outbreak at a British hospital. The SNPs found by WGS allowed the investigators
to discriminate among alternative epidemiological hypotheses. These pioneering studies have
been followed many studies (Table 1) which have dealt with outbreaks and transmission
networks of over 30 different bacteria species infecting humans. An even larger number of
works have been published about viral infections (not included in this review) and a few have
dealt with fungal infections. Two particular bacteria, M. tuberculosis and L. pneumophila, the
main etiological agents of tuberculosis and legionellosis, respectively, are analyzed in more
detail below, but some general patterns and conclusions have started to emerge from the
analysis of more than 30 pathogenic bacteria, that we briefly review next.
From retrospective to real time analysis of outbreaks. We have previously commented that the
molecular analysis of outbreaks and transmission networks is necessarily a complement to the
epidemiological investigations leading to the identification and control of the source(s), vectors
or routes so to put a fast stop to ongoing processes. Hence, it is very important that the
information obtained from the molecular analyses can be shared with the epidemiology team
for a better evaluation of the total evidence available thus far and more appropriate and
accurate decisions can be adopted. The initial methodologies available for WGS were very labor
intensive and the shortest time since a sample was obtained until its complete sequence could
be determined was in the order of weeks. Too long for a pressing demand of action. However,
the advent of new technologies, such as Ion Torrent PGM and, more recently, MinION have
changed this situation. Both methods can deliver sequence information within a few hours of
gaining access to the sample, thus allowing a very rapid communication of results to field
The first case in which these new technologies were applied during the investigation of the
source of an outbreak was that an enteroaggregative Escherichia coli O104:H4 strain that
affected several European countries in the spring of 2011 (Mellmann et al, 2011). Complete
genome sequences were obtained from a representative isolate of the outbreak and a reference
strain which produced similar clinical features in just 62 hours. The comparison revealed key
differences in plasmid and gene contents between the strains, indicating that the outbreak was
due to a new and not a previously circulating strain of the bacterium. It also allowed the design
of a test to be applied for quick diagnostic in any lab.
Loss of identity as hallmark of relatedness. One consequence of using complete genome
sequences for the analysis of outbreaks and transmission chains is the necessary dismissal of
complete identity as the proof of charge in considering two or more isolates as linked to the
same transmission event or episode. This was usually the case for most previous markers which
explored only a minor fraction of the nucleotides in the genome of the pathogenic bacteria.
Except for a few rapidly evolving markers, usually associated to tandem repeats, the number of
differences expected between two isolates depends on three factors: the mutation rate per site,
the number of sites being compared and the time since they diverged from their last common
ancestor. When the number of generations since divergence is relatively small, as in outbreaks
and most transmission networks, and the number of sites being sampled is also small, the
probabilities of finding a SNP (or a different allele in the case of MLST) are also very small.
However, using complete genome sequences, and assuming that the previous assumptions
remaining identical, will increase those probabilities in a three-fold factor or more, because the
number of sites interrogated is now in the order of millions instead of tens or hundreds.
Within-host evolution. In addition, the exploration of complete genome sequences of long- or
chronically-infecting bacteria has shown that evolution does occur within hosts at relevant rates
for being reflected in some nucleotide changes (Didelot et al, 2016). Even for pathogens that
produce acute infections, a low per site mutation rate is compensated by the large number of
nucleotides present in a genome and the different random and directional processes that occur
in an infected individual, thus leading to some new mutations arising in many newly replicated
genomes (Kennemann et al, 2011;Mathers et al, 2015). If the infection last longer or becomes
chronic, the chances that changes occur in the pathogen are very high and additional
evolutionary processes such as compartmentalization may contribute to within patient
differentiation of bacterial sub-populations.
These processes have important consequences at different levels. On the one hand, a variable
population can adapt more rapidly to new environmental conditions which might include new
treatments or an adaptive immune response by the host (Mwangi et al, 2007). On the other
hand, a variable population will result in different initial compositions in successive transmission
events, which will be reflected in differences among the populations established in the new
hosts. The analysis of transmission networks becomes more complicated because using a single
genome sequence per host cannot reveal the whole range of variation present within it (Worby
et al, 2014). Under these circumstances, the use of evolutionary methods to reveal the common
ancestry of isolates derived from patients presumably included in the same network becomes
an absolute necessity.
Mutation patterns and processes. Apart from revealing larger amounts of variation than
anticipated from previous studies with just a few gene sequences, whole genome sequences
have also informed about the types and distribution of mutational changes occurring at different
time-scales. A few years ago, the contribution of homologous recombination and horizontal
gene transfer to genetic variation in bacterial genomes was found to be considerably more
important than previously thought (Doolittle, 1998). But this was thought to be the result of
millions of generations in which a generally rare process might have been acting. In shorter timescales, months or years, the impact of processes generating variation other than point mutation
was thought to be negligible except for loci including repeat units, such as in MIRU-VNTRs in M.
tuberculosis, in which slippage-and-mispairing during replication often lead to new alleles.
Recent analyses at the complete genome level have shown that this view is incorrect, at least
for some bacteria such as Neisseria gonorrhoeae, Salmonella enterica or L. pneumophila (Didelot
and Maiden, 2010;Sánchez-Busó et al, 2014). In fact, a comparison of the relative effects of
recombination and point mutation in almost 50 bacterial species revealed variation of three
orders of magnitude (Vos and Didelot, 2009). Although there are not quantitative estimates yet,
horizontal gene transfer, with or without final stabilization in the receiving genome, is also
known to play a significant role in the short term evolution of many bacteria, as unfortunately
shown by the ease of spread of many antibiotic resistance genes across species. The additional
variation introduced by these processes has to be considered when analyzing large transmission
networks or long-lasting outbreaks, because the incorporation of these new variants may
confound inferences of recent ancestry based on overall similarity or on a few loci.
Rates of evolution. The increased availability of complete genome sequences from bacteria with
a more or less direct epidemiological link has also provided an opportunity for a more detailed
study of evolutionary processes at the population genomic level. Apart from the different types
of variants introduced in these populations, the access to asynchronically sampled isolates
allows the application of Bayesian methods to estimate evolutionary rates (Drummond et al,
2006). These methods can accommodate strict and relaxed clock models, different demographic
regimes, as well as variation in rates among lineages, thus allowing the estimation of relevant
evolutionary parameters from organisms with different natural and evolutionary histories. Most
often they are applied to rapidly evolving organisms, collectively known as measurably evolving
populations (Drummond et al, 2003;Biek et al, 2015), which mainly include viruses along with
some bacteria. But the methods are also valid for more slowly evolving organisms with sampling
dates different enough as to provide estimates of the evolutionary rate. Recently, this approach
has been used with bacterial genomes obtained from ancient samples (Schuenemann et al,
2013;Bos et al, 2014;Mendum et al, 2014;Rasmussen et al, 2015;Bos et al, 2016;Maixner et al,
One apparent feature of the estimates of bacterial evolutionary rates is the negative correlation
between the time to the most recent common ancestor of the sample studied and the inferred
evolutionary rate (Figure 1). Higher evolutionary rates at short times can be explained by the
relative inefficiency of natural selection and/or genetic drift in the removal of neutral or quasineutral polymorphisms which are continuously arising in bacterial populations. Hence,
transitional polymorphisms contribute significantly to the apparent acceleration of evolutionary
rates in short time-scales. At the same time, they also provide a wealth of variation what might
have an adaptive value if the circumstances are appropriate. On the long run, many of these
transient variants will have disappeared and evolutionary rates are reduced correspondingly.
This negative correlation has to be taken into account when comparing rates across studies,
even for the same species, and in the inference of other evolutionary parameters (Biek et al,
The analysis of (almost) complete genome data. One of the main advantages of MLST or SBT
over alternative methods for the analysis of pathogenic bacteria in the context of outbreaks and
transmission chains is the objectivity and simplicity in the specification of the variants found in
any isolate. The nucleotide sequences obtained for each locus are compared to a predetermined
database in which previous homologous sequences have been deposited. If there is a perfect
match, the newly determined variant received the same identifier as the pre-existing one. If that
is not the case, curators of the database will assign a new code to the variant. The combination
of allele codes in the loci included in the typing scheme is summarized in a sequence type (ST)
with a different number of each combination of variants. This procedure is easily communicated
because it requires the identification of nucleotide variants, usually through Sanger sequencing,
in just 6 or 7 loci. However, the advent of NGS and the determination of complete genome
sequences makes this procedure of denoting the variants impractical.
Several alternative have already been proposed for the identification of complete genome
sequences for epidemiological analysis. One method consists in extending the MLST naming
scheme to more loci, eventually all the loci in the genome of the corresponding species, thus
leading to “whole genome MLST” (wgMLST) schemes (Cody et al, 2013). The first proposal of
wgMSLT was done for Campylobacter isolates and the initial MLST scheme based on 7 loci was
extended to 1667 loci, although this number was reduced to 1026 when only those present in
all the isolates analyzed were considered. This represents the “core genome” of the species,
which is complemented by the “auxiliary genome”, the set of loci which are present in some but
not all the isolates of a species. In light of the very large genome plasticity of many bacterial
species, fixed compositions of the core and auxiliary genomes are almost impossible, which
creates an additional problem for the stability of the scheme. Nevertheless, this approach has
gained some popularity and cgMLST (“core genome MLST”, a reduced version of wgMLST as
described above) schemes are now available for several pathogens including S. aureus, Listeria
monocytogenes, Enterococcus faecium (de Been et al, 2015), and S. enterica (Taylor et al, 2015),
among others.
To prevent the proliferation of STs which inevitably accompanies wgMLST or cgMLST, a first level
classification of STs into clusters or clonal groups is usually performed (Cody et al, 2013;Qin et
al, 2016). These can be based on an extension of the BURST method (Feil et al, 2001;Feil et al,
2004), which considers as variants of the same clonal group to those that differ in one single
locus of the original MLST scheme, or use more sophisticated approaches based on the
population genetic analysis of the actual SNPs detected in the loci included in the wgMLST or
cgMLST (Qin et al, 2016) with different molecular population methods such as BAPS (Corander
and Tang, 2007) or STRUCTURE (Rosenberg et al, 2002). These methods share the advantage of
portability thus allowing comparisons among different laboratories and needs. However, they
also discard important information, eventually crucial, contained in the auxiliary genome.
Hence, although standard typing schemes are useful, whole genome sequence information
should not be reduced to a ST number or complex under a wgMLST and the complete data
should still be available for future use by the scientific community.
Outbreak investigation in Mycobacterium tuberculosis: the genome as an
epidemiological marker
Mycobacterium tuberculosis is the main causative agent of human tuberculosis in the world.
Every year more than 1.5 million persons die of tuberculosis, more than of any other infectious
disease (WHO, 2014). The epidemiology of the disease has to take into account the natural
history of the bacteria. It is an obligate human pathogen with very effective airborne
transmission and that typically infects the lungs. It is estimated that one third of the human
population is infected by the bacilli and this explains why every year around 9 million new cases
are declared. In most cases the initial infection derives in an asymptomatic state called latency
in which the bacteria have not been eliminated but are controlled by the immune system. In 58
10% of the latent cases the disease progresses to an active state in which the bacteria actively
replicate and cause pulmonary disease. Only an active tuberculosis case can transmit the disease
and thus in tuberculosis, disease and transmission are linked. The typical window of progress to
active disease after infection is two years but the bacteria may remain latent for years or even
Mycobacterium tuberculosis has been traditionally regarded as a monomorphic organism due
to the low genetic diversity found among representative strains datasets (Achtman, 2008). Thus
epidemiological tools were developed based on fast evolving genetic elements (Barnes and
Cave, 2013). Typing of the insertion sequence IS6110 by RFLP and of minisatellites, called MIRUVNTR, are the two gold standards in tuberculosis molecular epidemiology and, together with
spoligotyping, based on the CRISPR region of the bacteria, have allowed to define successful M.
tuberculosis clones. Among these clones, the identification of an hypervirulent clade, called
Beijing family, has attracted much attention (Parwati et al., 2010). Strains from the Beijing family
are more common in East Asia but can be identified across the globe. Experimental and
epidemiological research have identified Beijing strains as hypervirulent in the mice model of
infection and with frequent association to drug resistance in humans. In South Africa Beijing
strains have been on the rise for the last 40 years (Cowley et al., 2008). Beijing strains belong to
one of the seven lineages of human tuberculosis strains (Comas et al., 2013). The most common
is lineage 4, which is highly frequent in Africa, Europe and America. There is a strong association
between lineages and their geographic origin, being the most extreme cases the two lineages of
Mycobacterium africanum, that can only be found in West Africa (De Jong et al, 2010), and
Lineage 7 recently described in Ethiopia (Comas et al., 2013). Regardless the lineage, drug
resistance to first and second line treatments have been identified (Farhat et al., 2013). The
mutations responsible for drug resistance are always chromosomal mutations because there is
no ongoing horizontal gene transfer in M. tuberculosis. Although ecological theory predicts that
drug resistance mutations have a fitness cost, experimental evolution and molecular
epidemiology have shown that different drug resistance mutations have different fitness costs
(Comas et al., 2012). As a consequence, multidrug-resistance cases (MDR-TB) among people
never treated before, and therefore due to transmission, are on the rise and in some particular
areas represent more than 50% of the tuberculosis burden of the region. Although not part of
this review whole genome sequencing is allowing to define the set of mutations associated to
resistance to the different antibiotics but also the genotype of highly successful MDR-TB strains.
The first study that showed the potential of the genome as an epidemiological marker dates
back to 2009 (Niemann et al, 2009). In this study, three strains which looked almost identical
using traditional molecular epidemiology markers such as restriction fragment length
polymorphisms (RFLP) and minisatellite (MIRU-VNTR) were shown to differ in more than 100
SNPs. Later on, Jennifer Gardy and collaborators (2011) used genome comparison techniques to
solve an on-going outbreak in British Columbia suspected to have started in the early 1990s. By
combining genomic, epidemiological and social contact data the authors showed that it can be
gained get a better resolution of the transmission events within transmission clusters. Such
events are very difficult to identify with traditional molecular epidemiology markers. This work
already defined index cases associated to multiple secondary cases, also denoted as superspreaders. Super-spreaders are becoming a common topic when analyzing large transmission
clusters (Walker et al, 2013b) instead of the traditional view of a stepwise "chain" of
From 2010, NGS has been successfully applied to deeply resolve tuberculosis outbreaks.
Considerably attention has been paid to understand those outbreaks that have been on-going
over years. For example, a large outbreak in Hamburg, Germany, was identified by classical
genotyping data in 1996 (Roetzer et al, 2013). However, clustering data not always correlated
with epidemiological and geographical information leading to the suspicion that the outbreak
was more complex than previously anticipated. By whole genome sequencing of 86 strains from
the outbreak (1996-2011), Roetzer et al. (2013) were able to identify an independent
transmission network, thus confirming the non-clonality of the outbreak. Two clusters were
determined, one starting in 1997 and the other starting in 2010, much more in agreement with
epidemiological investigations. Therefore, one important application of whole genome
sequencing to investigate tuberculosis outbreaks is to ability to assign with higher confidence
cases to the outbreak and exclude those that, albeit genetically close, correspond to a different
chain of events.
Similarly, in Bern, Switzerland, a genotype detected by RFLP profiling caused a large number of
tuberculosis cases during the 1990’s (Stucki et al, 2015). The cases were associated to the typical
risk factors in local populations found in European cities such as HIV infection or alcoholism.
Stucki et al. (2015) sequenced the complete genome of strains belonging to the original outbreak
along with local control strains. By comparing outbreak and control strains they designed a realtime SNP typing assay based on the detection of genome position with a polymorphism specific
to the outbreak strains. Next, they typed a retrospective collection of isolates of the Canton of
Bern from 1993 to 2011. They were able to identify 68 additional cases of the outbreak based
on the presence of the mentioned SNP including cases from 2011. Therefore, the combination
of whole genome sequencing and SNP typing allowed them to identify cases associated to the
outbreak and find that the outbreak that started in early nineties was still on-going at the time
of investigation. In addition, they obtained the whole genome sequence of all the isolates
assigned to the outbreak. With this information, they were able to resolve the individual
transmission patterns for 75% of the strains. Importantly, 66 out of the 68 strains had exactly
the same RFLP pattern. Furthermore, the analysis of the transmission network together with the
epidemiological information revealed two different sub-outbreaks initiated by two different
Therefore, next generation sequencing of the Hamburg (Roetzer et al, 2013), the Bern outbreak
(Stucki et al, 2015) and others (Török and Peacock, 2012;Smit et al, 2015;Lee et al, 2015) have
revealed the complexity of tuberculosis outbreaks. Given that tuberculosis is not an acute
disease and that a tuberculosis case can be latent, asymptomatic for years, the true extent of
tuberculosis outbreaks can only be revealed by a sustained genotyping efforts over years.
Furthermore, as in the case of the Bern outbreak, whole genome sequence data can be used to
design new diagnostics and/or surveillance tools. A similar approach has been used to
prospectively identify new outbreak-associated cases in sputum samples (Pérez-Lago et al,
Apart from specific outbreaks, genomic epidemiology has been used in a population-based scale
to evaluate its utility for surveillance and diagnostics. In a series of publications starting in 2012,
Public Health England has applied next generation sequencing to incorporate whole genome
sequencing as the default typing method of Mycobacterium tuberculosis in the United Kingdom
(Walker and Beatson, 2012;Walker et al, 2014). They have shown that the genome data allow
to delineate outbreaks better than MIRU-VNTR analyses. Furthermore, in an attempt to derive
a rule of thumb to identify a transmission event between two cases they also sequence several
isolates from the same patient and known household contacts. They were able to identify a
threshold of five SNPs when the cases had a confirmed epidemiological link and they proposed
a threshold of up to 12 SNPs for casual transmission in the community (Walker and Beatson,
2012). Other studies have found a similar distribution of SNPs when analyzing transmission
events in populations (Bryant et al, 2013a;Casali et al, 2014).
However, we are still blinded about how these thresholds apply to different clinical settings than
the low-burden countries of Europe. In high-burden countries delineating transmission clusters
should be more difficult if public health interventions cannot stop transmission events (Yates et
al, 2016). Thus, the circulating strains may be participating at the same time in several clusters.
The only population-based study published in a high-burden country shows that the threshold
described in (Pérez-Lago et al, 2015) may be useful, although more work will be needed to
generalize the results to, for example, large urban areas.
There are several factors that may distort the proposed threshold values. One of these factors
is mixed infections. The true extent of co-infections in high-burden countries is not clear and
there is hope that whole genome data can distinguish between relapses and re-infections
(Bryant et al, 2013a;Guerra-Assunção et al, 2015). This issue is critical to delineate transmission
in high burden countries but also for clinical trials investigations because relapse is one of the
end points of those investigations. However, it is the diversity that can be found during infection
from a single strain what is attracting more research and attention. From drug susceptibility
clinical data, it has been clear for decades that several populations may co-exist in the same
patient. These subpopulations were flagged due to inconsistent results in drug resistance
susceptibility tests between isolates of the same patient (Rinder et al, 2001). Whole genome
sequencing has shown that, in fact, this is the case and what is recovered from a sputum sample
is often a mix of different sub-populations (Sun et al, 2012). These sub-populations can be
revealed by looking at positions in which a mutant and a wild-type allele can be identified at the
same time. In the context of drug resistance, it has been shown that several drug resistant subpopulations may co-exist and compete and that their frequencies may change over time (Liu et
al, 2015). A similar phenomenon has been shown outside the context of drug resistance. The
issue of within patient diversity not only has clinical and diagnostic implications. If several subpopulations co-exist and accumulate a different number of SNPs then chances are that the
epidemiological investigation of outbreaks may be distorted by the isolate chosen for the
analysis (Walker et al, 2013a;Walker et al, 2013b). An analysis of cases in which higher than
expected diversity was expected confirmed that, although the thresholds proposed to delineate
a transmission event are in general valid, there are epidemiologically cases in which a larger than
expected number of SNPs can be found (Pérez-Lago et al, 2014). How frequent are those
"outliers" is a matter of on-going investigation.
High throughput investigation of Legionella pneumophila outbreaks
High throughput sequencing can also be used to study organisms with higher level of
polymorphism and strictly environmental, contrary to Mycobacterium tuberculosis. This is the
case of L. pneumophila, causative agent of Legionellosis, and for which there is only one report
of a possible person-to-person transmission (Correia et al, 2016) up to date. This opportunistic
pathogen can produce pneumonia after inhalation of aerosols with enough bacterial load, with
the highest burden in warm water-related environments. The first reported outbreak dates from
1976 when more than hundred legionnaires were infected in a convention in Philadelphia
(Fraser et al, 1977). A legionellosis outbreak is defined as a cluster of more than three cases
occurring at the same place and time and the epidemiological investigation is crucial to find the
environmental sources.
The investigation of legionellosis outbreaks has traditionally been conducted by using
biochemical or molecular methods that allows comparing the clinical isolates with the strains
obtained from the environment (Fields et al, 2002). Broad techniques such as serogrouping
benefited from genetic methods that provided improved resolution in the so-called SequenceBased Typing (SBT) (Gaia et al, 2003;Gaia et al, 2005), based on Multi-Locus Sequence Typing
(MLST) approach (Urwin and Maiden, 2003) but incorporating virulence genes in the scheme to
increase the discrimination power among strains.
However, although SBT provided researchers with a tool that allowed the classification of strains
into groups (Sequence Types, STs), the introduction of high-throughput sequencing techniques
for microbial analysis and outbreak investigations in other species derived in its application to
legionellosis outbreaks because of its increased discrimination power. The first published work
was indeed a pilot study to test the potential of whole-genome sequencing (WGS) on the
discrimination between isolates from an outbreak produced in the UK in 2003 and non-outbreak
related strains (Reuter et al, 2013b). From this point, a number of other outbreaks have been
analyzed using WGS, as for example an outbreak of ST62 associated to a cooling tower in Quebec
City in 2012 (Lévesque et al, 2014) or a massive outbreak that occurred in Edinburgh (UK, 2012)
related to multiple STs and including mixed infections (McAdam et al, 2014). WGS has also been
used to investigate the persistent infection history of ST23 in a hotel in Spain in 2012 (SánchezBusó et al, 2016) and the eradication of L. pneumophila associated to a hospital in Australia that
have been responsible of nosocomial cases (Bartley et al, 2016).
The environmental source of legionellosis cases has been historically difficult to trace, and
because of the high social and economic impact of this kind of outbreaks on the affected
populations, public health interventions are obliged to be rapid and accurate. WGS has shown
further variability within many STs (Underwood et al, 2013a;Sánchez-Busó et al, 2014), showing
evidence that at least some of them are not clonal. This observation complicates the study of
legionellosis outbreaks and was the leading aim in the study by Sánchez-Busó et al. (2014). In
this work, 69 isolates including strains associated to 13 different outbreaks and sporadic cases
occurred in a single locality (Alcoy, Spain) during more than 10 years (1999-2010) were analyzed
by high throughput sequencing. Different STs were included, with special interest on ST578
cases, which had been recurrently reported as the causing ST of most of those outbreaks
(Coscollá et al, 2010).
The analysis showed two main lineages within the endemic ST578, more than 1,000 SNPs apart
from each other. Not all the strains from the same outbreak clustered together, revealing the
non-clonality of the isolates, as these were phylogenetically grouped independently of their
source (clinical or environmental), sampling date or outbreak. Because ST578 is known to be
endemic in the area of Alcoy, these results suggest that it is indeed very complicated to find an
infectious source using just molecular data in endemic areas. These should be used together
with the epidemiological investigation to be able to draw the accurate conclusions that public
health interventions require.
Other interesting fact that this work shows is that the genomic data can reflect public health
actions along time. As an example, using Bayesian inference, an estimate of the ST578
population dynamics revealed a decreased population size between 2006 and 2008, which
correlated with a moment in which public health measures were taken in the city by removing
high-risk installation from the city center.
In the case of organisms where person-to-person transmission is very rare or even inexistent,
whole genome sequencing can provide the most discriminant tool to link clinical cases with
environmental sources, providing the accuracy that public health interventions require in these
cases. But, moreover, it can help understand how outbreaks occur, which is the starting line to
be able to predict and even prevent their occurrence.
Complete genome analysis of bacterial pathogens is still far from being the usual method for
analyzing outbreaks and transmission networks, although it will not take long before it does so.
The increasing speed, ease and reliability as well as the reduced costs associated to new highthroughput sequencing technologies point to that direction. But gaining information is only a
part of the process. More data also mean an increased need for interpretative tools at all levels,
from the mere analysis of reads to the inference of the evolutionary and genealogical
relationships among the isolates. Progress is still pending at all levels, from the technology to
obtain, fast and cheap, complete genome sequence data of a specific pathogen from an infected
individual or a potential vector o source to analytical tools capable of extracting the relevant
information from the deluge of data generated by high-throughput sequencers and for the
integration of this information with the clinical, epidemiological and evolutionary information
which are needed when they have to be interpreted in the appropriate context.
We thank Dr. Pierre Pontarotti for his kind invitation to write this chapter. This work has been
funded by project BFU2014-58656-R from MINECO (Spanish Government) to FGC. IC is
supported by Ramón y Cajal Spanish research grant RYC-2012-10627, MINECO research grant
SAF2013-43521-R, and the European Research Council (ERC) (638553-TB-ACCELERATE). BB has
been recipient of a Beca de Colaboración from the Spanish Ministerio de Educación y Cultura.
