The Next Generation of Genomic Research

The Next Generation of Genomic Research
Genome Sequencer GS-FLX

Patrick Ng, Ph.D.
Scientific Liaison Manager, Asia-Pacific
Roche Diagnostics
patrick.ng@roche.com
May 2007 June 2007
Roche Genomics Solutions
Real-time qPCR
LightCycler 2.0; LC480
• HRM for SNP validation
• qPCR for quantitative validation
Quick genome scans High-throughput sequencing
Whole genome tiling microarrays Genome Sequencer GS-FLX
• aCGH • Ultra-Broad sequencing
• ChIP-chip • Ultra-Deep sequencing
• Epigenetics • Whole genome de novo
sequencing
• Gene expression
• Whole genome re-sequencing
• Targeted sequence capture Sample Prep
(SeqCap)
Roche reagents
• RNA/DNA isolation
• RNA/DNA purification
• DNA amplification
• DNA labeling
Optics subsystem
Fluidics subsystem
Computer subsystem Keyboard and

mouse (in drawer)
Genome Sequencing – 1st Generation
Sanger dideoxy sequencing on an ABI capillary sequencer
• Sample prep
– Bacterial cloning of DNA, colony picking,
culturing, plasmid DNA extraction
– Typical time needed: weeks/ months
• DNA Sequencing
– State of the art capillary sequencer
enables maximum ~1-2 Mb/ 24hr
– Typical sequencing time needed for a
whole genome project: months/ years
• Personnel requirements
– Production sequencing facility required;
manpower needed for sample prep and
sequencing, typically 5-10 full-time
staff
Genome Sequencing – Next (2nd) Generation
Massively parallel pyrosequencing (454-sequencing™) on the GS-FLX
• Sample prep
– No bacterial cloning; all cloning is done in vitro
– time needed: ~15 hrs including emulsion PCR
amplification
• DNA Sequencing
– GS-FLX output/run ~100 Mb in 7.5hr
– Daily throughput ~300 Mb/ 24hrs theoretical
– Typical sequencing time needed for a
whole genome project: much shortened
• Personnel requirements
– Manpower needed for sample prep and
sequencing, typically 1-2 full-time staff
Sequencing of Corynebacterium kroppenstedtii strain
From sequencing to manuscript submission = 1 week
Tauch, A. et al. Ultrafast pyrosequencing of Corynebacterium kroppenstedtii DSM44385 revealed
insights into the physiology of a lipophilic corynebacterium that lacks mycolic acids.”
Journal of Biotechnology, Available online 20 March 2008
Number of GS-FLX runs 1

Number of reads 560,248
Mean read length 196 bp
Total no. of bases 110,018,974
Coverage depth 45.8 x
Number of assembled contigs 6
Assembled bases 2,434,342
Mean G+C content 57.5%
Coding sequences (pred.) 2,119
Average gene length 1,016 bp
Average intergenic region 163 bp
7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
Sequencing of Corynebacterium kroppenstedtii strain
From sequencing to manuscript submission = 1 week
Dr. Andreas Tauch, University of Bielefeld:
7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
Sequencing Time:
C. glutamicum (2000) Clone-by-clone, Sanger 2 years
C. jejecum (2003) Whole genome shotgun 3 mths
C. urealyticum (2006) GS-20 shotgun 4 days
C. kroppenstedtii (2007) GS-FLX shotgun 7.5 hrs
(XLR_HD):
Expect >10 C. kroppenstedtii-like genomes to be
sequenced in 1 run
(~500Mb / (2.4Mb*20x oversampling) = 10.4)
Cloning Bias in Conventional ABI Sequencing
(500 kb stretch in Listeria monocytogenes)
50 18
45 16
40 14
35
12
30
10
25
8
20
6
15
10 4
5 2
0 0
Reference position (bp)
GS-FLX coverage ABI coverage
Courtesy of Drs. Nusbaum and Young of the Broad Institute
GS-FLX Sequencing Workflow
Overview
Sample input: Genomic DNA, BACs, amplicons, cDNA

Starting DNA
Generation of small DNA fragments via nebulization Fragments
Ligation of A/B-Adaptors flanking single-stranded One Fragment

DNA fragments
Emulsification of beads and fragments in water-in-oil

microreactors
One Bead
Clonal amplification of fragments bound to beads in
microreactors
Sequencing and base calling One Read

400,000+
reads per run
GS-FLX Process Steps
1. Shotgun DNA library preparation
DNA Library Preparation and Titration emPCR Sequencing
4.5 h and 10.5 h 8h 7.5 h
a. Genomic DNA fragmented by nebulization d. Select for only A-fragment-B and B-fragment-A
b. Adaptors A and Biot-B ligated to fragments sstDNA molecules in supernatant (not Biot)
c. Immobilize repaired, adapted DNA to e. Functional validation of sstDNA by titration run (do
paramagnetic Streptavidin beads emPCR and GS-FLX sequencing run to determine
best number sstDNA molecules per Capture Bead;)
gDNA sstDNA library

Process Steps
2. emPCR
4.5 h and 10.5 h 8h 7.5h
Anneal sstDNA to an Emulsify beads and PCR Clonal amplification Break microreactors,
excess of DNA capture reagents in water-in-oil occurs inside and enrich for DNA-
beads (choice of no. of microreactors (using a microreactors. positive beads (using
average molecules/bead TissueLyser). Most of these Typically, 40 PCR magnetic streptavidin
is based on titration microreactors that contain cycles are performed. beads that bind to the
results)*. Capture beads DNA, will contain only 1 DNA biotinylated emPCR
are non-paramagnetic. molecule and 1 bead products). Convert to
bead-bound sstDNA
sstDNA library Clonally-amplified sstDNA attached to capture bead
*Titration is required to avoid excessive “empty” or else “multi-template” beads in the emPCR
Process Steps
3. Sequencing
4.5 h and 10.5 h 8h 7.5 h
PTP well diameter: average of 44 µm

Capture bead diameter: 27-32 um
Enzyme bead diameter: 2.8 um
Packing bead diameter: 0.8 um
Wells per PTP: 1.6 Million
A single, clonally amplified sstDNA bead

(after enrichment) is deposited per well.
Load PicoTiterPlate (PTP) into sequencer,
begin run
Amplified sstDNA library beads Quality reads

Process Steps
Sequencing
Pyrosequencing details (Sequencing-by-synthesis)
4.5 h and 10.5 h 8h 7.5 h
4 unlabeled nt’s (TACG) are added

DNA capture sequentially (flowed), 1nt at a time.
bead Cycled 100 times for large PTP run
containing Chemiluminescent signal generation
~10-30 (based on Pyrosequencing™)
million Pyrophosphate released upon nt
copies of a incorporation, is converted to ATP,
single clonal adenosine 5´
5´ phosphosulfate which drives luciferase reaction,
fragment and light output
(sstDNA Light signal captured on CCD camera
templates) Signal processing to determine base
sequence and quality score
Amplified sstDNA library beads Quality reads

Software
Image acquisition -> Image Processing -> Signal processing (FASTA basecalls + Quality scores)
-> Applications software
Metric and image viewing software Signal output from a single well
(flowgram)
On current GS-FLX system, raw image -> FASTA basecalls = 8-9 hours
GS FLX Sequencing
Bioinformatics
Reference Mapper
Image capture (assembly using ref. seq.)
Image processing
De novo Assembler
(assembly from scratch)
Signal processing
Amplicon Variant Analyzer
Features:
• Small dataset size = greater convenience. 13.2 GB including raw images, allows future re-analysis if desired
(e.g. post software upgrade)
• Useful software (currently 3 applications) available out-of-the-box
• Long (250bp) and accurate (99.5% single-read) reads = No filtering of reads against known ref needed
• Software recognizes molecular barcodes (MIDs) for greater multiplexing (and economy); also able to
assemble using various types: 454-shotgun, Sanger and paired-end reads (singly/ in combo) for best results
The Genome Sequencer FLX System
Technical Specifications
Current system After “Titanium” kit upgrade, ~Q3, 2008
≥ 400,000 sequence reads per run ≥ 1,000,000 sequence reads per run
200 - 300 bases per read 400 - 500 bases per read
1 run = 7.5 hours = ~100 Mb 1 run = 10 hours = ~500 Mb
2-3 runs per 24hr day 2 runs per 24hr day
Theoretical 1 Gb in 3-4 days Theoretical 1 Gb in 1 day
Accuracy: ~99.5% over 200 bases Accuracy: ~99.0% over 400 bases
Image & signal processing: 8-9 hrs post- Image & signal processing: 12-20 hrs post-
sequencing sequencing (upgraded Unix cluster)
No change in sequencer New, metallized PTP Current image Improved image

Data Analysis Server for XLR HD
For an out-of-the-box solution we have

qualified a provider who will supply an
integrated system. This purchase is made
through Roche and is delivered as a one
box solution.
Support concept in place, deck available
Server specifications are available for a

do-it-yourself option.
GS-FLX or not?
You need a GS-FLX if:
•You require large scale, very high throughput DNA sequencing
•You intend to study any of the following: de novo sequencing & assembly (whole/partial genome);
whole genome resequencing; targeted resequencing/ amplicon resequencing; metagenomics;
transcriptomics.
You do not need a GS-FLX if:

•You only intend to sequence small numbers of samples each time (tens of thousands of bases)
•Your applications only involve simple clone sequence verifications.
There are other Next-Generation Sequencers available that seem to be cheaper; they promise
to do many things. Why don’t I buy those instead?
• Keep in mind the following: other NGS platforms give many short, lower-quality reads. These are
suitable for only specific applications (mapping of tags to a reference, and counting them). Their short
reads result in poor mapping specificity and short contigs (more gaps). Few/no publications..
Side-by-Side Comparison
Sanger dideoxy sequencing vs 454-sequencing ™
Sanger dideoxy GS-FLX
Read-length (bases) 750-1,000 ~200
Throughput (bases per 24 hrs) 1-2 million ~300 million (3 runs)
Raw cost-per-base (US cents) ~0.2¢ ~0.008¢
Accuracy (Single-read/ 20x consensus) 99.3-99.6% / >99.99% ~99.5% / >99.995%
Manpower (bact. genome project) 5-10 FTE 1-2 FTE
Sample prep time (bact. genome project) Months Hours
Sequencing time (bact. genome project) Months – Years Hours – Days
Sample characteristics In vivo clones: biased In vitro: unbiased
Errors: homopolymer Errors: homopolymer

slippage; GC-rich indels
hardstops
Direct PCR: averaged Amplicon sequencing:
signals each molecule
individually sequenced
AT- or GC-rich genomes not a problem for GS-FLX
Depending on the organism,

read lengths are in the range
of 200 – 300 high quality
bases.
Genomes that are more
AT- or GC-rich typically
yield a longer read length
distribution as compared to
an AT/GC neutral genome
Read length
Long reads matter
Short reads do not provide genome mapping uniqueness
Whole human genome

Human chr 1 only
(Whiteford, N. et al. Nucl. Acids Res. 2005)
Modified from Figure 2. Uniqueness as a function of read length; human genomic DNA.
25 to 35-bp reads: ~ 80 - 87% uniqueness High genome-mapping uniqueness is
100-bp reads: > 90% uniqueness important for genome annotation and
transcriptome profiling experiments
Transcriptomics
Transcriptome profiling of Plants

Short (singleton) reads do not provide sufficient transcriptome
mapping uniqueness
Read length 25 36 72 250
Arabidopsis thaliana 49% 55% 65% 83%
Brassica rapus 66% 71% 80% 91%
Glycine max 51% 56% 64% 82%
Lotus japonicus 75% 79% 86% 95%
Medicago truncatula 67% 69% 74% 86%
Oryza sativa 48% 53% 61% 79%
Populus trichocarpa 71% 75% 81% 91%
Solanum tuberosum 51% 56% 65% 82%
Zea mays 46% 53% 62% 78%
Table shows uniqueness of transcriptomic singletons. Data is extracted from a 2008

study modeling all possible reads in both directions of specified lengths across 20
different plant species. Reference transcript assemblies were from http://plantta.tigr.org/)
Long reads matter
Short reads result in short contigs and poor coverage
Percentage
of chr 1 10000
covered
Note: Average gene size in humans ~10-15 kb
(Whiteford, N. et al. Nucl. Acids Res. 2005)
Modified from Figure 2. Genome coverage (at specific contig sizes), as a fn. of read length
25 to 35-bp reads: Only 7% can form contigs of 10,000bp and larger (gaps!!!)
100-bp reads: 90% can form contigs of 10,000bp and larger
Read length is important in genome assembly
Why long reads are needed for genome assembly
1. Genomes (especially complex ones) contain large numbers of repeats

- Repeats can be a few bases, to thousands of bases long
2. It is very difficult to completely assemble a genome, if the read length is shorter
than the repeats (see next slide)
- The reads need to bridge the repetitive DNA regions
3. Paired-end sequences can help assembly, but the end-tags still need to map
specifically to the contigs
- So the end-tags themselves also need to be longer
4. In summary, short reads = short contigs, more gaps and poor assemblies
5. GS-FLX gives longer reads, fewer gaps, good genome assemblies
* NOTE: The same situation exists even for resequencing: the mapping uniqueness of short
singleton reads to a reference genome is greatly improved by increased read length. Also,
the structures of splice variants are much clearer when long reads are used.
Long reads matter for de novo assembly
Short reads cannot bridge repetitive regions; a gap remains
GS-FLX Long read

?
?
?
?
Short reads ?
CGTAGGCTAGATGCATGCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATATAGCGATCTCGACATGCT
Unique DNA Repetitive DNA Unique DNA

Sequence Sequence Sequence
If the read does not span the repeats, no amount of increased sequencing
coverage (depth) will allow either de novo genome assembly, or high-quality
resequencing (there will be gaps)
GS-FLX sequencing accuracy
4.0% E. coli run #1

09_29A The very high GS-
E. coli run #2
09_29B
Cumulative Read Error
3.5% FLX single-read

Reported in E. coli run+
09_14 #309_18A
E. coli run #4
09_18B+09_25
3.0%
Nature, 2005 T. thermophilus
Thermophilus accuracy avoids
C. jejuni
2.5% C jejuni the need for
“quality-filtering”
2.0% GS20
against a
1.5% Q2, 2006 reference
1.0%
Currently (GS-FLX) sequence (used
0.5% by other
0.0% sequencing
0 50 100 150 200 250 platforms)
Base Position
• GS-FLX Single-Read accuracy > 99.5% (includes all homopolymer errors)

– Sanger Single-Read Accuracy = 99.3% to 99.6%
• GS-FLX (22x) Consensus-Read Accuracy > 99.995%
Cumulative Read Error by Length
Comparing short-read system to Roche GS-FLX
* data
Short-read sequencer
(Competitor I)
• At 36 bases, Short-read system gives ~2 bases wrong (5% error)

• Up to 200 bases. GS-FLX gives < 1 base wrong (0.36%)
Based on data downloaded from Sanger Institute website

System Performance: Homopolymers
Individual Reads versus Consensus Reads in E.coli
120.00% Consensus
Genome Sequencer FLX data Single Reads
100.00%
80.00%
Accuracy
60.00%
40.00%
20.00%
0.00%
1 2 3 4 5 6 7 8 9
Homopolymer Length
1. Single-read accuracy here in parsing homopolymers > 90% up to n = 5

2. Consensus accuracy provided by 22x oversampling is ~100% even at n = 9
Distribution of homopolymers in the human genome
HomoPolymer Frequency
100.00000%
A G C T
10.00000%
Percent of Genome
1.00000%
4-mer
0.10000%
6-mer
0.01000% 8-mer
10-mer
0.00100%
12-mer
0.00010% 14-mer
16-mer
0.00001%
Nucleotide
The incidence of homopolymeric stretches long enough to cause problems for

consensus GS-FLX sequencing (> 8-mer) in the human genome is low, < 0.05%
What can the GS-FLX do for you?
• May 2008: > 150 publications related to 454-sequencing

• You can request updated bibliographies (or access from www.454.com)
GS-FLX applications
UDS • Metagenomics
• Medical resequencing Ultra Deep (whole genome)
UBS
(targeted) Amplicon
Ultra Broad • Transcriptomics
- Oncogenomics resequencing
Sequencing • Small ncRNA identification
• Microbial diversity of specific
(16S rRNA Metagenomics) regions • Novel pathogen discovery
(Medical metagenomics)
• Virus quasispecies studies
• Ancient DNA studies
• Epigenetics and Gene
Regulation
- DNA methylation
WGS • Whole genome re-sequencing and

• de novo sequencing and assembly Whole assembly to identify variations
(with or without paired-end data) (comparative genomics);
Genome
Sequencing • Whole genome Paired-End Mapping to
identify structural variations
Amplicon sequencing (Ultradeep)
• Medical resequencing for variant detection
• Virus genotyping and drug-resistance
• DNA methylation characterization
Ultradeep sequencing
How does Amplicon Sequencing detect rare variants?
PCR to obtain target locus
Mutation present at low frequency GS-FLX Direct PCR

in heterogeneous sample sequencing sequencing (Sanger)
Every species is sequenced individually, Result is averaged across all species present
allowing every mutation to be quantified and may be undetectable
Targeted resequencing (Medical resequencing)

Ultradeep amplicon resequencing to detect rare variations
A B C D
Isolate genomic DNA from each sample
Generate long-range
Disease Associated Region > 1000s bp
Genomic DNA
PCR amplicons
across desired
target region
Long Range PCR Amplicons: 3-15 kb each
Pool amplicons
(in equimolar amounts)
Shotgun Sequence on GS-FLX
Analysis is by Ref Mapper (AVA is for direct amplicon analysis)

Resequencing example: Long range PCR on EGFR
No gaps, analysis of the whole gene possible
Shotgun library created from 10 overlapping, long range PCR Amplicons

Mapped into 1 single contig of length 70,107 nt (amplified Region: 70,1kb)
Sample H441: 80 High Confidence variations detected (62 SNPs known from db)
Sample H1975: 95 High Confidence variations detected (73 SNPs known from db)
Short InDels in H441
Detectable – no quality filtering based on comparison
with reference sequence needed
# of Variants Sequence Change
Single Base Substitution 12 Various
2 Base Deletion 1 AT/--
2 Base Insertion 1 --/CA
3 Base Deletion 1 GCT/---
4 Base Deletion 2 ACAC/----

TGTG/----
4 Base Insertion 2 ----/CACA
-----/TTCA
All new variants have been confirmed by Sanger Sequencing;

Detecting variations > 2nt not possible using competing platform (Manufacturer I)
Whole Genome Re-Sequencing
Benefits of using GS-FLX
250bp reads effectively span many sequences that micro-reads interpret as

repeats, and minimizes numbers of gaps.
Bias-free sequencing (coverage does not vary due to GC/AT content).
Significantly higher genome coverage and much more efficient identification of
more variations.
InDels can be discovered efficiently/easily (no filtering against reference
sequence needed).
Targeted resequencing
Ultradeep sequencing for retrospective analysis of relapse in
NSCLC patient R.K. Thomas et al., Nature Medicine (July 2006) 12: 852-855
• Lung cancer patient had strong initial

response to Erlotinib (TKI) before relapsing
12.5 mo later with pleural effusions.
• Histological specimen post-relapse had
only 1-10% tumor content. Sanger
sequencing showed wt EGFR only.
• 454-sequencing (amplicon sequencing of
exons 18-22) showed 2 mutations:
(i) 18-bp deletion in exon 19 (Del4) (3% of
11,367 reads), and
(ii) C->T substitution resulting in T790M
mutation (2% of 136,776 reads).
• Validation: cells overexpressing EGFR with
Del4 were sensitive to Erlotinib, but
combined Del4/T790M mutation rendered
cells resistant.
Ultradeep sequencing/ haplotype analysis
For virus genotyping, long read lengths

are of paramount importance in HAPLOTYPING
3 mutations are frequently observed in multidrug resistant HIV viruses:
M46I/L, V82A/F/S/T, and L90M.
The combination confers resistance to all protease inhibitors currently
in use.
If all three were found on a single major species before treatment, protease
inhibitors would have minimal effect.
However, if they were on distinct species, they could be suppressed by
different inhibitors.
A single GS-FLX read (250bp) read covers all 3 mutations (135bp apart)
Provides valuable information on drug resistance and productive treatment
options etc…
Short reads (25-50bp) cannot distinguish between the various subtypes
See also: Wensing et al. (2005) JID 192:958; Shafer et al. (2006) JID 194 (Supple 1):551; Bonaventura et al. “New
developments in HIV drug resistance and options for treatment-expererienced patients”
Subspecies identification in HIV-1
Sequencing of 207 bp amplicon from virus protease gene
Individual sequencing of each template allows identification and quantification
of distinct virus subspecies within a mixed population, including “haplotypes”
39% of reads
GAAATG-GCTTTGCC
34% of reads |||||| | |||||| Sanger, direct PCR
GACATGAATTTG GAAATGAG-TTTGCC
|| |||| |||| GAAATGGNTTTGCC
GAAATGAGTTTG
Mutation Freq.
Coverage
Unresolvable region
GAAATGCAGTT-GCCAGG
|||||| |||| ||||||
GAAATG-AGTTTGCCAGG In collaboration with Dr. M. Kozal, Yale VA Hospital;
21% of reads See also Wang et al. (2007) Genome Res , for a good look
at coverage needed in HIV-1 variant analysis
Transcriptomics
• Based on existing (some partial) reference genomes: microbes,
drosophila, human cell lines, maize, arabidopsis, medicago,
salmon, sheep.
• de novo transcriptome characterization: Paper wasp (ref
honeybee), Glanville Fritillary butterfly (ref Bombyx mori)
Transcriptomics
Transcriptome characterization
Overview
cDNA synthesis from RNA, then fragmentation and 454-sequencing

- options: oligo-dT, random priming, RNA fragmentation, cDNA normalization
Map the ESTs to reference genome or other suitable database
o Compare gene expression levels across samples by EST counting

o Perform genome annotation to improve existing gene models
Transcriptomics
Advantages of using GS-FLX
Long, accurate reads allow excellent mapping of ESTs to genome (Table)
- Torres et al., Genome Research (2008) Jan;18(1):172-7 (GS-20, Drosophila)
Transcriptomics
Advantages of using GS-FLX
Essentially unbiased representation regardless of transcript length (Figure)
or expression level
- Weber et al., Plant Physiology (2007) May; 144:32-42 (GS-20, Arabidopsis)
• ESTs cover every part

of every cDNA.
Medium length cDNAs
are shown here, but
5’ 3’ distribution pattern
similar for Short and
Long cDNAs also
• Some favoring of 5’
and 3’ ends, possibly
indicating incomplete
nebulization
• Total number of (+)
direction ESTs (55-
60%) is slightly > (-)
direction ESTs. Why?
(position relative to 5’ end of cDNA)
Here we see 154,379 GS-20 ESTs corresponding to 1,053 transcripts (flcDNAs) of 1,000-2,000nt in size (medium length).
GS-FLX sequencing of the paper wasp
transcriptome
Conventional cDNA library preparation from wasp was sequenced using 454-Sequencing.
391,157 brain cDNA reads generated
3,017 genes hit in honey bee genome
No wasp genome available
32 behavioral gene orthologs further characterized to demonstrate the link between
maternal behavior and the development of social behavior
Study also demonstrated the ability to use a known, related genome (Bee in this case)
as a hub to successfully generate assemblies
Transcriptomics
de novo whole transcriptome characterization

using 454-sequencing (GS-20)
Molecular Ecology (2008)
•No reference genome available; this is 1st report of a de novo

transcriptome assembly using NGS data.
Glanville fritillary butterfly
•2 cDNA libraries, normalized (Evrogen); 2 GS20 runs done; (Melitaea cinxia).
SeqMan Pro assembly of ESTs.
• 518,079 high-quality ESTs (88% of raw) obtained, assembled
into 48,354 contigs + 59,943 singletons, thus 108,297 unigenes.
• Microarray made using assembled transcripts: high reproducibility.
• Issues: Cf B.mori database, inferred that ideally, 4x more sequencing
would be needed for complete flcDNA coverage. Also, assembly of
splice variants was “complex”, and needed human annotation. We can
expect that with GS-FLX and XLR-HD, results will be vastly improved.
• Interestingly, 618 reads were non-metazoan (mainly cryptosporidiae),
hence 454-seq can possibly be used for xenobiont detection
Metagenomics (& Microbial Diversity)
Metagenomics on the GS-FLX
Benefits of long, accurate reads
• Metagenomics is the shotgun sequencing of

mixed DNA isolated from environmental samples.
For assessing bacterial diversity, the focus is
usually on 16S rRNA
• Read length of >200 bases allows:
– accurate assessment of diversity (low
rate of ambiguous mapping results)
– unambiguously identify an organism or
gene in an unknown complex environmental
sample.
See: R Edwards et. al. BMC Genomics, 7:57 (2006)

Metagenomics on the GS-FLX
Why long reads are needed
FLX long reads Microreads
DNA and shotgun sequencing

Isolation of environmental
Reads map uniquely Reads map everywhere
You can´t use them
(Paired end reads might help, but even then,
higher specificity with longer end-tags)
Environment (e.g. deep sea)

Tens of thousands of
different species
Ultrabroad sequencing- Metagenomics
Elucidation of symbiotic interdependence between

insect host & 2 internal bacteria
• Tripartite symbiotic relationship between insect host

(H. coagulata, Glassy-Winged Sharpshooter) that
feeds on xylem sap, and 2 internal bacteria:
Baumannia sp. (provides vitamins and cofactors)
and Sulcia sp. (provides essential aa.’s).
• 23 Newbler contigs were assembled into a
complete circular Sulcia genome
• Illumina-Solexa 1G was used to resolve 155
homopolymeric uncertainties (Sulcia genome is
245,000 bp; homopolymer errors = 0.06%).
• As with combined Sanger/454-sequencing, perhaps
dual-platform experiments may be the most
sensible approach for highest-quality asemblies, for
genome centers that can afford them.
Novel pathogen discovery/ Metagenomics
Novel pathogen discovery

Medical Metagenomics
Palacios, G, et al.; N Engl J Med 2008:358

The same group that did the Honey Bee
Colony Collapse Disorder metagenomics work
Novel pathogen discovery/ Metagenomics
Novel pathogen discovery – Medical Metagenomics?
Palacios, G, et al.; N Engl J Med 2008:358
• 3 women in Australia received various transplanted organs (liver, kidney) from same male donor;
donor had died from cerebral hemorrhage.
• 4-6 wks post-op, recipients all died from “febrile illness with varying degrees of encephalopathy”.
• Tested negative: bacterial/ viral cultures; PCR for various viruses; microarray analysis using
panmicrobial and viral arrays
• Methods: GS-FLX sequencing performed on RNA extracted (various source tissues) from 2 deceased
patients; RNA was DNaseI-treated, then RT-PCR using random primers. No further nebulization done.
• Results: Sequences filtered bioinformatically to remove repetitive DNA; human (host) DNA subtracted;
non-human sequences were clustered with Cd-hit, then CAP3 assembled. BLASTX and BLASTN
against Genbank performed. Of 103,632 sequences (mean size: 162bp), 14 fragments had
homology to arenavirus, closest relationship to LCMV. Validation of novel LCMV done by RT-PCR:
22 of 30 samples (from 3 patients) positive. Other validations done, on Vero E6 cells, and patient
samples.
• Conclusions: GS-FLX sequencing has been used to identify a novel pathogen present against a
massive background of known host gDNA; “Medical Metagenomics”?
14 out of ~100K reads hit LCMV
Maximum contiguous
match to known LCMV:
Only 14 bp
Small noncoding RNA (sncRNA)
characterization
Ultrabroad seq- MicroRNA
Small non-coding RNA (sncRNA) analysis

• Transcripts expressed from the genome can be • GS-FLX’s long read length enables the
protein-coding, or non-coding complete sequencing of almost all small
RNA classes, in just one read
• Non-coding RNAs have important regulatory
functions; small ncRNAs include: • Coupled with the high depth of coverage,
– Small interfering RNAs (siRNAs) ~21-25nt GS-FLX is well-placed to not just
• NAT-siRNAs (antisense; 21-24nt) characterize existing sncRNAs, but also
identify and quantify those that may be
– microRNAs (miRNAs) ~22nt
very rare, or novel, with a high degree of
confidence.
Recently, some larger sncRNAs identified:
•Piwi-interacting RNAs (piRNAs) ~29-31nt
•Small-scan RNAs (scnRNAs) 27-31nt
•Repeat assd siRNAs (rasiRNAs) 24-29nt
•Long siRNAs (lsiRNAs) 30-40nt
–small nucleolar RNAs (snoRNAs) ~60-300nt
–“short RNAs” (sRNAs) <200nt (mean 35nt)
•Promoter-associated PASRs
•2’-Termini associated TASRs
–“long RNAs” (lRNAs) >200nt (mean 100nt)
Whole genome sequencing
• de novo sequencing and assembly
- bacteria, fungi (including mushroom), viruses, barley (4 BACs as
proof of principle), pinot noir grape (Sanger plus 454)
• Genome resequencing
- bacteria, viruses, Pea, monkey, human
• Paired-end reads and their utility in de novo assembly
& structural variation detection
De novo sequencing and assembly
de novo Sequencing and Assembly

Introduction
A B C
1 2
CONTIG 1 CONTIG 2
3
CONTIG 3 (Contigs are not ordered)
Random (shotgun) fragments Clustering (overlapping) Consensus merging
E D
1 2 3 1 2 3
Scaffold (supercontig) formation (Ordering and orienting contigs, with paired end data)
Scaffolds can still contain gaps. Contigs that are oriented w.r.t. their immediate neighbours
Final step is “gap-filling” or “finishing”. are gradually ordered to form a scaffold.
Comparative Genomics
Usefulness of Paired-End sequences (1)

1. Paired-end sequences are very useful for de novo assembly, and also for detecting variations.
2. Paired-end sequences can span repeats, to better orientate shotgun contigs.
3. The choice of what length paired-end span to use, depends on the specific genome being studied.
Contig 1 Contig 2
Biot Biot
454 Paired-end Reads
Biot
Scaffold Generation
Biot
Scaffold 1
Shear
&
select Biot
Biot
Paired-End; de novo assembly
de novo genome assembly of bacteria

GS-20 shotgun reads, assisted by Paired-End reads (2kb span)
E. coli B. licheniformis S. cerevisiae

Genome size (Mb) 4.6 4.2 12.2
Number of runs 3 3 9
Fold oversampling 22 27 23
Assembly Contigs 140 98 821
PE library runs 1 1 2
Number of paired
reads 112000 255000 395000
Supercontigs 24 9 153
Genome Coverage 98.60% 99.20% 93.20%
Roche in-house data

Roche-GIS collaboration
Results of de novo assembly
B.pseudomallei #22
Expected Genome size (Mb) 7
Number of chromosomes 2
Number of runs (GS-20) 6
Fold oversampling 22
Assembly Contigs (1221 - 79098 bp) 940
PE library runs 1
Supercontigs using 2kb PE library 50
using 5kb PE library 11
using 10kb PE library 4
Genome Coverage (w.r.t. Sanger ref) 93.04%
Choice of Paired-End library target size depends on the particular genome
Paired-End; de novo assembly
Improved Paired-End Protocol (16-20kb span)

Protocol to be released with Titanium upgrade New!
Number of
GS-FLX Genome Contigs/ 16 kb
Read Type Coverage Scaffolds 2000
Shotgun 15× 98 1500
Count
+
1000
3 kb span 18× 7
+ 500
16-20 kb span 20× 1

0
10000 20000 30000
Assembly of 4.6Mb E.coli genome into 1 scaffold Pair Distance (bp)

(Consensus accuracy ~99.999%)
Paired-End; Comparative Genomics
Usefulness of Paired-End sequencing (2)

1. Paired-end sequences are very useful for de novo assembly, and for detecting variations.
2. Paired-end sequences can span repeats, to better orientate shotgun contigs.
3. The choice of what length paired-end span to use, depends on the specific genome being studied.
Insertion
Sample
Biot Biot Reference
Mapped Span (sample to ref) < [mean-3SD]
Deletion
Biot
Sample
Biot
Reference
Biot Mapped Span > [mean + 3SD]
Biot
Comparative Genomics
Roche-Yale publishes study in Science

Human structural variations identified
J. Korbel et al. Science, 13 September 2007
• Same principle as used in GIS, but here the Roche-Yale group used only 3kb PE spans for higher
resolution, and ~100bp tags (for best human genome mapping specificity)
• For SV identification, Yale sequenced between 10M – 21M of 3kb-PE reads (thus 10x- 21x coverage)
– GIS used 323,632 of 10kb-PE reads (thus ~ 450x coverage of Bp genome)
• Between NA15510 (putative European female) and NA18505 (Yoruban female),
– A total of 1,297 SVs were identified (1,175 indels, 122 inversions)
• PCR validation on 40 randomly-selected SVs: 97% validation success
• Note: no actual genome-genome alignment was done in this Roche-Yale study
Resequencing
Whole Human Genome Resequencing

The first Next Gen Sequencer-based resequencing of the
human genome
David Wheeler et al., Nature 452:872 (17 April 2008)
• Blood sample provided by Dr. James Watson in 2005
• 454 Life Sciences/Baylor Human Genome Sequencing
Center (HGSC) collaboration
• Browser at http://jimwatsonsequence.cshl.edu/
• Completed May 2007

• On a GS-FLX: 2 months, US$ 1M, 78.5 Million reads, 19.7 Billion bases (6.5x oversampling).
• Cf. HGP: 10-15 years, US$ 4B.
• Only 3% of the total reads could not be mapped to reference human genome
(UCSC and Celera assembly). Of the 3%, ~1.1% completely unknown; remainder were
unmapped because of DNA repeats.
• Identified 177,181 InDels ranging from 3 to >7,000bp.
• 1.8 Million known (in dbSNP) SNPs observed; 200,000 novel SNPs identified
De Novo sequencing of potato BACs
Long accurate reads – few contigs
Pilot study for a member of the potato consortium
56 plant BACs sequenced using MIDs, and assembled
8 in milestone I, 48 on two LR70 runs in milestone II
Average BAC insert size: 136 kb
Average number of contigs >500 nt (N=56): 16.6
Average N50 contigs (N=56): 39,808 nt
New sequence information not sequenceable using
Sanger capillary sequencer was detected
Comparison with Sanger sequence often revealed several
kb new information per BAC, because GS-FLX has no
cloning bias
Technological innovations
• Sequence Capture (SeqCap) microarrays
• Multiplex Identifiers (MIDs)
Sample preparation is the new bottleneck
for resequencing applications
Labor,
Infrastructure,
Throughput
Sample prep by PCR and cloning
Conventional Sanger dideoxy sequencing
Next Generation Sequencing: ultra-high throughput
Sample prep by NimbleGen Microarray-based Sequence Capture (SeqCap)
Capture of specific DNA regions at kb to Mb scale,

will enable the full potential of Next Generation Sequencing to be exploited.
Targeted Resequencing
NimbleGen SeqCap arrays and GS-FLX

An easier way to do targeted resequencing
7 8 9 10
Fragment
Analyze sequences of
captured exons
Hybridize to
SeqCap array
Sequence on GS-FLX
Wash and elute,
perform PCR
Targeted Resequencing
NimbleGen SeqCap arrays and GS-FLX

Exon Capture Results
454-sequencing reads, BLAST hits
Replicate 1 • Probes >60mer

• 1 probe every 10 bases
Replicate 2
• Highly reproducible
Replicate 3
• Accurately captures targets
• Mean enrichment ~378-fold
Array probe positions

Chr16 exon capture
(T.J. Albert et al. Nature Methods. October 2007)
NimbleGen Sequence Capture Array Targeting
11p12 Diabetes Locus
Mapped reads
base cov depth
Targeted region
Genomic region
Repetitive regions
• 385,000 probes targeted ~3 Mb of 11p12 locus

• 72% of the region covered by probes (window masking of repeats)
High Coverage and Specificity of
Sequence Capture at 11p12
Hapmap Samples
Initial target bases
Total reads
Number of reads in target regions
Percent of reads in target regions
Target bases covered
Percent target bases covered
Average coverage DEPTH
Median coverage DEPTH High Coverage
DEPTH
HapMap SNPs classified correctly
NimbleGen Sequence Capture 385K
Custom Service End-April 2008 (estimated)
Step
Step1:
1:Array
ArrayDesign.
Design.
NimbleGen
NimbleGen willdesign
will designprobes
probesagainst
againstregions
regionsprovided
providedby
bythe
theresearcher.
researcher.Repetitive
Repetitiveregions
regionswill
willnot
not
be covered by the design, and researchers will approve the design before Step 2 starts.
be covered by the design, and researchers will approve the design before Step 2 starts.
Step
Step2:
2:Sequence
SequenceCapture
Capture
The
The researcher shipsgenomic
researcher ships genomicDNA
DNAsamples
samplestotothe
theRoche
RocheNimbleGen
NimbleGenService
ServiceLab.
Lab.Roche
Roche
NimbleGen
NimbleGenwill
willmanufacture
manufacturethe
thearray
arrayfrom
fromStep
Step11and
andperform
performsequence
sequencecapture
captureon
onthe
thesamples.
samples.
The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
Researchers will provide:

• High-quality genomic DNA (human or mouse only; > 21 µg/sample).
- WGA samples are currently not acceptable.
• Sequence information on regions to target in the genome
- from hg18 or mm9, currently up to 5Mb max.
Researchers will get:
Captured DNA (10 µg amplified DNA/sample) with report on yield and level of enrichment –
qPCR ref against ctrl loci.
• List of regions targeted by the design and visualization software.
• User’s Guides, including how to sequence the captured DNA with GS-FLX.
Molecular Barcoding concept
MIDs (Multiplex Identifers)
MID Sequence
A ACGAGTGCGT
B ACGCTCGACA
C AGACGCACTC
D AGCACTGTAG
E ATCAGACACG
F CGTGTCTCTA
G CTCGCGTGTC
H TAGTATCAGC
I TCTCTATGCG
• MIDs allow barcoding of up to 12 different samples, for mixing, emPCR and J TGATACGTCT
pooling into each region for sequencing; maximum possible = 16 regions x 12 K TACTGAGCTA
samples / region, = 192 samples per PTP (1000 reads per sample) L ATATCGCGAG
Sequencing primer
Primer A Key MID Library fragment Primer B

The Next Generation of Genomic Research

Uploaded by

Copyright:

Available Formats

The Next Generation of Genomic Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Next Generation of Genomic Research

Uploaded by

Copyright:

Available Formats

The Next Generation of Genomic Research

Genome Sequencer GS-FLX

Computer subsystem Keyboard and

Number of GS-FLX runs 1

Sample input: Genomic DNA, BACs, amplicons, cDNA

Generation of small DNA fragments via nebulization Fragments

Ligation of A/B-Adaptors flanking single-stranded One Fragment

Emulsification of beads and fragments in water-in-oil

Sequencing and base calling One Read

gDNA sstDNA library

sstDNA library Clonally-amplified sstDNA attached to capture bead

PTP well diameter: average of 44 µm

A single, clonally amplified sstDNA bead

Amplified sstDNA library beads Quality reads

4 unlabeled nt’s (TACG) are added

Amplified sstDNA library beads Quality reads

Amplicon Variant Analyzer

No change in sequencer New, metallized PTP Current image Improved image

For an out-of-the-box solution we have

Server specifications are available for a

You do not need a GS-FLX if:

Read-length (bases) 750-1,000 ~200

Throughput (bases per 24 hrs) 1-2 million ~300 million (3 runs)

Raw cost-per-base (US cents) ~0.2¢ ~0.008¢

Accuracy (Single-read/ 20x consensus) 99.3-99.6% / >99.99% ~99.5% / >99.995%

Manpower (bact. genome project) 5-10 FTE 1-2 FTE

Sample prep time (bact. genome project) Months Hours

Sequencing time (bact. genome project) Months – Years Hours – Days

Sample characteristics In vivo clones: biased In vitro: unbiased

Errors: homopolymer Errors: homopolymer

Depending on the organism,

Whole human genome

(Whiteford, N. et al. Nucl. Acids Res. 2005)

Transcriptome profiling of Plants

Table shows uniqueness of transcriptomic singletons. Data is extracted from a 2008

(Whiteford, N. et al. Nucl. Acids Res. 2005)

1. Genomes (especially complex ones) contain large numbers of repeats

GS-FLX Long read

Unique DNA Repetitive DNA Unique DNA

4.0% E. coli run #1

3.5% FLX single-read

• GS-FLX Single-Read accuracy > 99.5% (includes all homopolymer errors)

• At 36 bases, Short-read system gives ~2 bases wrong (5% error)

Based on data downloaded from Sanger Institute website

1. Single-read accuracy here in parsing homopolymers > 90% up to n = 5

The incidence of homopolymeric stretches long enough to cause problems for

• May 2008: > 150 publications related to 454-sequencing

WGS • Whole genome re-sequencing and

How does Amplicon Sequencing detect rare variants?

PCR to obtain target locus

Mutation present at low frequency GS-FLX Direct PCR

Targeted resequencing (Medical resequencing)

Shotgun Sequence on GS-FLX

Analysis is by Ref Mapper (AVA is for direct amplicon analysis)

Shotgun library created from 10 overlapping, long range PCR Amplicons

Single Base Substitution 12 Various

2 Base Deletion 1 AT/--

2 Base Insertion 1 --/CA

3 Base Deletion 1 GCT/---