The Next Generation of Genomic Research
The Next Generation of Genomic Research
The Next Generation of Genomic Research
Real-time qPCR
LightCycler 2.0; LC480
• HRM for SNP validation
• qPCR for quantitative validation
Quick genome scans High-throughput sequencing
Whole genome tiling microarrays Genome Sequencer GS-FLX
• aCGH • Ultra-Broad sequencing
• ChIP-chip • Ultra-Deep sequencing
• Epigenetics • Whole genome de novo
sequencing
• Gene expression
• Whole genome re-sequencing
• Targeted sequence capture Sample Prep
(SeqCap)
Roche reagents
• RNA/DNA isolation
• RNA/DNA purification
• DNA amplification
• DNA labeling
Optics subsystem
Fluidics subsystem
• Sample prep
– Bacterial cloning of DNA, colony picking,
culturing, plasmid DNA extraction
– Typical time needed: weeks/ months
• DNA Sequencing
– State of the art capillary sequencer
enables maximum ~1-2 Mb/ 24hr
– Typical sequencing time needed for a
whole genome project: months/ years
• Personnel requirements
– Production sequencing facility required;
manpower needed for sample prep and
sequencing, typically 5-10 full-time
staff
Genome Sequencing – Next (2nd) Generation
Massively parallel pyrosequencing (454-sequencing™) on the GS-FLX
• Sample prep
– No bacterial cloning; all cloning is done in vitro
– time needed: ~15 hrs including emulsion PCR
amplification
• DNA Sequencing
– GS-FLX output/run ~100 Mb in 7.5hr
– Daily throughput ~300 Mb/ 24hrs theoretical
– Typical sequencing time needed for a
whole genome project: much shortened
• Personnel requirements
– Manpower needed for sample prep and
sequencing, typically 1-2 full-time staff
Sequencing of Corynebacterium kroppenstedtii strain
From sequencing to manuscript submission = 1 week
Tauch, A. et al. Ultrafast pyrosequencing of Corynebacterium kroppenstedtii DSM44385 revealed
insights into the physiology of a lipophilic corynebacterium that lacks mycolic acids.”
Journal of Biotechnology, Available online 20 March 2008
7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
Sequencing of Corynebacterium kroppenstedtii strain
From sequencing to manuscript submission = 1 week
Dr. Andreas Tauch, University of Bielefeld:
7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
Sequencing Time:
C. glutamicum (2000) Clone-by-clone, Sanger 2 years
C. jejecum (2003) Whole genome shotgun 3 mths
C. urealyticum (2006) GS-20 shotgun 4 days
C. kroppenstedtii (2007) GS-FLX shotgun 7.5 hrs
(XLR_HD):
Expect >10 C. kroppenstedtii-like genomes to be
sequenced in 1 run
(~500Mb / (2.4Mb*20x oversampling) = 10.4)
Cloning Bias in Conventional ABI Sequencing
(500 kb stretch in Listeria monocytogenes)
50 18
45 16
40 14
35
12
30
10
25
8
20
6
15
10 4
5 2
0 0
Reference position (bp)
GS-FLX coverage ABI coverage
Courtesy of Drs. Nusbaum and Young of the Broad Institute
GS-FLX Sequencing Workflow
Overview
a. Genomic DNA fragmented by nebulization d. Select for only A-fragment-B and B-fragment-A
b. Adaptors A and Biot-B ligated to fragments sstDNA molecules in supernatant (not Biot)
c. Immobilize repaired, adapted DNA to e. Functional validation of sstDNA by titration run (do
paramagnetic Streptavidin beads emPCR and GS-FLX sequencing run to determine
best number sstDNA molecules per Capture Bead;)
Anneal sstDNA to an Emulsify beads and PCR Clonal amplification Break microreactors,
excess of DNA capture reagents in water-in-oil occurs inside and enrich for DNA-
beads (choice of no. of microreactors (using a microreactors. positive beads (using
average molecules/bead TissueLyser). Most of these Typically, 40 PCR magnetic streptavidin
is based on titration microreactors that contain cycles are performed. beads that bind to the
results)*. Capture beads DNA, will contain only 1 DNA biotinylated emPCR
are non-paramagnetic. molecule and 1 bead products). Convert to
bead-bound sstDNA
*Titration is required to avoid excessive “empty” or else “multi-template” beads in the emPCR
Process Steps
3. Sequencing
DNA Library Preparation and Titration emPCR Sequencing
4.5 h and 10.5 h 8h 7.5 h
Metric and image viewing software Signal output from a single well
(flowgram)
On current GS-FLX system, raw image -> FASTA basecalls = 8-9 hours
GS FLX Sequencing
Bioinformatics
Reference Mapper
Image capture (assembly using ref. seq.)
Image processing
De novo Assembler
(assembly from scratch)
Signal processing
Features:
• Small dataset size = greater convenience. 13.2 GB including raw images, allows future re-analysis if desired
(e.g. post software upgrade)
• Useful software (currently 3 applications) available out-of-the-box
• Long (250bp) and accurate (99.5% single-read) reads = No filtering of reads against known ref needed
• Software recognizes molecular barcodes (MIDs) for greater multiplexing (and economy); also able to
assemble using various types: 454-shotgun, Sanger and paired-end reads (singly/ in combo) for best results
The Genome Sequencer FLX System
Technical Specifications
Current system After “Titanium” kit upgrade, ~Q3, 2008
≥ 400,000 sequence reads per run ≥ 1,000,000 sequence reads per run
200 - 300 bases per read 400 - 500 bases per read
1 run = 7.5 hours = ~100 Mb 1 run = 10 hours = ~500 Mb
2-3 runs per 24hr day 2 runs per 24hr day
Theoretical 1 Gb in 3-4 days Theoretical 1 Gb in 1 day
Accuracy: ~99.5% over 200 bases Accuracy: ~99.0% over 400 bases
Image & signal processing: 8-9 hrs post- Image & signal processing: 12-20 hrs post-
sequencing sequencing (upgraded Unix cluster)
There are other Next-Generation Sequencers available that seem to be cheaper; they promise
to do many things. Why don’t I buy those instead?
• Keep in mind the following: other NGS platforms give many short, lower-quality reads. These are
suitable for only specific applications (mapping of tags to a reference, and counting them). Their short
reads result in poor mapping specificity and short contigs (more gaps). Few/no publications..
Side-by-Side Comparison
Sanger dideoxy sequencing vs 454-sequencing ™
Sanger dideoxy GS-FLX
Read length
Long reads matter
Short reads do not provide genome mapping uniqueness
Modified from Figure 2. Uniqueness as a function of read length; human genomic DNA.
25 to 35-bp reads: ~ 80 - 87% uniqueness High genome-mapping uniqueness is
100-bp reads: > 90% uniqueness important for genome annotation and
transcriptome profiling experiments
Transcriptomics
Percentage
of chr 1 10000
covered
Note: Average gene size in humans ~10-15 kb
Modified from Figure 2. Genome coverage (at specific contig sizes), as a fn. of read length
25 to 35-bp reads: Only 7% can form contigs of 10,000bp and larger (gaps!!!)
100-bp reads: 90% can form contigs of 10,000bp and larger
Read length is important in genome assembly
Why long reads are needed for genome assembly
* NOTE: The same situation exists even for resequencing: the mapping uniqueness of short
singleton reads to a reference genome is greatly improved by increased read length. Also,
the structures of splice variants are much clearer when long reads are used.
Long reads matter for de novo assembly
Short reads cannot bridge repetitive regions; a gap remains
If the read does not span the repeats, no amount of increased sequencing
coverage (depth) will allow either de novo genome assembly, or high-quality
resequencing (there will be gaps)
GS-FLX sequencing accuracy
* data
Short-read sequencer
(Competitor I)
100.00%
80.00%
Accuracy
60.00%
40.00%
20.00%
0.00%
1 2 3 4 5 6 7 8 9
Homopolymer Length
HomoPolymer Frequency
100.00000%
A G C T
10.00000%
Percent of Genome
1.00000%
4-mer
0.10000%
6-mer
0.01000% 8-mer
10-mer
0.00100%
12-mer
0.00010% 14-mer
16-mer
0.00001%
Nucleotide
Every species is sequenced individually, Result is averaged across all species present
allowing every mutation to be quantified and may be undetectable
Ultradeep sequencing
Generate long-range
Disease Associated Region > 1000s bp
Genomic DNA
PCR amplicons
across desired
target region
Long Range PCR Amplicons: 3-15 kb each
Pool amplicons
(in equimolar amounts)
Targeted resequencing
Ultradeep sequencing for retrospective analysis of relapse in
NSCLC patient R.K. Thomas et al., Nature Medicine (July 2006) 12: 852-855
See also: Wensing et al. (2005) JID 192:958; Shafer et al. (2006) JID 194 (Supple 1):551; Bonaventura et al. “New
developments in HIV drug resistance and options for treatment-expererienced patients”
Subspecies identification in HIV-1
Sequencing of 207 bp amplicon from virus protease gene
Individual sequencing of each template allows identification and quantification
of distinct virus subspecies within a mixed population, including “haplotypes”
39% of reads
GAAATG-GCTTTGCC
34% of reads |||||| | |||||| Sanger, direct PCR
GACATGAATTTG GAAATGAG-TTTGCC
|| |||| |||| GAAATGGNTTTGCC
GAAATGAGTTTG
Mutation Freq.
Coverage
Unresolvable region
GAAATGCAGTT-GCCAGG
|||||| |||| ||||||
GAAATG-AGTTTGCCAGG In collaboration with Dr. M. Kozal, Yale VA Hospital;
21% of reads See also Wang et al. (2007) Genome Res , for a good look
at coverage needed in HIV-1 variant analysis
Transcriptomics
• Based on existing (some partial) reference genomes: microbes,
drosophila, human cell lines, maize, arabidopsis, medicago,
salmon, sheep.
• de novo transcriptome characterization: Paper wasp (ref
honeybee), Glanville Fritillary butterfly (ref Bombyx mori)
Transcriptomics
Transcriptome characterization
Overview
Transcriptome characterization
Advantages of using GS-FLX
Long, accurate reads allow excellent mapping of ESTs to genome (Table)
- Torres et al., Genome Research (2008) Jan;18(1):172-7 (GS-20, Drosophila)
Transcriptomics
Transcriptome characterization
Advantages of using GS-FLX
Essentially unbiased representation regardless of transcript length (Figure)
or expression level
- Weber et al., Plant Physiology (2007) May; 144:32-42 (GS-20, Arabidopsis)
Here we see 154,379 GS-20 ESTs corresponding to 1,053 transcripts (flcDNAs) of 1,000-2,000nt in size (medium length).
GS-FLX sequencing of the paper wasp
transcriptome
Conventional cDNA library preparation from wasp was sequenced using 454-Sequencing.
391,157 brain cDNA reads generated
3,017 genes hit in honey bee genome
No wasp genome available
32 behavioral gene orthologs further characterized to demonstrate the link between
maternal behavior and the development of social behavior
Study also demonstrated the ability to use a known, related genome (Bee in this case)
as a hub to successfully generate assemblies
Transcriptomics
• 3 women in Australia received various transplanted organs (liver, kidney) from same male donor;
donor had died from cerebral hemorrhage.
• 4-6 wks post-op, recipients all died from “febrile illness with varying degrees of encephalopathy”.
• Tested negative: bacterial/ viral cultures; PCR for various viruses; microarray analysis using
panmicrobial and viral arrays
• Methods: GS-FLX sequencing performed on RNA extracted (various source tissues) from 2 deceased
patients; RNA was DNaseI-treated, then RT-PCR using random primers. No further nebulization done.
• Results: Sequences filtered bioinformatically to remove repetitive DNA; human (host) DNA subtracted;
non-human sequences were clustered with Cd-hit, then CAP3 assembled. BLASTX and BLASTN
against Genbank performed. Of 103,632 sequences (mean size: 162bp), 14 fragments had
homology to arenavirus, closest relationship to LCMV. Validation of novel LCMV done by RT-PCR:
22 of 30 samples (from 3 patients) positive. Other validations done, on Vero E6 cells, and patient
samples.
• Conclusions: GS-FLX sequencing has been used to identify a novel pathogen present against a
massive background of known host gDNA; “Medical Metagenomics”?
14 out of ~100K reads hit LCMV
Maximum contiguous
match to known LCMV:
Only 14 bp
Small noncoding RNA (sncRNA)
characterization
Ultrabroad seq- MicroRNA
1 2
CONTIG 1 CONTIG 2
3
E D
1 2 3 1 2 3
Scaffold (supercontig) formation (Ordering and orienting contigs, with paired end data)
Scaffolds can still contain gaps. Contigs that are oriented w.r.t. their immediate neighbours
Final step is “gap-filling” or “finishing”. are gradually ordered to form a scaffold.
Comparative Genomics
Contig 1 Contig 2
Biot Biot
Biot
Scaffold Generation
Biot
Scaffold 1
Shear
&
select Biot
Biot
Paired-End; de novo assembly
B.pseudomallei #22
Expected Genome size (Mb) 7
Number of chromosomes 2
Number of runs (GS-20) 6
Fold oversampling 22
Assembly Contigs (1221 - 79098 bp) 940
PE library runs 1
Supercontigs using 2kb PE library 50
using 5kb PE library 11
using 10kb PE library 4
Genome Coverage (w.r.t. Sanger ref) 93.04%
Choice of Paired-End library target size depends on the particular genome
Paired-End; de novo assembly
Number of
GS-FLX Genome Contigs/ 16 kb
Read Type Coverage Scaffolds 2000
Count
+
1000
3 kb span 18× 7
+ 500
Insertion
Sample
Deletion
Biot
Sample
Biot
Reference
Biot
Comparative Genomics
• Same principle as used in GIS, but here the Roche-Yale group used only 3kb PE spans for higher
resolution, and ~100bp tags (for best human genome mapping specificity)
• For SV identification, Yale sequenced between 10M – 21M of 3kb-PE reads (thus 10x- 21x coverage)
– GIS used 323,632 of 10kb-PE reads (thus ~ 450x coverage of Bp genome)
• Between NA15510 (putative European female) and NA18505 (Yoruban female),
– A total of 1,297 SVs were identified (1,175 indels, 122 inversions)
• PCR validation on 40 randomly-selected SVs: 97% validation success
• Note: no actual genome-genome alignment was done in this Roche-Yale study
Resequencing
Fragment
Analyze sequences of
captured exons
Hybridize to
SeqCap array
Sequence on GS-FLX
Wash and elute,
perform PCR
Targeted Resequencing
Replicate 3
• Accurately captures targets
• Mean enrichment ~378-fold
Mapped reads
Targeted region
Genomic region
Repetitive regions
Hapmap Samples
Initial target bases
Total reads
Number of reads in target regions
Percent of reads in target regions
Target bases covered
Percent target bases covered
Average coverage DEPTH
Median coverage DEPTH High Coverage
DEPTH
HapMap SNPs classified correctly
NimbleGen Sequence Capture 385K
Custom Service End-April 2008 (estimated)
Step
Step1:
1:Array
ArrayDesign.
Design.
NimbleGen
NimbleGen willdesign
will designprobes
probesagainst
againstregions
regionsprovided
providedby
bythe
theresearcher.
researcher.Repetitive
Repetitiveregions
regionswill
willnot
not
be covered by the design, and researchers will approve the design before Step 2 starts.
be covered by the design, and researchers will approve the design before Step 2 starts.
Step
Step2:
2:Sequence
SequenceCapture
Capture
The
The researcher shipsgenomic
researcher ships genomicDNA
DNAsamples
samplestotothe
theRoche
RocheNimbleGen
NimbleGenService
ServiceLab.
Lab.Roche
Roche
NimbleGen
NimbleGenwill
willmanufacture
manufacturethe
thearray
arrayfrom
fromStep
Step11and
andperform
performsequence
sequencecapture
captureon
onthe
thesamples.
samples.
The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
Sequencing primer