Gene Expression RNA Sequence
Gene Expression RNA Sequence
Gene Expression RNA Sequence
CUBT401: RNA-seq
Zedias Chikwambi
2017
23
Tarea
http://combine-australia.github.io/RNAseq-
R/
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module 26
Module 1
Introduction to RNA sequencing (lecture)
www.malachigriffith.org
mgriffit@genome.wustl.edu
www.obigriffith.org
ogriffit@genome.wustl.edu
Learning objectives of the course
Tutorials
Provide a working example of an RNA-seq analysis pipeline
Run in a ‘reasonable’ amount of time with modest computer
resources
Self contained, self explanatory, portable
Learning objectives of module 1
Condition 1 Condition 2
(normal colon) (colon tumor) Sequence ends
Map to genome,
transcriptome, and
predicted exon
junctions
http://www.alexaplatform.org/courses/2013/cbw/ENCODE_RNAseq_standards_v1.0.pdf
Replicates
Technical Replicate
Multiple instances of
sequence generation
Flow Cells, Lanes, Indexes
Biological Replicate
Multiple isolations of cells
showing the same
phenotype, stage or other
experimental condition
Some example
concerns/challenges:
Environmental Factors,
Growth Conditions, Time
Correlation Coefficient 0.92-
0.98
Common analysis goals of RNA-Seq
analysis (what can you ask of the
data?)
Gene expression and differential expression
Alternative expression analysis
Transcript discovery and annotation
Allele specific expression
Relating to SNPs or mutations
Mutation discovery
Fusion detection
RNA editing
General themes of RNA-seq
workflows
Each type of RNA-seq analysis has distinct requirements and
challenges but also a common theme:
1. Obtain raw data (convert format)
2. Align/assemble reads
3. Process alignment with a tool specific to the goal
• e.g. ‘cufflinks’ for expression analysis, ‘defuse’ for fusion detection, etc.
4. Post process
• Import into downstream software (R, Matlab, Cytoscape, Ingenuity,
etc.)
5. Summarize and visualize
• Create gene lists, prioritize candidates for validation, etc.
Tool recommendations
Alignment
BWA (PMID: 20080505)
Align to genome + junction database
Tophat (PMID: 19289445), STAR (PMID: 23104886), MapSplice (PMID: 20802226), hmmSplicer
(PMID: 21079731)
Spliced alignment to genome
Fusion detection
Tophat-fusion (PMID: 21835007), ChimeraScan (PMID: 21840877), Defuse (PMID: 21625565), Comrad
(PMID: 21478487)
Transcript assembly
Trinity (PMID: 21572440), Oases (PMID: 22368243), Trans-ABySS (PMID: 20935650)
Visit the ‘SeqAnswers’ or ‘BioStar’ forums for more recommendations and discussion
http://seqanswers.com/
http://www.biostars.org/
SeqAnswers exercise
Go to:
http://seqanswers.com/
> 50 bp reads
Spliced aligner such as Bowtie/TopHat
Visualization of spliced alignment of RNA-seq data
Normal WGS
IGV screenshot
Acceptor site mutation
Tumor WGS
Tumor RNA-seq
Common questions: how reliable are
expression predictions from RNA-
seq?
Are novel exon-exon junctions real?
What proportion validate by RT-PCR and Sanger sequencing?
Are differential/alternative expression changes observed
between tissues accurate?
How well do DE values correlate with qPCR?
384 validations
qPCR, RT-PCR, Sanger sequencing
See ALEXA-Seq publication for details:
Also includes comparison to microarrays
Griffith et al.
Alternative expression analysis by RNA sequencing.
Nature Methods. 2010 Oct;7(10):843-847.
Validation (qualitative)
qPCR of 192
exons identified
as alternatively
expressed by
ALEXA-Seq
Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)
Visualization
Inputs
Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
Read Transcript Gene Differential
Sequencing
alignment compilation identification expression
Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)
Visualization
Inputs
Module 1
We are on a coffee break &
networking session
Translation
Overview of RNA-Seq
From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html
Common Data Formats for RNA-
FASTA format:
Seq
>61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
FASTQ format:
@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA
You might get the data in a .sff or .bam format. Fastq-reads are easy to
extract from both of these binary (compressed) formats!
Paired-End
Insert size
Insert size
Read 1
DNA-fragment
Read 2
Adapter+primer
@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA
@61DFRAAXX100204:1:100:10494:3070/2
ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA
+
_^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad
New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
<read>:<is filtered>:<control number>:<sample number>
Example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
Transcript Reconstruction from RNA-Seq Reads
TopHat
Transcript Reconstruction from RNA-Seq Reads
TopHat
Cufflinks
Transcript Reconstruction from RNA-Seq Reads
TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads
TopHat Trinity
Cufflinks
Transcript Reconstruction from RNA-Seq Reads
TopHat Trinity
Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
Basic concepts of mapping-based RNA-seq - Spliced reads
DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
GT AG GT AG GT AG
ATG TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR AA
ATG A
TAG, TAA, TGA
Start codon A
Stop codon A
Splicing
A
mRNA A
UTR UTR AAAAAAAAA
Translation
RNA-seq - Spliced reads
Pre-mRNA
DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
ATG GT GT GT
TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR
ATG TAG, TAA, TGA
Start codon Stop codon
Splicing
mRNA
UTR UTR
Translation
Pre-mRNA
Pre-mRNA
Stranded rna-seq
Overview of the Tuxedo Software Suite
0 61G9EAAXX100520:5:100:10095:16477
1 83
2 chr1
3 51986
4 38
5 46M
6 =
7 51789
8 -264
9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
11 MD:Z:67
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38
16 XQ:i:40
17 X2:i:0
TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
De novo transcriptome assembly
No genome required
Genome
Gene A 1 2 2-fold No
No confidence in 2-fold
difference. Likely
observed by chance.
From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for
More Counts = More Statistical
Power Example: 5000 total reads per sample.
Observed 2-fold differences in read counts.
geneB 10 20 0.098
+ other (not-R)
including CuffDiff
See: http://www.biomedcentral.com/1471-2105/14/91
Use of transcripts
Transcripts can be assembled de novo or from
mapped reads and then used in gene
expression/differential expression studies
Can be functionally anntoated
Functional annotation
Take transcripts from Cufflinks or Trinity
Annotate the sequences functionally in
Blast2GO
Blast2GO
KEGG-mapping