Gene Expression RNA Sequence

Gene expression
CUBT401: RNA-seq
Zedias Chikwambi
2017
23
Tarea
 http://combine-australia.github.io/RNAseq-
R/
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module 26
Module 1
Introduction to RNA sequencing (lecture)
www.malachigriffith.org
mgriffit@genome.wustl.edu
www.obigriffith.org
ogriffit@genome.wustl.edu
Learning objectives of the course
 Module 1: Introduction to RNA sequencing

 Module 2: RNA-seq alignment and visualization
 Module 3: Expression and Differential Expression
 Module 4: Isoform discovery and alternative expression
 Module 5: Gene fusion discovery
 Tutorials
 Provide a working example of an RNA-seq analysis pipeline
 Run in a ‘reasonable’ amount of time with modest computer
resources
 Self contained, self explanatory, portable
Learning objectives of module 1
 Introduction to the theory and practice of RNA

sequencing (RNA-seq) analysis
 Rationale for sequencing RNA
 Challenges specific to RNA-seq
 General goals and themes of RNA-seq analysis work flows
 Common technical questions related to RNA-seq analysis
 Getting help outside of this course
 Introduction to the RNA-seq hands on tutorial
Gene
expression
RNA sequencing
Isolate RNAs Generate cDNA, fragment,
Samples of interest size select, add linkers
Condition 1 Condition 2
(normal colon) (colon tumor) Sequence ends
Map to genome,
transcriptome, and
predicted exon
junctions
100s of millions of paired reads

10s of billions bases of sequence
Downstream analysis
Why sequence RNA (versus DNA)?
 Functional studies
 Genome may be constant but an experimental condition has a
pronounced effect on gene expression
 e.g. Drug treated vs. untreated cell line
 e.g. Wild type versus knock out mice
 Some molecular features can only be observed at the RNA

level
 Alternative isoforms, fusion transcripts, RNA editing
 Predicting transcript sequence from genome sequence is

difficult
 Alternative splicing, RNA editing, etc.
Why sequence RNA (versus DNA)?
 Interpreting mutations that do not have an obvious effect on
protein sequence
 ‘Regulatory’ mutations that affect what mRNA isoform is expressed
and how much
 e.g. splice sites, promoters, exonic/intronic splicing motifs, etc.
 Prioritizing protein coding somatic mutations (often
heterozygous)
 If the gene is not expressed, a mutation in that gene would be less
interesting
 If the gene is expressed but only from the wild type allele, this might
suggest loss-of-function (haploinsufficiency)
 If the mutant allele itself is expressed, this might suggest a candidate
drug target
Challenges
 Sample
 Purity?, quantity?, quality?
 RNAs consist of small exons that may be separated by large introns
 Mapping reads to genome is challenging
 The relative abundance of RNAs vary wildly
 105 – 107 orders of magnitude
 Since RNA sequencing works by random sampling, a small fraction of highly
expressed genes may consume the majority of reads
 Ribosomal and mitochondrial genes
 RNAs come in a wide range of sizes
 Small RNAs must be captured separately
 PolyA selection of large RNAs may result in 3’ end bias
 RNA is fragile compared to DNA (easily degraded)
Agilent example / interpretation
 http://www.alexaplatform.org/courses/2013/cbw/Agilent_Trace_Examples.pdf
 ‘RIN’ = RNA integrity number
 0 (bad) to 10 (good)
RIN = 6.0 RIN = 10

Design considerations
 Standards, Guidelines and Best Practices for
RNA-seq
 The ENCODE Consortium
 Download from the Course Wiki
 Meta data to supply, replicates, sequencing depth,
control experiments, reporting standards, etc.
 http://www.alexaplatform.org/courses/2013/cbw/ENCODE_RNAseq_standards_v1.0.pdf
Replicates
 Technical Replicate
 Multiple instances of
sequence generation
 Flow Cells, Lanes, Indexes
 Biological Replicate
 Multiple isolations of cells
showing the same
phenotype, stage or other
experimental condition
 Some example
concerns/challenges:
 Environmental Factors,
Growth Conditions, Time
 Correlation Coefficient 0.92-
0.98
Common analysis goals of RNA-Seq
analysis (what can you ask of the
data?)
 Gene expression and differential expression
 Alternative expression analysis
 Transcript discovery and annotation
 Allele specific expression
 Relating to SNPs or mutations
 Mutation discovery
 Fusion detection
 RNA editing
General themes of RNA-seq
workflows
 Each type of RNA-seq analysis has distinct requirements and
challenges but also a common theme:
1. Obtain raw data (convert format)
2. Align/assemble reads
3. Process alignment with a tool specific to the goal
• e.g. ‘cufflinks’ for expression analysis, ‘defuse’ for fusion detection, etc.
4. Post process
• Import into downstream software (R, Matlab, Cytoscape, Ingenuity,
etc.)
5. Summarize and visualize
• Create gene lists, prioritize candidates for validation, etc.
Tool recommendations
 Alignment
 BWA (PMID: 20080505)
 Align to genome + junction database
 Tophat (PMID: 19289445), STAR (PMID: 23104886), MapSplice (PMID: 20802226), hmmSplicer
(PMID: 21079731)
 Spliced alignment to genome
 Expression, differential expression alternative expression

 Cufflinks/Cuffdiff (PMID: 20436464), ALEXA-seq (PMID: 20835245), RUM (PMID: 21775302)
 Fusion detection
 Tophat-fusion (PMID: 21835007), ChimeraScan (PMID: 21840877), Defuse (PMID: 21625565), Comrad
(PMID: 21478487)
 Transcript assembly
 Trinity (PMID: 21572440), Oases (PMID: 22368243), Trans-ABySS (PMID: 20935650)
 Visit the ‘SeqAnswers’ or ‘BioStar’ forums for more recommendations and discussion
 http://seqanswers.com/
 http://www.biostars.org/
SeqAnswers exercise
 Go to:
 http://seqanswers.com/
 Click the ‘Wiki’ link

 http://seqanswers.com/wiki/SEQanswers
 Visit the ‘Software Hub’

 http://seqanswers.com/wiki/Software
 Browse the software that has been added

 http://seqanswers.com/wiki/Special:BrowseData
 Use the tag cloud to identify tools related to your area of

interest. e.g. RNA-seq alignment
Common questions: Should I remove
duplicates for RNA-seq?
 Maybe… more complicated question than for DNA
 Concern.
 Duplicates may correspond to biased PCR amplification of particular fragments
 For highly expressed, short genes, duplicates are expected even if there is no
amplification bias
 Removing them may reduce the dynamic range of expression estimates
 Assess library complexity and decide…
 If you do remove them, assess duplicates at the level of paired-end reads
(fragments) not single end reads
Common questions: How much
library depth is needed for RNA-seq?
 Depends on a number of factors:
 Question being asked of the data. Gene expression? Alternative
expression? Mutation calling?
 Tissue type, RNA preparation, quality of input RNA, library
construction method, etc.
 Sequencing type: read length, paired vs. unpaired, etc.
 Computational approach and resources
 Identify publications with similar goals
 Pilot experiment
 Good news: 1-2 lanes of recent Illumina HiSeq data should
be enough for most purposes
Common questions: What mapping
strategy should I use for RNA-seq?
 Depends on read length
 < 50 bp reads
 Use aligner like BWA and a genome + junction database
 Junction database needs to be tailored to read length
 Or you can use a standard junction database for all read lengths
and an aligner that allows substring alignments for the junctions
only (e.g. BLAST … slow).
 Assembly strategy may also work (e.g. Trans-ABySS)
 > 50 bp reads
 Spliced aligner such as Bowtie/TopHat
Visualization of spliced alignment of RNA-seq data
Normal WGS
IGV screenshot
Acceptor site mutation
Tumor WGS
Tumor RNA-seq
Common questions: how reliable are
expression predictions from RNA-
seq?
 Are novel exon-exon junctions real?
 What proportion validate by RT-PCR and Sanger sequencing?
 Are differential/alternative expression changes observed
between tissues accurate?
 How well do DE values correlate with qPCR?
 384 validations
 qPCR, RT-PCR, Sanger sequencing
 See ALEXA-Seq publication for details:
 Also includes comparison to microarrays
 Griffith et al.
Alternative expression analysis by RNA sequencing.
Nature Methods. 2010 Oct;7(10):843-847.
Validation (qualitative)
33 of 192 assays shown. Overall validation rate = 85%

Validation (quantitative)
qPCR of 192
exons identified
as alternatively
expressed by
ALEXA-Seq
Validation rate = 88%

BioStar exercise
 Go to the BioStar website:
 http://www.biostars.org/
 If you do not already have an OpenID (e.g. Google, Yahoo,
etc.)
 Login -> ‘get one’
 Login and set up your user profile
 Tasks:
 Find a question that seems useful and ‘vote it up’
 Answer a question [optional]
 Search for a topic area of interest and ask a question that has
not already been asked [optional]
Introduction to tutorial
(Module 1)
Module 1 – Introduction to RNA sequencing bioinformatics.ca

Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
Read Transcript Gene Differential
Sequencing
alignment compilation identification expression
Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)
Raw sequence Reference Gene

data genome annotation CummRbund
(.fastq files) (.fa file) (.gtf file)
Visualization
Inputs
Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
Read Transcript Gene Differential
Sequencing
alignment compilation identification expression
Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)
Raw sequence Reference Gene

data genome annotation CummRbund
(.fastq files) (.fa file) (.gtf file)
Visualization
Inputs
Module 1
We are on a coffee break &
networking session
Module 1 – Introduction to RNA sequencing bioinformatics.ca

RNA sequencing,
transcriptome and
expression quantification
Henrik Lantz, BILS/SciLifeLab

Lecture synopsis
 What is RNA-seq?
 Basic concepts
 Mapping-based transcriptomics (genome
-based)
 De novo based transcriptomics (genome-free)
 Expression counts and differential expression
 Transcript annotation
RNA-seq
DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
GT AG GT AG GT AG
ATG TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR AA
ATG A
TAG, TAA, TGA
Start codon A
Stop codon A
Splicing
A
mRNA A
UTR UTR AAAAAAAAA
ATG TAG, TAA, TGA

Translation
Overview of RNA-Seq
From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html
Common Data Formats for RNA-
FASTA format:
Seq
>61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
FASTQ format:
@61DFRAAXX100204:1:100:10494:3070/1
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA
Quality values in increasing order:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_àbcdefghijklmnopqrstuvwxyz{|}~
You might get the data in a .sff or .bam format. Fastq-reads are easy to
extract from both of these binary (compressed) formats!
Paired-End
Insert size
Insert size
Read 1
DNA-fragment
Read 2
Adapter+primer
Inner mate distance

Paired-end gives you two files
FASTQ format (old):
@61DFRAAXX100204:1:100:10494:3070/1
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA
@61DFRAAXX100204:1:100:10494:3070/2
ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA
+
_^_a^cccegcgghhgZc`ghhcêgggd^_[d]defcdfd^ZÔXWaQâd
New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
<read>:<is filtered>:<control number>:<sample number>
Example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
Transcript Reconstruction from RNA-Seq Reads
Nature Biotech, 2010

TopHat
TopHat
Cufflinks
TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks GMAP
TopHat Trinity
Cufflinks
TopHat Trinity
Cufflinks GMAP
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
Basic concepts of mapping-based RNA-seq - Spliced reads
DNA
UTR UTR
GT AG GT AG GT AG
ATG TAG, TAA, TGA
Transcription
Pre-mRNA
UTR UTR AA
ATG A
TAG, TAA, TGA
Start codon A
Stop codon A
Splicing
A
mRNA A
UTR UTR AAAAAAAAA
ATG TAG, TAA, TGA

Translation
RNA-seq - Spliced reads
Pre-mRNA
DNA
UTR UTR
ATG GT GT GT
TAG, TAA, TGA
Transcription
Pre-mRNA
UTR UTR
ATG TAG, TAA, TGA
Splicing
mRNA
UTR UTR
ATG TAG, TAA, TGA

Translation
Pre-mRNA
Pre-mRNA
Stranded rna-seq
Overview of the Tuxedo Software Suite
Bowtie (fast short-read alignment)
TopHat (spliced short-read alignment)
Cufflinks (transcript reconstruction from alignments)
Cuffdiff (differential expression analysis)
CummeRbund (visualization & analysis)

Slide courtesy of Cole Trapnell
Tophat-mapped reads
Alignments are reported in a compact representation: SAM format
0 61G9EAAXX100520:5:100:10095:16477
1 83
2 chr1
3 51986
4 38
5 46M
6 =
7 51789
8 -264
9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
11 MD:Z:67
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38
16 XQ:i:40
17 X2:i:0
SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

0 61G9EAAXX100520:5:100:10095:16477 (read name)

1 83 (FLAGS stored as bit fields; 83 = 00001010011 )
2 chr1 (alignment target)
3 51986(position alignment starts)
4 38
5 46M (Compact description of the alignment in CIGAR format)
6 =
7 51789
8 -264 (read sequence, oriented according to the forward alignment)
11 MD:Z:67 (base quality values)
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38 (Metadata)
16 XQ:i:40
17 X2:i:0

0 61G9EAAXX100520:5:100:10095:16477 (read name)

1 83 (FLAGS stored as bit fields; 83 = 00001010011 )
2 chr1 (alignment target)
3 51986(position alignment starts)
4 38
5 46M (Compact description of the alignment in CIGAR format)
6 = Still not compact enough…
7 51789
Millions
8 to billions
-264 (readof readsoriented
sequence, takesaccording
up a lot offorward
to the space!!
alignment)
11 Convert
MD:Z:67 SAM to binary – BAM format. (base quality values)
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38 (Metadata)
16 XQ:i:40
17 X2:i:0

Samtools
 Tools for
 converting SAM <-> BAM
 Viewing BAM files (eg. samtools view file.bam | less )
 Sorting BAM files, and lots more:
There is also CRAM…
 CRAM compression rate File format File size (GB)
 SAM
7.4
 BAM
1.9
 CRAM lossless
1.4
 CRAM 8 bins
0.8
 CRAM no quality scores 0.26
Visualizing Alignments
of RNA-Seq reads
Text-based Alignment Viewer
% samtools tview alignments.bam target.fasta
IGV
IGV: Viewing Tophat Alignments
Transcript Reconstruction Using Cufflinks
From Martin & Wang. Nature Reviews in Genetics. 2011



GFF file format
GFF3 file format
Seqi sourc type start end scor stran phas attributes
d e e d e
Chr1 Snap gene 234 3657 . + . ID=gene1; Name=Snap1;
Chr1 Snap mRN 234 3657 . + . ID=gene1.m1;
A Parent=gene1;
Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1;

Parent=gene1.m1;
Chr1 Snap CDS 577 1543 . + 0 ID=gene1.m1.CDS1;
Parent=gene1.m1;
Chr1 Snap exon 1822 2674 . + . ID=gene1.m1.exon2;
Parent=gene1.m1;
Chr1 Snap CDS 1822 2674 . + 2 ID=gene1.m1.CDS2;
Parent=gene1.m1;
start_ Alias, note, ontology_term
codo …
n
stop_
codo
GTF file format
GTF file format
Seqi sourc type start end scor stran phas attributes
d e e d e
Chr1 Snap exon 234 1543 . + . gene_id “gene1”;
transcript_id “transcript1”;
Chr1 Snap CDS 577 1543 . + 0 gene_id “gene1”;
Chr1 Snap exon 1822 2674 . + . gene_id “gene1”;
Chr1 Snap CDS 1822 2674 . + 2 gene_id “gene1”;
start_
codo
n
stop_
codo
n
TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks GMAP
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
De novo transcriptome assembly
No genome required
Empower studies of non-model organisms

 expressed gene content
 transcript abundance
 differential expression
The General Approach to
De novo RNA-Seq
Assembly
Using De Bruijn Graphs
Sequence Assembly via De Bruijn Graphs
From Martin & Wang, Nat. Rev. Genet. 2011

Contrasting Genome and Transcriptome Assembly
Genome Assembly Transcriptome Assembly
• Uniform coverage • Exponentially distributed coverage levels

• Single contig per locus • Multiple contigs per locus (alt splicing)
• Double-stranded • Strand-specific
Trinity Aggregates Isolated Transcript Graphs
Genome Assembly Trinity Transcriptome Assembly

Single Massive Graph Many Thousands of Small Graphs
Entire chromosomes represented. Ideally, one graph per expressed gene.

Trinity – How it works:
RNA-Seq Linear de-Bruijn Transcripts

reads contigs graphs +
Isoforms
Thousands of disjoint graphs

Trinity output: A multi-fasta file
Can align Trinity transcripts to genome scaffolds to examine intron/exon structures
(Trinity transcripts aligned using GMAP)
An alternative: Pacific Biosciences (PacBio)
 Pros: Long reads (average 4.5 kbp), can give you full
length transcripts in one read
 Cons: High error rate on longer fragments (15%),
expensive
Abundance Estimation
(Aka. Computing Expression Values)
Expression Value

Expression Value

Normalized Expression Values
•Transcript-mapped read counts are
normalized for both length of the transcript and
total depth of sequencing.
•Reported as: Number of RNA-Seq Fragments
Per Kilobase of transcript

per total Million fragments mapped
FPKM
Differential Expression Analysis
Using RNA-Seq
Differential expression
Mapped reads - condition 1
Genome
Mapped reads - condition 2

Diff. Expression Analysis Involves
 Counting reads
 Statistical significance testing
Sample_A Sample_B Fold_Change Significant?
Gene A 1 2 2-fold No
Gene B 100 200 2-fold Yes

Beware of concluding fold change
from small numbers of counts
Poisson distributions for counts based on 2-fold expression differences
No confidence in 2-fold
difference. Likely
observed by chance.
High confidence in 2-fold

difference. Unlikely
observed by chance.
From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for
More Counts = More Statistical
Power Example: 5000 total reads per sample.
Observed 2-fold differences in read counts.
SampleA Sample B Fisher’s Exact Test

(P-value)
geneA 1 2 1.00
geneB 10 20 0.098
geneC 100 200 < 0.001

Tools for DE analysis with RNA-Seq
ShrinkSeq
NoiSeq
baySeq
Vsf
Voom
SAMseq
TSPM
DESeq
EBSeq
NBPSeq
edgeR
+ other (not-R)
including CuffDiff
See: http://www.biomedcentral.com/1471-2105/14/91
Use of transcripts
 Transcripts can be assembled de novo or from
mapped reads and then used in gene
expression/differential expression studies
 Can be functionally anntoated
Functional annotation
 Take transcripts from Cufflinks or Trinity
 Annotate the sequences functionally in
Blast2GO
Blast2GO
KEGG-mapping

Gene Expression RNA Sequence

Uploaded by

Copyright:

Available Formats

Gene Expression RNA Sequence

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gene Expression RNA Sequence

Uploaded by

Copyright:

Available Formats

What are the main steps involved in an RNA-seq analysis pipeline?

What are the main steps involved in an RNA-seq analysis pipeline?

What are some of the challenges associated with RNA sequencing?

What are some of the challenges associated with RNA sequencing?

Gene expression

 Module 1: Introduction to RNA sequencing

 Introduction to the theory and practice of RNA

100s of millions of paired reads

 Some molecular features can only be observed at the RNA

 Predicting transcript sequence from genome sequence is

RIN = 6.0 RIN = 10

 Expression, differential expression alternative expression

 Click the ‘Wiki’ link

 Visit the ‘Software Hub’

 Browse the software that has been added

 Use the tag cloud to identify tools related to your area of

33 of 192 assays shown. Overall validation rate = 85%

Validation rate = 88%

Module 1 – Introduction to RNA sequencing bioinformatics.ca

Raw sequence Reference Gene

Raw sequence Reference Gene

Module 1 – Introduction to RNA sequencing bioinformatics.ca

Henrik Lantz, BILS/SciLifeLab

ATG TAG, TAA, TGA

Quality values in increasing order:

Inner mate distance

Nature Biotech, 2010

ATG TAG, TAA, TGA

ATG TAG, TAA, TGA

Bowtie (fast short-read alignment)

TopHat (spliced short-read alignment)

Cufflinks (transcript reconstruction from alignments)

Cuffdiff (differential expression analysis)

CummeRbund (visualization & analysis)

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

0 61G9EAAXX100520:5:100:10095:16477 (read name)

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

0 61G9EAAXX100520:5:100:10095:16477 (read name)

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

From Martin & Wang. Nature Reviews in Genetics. 2011

From Martin & Wang. Nature Reviews in Genetics. 2011

From Martin & Wang. Nature Reviews in Genetics. 2011

Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1;

Empower studies of non-model organisms

From Martin & Wang, Nat. Rev. Genet. 2011

Genome Assembly Transcriptome Assembly

• Uniform coverage • Exponentially distributed coverage levels

Genome Assembly Trinity Transcriptome Assembly

Entire chromosomes represented. Ideally, one graph per expressed gene.

RNA-Seq Linear de-Bruijn Transcripts

Thousands of disjoint graphs

Slide courtesy of Cole Trapnell

Slide courtesy of Cole Trapnell

•Reported as: Number of RNA-Seq Fragments

Per Kilobase of transcript

Mapped reads - condition 1

Mapped reads - condition 2

Sample_A Sample_B Fold_Change Significant?

Gene B 100 200 2-fold Yes