Gene Expression RNA Sequence

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 120
At a glance
Powered by AI
The document discusses the main concepts and steps in RNA sequencing analysis including sample preparation, mapping reads, abundance estimation, differential expression analysis, and functional annotation of transcripts.

The main steps discussed are isolating RNA, sequencing, mapping reads to genomes and transcriptomes, and performing downstream analyses such as expression analysis, differential expression, and functional annotation.

Some of the challenges mentioned are issues with sample purity, quantity and quality as RNA molecules consist of small exons that can be degraded. There is also a high error rate when using long read sequencing technologies.

Gene expression

CUBT401: RNA-seq

Zedias Chikwambi
2017
23
Tarea
 http://combine-australia.github.io/RNAseq-
R/
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module 26
Module 1
Introduction to RNA sequencing (lecture)

www.malachigriffith.org
mgriffit@genome.wustl.edu

www.obigriffith.org
ogriffit@genome.wustl.edu
Learning objectives of the course

 Module 1: Introduction to RNA sequencing


 Module 2: RNA-seq alignment and visualization
 Module 3: Expression and Differential Expression
 Module 4: Isoform discovery and alternative expression
 Module 5: Gene fusion discovery

 Tutorials
 Provide a working example of an RNA-seq analysis pipeline
 Run in a ‘reasonable’ amount of time with modest computer
resources
 Self contained, self explanatory, portable
Learning objectives of module 1

 Introduction to the theory and practice of RNA


sequencing (RNA-seq) analysis
 Rationale for sequencing RNA
 Challenges specific to RNA-seq
 General goals and themes of RNA-seq analysis work flows
 Common technical questions related to RNA-seq analysis
 Getting help outside of this course
 Introduction to the RNA-seq hands on tutorial
Gene
expression
RNA sequencing
Isolate RNAs Generate cDNA, fragment,
Samples of interest size select, add linkers

Condition 1 Condition 2
(normal colon) (colon tumor) Sequence ends

Map to genome,
transcriptome, and
predicted exon
junctions

100s of millions of paired reads


10s of billions bases of sequence
Downstream analysis
Why sequence RNA (versus DNA)?
 Functional studies
 Genome may be constant but an experimental condition has a
pronounced effect on gene expression
 e.g. Drug treated vs. untreated cell line
 e.g. Wild type versus knock out mice

 Some molecular features can only be observed at the RNA


level
 Alternative isoforms, fusion transcripts, RNA editing

 Predicting transcript sequence from genome sequence is


difficult
 Alternative splicing, RNA editing, etc.
Why sequence RNA (versus DNA)?
 Interpreting mutations that do not have an obvious effect on
protein sequence
 ‘Regulatory’ mutations that affect what mRNA isoform is expressed
and how much
 e.g. splice sites, promoters, exonic/intronic splicing motifs, etc.
 Prioritizing protein coding somatic mutations (often
heterozygous)
 If the gene is not expressed, a mutation in that gene would be less
interesting
 If the gene is expressed but only from the wild type allele, this might
suggest loss-of-function (haploinsufficiency)
 If the mutant allele itself is expressed, this might suggest a candidate
drug target
Challenges
 Sample
 Purity?, quantity?, quality?
 RNAs consist of small exons that may be separated by large introns
 Mapping reads to genome is challenging
 The relative abundance of RNAs vary wildly
 105 – 107 orders of magnitude
 Since RNA sequencing works by random sampling, a small fraction of highly
expressed genes may consume the majority of reads
 Ribosomal and mitochondrial genes
 RNAs come in a wide range of sizes
 Small RNAs must be captured separately
 PolyA selection of large RNAs may result in 3’ end bias
 RNA is fragile compared to DNA (easily degraded)
Agilent example / interpretation
 http://www.alexaplatform.org/courses/2013/cbw/Agilent_Trace_Examples.pdf
 ‘RIN’ = RNA integrity number
 0 (bad) to 10 (good)

RIN = 6.0 RIN = 10


Design considerations
 Standards, Guidelines and Best Practices for
RNA-seq
 The ENCODE Consortium
 Download from the Course Wiki
 Meta data to supply, replicates, sequencing depth,
control experiments, reporting standards, etc.

 http://www.alexaplatform.org/courses/2013/cbw/ENCODE_RNAseq_standards_v1.0.pdf
Replicates
 Technical Replicate
 Multiple instances of
sequence generation
 Flow Cells, Lanes, Indexes
 Biological Replicate
 Multiple isolations of cells
showing the same
phenotype, stage or other
experimental condition
 Some example
concerns/challenges:
 Environmental Factors,
Growth Conditions, Time
 Correlation Coefficient 0.92-
0.98
Common analysis goals of RNA-Seq
analysis (what can you ask of the
data?)
 Gene expression and differential expression
 Alternative expression analysis
 Transcript discovery and annotation
 Allele specific expression
 Relating to SNPs or mutations
 Mutation discovery
 Fusion detection
 RNA editing
General themes of RNA-seq
workflows
 Each type of RNA-seq analysis has distinct requirements and
challenges but also a common theme:
1. Obtain raw data (convert format)
2. Align/assemble reads
3. Process alignment with a tool specific to the goal
• e.g. ‘cufflinks’ for expression analysis, ‘defuse’ for fusion detection, etc.
4. Post process
• Import into downstream software (R, Matlab, Cytoscape, Ingenuity,
etc.)
5. Summarize and visualize
• Create gene lists, prioritize candidates for validation, etc.
Tool recommendations
 Alignment
 BWA (PMID: 20080505)
 Align to genome + junction database
 Tophat (PMID: 19289445), STAR (PMID: 23104886), MapSplice (PMID: 20802226), hmmSplicer
(PMID: 21079731)
 Spliced alignment to genome

 Expression, differential expression alternative expression


 Cufflinks/Cuffdiff (PMID: 20436464), ALEXA-seq (PMID: 20835245), RUM (PMID: 21775302)

 Fusion detection
 Tophat-fusion (PMID: 21835007), ChimeraScan (PMID: 21840877), Defuse (PMID: 21625565), Comrad
(PMID: 21478487)

 Transcript assembly
 Trinity (PMID: 21572440), Oases (PMID: 22368243), Trans-ABySS (PMID: 20935650)

 Visit the ‘SeqAnswers’ or ‘BioStar’ forums for more recommendations and discussion
 http://seqanswers.com/
 http://www.biostars.org/
SeqAnswers exercise
 Go to:
 http://seqanswers.com/

 Click the ‘Wiki’ link


 http://seqanswers.com/wiki/SEQanswers

 Visit the ‘Software Hub’


 http://seqanswers.com/wiki/Software

 Browse the software that has been added


 http://seqanswers.com/wiki/Special:BrowseData

 Use the tag cloud to identify tools related to your area of


interest. e.g. RNA-seq alignment
Common questions: Should I remove
duplicates for RNA-seq?
 Maybe… more complicated question than for DNA
 Concern.
 Duplicates may correspond to biased PCR amplification of particular fragments
 For highly expressed, short genes, duplicates are expected even if there is no
amplification bias
 Removing them may reduce the dynamic range of expression estimates
 Assess library complexity and decide…
 If you do remove them, assess duplicates at the level of paired-end reads
(fragments) not single end reads
Common questions: How much
library depth is needed for RNA-seq?
 Depends on a number of factors:
 Question being asked of the data. Gene expression? Alternative
expression? Mutation calling?
 Tissue type, RNA preparation, quality of input RNA, library
construction method, etc.
 Sequencing type: read length, paired vs. unpaired, etc.
 Computational approach and resources
 Identify publications with similar goals
 Pilot experiment
 Good news: 1-2 lanes of recent Illumina HiSeq data should
be enough for most purposes
Common questions: What mapping
strategy should I use for RNA-seq?
 Depends on read length
 < 50 bp reads
 Use aligner like BWA and a genome + junction database
 Junction database needs to be tailored to read length
 Or you can use a standard junction database for all read lengths
and an aligner that allows substring alignments for the junctions
only (e.g. BLAST … slow).
 Assembly strategy may also work (e.g. Trans-ABySS)

 > 50 bp reads
 Spliced aligner such as Bowtie/TopHat
Visualization of spliced alignment of RNA-seq data

Normal WGS

IGV screenshot
Acceptor site mutation

Tumor WGS

Tumor RNA-seq
Common questions: how reliable are
expression predictions from RNA-
seq?
 Are novel exon-exon junctions real?
 What proportion validate by RT-PCR and Sanger sequencing?
 Are differential/alternative expression changes observed
between tissues accurate?
 How well do DE values correlate with qPCR?
 384 validations
 qPCR, RT-PCR, Sanger sequencing
 See ALEXA-Seq publication for details:
 Also includes comparison to microarrays
 Griffith et al.
Alternative expression analysis by RNA sequencing.
Nature Methods. 2010 Oct;7(10):843-847.
Validation (qualitative)

33 of 192 assays shown. Overall validation rate = 85%


Validation (quantitative)

qPCR of 192
exons identified
as alternatively
expressed by
ALEXA-Seq

Validation rate = 88%


BioStar exercise
 Go to the BioStar website:
 http://www.biostars.org/
 If you do not already have an OpenID (e.g. Google, Yahoo,
etc.)
 Login -> ‘get one’
 Login and set up your user profile
 Tasks:
 Find a question that seems useful and ‘vote it up’
 Answer a question [optional]
 Search for a topic area of interest and ask a question that has
not already been asked [optional]
Introduction to tutorial
(Module 1)

Module 1 – Introduction to RNA sequencing bioinformatics.ca


Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
Read Transcript Gene Differential
Sequencing
alignment compilation identification expression

Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)

Raw sequence Reference Gene


data genome annotation CummRbund
(.fastq files) (.fa file) (.gtf file)

Visualization
Inputs
Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
Read Transcript Gene Differential
Sequencing
alignment compilation identification expression

Bowtie/TopHat
RNA-seq reads Cufflinks Cuffdiff
alignment Cufflinks
(2 x 100 bp) (cuffmerge) (A:B comparison)
(genome)

Raw sequence Reference Gene


data genome annotation CummRbund
(.fastq files) (.fa file) (.gtf file)

Visualization
Inputs

Module 1
We are on a coffee break &
networking session

Module 1 – Introduction to RNA sequencing bioinformatics.ca


RNA sequencing,
transcriptome and
expression quantification

Henrik Lantz, BILS/SciLifeLab


Lecture synopsis
 What is RNA-seq?
 Basic concepts
 Mapping-based transcriptomics (genome
-based)
 De novo based transcriptomics (genome-free)
 Expression counts and differential expression
 Transcript annotation
RNA-seq
DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
GT AG GT AG GT AG
ATG TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR AA
ATG A
TAG, TAA, TGA
Start codon A
Stop codon A
Splicing
A
mRNA A
UTR UTR AAAAAAAAA

ATG TAG, TAA, TGA


Start codon Stop codon

Translation
Overview of RNA-Seq

From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html
Common Data Formats for RNA-
FASTA format:
Seq
>61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT

FASTQ format:

@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

Quality values in increasing order:


!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

You might get the data in a .sff or .bam format. Fastq-reads are easy to
extract from both of these binary (compressed) formats!
Paired-End
Insert size
Insert size

Read 1

DNA-fragment

Read 2

Adapter+primer

Inner mate distance


Paired-end gives you two files
FASTQ format (old):

@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

@61DFRAAXX100204:1:100:10494:3070/2
ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA
+
_^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad
New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
<read>:<is filtered>:<control number>:<sample number>

Example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
Transcript Reconstruction from RNA-Seq Reads

Nature Biotech, 2010


Transcript Reconstruction from RNA-Seq Reads

TopHat
Transcript Reconstruction from RNA-Seq Reads

TopHat

Cufflinks
Transcript Reconstruction from RNA-Seq Reads

TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package

Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads

TopHat Trinity

Cufflinks
Transcript Reconstruction from RNA-Seq Reads

TopHat Trinity

Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads

End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
Basic concepts of mapping-based RNA-seq - Spliced reads

DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
GT AG GT AG GT AG
ATG TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR AA
ATG A
TAG, TAA, TGA
Start codon A
Stop codon A
Splicing
A
mRNA A
UTR UTR AAAAAAAAA

ATG TAG, TAA, TGA


Start codon Stop codon

Translation
RNA-seq - Spliced reads
Pre-mRNA
DNA
Exon Intron Exon Intron Exon Intron Exon
UTR UTR
ATG GT GT GT
TAG, TAA, TGA
Start codon Stop codon
Transcription
Pre-mRNA
UTR UTR
ATG TAG, TAA, TGA
Start codon Stop codon
Splicing
mRNA
UTR UTR

ATG TAG, TAA, TGA


Start codon Stop codon

Translation
Pre-mRNA
Pre-mRNA
Stranded rna-seq
Overview of the Tuxedo Software Suite

Bowtie (fast short-read alignment)

TopHat (spliced short-read alignment)

Cufflinks (transcript reconstruction from alignments)

Cuffdiff (differential expression analysis)

CummeRbund (visualization & analysis)


Slide courtesy of Cole Trapnell
Tophat-mapped reads
Alignments are reported in a compact representation: SAM format

0 61G9EAAXX100520:5:100:10095:16477
1 83
2 chr1
3 51986
4 38
5 46M
6 =
7 51789
8 -264
9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
11 MD:Z:67
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38
16 XQ:i:40
17 X2:i:0

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf


Alignments are reported in a compact representation: SAM format

0 61G9EAAXX100520:5:100:10095:16477 (read name)


1 83 (FLAGS stored as bit fields; 83 = 00001010011 )
2 chr1 (alignment target)
3 51986(position alignment starts)
4 38
5 46M (Compact description of the alignment in CIGAR format)
6 =
7 51789
8 -264 (read sequence, oriented according to the forward alignment)
9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
11 MD:Z:67 (base quality values)
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38 (Metadata)
16 XQ:i:40
17 X2:i:0

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf


Alignments are reported in a compact representation: SAM format

0 61G9EAAXX100520:5:100:10095:16477 (read name)


1 83 (FLAGS stored as bit fields; 83 = 00001010011 )
2 chr1 (alignment target)
3 51986(position alignment starts)
4 38
5 46M (Compact description of the alignment in CIGAR format)
6 = Still not compact enough…
7 51789
Millions
8 to billions
-264 (readof readsoriented
sequence, takesaccording
up a lot offorward
to the space!!
alignment)
9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
11 Convert
MD:Z:67 SAM to binary – BAM format. (base quality values)
12 NH:i:1
13 HI:i:1
14 NM:i:0
15 SM:i:38 (Metadata)
16 XQ:i:40
17 X2:i:0

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf


Samtools
 Tools for
 converting SAM <-> BAM
 Viewing BAM files (eg. samtools view file.bam | less )
 Sorting BAM files, and lots more:
There is also CRAM…
 CRAM compression rate File format File size (GB)
 SAM
7.4
 BAM
1.9
 CRAM lossless
1.4
 CRAM 8 bins
0.8
 CRAM no quality scores 0.26
Visualizing Alignments
of RNA-Seq reads
Text-based Alignment Viewer
% samtools tview alignments.bam target.fasta
IGV
IGV: Viewing Tophat Alignments
Transcript Reconstruction Using Cufflinks

From Martin & Wang. Nature Reviews in Genetics. 2011


Transcript Reconstruction Using Cufflinks

From Martin & Wang. Nature Reviews in Genetics. 2011


Transcript Reconstruction Using Cufflinks

From Martin & Wang. Nature Reviews in Genetics. 2011


GFF file format
GFF3 file format
Seqi sourc type start end scor stran phas attributes
d e e d e
Chr1 Snap gene 234 3657 . + . ID=gene1; Name=Snap1;
Chr1 Snap mRN 234 3657 . + . ID=gene1.m1;
A Parent=gene1;

Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1;


Parent=gene1.m1;
Chr1 Snap CDS 577 1543 . + 0 ID=gene1.m1.CDS1;
Parent=gene1.m1;
Chr1 Snap exon 1822 2674 . + . ID=gene1.m1.exon2;
Parent=gene1.m1;
Chr1 Snap CDS 1822 2674 . + 2 ID=gene1.m1.CDS2;
Parent=gene1.m1;
start_ Alias, note, ontology_term
codo …
n
stop_
codo
GTF file format
GTF file format
Seqi sourc type start end scor stran phas attributes
d e e d e
Chr1 Snap exon 234 1543 . + . gene_id “gene1”;
transcript_id “transcript1”;
Chr1 Snap CDS 577 1543 . + 0 gene_id “gene1”;
transcript_id “transcript1”;
Chr1 Snap exon 1822 2674 . + . gene_id “gene1”;
transcript_id “transcript1”;
Chr1 Snap CDS 1822 2674 . + 2 gene_id “gene1”;
transcript_id “transcript1”;
start_
codo
n
stop_
codo
n
Transcript Reconstruction from RNA-Seq Reads

TopHat TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package

Cufflinks GMAP
Transcript Reconstruction from RNA-Seq Reads

End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
Trinity
GMAP
De novo transcriptome assembly
No genome required

Empower studies of non-model organisms


 expressed gene content
 transcript abundance
 differential expression
The General Approach to
De novo RNA-Seq
Assembly
Using De Bruijn Graphs
Sequence Assembly via De Bruijn Graphs

From Martin & Wang, Nat. Rev. Genet. 2011


From Martin & Wang, Nat. Rev. Genet. 2011
From Martin & Wang, Nat. Rev. Genet. 2011
Contrasting Genome and Transcriptome Assembly

Genome Assembly Transcriptome Assembly

• Uniform coverage • Exponentially distributed coverage levels


• Single contig per locus • Multiple contigs per locus (alt splicing)
• Double-stranded • Strand-specific
Trinity Aggregates Isolated Transcript Graphs

Genome Assembly Trinity Transcriptome Assembly


Single Massive Graph Many Thousands of Small Graphs

Entire chromosomes represented. Ideally, one graph per expressed gene.


Trinity – How it works:

RNA-Seq Linear de-Bruijn Transcripts


reads contigs graphs +
Isoforms

Thousands of disjoint graphs


Trinity output: A multi-fasta file
Can align Trinity transcripts to genome scaffolds to examine intron/exon structures
(Trinity transcripts aligned using GMAP)
An alternative: Pacific Biosciences (PacBio)
 Pros: Long reads (average 4.5 kbp), can give you full
length transcripts in one read
 Cons: High error rate on longer fragments (15%),
expensive
Abundance Estimation
(Aka. Computing Expression Values)
Expression Value

Slide courtesy of Cole Trapnell


Expression Value

Slide courtesy of Cole Trapnell


Normalized Expression Values
•Transcript-mapped read counts are
normalized for both length of the transcript and
total depth of sequencing.

•Reported as: Number of RNA-Seq Fragments

Per Kilobase of transcript


per total Million fragments mapped
FPKM
Differential Expression Analysis
Using RNA-Seq
Differential expression

Mapped reads - condition 1

Genome

Mapped reads - condition 2


Diff. Expression Analysis Involves
 Counting reads
 Statistical significance testing

Sample_A Sample_B Fold_Change Significant?

Gene A 1 2 2-fold No

Gene B 100 200 2-fold Yes


Beware of concluding fold change
from small numbers of counts
Poisson distributions for counts based on 2-fold expression differences

No confidence in 2-fold
difference. Likely
observed by chance.

High confidence in 2-fold


difference. Unlikely
observed by chance.

From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for
More Counts = More Statistical
Power Example: 5000 total reads per sample.
Observed 2-fold differences in read counts.

SampleA Sample B Fisher’s Exact Test


(P-value)
geneA 1 2 1.00

geneB 10 20 0.098

geneC 100 200 < 0.001


Tools for DE analysis with RNA-Seq
ShrinkSeq
NoiSeq
baySeq
Vsf
Voom
SAMseq
TSPM
DESeq
EBSeq
NBPSeq
edgeR

+ other (not-R)
including CuffDiff
See: http://www.biomedcentral.com/1471-2105/14/91
Use of transcripts
 Transcripts can be assembled de novo or from
mapped reads and then used in gene
expression/differential expression studies
 Can be functionally anntoated
Functional annotation
 Take transcripts from Cufflinks or Trinity
 Annotate the sequences functionally in
Blast2GO
Blast2GO
KEGG-mapping

You might also like