See discussions, stats, and author profiles for this publication at: https://www.researchgate.


Alternative mRNA transcription, processing, and translation: Insights from

RNA sequencing

Article in Trends in Genetics · January 2015

DOI: 10.1016/j.tig.2015.01.001


296 1,658

2 authors, including:

Eleonora De Klerk
UCSF University of California, San Francisco


All content following this page was uploaded by Eleonora De Klerk on 27 October 2017.

The user has requested enhancement of the downloaded file.

Feature Review

Alternative mRNA transcription,

processing, and translation: insights
from RNA sequencing
Eleonora de Klerk and Peter A.C. ‘t Hoen
Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands

The human transcriptome comprises >80 000 protein- processing, and translation restricts the number of combi-
coding transcripts and the estimated number of proteins nations of possible alternative transcripts and proteins.
synthesized from these transcripts is in the range of
250 000 to 1 million. These transcripts and proteins are Initiation of transcription: alternative promoters
encoded by less than 20 000 genes, suggesting extensive During the biogenesis of mRNAs, regulation of transcrip-
regulation at the transcriptional, post-transcriptional, tion initiation represents the first layer in the control of
and translational level. Here we review how RNA se- gene expression [1–4]. Alternative transcription initiation
quencing (RNA-seq) technologies have increased our leads to the formation of transcripts differing in their first
understanding of the mechanisms that give rise to alter- exon or in the length of the 50 untranslated region (50 -UTR).
native transcripts and their alternative translation. We The use of alternative first exons leads to transcripts with
highlight four different regulatory processes: alternative different open reading frames (ORFs) and diversifies the
transcription initiation, alternative splicing, alternative repertoire of encoded proteins giving rise to protein iso-
polyadenylation, and alternative translation initiation. forms with alternative N termini [5] (Figure 1A). Alterna-
We discuss their transcriptome-wide distribution, their tively, transcripts sharing the same coding region but a
impact on protein expression, their biological relevance, different 50 -UTR can be subject to differential translational
and the possible molecular mechanisms leading to their regulation (Figure 1B) [6] through short upstream ORFs
alternative regulation. We conclude with a discussion (uORFs) involved in translational control [7–9] or in the
of the coordination and the interdependence of these production of biologically relevant peptides [10–12].
four regulatory layers. The use of alternative promoters and transcription start
sites (TSSs) in protein coding transcripts was established
Regulatory layers defining gene expression before the development of transcriptome-wide approaches,
The diversification of cellular and organismal functions through studies based on a method called cap analysis of
observed in higher eukaryotes cannot be explained by the gene expression (CAGE) [13]. CAGE still represents the basic
sheer number of genes but is mostly due to the expression technology for the detection of TSSs. Recently, several high-
of different transcripts and proteins from the same genes. throughput CAGE methods, such as DeepCAGE, have been
Variation in the expression of coding genes is controlled at developed [14]. These transcriptome-wide studies suggest
multiple levels, from transcription to RNA processing and that TSS use is highly tissue specific [4,15–18] and that the
translation. Alternative transcripts and proteins may arise number of alternative TSSs differs by tissue type, with the
from alternative transcription initiation, alternative splic- hippocampus accounting for a larger number of TSSs than
ing, alternative polyadenylation (APA), and alternative any other tissue [18,19]. To what extent alternative TSSs
translation initiation. These co- and post-transcriptional lead to alternative 50 noncoding regions or translate into
regulatory mechanisms expand the genome’s coding capa- novel protein isoforms is virtually impossible to determine
city modifying protein function, stability, localization, and from DeepCAGE reads, which consist of 25 or 26 nucleotides.
expression levels. In this review, we discuss how high- To assess the potential for novel ORFs arising from the use
throughput RNA-seq has helped us to understand these four of alternative TSSs, it is essential to integrate DeepCAGE
regulatory processes. We describe their transcriptome-wide data with RNA-seq, ribosome profiling, and proteomics.
abundance in mammalian cells, their impact on protein The FANTOM Consortium is leading most of the re-
expression, their biological relevance, and the molecular search in the field of promoters and TSSs. In their most
mechanisms underlying these processes. Finally, we high- recent TSS survey [4], which includes approximately
light how the interdependence between transcription, RNA 200 human primary cell types, 150 human tissues, and
250 human cancer cell lines, it was shown that on average
Corresponding author: ‘t Hoen, P.A.C. (
there are four TSSs per gene, but the number of TSSs
Keywords: gene expression; transcriptome; RNA sequencing; alternative polyadeny-
lation; alternative splicing; translation. reported strictly relies on the filtering method used. An
estimate of the transcriptome-wide distribution of alterna-
ß 2015 Elsevier Ltd. All rights reserved. tive TSSs can indeed be complicated by the presence of
CAGE peaks marking enhancer regions [4], 30 -UTRs
Trends in Genetics xx (2015) 1–12 1
Feature Review

(A) Alternave first exons (B) Alternave 5′-UTR

10 kb 1 kb
TSS 1 198
Prol cells

Prol cells
1 1
99 233
Diff cells


Diff cells
1 1
RefSeq genes RefSeq genes
Tpm3 Cryab
Ensembl gene predicons


Alternave first exons uORF Stop codon pORF

(C) Promoters and enhancers (D) Long range transcriponal control

TSS 1 TSS 2 TSS 3 Lmbr1 Rnf32 Shh

P1 Exon 1 P2 Exon 2 E P3 Exon 3


850 Kb

Key: Limb-specific Brain-specific

Epithelial lining-specific Floorplate-specific

TRENDS in Genetics

Figure 1. Alternative transcription initiation. (A) Data from a deep cap analysis of gene expression (DeepCAGE) experiment showing alternative transcription start sites
(TSSs) used during muscle differentiation in proliferating myoblasts and differentiated myotubes [16]. In the Tpm3 gene, different promoters lead to the formation of
transcripts with different first exons. One alternative TSS (TSS3) is specifically used in differentiated cells. (B) In the Cryab gene, proliferating cells make use of an alternative
TSS to extend their 50 untranslated region (50 -UTR). The sequence of the 50 -UTR is shown below the reference track. The extension on the 50 -UTR leads to the transcription of
a potential upstream open reading frame (uORF) starting at a canonical AUG codon and ending before the start codon of the primary ORF (pORF). (C) An illustrative
example of cell- and tissue-specific alternative TSSs regulated by the binding of transcription factors (TFs) to promoters and enhancer regions. While TF1 and TF2 bind to
promoters (P1, P2) surrounding the TSS, TF3 binds to a distal upstream sequence corresponding to an enhancer region (E), which enhances transcription from a third TSS
(TSS3). Some TFs are present in multiple tissues (TF1) whereas others are tissue specific (TF2, TF3), and their transcription can also be regulated during cell differentiation
(TF1 regulates transcription in undifferentiated cells and TF2 in differentiated cells). (D) Long-range transcriptional control mediated by enhancers. Transcriptional
regulation of the Shh gene is tightly controlled during development by enhancer regions located up to 850 kb from the gene. Whereas some enhancers are located within
the coding region of Shh, others are located in intergenic regions or within intronic regions of the Lmbr1 and Rnf32 genes. Genes are depicted as gray boxes. Known
enhancer regions in the mouse are marked in different colors according to their tissue specificity.

[4,20,21], coding regions (a phenomenon called exon paint- type-restricted expression due to the presence of proximal
ing [16,22,23]), and promoter-associated short RNAs enhancers [4].
(PASRs) [20]. Whereas exon painting may arise as a con- The molecular mechanisms responsible for the choice of
sequence of recapping of degradation products, many other alternative promoters and TSSs can be divided into two
CAGE peaks represent short capped transcripts whose categories: alteration of the chromatin state and regulation
functions remain largely unknown. A striking recent find- mediated by cell- and tissue-specific transcription factors
ing from this large TSS survey [4] is that most genes are (Figure 1C). Understanding the biological importance of
regulated in a tissue-specific manner and only a small alternative and tissue-specific TSSs requires learning how
percentage can be considered to be truly housekeeping. the choice of a specific TSS is made and which transcription
The use of alternative tissue-specific TSSs seems to be factor and regulatory networks are involved. This can be
regulated by the presence of enhancer regions more than achieved by making inferences on transcriptional net-
by alternative core promoters. Half of all detected CpG works. In a DeepCAGE time-course study on the differen-
island promoters and more than 90% of all promoters tiation of human monocytic leukemia cells [17], the authors
lacking both CpG islands and a TATA box exhibit cell predicted transcription factor binding sites around the
Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

TSSs identified in each condition and subsequently built a From analysis of the transcriptomes of 15 different
network model of gene expression using motif activity human cell lines [1], it appears that up to 25 different
response analysis. This provided important insights into transcripts can be produced from a single gene and that up
the key regulators active in transcriptional control in to 12 alternative transcripts may be expressed in a partic-
distinct phases of differentiation. Similarly, another study ular cell. Alternative transcripts are not expressed at the
[24] inferred transcriptional regulatory networks after the same level, but one transcript is usually dominant
perturbation of specific transcription factors (PU.1, IRF8, [34]. According to the latest GENCODE release [version
MYB and SP1) in the same cells. This led to the discovery 20 (], there are
of target genes for each transcription factor and led to the almost 80 000 transcript variants encoded by about
identification of de novo binding site motifs. 20 000 protein-coding genes in humans – an average of
Many studies focusing on single genes have shown that four transcripts per gene. A previous GENCODE release
the choice of a specific TSS has critical roles during devel- (version 7) reported an average of six transcripts per gene,
opment [25–27] and cell differentiation [28] and aberra- while RefSeq, the University of California, Santa Cruz
tions in alternative promoter and TSS use lead to various (UCSC), and the Collaborative Consensus Coding Se-
diseases including cancer [29,30], neuropsychiatric disor- quence (CCDS) project [35] report a much lower average.
ders [31], and developmental disorders [32]. Whereas These discordances suggest that variations in the number
some disorders are caused by epigenetic changes or genetic of transcripts per gene reported are due to the different
aberrations in the promoter region, others are caused by methods used to annotate RNA sequences, highlighting
genetic changes in distal elements affecting long-range the current limitations in fully characterizing transcrip-
transcriptional regulation. The ENCODE project has tomes.
shown the presence of more than 1000 long-range inter- It remains challenging to predict which transcripts are
actions between TSSs and distal elements within a range present in a specific cell type. Splice site selection depends
of 120 kb [3]. An example of such a long-range interaction is on multiple parameters including the presence of splicing
Shh [32] (Figure 1D), a gene that is spatially and tempo- regulators, the strength of splice sites, the structure of
rally regulated during development. To date, ten Shh exon–intron junctions, and the process of transcription. So
enhancers have been identified, located within a region far, various molecular mechanisms have been shown to
of 1 Mb in humans and 850 kb in mice (Figure 1D). These regulate alternative splicing.
enhancers play a key role during development, as indicated Next to conserved cis elements such as the splice donor
by mutations in the limb-specific enhancer that lead to and acceptor sites, branch sites, polypyrimidine tracts, and
various skeletal limb abnormalities. a range of other sequence motifs are recognized by various
auxiliary splicing factors. These auxiliary RNA-binding
Splicing: alternative exons proteins (RBPs) are not part of the spliceosomal machinery
During and after transcription, almost all mRNAs are but can enhance or suppress alternative splicing by inter-
spliced. Alternatively spliced transcripts result from the fering with it [36–39]. Various crosslinking and RNA
differential inclusion of subsets of exons (Figure 1A and immunoprecipitation techniques, followed by next-genera-
Box 1). Of the regulatory mechanisms discussed in this tion sequencing, have been developed to map RNA–protein
review, alternative splicing is the most prevalent event, interactions in vivo [14]. An early goal of these studies was
affecting approximately 95% of mammalian genes the identification of RNA-binding sites. Many of these
[33]. RNA-seq has the potential to elucidate the number, studies have shown that RBPs recognize short (3–7 nt)
structure, and abundance of alternative transcripts and degenerate motifs, have multiple RNA-binding domains,
the molecular mechanisms responsible for their formation. and display variable efficiency when multiple motifs clus-
ter together [40,41]. Moreover, many RBPs regulate the
expression of other auxiliary factors. The differing cellular
Box 1. Alternative splicing events
and temporal localization of RBPs [42,43] may explain the
different dynamics regulating alternative and constitutive
Five major alternative splicing events are distinguished: exon splicing: whereas constitutive splicing mainly occurs
skipping (also called cassette exon), use of alternative acceptor
and/or donor sites, intron retention, and mutually exclusive exons.
cotranscriptionally, alternative splicing mainly occurs
Exon skipping appears to be the most common, occurring in 38% post-transcriptionally [44]. For recent mechanistic models
of mouse and human genes, whereas intron retention is less of splicing regulation through RBPs, see [45]. Alternative
common (3%) [135]. How the spliceosome recognizes alternative splicing can also be regulated in a manner totally indepen-
exons and decides which exons to include remains not fully dent of auxiliary splicing factors [46]. Splicing silencer
understood. Before the advent of RNA-seq, studies revealed some
general characteristics in conserved alternative cassette exons: they
sequences regulate alternative splicing when competing
tend to be smaller in size compared with constitutive exons [136] 50 splice sites are present in the same RNA molecule
and their length is divisible by three, thus maintaining the same (Figure 2B). The competing 50 splice sites are equally well
reading frame when the alternative exon is skipped or included recognized by the U1 small nuclear ribonucleoprotein
[137]. Non-conserved cassette exons do not show these character-
(snRNP), but silencer sequences alter the configuration
istics. In addition, alternative exons seem to contain weaker splice
sites (the exon–intron junctions at the 50 and 30 ends of introns; i.e., in which U1 binds to the 50 splice sites, leading to silencing
donor and acceptor sites), although the other primary cis-acting of the 50 splice site. This can change the efficiency of a splice
elements used to define the intron (the branch site and the site: weak 50 splice sites can be recognized and used instead
polypyrimidine tract located upstream of the acceptor site) are of stronger 50 splice sites. RNA-seq datasets can be used
generally similar to those found in constitutive exons [138].
to computationally identify common and tissue-specific
Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

(A) Alternave exons splicing regulatory sequences. These studies have shown
2 kb that the same sequence can act as an enhancer or a silencer
Muscle 0 in different tissues, but experimental validations of these
brain 18
predicted regulatory sequences are needed to confirm these

RefSeq genes
observations [47].
Alternative splicing can also be regulated by RNA sec-
ondary structures (Figure 2C). Short-range RNA second-
Alternave skipped exons
ary structures can mask primary cis elements such as the
(B) Splicing silencer sequence acceptor and donor sites or the polypyrimidine tract
[48,49]. They have been associated with alternative splic-
Weak 5′ ss Strong 5′ ss 3′ ss ing at alternative 50 splice sites. For example, the RBP
MBNL1 forms a secondary structure upstream of exon 5 of
(I) Exon 1 GU AG Exon 2
human TNNT2 and upstream of the fetal exon of mouse
U1 U1 Tnnt3, blocking U2AF65 binding to the polypyrimidine
tract [50,51]. Long-range secondary structures bring dis-
tant splice sites into closer proximity, facilitating alterna-
Weak 5′ ss Strong 5′ ss 3′ ss tive splicing, and are associated with weak alternative 30
splice sites [49]. Computational studies based on RNA-seq
(II) Exon 1 GU AG Exon 2
U1 datasets suggest that the splicing of thousands of mam-
U1 SSS malian genes is dependent on RNA structures, both short
and long range [49]. Recently developed high-throughput
techniques combine nuclease digestion [52] or chemical
Weak 5′ ss Strong 5′ ss 3′ ss
probing [53] with next-generation sequencing to provide
transcriptome-wide RNA structural information. Two
(III) Exon 1 GU AG Exon 2 studies have recently shown a transcriptome-wide rela-
U1 U1 tionship between secondary structures and alternative
SSS SSS splicing [54,55], by reporting the presence of strong sec-
ondary structures at 50 splice sites that correlate with
(C) Short-and long-range RNA structures unspliced exons. The question that remains unsolved by
RNA-seq studies is whether the plethora of transcript
Weak 5′ ss Strong 5′ ss 3′ ss variants produced affect protein expression. This question
has been recently addressed by studies using ribosome
(I) Exon 1 Exon 2
profiling, discussed further below. A general observation
U1 from transcriptome-wide studies is that alternative splic-
ing is essential for development [56,57] and cell, tissue [58],
and species specificity [59]. A plausible explanation of how
alternative exons can confer such specificity is the inclu-
sion or exclusion of binding motifs and post-translational
modification sites, as shown in a study where the authors
5′ ss Weak 3′ ss investigated the structural and functional properties of
(II) Exon 1 Exon 3
alternative exons [60].
Due to the widespread role of alternative splicing, it is
U1 unsurprising that errors in this process lead to various
diseases, from neurodegenerative disorders to muscle
dystrophies and cancer; we refer the reader to recent
detailed reviews [61,62].
Strong 3′ ss
30 End maturation: APA
TRENDS in Genetics
Another step in mRNA processing is the process of poly-
adenylation [63]. The use of APA sites represents an extra
Figure 2. Alternative splicing. (A) Data from an RNA sequencing (RNA-seq)
regulatory layer during gene expression that results in the
experiment showing tissue-specific alternative splicing [139]. The SLC25A3 gene
is differentially spliced in brain and muscle tissues through exon skipping. (B) formation of transcripts differing in their 30 ends. Tran-
Alternative splicing regulated by silencer sequences. In (I) the U1 small nuclear scripts arising from APA may differ in their coding region
ribonucleoprotein (snRNP) splicing factor recognizes both strong and weak 50
splice sites (50 ss) but splicing occurs only at the strong 50 ss. In (II) a splicing
(if APA sites are located in a different exon or intron)
silencer sequence (sss) is located downstream of the strong 50 ss. U1 binds both (Figure 3A) or in the length of their 30 -UTRs [tandem
the weak and the strong 50 ss, but the conformation in which it binds the strong polyadenylation sites (PASs)] (Figure 3B). The impact of
50 ss is suboptimal for splicing; therefore, only the weak 50 ss is used for splicing. In
(III) the sss is located downstream of both the weak and the strong 50 ss. U1 binds
APA on the regulation of gene expression can be extended
both with suboptimal conformation, but only the strong 50 ss is used for splicing.
(C) Alternative splicing regulated by RNA secondary structures. Example of short- upstream. (II) The long-range RNA secondary structure brings together a strong
and long-range RNA secondary structures. (I) The short-range RNA secondary 50 ss and a weak 30 ss, causing the loss of a complete exon (in green) and a region
structure masks a strong 50 ss, leading to the recognition of a weaker 50 ss located of the last exon (in purple).

Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

(A) Intronic alternave polyadenylaon

20 kb
Distal PAS

Intronic PAS
RefSeq genes


Alternave 3′ terminal exon

(B) Tandem alternave polyadenylaon

(I) (II)
2 kb 1 kb

OPMD Control
OPMD Control

46_ Distal PAS
Distal PAS
Proximal PAS
0_ 0_

Distal PAS Proximal PAS

Proximal PAS
0_ 0_
Full 3′-UTR Full 3′-UTR
Arih2 Ccnd1
Truncated 3′-UTR Truncated
HuR mof: uukruuu
HNRNPL mofs: amayama,acacrav
Loss / gain of miRNA binding site HNRNPL
HNRNPK mof: ccawmcc

HNRNPU mof: uguauug

Loss / gain of RBPs binding sites

(C) Polyadenylaon site selecon

(I) Pol II
Exon 2
Non-canonical PA signal
Proximal PAS CPSF CstF
Distal PAS
Canonical PA signal

Exon 2
CFIm Canonical PA signal
Distal PAS

Proximal PAS
Non-canonical PA signal
TRENDS in Genetics

Figure 3. Alternative polyadenylation (APA). (A) Data from a poly(A)-sequencing experiment showing APA in the intron of the Luc7l2 gene [71], leading to an intronic
proximal polyadenylation site (PAS) located in a different terminal exon giving rise to transcript variants with different open reading frames (ORFs). (B) Two examples of
tandem APA in muscle tissue from a mouse model for oculopharyngeal muscle dystrophy (OPMD) [71]. In the Arih2 gene (I), both the distal and the proximal PASs can be
used in the disease state. Recognition of a proximal PAS leads to shortening of the 30 untranslated region (30 -UTR) and loss of a miRNA binding site, causing an increase in
transcript levels. In the Ccnd1 gene (II), shortening of the 30 -UTR leads to the loss of many recognition sites for RNA-binding proteins (RBPs) that stabilize the transcript. Loss
of stability leads to a decrease in transcript level. (C) Model mechanisms regulating tandem APA. Common sequences in the 30 -UTR that regulate polyadenylation are the
upstream sequence element (USE), the UGUU sequence recognized by cleavage factor I (CFIm), the polyadenylation (PA) signal recognized by cleavage and
polyadenylation specific factor (CPSF), and the downstream sequence element (DSE) recognized by cleavage stimulation factor (CstF). CPSF and CstF are brought to the
RNA by RNA polymerase II (Pol II), together with poly(A)-binding protein nuclear 1 (PABPN1), through its C-terminal domain (CTD). Generally, CPSF recognizes the
canonical PA signal and cuts at a distal PAS, at a CA dinucleotide (I). If PABPN1 or CFIm is present at a lower concentration, the CPSF recognizes noncanonical (weaker) PA
signals (II) and cuts at proximal PASs, leading to the formation of transcripts with truncated 30 -UTRs.

through effects on transcript localization [64], stability, and studies able to detect overall changes in polyadenylation,
translation efficiency [65] and on the nature of the encoded to serial analysis of gene expression (SAGE)-based methods
protein. Numerous RNA-seq methods have contributed to able to specifically quantify and characterize the 30 ends
our understanding of APA, ranging from RNA-seq of transcripts, to a series of dedicated protocols for the
Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

accurate detection and quantification of PASs [14]. These Box 2. The biological relevance of APA
transcriptome-wide studies have deepened our understand- A study based on expressed sequence tags comprising 42 human
ing of APA, providing information on newly discovered tissues [140] showed that certain tissues preferentially produce
PASs, elucidating the impact of APA on gene expression, mRNAs of a certain length. Brain, pancreatic islet, ear, bone marrow,
and discovering new APA regulatory mechanisms. and uterus showed a preference for distal PASs, leading to longer
Although the number of alternative PASs detected dif- 30 -UTRs. Retina, placenta, ovary, and blood showed a preference for
proximal PASs. This classification might change when considering
fers greatly between studies [66–68], these studies contrib- the levels at which these mRNAs are expressed. Although most
ute to the notion of the ubiquity of APA events, which of the transcripts detected in the brain contain distal PASs, the
involve approximately 70% of human genes. According to transcripts that are highly abundant generally show a preference for
a study conducted on 15 human cell lines, there are on proximal PASs and have short 30 -UTRs [72]. Other studies showed
average two PASs per gene [1]. APA within the same last that the choice between a distal and a proximal PAS was modulated
during differentiation and development. Progressive lengthening of
exon (tandem 30 -UTRs) is the most abundant type of APA 30 -UTRs was shown for most of the transcripts during cell
[68]. Intronic APA events are reported less frequently differentiation and during embryonic development [141]. By con-
and thousands of intronic PASs are usually suppressed trast, shortening was observed during proliferation [142] and during
[69]. APA is generally linked to changes in gene expression reprogramming of somatic cells [143].
levels and, ultimately, to protein abundance. Studies have
shown an inverse correlation between 30 -UTR length and different species and APA profiles from different species
protein expression levels [70,71]. Some human tissues are similar for the same tissues [80,81,86]. Modulation of
(such as brain, testis, lung, and breast) are enriched for APA has also been widely observed during proliferation,
highly abundant transcripts with short 30 -UTRs, whereas differentiation, and development [68,87–89].
others (such as heart and skeletal muscle) contain many Widespread alteration of APA profiles has been ob-
low-abundance transcripts with long 30 -UTRs [72]. In- served in several diseases. Many studies have reported
creased expression of transcripts with shortened 30 -UTRs shortening of 30 -UTRs in cancer [90–92], linked to exten-
can be explained by loss of miRNA target sequences, loss of sive upregulation and activation of oncogenes. However,
UPF1-binding sites, which leads to RNA decay [73], or loss shortening of 30 -UTRs poorly correlates with breast, lung,
of AU-rich elements (AREs), which leads to ARE-directed and colorectal cancer prognosis [93,94], suggesting that the
mRNA degradation [71]. However, there are many excep- relationship between APA and cancer is not straightfor-
tions to the general rule, as proteins that bind to the ward. More recently, altered APA profiles have been linked
30 -UTR can also stabilize mRNAs [74–76]. to muscle disorders such as myotonic dystrophy [95] and
Transcriptome-wide studies have been undertaken to oculopharyngeal muscular dystrophy [70].
elucidate the dynamics of APA regulation. In general,
disruption of the polyadenylation machinery leads to loss From mRNA to protein: alternative translation initiation
of fidelity in the choice of PAS and shortening of the 30 - In addition to the regulation of transcription and proces-
UTRs. There are numerous 30 processing factors involved sing, the translation of transcripts is also tightly regulated.
in polyadenylation; nevertheless, changes in the expres- Regulation of translation defines not only the abundance
sion levels of a single specific factor are sufficient to influ- of a protein but also its amino acid composition through the
ence the choice of PAS. For example, decreased levels of use of different start codons [96], as translation may start
cleavage factor I (CFIm) 68 or poly(A)-binding protein at uORFs or at alternative ORFs (aORFs) (Box 3 and
nuclear 1 (PABPN1) lead to transcriptome-wide shorten- Figure 4).
ing of 30 -UTRs, corresponding to an increased preference In the past, changes in protein synthesis were measured
for noncanonical polyadenylation signals (Figure 3C) exclusively based on proteomic approaches or estimated
[70,77,78]. based on total mRNA levels. More recently, they have been
Many recent transcriptome-wide studies have con-
firmed that distal PASs generally have a strong canonical Box 3. Alternative translation initiation
signal motif [A(A/U)UAAA], whereas proximal PASs di-
verge from the canonical sequence [68,79–81]. Interesting- uORFs are located in the 50 -UTR of a transcript. Depending on the
presence or absence of stop codons and their coding frame, a uORF
ly, tissue-specific regulated PASs can be depleted of the can overlap with the pORF or not. Overlapping and in-frame uORFs
canonical motif. For example, APA in brain seems to be lead to N-terminal extended protein isoforms [8], whereas non-
regulated by an A-rich motif starting just downstream of overlapping uORFs affect the translation of pORFs in various ways
the PAS [82]. A-rich sequences have also been reported [144]: they can block the translation of the pORFs, reducing protein
upstream of cleavage sites for transcripts lacking canonical production; they can promote reinitiation of translation at down-
stream start codons; or they can enhance translation of the main
motifs [83]. pORFs. aORFs are located downstream of the annotated start codon.
Numerous studies based on expressed sequence tags In-frame aORFs give rise to N-terminal truncated isoforms
and microarrays have previously shown the biological [145]. uORFs and aORFs can also be out of frame with respect to
relevance of APA (Box 2) [84,85]. APA profiles are tissue the pORFs and lead to the production of different peptides. The
sequences translated in more than one reading frame are called
specific and appear to be tightly regulated during develop-
dual coding regions [103]. We also note that uORFs and aORFs are
ment and cell differentiation. Most of the findings achieved not the only events that increase the diversity of the translated
by recent transcriptome-wide approaches confirm at a mRNAs and affect protein production. The genetic code can be read
larger scale what was previously observed. The tissue in alternative ways, leading to frameshifting, hopping, stop codon
specificity of APA and the correlation between tissue read-through, recoding, and codon reassignment [146,147], topics
beyond the scope of this review.
and 30 -UTR length seem to be highly conserved between
Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

(A) Alternave open reading frame assessed via ribosome profiling [97]. Deep sequencing of
RNA fragments protected by ribosomes determines the
500 bases

position of the ribosomes on the RNA molecule at nucleo-
TIS 1 (pORF) tide resolution, allowing exact characterization of the
Prol cells

translation initiation site (TIS) and quantification of levels

TIS 2 (aORF) of translation. Ribosome profiling studies in combination
108 with RNA-seq have assessed the extent of alternative
translation initiation, provided insights into the regulatory
Diff cells

TIS 1 (pORF)
mechanisms of this process, and shed light on how it
impacts gene expression.
RefSeq genes
Rps20 A common finding of many recent ribosome profiling
mm10 5 bases studies is the widespread use of alternative TISs. Initiation
of translation at alternative TISs may be caused by various
forms of stress but is also observed under normal physio-
logical conditions. Between 50% and 65% of transcripts
aORF in frame, truncated isoform contains more than one TIS [7,98,99]. Most of the detected
TISs are located upstream of the annotated start codons
(II) 500 bases (50–60%), leading to potential uORFs. A minority are
located downstream of the annotated start codons

(20%) and lead to N-terminally truncated proteins or

out-of-frame ORFs. However, some ribosome profiling
peaks detected as alternative TISs may represent cases
TIS 2 (aORF) of ribosomal stalling. To distinguish these from genuine

TIS 1 (pORF)
TISs, proteomic data are essential. These are often difficult
to obtain because the peptides are usually short and
RefSeq genes unstable. Moreover, the study of the proteome in a high-
5 bases
throughput fashion presents certain technical limitations,
especially for low-abundance proteins, which are difficult
to detect among a diverse pool of proteins [100].
aORF out of frame, novel protein Insights into the mechanisms regulating the choice of an
uORF or aORF over a primary ORF are starting to emerge.
Initiation of translation at near-cognate codons and non-
(B) Upstream open reading frame AUG codons, previously reported for a small number of
2 kb mm10
mRNAs, appears to be common, as approximately 50% of
TIS 1 (pORF) translation is initiated at noncanonical codons [98,99].
These noncanonical start codons are enriched in uORFs.
By contrast, TISs located downstream of annotated TISs
Prol cells

comprise mainly AUG codons. The use of near-cognate
0 _ and non-AUG start codons has been confirmed by mass
spectrometry [101]. Interestingly, these codons are recoded
to regular methionines, as all of the produced proteins
Diff cells

seem to contain an N-terminal methionine.

Recent studies support the leaky scanning theory [102],
RefSeq genes
according to which the choice of a downstream TIS depends
Cryab on the strength of the Kozak consensus sequence. It was
shown on a transcriptome-wide scale that initiation at
5′-UTR downstream TISs usually occurs when the Kozak sequence
in the annotated start codon is suboptimal. A similar
uORF Stop codon pORF
mechanism applies for initiation at uORFs. uORFs are
TRENDS in Genetics translated in parallel to their downstream primary ORFs
Figure 4. Alternative translation initiation. Alternative translation initiation sites
(pORFs) if the start codon used in the uORF is a non-AUG,
(TISs) detected by ribosome profiling (
PRJEB7207). (A) Examples of alternative TISs leading to alternative open reading
frames (aORFs) in frame (I) or out of frame (II) with the primary ORF (pORF). In the ribosome profiling), one corresponding to the annotated start codon and one
Rps20 gene (I), a switch in TIS use occurs during cell differentiation. Proliferating located downstream of the annotated start codon, leading to an aORF. The
cells use two TISs, one corresponding to the annotated start codon and the other alternative TIS is shown in the highlighted box. The alternative TIS corresponds to
corresponding to an aORF, the latter of which leads to a truncated protein isoform. an AUG start codon that is out of frame compared with the pORF, indicating the
The alternative TIS is shown in the highlighted box. The top part (gray) shows the presence of a dual coding region. (B) Examples of alternative TISs leading to an
three possible frames and the blue bar shows the frame of the pORF. Because upstream ORF (uORF) in the Cryab gene. Proliferating cells use two TISs, one
ribosome profiling peaks are usually displayed using only the 50 end of each located in the 50 untranslated region (50 -UTR) and one corresponding to the
mapped read, the black line indicates the actual TIS location of the aORF, located annotated start codon. The sequence of the 50 -UTR incorporated by the alternative
12 bp downstream of the mapped peak. In the Crip1 gene (II), only one TIS is shown below the reference track. Extension of the 50 -UTR leads to the
transcription start site (TSS) is present (top track, deep cap analysis of gene translation of an uORF, with a canonical AUG codon and ending before the start
expression (DeepCAGE) [16]) but two different TISs are used (bottom track, codon of the pORF, negatively regulating translation.

Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

but translation of pORFs is usually repressed if the uORFs chance of recognizing alternative exons [117] or proximal
contain an AUG start codon and a strong Kozak sequence PASs [118,119] and the choice of TSS is linked to a specific
[99]. splicing pattern [120,121] or to the use of specific PASs
Both aORFs and uORFs can give rise to ORFs with [71,122,123].
reading frames different from the pORFs, a phenomenon In addition to links between transcription and mRNA
known as dual coding [103]. The triplet periodicity ob- processing, alternative splicing and APA also appear to
served in ribosome profiling data enables the detection be interdependent. Twenty years ago, it was shown that
of dually decoded regions. Although the extent of dual splicing of the last intron requires definition of the last
coding observed in the human genome in ribosome profil- exon (at least in mammals [124]) and this occurs through
ing studies is only approximately 1%, it has been suggested the cooperation of splicing and polyadenylation factors that
that this might be an underestimate due to technical and interact across the last exon, leading to mutual enhance-
analytical limitations (low coverage and the assumption ment of both splicing and polyadenylation [125]. The
that the two frames must be translated at the same rate) snRNPs U1 and U2 and the U2 auxiliary factor 65 kDa
[103]. subunit (U2AF65), all spliceosome components, are also
The extent to which mRNA levels explain differences in part of the human pre-mRNA 30 processing complex
protein abundance is still debated. Although some studies [126]. These spliceosome components directly interact with
have reported a poor correlation [104] – in the range of cleavage and polyadenylation specific factor (CPSF) and
approximately 40% of protein levels explained by mRNA with CFIm. Splicing factors can also play a role in prema-
levels [105–108] or even less than 20% [109] – others claim ture cleavage and polyadenylation, as shown by the spli-
a much higher correlation of up to approximately 80% ceosomal factor TRAP150 [127].
[110]. Ribosome-associated RNA levels seem to be a good Recent transcriptome-wide studies further support the
proxy for protein levels, as the correlations between mRNA links between splicing and polyadenylation. Alteration of
and protein observed are between 60% and 90% the splicing factor hnRNP H has been shown to have
[109,111]. Nevertheless, a study that compared changes widespread effects on tandem APA, with increased 30 -
at mRNA levels and ribosome-bound mRNAs showed pro- UTR shortening in the presence of hnRNP H and length-
found uncoupling between transcription and translation in ening in its absence (Figure 5A, top). Changes in APA were
several different experiments after treatments with extra- accompanied by changes in alternative splicing. A direct
cellular stimuli or during cell and tissue differentiation link between hnRNP H and the choice of a specific PAS was
[112]. Therefore, it remains unclear whether regulation at shown by crosslinking immunoprecipitation sequencing
the translational level has a major influence on global (CLIP-seq) analysis, by the presence of a higher CLIP
protein abundance or whether it is restricted to a subset tag density next to the proximal PAS [128]. An increase
of genes. in proximal PAS use was also observed after alteration of
Nova, a RBP involved in alternative splicing [36].
Transcription, RNA processing, and translation: High CLIP tag density surrounding proximal PASs has
interdependent processes also been observed for the RBPs MBNL1 and MBNL2
The molecular machineries involved in transcription and (Figure 5A, bottom), which are known to regulate splicing
RNA processing are spatiotemporally coupled. Several [38], and a direct link between MBNL proteins and APA
reviews have extensively described cotranscriptional regu- was recently explained by the competition of MBNL with
lation of capping, splicing, and polyadenylation [113,114]. CFIm68, a component of the polyadenylation machinery
RNA polymerase II (Pol II) is an important player in the [95].
regulation of this coupling, as its C terminus recruits pro- Whether alternative splicing is also coupled to non-
teins involved in capping, splicing, and polyadenylation tandem APA remain unclear. A few studies have specifi-
[115]. There is ample support of the coupling between cally investigated the interdependency between intronic
transcription and splicing. Splicing predominantly occurs polyadenylation and splicing. Cryptic intronic PASs are
during transcription [1,44], as indicated by the following mainly located in large introns with weak 50 splice sites.
three observations: many introns are already spliced in This suggests that intronic polyadenylation can be inhib-
chromatin-associated RNAs; there is enrichment of spliceo- ited if there are splicing enhancers that recognize the 50
somal small nuclear RNAs in chromatin-associated RNAs; splice site, as shown for U1 [129], or enhanced in the case of
and exons that are spliced are enriched for epigenetic chro- suboptimal splicing [130]. The coupling observed in this
matin marks [116]. Nevertheless, splicing events at the case represents kinetic competition between splicing and
30 end of a transcript might occur post-transcriptionally, polyadenylation [131].
giving a general 50 –30 trend in splicing completion. Finally, coupling is not restricted to processes connected
Transcription and splicing are coupled not simply in in space and time. Interdependency has also been shown
space and time but are also jointly responsible for the between processes occurring in different subcellular com-
formation of alternative transcripts. The interdependence partments; for example, between APA and translation.
of different RNA-processing events restricts the number of Cytoplasmic polyadenylation element-binding protein 1
combinations of alternative TSSs, exons, and PASs. Splic- (CPEB1), which shuttles between the nucleus and the
ing and polyadenylation might be influenced not only by cytoplasm, has been shown to play a dual role in APA
the transcription elongation rate but also by transcription and translation [132] (Figure 5B). Interestingly, CPEB1
initiation: a lower elongation rate is linked to slower can also regulate alternative splicing. CPEB1 prevents
splicing and polyadenylation and therefore to an increased recruitment of the splicing factor U2AF65 to the 30 splice
TIGS-1175; No. of Pages 12

Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

(I) (II)


Non-canonical Canonical Non-canonical Canonical

PA signal PAS 1 PA signal PAS 2 PA signal PAS 1 PA signal PAS 2
(III) (IV)

Non-canonical Canonical Non-canonical Canonical

PA signal PAS 1 PA signal PAS 2 PA signal PAS 1 PA signal PAS 2





Non-canonical Canonical
PA signal PAS 1 PA signal PAS 2 PAS 2



Non-canonical Canonical
PA signal PAS 1 PA signal PAS 2 PAS 1
Nucleus Cytoplasm
TRENDS in Genetics

Figure 5. Coupled regulatory mechanisms. (A) Tandem alternative polyadenylation (APA) regulated by splicing factors. The RNA-binding proteins hnRNP H and MBNL
regulate APA in opposing ways. In the presence of hnRNP H (I), cleavage and polyadenylation specific factor (CPSF) binds weaker noncanonical polyadenylation (PA)
signals and cuts at the proximal polyadenylation site (PAS 1) leading to shortening of the 30 untranslated region (30 -UTR), while in its absence (II) only the canonical PA
signal is recognized and cleavage occurs in the distal PAS (PAS 2). (III) MBNL masks the region upstream of weak noncanonical PA signals, blocking the binding of cleavage
factor I (CFIm). This leads to binding of CFIm to a more distal UGUU sequence, followed by binding of CPSF to the distal canonical PA signal and use of the distal PAS (PAS
2). In the absence of MBNL (IV), CFIm can bind proximal UGUU regions and bring the CPSF to weaker PA signals, causing cleavage at the proximal PAS (PAS 1) and
shortening of the 30 -UTR. (B) Coupling of APA and translation. In the nucleus, in the absence of cytoplasmic polyadenylation element-binding protein 1 (CPEB1) (I), CPSF
binds the canonical PA signal and cleaves the RNA at a distal PAS (PAS 2). In the presence of CPEB1 (II), CPEB1 binds the cytoplasmic polyadenylation element (CPE) located
upstream of weak noncanonical PA signals. CPEB1 directly interacts with CPSF, bringing it to regions proximal to the weak PA signal. This leads to their recognition by CPSF
and cleavage at the proximal PAS (PAS 1). When CBEP1 shuttles to the cytoplasm, it again binds to the CPE, but this time to promote lengthening of the poly(A) tail by
poly(A) polymerase (PAP), which results in increased translation efficiency. Lengthening of the poly(A) tails of transcripts bearing proximal PASs (PAS 1) (II) is enhanced by
the fact that the CPE, PAP, and the polyadenylation site are in close proximity, whereas this enhancement is disrupted when the distance is greater due to the 30 -UTR
lengthening in transcripts bearing a distal PAS (PAS 2).

site, but simultaneously recruits the polyadenylation ma- molecular mechanisms that coordinate their formation
chinery. The RBP CPEB1 is an example of a master during transcription and mRNA processing, we still face
regulator that affects three layers of gene expression: technical limitations due to the short read length of next-
splicing, polyadenylation, and translation. generation sequencing data and reliance on statistical and
computational approaches to reconstruct transcript struc-
Concluding remarks ture. This represents an obstacle when trying to link
RNA-seq technologies are elucidating the mechanisms different events occurring in the same RNA molecule.
that expand the genome’s coding capacity and are The only way to specifically determine the exact transcript
quickly redefining the concept of gene expression regula- structure for each detected RNA molecule is by sequencing
tion. full-length RNAs, an option that is currently becoming
Although there is a continuing increase in the number of more feasible [133,134] and that is opening a new era in
transcripts identified, and in the understanding of the the field of RNA-seq.
Feature Review Trends in Genetics xxx xxxx, Vol. xxx, No. x

Feature Review

Feature Review

Feature Review

