Cross-Mapping MicroRNA Identification

Methods
Cross-mapping and the identification of editing

sites in mature microRNAs in high-throughput
sequencing libraries
Michiel J.L. de Hoon,1 Ryan J. Taft,2 Takehiro Hashimoto,1
Mutsumi Kanamori-Katayama,1 Hideya Kawaji,1 Mitsuoki Kawano,1
Mami Kishima,1 Timo Lassmann,1 Geoffrey J. Faulkner,3 John S. Mattick,2
Carsten O. Daub,1 Piero Carninci,1 Jun Kawai,1 Harukazu Suzuki,1
and Yoshihide Hayashizaki1,4
1
Omics Science Center, RIKEN Yokohama Institute, Tsurumi, Yokohama 230-0045, Japan; 2Institute for Molecular Bioscience,
The University of Queensland, St. Lucia QLD 4072, Australia; 3Roslin Institute and Royal (Dick) School of Veterinary Studies,
The University of Edinburgh, Roslin EH25 9PS, United Kingdom
MicroRNAs (miRNAs) are short (20–23 nt) RNAs that are sequence-specific mediators of transcriptional and post-
transcriptional regulation of gene expression. Modern high-throughput technologies enable deep sequencing of such
RNA species on an unprecedented scale. We find that the analysis of small RNA deep-sequencing libraries can be affected
by cross-mapping, in which RNA sequences originating from one locus are inadvertently mapped to another. Similar to
cross-hybridization on microarrays, cross-mapping is prevalent among miRNAs, as they tend to occur in families, are
similar or derived from repeat or structural RNAs, or are post-transcriptionally modified. Here, we develop a strategy to
correct for cross-mapping, and apply it to the analysis of RNA editing in mature miRNAs. In contrast to previous reports,
our analysis suggests that RNA editing in mature miRNAs is rare in animals.
[Supplemental material is available online at http://www.genome.org.]
In recent years, the use of high-throughput deep-sequencing tech- Several recent studies have reported evidence of adenosine to
nologies has revolutionized transcriptomics research (Kahvejian inosine (A-to-I) editing in pri-miRNAs (Luciano et al. 2004; Blow
et al. 2008; Morozova and Marra 2008; Schuster 2008) and facili- et al. 2006; Kawahara et al. 2008). Such editing can affect the ef-
tated the evolutionary and genome-wide assessment of microRNA ficacy of DICER1 and RNASEN processing or lead to transcript
(miRNA) expression (Ruby et al. 2006; Landgraf et al. 2007; Babiarz degredation (Gottwein et al. 2006; Obernosterer et al. 2006;
et al. 2008; Kuchenbauer et al. 2008; Morin et al. 2008), These short Thomson et al. 2006; Yang et al. 2006; Kawahara et al. 2007b;
(20–23 nt) RNA species, which play important roles in transcrip- Viswanathan et al. 2008). The discovery of A-to-I editing sites
tional and post-transcriptional gene regulation, are generally pro- within the region of the pri-miRNA comprising the mature miRNA
cessed from long primary transcripts in two steps. First, short led to the enticing hypothesis that A-to-I editing may generally
hairpin structures are excised from long primary miRNA (pri- affect miRNA target selection and stability (Luciano et al. 2004;
miRNA) transcripts in the nucleus by the microprocessor complex, Blow et al. 2006). Indeed, tissue-specific editing of bases in the seed
which is composed of the ribonuclease RNASEN (DROSHA) and region of the miRNA miR-376a alters the set of genes it silences
the RNA-binding protein DGCR8 (Winter et al. 2009). After export (Kawahara et al. 2007a). Similarly, deep sequencing of small RNAs
to the cytoplasm by XPO5 (Exportin 5), DICER1 cleaves the hair- from three mouse tissues showed extensive nucleotide heteroge-
pin to yield an RNA duplex with a characteristic two-base 39 neity and modifications in mature miRNAs of the let-7 family (Reid
overhang (Winter et al. 2009). One strand of this duplex, the ma- et al. 2008), including internal insertions, deletions, and nucleotide
ture miRNA, is loaded into a regulatory protein complex, RISC (the substitutions. However, most of these sequence variations are dis-
RNA-induced silencing complex), while the other strand is usually, tinct from the A-to-I editing events, suggesting widespread editing
although not exclusively, degraded (Winter et al. 2009). Mature of mature miRNAs by currently unknown RNA processing enzymes.
miRNAs typically regulate gene expression by binding to messen- In this study, we analyze short RNA libraries produced during
ger RNA 39 untranslated regions (UTRs) via the ‘‘seed sequence’’ a time course of differentiation of THP-1 cells in response to
(nucleotides 2–8) and inducing either transcript degradation or phorbol-mysterate-acetate (PMA), which mimics macrophage mat-
translation inhibition, although there are also examples of miRNAs uration (The FANTOM Consortium and the Riken Omics Science
interacting with 59 UTRs and promoter regions (Lee et al. 2009). Center 2009; Taft et al. 2009). These libraries are of sufficient se-
quencing depth to make an accurate assessment of the prevalence
4
Corresponding author. of editing in mature miRNAs. Similar to cross-hybridization in
E-mail rgscerg@gsc.riken.jp; fax 81-45-5039216. microarray experiments, short RNA sequences obtained from
Article published online before print. Article and publication date are at
http://www.genome.org/cgi/doi/10.1101/gr.095273.109. Freely available deep-sequencing experiments can be inadvertently mapped to
online through the Genome Research Open Access option. multiple, and sometimes incorrect, loci if there are many similar
20:257–264 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org Genome Research 257
www.genome.org
de Hoon et al.
sequences in the genome or if sequencing errors or post-tran-

scriptional modifications occur. Such cross-mapping events can
lead to overrepresented mismatches at specific genome locations
between the genome sequence and the RNA sequence, giving the
appearance of RNA editing. Using a novel strategy to avoid cross-
mapping artifacts, we find that editing of mature miRNAs is rare.
Indeed, reanalysis of murine deep-sequencing data (Reid et al.
2008) suggests that cross-mapping rather than RNA editing is re-
sponsible for the previously observed sequence heterogeneity in the
murine let-7 miRNA family.
Results
The FANTOM4 short RNA libraries
The FANTOM4 project sought to detail the entire genetic network
of a model human monocytic leukemia cell line, THP-1, as it dif-
ferentiated from a monoblast-like to a monocyte-like cell after
Figure 1. Number of mapping locations. The number of mapping
stimulation with PMA (The FANTOM Consortium and the Riken
locations for all FANTOM4 THP-1 short RNA sequences. More than half of
Omics Science Center 2009). As part of this effort, short RNAs, 10 the short RNAs in these libraries map to more than one genome location.
to 26 nt in size, were harvested at eight time points over 96 h and
were reverse-transcribed to produce cDNAs, which were sequenced
using a 454 Life Sciences (Roche) sequencer (Taft et al. 2009). To shows an example of cross-mapping between the let-7b and let-7c
analyze editing sites in mature miRNAs, we combined sequences miRNAs, whose mature miRNAs differ by only one nucleotide. Let-
from all time points into a data set of ;3.5 million short RNAs. To 7b with an additional 39 adenosine maps equally well to the let-7b
ensure that miRNA editing could be accurately assessed in this locus and the let-7c locus. The sequence counts of such multi-
model, we examined the expression of the adenosine deaminase mapping RNAs are usually divided equally between the candidate
(ADAR) and the cytodine deaminase (APOBEC) families of proteins genomic loci, assigned arbitrarily to one of the mapping locations,
by microarray and CAGE profiling (The FANTOM Consortium and or discarded altogether (Ruby et al. 2006; Landgraf et al. 2007;
the Riken Omics Science Center 2009) and confirmed that they are Babiarz et al. 2008; Morin et al. 2008; Taft et al. 2009). In this ex-
expressed throughout the PMA time course (Supplemental Fig. 1). ample, short RNAs produced at the let-7b locus, but cross-mapped
to let-7c, will lead to overrepresented mismatches at bases where
let-7c differs from let-7b, resulting in an alignment that can be
Cross-mapping in short RNA sequencing libraries incorrectly interpreted as an RNA editing site in let-7c.
We mapped THP-1 small RNAs to the human genome and iden- To counter the effects of cross-mapping, we have developed
tified all loci where they aligned with the fewest errors. An error an expectation-maximization algorithm, described in detail in the
can be a mismatch, an insertion, or a deletion (for details, see Methods, which assigns weights to the candidate loci of each short
Methods). Small RNAs do not necessarily map uniquely. As shown RNA using both the expression level and the alignment between
in Figure 1, more than half of the short RNAs in the FANTOM4 the RNA and the genome sequence (Fig. 3). At each iteration of the
library map to multiple loci. The ambiguity in the genomic origin algorithm, we calculate the expression level from the weighted
of these RNAs may lead to cross-mapping,
in which a short RNA originating from
one locus is partially or completely as-
signed to a different location. Cross-
mapping complicates an accurate iden-
tification of RNA editing sites, which is
based on an analysis of the mismatches
between the sequenced RNA (i.e., the
deep sequencing tag) and the genome.
However, as shown in Figure 1, 66% of
short RNAs with one mismatch map to
multiple genome locations, making this
analysis nontrivial.
MicroRNAs are particularly suscep-
tible to cross-mapping, as they are short
and tend to occur in families with highly Figure 2. Cross-mapping in short RNA sequencing libraries. Sequences in green and blue represent
similar sequences (Bartel 2004). A further the genome sequence and the RNA sequence, respectively. Mismatches between the genome se-
complication is that many mature miRNAs quence and the RNA sequence are shown in red. A miRNA sequence with an additional 39 adenosine,
have single-nucleotide non-template 39 either due to post-transcriptional addition (as discussed in the text) or a sequencing error, maps equally
well to the genome loci encoding the human miRNAs let-7b and let-7c. This miRNA sequence was read
adenosine or uracil additions (Landgraf 20 times in the FANTOM4 time course short RNA libraries. Dividing the sequence counts equally be-
et al. 2007; Kuchenbauer et al. 2008; Jones tween these two genome loci leads to a spurious RNA editing site in let-7c. Alternatively, cross-mapping
et al. 2009; Katoh et al. 2009). Figure 2 to unannotated genome regions may give rise to spurious novel noncoding RNA loci.
258 Genome Research

www.genome.org
Cross-mapping and editing in mature microRNAs
weight ratios. For more than 75% of the mappings, the cross-map
corrected weight differs from the corresponding equal weight by
more than 25%. Most of the weight ratios are less than one,
whereas a few mappings have a weight ratio much larger than one.
This implies that the cross-mapping correction based on this
weighting scheme establishes a clear preference between the can-
didate mapping locations, which we found to be independent of
the expression level of the short RNA (Fig. 4B).
Identification of editing sites in mature miRNAs

Almost half (43.5%) of the small RNAs in the FANTOM4 time
course can be mapped to loci encoding miRNAs, and 19% can be
aligned to transfer RNAs (tRNAs) (Supplemental Fig. 2). In total,
236 mature miRNAs with a total count of at least 10 were identified
(see Supplemental Table 1). Cross-mapping did not affect the rel-
ative expression values of the majority of miRNAs (Fig. 5; Supple-
mental Table 2).
To search for putative RNA editing sites, each position along
the length of mature miRNAs was analyzed to find the total number
of RNA reads, as well as the number of mismatches. Under the null
hypothesis, mismatches are caused by experimental errors rather
than by miRNA editing. We calculated the tail probability of finding
the mismatches associated with any particular locus using the rate
of experimental errors given by the error profile, and applied the
Bonferroni correction to correct for multiple testing. We rejected the
null hypothesis if the Bonferroni-corrected tail probability at a par-
ticular genome position is less than 0.05, suggesting that the over-
represented mismatches at that position are due to editing (for de-
tails, see Methods). For a typical case in which 100 transcripts of
a given miRNA are sequenced, an editing site can be detected if at
least 10% of the transcripts contain the modified nucleotide.
Table 1 shows all potential RNA editing sites found in the
FANTOM4 short RNA data in mature miRNA sequences, excluding
Figure 3. Cross-mapping correction strategy. For each short RNA that overrepresented mismatches at the 39 terminal position, which can
can be aligned to multiple genome regions with an equal number of be affected by enzymatic nucleotide additions to the 39 end ( Jones
errors, our strategy to correct for cross-mapping assigns weights to each et al. 2009; Katoh et al. 2009). Without correcting for cross-mapping,
candidate mapping location based on the local expression level as well as
we find 10 genome locations in nine miRNA species in which the
the alignment errors. The latter is based on the error profile describing the
probability of an alignment error as a function of the position along number of mismatches between the RNA sequence and the ge-
the alignment. Both the local expression level and the error profile are nome is statistically significant (P < 0.05). However, only three
calculated from the mapped RNAs themselves using an expectation- mismatch sites remain after correcting for cross-mapping.
maximization algorithm. First, we assign equal weights to all candidate
To verify that sites of overrepresented mismatches are indeed
mapping locations. We then calculate the error profile and the expression
level of each genome location from the complete set of mapped RNA caused by cross-mapping rather than miRNA editing, we calculated
sequences. This allows us then to recalculate the error profile and the the Spearman correlation along the PMA time course between
expression levels. This process is iterated until convergence. RNAs that map to the genome with mismatches and highly
expressed RNAs that map unambiguously at each of the candidate
mapping loci. This is an independent verification of the cross-
number of RNAs that map at each locus, and assign a probability to mapping correction, since the correlation along the time course
the sequence alignment between the RNA and the genome se- depends on the relative expression variation between time points,
quence. The latter is calculated using an error profile, representing but not on the absolute counts of each RNA, whereas the cross-
the overall prevalence of alignment errors as a function of position mapping correction depends on the total count of each RNA sum-
along the alignment. We then calculate the number of RNAs med over the eight time points, but not on the RNA counts at each
expected at each candidate mapping location, and choose the time point separately. For each of the potential editing sites, we
mapping weights in proportion to this value. Upon convergence of identified the most highly expressed perfectly mapping short RNA
the expectation-maximization algorithm, the sequence counts of at the miRNA site as well as the proposed cross-mapping origin and
multimapping RNAs are assigned to their candidate mapping lo- calculated the Spearman correlation between their sequence counts
cations according to the mapping weights. To assess the impact of and the counts of the most highly expressed sequence containing
this approach, we contrast it to an equal-weight strategy. We cal- a mismatch at the potential miRNA editing site. As shown in Figure
culate a weight ratio of an RNA mapping at a candidate mapping 6, we found that the correlation at the proposed cross-mapping
site by dividing the value of the weight obtained in our weighting origin was higher than the correlation at the miRNA locus, in
strategy by its value under an equal-weight approach (for details, agreement with our conclusion that the sites with overrepresented
see Methods). Figure 4A shows the cumulative distribution of these mismatches are caused by cross-mapping rather than RNA editing.
Genome Research 259

www.genome.org
de Hoon et al.
Figure 4. Effect of the cross-mapping correction on mapping weights. The weight ratio is defined as the weight calculated by the cross-mapping
correction strategy divided by the corresponding weight under an equal-weight strategy (see text for more detail). The more a weight ratio differs from
unity, the larger the cross-mapping correction. (A) The cumulative distribution of the weight ratios, revealing that most weights are reduced in comparison
to the equal-weight strategy, while a few weights are greatly increased. (B) A two-dimensional binning plot of the count of each short RNA sequence and
the weight ratio, using a logarithmic color scheme to represent the number of mapping events in each bin.
As shown in Table 1, cross-mapping typically occurs between that purportedly showed evidence of editing at the ninth position
miRNAs of the same family (e.g., miR-27a and miR-27b or miR-18a of murine mature let-7a. Using the same methodology we used to
and miR-18b), although in miR-1274a, a spurious editing site is analyze the FANTOM4 data, we mapped these short RNAs to the
caused by cross-mapping of RNA fragments from a lysine tRNA. mouse genome and applied the cross-mapping correction strategy.
Indeed, this miRNA was previously discovered by deep RNA se- Supplemental Table 4 shows the alignments between the se-
quencing (Morin et al. 2008), overlaps an annotated ERVK long quenced short RNA and members of the murine let-7 family. The
terminal repeat, and may itself be a cross-mapping artifact. Simi- short RNA sequences that map to the let-7a locus with a mismatch
larly, cross-mapping is likely to be responsible for a spurious edit- at the proposed editing position can alternatively be mapped to
ing site in miR-1260. Small RNAs that map to miR-1260 with miRNA let-7e, with a single mismatch at the 39 terminal nucleotide.
a single mismatch can alternatively be mapped to leucine tRNA, Our cross-mapping analysis finds an 80% probability that these se-
which is expressed almost 80 times higher than miR-1260. As quences are produced from the let-7e locus instead of the let-7a lo-
shown in Supplemental Figure 3, the expression level of leucine cus. This suggests that these sequences originate from the let-7e
tRNA more than compensates for the alignment with fewer errors
to miR-1260. Indeed, the odds-ratio that this sequence originates
from leucine tRNA instead of miR-1260 is more than 36, strongly
suggesting that the overrepresented mismatches at the putative
editing site in miR-1260 are due to cross-mapping leucine tRNA
sequences rather than miRNA editing.
G-to-C mismatches were found at the tenth residue in 187 of
250 RNAs that map to miR-629. Since no enzyme has been identi-
fied that mediates such editing events, we considered the possibility
that these overrepresented mismatches were due to a nucleotide
polymorphism in the THP-1 genome. Indeed, resequencing of this
genomic region revealed a novel single nucleotide polymorphism
(SNP): A cytosine nucleotide instead of a guanine can be found at
the mismatch location in THP-1 cells (Supplemental Fig. 4).
In miR-376c we found A-to-G mismatches at position 6 of the
mature miRNA in eight out of 10 RNA sequences. This position was
previously identified as an A-to-I editing site (Kawahara et al. 2007a),
and was the only editing site validated by the cross-mapping cor-
rection in mature miRNAs in the FANTOM4 libraries.
In contrast to mature miRNAs, we found 385 genome sites
Figure 5. Comparison of miRNA expression before and after cross-
with overrepresented mismatches in tRNA sequences after cor- mapping correction. This scatter plot shows the expression of miRNAs
recting for cross-mapping. Most (61%) of these sites are known to before and after correcting for cross-mapping. With a Spearman corre-
be post-transcriptionally modified (Supplemental Table 3), similar lation coefficient of 0.99, the cross-mapping correction has a minor effect
to what was recently found in an analysis of high-throughput on the estimated expression of most miRNAs. However, for 14 miRNAs
the relative difference between the estimated expression with and with-
short RNA sequencing libraries in plants (Ebhardt et al. 2009).
out the cross-mapping correction was larger than 50%. These miRNAs are
Given the apparent paucity of bona fide editing sites in ma- circled and shown together with their short RNA counts before and after
ture miRNAs, we reanalyzed the short RNA data (Reid et al. 2008) the cross-mapping correction.
260 Genome Research

www.genome.org
Table 1. Overrepresented mismatches in mature miRNA sequences
Without correcting After correcting for

for cross-mapping cross-mapping
Mismatch Origin of
miRNA DNA!RNA Total Mismatched P Total Mismatched P cross-mapping
miR-27b 19:T!C 7707 5586 0 2148 62 1 miR-27a

miR-629 10:G!C 235 172 3 3 10244 250 187 4 3 10288 SNP
miR-106a 12:C!T 1528 327 6 3 10242 207 7 1 miR-20a
let-7c 17:A!G 174 132 1 3 10195 25 4 1 let-7b
miR-1260 9:T!G 107 105 6 3 10178 160 148 2 3 10256 tRNA-Leu (see text)
miR-1274a 11:A!G 54 39 1 3 1053 8 1 1 tRNA-Lys
miR-20b 10:C!T 151 32 2 3 1021 86 0 1 miR-17, miR-20a
miR-18b 20:T!A 211 96 2 3 1013 40 3 1 miR-18a
miR-20b 12:T!C 151 23 3 3 1011 85 0 1 miR-17
miR-376c 6:A!G 10 8 3 3 108 10 8 1 3 109 Known
This table shows all overrepresented mismatches found in mature miRNAs before and after correcting for cross-mapping, except those occurring at the
39 terminal nucleotide of the mature miRNA, together with the corresponding sequence counts, statistical significance after Bonferroni correction (see
Methods), and the origin of cross-mapping RNAs if applicable. Before correcting for cross-mapping, 10 such potential miRNA editing sites are found. As
shown in this table, eight of these are due to cross-mapping; one additional apparent editing site is due to a single nucleotide polymorphism in the THP-1
genome. The remaining editing site, in the mature miR-376c, has been identified previously (Kawahara et al. 2007a).
locus with a single 39 nucleotide extension, rather than the let-7a didate mapping location. Applying this method to our short RNA
locus with a U-to-G editing event, and that the overrepresented data set resulted in three to eight spurious cross-mapping loci,
mismatches in murine let-7a are caused by cross-mapping rather depending on which alignment program was used to find the
than RNA editing. candidate mapping locations.
Current mapping strategies typically use a hierarchical ap-
proach, which first attempts to map short RNAs to the genome
Discussion without any mismatches, then to map the remaining short RNAs
Small RNA sequences that cross-map to multiple genome locations with one mismatch, and continue this process up to some upper
can give rise to spurious alignments that can be interpreted as
novel editing events. miRNAs are particularly likely to cross-map,
since they are short, occur in families with highly similar se-
quences, and may contain nontemplate single-nucleotide addi-
tions on their 39 end. Previous miRNA microarray probe design
algorithms have attempted to ameliorate the chance of cross-
hybridization between miRNAs with single-base differences (Lee
et al. 2008). Similarly, our strategy aims to mitigate the occurrence
of cross-mapping artifacts in high-throughput short RNA se-
quencing experiments by carefully weighting the assignment of
RNAs to genome loci. Similar strategies have been developed pre-
viously for mapping short sequences obtained from CAGE and
RNA-seq (Jurka 2000; Faulkner et al. 2008).
Alignment errors preferentially occur near the 59 and 39 ends
of short RNAs and are likely due to imperfections in the linker li-
gation during the construction of the cDNA library used for se-
quencing and enzymatic nucleotide additions to the 39 end of
miRNAs ( Jones et al. 2009; Katoh et al. 2009), rather than to true
sequencing errors. Cross-mapping caused by such alignment er-
rors, therefore, cannot be resolved by using the per-nucleotide
quality scores of the sequences reads as used in the mapping Figure 6. Verification of the cross-mapping correction at spurious
algorithms implemented in MAQ (Li et al. 2008) and Bowtie editing sites. Our analysis showed that eight out of 10 sites with over-
represented mismatches in mature miRNAs are due to cross-mapping
(Langmead et al. 2009). Mapping our short RNA data set by MAQ rather than true RNA editing sites (Table 1). For each of these miRNAs, we
resulted in three new spurious miRNA editing sites due to cross- identified the most abundant RNA that mapped exactly (i.e., without
mapping, in addition to those shown in Table 1, while the true mismatches) to the miRNA locus and the most abundant RNA that map-
miRNA editing site in miR-376c was missed. Using Bowtie resulted ped to the putative origin of cross-mapping. We also identified the most
abundant RNA mapping to the miRNA locus with one mismatch at the site
in eight new spurious miRNA editing sites. of overrepresented mismatches. For this RNA, we calculated the Spearman
Our cross-mapping correction uses both the expression level correlation along the time course of its counts with the counts of the RNA
and the error profile to assign weights to candidate mapping lo- derived from the miRNA locus (light gray), and the Spearman correlation
cations. Mortazavi et al. (2008) previously developed a weighting along the time course of the RNA derived from the putative origin of cross-
mapping (dark gray). In all cases, we find that the putative origin of cross-
scheme for RNA-seq and SNP detection, in which sequences mapping yields a stronger correlation than the miRNA site, supporting our
mapping to multiple locations are weighted based on the counts of conclusion that the mismatched RNA originates from the cross-mapping
uniquely mapping sequences in the neighborhood of each can- locus rather than the miRNA locus.
Genome Research 261

www.genome.org
de Hoon et al.
limit on the allowable number of mismatches. Due to the large of editing in mature miRNAs supports the idea that the primary
dynamic range in deep-sequencing libraries, this strategy may not biological function of miRNA editing in animals is the regulation
be optimal. For example, we can reasonably expect the absolute of the processing pathway, rather than the targeting of miRNAs
number of sequences with errors to increase with the sequences (Kawahara et al. 2007b, 2008).
produced at a given locus. Indeed, the most highly expressed short
RNA in the FANTOM4 libraries, which mapped to miR-27a, was
sequenced 107,777 times. With errors preferentially occurring at Methods
the 59 and 39 end of the RNA, even a 1% error rate can result in THP-1 cell culture, RNA extraction, small RNA library preparation,
thousands of short RNAs cross-mapping to a locus elsewhere in the and sequencing were described previously (Taft et al. 2009). As the
genome, confounding expression, annotation, or editing analyses, presence of more than one bead within one droplet in the water/oil
as illustrated by the example of miR-1260 above. Hence, particularly emulsion used for PCR amplification gives rise to multiple counts
for sequences originating from highly expressed loci, the candidate of identical reads, only one count was retained for identical reads
genome locus with the fewest alignment errors is not necessarily the to avoid emulsion PCR biases. Potential editing sites in mature
best. In general, these kinds of artifacts can be avoided by including murine miRNAs (Reid et al. 2008) were analyzed using sequences
more candidate mapping locations in the analysis, even though the that were kindly provided by the Gunaratne laboratory (Depart-
alignment quality between the transcript and the genome sequence ment of Biology and Biochemistry, University of Houston, TX).
may be lower. For example, the ELAND mapping software (AJ Cox,
unpubl.) reports all candidate mapping loci with up to two mis-
matches, instead of only the locus with simply the best sequence Finding candidate mapping sites of small RNAs
alignment. Such a strategy could particularly be useful if combined We align each sequenced RNA to the genome by minimizing the
with a cross-mapping correction strategy, as RNAs would be as- number of substitutions, insertions, and deletions in the align-
signed a very low weight at candidate mapping loci where they align ment. To enable an unbiased assessment of the occurrence of RNA
poorly except if they are highly expressed. editing, we allowed any number of alignment errors between the
The recent revelation that tRNAs can be processed into dis- RNA and the genome.
crete small RNA species (Kawaji et al. 2008; Elbarbary et al. 2009; We used Nexalign (T Lassmann, unpubl.; http://genome.gsc.
Thompson and Parker 2009; Yamasaki et al. 2009) further com- riken.jp/osc/english/software) for alignments with up to one
plicates the analysis of small RNA deep-sequencing libraries. tRNA- mismatch, insertion, or deletion. Nexalign is a suffix array-based
derived small RNAs are highly expressed and are therefore partic- method (Manber and Myers 1990) that guarantees to find all full-
ularly prone to produce cross-mapping artifacts, suggesting that length exact matches of a query to a target database. To allow for-
matching with errors, Nexalign generates a library of all sequences
cross-mapping correction is an essential step of deep-sequencing data
that can be generated from the queries using the user-specified
analysis. Indeed, tRNAs are also characterized by post-transcriptional
number of mismatches, insertions, or deletions. All sequences in the
modifications, including CCA additions to the 39 end of tRNAs and
libraries are searched one by one, thus guaranteeing a complete set
site-specific nucleotide modifications (Ebhardt et al. 2009), which
of matches within the given number of mismatches, insertions, or
cause mismatches between the genome and the sequenced RNA. deletions. Per-base quality values are not taken into consideration.
An example of the latter is the methylated adenine in the TCC arm Using Nexalign, we first map the extracted small RNA reads,
of tRNAs, which is frequently read as a uridine in our RNA se- using exact matches only, to the human genome (assembly hg18;
quencing libraries. Whereas our mapping strategy already takes the NCBI build 36.1) (The Genome Sequencing Consortium 2001;
39 CCA sequence tags into account (see Methods), other post- Kuhn et al. 2009), ribosomal sequences contained in the GenBank
transcriptional modifications are more difficult to include, as such (Benson et al. 2008) record with accession number U13369, and
modifications and their potential manifestation in short RNA se- a collection of human mature tRNAs, which was created by
quencing are generally unknown. Examples of miRNAs that align appending the CCA sequence tag to the 39 end of each tRNA se-
to mature tRNAs, suggesting that they may be cross-mapping ar- quence as encoded in the genome (see Supplemental Table 5). This
tifacts, are the human miRNAs miR-720, miR-1274a, miR-1274b, sequence, which is added post-transcriptionally during the matu-
and miR-1308, and the mouse miRNAs miR-720, miR-1937a, and ration process of tRNAs, is usually not present in the genome. Short
miR-1937b. RNA sequences mapping to ribosomal sequences were removed
from the further analysis, while the coordinates of short RNAs
mapping to mature tRNAs were converted to genome coordinates.
Few editing events in mature miRNAs in animals The remaining short RNAs were then mapped, using Nexalign, to
The FANTOM4 time course short RNA libraries are dominated by the human genome, ribosomal sequences, and mature tRNAs,
mature miRNAs, which occur in miRNA families with highly allowing for one mismatch, insertion, or deletion, again convert-
ing the coordinates of short RNAs mapping to mature tRNAs and
similar mature miRNA sequences. Cross-mapping between differ-
discarding short RNAs mapping to ribosomal sequences. The
ent miRNAs may lead to the artifactual identification of editing or
remaining short RNAs were mapped to the genome, ribosomal
modification sites, since cross-mapping RNAs originating from
sequences, and mature tRNA sequences using BLAST (Altschul
a particular locus will, at the cross-mapped locus, give rise to se- et al. 1990) with the command-line options -p blastn -m 7 -F F -G
quence mismatches at specific genome positions. 10 -E 6 -q -4 -r 5 -W 7 -b 1000000 -v 1000000 -J T. The short RNAs
We found few bona-fide editing sites in mature miRNAs se- were then aligned to the regions identified by Blast, extended by
quenced in the FANTOM4 libraries. Overrepresented mismatches 10 bp on both sides, using an in-house global alignment algorithm
at specific genome positions were, in the vast majority of cases, to obtain an alignment over the full length of the RNA. Alignments
caused by cross-mapping short RNAs originating from other miRNAs with the minimum number of errors were retained, where each
or from tRNAs. Similarly, previously identified sites of overrep- mismatch, insertion, or deletion in the alignment was counted as
resented mismatches in mature miRNAs (Reid et al. 2008) also one error. Without correcting for cross-mapping, the counts of
seem to be due to cross-mapping rather than RNA editing. The lack RNAs mapping to more than one locus with an equal number of
262 Genome Research

www.genome.org
errors were divided equally between these loci. RNAs mapping to weights using Equation 4. We repeat this process until conver-
more than 100 locations were removed from the analysis. gence.
For lowly-expressed regions, where the expression level ni
may be due to a single transcript variant, the presence of one
Cross-mapping correction uniquely mapping transcript in any of the candidate mapping
The mapping procedure described above finds candidate mapping regions will cause all multimapping transcripts to be assigned there
locations in the genome for each of the RNAs in the data set. For at the exclusion of the other candidate mapping regions. To avoid
short RNAs with multiple candidate mapping sites, we apply the such instabilities
rffiffiffiffiffiffiffiffiffi in the iteration process, we added pseudocounts
cross-mapping correction to assign weights to each candidate equal to + nj , equally divided between the candidate mapping
mapping site. j
The cross-mapping correction calculates weights using both regions, to all expression levels ni in Equation 4.
the expression level of each candidate mapping site and the errors Convergence is reached when the mapping weights wi(r) no
in the alignment between the short RNA and the genome sequence longer change. We stop the iteration as soon as the absolute change
at each candidate mapping site. The expression level of each can- in the weights wi(r) summed over all RNAs r and all candidate
didate mapping site depends on the number of short RNAs map- mapping locations i is less than 1. In practice, a much smaller
ping there, whereas the alignment errors are assessed using an error number of iterations will be adequate. As a case in point, in our
profile describing the prevalence of errors as a function of position analysis of editing in mature miRNAs, a single iteration is sufficient
along the RNA. Since both the expression level and the error profile to identify all spurious editing sites as such.
depend on the weights assigned to each short RNA, the cross- The weight ratios shown in Figure 4 were calculated by di-
mapping correction is necessarily an iterative process. viding the mapping weights wi(r) by their value under an equal-
As the first step of the cross-mapping correction, for each weight strategy, resulting in mr wi(r).
short RNA r we assign an equal weight wi(r) = 1/mr to each of its mr
candidate mapping locations. The expression level of a candidate
mapping site i is then calculated as
Analysis of potential miRNA editing sites
We calculate the number n of RNA sequences for each genome
ni = + wi ðr Þcðr Þ; ð1Þ
r
position in mature miRNAs as well as the number K of the most
prevalent mismatch nucleotides at that position. Under the null
where the sum extends over all RNAs r that overlap the candidate hypothesis, we assume that the mismatches are caused by errors at
mapping site i, and wi(r) is the weight of RNA r mapping to can- a background probability p obtained from the error profiles cal-
didate mapping site i. culated by the cross-mapping correction. The tail probability is
We also calculate the error profile, defined as the relative then calculated as:
frequency of alignment errors as a function of the position k along
n
the RNA: n i
tail probability = + p ð1 pÞni : ð5Þ
i=k
i
+ wi ðr Þcðr Þai ðr; kÞ
r
pk = ; ð2Þ For non-integer K, n, this can be generalized as
+ wi ðr Þcðr Þ
r
tail probability = I p ðk; n K + 1Þ; ð6Þ
where ai(r; k) equals 1 if there is an alignment error at position k
along the RNA in the alignment of RNA r to candidate mapping where I is the regularized incomplete beta function.
location i, and 0 otherwise. In practice, we found that the error We apply Bonferroni’s correction for multiple hypothesis
profile is larger toward the 59 and 39 ends of the RNA, and essentially testing based on the number of sites N in mature miRNAs that were
flat toward the center of the RNA. We therefore calculate the error evaluated for the presence of overrepresented mismatches:
profile at the first, second, and third nucleotide at the 59 end; the Bonferroni-corrected tail probability = 1
first, second, and third nucleotide at the 39 end; and an average p at N
1 I p ðK; n K + 1Þ : ð7Þ
all intermediate positions. Supplemental Figure 5 shows the error
profile as calculated for the FANTOM4 short RNA time course data.
Using the error profile, we can now calculate the probability
Pr(r|si) that transcription of the candidate mapping region i with Capillary sequencing of the miR-629 region in the THP-1
genome sequence si will result in a sequenced transcript r: cell line
Y Y Total genome DNA from THP-1 cells was extracted with a DNeasy
Prðrjsi Þ = pk ð1 pk Þ; ð3Þ
k:ai ðr;kÞ = 1 k:ai ðr;kÞ = 0
Blood & Tissue Kit (Qiagen) according to the manufacturer’s
instructions. The DNA fragments of the miR-629 region were
where the product extends over all positions k along the alignment. amplified by PCR using the following two primers: SNP2_F:
By multiplying Equations 1 and 3, we find the number ni 3 59-AGACAGGACTGTGTAGGGTTGAG-39 and SNP2_R: 59-CCAA
Pr(r|si) of expected sequences r originating from candidate map- GAGGGTACTAGCAGATGATG-39. PCRs were performed in a total
ping site i. We assign weights to each candidate mapping site in volume of 50 mL, using 0.5 mg of THP-1 genome DNA/1 mL of DNA
proportion to this expected number of sequences: polymerase (5 U/mL)/5 mL of 103 reaction buffer/1mL of 2.5 mM
ni 3 Prðrjsi Þ dNTPs/3 mL of 1 mM SNP2_F primer/3 mL of 1 mM SNP2_R primer/
wi ðr Þ = ; ð4Þ water up to 50 mL. Samples were incubated at 94°C for 7 min, 45
+ nj 3 Pr rjsj
j cycles at 94°C for 30 sec/63°C for 30 sec/73°C for 1 min, and at
72°C for 1 min were performed. A total of 10 mL of PCR samples
where j iterates over all candidate mapping locations of sequence r. were separated in 3% agarose gel, and the target band (188 bp) was
Using these weights, we recalculate the expression level of cut out and extracted by Gelextraction kit (Qiagen). Capillary se-
each genome location i using Equation 1, the error profile using quencing was performed using the ABI3100 sequencer according
Equation 2, the probabilities Pr(r|si) using Equation 3, and the to the manufacturer’s instructions.
Genome Research 263

www.genome.org
de Hoon et al.
Acknowledgments Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead
B, Raney BJ, Pohl A, Pheasant M, et al. 2009. The UCSC Genome Browser
This work was supported by a Research Grant for the RIKEN Omics Database: Update 2009. Nucleic Acids Res 37: D755–D761.
Science Center from the Ministry of Education, Culture, Sports, Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice
A, Kamphorst AO, Landthaler M, et al. 2007. A mammalian microRNA
Science, and Technology, Japan to Y.H., and a grant of the Genome expression atlas based on small RNA library sequencing. Cell 129: 1401–
Network Project from the Ministry of Education, Culture, Sports, 1414.
Science and Technology, Japan to Y.H. (http://genomenetwork.nig. Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-
ac.jp/index_e.html). efficient alignment of short DNA sequences to the human genome.
Genome Biol 10: R25. doi: 10.1186/gb-2009-10-3-r25.
Lee I, Ajay SS, Chen H, Maruyama A, Wang N, McInnis MG, Athey BD. 2008.
Discriminating single-base difference miRNA expressions using
References microarray Probe Design Guru (ProDeG). Nucleic Acids Res 36: e27. doi:
10.1093/nar/gkm1165.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local Lee I, Ajay SS, Yook JI, Kim HS, Hong SH, Kim NH, Dhanasekaran SM,
alignment search tool. J Mol Biol 215: 403–410. Chinnaiyan A, Athey BD. 2009. New class of microRNA targets
Babiarz JE, Ruby JG, Wang Y, Bartel DP, Blelloch R. 2008. Mouse ES cells containing simultaneous 59-UTR and 39-UTR interaction sites. Genome
express endogenous shRNAs, siRNAs, and other Microprocessor- Res 19: 1175–1183.
independent, Dicer-dependent small RNAs. Genes & Dev 22: 2773– Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads
2785. and calling variants using mapping quality scores. Genome Res 18:
Bartel DP. 2004. MicroRNAs: Genomics, biogenesis, mechanism, 1851–1858.
and function. Cell 116: 281–297. Luciano DJ, Mirsky H, Vendetti NJ, Maas S. 2004. RNA editing of a miRNA
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. 2008. precursor. RNA 10: 1174–1177.
GenBank. Nucleic Acids Res 36: D25–D30. Manber U, Myers G. 1990. Suffix arrays: A new method for on-line string
Blow MJ, Grocock RJ, Van Dongen S, Enright AJ, Dicks E, Futreal PA, Wooster searches. In Proceedings of the first annual ACM-SIAM symposium on
R, Stratton MR. 2006. RNA editing of human microRNAs. Genome Biol 7: discrete algorithms. pp. 319–327. Society for Industrial and Applied
R27. doi: 10.1186/gb-2006-7-4-r27. Mathematics, Philadelphia.
Ebhardt HA, Tsang HH, Dai DC, Liu Y, Bostan B, Fahlman RP. 2009. Meta- Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu
analysis of small RNA-sequencing errors reveals ubiquitous post- AL, Zhao Y, McDonald H, Zeng T, Hirst M, et al. 2008. Application of
transcriptional RNA modifications. Nucleic Acids Res 37: 2461–2470. massively parallel sequencing to microRNA profiling and discovery in
Elbarbary RA, Takaku H, Uchiumi N, Tamiya H, Abe M, Takahashi M, human embryonic stem cells. Genome Res 18: 610–621.
Nishida H, Nashimoto M. 2009. Modulation of gene expression by Morozova O, Marra MA. 2008. Applications of next-generation sequencing
human cytosolic tRNase ZL through 59-half-tRNA. PLoS One 4: e5908. technologies in functional genomics. Genomics 92: 255–264.
doi: 10.1371/journal.pone.0005908. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping
The FANTOM Consortium and the Riken Omics Science Center. 2009. The and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods
transcriptional network that controls growth arrest and differentiation 5: 621–628.
in a human myeloid leukemia cell line. Nat Genet 41: 553–562. Obernosterer G, Leuschner PJF, Alenius M, Martinez J. 2006. Post-
Faulkner GJ, Forrest ARR, Chalk AM, Schroder K, Hayashizaki Y, Carninci P, transcriptional regulation of microRNA expression. RNA 12: 1161–1167.
Hume DA, Grimmond SM. 2008. A rescue strategy for multimapping Reid JG, Nagaraja AK, Lynn FC, Drabek RB, Muzny DM, Shaw CA, Weiss MK,
short sequence tags refines surveys of transcriptional activity by CAGE. Naghavi AO, Khan M, Zhu H, et al. 2008. Mouse let-7 miRNA
Genomics 91: 281–288. populations exhibit RNA editing that is constrained in the 59-seed/
The Genome Sequencing Consortium. 2001. Initial sequencing and analysis cleavage/anchor regions and stabilize predicted mmu-let-7a:mRNA
of the human genome. Nature 409: 860–921. duplexes. Genome Res 18: 1571–1581.
Gottwein E, Cai X, Cullen BR. 2006. A novel assay for viral microRNA Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP.
function identifies a single nucleotide polymorphism that affects 2006. Large-scale sequencing reveals 21U-RNAs and additional
Drosha processing. J Virol 80: 5321–5326. microRNAs and endogenous siRNAs in C. elegans. Cell 127: 1193–1207.
Jones MR, Quinton LJ, Blahna MT, Neilson JR, Fu S, Ivanov AR, Wolf DA, Schuster SC. 2008. Next-generation sequencing transforms today’s biology.
Mizgerd JP. 2009. Zcchc11-dependent uridylation of microRNA directs Nat Methods 5: 16–18.
cytokine expression. Nat Cell Biol 11: 1157–1163. Taft RJ, Glazov EA, Cloonan N, Simons C, Stephen S, Faulkner GJ, Lassmann
Jurka J. 2000. Repbase update: A database and an electronic journal of T, Forrest ARR, Grimmond SM, Schroder K, et al. 2009. Tiny RNAs
repetitive elements. Trends Genet 16: 418–420. associated with transcription start sites in animals. Nat Genet 41: 572–
Kahvejian A, Quackenbush J, Thompson JF. 2008. What would you do if you 578.
could sequence everything? Nat Biotechnol 26: 1125–1133. Thompson DM, Parker R. 2009. The RNase Rny1p cleaves tRNAs and
Katoh T, Sakaguchi Y, Miyauchi K, Suzuki T, Kashiwabara S-I, Baba T, Suzuki promotes cell death during oxidative stress in Saccharomyces cerevisiae.
T. 2009. Selective stabilization of mammalian microRNAs by 39 J Cell Biol 185: 43–50.
adenylation mediated by the cytoplasmic poly(A) polymerase GLD-2. Thomson JM, Newman M, Parker JS, Morin-Kensicki EM, Wright T,
Genome Res 23: 433–438. Hammond SM. 2006. Extensive post-transcriptional regulation of
Kawahara Y, Zinshteyn B, Sethupathy P, Iizasa H, Hatzigeorgiou AG, microRNAs and its implications for cancer. Genes & Dev 20: 2202–2207.
Nishikura K. 2007a. Redirection of silencing targets by adenosine-to- Viswanathan SR, Daley GQ, Gregory RI. 2008. Selective blockade of
inosine editing of miRNAs. Science 315: 1137–1140. microRNA processing by Lin28. Science 320: 97–100.
Kawahara Y, Zinshteyn B, Chendrimada TP, Shiekhattar R, Nishikura K. Winter J, Jung S, Keller S, Gregory RI, Diederichs S. 2009. Many roads to
2007b. RNA editing of the microRNA-151 precursor blocks cleavage by maturity: microRNA biogenesis pathways and their regulation. Nat Cell
the Dicer–TRBP complex. EMBO Rep 8: 763–769. Biol 11: 228–234.
Kawahara Y, Megraw M, Kreider E, Iizasa H, Valente L, Hatzigeorgiou AG, Yamasaki S, Ivanov P, Hu GF, Anderson P. 2009. Angiogenin cleaves tRNA
Nishikura K. 2008. Frequency and fate of microRNA editing in human and promotes stress-induced translational repression. J Cell Biol 185:
brain. Nucleic Acids Res 36: 5270–5280. 35–42.
Kawaji H, Nakamura M, Takahashi Y, Sandelin A, Katayama S, Fukuda S, Yang W, Chendrimada TP, Wang Q, Higuchi M, Seeburg PH, Shiekhattar R,
Daub CO, Kai C, Kawai J, Yasuda J, et al. 2008. Hidden layers of human Nishikura K. 2006. Modulation of microRNA processing and expression
small RNAs. BMC Genomics 9: 157. doi: 10.1186/1471-2164-9-157. through RNA editing by ADAR deaminases. Nat Struct Mol Biol 13: 13–21.
Kuchenbauer F, Morin RD, Argiropoulos B, Petriv OI, Griffith M, Heuser M,
Yung E, Piper J, Delaney A, Prabhu AL, et al. 2008. In-depth
characterization of the microRNA transcriptome in a leukemia
progression model. Genome Res 18: 1787–1797. Received July 1, 2009; accepted in revised form November 24, 2009.
264 Genome Research

www.genome.org

Cross-Mapping MicroRNA Identification

Uploaded by

Copyright:

Available Formats

Cross-Mapping MicroRNA Identification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cross-Mapping MicroRNA Identification

Uploaded by

Copyright:

Available Formats

Methods

Cross-mapping and the identification of editing

sequences in the genome or if sequencing errors or post-tran-

258 Genome Research

Identification of editing sites in mature miRNAs

Genome Research 259

260 Genome Research

Table 1. Overrepresented mismatches in mature miRNA sequences

Without correcting After correcting for

miR-27b 19:T!C 7707 5586 0 2148 62 1 miR-27a

Genome Research 261

262 Genome Research

Genome Research 263

264 Genome Research

You might also like