Homer: Mapping Reads To The Genome

HOMER
Software for motif discovery and next-gen sequencing analysis
Mapping reads to the genome

Once you have checked your FASTQ files and have removed all adapter sequences that might be present, you are ready
to map them to a reference genome. While tools like BLAST and BLAT are powerful methods, they are not specialized for
the vast amount of data generated by next-generation sequencers. It is highly recommended that you use a next-gen
specific read alignment program.
Note: While BWA, Bowtie, and Tophat have received the most attention as short read alignment algorithms, new methods
such as STAR are significantly faster and in some cases more accurate. More like it are likely to come along soon (if not
already available...)
Selecting a reference genome

Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads.
Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this
recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a
genome):
1. Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it? Use that one.
2. Do any of your collaborators have a favorite genome/version?
3. Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily
visualize your data later using the UCSC Genome Browser. (i.e. mm9, hg19)
Mapping to a normal genomic reference
You want to map your reads directly to the genome if you are performing:
ChIP-Seq
GRO-Seq
DNase-Seq
MNase-Seq
Hi-C
Popular short read aligners
Full List
Most Popular:
bowtie : fast, works well
bowtie2 : fast, can perform local alignments too
BWA - Fast, allows indels, commonly used for genome/exome resequencing
Subread - Very fast, (also does splice alignment)
STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory)
To be honest, I would probably recommend STAR for almost any application at this point if you have
the memory (see below)
Example of alignment with bowtie2:
Step 1 - Build Index (takes a while, but only do this once):
After installing bowtie2, the reference genome must first be "indexed" so that reads may be
quickly aligned. You can download pre-made indices from the bowtie website (check for those
here first). Please be aware that bowtie2 indexes are different than "bowtie" indexes. To
perform make your own from FASTA files, do the following:
1. Download FASTA files for the unmasked genome of interest if you haven't already (i.e.
from UCSC). Do NOT use masked sequences. I also tend to remove the "*_random.fa"
chromosomes. These often contain part of the canonical chromosomes in addition to
regions that cannot be placed in the assembly. The problem with these regions is that
the part shared with the canonical chromosome will be present twice, making it difficult to
map the reads to a unique location.
2. Concatenate FASTA files into a single file. We can do this using the UNIX cat command,
which merges files together
cat *.fa > genome.fa
3. From the directory containing the genome.fa file, run the "bowtie2-build" command. The
default options usually work well for most genomes. For example, for hg19:
/path-to-bowtie-programs/bowtie2-build genome.fa hg19
This command will create 6 files with a *.bt2 file extension. These will then be used by
bowtie2 or newer versions of tophat to map data.
4. Copy the *.bt2 files to the bowtie2 indexes directory (or somewhere you can store them)
so that bowtie2 knows where to find them later:
cp *.bt2 /path-to-bowtie-programs/indexes/
Step 2 - Align sequences with bowtie (perform for each experiment):
The most common output format for high-throughput sequencing is FASTQ format, which
contains information about the sequence (A,C,G,Ts) and quality information which describes
how certain the sequencer is of the base calls that were made. In the case of Illumina
sequencing, the output is usually a "s_1_sequence.txt" file. More recently the Illumina
pipeline will output a file that is debarcoded with your sample name such as
"Experiment1.fastq". In addition, much of the data available in the SRA, the primary archive of
high-throughput sequencing data, is in this format.
To use bowtie2 to map this data, run the following command:
/path-to-bowtie-programs/bowtie2 -p <# cpu> -x <genome index prefix> <fastq file>

> <output filename>
i.e.
/programs/bowtie2 -p 8 -x hg19 Experiment1.fastq > Experiment1.sam
Where <genome index prefix> is the common prefix for the *.bt2 files that were created using the
bowtie2-build command in step 1, or from a downloaded index. If the *.bt2 files are stored int the
"/path-to-bowtie2-program/indexes/" directory, you only need to specify the name of the index. If the
index files are located elsewhere, you can specify the full path names of the index files (in the
examples above this would be "-x /programs/indexes/hg19").
In the example above, we use 8 processors/threads. The bowtie2 program is very parallel in nature,
with near linear speed up with additional processors.
The default output is a SAM file. To learn more about SAM alignment files, go to the next section on
SAM/BAM files.
There are many other options for bowtie2 that may be important for your study, but most ChIP-Seq
data can be mapped using the default options.
NOTE: Usually, the process of removing duplicate reads or removing non-unique alignments is
handled by the downstream analysis program.
Example of alignment with bwa:
bwa index -a bwtsw genome.fa
Step 2 - Align reads to the index (perform for each experiment):
# where the genome.fa is in the same directory with your index from the first step
bwa mem -t #cpus genome.fa reads.fq > aln-se.sa
#paired end
bwa mem -t #cpus genome.fa reads1.fq reads2.fq > aln-pe.sam
Mapping to a genome while allowing splicing
Usually, any kind of RNA-seq method will benefit from looking for splicing junctions in addition to genomic
mapping:
RNA-Seq (polyA+, total)

CLIP-Seq/PAR-CLIP
RIP-Seq
ChIRP-Seq
Ribo-Seq
Popular splice read aligners
Tophat (most popular)

Subread - Very fast, (also does splice alignment)
STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory)
lots of others...
I would probably recommend STAR for RNA-Seq is you have enough RAM (see below)
Example of aligning RNA-Seq data with STAR (very very fast)
STAR is one of a growing number of short read aligners that takes advantage of advances in
computational power to optimize the short read mapping process (original publication: Dobin et al.)
The key limitation with STAR is computer RAM - STAR requires at least 30 Gb to align to the human
or mouse genomes.
To install STAR, visit the website and follow their instructions.
Step 1 - Build a genome index
Like all aligners, you need to build the genome index first. The STAR website has links to
the hg19 genome index if you want to skip this step. First, make a directory for the index
(i.e. mm9-starIndex/). Then copy the genome FASTA file it the directory and cd into it to
make that directory your current directory. Then, to build the index, the command is:
STAR --runMode genomeGenerate --runThreadN <# cpus> --genomeDir

<genome output directory> --genomeFastaFiles <input Genome FASTA
file>
i.e. STAR --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --

genomeFastaFiles mm9.fa
Note: For small genomes, you may need to add the following to create a proper index: "--
genomeSAindexNbases 6"
Step 2 - Align RNA-Seq Reads to the genome with STAR
To align RNA-Seq reads with STAR, run the following command:
STAR --genomeDir <Directory with the Genome Index> --runThreadN <#

cpus> --readFilesIn <FASTQ file> --outFileNamePrefix <OutputPrefix>
i.e. STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn

Experiment1.fastq --outFileNamePrefix Experiment1Star
# paried-end data:
STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn
read1.fastq read2.fastq --outFileNamePrefix Experiment1Star
STAR will create several output files - the most important of which is the
"*.Aligned.out.sam" - you will likely want to convert this to a bam file and sort it to use it
with other programs. The default output is a SAM file. To learn more about SAM
alignment files, go to the next section on SAM/BAM files.
Notes on STAR
STAR is very very fast - it will rip through 20 million reads in a matter of minutes if you
have 20 cpus working for you. In fact, the longest part of the program is loading the
index into memory. If you are aligning several experiments in a row, add the option "--
genomeLoad LoadAndKeep" and STAR will load the genome index into shared
memory so that it can use it for the next time you run the program. Also, converting the
sam file into a sorted bam file will take much longer than aligning the data in the first
place.
Example of Alignment with Tophat (not recommended)
Tophat is basically a specialized wrapper for bowtie2 - it manipulates your reads and aligns them
with bowtie2 in order to identify novel splice junctions. It can also use given splice junctions/gene
models to map your data across known splice junctions.
This part is exactly the same as for bowtie2 - if you already made or downloaded an
index for bowtie2, you can skip this step.
Step 2 - Align your RNA-seq data to the genome using Tophat
To use tophat to map this data, run the following command:
/path-to-bowtie-programs/tophat -o <output directory> -p <# cpu>

</path-to-genome-index/prefix> <fastq file>
For example:
/programs/tophat -o TophatOutput/ -p 8 /programs/indexes/hg19
Experiment1.fastq
Paired-end Example:
/programs/tophat -o TophatOutputPE/ -p 8 /programs/indexes/hg19
Experiment1.r1.fastq Experiment1.r2.fastq
In the example above, we use 8 processors/threads. The tophat2 program contains a

mix of serial and parallel routines, so more processors doesn't necessarily mean it will be
finished much faster. In particular, if you are performing coverage based searchers, it
may take a long time (multiple processors will not help). In general, if you have multiple
samples to map, it's better to map them at the same time with separate commands rather
than mapping them one at a time with mapping processors.
Tophat places several output files in an output directory. The most important is the
"accepted_hits.bam" file - this is the alignment file that will be used for future analysis
(more info here). There are additional files that can be useful, including a "junctions.bed"
file with records all of the splice junctions discovered in the data.
Important Tophat Parameters:
--library-type <fr-unstranded | fr-firststrand | fr-secondstrand>
Describes which method was used to generate your RNA-seq library. If you
used a method that doesn't produce strand specific signal (i.e. just
sequencing a cDNA library), then select the fr-unstranded. If you use a
stranded method that sequences the first DNA strand made (like a dUTP
method), then use fr-firststrnad. If you use direct ligation methods, then fr-
secondstrand. Correctly specifying the library type will help Tophat identify
splice junctions.
-G <GTF file>
Use this option to specify a known transcriptome to map the reads against.
By default, tophat will also search for de novo splice events, but this will help
it known were the known ones are so that you don't miss them. A GTF files
are called Gene Transfer Files, and a description of the format can be found
here. To get a GTF file for your organism, you can usually get one from
UCSC Table Browser:
In the output format, be sure to select GTF file - the file you download from
here should work with tophat.
Tophat Mapping Strategy
If your goal is to identify novel transcripts and you have several separate experiments, I
would recommend pooling all of your data together into a single expeirment/FASTQ file
and mapping your data in one run. One of the ways Tophat tries to identify novel
junctions is by first identifying "exons" by mapping segments of reads to the genome
using bowtie2. The more segments it's able to map, the more confident it is about
putative exons and the greater the chance it will identify unannotated splice sites.
You can then go back with your novel splice sites and remap the original experiments
(not pooled) to get reads for each individual experiment using the "-j <raw junction
file>". This is most useful for short reads. This general strategy is also useful if running
cufflinks...
Mapping bisulfite-treated DNA
MethylC-Seq, BS-Seq, or RBBS-Seq data requires a special mapping strategy. In these experiments the
genomic DNA is bisulfite treated, causing all unmethylated cytosines to be converted to uracil, which will
utimately show up as thymine after sequencing. This is a clever way to figure out which cytosines are
methylated in the genome, but requires a clever mapping strategy to avoid bias detection. I'll try to include an
example of mapping this type of data in the near future, for now consider this list of BS-aligners.
De novo Assembly
Sometimes it makes more sense to perform de novo assembly instead of mapping reads to a reference
genome. Assembly is well beyond the scope of this tutorial.
Genomics assembly from short reads: Velvet, SOAPdenovo
Transcript assembly: Trinity
Can't figure something out? Questions, comments, concerns, or other feedback:

cbenner@salk.edu

Homer: Mapping Reads To The Genome

Uploaded by

Copyright:

Available Formats

Homer: Mapping Reads To The Genome

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Homer: Mapping Reads To The Genome

Uploaded by

Copyright:

Available Formats

HOMER

Software for motif discovery and next-gen sequencing analysis

Mapping reads to the genome

Selecting a reference genome

Mapping to a normal genomic reference

Popular short read aligners

Example of alignment with bowtie2:

Step 1 - Build Index (takes a while, but only do this once):

cat *.fa > genome.fa

/path-to-bowtie-programs/bowtie2-build genome.fa hg19

Step 2 - Align sequences with bowtie (perform for each experiment):

To use bowtie2 to map this data, run the following command:

/path-to-bowtie-programs/bowtie2 -p <# cpu> -x <genome index prefix> <fastq file>

Example of alignment with bwa:

Step 1 - Build Index (takes a while, but only do this once):

bwa index -a bwtsw genome.fa

Step 2 - Align reads to the index (perform for each experiment):

bwa mem -t #cpus genome.fa reads.fq > aln-se.sa

RNA-Seq (polyA+, total)

Popular splice read aligners

Tophat (most popular)

Example of aligning RNA-Seq data with STAR (very very fast)

To install STAR, visit the website and follow their instructions.

Step 1 - Build a genome index

STAR --runMode genomeGenerate --runThreadN <# cpus> --genomeDir

i.e. STAR --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --

Step 2 - Align RNA-Seq Reads to the genome with STAR

To align RNA-Seq reads with STAR, run the following command:

STAR --genomeDir <Directory with the Genome Index> --runThreadN <#

i.e. STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn

Example of Alignment with Tophat (not recommended)

Step 1 - Build Index (takes a while, but only do this once):

Step 2 - Align your RNA-seq data to the genome using Tophat

To use tophat to map this data, run the following command:

/path-to-bowtie-programs/tophat -o <output directory> -p <# cpu>

In the example above, we use 8 processors/threads. The tophat2 program contains a

Important Tophat Parameters:

--library-type <fr-unstranded | fr-firststrand | fr-secondstrand>

Tophat Mapping Strategy

Mapping bisulfite-treated DNA

Genomics assembly from short reads: Velvet, SOAPdenovo

Transcript assembly: Trinity

Can't figure something out? Questions, comments, concerns, or other feedback:

You might also like