Newman2018 Protocol TheEnsemblGenomeBrowserStrateg
Newman2018 Protocol TheEnsemblGenomeBrowserStrateg
Newman2018 Protocol TheEnsemblGenomeBrowserStrateg
Abstract
The Ensembl Genome Browser provides a wealth of freely available genomic data that can be accessed for
many purposes by genetics, genomics, and molecular biology researchers. Herein we present two protocols
for exploring different aspects of these data: a phenotype and its associated variants and genes, and a pro-
moter and the epigenetic marks and protein-binding activity associated with it. These workflows illustrate
a subset of the data types available through the Ensembl Browser, and can be considered a springboard for
further exploration.
1 Introduction
Martin Kollmar (ed.), Eukaryotic Genomic Databases: Methods and Protocols, Methods in Molecular Biology, vol. 1757,
https://doi.org/10.1007/978-1-4939-7737-6_6, © The Author(s) 2018
115
116 Victoria Newman et al.
Fig. 1 Ensembl features. Ensembl integrates together gene annotation, genetic variation, gene regulation data,
and comparative genomics onto a single genomic platform. Gene annotation is carried out in house, annotating
the full intron–exon structure of coding and noncoding transcripts. Short variants, such as SNPs and indels, are
pulled into Ensembl from external databases, alongside structural variants and copy-number variants. ChIP-seq
and DNase-seq data is used for in-house prediction of regions of open chromatin and regulatory elements such
as promoters, enhancers and CTCF binding sites on the genome and their activity in different cell types. Whole
genome alignments and gene tree analysis is carried out in house to compare species in Ensembl. These data
are presented alongside each other on the genome in the Ensembl browser, and can also be accessed for bulk
export through BioMart, programmatically through APIs and as flat-files on the Ensembl FTP site
Fig. 2 The Ensembl homepage. The Ensembl homepage provides access to a search function, which can
retrieve information associated with, for example, genes, transcripts, proteins, variants, phenotypes, and ontol-
ogy terms. In addition, links are available to Ensembl’s sister site, Ensembl Genomes, as well as to the most-
searched genomes and a complete list of annotated genomes. Fully annotated genomes are available on the
main Ensembl site, while genomes whose annotation is in process can be browsed on the Ensembl Pre! site.
Ensembl maintains web interfaces of archived versions for 5 years. These can be accessed from a link in the
lower right-hand corner. Documentation and help pages can be accessed from the homepage, as well as in-
house and external tools integrated into the Ensembl web interface. A dedicated page describing data-
download strategies is also available and presents links to the point-and-click tool BioMart, which permits bulk
download of Ensembl datasets with no requirement for programming expertise, as well as APIs and FTP site
of Ensembl can be retrieved from the FTP site, or from our data-
bases via the Perl APIs, in perpetuity.
Beyond providing access to data related to publicly available
genome annotation, Ensembl integrates a number of tools designed
to process or analyze your own data. The ID History Converter
118 Victoria Newman et al.
2 Materials
3 Methods
3.1 WF1: Phenotype- The Ensembl browser can be searched using a variety of terms, includ-
Based Searches ing genomic regions, genes, variants, or phenotypes; the following
and Identification workflow describes a phenotype-based search that highlights data and
of Associated Genetic annotations collated in the Phenotype, Variant, Gene, and Transcript
Variation tabs.
Non-melanoma skin cancer—principally basal cell and squa-
mous cell carcinomas—is a relatively common pathology associ-
ated with variants in several genes [21].
1. Getting started: To explore the phenotype in more detail, type
“non-melanoma skin cancer” into the search box on the
Ensembl home page, www.ensembl.org, and click the “Go”
button. The search autocomplete may retrieve direct links to
The Ensembl Genome Browsers 119
Fig. 3 The Ensembl phenotype tab. The Ensembl phenotype tab allows you to explore the phenotype ontology
associated with a phenotype and any loci (variants, QTLs, or genes) linked to the phenotype. Loci associated
with the phenotype shown in a table on the Associated loci page. The buttons above the table allow filtering.
Links take you to the database and/or paper where the link between locus and phenotype was made
120 Victoria Newman et al.
Fig. 4 The Ensembl variation tab. The Ensembl variation tab provides a wealth of information about a particular
variant, such as a SNP or indel. (A) A variant summary shown on all pages in the variant tab, including alleles,
MAF, and evidence status. The menu at the left-hand side provides links to all the pages providing information
on the variant. (B) Pie charts from the Population Genetics page, showing the allele frequencies for the variant
in the 1000 Genomes populations. (C) The Genes and Regulation table, listing all genes affected by the variant
with details of sequence ontology consequences, position in the gene and protein, and SIFT and Polyphen
scores for amino acid changes (where relevant)
The Ensembl Genome Browsers 121
Below the consequence, you can see that the reference allele of
rs1805007 at the genomic position 16:89919709 is C, and one
alternative allele, T, has been observed. Minor allele frequency
(MAF) has been calculated for the alternative allele, which was
observed in 1000 Genomes Project participants: it was identified
in 2% of participants in that study [2, 26].
Navigating to the Variant tab from the Phenotype tab auto-
matically loads a table containing the phenotype data relating to
this variant, as mentioned above. Tanning ability, sensitivity to
sun, and fair hair and skin color have all been associated with the
variant, as has basal cell carcinoma, a form of non-melanoma
skin cancer. Collectively, these phenotypes are consistent with the
observed linkage between fair complexions and sensitivity to sun
exposure.
4. The menu on the left presents additional options. Click
“Population genetics” to view allele frequencies in global
populations.
On this page, data from the 1000 Genomes [26], HapMap
[27], and NHLBI Exome Sequencing [28] Projects and the
Exome Aggregation Consortium (ExAC) [29] are displayed.
The data from the 1000 Genomes Project are shown at the top
(Fig. 4B); the pie-charts represent allele frequencies for different
superpopulations. Allele frequencies for subpopulations within
each superpopulation can be viewed by clicking the “Sub-
populations” link beneath the corresponding superpopulation.
Allele and genotype frequencies among 1000 Genomes Project
participants can also be found in tabular form immediately
below the graphical views.
The frequency of the T variant allele in 1000 Genomes Project
participants is highest among European subgroups, and individ-
uals homozygous for the variant also occur only in these subgroups.
This is expected given the phenotypes associated with the variant
(Fig. 4B).
5. To explore genes and transcripts with which the variant is asso-
ciated, click “Genes and regulation” in the menu on the left.
As we saw previously, the variant lies within the MC1R gene;
the summary table here indicates that it overlaps two indepen-
dent transcripts of this gene as a missense variant and is a down-
stream gene variant of a third transcript. Other genes and
transcripts affected by the variant, as well as the associated conse-
quences, are also shown (Fig. 4C).
In a second table, called “Gene expression correlations,” you
can find a list of genes whose expression has been found by the
GTEx Project to be affected by the variant of interest [30].
Finally, any regulatory features or motifs in which the variant
falls will be listed in two separate tables at the bottom of the page.
There are no regulatory features or motifs that overlap the vari-
ant rs1805007.
122 Victoria Newman et al.
Fig. 5 (continued) against the genome. The central contig indicates the genome. Positive stranded genes, such
as MC1R are depicted above the contig. Strand is also indicated by an arrow alongside the transcript name
indicating the direction of transcription, and by introns, which are shown pointing upward on positive stranded
genes and downward on negative stranded genes. Some transcripts have been removed from this image for
size. On all pages in the gene tab, a menu on the left-hand side lists all the pages available for looking at a
gene. (B) Three pages are available for looking at the GO terms associated with a gene, conforming to the three
categories of terms, Biological process, Molecular function, and Cellular component. These are listed for each
gene, including which transcript they are associated with and how they were annotated
The Ensembl Genome Browsers 123
Fig. 5 The Ensembl gene tab. The Ensembl gene tab provides a number of views to look at different aspects of
a gene. (A) The gene summary page includes a graphical depiction of the transcripts of the gene, shown
124 Victoria Newman et al.
Fig. 6 The Ensembl transcript tab. The transcript tab contains all views for looking at a transcript and its asso-
ciated protein, where relevant. (A) The left-hand menu on the transcript tab lists all the pages for looking at
transcripts and proteins, and differs subtly from the gene tab menu. It has three different sequence views,
allowing you to view the exon and intron sequences in a table, an alignment of the cDNA, CDS and peptide
sequences, and the protein sequence only. As you open different features, such as genes, transcripts, and
variants, tabs appear in the top bar, allowing easy navigation between the different features you’ve been look-
ing at. (B) The Supporting evidence page shows which cDNA and protein evidence was used to annotate the
transcript model
126 Victoria Newman et al.
15. Filter the table to view missense variants between amino acid
coordinates 150–160.
(a)
Filter the table for missense variants by clicking
“Consequence” in the Filter section, then “Turn All Off”
and “Missense variant.”
(b) Filter the table to view variants at a specific amino acid
coordinate within the translated sequence of the transcript
by clicking on “Filter Other Columns,” then “AA Coord.”
Use the sliders to restrict the area for which variants are
shown to 150–160.
You can filter this table in numerous ways, including by conse-
quence, source, and genomic or amino-acid coordinates (Fig. 7B).
For missense variants, there are also options to filter by predicted
pathogenicity score, as determined by SIFT [39] and/or PolyPhen
[40] (PolyPhen calculations are available only for human vari-
ants). SIFT and PolyPhen pathogenicity predictions have been
calculated for rs1805007 and the amino acid substitution is con-
sidered deleterious (An additional variant, rs149922657, has
been observed at the same position of MC1R, but has not been
associated with any phenotype.).
16. Click “Haplotypes” in the left-hand menu.
This page allows you to view linked variants that tend to be
coinherited. As a default, the amino acid identities and coordi-
nates of each haplotype are shown, along with their frequencies in
different 1000 Genomes Project populations [26]; however, click-
ing “switch to CDS view” at the top of the table will show nucleo-
tide sequences instead (Fig. 8A). The fifth haplotype listed in the
protein-haplotype table represents our variant of interest. The fre-
quency of this haplotype is, as already seen, higher in the European
subgroup. Lower in the table can be found the 151R>H haplotype
corresponding to rs149922657, the other variant observed at posi-
tion 151; this variant was recovered in only two 1000 Genomes
Project participants.
Clicking on any haplotype will load a table indicating its fre-
quencies in different 1000 Genomes populations in more detail
(Fig. 8B), as well as a sequence view highlighting the nucleotide
and amino-acid positions altered, if applicable (Fig. 8C).
17. Exporting Ensembl variation data: Data can be exported from
Ensembl at multiple scales. A link to the BioMart tool, which
permits the download of customized datasets at intermediate
scale, can be found in the navigation bar at the top of all
Ensembl pages (Fig. 2) [12]. In the BioMart interface, select
the Dataset “Ensembl Variation” (this will also include the
release number, which is 88 at the time of writing), then
“Human Short Variants (SNPs and indels excluding flagged
variants).” To download all variants of ≤50 bp lying within
The Ensembl Genome Browsers 127
Fig. 7 Table of short variants found within a transcript. The variant table lists all the variants found within a
transcript. A similar page can be found in the gene tab listing all the variants in a gene. The table lists the vari-
ants, which are links to the variant tab, with their positions, alleles, SO consequences, and predicted protein
effects. Buttons above the table allow you to filter to table to only show variants of interest. (A) The unfiltered
table for MC1R-001. (B) The same table, filtered to only show missense variants between residues 150 and
160. The applied filters are shown above the table and can be easily removed
128 Victoria Newman et al.
Fig. 8 Representation of protein haplotypes found in 1000 Genomes individuals. For each of the individuals in
the 1000 Genomes population, the complete protein and CDS sequences were calculated. Sets of cosegregat-
ing variants were defined as protein and transcript haplotypes, their frequencies determined and listed in the
Transcript haplotype page. (A) The table lists all the haplotypes found by the amino acid change. Click on the
haplotype for more details (shown in panels B and C). By default, the page shows the protein haplotypes, but
can be switched to show the CDS haplotypes. (B) The frequency of the selected haplotype across 1000
Genomes subpopulations. (C) An alignment of the reference and haplotype protein and CDS sequences
The Ensembl Genome Browsers 129
3.2 WF2: Gene- The following workflow describes a gene-based search and indicates
Based Searches some of the data and annotations collated in the Gene tab and
and Identification Regulation tab.
of Regulatory Features The POU5F1 gene, formerly known as OCT4, encodes one of
in a Genomic Region the so-called “Yamanaka factors” implicated in cellular de-
differentiation and induction of pluripotency [41, 42]. We can
search Ensembl to view the POU5F1 gene model and associated
annotation, including predicted regulatory features.
1. Getting started: Type “POU5F1” into the search box on the
homepage, www.ensembl.org, or in the upper right corner of
any browser page, and click “Go.” This will generate a search-
results page with “POU5F1 (Human Gene)” as the top hit.
Click the title link to navigate directly to the POU5F1 Gene
tab.
The gene “Summary” containing a graphical representation
of the gene model loads by default following navigation from the
search results.
2. Downloading gene sequences: The sequence of the gene and
flanking regions can be downloaded from the Gene tab in two
ways.
(a) To download the sequence in FASTA format for process-
ing in an external tool, simply click the “Export data” but-
ton below the left-hand menu.
This will open a pop-up window that presents customization
options.
(b) To view POU5F1 sequence in the browser, click
“Sequence” in the left-hand menu.
This opens a display in FASTA format; buttons to download
and to BLAST the sequence are shown on this page, and down-
load customization options are similarly available (Fig. 10).
3. Exploring regulatory features: Select “Summary” in the left-
hand menu. Scroll down to the graphical view of the gene
model and locate the Regulatory Build track.
The Regulatory Build depicts regulatory features that have
been annotated based on epigenome-scale data imported from
sources such as ENCODE [43], Roadmap Epigenomics [44] and
130 Victoria Newman et al.
Fig. 9 The BioMart interface. BioMart allows easy export of tables of gene, variant, or regulatory feature data.
A video tutorial for BioMart is available at https://www.youtube.com/watch?v=QvGT2G0-hYA&ab_channel=E
nsemblHelpdesk
Fig. 10 Exporting gene sequence from Ensembl. All sequence views in Ensembl allow download in either plain
FASTA or annotated rich text format (RTF)
The Ensembl Genome Browsers 131
Fig. 12 Evidence for and activity of regulatory features in different cell types. The Details by cell type page in
the regulatory feature tab can be manipulated to show the activity of the feature in cell types of interest using
the buttons at the top. For each cell type, the feature is shown colour-coded to indicate its activity, with the
evidence shown below. The evidence is ChIP-seq and DNase-seq data, and is shown as peaks of significant
activity and as signal giving the number of reads. The top of the peak is indicated in the peak bar by pairs of
black arrows. Black blocks in the regulatory features indicate the position of transcription factor binding
motifs, which are listed in a pop-up when clicked on
The Ensembl Genome Browsers 135
Fig. 13 Adding regulation tracks to a region view. The Region in detail view displays a genomic region and can
be customized to show tracks of interest using the Configure this page button. This opens a detailed menu
listing all the available tracks, using categories on the left, including regulatory features and evidence.
Regulatory evidence can be added using a matrix selector, listing the cell type along the top and type of evi-
dence down the left
4 Discussion
Funding/Acknowledgments
References
1. Aken BL, Ayling S, Barrell D et al (2016) The 6. Robinson JT, Thorvaldsdottir H, Winckler W
Ensembl gene annotation system. Database et al (2011) Integrative genomics viewer. Nat
(Oxford) 2016. https://doi.org/10.1093/ Biotechnol 29(1):24–26. https://doi.
database/baw093 org/10.1038/nbt.1754
2. Chen Y, Cunningham F, Rios D et al (2010) Ensembl 7. Hubbard T, Barker D, Birney E et al (2002)
variation resources. BMC Genomics 11:293. https:// The Ensembl genome database project. Nucleic
doi.org/10.1186/1471-2164-11-293 Acids Res 30(1):38–41
3. Herrero J, Muffato M, Beal K et al (2016) 8. The Ensembl Browser. http://www.ensembl.
Ensembl comparative genomics resources. org
Database (Oxford) 2016. https://doi. 9. Kersey PJ, Allen JE, Armean I et al (2016)
org/10.1093/database/baw053 Ensembl Genomes 2016: more genomes, more
4. Zerbino DR, Johnson N, Juetteman T et al complexity. Nucleic Acids Res 44(D1):D574–
(2016) Ensembl regulation resources. Database D580. https://doi.org/10.1093/nar/gkv1209
(Oxford) 2016. https://doi.org/10.1093/ 10. The Ensembl Genomes Browser. http://www.
database/bav119 ensemblgenomes.org
5. Kent WJ, Sugnet CW, Furey TS et al (2002) 11. Aken BL, Achuthan P, Akanni W et al (2017)
The human genome browser at UCSC. Genome Ensembl 2017. Nucleic Acids Res
Res 12(6):996–1006. https://doi. 45(D1):D635–D642. https://doi.
org/10.1101/gr.229102. Article published org/10.1093/nar/gkw1104
online before print in May 2002
138 Victoria Newman et al.
39. Ng PC, Henikoff S (2003) SIFT: predicting 46. Fantom Consortium, Forrest AR, Kawaji H
amino acid changes that affect protein func- et al (2014) A promoter-level mammalian
tion. Nucleic Acids Res 31(13):3812–3814 expression atlas. Nature 507(7493):462–470.
40. Adzhubei IA, Schmidt S, Peshkin L et al (2010) https://doi.org/10.1038/nature13182
A method and server for predicting damaging 47. Bryne JC, Valen E, Tang MH et al (2008)
missense mutations. Nat Methods 7(4):248–249. JASPAR, the open access database of transcrip-
https://doi.org/10.1038/nmeth0410-248 tion factor-binding profiles: new content and
41. Takahashi K, Yamanaka S (2006) Induction of tools in the 2008 update. Nucleic Acids Res
pluripotent stem cells from mouse embryonic 36(Database issue):D102–D106. https://doi.
and adult fibroblast cultures by defined factors. org/10.1093/nar/gkm955
Cell 126(4):663–676. https://doi.org/ 48. The Track Hub Registry. https://trackhubreg-
10.1016/j.cell.2006.07.024 istry.org
42. Okita K, Ichisaka T, Yamanaka S (2007) 49. Data formats compatible with Ensembl.
Generation of germline-competent induced http://www.ensembl.org/info/website/
pluripotent stem cells. Nature 448(7151):313– upload/index.html - formats
317. https://doi.org/10.1038/nature05934 50. The Ensembl Training Site. http://training.
43. ENCODE Project Consortium (2012) An ensembl.org
integrated encyclopedia of DNA elements in 51. EMBL-EBI’s Train Online Platform. https://
the human genome. Nature 489(7414):57–74. www.ebi.ac.uk/training/online/
https://doi.org/10.1038/nature11247 52. Hosting an Ensembl Workshop. http://www.
44. Roadmap epigenomics Consortium, Kundaje ensembl.info/blog/2017/01/05/
A, Meuleman W et al (2015) Integrative analy- so-you-want-to-run-an-ensembl-workshop/
sis of 111 reference human epigenomes. Nature 53. The Ensembl Helpdesk YouTube channel.
518(7539):317–330. https://doi.org/ h t t p s : / / w w w. y o u t u b e . c o m / u s e r /
10.1038/nature14248 EnsemblHelpdesk
45. Fernandez JM, de la Torre V, Richardson D 54. The Ensembl Helpdesk Youku channel.
et al (2016) The BLUEPRINT data analysis h t t p : / / i . y o u k u . c o m / i /
portal. Cell Syst 3(5):491–495.e495. https:// UMzM1NjkzMTI0?spm=a2h0j.8191423.sub-
doi.org/10.1016/j.cels.2016.10.021 scription_wrap.DD~A
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons License and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
License, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons License and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.