Genomes 4 (Epigenomics + ENCODE) 2022
Genomes 4 (Epigenomics + ENCODE) 2022
Genomes 4 (Epigenomics + ENCODE) 2022
• Nucleosomes can form higher-order folded structures that compact the DNA more
effectively into heterochromatic or euchromatic structures. Actively transcribed genes
are generally found in areas of euchromatin, whereas those that are normally
repressed are generally found in areas of heterochromatin.
As cells leave mitosis, large regions of each chromosome become decondensed and disperse in the nuclei: euchromatin,
which contains most of the transcriptionally active genes. Some chromosomal regions, and whole chromosomes in certain
cases, remain condensed: heterochromatin, commonly includes regions surrounding the centromeres and telomeres.
Nucleosomes can form higher-order
folded structures that compact the
DNA more effectively into
heterochromatic or euchromatic
structures. Actively transcribed
genes are generally found in areas
of euchromatin, whereas those that
are normally repressed are generally
found in areas of heterochromatin
The Barr body is an example of
facultative heterochromatin
(heterochromatic in some cells,
euchromatic in others)
Each chromosome has its own territory within the nucleus
• The chromatin contained within each chromosome is not randomly distributed in the nucleus
but occupies a specific location known as the chromosome territory.
• The chromosome territories are formed by the interaction of chromatin with the nuclear lamina.
Chromatin is placed into the chromosome territories according to its state of
activation. Heterochromatin is more likely to be held near the nuclear periphery
through such interactions, and this tends to segregate inactive genes in such regions.
• Genes required for active transcription are often found on “loops” of DNA that
protrude from the chromosomes held near the nuclear periphery into so-called
“transcription factories” where RNA synthesis takes place. Transcription factories have the
potential to bring genes that are co-expressed into direct contact with one another. In
fact, the gene loci move to an immobilized polymerase already present in a factory,
rather than the transcriptional machinery being recruited to and moving along the chromatin template to
a gene.
The mammalian cell nucleus
contains chromatin in the form
of chromosome territories (CTs).
• The boundaries of TADs are marked by insulators, which are involved in 3D genome organization at
multiple spatial scales and are important for dynamic reorganization of chromatin structure during
reprogramming and differentiation.
• An insulator is typically 300 bp to 2000 bp in length and contains clustered binding sites
for sequence specific DNA-binding proteins that mediate intra- and inter-chromosomal
interactions.
• Insulators maintain the independence of each TAD, preventing cross-talk between
adjacent domains. Insulators prevent the genes within a domain from being influenced by
the regulatory modules present in an adjacent domain
Insulators Genomes IV, chapter 10
Insulator = 1-2Kb
Insulators prevent the genes within a domain from being influenced by the regulatory
modules present in an adjacent domain (see fig B). If an insulator is excised from its normal
location and reinserted between a gene and the upstream regulatory modules that control
expression of that gene, then the gene no longer responds to its regulatory modules
Genomes IV, chapter 10
Positional effect of Insulators
(A) A cloned gene that is inserted into a region of highly packaged chromatin will be inactive, but one inserted into open
chromatin will be expressed.
(B) The results of cloning experiments without (red) and with (blue) insulator sequences. When insulators are absent, the
expression level of the cloned gene is variable, depending on whether it is inserted into packaged or open chromatin.
When flanked by insulators, the expression level is consistently high because the insulators establish a TAD at the
insertion site.
Genomes IV, chapter 10
• DNA that is incorporated into constitutive heterochromatin should not be required for
any transcriptional processes because for the most part these regions contain only
satellite repeats, telomeric DNA etc., no genes.
How does the cell access the genetic information of the DNA when it needs to make
messenger RNA?
Key Concepts
Modifying the structure of chromatin
The transcriptional machinery needs to access the DNA sequence. Therefore, the
structure of chromatin has to change in order to expose the DNA required for interaction
with proteins controlling transcription.
Histone modifications include acetylation, phosphorylation, methylation, ADP ribosylation, and ubiquitination. Multiple
residues on each of the four core histones have been identified as potential modification sites and some lysine side chains can
be either methylated or acetylated. There is strong evidence that histone acetylation promotes the disruption of nucleosomes
at promoters in advance of initiation, whereas histone hypermethylation is often related to transcriptional repression
Effects of DNA methylation on transcription
Three models:
DNA methylation is a common epigenetic modification and has been implicated in gene expression control (usually
repression) in several species, although it is by no means a mechanism universal to all eukaryotes.
Methylated CpG islands are the target sites
for attachment of histone deacetylase
complexes (HDAC) that modify the
surrounding chromatin in order to silence
the adjacent genes
The Human Epigenome
200 different cell types, only one genome
The term epigenetics refers to heritable changes in gene expression (active versus
inactive genes) that do not involve changes to the underlying DNA sequence; a change
in phenotype without a change in genotype. An epigenetic system should be heritable,
self-perpetuating, and reversible
Chromatin states vary from cell type to cell type and along chromosomes. A multi-
cellular organism will be characterized by one genome, but by as many epigenomes as
there are cell types.
3-1 Epigenome.pdf
DNA methylation
• DNA methylation is a common epigenetic modification and has been implicated in gene
expression control (usually repression) in several species
• The main methylation event is the formation of 5-methylcytosine.
• DNA methylation is not uniform across the human genome and tends to be enriched in
CpG islands usually found in gene promoter regions.
• DNA methylation is performed by DNA methyltransferases. These can establish de
novo DNA methylation or maintain existing methylation patterns during DNA replication.
• DNA methyltransferases function as parts of protein complexes that modify chromatin
structure.
• Histone modification patterns, transcription factors, and miRNAs direct the DNA
methyltransferases and their associated chromatin- structure-modifying complexes to
specific locations in the genome
CpG-rich and CpG-poor islands
Nearly one-quarter of all methylation identified in embryonic stem cells are in a non-CG
context. Methylation in non-CG contexts show enrichment in gene bodies and depletion
in protein binding sites and enhancers.
Key Concepts
Human DNA methylome in stem cells and
fibroblasts (2)
There is a positive correlation between gene expression and mCHG or mCHH density.
Furthermore, highly expressed genes have a lower promoter methylation level compared
to low expressed genes
non-CG methylation is a fundamental
characteristic of stem cells.
• H1 and H9, two human embryonic
stem cell lines, show non-CG
methylation in conserved positions
In the H1 stem cells we detected abundant DNA DNA methylation sequence context is displayed according to the key and the percentage methylation at each
position is represented by the fill of each circle.
methylation in non-CG contexts (mCHG and mCHH,
where H=A, C or T)
Distribution of the methylation level in each sequence context. The y axis indicates the
fraction of all methylcytosines that display each methylation level (x axis), where methylation
level is the mC/C ratio at each reference cytosine.
H1 has both mCG and mCHG + mCHH, whereas IMR90 has only mCG
Relative methylation density within
gene bodies as a function of gene expression, in H1
http://learn.genetics.utah.edu/content/epigenetics/nutrition/
The Avy metastable epiallele resulted from the insertion of an intracisternal A particle (IAP),
endogenous retroviral element, upstream of the transcription start site of the Agouti gene!!!
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822875/
Metastable epialleles
Metastable epialleles are identical alleles that are variably expressed due to
epigenetic modifications that are established very early in development.
Epigenetic phenomenon
A stable and heritable (through cell divisions) change in gene expression that is
independent of DNA sequence changes and is, in principle, reversible.
Epigenetic modification
A term commonly used to describe a change in nucleosome structure caused by histone
modifications, histone variants, or modification (methylation) of the DNA. These
changes are not necessarily epigenetic (see ‘epigenetic phenomenon’) in the sense that
they are stable through cell divisions, but (such as symmetrical DNA methylation) some
might be.
Dynamics of 5mC and its oxidation products in pre-implantation embryos. Although the
maternal DNA goes through passive demethylation, the paternal genome is demethylated
in two steps. Tet3 first oxidizes the 5mC in the paternal genome, and the oxidation products
are then diluted through a replication-dependent process.
DNA methylation is erased in the paternal and maternal genomes after fertilization and is deposited again at later developmental stages, in serum embryonic stem cells (ESCs) and in
primed cells (epiblast-derived stem cells (EpiSCs) and epiblast-derived stem cell-like cells (EpiSCLCs)). DNA methylation is reduced during primordial germ cell (PGC) development.
Similarly, the silencing histone marks (for example, trimethylation of histone H3 at lysine 27 (H3K27me3) and H3K9me2) increase post-implantation and are reduced in 2i-cultured ESCs
and PGCs. H3K4me3 is present in broad domains in the oocyte but is restricted to transcription start sites after fertilization.
The interplay of epigenetic marks during stem cell differentiation and development. Nature Reviews Genetics 18, 643–658 (2017)
Imprinted Genes Bypass
Epigenetic Reprogramming
The epigenetic basis of gene imprinting
Genomic imprinting refers to monoallelic gene expression that occurs in a manner that
is specific to the parent of origin.
A small percentage (less than 1%) of genes are expressed from only one allele. Because
each of these alleles is derived from a different parent, the imprinted genes are
sometimes described as being maternally or paternally imprinted.
Imprinted genes hold a special status in the pre-implantation embryo in that they seem
to resist the genome-wide demethylation events that take place shortly after
fertilization.
Many - but not all - imprinted genes are found in clusters throughout the genome and in
most cases these genes, which can be either maternally or paternally expressed, are
jointly regulated through an imprinting control region (ICR). Indeed, removal of ICRs
from known imprinted genes leads to the loss of the imprinting-that is, to expression
from both alleles.
Methylation is involved in genomic imprinting and X
inactivation
For example, in those rare individuals that possess just a single X chromosome, no inactivation occurs,
and in those individuals with an XXX karyotype, two of the three X chromosomes are inactivated
Reprogramming of DNA methylation during pre-implantation embryonic development.
Methylation levels throughout pre-implantation development of embryos. The paternal
genome (blue) undergoes rapid active demethylation, whereas the maternal genome
(red) undergoes passive demethylation until the morula stage of pre-implantation
development, when de novo methylation commences.
The formation of pronuclei allows the maternal and paternal gamete genomes to be
kept separate during the first few hours after fertilization. After entry of the
spermatozoon during fertilization, its DNA content is maintained in a separate
membrane-enclosed structure distinct from the female nuclear material. These
structures are referred to as pronuclei, and DNA demethylation proceeds at different
rates in these regions.
Ligers and Tigons
Mules and Hinnies
Lions and tigers don't normally meet in nature. But they can get along very well in
captivity, where they sometimes produce hybrid offspring. The offspring look different,
depending on who the mother is. A male lion and a female tiger produce a liger - the
biggest of the big cats. A male tiger and a female lion produce a tigon, a cat that is about
the same size as its parents.
The difference in size and appearance between ligers and tigons is due to the parents'
differently imprinted genes. Other animals can also hybridize, with similar results. For
example, a horse and a donkey can produce a mule or a hinny.
Ligers and Tigons
Mules and Hinnies
Evidence of the existence of imprinting
Pronucleus transplantation experiments in mice: creation of androgenetic
and gynogenetic zygotes
gynogenetic zygotes
2n chromosomes ALL aborted embryos
of female origin
androgenic zygotes
2n chromosomes ALL aborted embryos
of male origin
CONTROLS
zygotes obtained by
transfer of pronuclei
Normal embryos
2n chromosomes, n
supplied by a male and n
by a female
Maternal imprinting
Male Female
P P
M M
Child
P
M
Maternal imprinting
Male Female
P P
M M
Child
P P
M M
Maternal imprinting
Male Female
P P
M M
Child
P P
M M
Beckwith-Wiedemann Syndrome (BWS)
Disease due to a maternal imprinted gene caused by acquisition of function. The gene maps to 11p15
P
Duplication on the maternal chromosome is
M without consequences because the
supernumerary copy is not expressed
Beckwith-Wiedemann Syndrome (BWS)
The Genetic Conflict hypothesis, supposes that imprinting grew out of a competition between males for
maternal resources.
In some species, more than one male can father offspring from the same litter. A house cat, for example, can
mate more than once during a heat and have a litter of kittens with two or more fathers. If one father's kittens
grow larger than the rest, his offspring will be more likely to survive to adulthood and pass along their
genes. So it's in the interest of the father's genes to produce larger offspring. The larger kittens will be able to
compete for maternal resources at the expense of the other father's kittens.
On the other hand, a better outcome for the mother's genes would be for all of her kittens to survive to
adulthood and reproduce. The mother alone will provide nutrients and protection for her kittens throughout
pregnancy and after birth. She needs to be able to divide her resources among several kittens, without
compromising her own needs.
It turns out that many imprinted genes are involved in growth and metabolism. Paternal imprinting favors
the production of larger offspring, and maternal imprinting favors smaller offspring. Often maternally and
paternally imprinted genes work in the very same growth pathways. This conflict of interest sets up an
epigenetic battle between the parents -- a sort of parental tug-of-war.
http://www.geneimprint.com/site/genes-by-species
The first assisted reproductive technology (ART)-conceived human precedes the
discovery of genomic imprinting in mammals by several years. This is just one
example of how medical technologies often outpace our basic understanding of
biological processes. Furthermore, this underscores the importance of continuously
reassessing and improving existing methods. Increased safety and reduced
epigenetic abnormalities from ART procedures can be achieved through the
knowledge gained from animal and stem cell-based studies, which can ultimately
lead to better health outcomes for ART patients.
Mouse Functional
ENCODE Characterization
06/09/2012
The Encyclopedia of DNA Elements (ENCODE) project aims to delineate all functional elements
encoded in the human genome.
The entire project consisted of 30 ENCODE related papers published across different journals
can be explored through the 13 different threads on www.nature.com/ENCODE
The Functional Elements of the genome
ENCODE: genomes contain discrete, linearly ordered units that can be connected with
specific functional features or processes.
In part, this represents a departure from the widely accepted reductionist approach to
genome function, in which iterative dissection by truncation or editing of larger
sequences that encompass a given functional activity was coupled to an experimental
read-out of that activity.
The Reductionism and Holism in biology
Holism is an approach to research that emphasizes the study of complex systems. Systems
are approached as coherent wholes whose component parts are best understood in
context and in relation to one another and to the whole.
This practice is in contrast to a purely analytic tradition (reductionism) which aims to gain
understanding of systems by dividing them into smaller composing elements and through
understanding their elemental properties. The holism-reductionism dichotomy is often
evident in conflicting interpretations of experimental findings and in setting priorities for
future research.
Encode’s functional elements
Previous studies suggested that 3-8% of bases are under purifying (negative) selection and
therefore may be functional.
The Functional Elements of the genome
Example:
Active promoters are marked by alterations in chromatin structure that give rise to nuclease hypersensitivity
of the underlying DNA.
Main criticisms
• [...] the aim of the ENCODE Consortium to identify functions experimentally is, in
principle, a worthy one. We have already seen that ENCODE uses an evolution-free
definition of “functionality.” [...]
• [...] According to ENCODE, for a DNA segment to be ascribed functionality it needs to (1)
be transcribed or (2) associated with a modified histone or (3) located in an open-
chromatin area or (4) to bind a transcription factors or (5) to contain a methylated CpG
dinucleotide. We note that most of these properties of DNA do not describe a
function; some describe a particular genomic location or a feature related to nucleotide
composition [...] (example: Transcription-factor-binding region promoting a
pseudogene)
• Doolittle, W.F. (2013). Is junk DNA bunk? A critique of ENCODE. Proceedings of the National Academy of
Sciences USA 110: 5294–5300.
• Eddy, S.R. (2012). The C-value paradox, junk DNA and ENCODE. Current Biology 22: R898–R899.
• Eddy, S.R. (2013). The ENCODE project: Missteps overshadowing a success. Current Biology 23: R259–R261.
• Graur, D., Y. Zheng, N. Price, R.B.R. Azevedo, R.A. Zufall, and E. Elhaik. (2013). On the immortality of
television sets: “Function” in the human genome according to the evolution-free gospel of ENCODE.
Genome Biology and Evolution 5: 578-590.
• Niu, D.-K. and L. Jiang. (2013). Can ENCODE tell us how much junk DNA we carry in our genome?
Biochemical and Biophysical Research Communications 430: 1340-1343.
• Hurst, L.D. (2013). Open questions: A logic (or lack thereof) of genome organization. BMC Biology 11: 58.
• …
ENCODE findings
The function of the vast majority of the human genome is unknown. The ENCODE
project has systematically mapped regions of transcription, transcription factor
association, chromatin structure and histone modification.
The vast majority (> 80%) of the human genome participates in at least one
biochemical RNA- and/or chromatin-associated event in at least one cell type.
In 2007, the pilot phase of the ENCODE project searched for functional elements in 1% of the genome in a few human
cell lines. The consortium catalogued two types of elements:
1. DNA regions that are transcribed into RNA (both protein-coding and non-protein-coding)
2. DNA regions that regulate gene transcription, known as cis-regulatory elements (CREs). These regions can be
identified by their accessibility to DNase I, by DNA-binding proteins such as transcription factors, or by
modifications on histone proteins.
In 2012, the second phase of the ENCODE project extended the search to the whole genome in more human cell lines.
Similar efforts were extended to the mouse genome in 2014.
In the third phase of the project (2017) the consortium moved from cell lines to cells taken directly from human and
mouse tissues.
Initial set of Materials (2012)
147 different cell types
1,640 datasets (different tissues, condition and assays)
Tier 1. Tier 1 (first priority) cell types comprised three widely studied cell lines: K562
erythroleukaemia cells; GM12878, a B-lymphoblastoid cell line that is also part of the 1000
Genomes project; and the H1 embryonic stem cell (H1 hESC) line.
Tier 2. The second-priority set included HeLa-S3 cervical carcinoma cells, HepG2
hepatoblastoma cells and primary (non-transformed) human umbilical vein endothelial cells
(HUVECs).
6. The protein-RNA interactions of RNA binding proteins (RBPs) and the RNA elements they
bind to across the transcriptome. These RNA elements, when expressed, form the basis of co- and post-
transcriptional regulation of human genes.
Gene annotation and expression
Methods: RNA-Seq
• Poly(A): 3' end polyadenylated tail is targeted in order to ensure that coding RNA is
separated from noncoding RNA.
• CAGE: Capture of the methylated cap at the 5’ end of RNA, followed by high-throughput
sequencing of a small tag adjacent to the 5’ methylated caps.
• RNA-PET: simultaneous poly(A) and CAGE captures
The group’s approach to manual gene annotation is to annotate transcripts aligned to the
genome and take the genomic sequences as the reference rather than the cDNAs.
To create a gold standard reference annotation the HAVANA team uses tools developed in-
house to manually annotate human, mouse and zebrafish genomes.
Manual annotation is especially important in areas that are not well catered for by automated
annotation systems, such as splice variation, pseudogenes, conserved gene families,
duplications and non-coding genes.
GENCODE annotation pipeline
Within the ENCODE Consortium, GENCODE aim to accurately annotate all protein-
coding genes, pseudogenes, and noncoding transcribed loci in the human genome
through manual curation and computational methods. Annotated transcript
structures are assessed, and less well-supported loci are systematically,
experimentally validated. Predicted exon-exon junctions are evaluated by RT-PCR
amplification followed by highly multiplexed sequencing readout, a method called
RT-PCR-seq. Seventy-nine percent of all assessed junctions were confirmed by this
evaluation procedure, demonstrating the high quality of the GENCODE gene set.
This RT-PCR-seq targeted approach also has the advantage of identifying novel exons
of known genes, as unannotated exons were discovered in ~11% of assessed
introns. We thus estimate that at least 18% of known loci have yet-unannotated
exons.
http://europepmc.org/article/MED/22955982
GENCODE annotation levels
Level 1 - validated
Pseudogene loci that were jointly predicted by the Yale Pseudopipe and
UCSC Retrofinder pipelines as well as by Havana manual annotation;
other transcripts that were verified experimentally by RT-PCR and
sequencing through the GENCODE experimental pipeline.
The aims of GENCODE Phase 2, which ran from 2013 to 2017, were:
• To continue to improve the coverage and accuracy of the GENCODE human gene set
• To create a mouse GENCODE gene set that includes protein-coding regions with associated
alternative splice variants, non-coding loci which have transcript evidence, and
pseudogenes.
The mouse annotation data allow comparative studies between human and mouse and likely
improve annotation quality in both genomes.
Improvement of gene annotations by GENCODE
Antigen receptor
(immunoglobulin (IG) and T-
cell receptor (TR))
10/2020
11/2021
Expression levels of genes and transcripts annotated
by GENCODE, which can be visualized on SCREEN:
Search Candidate cis-Regulatory Elements by
ENCODE (https://screen.encodeproject.org/)
What is the difference between GENCODE and Ensembl
annotation?
In the curated canonical transcripts set (UCSC, RefSeq, EnsEmbl), there are some
discrepancies where there are multiple transcripts for a given gene. The canonical
transcript is defined as either the longest CDS, if the gene has translated transcripts,
or the longest cDNA.
The canonical transcript for a gene is set according to the following hierarchy:
1. Longest Consensus CDS (see CCDS project) translation with no stop codons.
2. If no (1), choose the longest Ensembl/Havana merged translation with no
stop codons.
3. If no (2), choose the longest translation with no stop codons.
4. If no translation, choose the longest non-protein-coding transcript.
Gene Annotations
Annotating the locations of genes and other genetic control elements of the genome
Annotation of genes is provided by multiple public resources, using different methods, and
resulting in information that is similar but not always identical. Here is an alphabetical
listing of the most relevant databases for genome annotation:
The Ensembl project was started in 1999, before the draft human genome was
completed. The goal of Ensembl was to automatically annotate the genome, integrate
this annotation with other available biological data – like Gencode - and make all this
publicly available via the web.
Ensembl has gained an ardent following as a data source because they include data
patches and they release scheduled updates to their genome annotations.
The group’s approach to manual gene annotation is to annotate transcripts aligned to the
genome and take the genomic sequences as the reference rather than the cDNAs.
The UCSC Genome Browser is an on-line genome browser hosted by the University of
California, Santa Cruz (UCSC).
The Consensus CDS (CCDS)
The human and mouse genome sequence is now sufficiently stable to start
identifying those gene placements that are identical, and to make those data public
and supported as a core set by the three major public genome browsers. The
Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human
and mouse protein coding regions that are consistently annotated and of high quality.
The long term goal is to support convergence towards a standard set of gene
annotations.
The NCBI, Ensembl, and Havana annotation of the GRCh38 reference genome is
analyzed to identify additional coding sequences (CDS) that are consistently
annotated. CCDS data is available in the CCDS web site and FTP site
MANE Project
Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration between the National Center
for Biotechnology Information (NCBI) and the European Molecular Biology Laboratories-European
Bioinformatics Institute (EMBL-EBI). The goal of this project is to provide a minimal set of matching
RefSeq and Ensembl-GENCODE transcripts of human protein-coding genes, where the transcripts
from a matched pair are identical (5’ UTR, coding region and 3’ UTR), but retain their respective
identifiers.
The MANE project is only being completed for human genes on GRCh38.
https://www.ncbi.nlm.nih.gov/refseq/MANE/
https://www.youtube.com/watch?v=Pm0H32gcKeE
https://www.youtube.com/watch?v=SbQ8mB1v85c
Histone modifications
Regions of histone modifications
Histone proteins undergo post-translational modification in different ways, which impacts their
interactions with DNA.
Some modifications disrupt histone-DNA interactions, causing nucleosomes to unwind. In
this open chromatin conformation (euchromatin) DNA is accessible to binding of
transcriptional machinery and subsequent gene activation.
Other modifications strengthen histone-DNA interactions creating a tightly packed chromatin
structure (heterochromatin) where transcriptional machinery cannot access DNA, resulting in
gene silencing. In this way, modification of histones by chromatin remodeling complexes
changes chromatin architecture and gene activation.
The global patterns of histone modification (acetylation, methylation, phosphorylation,
citrullination) are highly variable across cell types, in accordance with changes in
transcriptional activity.
Integration of the different histone modification information can be used systematically to
assign functional attributes to genomic regions.
Chromatin modification based on histone
covalent modifications
Histone modifications include acetylation, phosphorylation, methylation, ADP ribosylation, and ubiquitination. Multiple
residues on each of the four core histones have been identified as potential modification sites and some lysine side chains can
be either methylated or acetylated. There is strong evidence that histone acetylation promotes the disruption of nucleosomes
at promoters in advance of initiation, whereas histone hypermethylation is often related to transcriptional repression
Methods: ChIP-seq
Chromatin immunoprecipitation followed by
sequencing.
Treatment of DNA with bisulfite converts cytosine residues to uracil, but leaves 5-
methylcytosine residues unaffected. Bisulfite sequencing is the use of bisulfite treatment of
DNA before routine sequencing to determine the pattern of methylation. DNA that has
been treated with bisulfite retains only methylated cytosines. The objective of this analysis
is therefore reduced to differentiating between single nucleotide polymorphisms (cytosines
and thymidine) resulting from bisulfite conversion
2 WGS: 1 normal + 1 subjected
to Bisulfite treatment
Methods: Reduced Representation Bisulphite Sequencing
In the Reduced Representation Bisulphite Sequencing (RRBS), after the Bisulphite treatment of
DNA sequence to convert unmethylated cytosines to uracil, in order to save costs the sample is
enriched in CpGs by digestion with methylation-insensitive restriction enzyme MspI that
targets 3’CCGG5’ before sequencing.
MSP1 MSP1
restriction site restriction site
DNA Methylation
• 96% of CpGs exhibit differential methylation in at least one cell type or tissue, and levels of
DNA methylation correlate with chromatin accessibility.
Active genes are more likely to have altered nucleosome state, which makes
DNase I digestion a great reference measure for mapping genomic
regulatory elements
Methods: FAIRE-seq
Formaldehyde Assisted Isolation of Regulatory Elements (FAIRE) isolates nucleosome-depleted
genomic regions.
Cells are subjected to cross-linking, ensuring that the interaction
between the nucleosomes and DNA are fixed. After sonication, the
fragmented and fixed DNA is separated using a phenol-chloroform
extraction. This method creates two phases, an organic and an
aqueous phase. Due to their biochemical properties, the DNA
fragments cross-linked to nucleosomes will preferentially sit in the
organic phase. Nucleosome depleted or ‘open’ regions on the other
hand will be found in the aqueous phase. By specifically extracting the
aqueous phase, only nucleosome-depleted regions will be purified
and enriched.
In contrast to DNase-Seq, the FAIRE-Seq protocol doesn't require the
permeabilization of cells or isolation of nuclei, and can analyse any
cell type.
Methods: FAIRE-seq
NOTE:
Maps of open chromatin can be constructed using both DNase-seq and FAIRE-seq
Differences in DNase-seq and FAIRE-seq may be due to the specific regulatory complexes
bound at each site, which could affect the ability of DNaseI to cut or formaldehyde to
crosslink but generally:
→ DNase-only sites tended to occur at transcription start sites
→ FAIRE-only sites were more often found in distal regions
Methods: ATAC-seq
Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) probes DNA accessibility with Tn5
transposase, which inserts sequencing adapters into accessible regions of chromatin. Sequencing
reads can then be used to infer regions of increased accessibility, as well as to map regions of
transcription factor binding and nucleosome position.
The method is a fast and sensitive alternative to DNase-Seq for assaying chromatin accessibility
genome-wide
Methods: ATAC-seq
Chromatin accessibility is the hallmark of regulatory DNA regions. It is the degree to which
nuclear macromolecules are able to physically contact chromatinized DNA and is determined
by the occupancy and topological organization of nucleosomes as well as other chromatin-
binding factors that occlude access to DNA.
This landscape of open chromatin regions (OCRs) broadly reflects regulatory capacity - rather
than a static biophysical state - and is a critical determinant of chromatin organization and
function.
• Identified > 200,000 OCRs per cell type. On average, 98.5% of the occupancy sites of
transcription factors mapped by ENCODE ChIP-seq lie within OCRs.
• The majority of OCRs lie distal to TSSs.
The Majority of Primate-Specific Regulatory Sequences Are
Derived from Transposable Elements
Nearly half of the human genome is composed of repetitive sequences, most of which were derived
from transposable elements. There is growing evidence showing that some of these transposon-derived
sequences have been a source of new binding sites for various mammalian transcription factors.
44% of open chromatin regions are in TEs. Distinct subfamilies of endogenous retroviruses (ERVs)
contributed significantly more accessible regions than expected by chance, with up to 80% of their
instances in open chromatin.
TEs contributing to open chromatin had higher levels of sequence conservation, and thousands of
ERV–derived sequences are activated in a cell type–specific manner, especially in embryonic and
cancer cells, and this activity is associated with cell type–specific expression of neighboring genes.
These results demonstrate that TEs, and in particular ERVs, have contributed hundreds of thousands
of novel regulatory elements to the primate lineage and reshaped the human transcriptional
landscape.
Candidate Cis Regulatory Elements (cCRE): promoter-like signatures (PLS), proximal enhancer-like signatures (pELS), distal enhancer-like
signatures (dELS), with high DNase, high H3K4me3 and low H3K27ac signals (DNase-H3K4me3), and bound by CTCF
Binding activity and location of transcription-factors
Transcription factors binding regions
A key characteristic of each transcription factor (TF) protein is its DNA binding domain,
which recognizes a specific DNA motif. These motifs tend to be short and degenerate, so
even when the DNA binding motif is known, one cannot generally predict where a given
transcription factor may bind.
ENCODE discovers many new transcription-factor-binding-site motifs and explores their
properties. ENCODE TF tracks contain transcription factor binding sites determined by ChIP-
seq.
TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions,
flanked by well-positioned nucleosomes, and many of these features show cell type
specificity.
Transcription factors binding regions (2012)
To identify regulatory regions directly, the binding locations of 119 different transcription
factors were mapped in 72 cell types using ChIP-seq
Overall, > 600,000 binding regions covering 231 Mb (8.1%) of the genome are enriched for
regions bound by DNA-binding proteins across all cell types.
All the information associated with each transcription factor - including the ChIP-seq peaks,
discovered motifs and associated histone modification patterns - is organised in FactorBook
(http://www.factorbook.org).
CTCF
https://doi.org/10.1038/s41586-020-2077-3
eCLIP
eCLIP is an enhanced version of the crosslinking and immunoprecipitation (CLIP) assay used to
identify the binding sites of RNA binding proteins (RBPs) in vivo.
RNA and the protein of interest are UV-crosslinked, followed by cell lysis and RNase I
digestion. Next, the protein-RNA complexes are immunoprecipitated and ligated to an RNA
adapter on the 3' end of the target RNA. The bound protein is removed by proteinase K
digestion, and the RNA is reverse-transcribed. The resulting cDNA fragments are then
amplified and sequenced.
RBP2GO (https://RBP2GO.DKFZ.de) is a comprehensive database of all currently available proteome-wide datasets
for RBPs across 13 species from 53 studies including 105 datasets identifying altogether 22 552 RBP candidates.
These are combined with the information on RBP interaction partners and on the related biological processes,
molecular functions and cellular compartments. RBP2GO offers a user-friendly web interface with an RBP scoring
system and powerful advanced search tools allowing forward and reverse searches connecting functions and RBPs to
stimulate new research directions.
Three dimensional chromatin interactions
Chromosome conformation capture
The identification and mapping of all the
genes and of their functional elements is
complicated by the fact that the genomic
positions of genes and elements do not
provide direct information about functional
relationships between them.
(A) Two fragments: A (red) and B (blue), are spatially separated in the linear genome (gray dotted line) or neighboring (red and blue to gray fading). (B) If fragment A and B are in close
spatial proximity they can become cross-linked and ligated during the Hi-C procedure (1). Partial digests result from undigested neighboring fragments that were biotinylated (2). Other
possible, non-valid products can be derived from non-ligated DNA (dangling-end; 3) or single fragments that have become circularized after ligation (self-circles; 4). The gray arrow indicate
the orientation of the paired-end reads in the Hi-C library (C) Dangling ends can be removed from the Hi-C library prior to sequencing, as described in this protocol. Any remaining dangling-
ends and self-circles can be filtered out from the sequenced library computationally after mapping and assessing the orientation of the DNA reads. After mapping, valid reads locate to
different fragments in the reference genome and are either inward or outward oriented, or directed in the same direction (both pointing left or both pointing right) (1). Unligated partial
digestion products cannot be distinguished from valid reads because the two reads will map to two (neighboring) restriction fragments. This category is characterized by an inward read
orientation (2). Invalid reads have mapped to the same fragment in the reference genome and can be either inward (dangling ends; 3), outward (self-circles; 4) or same direction (error; 5).
Gray arrows indicate the read orientation in the reference genome.
Chromosome conformation capture
analysis revealed numerous features
of genomic organization, such as the
presence of chromosome territories
and the preferential association of
small gene-rich chromosomes.
HiC - Interaction frequency
Chr
Topologically Associating Domains
A topologically associating domain (TAD) is a sub megabase region of high self-
interaction that displays limited interaction outside the domain, meaning that DNA
sequences within a TAD physically interact with each other more frequently than with
sequences outside the TAD.
Boundaries at both side of these domains are conserved between different
mammalian cell types and even across species and are highly enriched with CCCTC-
binding factor (CTCF) and cohesin binding sites. In addition, some types of genes (such
as transfer RNA genes and housekeeping genes) appear near TAD boundaries more often
than would be expected by chance. In humans and mice there are 2000–3000 domains,
with an average size of about 1 Mb
TADs were discovered in 2012 using chromosome conformation capture techniques.
They have been shown to be present in multiple species, including Drosophila, mouse,
plants, fungi and human genomes. In bacteria, they are referred to as Chromosomal
Interacting Domains (CIDs)
TADs were originally defined algorithmically in low-resolution (40 kb) mammalian Hi-C
matrices as megabase-scale genomic blocks in which DNA sequences exhibit
significantly higher interaction frequency with other DNA sequences within the
domain than with those outside of the block (Fig. 1a).
The most salient feature of TADs is that TADs are demarcated by boundaries.
https://www.nature.com/articles/s41588-019-0561-1?proof=tNature
Topologically Associating Domains
Transcriptomic data
TAD disruption can lead to pathological conditions
health disease
A 660 kb deletion that removes a TAD boundary and enhancer-A (Enh-A) leads to enhancer B (Enh-B)
adoption by lamin B1 gene (LMNB1), leading to its misexpression and, subsequently, autosomal dominant
adult-onset leukodystrophy. Deletion breakpoints are depicted by magenta and yellow dots
The ChIA-PET experimental protocol, which includes chromatin preparation, ChIP, linker ligation, proximity ligation, Mme I
restriction digestion, and DNA sequencing. This figure is from Genome Biol. 2010, 11 (2): R22-10.1186/gb-2010-11-2-r22.
Transcription models based on chromatin interactions
In addition to promoter-enhancer and enhancer-enhancer interactions, ChIA-PET revealed
that also promoter-promoter interactions are pervasive in human cells. In all the promoter-
nonpromoter interactions, more than 40% of the non-promoter regulatory elements didn't
interact with their nearest promoters. This means that the current assumption that
transcription factor binding sites regulate their nearest genes - is not valid.
Three types of transcription models have been proposed: 1) basal promoter models, in
which there are no chromatin interactions; 2) single-gene interaction models, in which one
gene is involved with one or more promoter-nonpromoter interactions; and 3) multi-gene
interaction models, in which multiple genes are linked together by chromatin interactions to
form a transcription factory for potential correlated transcription.
Chromosome interacting regions
Minor allele frequency (MAF) refers to the frequency at which the second
most common allele occurs in a given population.
Negative selection can be examined using two measures that highlight different periods of
selection in the human genome:
Deficits represent substitutions that would have occurred if the element were neutral
DNA but did not occur because the element has been under functional constraint.
Positive scores represent a substitution deficit (i.e., fewer substitutions than the
average neutral site) and thus indicate that a site may be under evolutionary
constraint. Negative scores indicate that a site is probably evolving neutrally
The impact of selection on functional elements
The next slide reports the levels of pan-mammalian constraint (higher GERP score = higher conservation =
higher constraint) compared to diversity, a measure of negative selection in the human population (mean
expected heterozygosity, inverted scale, y axis). Each point is an average for a single data set.
The top-right corners have the strongest evolutionary constraint and lowest diversity.
Coding (C), UTR (U), genomic (G), intergenic (IG) and intronic (IN) averages are shown as filled squares. In
each case the vertical and horizontal cross hairs show representative levels for the neutral expectation for
mammalian conservation and human population diversity, respectively. Each graph also shows genomic
background levels and measures of coding-gene constraint for comparison. Because human population
diversity are plotted on an inverted scale, elements that are more constrained by negative selection will tend
to lie in the upper and right-hand regions of the plot.
GERP: Positive scores represent a substitution deficit (i.e., fewer substitutions than the
average neutral site) and thus indicate that a site may be under evolutionary constraint.
Negative scores indicate that a site is probably evolving neutrally
DHS sites show enrichment in pan-mammalian Bound transcription factor motifs show both
constraint (are under negative selection ) and more mammalian constraint and higher
decreased human population diversity suppression of human diversity
Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Genomic averages (G) are shown
Bound transcription
factor motifs show
both more mammalian
constraint and higher
suppression of human
diversity
Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Genomic averages (G) are shown
No evidence for pan-
mammalian selection of
intronic RNA (dark green) and
intergenic RNA (light green)
Evidence of negative selection
of intronic RNA in humans
Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Intergenic (IG) and intronic (IN)
averages are shown
Derived allele frequency (DAF) is the
extent of deviation in human population
from the ancestral base.
High derived allele frequency (DAF) means that a mutation likely occurred somewhere
on the human lineage and is now found in about 95% of humans. The underlying
mechanism is unknown.
• Many of them are due to retrotransposon activity, but an appreciable proportion is non-
repetitive primate-specific sequence. Examination of these primate-specific regions
revealed that all classes of elements show depressed derived allele frequencies,
consistent with recent negative selection occurring in at least some of these regions (Fig.
1e).
This indicates that an appreciable proportion of the unconstrained elements are lineage-
specific elements required for organismal function
Exons exhibit by far the strongest levels of constraint.
Over 94% of the coding exons in the human genome overlap at least one
predicted constrained element (CE); conversely, only about 16% of
constrained elements overlap a coding exon. 3′ UTR regions show
noticeable constraint levels.
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001025
Summary of ENCODE-identified elements (2012)
A surprisingly large amount of the human genome, 80.4%, is covered by at least one
ENCODE-identified element.
• The broadest element class represents the different RNA types, covering 62% of the
genome (although the majority is inside of introns or near genes).
• Regions highly enriched for histone modifications form the next largest class (56.1%).
• Smaller proportions of the genome are occupied by regions of open chromatin (15.2%) or
sites of transcription factor binding (8.1%).
• Using our most conservative assessment, 8.5% of bases are covered by either a
transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%). This, however, is
still about 4.5-fold higher than the amount of protein-coding exons, and about twofold
higher than the estimated amount of pan-mammalian constraint.
ENCODE data integration with known genomic
features
ENCODE data integration with known genomic features
Gene regulation at functional elements is
governed by interplay of nucleosome remodeling,
histone modifications and TF binding.
http://www.sciencedirect.com/science/article/pii/S167202291300048X
ENCODE data integration with known genomic features:
Promoters
Hypothesis: RNA expression (output) can be
effectively predicted from patterns of chromatin
modification or transcription factor binding (input).
These associations are dependent on the genomic context, meaning that once the
genome is separated into promoter proximal and distal regions, the overall levels of
co-association decrease, but more specific relationships are uncovered
a, Significant coassociations of transcription
factor pairs using the GSC statistic across the
entire genome in K562 cells.
• Three active proximal states, with a distinct core promoter region (TSS and promoter
flanking, PF, states), leading to active gene bodies (transcribed state, T).
• Three ‘active’ distal states, tentatively labelled two as enhancers (predicted enhancers, E,
and predicted weak enhancers, WE) due to their occurrence in regions of open chromatin
with high H3K4me1. The other active state (CTCF) has high CTCF binding.
• The repressed state (R) summarizes sequences split between different classes of actively
repressed or inactive, quiescent chromatin.
CTCF encodes a transcriptional regulator protein with 11 highly conserved zinc finger (ZF) domains. This nuclear protein is able to use different combinations of the ZF domains to bind
different DNA target sequences and proteins. Depending upon the context of the site, the protein can bind a histone acetyltransferase (HAT)-containing complex and function as a
transcriptional activator or bind a histone deacetylase (HDAC)-containing complex and function as a transcriptional repressor. Mutations in this gene have been associated with invasive
breast cancers, prostate cancers, and Wilms' tumors.
Active, proximal genome states
Conversely, the E and T states have substantial cell-specific behaviour, whereas the TSS state
has a bimodal behaviour with similar numbers of cell-invariant and cell-specific occurrences
Variability of states between cell lines, showing the distribution of occurrences of the state in the six cell lines (Tier 1 + Tier 2) at
specific genome locations: from unique to one cell line (1) to ubiquitous in all six cell lines (6) for five states (CTCF, E, T, TSS and
R).
Distribution of Transcription factors across genome state
segments
88% of GWAS associated SNPs are either intronic or intergenic. ENCODE found that 12% of
these SNPs overlap transcription-factor-occupied regions whereas 34% overlap DHSs.
The control SNP sets are (from left to right): SNPs on the Illumina 2.5Mchip as an example of a widely used GWAS SNP typing panel; SNPs from the 1000 Genomes
project; SNPs extracted from 24 personal genomes (see personal genome variants track at http://main.genome-browser.bx.psu.edu (ref. 80)), all shown as blue bars.
8 SNPs associated with Crohn’s disease and other inflammatory diseases reside in a large
gene desert on chromosome 5, along with some epigenetic features indicative of function.
No genes here!
The SNP (rs11742570) strongly associated to Crohn’s disease overlaps a GATA2 transcription-factor-binding signal determined
in HUVECs. This region is also DNase I hypersensitive in HUVECs and T-helper TH1 and TH2 cells.
“Nonrandom association of phenotypes with ENCODE cell
types strengthens the argument that at least some of the
GWAS lead SNPs are functional or extremely close to
functional variants.