0% found this document useful (0 votes)
5 views264 pages

Genomes 4 (Epigenomics + ENCODE) 2022

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 264

Inside the nucleus

(accessing the genome)


Key Concepts

DNA packaging and chromatin modification


• DNA is packaged by wrapping it around histone octamers to form nucleosomes. Four
histone proteins (H2A, H2B, H3, and H4) form the nucleosome structure, and part of the
N-terminus of each of these proteins protrudes out of the nucleosome. This is very
important because those amino acids can be subjected to many post-translational
modifications.

• Nucleosomes can form higher-order folded structures that compact the DNA more
effectively into heterochromatic or euchromatic structures. Actively transcribed genes
are generally found in areas of euchromatin, whereas those that are normally
repressed are generally found in areas of heterochromatin.

• Although nucleosome packaging is clearly essential, it does present something of a


problem to the cell because DNA wrapped in nucleosomes is sterically occluded from
many protein complexes that must act on it.
Each DNA molecule is packaged
into a chromosome that is
50.000 times shorter than its
extended lenght
The fundamental unit of chromatine is the nucleosome core particle (NCP), 147 bp of DNA in
a left-handed super-helical turn around an octamer of four core histones: H2A, H2B, H3, and
H4. Linker DNA serve to join the nucleosomes together, and these are associated with the
linker histone H1.
Nucleosome assembly and disassembly
In the second level of folding, a string of nucleosomes folds
into a fiber with an approximate diameter of 30 nm
Chromatin fibers are further organized into euchromatin and
heterochromatin

As cells leave mitosis, large regions of each chromosome become decondensed and disperse in the nuclei: euchromatin,
which contains most of the transcriptionally active genes. Some chromosomal regions, and whole chromosomes in certain
cases, remain condensed: heterochromatin, commonly includes regions surrounding the centromeres and telomeres.
Nucleosomes can form higher-order
folded structures that compact the
DNA more effectively into
heterochromatic or euchromatic
structures. Actively transcribed
genes are generally found in areas
of euchromatin, whereas those that
are normally repressed are generally
found in areas of heterochromatin
The Barr body is an example of
facultative heterochromatin
(heterochromatic in some cells,
euchromatic in others)
Each chromosome has its own territory within the nucleus

Initially it was thought that chromosomes are


distributed randomly within a eukaryotic
nucleus. We now know that this view is
incorrect and that each chromosome occupies
its own space or territory.

Human chromosome territories in false color. Each


chromosome is numbered. Below: three dimensional
simulation of the chromosome arrangement
Genomes IV, chapter 10
Philadelphia chromosome
CML (chronic Myeloid Leukemia)

• Chromosome territories Translocation site generates


appear to be fairly static BCR-ABL oncogene
(fusion protein)
within an individual nucleus.

• The relative positioning of


territories is not retained after
cell division, as different
patterns are observed in the
nuclei of daughter cells

• There may, however, be


certain constraints on
territory locations, as
chromosome translocations
are more frequent between
certain pairs than others.
Chromosome territories and transcriptional factories

• The chromatin contained within each chromosome is not randomly distributed in the nucleus
but occupies a specific location known as the chromosome territory.
• The chromosome territories are formed by the interaction of chromatin with the nuclear lamina.
Chromatin is placed into the chromosome territories according to its state of
activation. Heterochromatin is more likely to be held near the nuclear periphery
through such interactions, and this tends to segregate inactive genes in such regions.
• Genes required for active transcription are often found on “loops” of DNA that
protrude from the chromosomes held near the nuclear periphery into so-called
“transcription factories” where RNA synthesis takes place. Transcription factories have the
potential to bring genes that are co-expressed into direct contact with one another. In
fact, the gene loci move to an immobilized polymerase already present in a factory,
rather than the transcriptional machinery being recruited to and moving along the chromatin template to
a gene.
The mammalian cell nucleus
contains chromatin in the form
of chromosome territories (CTs).

Constitutive heterochromatin (dark


brown) is mainly found as
pericentromeric chromatin in
patches throughout the nuclear
volume, at the nuclear periphery,
and around nucleoli. Transcription
(green), replication (yellow), and
DNA repair processes (light blue)
usually occur in small domains with
a diameter of less than 100 nm. A
diverse set of nuclear bodies, such
as speckles, paraspeckles, the
perinucleolar compartment, Cajal
bodies, and PML bodies, are found
in the interchromatin space.
CTs may overlap at their touching
borders (intermingling) or create the
so-called interchromatin space.
The majority of genes prefer to occupy the
surface of the territory, others loop apart
and associate with nuclear bodies involved
in transcription (activation or repression).
Gene-rich, early replicating chromosomal
regions are generally clustered in the
internal areas, whereas gene-poor, late-
replicating regions seem to be located at
the periphery.
Nuclear speckles
Higher eukaryotic cell nuclei are highly compartmentalized into bodies and structural assemblies of specialized
functions. Nuclear speckles are one of the most prominent nuclear bodies, yet their functional significance
remains largely unknown.

The ‘nuclear speckle’ name is now reserved for one particular


nuclear body, ranging in size from 0.3–3 mm and in number from
several to tens per nucleus. To date, two models have been
proposed for the possible physiological functions of nuclear
speckles:
• Model 1 suggests nuclear speckles are storage sites for RNA
processing factors.
• Model 2 instead suggests nuclear speckles are ‘gene
expression hubs’, enhancing transcription, post-
transcriptional splicing, and/or other RNA processing
activities for a subset of genes through rapid recycling of
components between speckles and genes
Nuclear speckles (handwritten ‘a’) using different histochemical stains,
including silver staining; also shown are nucleoli (handwritten ‘b’).
Original drawing at the Cajal Institute, CSIC, Madrid
Topologically Associated Domains (TADs)

Each chromosome is organized in topologically associated domains (TADs) = folded


regions in 3D structure - strongly interacting sequences
Each TAD contains a set of genes that are subject to a similar expression pattern =
FUNCTIONAL DOMAIN
• Open TADs = active transcription
• Close TADs = not expressed genes

• Most of interactions occur within the same chromosome,


only few interactions between different chromosomes

• domain architecture is conserved: Each of the four domains is a


– in different cell type within the same organism contiguous segment of
chromatin. In this example, the
second and third domains are
– in related species linked by an unstructured
chromatin segment.
Topologically Associating Domains
A topologically associating domain (TAD) is a sub megabase region of high self-
interaction that displays limited interaction outside the domain, meaning that DNA
sequences within a TAD physically interact with each other more frequently than with
sequences outside the TAD. Boundaries at both side of these domains are conserved
between different mammalian cell types and even across species and are highly
enriched with CCCTC-binding factor (CTCF) and cohesin binding sites. In addition, some
types of genes (such as transfer RNA genes and housekeeping genes) appear near TAD
boundaries more often than would be expected by chance. In humans and mice there
are 2000–3000 domains, with an average size of about 1 Mb

TADs were discovered in 2012 using chromosome conformation capture techniques


including Hi-C. They have been shown to be present in multiple species, including fruit
flies (Drosophila), mouse, plants, fungi and human genomes. In bacteria, they are
referred to as Chromosomal Interacting Domains (CIDs)
Insulators Genomes IV, chapter 10

• The boundaries of TADs are marked by insulators, which are involved in 3D genome organization at
multiple spatial scales and are important for dynamic reorganization of chromatin structure during
reprogramming and differentiation.
• An insulator is typically 300 bp to 2000 bp in length and contains clustered binding sites
for sequence specific DNA-binding proteins that mediate intra- and inter-chromosomal
interactions.
• Insulators maintain the independence of each TAD, preventing cross-talk between
adjacent domains. Insulators prevent the genes within a domain from being influenced by
the regulatory modules present in an adjacent domain
Insulators Genomes IV, chapter 10

Insulator = 1-2Kb

Insulators prevent the genes within a domain from being influenced by the regulatory
modules present in an adjacent domain (see fig B). If an insulator is excised from its normal
location and reinserted between a gene and the upstream regulatory modules that control
expression of that gene, then the gene no longer responds to its regulatory modules
Genomes IV, chapter 10
Positional effect of Insulators

(A) A cloned gene that is inserted into a region of highly packaged chromatin will be inactive, but one inserted into open
chromatin will be expressed.
(B) The results of cloning experiments without (red) and with (blue) insulator sequences. When insulators are absent, the
expression level of the cloned gene is variable, depending on whether it is inserted into packaged or open chromatin.
When flanked by insulators, the expression level is consistently high because the insulators establish a TAD at the
insertion site.
Genomes IV, chapter 10

How Insulators Work

Functional component of insulator is not the insulating sequence itself:


• DNA-binding proteins, such as Su(Hw) in Drosophila and CTCF in mammals attach specifically to
insulators (CCCTC-binding factor).
• These protein act as a molecular glue that holds pairs of insulators together, enabling the DNA
between the insulator to loop out and form the TAD
Chromatin compaction restricts access to the information
content of DNA

• DNA that is incorporated into constitutive heterochromatin should not be required for
any transcriptional processes because for the most part these regions contain only
satellite repeats, telomeric DNA etc., no genes.

• Genes are incorporated into euchromatic regions of chromosomes. DNA wrapped in


nucleosomes is sterically occluded from many protein complexes that must act on it.

How does the cell access the genetic information of the DNA when it needs to make
messenger RNA?
Key Concepts
Modifying the structure of chromatin

The transcriptional machinery needs to access the DNA sequence. Therefore, the
structure of chromatin has to change in order to expose the DNA required for interaction
with proteins controlling transcription.

How to increase the accessibility of the genome to DNA-binding proteins:

• ATP-dependent chromatin remodeling activities that alter the DNA-histone


interaction noncovalently.

• Chromatin modifying activities based on covalent modifications on histone tails or


histone core proteins through acetylation or methylation of key amino acids on the
N-terminal tails that cause changes in nucleosome occupancy
Nucleosome
Remodeling
Chromatin remodeling involves ATP- Factor (NURF)
dependent nucleosome remodeling
factors, whose global function is to
modulate the access of transcription
factors to chromosomal DNA.

They do so by hydrolyzing ATP to


noncovalently restructure, mobilize, or
eject nucleosomes, causing them to
disassemble nucleosomes or
redistribute their locations in response
to the binding of proteins such as
transcription factors that control gene
activity.
Chromatin modification based on histone
covalent modifications

Histone modifications include acetylation, phosphorylation, methylation, ADP ribosylation, and ubiquitination. Multiple
residues on each of the four core histones have been identified as potential modification sites and some lysine side chains can
be either methylated or acetylated. There is strong evidence that histone acetylation promotes the disruption of nucleosomes
at promoters in advance of initiation, whereas histone hypermethylation is often related to transcriptional repression
Effects of DNA methylation on transcription

Three models:

• Methylated and unmethylated DNA lead to quite different chromatin


structures. Unmethylated DNA: open chromatin; methylated DNA: closed chromatin.
• Binding of a transcription factor to its recognition site may be specifically
inhibited by the presence of methylated DNA at that site. Several transcription
factors recognize sequences that contain CpG residues, and this recognition has been
shown to be inhibited by methylation.
• Direct binding of specific transcriptional repressors to methylated DNA.
Two protein complexes have been identified that are good candidates for this type of
repression, namely methylcytosine-binding proteins 1 and 2 (MECP1 and MECP2). These
repressors bind to the 5-methylcytosine residues of CpG dinucleotides.
DNA methylation in eukaryotes involves the addition of a methyl group to carbon 5
on the pyrimidine or the purine ring of a base in a reaction that is catalyzed by DNA
methyltransferases.

DNA methylation is a common epigenetic modification and has been implicated in gene expression control (usually
repression) in several species, although it is by no means a mechanism universal to all eukaryotes.
Methylated CpG islands are the target sites
for attachment of histone deacetylase
complexes (HDAC) that modify the
surrounding chromatin in order to silence
the adjacent genes
The Human Epigenome
200 different cell types, only one genome

How does one type of cell, for example a


fibroblast, “know” that it is different from a
neuron or a muscle cell, given that all these
cells have essentially the same information
about protein synthesis contained in their
genomes?
The epigenetic memory is a natural mechanism by which the identity
of a cell is maintained through successive cell cycles during
development and differentiation. The states that define cell identity
are indeed heritable and maintained.

The identities of cells and tissues in multicellular organisms can be


maintained by their particular epigenome.
Epigenetics and epigenomics

The term epigenetics refers to heritable changes in gene expression (active versus
inactive genes) that do not involve changes to the underlying DNA sequence; a change
in phenotype without a change in genotype. An epigenetic system should be heritable,
self-perpetuating, and reversible

In eukaryotes, chromatin is at the heart of most epigenetic processes. The epigenome


refers to these states at the whole genome level.

Chromatin states vary from cell type to cell type and along chromosomes. A multi-
cellular organism will be characterized by one genome, but by as many epigenomes as
there are cell types.
3-1 Epigenome.pdf

Self-propagating transcriptional states that are


maintained through feedback loops and
networks of transcription factors (TFs) are the
most common type of trans epigenetic states.

If a TF activates its own transcription, it yields


an epigenetic state that is self-sustaining after
the originating stimulus is removed.

After each cell division, inherited TFs resume


their trans function on regulatory DNA
sequences.
The epigenome modules

The epigenome is composed of two modules:


i) a noncovalent module of the epigenome, the chromatin and its associated
chromatin modifying and remodeling activities.
ii) a component that is part of the covalent structure of DNA, methylated
cytosines located in the dinucleotide sequence CG;
More on chromatin modification based on
histone covalent modifications in the
ENCODE section
Key Concepts

DNA methylation
• DNA methylation is a common epigenetic modification and has been implicated in gene
expression control (usually repression) in several species
• The main methylation event is the formation of 5-methylcytosine.
• DNA methylation is not uniform across the human genome and tends to be enriched in
CpG islands usually found in gene promoter regions.
• DNA methylation is performed by DNA methyltransferases. These can establish de
novo DNA methylation or maintain existing methylation patterns during DNA replication.
• DNA methyltransferases function as parts of protein complexes that modify chromatin
structure.
• Histone modification patterns, transcription factors, and miRNAs direct the DNA
methyltransferases and their associated chromatin- structure-modifying complexes to
specific locations in the genome
CpG-rich and CpG-poor islands

• CpG-rich islands (approximately a tenfold enrichment over the rest of the


genome) are infrequently methylated. They remain consistently devoid of
methylation throughout the development of the embryo and the life of the adult
organism. They typically occur at or near the transcription start site of
housekeeping genes
• CpG-poor islands (approximately one CpG per 100 nucleotides) are frequently
methylated. CpG-poor promoters are tipical of tissue-specific genes that show
precise spatio-temporal expression control during development
Maintenance and de novo methylation
Global cytosine methylation in mammals is established by three independently encoded
DNA methyltransferases (DNMTs): DNMT1, DNMT3A, and DNMT3B

De novo methylation is the establishment of 5-methylcytosine at genomic loci where none


has previously existed. It is responsible for the increase in global DNA methylation that
occurs early in embryonic development.

Maintenance of existing patterns of DNA methylation to maintain cellular identity. This


function is performed by the alternative DNA methyltransferase DNMT1, which methylates
cytosine that are adjacent to 5-methylcytosine in the newly synthesized DNA strands of
dividing cells
..ACGACTACG..
..TGCTGATGC..

Symmetrical (CG) DNA methylation has a straightforward mechanism of


inheritance through DNA replication; replication results in two hemi-
methylated daughter strands and a DNA methyltransferase can be recruited to
these sites to fill in the missing methylation mark on the newly replicated
daughter strand.

Owing to this faithful mode of mitotic inheritance, symmetrical DNA


methylation is often referred to as an epigenetic mark
Principle of mDNMT1-mediated maintenance methylation
Representation of maintenance

Each of the daughter strands


synthesized by DNA polymerase I will
only be a faithful copy of the sequence
of bases found on the parent strands.
The DNA polymerase cannot
distinguish 5-methylcytosine from
cytosine; DNMT1 is therefore required
to bind to the sites when 5-
methylcytosine is located opposite its
unmethylated cytosine partner (hemi-
methylated DNA) and add the missing
methyl group.
A really great feature of histone N-terminal tails is that their individual amino acids can
be subjected to a wide range of post-translational modifications that can provide the
cell with information for controlling gene transcription that goes beyond the information
contained in the underlying sequence of the DNA itself.

One of the main histone post-translational modifications is methylation of the side-


chain amino group of lysine residues located at specific positions along the length of the
N-terminal tail.
Methylation of the fourth lysine residue of the N-terminal tail of
histone H3 (H3K4)

H3K4 Methylation is frequently present


on the nucleosomes at the promoters of
genes that are either undergoing active
transcription or at least may be expected
to up-regulate their transcription very
rapidly when required. Methylation of
H3K4 has been suggested to protect gene
promoters from de novo DNA
methylation in somatic cells.
Epigenetics and MicroRNAs

Noncoding RNA may control DNA


methyltransferases activity and may
have a decisive role in the control of
chromatin structure by targeting the
post-transcriptional control of several
chromatin-modifying enzymes
Key Concepts
Human DNA methylome in stem cells and
fibroblasts (1)
The prevailing assumption is that mammalian DNA methylation is located almost
exclusively in the CG context. However, in embryonic stem cells, a handful of studies have
detected non-CG methylation (mCHG and mCHH, where H= A, C or T) that comprises
almost 25% of all cytosines at which DNA methylation is identified

There are widespread differences in the composition and patterning of cytosine


methylation between human embryonic stem cells and fetal fibroblasts (fibroblasts are the most
common cells of connective tissue in animals).

Nearly one-quarter of all methylation identified in embryonic stem cells are in a non-CG
context. Methylation in non-CG contexts show enrichment in gene bodies and depletion
in protein binding sites and enhancers.
Key Concepts
Human DNA methylome in stem cells and
fibroblasts (2)

Non-CG methylation disappear upon induced differentiation of the embryonic stem


cells, and is restored in induced pluripotent stem cells.

There are hundreds of differentially methylated regions proximal to genes involved in


pluripotency and differentiation, and widespread reduced methylation levels in
fibroblasts is associated with lower transcriptional activity.

There is a positive correlation between gene expression and mCHG or mCHH density.
Furthermore, highly expressed genes have a lower promoter methylation level compared
to low expressed genes
non-CG methylation is a fundamental
characteristic of stem cells.
• H1 and H9, two human embryonic
stem cell lines, show non-CG
methylation in conserved positions

• IMR90 induced pluripotent stem


cells (iPS) show restoration of non-CG
methylation

• IMR90: fetal lung fibroblasts

• Cells H1 induced to differenziate


with BMP4 (bone morphogenetic
protein 4) loose non-CG methylation,
like the IMR90

In the H1 stem cells we detected abundant DNA DNA methylation sequence context is displayed according to the key and the percentage methylation at each
position is represented by the fill of each circle.
methylation in non-CG contexts (mCHG and mCHH,
where H=A, C or T)
Distribution of the methylation level in each sequence context. The y axis indicates the
fraction of all methylcytosines that display each methylation level (x axis), where methylation
level is the mC/C ratio at each reference cytosine.
H1 has both mCG and mCHG + mCHH, whereas IMR90 has only mCG
Relative methylation density within
gene bodies as a function of gene expression, in H1

High expression Low expression

• positive correlation between gene expression and mCHG or mCHH density.


• highly expressed genes containing threefold higher non-CG methylation
density than non-expressed genes
Promoter methylation and expression

Promoter methylation levels are


inversely related with the level of
gene expression

Highly expressed genes have a


lower promoter methylation
(mCG, mCHG o mCHH) level
compared to low expressed genes

High expressed Low expressed


genes genes
a Methylation atlas composed of 25 tissues and
cell types (columns) across ~8000 CpGs (rows).
For each cell type, we selected the top 100
uniquely hypermethylated (top) and 100 most
hypomethylated (bottom) CpG sites, giving a total
of 5000 tissue-discriminating individual CpGs. We
then added neighboring (up to 50 bp) CpGs, as
well as 500 CpGs that are differentially
methylated across pairs of otherwise similar
tissues. Overall, we used 7890 CpGs that are
located in 4039 500 bp genomic blocks.
Since the human genome was sequenced, the term
‘‘epigenetics’’ is increasingly being associated with
the hope that we are more than just the sum of our
genes.
Might what we eat, the air we breathe, or even the
emotions we feel influence not only our genes but
those of descendants? The environment can
certainly influence gene expression and can lead to
disease, but transgenerational consequences are
another matter.
Although the inheritance of epigenetic characters can
certainly occur—particularly in plants—how much is
due to the environment and the extent to which it
happens in humans remain unclear

(3-4 Epigenetics and progeny.pdf)


Nutrition and DNA methylation

The nutrients we extract from food enter metabolic


pathways where they are manipulated, modified, and
molded into molecules the body can use. One such
pathway is responsible for making methyl groups -
important epigenetic tags that silence genes.
Familiar nutrients like folic acid, B vitamins, and SAM-e
(S-Adenosyl methionine, a popular over-the-counter
supplement) are key components of this methyl-making
pathway. Diets high in these methyl-donating nutrients
can rapidly alter gene expression (!), especially during
early development when the epigenome is first being
established

http://learn.genetics.utah.edu/content/epigenetics/nutrition/
The Avy metastable epiallele resulted from the insertion of an intracisternal A particle (IAP),
endogenous retroviral element, upstream of the transcription start site of the Agouti gene!!!
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822875/
Metastable epialleles

Metastable epialleles are identical alleles that are variably expressed due to
epigenetic modifications that are established very early in development.

They are most often associated with retroelements and transgenesis.

They are exceptions that prove the rule!


For every case of stress/food/environment memory, the possibility of an
epigenetic basis must be confirmed. By definition, this requires that the
phenomenon is both stable and heritable (through cell divisions), yet
independent of DNA sequence change and thus at least in principle reversible.
A truly transgenerational stress memory is very likely to be epigenetic, but this
may not hold for somatic stress memory because of the shorter duration.

Lämke and Bäurle Genome Biology (2017) 18:124


The term ‘epigenetic mechanisms’ has been adopted by the scientific
literature to encompass all of the parameters that impact on the structure
of chromatin, including DNA methylation, whether or not they are stably
inheritable. This term provides a convenient label for chromatin
modifications (both on histones and DNA) and thus is hard to eradicate, but
this wide definition has caused considerable confusion.
Consequently, in the scientific field, the view has gained acceptance that
the term ‘epigenetic mechanisms’ should only be used when referring to
truly epigenetic phenomena.

Lämke and Bäurle Genome Biology (2017) 18:124


Epigenetic phenomenon/modification

Epigenetic phenomenon
A stable and heritable (through cell divisions) change in gene expression that is
independent of DNA sequence changes and is, in principle, reversible.

Epigenetic modification
A term commonly used to describe a change in nucleosome structure caused by histone
modifications, histone variants, or modification (methylation) of the DNA. These
changes are not necessarily epigenetic (see ‘epigenetic phenomenon’) in the sense that
they are stable through cell divisions, but (such as symmetrical DNA methylation) some
might be.

Lämke and Bäurle Genome Biology (2017) 18:124


https://www.nature.com/nature/journal/v502/n7472/pdf/nature12750.pdf

Dynamics of 5mC and its oxidation products in pre-implantation embryos. Although the
maternal DNA goes through passive demethylation, the paternal genome is demethylated
in two steps. Tet3 first oxidizes the 5mC in the paternal genome, and the oxidation products
are then diluted through a replication-dependent process.

DNA methylation patterns are re-established by de novo DNA methyltransferase (DNMT)


enzymes at the blastocyst stage.
KEY POINTS
1. DNA methylation patterns change
through development: naive
pluripotency, pre-implantation
epiblasts and primordial germ cells
are associated with global DNA
demethylation, whereas post-
implantation epiblasts and epiblast-
derived stem cells (EpiSCs) have high
levels of DNA methylation.
2. Exit from naive pluripotency and
embryonic stem cell differentiation
is accompanied by progressive
restriction of chromatin accessibility
and histone acetylation.

DNA methylation is erased in the paternal and maternal genomes after fertilization and is deposited again at later developmental stages, in serum embryonic stem cells (ESCs) and in
primed cells (epiblast-derived stem cells (EpiSCs) and epiblast-derived stem cell-like cells (EpiSCLCs)). DNA methylation is reduced during primordial germ cell (PGC) development.
Similarly, the silencing histone marks (for example, trimethylation of histone H3 at lysine 27 (H3K27me3) and H3K9me2) increase post-implantation and are reduced in 2i-cultured ESCs
and PGCs. H3K4me3 is present in broad domains in the oocyte but is restricted to transcription start sites after fertilization.

The interplay of epigenetic marks during stem cell differentiation and development. Nature Reviews Genetics 18, 643–658 (2017)
Imprinted Genes Bypass
Epigenetic Reprogramming
The epigenetic basis of gene imprinting
Genomic imprinting refers to monoallelic gene expression that occurs in a manner that
is specific to the parent of origin.
A small percentage (less than 1%) of genes are expressed from only one allele. Because
each of these alleles is derived from a different parent, the imprinted genes are
sometimes described as being maternally or paternally imprinted.
Imprinted genes hold a special status in the pre-implantation embryo in that they seem
to resist the genome-wide demethylation events that take place shortly after
fertilization.
Many - but not all - imprinted genes are found in clusters throughout the genome and in
most cases these genes, which can be either maternally or paternally expressed, are
jointly regulated through an imprinting control region (ICR). Indeed, removal of ICRs
from known imprinted genes leads to the loss of the imprinting-that is, to expression
from both alleles.
Methylation is involved in genomic imprinting and X
inactivation

For example, in those rare individuals that possess just a single X chromosome, no inactivation occurs,
and in those individuals with an XXX karyotype, two of the three X chromosomes are inactivated
Reprogramming of DNA methylation during pre-implantation embryonic development.
Methylation levels throughout pre-implantation development of embryos. The paternal
genome (blue) undergoes rapid active demethylation, whereas the maternal genome
(red) undergoes passive demethylation until the morula stage of pre-implantation
development, when de novo methylation commences.
The formation of pronuclei allows the maternal and paternal gamete genomes to be
kept separate during the first few hours after fertilization. After entry of the
spermatozoon during fertilization, its DNA content is maintained in a separate
membrane-enclosed structure distinct from the female nuclear material. These
structures are referred to as pronuclei, and DNA demethylation proceeds at different
rates in these regions.
Ligers and Tigons
Mules and Hinnies

Lions and tigers don't normally meet in nature. But they can get along very well in
captivity, where they sometimes produce hybrid offspring. The offspring look different,
depending on who the mother is. A male lion and a female tiger produce a liger - the
biggest of the big cats. A male tiger and a female lion produce a tigon, a cat that is about
the same size as its parents.
The difference in size and appearance between ligers and tigons is due to the parents'
differently imprinted genes. Other animals can also hybridize, with similar results. For
example, a horse and a donkey can produce a mule or a hinny.
Ligers and Tigons
Mules and Hinnies
Evidence of the existence of imprinting
Pronucleus transplantation experiments in mice: creation of androgenetic
and gynogenetic zygotes

gynogenetic zygotes
2n chromosomes ALL aborted embryos
of female origin

androgenic zygotes
2n chromosomes ALL aborted embryos
of male origin

CONTROLS
zygotes obtained by
transfer of pronuclei
Normal embryos
2n chromosomes, n
supplied by a male and n
by a female
Maternal imprinting

Male Female

P P
M M

Child

P
M
Maternal imprinting

Male Female

P P
M M

Child

P P
M M
Maternal imprinting

Male Female

P P
M M

Child

P P
M M
Beckwith-Wiedemann Syndrome (BWS)
Disease due to a maternal imprinted gene caused by acquisition of function. The gene maps to 11p15

P In normal subjects only the paternal copy is expressed


M

Duplication on the paternal chromosome


P results in the doubling of the gene product
M and the onset of the disease

P
Duplication on the maternal chromosome is
M without consequences because the
supernumerary copy is not expressed
Beckwith-Wiedemann Syndrome (BWS)

A mutation in the imprinting center prevents gene silencing in cis

The mutation is on the paternal chromosome →


P
there are no phenotypic consequences because the
M copy that cannot be turned off is expressed in any
case

P The mutation is on the maternal chromosome →


the person is sick because he/she has two active
M
copies of the gene
Genomic Imprinting
(Prader-Willi and Angelman Syndromes)
Prader-Willi Syndrome (PWS) - disease due to the absence of the PWS 'gene' function
(these are various genes which for simplicity are considered here as a single gene),
subject to maternal imprinting (only the copy provided by the father is expressed)
which maps in 15q11-13
Angelman Syndrome (AS) - disease due to absence of function of the AS gene, subject
to paternal imprinting (only the copy provided by the mother is expressed) that maps
in 15q11-13
Both diseases can be due to:
• deletion of the entire chromosomal region 15q11-13;
• uniparental disomy (UPD) (maternal in PWS, paternal in AS);
• imprinting error only for Angelman syndrome: mutation in the maternal copy of
the AS gene
P Expression profile in the normal subject:
PWS
the PWS 'gene' of the paternal
M AS
chromosome and the AS gene of the
maternal chromosome are expressed

P The deletion is on the Paternal


chromosome: absence of the PWS 'gene'
M AS
function → Prader-Willi syndrome

The deletion is on the Maternal


P PWS
chromosome: absence of the function of
M the AS gene → Angelman syndrome
Paternal Uniparental Disomy (UPD): functional
P PWS absence of the AS gene → Angelman syndrome
P PWS

M AS Maternal Uniparental Disomy: functional absence


AS of the PWS 'gene' → Prader-Willi syndrome
M

Mutation in the imprinting center on the P chromosome


P that cannot be reprogrammed and which is transmitted
M AS with a maternal fingerprint: functional absence of the
PWS gene → Prader-Willi syndrome

Mutation in the imprinting center on M chromosome


PWS which cannot be reprogrammed and which is transmitted
with a Paternal type imprint: functional absence of the
AS gene → Angelman syndrome
Some species of Cuckoo are brood parasites, laying their eggs in the nests of other species. The cuckoo egg hatches
earlier than the host's, and the cuckoo chick grows faster; in most cases the chick evicts the eggs or young of the host
species. The chick has no time to learn this behavior, so it must be an instinct passed on genetically. The chick
encourages the host to keep pace with its high growth rate with its rapid begging call and the chick's open mouth which
serves as a sign stimulus.
The Genetic Conflict hypothesis

The Genetic Conflict hypothesis, supposes that imprinting grew out of a competition between males for
maternal resources.

In some species, more than one male can father offspring from the same litter. A house cat, for example, can
mate more than once during a heat and have a litter of kittens with two or more fathers. If one father's kittens
grow larger than the rest, his offspring will be more likely to survive to adulthood and pass along their
genes. So it's in the interest of the father's genes to produce larger offspring. The larger kittens will be able to
compete for maternal resources at the expense of the other father's kittens.
On the other hand, a better outcome for the mother's genes would be for all of her kittens to survive to
adulthood and reproduce. The mother alone will provide nutrients and protection for her kittens throughout
pregnancy and after birth. She needs to be able to divide her resources among several kittens, without
compromising her own needs.
It turns out that many imprinted genes are involved in growth and metabolism. Paternal imprinting favors
the production of larger offspring, and maternal imprinting favors smaller offspring. Often maternally and
paternally imprinted genes work in the very same growth pathways. This conflict of interest sets up an
epigenetic battle between the parents -- a sort of parental tug-of-war.
http://www.geneimprint.com/site/genes-by-species
The first assisted reproductive technology (ART)-conceived human precedes the
discovery of genomic imprinting in mammals by several years. This is just one
example of how medical technologies often outpace our basic understanding of
biological processes. Furthermore, this underscores the importance of continuously
reassessing and improving existing methods. Increased safety and reduced
epigenetic abnormalities from ART procedures can be achieved through the
knowledge gained from animal and stem cell-based studies, which can ultimately
lead to better health outcomes for ART patients.

Induced pluripotent stem cells (iPSCs)


The Encyclopedia of DNA Elements (ENCODE)
https://www.youtube.com/watch?v=TwXXgEz9o4w
https://www.youtube.com/watch?v=yjpW30z-SB8
https://www.youtube.com/watch?v=PsV_sEDSE2o
Non-coding DNA Is Important For Disease And Gene
Regulation

• Many common disease associations and heritability appear to lie


outside of protein-coding regions
• Non-coding DNA variants are also known to alter human traits
(polydactyly)
• BUT: Reading the non-coding genome is difficult
Importance of Non-coding DNA

Carries out regulatory functions of genome and is important for


disease and gene regulation:
• Genes are regulated throughout time (e.g., development) and space (e.g.,
different tissues), health and disease, and in response to environmental
exposures
• Many regulatory elements are only active in specific cells/tissues
• Many common disease-associated genetic variants lie outside of protein-
coding regions
Functional information is needed to interpret the role of genetic variation in
human disease, and to apply genomics in the clinic
ENCODE goals

• Catalog all candidate functional


elements in the genome
• Make resource freely available to
community
ENCODE Timeline

2003-2007 2008-2012 2013-2016 2017-2021

Pilot Phase 1 Phase 2 Phase 3 Phase 4


Pilot Phase (1%)
Data Production Data Production Data Production
Data Production (Human only) (Human & Mouse (Human & Mouse)
(Human only)

modENCODE Computational Computational


(Fly and Worm) Analysis 1 Analysis 2

Mouse Functional
ENCODE Characterization

Technology Technology Technology


Development 1 & 2 Development 3 Development 4
The Encyclopedia of DNA Elements (ENCODE)

06/09/2012

The Encyclopedia of DNA Elements (ENCODE) project aims to delineate all functional elements
encoded in the human genome.
The entire project consisted of 30 ENCODE related papers published across different journals
can be explored through the 13 different threads on www.nature.com/ENCODE
The Functional Elements of the genome

ENCODE: genomes contain discrete, linearly ordered units that can be connected with
specific functional features or processes.

A cornerstone of ENCODE has been the use of biochemical signatures to identify


functional elements specified by the genomic sequence

In part, this represents a departure from the widely accepted reductionist approach to
genome function, in which iterative dissection by truncation or editing of larger
sequences that encompass a given functional activity was coupled to an experimental
read-out of that activity.
The Reductionism and Holism in biology

Holism is an approach to research that emphasizes the study of complex systems. Systems
are approached as coherent wholes whose component parts are best understood in
context and in relation to one another and to the whole.

This practice is in contrast to a purely analytic tradition (reductionism) which aims to gain
understanding of systems by dividing them into smaller composing elements and through
understanding their elemental properties. The holism-reductionism dichotomy is often
evident in conflicting interpretations of experimental findings and in setting priorities for
future research.
Encode’s functional elements

ENCODE defines a functional element as a discrete genome segment that:

• Encodes a defined product


→ ex. protein or non-coding RNA

• Displays a reproducible biochemical signature


→ ex. protein binding, or a specific chromatin structure

Previous studies suggested that 3-8% of bases are under purifying (negative) selection and
therefore may be functional.
The Functional Elements of the genome

The biochemical signature strategy was motivated by the recognition of common


biochemical or biophysical events that invariably attended certain types of noncoding
functional elements.

Example:

Active promoters are marked by alterations in chromatin structure that give rise to nuclease hypersensitivity
of the underlying DNA.
Main criticisms

• [...] the aim of the ENCODE Consortium to identify functions experimentally is, in
principle, a worthy one. We have already seen that ENCODE uses an evolution-free
definition of “functionality.” [...]

• [...] According to ENCODE, for a DNA segment to be ascribed functionality it needs to (1)
be transcribed or (2) associated with a modified histone or (3) located in an open-
chromatin area or (4) to bind a transcription factors or (5) to contain a methylated CpG
dinucleotide. We note that most of these properties of DNA do not describe a
function; some describe a particular genomic location or a feature related to nucleotide
composition [...] (example: Transcription-factor-binding region promoting a
pseudogene)
• Doolittle, W.F. (2013). Is junk DNA bunk? A critique of ENCODE. Proceedings of the National Academy of
Sciences USA 110: 5294–5300.
• Eddy, S.R. (2012). The C-value paradox, junk DNA and ENCODE. Current Biology 22: R898–R899.
• Eddy, S.R. (2013). The ENCODE project: Missteps overshadowing a success. Current Biology 23: R259–R261.
• Graur, D., Y. Zheng, N. Price, R.B.R. Azevedo, R.A. Zufall, and E. Elhaik. (2013). On the immortality of
television sets: “Function” in the human genome according to the evolution-free gospel of ENCODE.
Genome Biology and Evolution 5: 578-590.
• Niu, D.-K. and L. Jiang. (2013). Can ENCODE tell us how much junk DNA we carry in our genome?
Biochemical and Biophysical Research Communications 430: 1340-1343.
• Hurst, L.D. (2013). Open questions: A logic (or lack thereof) of genome organization. BMC Biology 11: 58.
• …
ENCODE findings

The function of the vast majority of the human genome is unknown. The ENCODE
project has systematically mapped regions of transcription, transcription factor
association, chromatin structure and histone modification.

The vast majority (> 80%) of the human genome participates in at least one
biochemical RNA- and/or chromatin-associated event in at least one cell type.
In 2007, the pilot phase of the ENCODE project searched for functional elements in 1% of the genome in a few human
cell lines. The consortium catalogued two types of elements:
1. DNA regions that are transcribed into RNA (both protein-coding and non-protein-coding)
2. DNA regions that regulate gene transcription, known as cis-regulatory elements (CREs). These regions can be
identified by their accessibility to DNase I, by DNA-binding proteins such as transcription factors, or by
modifications on histone proteins.

In 2012, the second phase of the ENCODE project extended the search to the whole genome in more human cell lines.
Similar efforts were extended to the mouse genome in 2014.

In the third phase of the project (2017) the consortium moved from cell lines to cells taken directly from human and
mouse tissues.
Initial set of Materials (2012)
147 different cell types
1,640 datasets (different tissues, condition and assays)

Tier 1. Tier 1 (first priority) cell types comprised three widely studied cell lines: K562
erythroleukaemia cells; GM12878, a B-lymphoblastoid cell line that is also part of the 1000
Genomes project; and the H1 embryonic stem cell (H1 hESC) line.

Tier 2. The second-priority set included HeLa-S3 cervical carcinoma cells, HepG2
hepatoblastoma cells and primary (non-transformed) human umbilical vein endothelial cells
(HUVECs).

Tier 3. Many different cell types not in tier 1 or tier 2.


2012
https://www.nature.com/collections/dggcchgghg
Accumulations of assays over the three phases of ENCODE.
2020
ENCODE 4

To study the biological function of candidate regulatory elements already compiled by


ENCODE, a new component, functional element characterization, has been added in
ENCODE 4.

ENCODE 4 includes the following components:


•Functional Element Mapping Centers
• Conduct high-throughput experiments that map biochemical activities to
identify candidate functional elements in human and mouse.
•Functional Element Characterization Centers
• Develop and apply generalizable approaches to characterize the role of
candidate functional elements in specific biological contexts.
Methods
• RNA-seq. Isolation of RNA sequences, often with different purification techniques to isolate different fractions of RNA followed by high
throughput sequencing.
• CAGE. Capture of the methylated cap at the 5’ end of RNA, followed by high-throughput sequencing of a small tag adjacent to the
5’methylated caps. 5’ methylated caps are formed at the initiation of transcription, although other mechanisms also methylate 5’ ends of
RNA.
• RNA-PET. Simultaneous capture of RNAs with both a 5’ methyl cap and a poly(A) tail, which is indicative of a full-length RNA. This is then
followed by sequencing a short tag from each end by high-throughput sequencing.
• ChIP-seq. Chromatin immunoprecipitation followed by sequencing. Specific regions of crosslinked chromatin, which is genomic DNA in
complex with its bound proteins, are selected by using an antibody to a specific epitope. The enriched sample is then subjected to high
throughput sequencing to determine the regions in the genome most often bound by the protein to which the antibody was directed. Most
often used are antibodies to any chromatin-associated epitope, including transcription factors, chromatin binding proteins and specific
chemical modifications on histone proteins.
• DNase-seq. The DNase I enzyme will preferentially cut live chromatin preparations at sites where nearby there are specific (non histone)
proteins. The resulting cut points are then sequenced using high-throughput sequencing to determine those sites ‘hypersensitive’ to DNase
I, corresponding to open chromatin.
• FAIRE-seq. Formaldehyde assisted isolation of regulatory elements. FAIRE isolates nucleosome-depleted genomic regions by exploiting the
difference in crosslinking efficiency between nucleosomes (high) and sequence-specific regulatory factors (low). FAIRE consists of
crosslinking, phenol extraction, and sequencing the DNA fragments in the aqueous phase.
• RRBS. Reduced representation bisulphite sequencing. Bisulphite treatment of DNA sequence converts unmethylated cytosines to uracil. To
focus the assay and save costs, specific restriction enzymes that cut around CpG dinucleotides can reduce the genome to a portion
specifically enriched in CpGs. This enriched sample is then sequenced to determine the methylation status of individual cytosines
quantitatively.
• 3C, 4C, 5C. Chromosome conformation capture (3C)-based technologies probe physical interaction between distinct chromosome regions
that can be separated by hundreds of kilobases is thought to be important in the regulation of gene expression
ENCODE Ground Level Annotations

1) Gene expression (RNA-seq)


Expression levels of genes and transcripts annotated by
GENCODE, which can be visualized on SCREEN
(https://screen.encodeproject.org/).

2) Histone mark enrichment (ChIP-seq)


Peaks (enriched genomic regions) of a variety of histone marks
computed from ChIP-seq experiments.
H3K27ac from mouse e11.5 hindbrain

3) DNA methylation (RRBS, WGBS)


Genome-wide methylation state of CpG, CHH, and CHG
dinucleotides.

RRBS analysis in GM12878


ENCODE Ground Level Annotations

4) Open chromatin (DNase-seq, FAIR-seq, ATAC-seq)


DNase I hypersensitive sites (DHSs) computed from DNase-seq and
FAIR-seq experiments, and ATAC-seq peaks (enriched genomic regions).
CTCF DHS Profile

5) Transcription factor binding (TF ChIP-seq)


Peaks (enriched genomic regions) of TFs computed from ChIP-seq
experiments. Visualize sequence motifs and other information on
Factorbook.
CTCF Motif from Factorbook

6) RNA binding protein occupancy (eCLIP-seq)


Peaks (enriched genomic regions) computed from eCLIP-seq data in
human cell lines K562 and HepG2 for RNA Binding Proteins (RBPs).
ENCODE Ground Level Annotations

7) Three dimensional chromatin interactions (ChIA-


PET)
3D interactions between genomic loci such as promoters and distal
enhancers computed from ChIA-PET experiments.

8) Topologically Associating Domains - TADs (Hi-C)


TADs and compartments computed from Hi-C experiments.

K562 Interaction Matrix


The ENCODE datasets
The functional genomic elements are:
1. the sequences and quantities of RNA transcripts, from both non-coding and protein-coding regions.
2. the degree of DNA methylation and chemical modifications to histones that can influence the
rate of transcription of DNA into RNA molecules.
3. the long-range chromatin interactions, such as looping, that alter the relative proximities of different
chromosomal regions in three dimensions and also affect transcription.
4. the binding activity and location of transcription-factors.
5. the sensitivity to DNA-cleavage. These sites are thought to indicate specific sequences at which the
binding of transcription factors and transcription-machinery proteins has caused nucleosome displacement.

6. The protein-RNA interactions of RNA binding proteins (RBPs) and the RNA elements they
bind to across the transcriptome. These RNA elements, when expressed, form the basis of co- and post-
transcriptional regulation of human genes.
Gene annotation and expression
Methods: RNA-Seq
• Poly(A): 3' end polyadenylated tail is targeted in order to ensure that coding RNA is
separated from noncoding RNA.
• CAGE: Capture of the methylated cap at the 5’ end of RNA, followed by high-throughput
sequencing of a small tag adjacent to the 5’ methylated caps.
• RNA-PET: simultaneous poly(A) and CAGE captures

Poly-A capture CAGE


Methods: CAGE
Cap Analysis of Gene Expression (CAGE) is a high-
throughput method for transcriptome analysis that
utilizes cap trapping, a technique based on the
biotinylation of the 7-methylguanosine cap of Pol II
transcripts, to pulldown the 5′-complete cDNAs
reversely transcribed from the captured transcripts.

A linker sequence is ligated to the 5′ end of the cDNA


and a specific restriction enzyme is used to cleave off a
short fragment from the 5′ end. Resulting fragments are
then amplified and sequenced using massive parallel
high-throughput sequencing technology, which results in
a large number of short sequenced tags that can be
mapped back to the referent genome to infer the exact
position of the transcription start sites (TSSs) used for
transcription of captured RNAs
Transcribed and protein-coding regions: (GENCODE)
The GENCODE reference gene set is a combination of manual gene annotation from the
Human and Vertebrate Analysis and Annotation (HAVANA) group and automatic gene
annotation from Ensembl. It is updated with every Ensembl release (approximately every 3 mo)

The group’s approach to manual gene annotation is to annotate transcripts aligned to the
genome and take the genomic sequences as the reference rather than the cDNAs.

GENCODE: the reference human genome


annotation for the ENCODE project
(Harrow, J. et al. )
Genome Res. (6 September 2012)
https://www.ensembl.org/index.html
The Human and Vertebrate Analysis and Annotation
(HAVANA) Project

The value of a genome is only as good as its annotation

To create a gold standard reference annotation the HAVANA team uses tools developed in-
house to manually annotate human, mouse and zebrafish genomes.

Manual annotation is especially important in areas that are not well catered for by automated
annotation systems, such as splice variation, pseudogenes, conserved gene families,
duplications and non-coding genes.
GENCODE annotation pipeline
Within the ENCODE Consortium, GENCODE aim to accurately annotate all protein-
coding genes, pseudogenes, and noncoding transcribed loci in the human genome
through manual curation and computational methods. Annotated transcript
structures are assessed, and less well-supported loci are systematically,
experimentally validated. Predicted exon-exon junctions are evaluated by RT-PCR
amplification followed by highly multiplexed sequencing readout, a method called
RT-PCR-seq. Seventy-nine percent of all assessed junctions were confirmed by this
evaluation procedure, demonstrating the high quality of the GENCODE gene set.
This RT-PCR-seq targeted approach also has the advantage of identifying novel exons
of known genes, as unannotated exons were discovered in ~11% of assessed
introns. We thus estimate that at least 18% of known loci have yet-unannotated
exons.

http://europepmc.org/article/MED/22955982
GENCODE annotation levels

Level 1 - validated
Pseudogene loci that were jointly predicted by the Yale Pseudopipe and
UCSC Retrofinder pipelines as well as by Havana manual annotation;
other transcripts that were verified experimentally by RT-PCR and
sequencing through the GENCODE experimental pipeline.

Level 2 - manual annotation


Havana manual annotation (and Ensembl annotation where it is identical
to Havana).

Level 3 - automated annotation


Ensembl loci where they are different from the Havana annotation or
where no Havana annotation can be found
GENCODE initial annotation (2012)
GENCODE annotates all protein-coding loci with alternatively
transcribed variants, non-coding loci with transcript evidence, and
pseudogenes.
The main data set combines with ENSEMBL automatic annotation
pipelines the results of HAVANA manual annotation and uses
experimental evidences to validate the annotation.
• 20,687 protein-coding genes spanning 40% of the genome (promoter to
poly(A))
• In total, GENCODE-annotated exons of protein coding genes cover 1.22% of
the genome.
• > 6 alternatively spliced transcripts per gene
• > 8,800 small RNAs and > 9,600 long non-coding RNA (lncRNA) loci
• > 11,000 pseudogenes, 863 of which were transcribed and associated with
active chromatin
Phase 2 GENCODE Goals

The aims of GENCODE Phase 2, which ran from 2013 to 2017, were:

• To continue to improve the coverage and accuracy of the GENCODE human gene set

• To create a mouse GENCODE gene set that includes protein-coding regions with associated
alternative splice variants, non-coding loci which have transcript evidence, and
pseudogenes.

The mouse annotation data allow comparative studies between human and mouse and likely
improve annotation quality in both genomes.
Improvement of gene annotations by GENCODE

Antigen receptor
(immunoglobulin (IG) and T-
cell receptor (TR))
10/2020
11/2021
Expression levels of genes and transcripts annotated
by GENCODE, which can be visualized on SCREEN:
Search Candidate cis-Regulatory Elements by
ENCODE (https://screen.encodeproject.org/)
What is the difference between GENCODE and Ensembl
annotation?

The GENCODE annotation is made by merging the manual gene annotation


produced by the Ensembl-Havana team and the Ensembl-genebuild
automated gene annotation.

The GENCODE annotation is the default gene annotation displayed in the


Ensembl browser. In practical terms, the GENCODE annotation is
essentially identical to the Ensembl annotation.
The canonical transcript

In the curated canonical transcripts set (UCSC, RefSeq, EnsEmbl), there are some
discrepancies where there are multiple transcripts for a given gene. The canonical
transcript is defined as either the longest CDS, if the gene has translated transcripts,
or the longest cDNA.

The canonical transcript for a gene is set according to the following hierarchy:
1. Longest Consensus CDS (see CCDS project) translation with no stop codons.
2. If no (1), choose the longest Ensembl/Havana merged translation with no
stop codons.
3. If no (2), choose the longest translation with no stop codons.
4. If no translation, choose the longest non-protein-coding transcript.
Gene Annotations

Annotating the locations of genes and other genetic control elements of the genome

Annotation of genes is provided by multiple public resources, using different methods, and
resulting in information that is similar but not always identical. Here is an alphabetical
listing of the most relevant databases for genome annotation:

• RefSeq genes http://www.ncbi.nlm.nih.gov/refseq/


• Ensembl genes http://www.ensembl.org
• Encode/Gencode http://encodeproject.org/ENCODE/
• UCSC genes https://genome.ucsc.edu/
• CCDS https://www.ncbi.nlm.nih.gov/projects/CCDS
• Matched Annotation from NCBI and EMBL-EBI (MANE) https://www.ncbi.nlm.nih.gov/refseq/MANE/
RefSeq

NCBI’s Reference Sequence (RefSeq) provides a comprehensive, integrated, non-


redundant, well-annotated set of sequences, including genomic DNA, transcripts, and
proteins.
The RefSeq collection aims to provide, for each included species, a complete set of non-
redundant, extensively cross-linked, and richly annotated nucleic acid and protein
records. The non-redundant nature of the RefSeq collection facilitates database inquiries
based on genomic location, sequence, or text annotation. The RefSeq collection does
include alternatively spliced transcripts encoding the same protein or distinct protein
isoforms, in addition to orthologs, paralogs, and alternative haplotypes for some
organisms, which will affect the outcome of a database query.
The RefSeq collection differs from the archival databases in the same way that a review
article differs from a related collection of primary research articles on the same subject.
Each RefSeq record represents a synthesis, by a person or group, of the primary
information that was generated and submitted by others.
Ensembl

The Ensembl project was started in 1999, before the draft human genome was
completed. The goal of Ensembl was to automatically annotate the genome, integrate
this annotation with other available biological data – like Gencode - and make all this
publicly available via the web.

Ensembl has gained an ardent following as a data source because they include data
patches and they release scheduled updates to their genome annotations.

Ensembl is based at the European Molecular Biology Laboratory's European


Bioinformatics Institute (EMBI-EBI), which is located on the Wellcome Genome Campus
in Hinxton, south of the city of Cambridge, United Kingdom.
https://www.ensembl.org/index.html
GENCODE
The GENCODE reference gene set is a combination of manual gene annotation from the
Human and Vertebrate Analysis and Annotation (HAVANA) group and automatic gene
annotation from Ensembl. It is updated with every Ensembl release (approximately every 3 mo)

The group’s approach to manual gene annotation is to annotate transcripts aligned to the
genome and take the genomic sequences as the reference rather than the cDNAs.

GENCODE: the reference human genome


annotation for the ENCODE project
(Harrow, J. et al. )
Genome Res. (6 September 2012)
UCSC Genome (Browser)

It is an interactive website offering access to genome sequence data from a variety of


vertebrate and invertebrate species and major model organisms, integrated with a
large collection of aligned annotations. Beginning with the GRCh38 assembly, the
UCSC Genome Browser version numbers for the human assemblies now match those
of the GRC to minimize version confusion.
Hence, the GRCh38 annotation is referred to as "hg38" in the Genome Browser
datasets and documentation. The previous release (hg19) was based on GRCh37

The UCSC Genome Browser is an on-line genome browser hosted by the University of
California, Santa Cruz (UCSC).
The Consensus CDS (CCDS)

The human and mouse genome sequence is now sufficiently stable to start
identifying those gene placements that are identical, and to make those data public
and supported as a core set by the three major public genome browsers. The
Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human
and mouse protein coding regions that are consistently annotated and of high quality.
The long term goal is to support convergence towards a standard set of gene
annotations.
The NCBI, Ensembl, and Havana annotation of the GRCh38 reference genome is
analyzed to identify additional coding sequences (CDS) that are consistently
annotated. CCDS data is available in the CCDS web site and FTP site
MANE Project
Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration between the National Center
for Biotechnology Information (NCBI) and the European Molecular Biology Laboratories-European
Bioinformatics Institute (EMBL-EBI). The goal of this project is to provide a minimal set of matching
RefSeq and Ensembl-GENCODE transcripts of human protein-coding genes, where the transcripts
from a matched pair are identical (5’ UTR, coding region and 3’ UTR), but retain their respective
identifiers.

The MANE transcript set is classified into three groups:


1. MANE Select: One high-quality representative transcript per protein-coding gene that is well-
supported by experimental data and represents the biology of the gene.
2. MANE Plus: A set of well-supported transcripts that have additional characteristics, such as
significant novel exons, which are not included in the MANE Select transcript.
3. MANE: All other matched transcripts that are not included in the Select and Plus sets.

The MANE project is only being completed for human genes on GRCh38.

https://www.ncbi.nlm.nih.gov/refseq/MANE/
https://www.youtube.com/watch?v=Pm0H32gcKeE
https://www.youtube.com/watch?v=SbQ8mB1v85c
Histone modifications
Regions of histone modifications
Histone proteins undergo post-translational modification in different ways, which impacts their
interactions with DNA.
Some modifications disrupt histone-DNA interactions, causing nucleosomes to unwind. In
this open chromatin conformation (euchromatin) DNA is accessible to binding of
transcriptional machinery and subsequent gene activation.
Other modifications strengthen histone-DNA interactions creating a tightly packed chromatin
structure (heterochromatin) where transcriptional machinery cannot access DNA, resulting in
gene silencing. In this way, modification of histones by chromatin remodeling complexes
changes chromatin architecture and gene activation.
The global patterns of histone modification (acetylation, methylation, phosphorylation,
citrullination) are highly variable across cell types, in accordance with changes in
transcriptional activity.
Integration of the different histone modification information can be used systematically to
assign functional attributes to genomic regions.
Chromatin modification based on histone
covalent modifications

Histone modifications include acetylation, phosphorylation, methylation, ADP ribosylation, and ubiquitination. Multiple
residues on each of the four core histones have been identified as potential modification sites and some lysine side chains can
be either methylated or acetylated. There is strong evidence that histone acetylation promotes the disruption of nucleosomes
at promoters in advance of initiation, whereas histone hypermethylation is often related to transcriptional repression
Methods: ChIP-seq
Chromatin immunoprecipitation followed by
sequencing.

Specific regions of crosslinked chromatin (DNA


in complex with its bound proteins) are
selected by using an antibody to a specific
epitope.

This permits to identify the regions in the


genome most often bound by that protein (Ex.
transcription factors, chromatin binding
proteins, specific chemical modifications on
histone proteins)
Methods: Histone modification ChIP-seq

Chromatine Immunoprecipitation (ChIP) uses antibodies to isolate


a protein or modification of interest, along with the DNA to which
it is bound. The DNA is then sequenced and mapped to the genome
to identify the protein or modification’s location and abundance.
• If the function of a histone modification is known, ChIP can
identify specific genes and regions with this histone
modification signature and the corresponding function across
the genome. Using ChIP against H3K4me1, for example, will
reveal the locations and sequences of active enhancers
throughout the genome, pointing to genes and genetic programs
of interest.
• If the function of the histone modification is not known, ChIP can
identify sequences, genes, and locations with this signature,
which can then be used to infer the function of the
modification.
The common nomenclature of histone modifications is:
• The name of the histone (e.g., H3)
• The single-letter amino acid abbreviation (e.g., K for Lysine) and the amino acid position in
the protein
• The type of modification (Me: methyl, P: phosphate, Ac: acetyl, Ub: ubiquitin)
• The number of modifications (only Me is known to occur in more than one copy per residue.
The lysine (K) ε-amino group of protein substrates can indeed accept up to three methyl
groups, resulting in either mono-, di-, or trimethyl lysine)
So H3K4me1 denotes the monomethylation of the 4th residue (a lysine) from the start (i.e., the
N-terminal) of the H3 protein.
Histone modification Function Location
H3K4me1 Activation Enhancers
H3K4me3 Activation Promoters
H3K36me3 Activation Gene bodies
H3K79me2 Activation Gene bodies
H3K9Ac Activation Enhancers, promoters
H3K27Ac Activation Enhancers, promoters
H4K16Ac Activation Repetitive sequences
H3K27me3 Repression Promoters, gene-rich regions
H3K9me3 Repression Satellite repeats, telomeres, pericentromeres
Gamma H2A.X DNA damage DNA double-strand breaks
H3S10P DNA replication Mitotic chromosomes
Candidate Cis Regulatory Elements (cCRE): promoter-like signatures (PLS), proximal enhancer-like signatures (pELS), distal enhancer-like
signatures (dELS), with high DNase, high H3K4me3 and low H3K27ac signals (DNase-H3K4me3), and bound by CTCF
DNA methylation
Methylome Sequencing
Analysis of the data generated from the sequence-based 5-
Genome-wide measurement of DNA methyl cytosine assays largely consists of sequence
alignment and segmentation
methylation is possible using:
• whole genome bisulfite
sequencing (WGBS, MethylC-seq)
or BS-seq;
• reduced-representation bisulfite
sequencing (RRBS);
• enrichment-based methods such
as MeDIP-seq, MBD-seq, and
MRE-seq;
• single-CpG resolution DNA
methylome analysis such as
methylCRF.
Methods: Bisulfite sequencing

Treatment of DNA with bisulfite converts cytosine residues to uracil, but leaves 5-
methylcytosine residues unaffected. Bisulfite sequencing is the use of bisulfite treatment of
DNA before routine sequencing to determine the pattern of methylation. DNA that has
been treated with bisulfite retains only methylated cytosines. The objective of this analysis
is therefore reduced to differentiating between single nucleotide polymorphisms (cytosines
and thymidine) resulting from bisulfite conversion
2 WGS: 1 normal + 1 subjected
to Bisulfite treatment
Methods: Reduced Representation Bisulphite Sequencing
In the Reduced Representation Bisulphite Sequencing (RRBS), after the Bisulphite treatment of
DNA sequence to convert unmethylated cytosines to uracil, in order to save costs the sample is
enriched in CpGs by digestion with methylation-insensitive restriction enzyme MspI that
targets 3’CCGG5’ before sequencing.
MSP1 MSP1
restriction site restriction site
DNA Methylation

• Promoter methylation is typically associated with repression, whereas genic methylation


correlates with transcriptional activity

• 96% of CpGs exhibit differential methylation in at least one cell type or tissue, and levels of
DNA methylation correlate with chromatin accessibility.

• Detected reproducible cytosine methylation outside CpG dinucleotides in adult tissues,


providing further support that this non-canonical methylation event may have important
roles in human biology
Open chromatin
Methods: DNase-seq
DNase-seq is primarily used to identify nucleosome-depleted DNase I hypersensitive (DHS) sites
that correspond to active regulatory elements.
Nucleosomes block DNase I from nicking. The DNase I enzyme will cut live chromatin
preparations at sites where nearby there are specific (non-histone) proteins. The resulting cut
points are then sequenced to determine those sites corresponding to open chromatin.

Active genes are more likely to have altered nucleosome state, which makes
DNase I digestion a great reference measure for mapping genomic
regulatory elements
Methods: FAIRE-seq
Formaldehyde Assisted Isolation of Regulatory Elements (FAIRE) isolates nucleosome-depleted
genomic regions.
Cells are subjected to cross-linking, ensuring that the interaction
between the nucleosomes and DNA are fixed. After sonication, the
fragmented and fixed DNA is separated using a phenol-chloroform
extraction. This method creates two phases, an organic and an
aqueous phase. Due to their biochemical properties, the DNA
fragments cross-linked to nucleosomes will preferentially sit in the
organic phase. Nucleosome depleted or ‘open’ regions on the other
hand will be found in the aqueous phase. By specifically extracting the
aqueous phase, only nucleosome-depleted regions will be purified
and enriched.
In contrast to DNase-Seq, the FAIRE-Seq protocol doesn't require the
permeabilization of cells or isolation of nuclei, and can analyse any
cell type.
Methods: FAIRE-seq

NOTE:

Maps of open chromatin can be constructed using both DNase-seq and FAIRE-seq
Differences in DNase-seq and FAIRE-seq may be due to the specific regulatory complexes
bound at each site, which could affect the ability of DNaseI to cut or formaldehyde to
crosslink but generally:
→ DNase-only sites tended to occur at transcription start sites
→ FAIRE-only sites were more often found in distal regions
Methods: ATAC-seq

Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) probes DNA accessibility with Tn5
transposase, which inserts sequencing adapters into accessible regions of chromatin. Sequencing
reads can then be used to infer regions of increased accessibility, as well as to map regions of
transcription factor binding and nucleosome position.
The method is a fast and sensitive alternative to DNase-Seq for assaying chromatin accessibility
genome-wide
Methods: ATAC-seq

Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq)


builds on a process called tagmentation: the simultaneous
fragmentation and tagging of a genome with sequencing adaptors.
The key component of this process is a mutant hyperactive Tn5
transposase, preloaded with DNA adapters, that excises any
sufficiently long DNA and tags genomic DNA for downstream
excision and fragmentation.
ATAC-seq was initially designed for next-generation sequencing
(NGS) preparation, but has now been successfully adapted to
efficiently identify open chromatin and is becoming part of the
standard epigenetic analysis.
Open chromatin regions

Chromatin accessibility is the hallmark of regulatory DNA regions. It is the degree to which
nuclear macromolecules are able to physically contact chromatinized DNA and is determined
by the occupancy and topological organization of nucleosomes as well as other chromatin-
binding factors that occlude access to DNA.
This landscape of open chromatin regions (OCRs) broadly reflects regulatory capacity - rather
than a static biophysical state - and is a critical determinant of chromatin organization and
function.
• Identified > 200,000 OCRs per cell type. On average, 98.5% of the occupancy sites of
transcription factors mapped by ENCODE ChIP-seq lie within OCRs.
• The majority of OCRs lie distal to TSSs.
The Majority of Primate-Specific Regulatory Sequences Are
Derived from Transposable Elements
Nearly half of the human genome is composed of repetitive sequences, most of which were derived
from transposable elements. There is growing evidence showing that some of these transposon-derived
sequences have been a source of new binding sites for various mammalian transcription factors.
44% of open chromatin regions are in TEs. Distinct subfamilies of endogenous retroviruses (ERVs)
contributed significantly more accessible regions than expected by chance, with up to 80% of their
instances in open chromatin.
TEs contributing to open chromatin had higher levels of sequence conservation, and thousands of
ERV–derived sequences are activated in a cell type–specific manner, especially in embryonic and
cancer cells, and this activity is associated with cell type–specific expression of neighboring genes.
These results demonstrate that TEs, and in particular ERVs, have contributed hundreds of thousands
of novel regulatory elements to the primate lineage and reshaped the human transcriptional
landscape.
Candidate Cis Regulatory Elements (cCRE): promoter-like signatures (PLS), proximal enhancer-like signatures (pELS), distal enhancer-like
signatures (dELS), with high DNase, high H3K4me3 and low H3K27ac signals (DNase-H3K4me3), and bound by CTCF
Binding activity and location of transcription-factors
Transcription factors binding regions
A key characteristic of each transcription factor (TF) protein is its DNA binding domain,
which recognizes a specific DNA motif. These motifs tend to be short and degenerate, so
even when the DNA binding motif is known, one cannot generally predict where a given
transcription factor may bind.
ENCODE discovers many new transcription-factor-binding-site motifs and explores their
properties. ENCODE TF tracks contain transcription factor binding sites determined by ChIP-
seq.
TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions,
flanked by well-positioned nucleosomes, and many of these features show cell type
specificity.
Transcription factors binding regions (2012)

To identify regulatory regions directly, the binding locations of 119 different transcription
factors were mapped in 72 cell types using ChIP-seq

Overall, > 600,000 binding regions covering 231 Mb (8.1%) of the genome are enriched for
regions bound by DNA-binding proteins across all cell types.

All the information associated with each transcription factor - including the ChIP-seq peaks,
discovered motifs and associated histone modification patterns - is organised in FactorBook
(http://www.factorbook.org).
CTCF

CTCF (or CCCTC-binding factor) encodes a transcriptional regulator protein with


11 highly conserved zinc finger (ZF) domains (ZF domains define DNA
recognition).
This nuclear protein is able to use different combinations of the ZF domains to
bind different DNA target sequences and proteins. Depending upon the context
of the site, the protein can bind a histone acetyltransferase (HAT)-containing
complex and function as a transcriptional activator or bind a histone
deacetylase (HDAC)-containing complex and function as a transcriptional
repressor. If the protein is bound to a transcriptional insulator element, it can
block communication between enhancers and upstream promoters, thereby
regulating imprinted expression. Mutations in this gene have been associated
with invasive breast cancers, prostate cancers, and Wilms' tumors. Alternatively
spliced transcript variants encoding different isoforms have been found for this
gene.
Candidate Cis Regulatory Elements (cCRE): promoter-like signatures (PLS), proximal enhancer-like signatures (pELS), distal enhancer-like
signatures (dELS), with high DNase, high H3K4me3 and low H3K27ac signals (DNase-H3K4me3), and bound by CTCF
RNA elements recognized by RNA-binding proteins
RNA elements recognized by RNA-binding proteins (RBPs)
RNA-binding proteins (RBPs) are involved in regulating gene expression. They interact
with RNA to form ribonucleoprotein complexes, which control post-transcriptional
processes such as splicing, cleavage and polyadenylation, and the editing, localization,
stability and translation of mRNAs.
Genes that encode RBPs are one of the largest gene families in the human genome,
comprising approximately 10% of all protein-coding genes. Their roles are essential for
normal human physiology, as defects in RBP function are associated with genetic and
somatic disorders, such as neurodegeneration, autoimmunity and cancer.
The RNA sequences and structures recognized by RBPs are encoded by the underlying
genomic sequence, and thus represent a class of functional sequence elements. ENCODE
3 introduces a new data set of RNA elements in the human genome that are recognized by
RBPs. This class of regulatory elements functions only when transcribed into RNA, as they
serve as the binding sites for RBPs.
ENCODE systematically mapped and studied the functions of 356 human RBPs using
integrative approaches consisting of 5 assays that focus on different aspects of RBP activity.

https://doi.org/10.1038/s41586-020-2077-3
eCLIP

eCLIP is an enhanced version of the crosslinking and immunoprecipitation (CLIP) assay used to
identify the binding sites of RNA binding proteins (RBPs) in vivo.

RNA and the protein of interest are UV-crosslinked, followed by cell lysis and RNase I
digestion. Next, the protein-RNA complexes are immunoprecipitated and ligated to an RNA
adapter on the 3' end of the target RNA. The bound protein is removed by proteinase K
digestion, and the RNA is reverse-transcribed. The resulting cDNA fragments are then
amplified and sequenced.
RBP2GO (https://RBP2GO.DKFZ.de) is a comprehensive database of all currently available proteome-wide datasets
for RBPs across 13 species from 53 studies including 105 datasets identifying altogether 22 552 RBP candidates.
These are combined with the information on RBP interaction partners and on the related biological processes,
molecular functions and cellular compartments. RBP2GO offers a user-friendly web interface with an RBP scoring
system and powerful advanced search tools allowing forward and reverse searches connecting functions and RBPs to
stimulate new research directions.
Three dimensional chromatin interactions
Chromosome conformation capture
The identification and mapping of all the
genes and of their functional elements is
complicated by the fact that the genomic
positions of genes and elements do not
provide direct information about functional
relationships between them.

A well-known example is provided by


enhancers that can regulate multiple target
genes that are located at large genomic
distances or even on different chromosomes
without affecting genes immediately next to
them.
Methods: Chromosome conformation capture
The genome is organized as a complex three-dimensional network that is determined by
physical interactions between genes and elements.
3C uses formaldehyde cross-linking to covalently trap interacting chromatin segments
throughout the genome. Interacting elements are then restriction-enzyme-digested and
intramolecularly ligated. The frequency with which two restriction fragments become ligated
is a measure of the frequency by which they interact in the nucleus.
3C uses PCR to detect individual chromatin interactions, which is particularly suited for
relatively small-scale studies focused on the analysis of interactions between a set of candidate
elements
3C-Carbon Copy, or “5C” uses highly multiplexed ligation-mediated amplification (LMA) to
first “copy” and then amplify parts of the 3C library followed by detection on microarrays or
by quantitative DNA sequencing.
Hi-C uses high-throughput paired end sequencing to find the nucleotide sequence of
fragments. Hence, all possible pairwise interactions between fragments are tested.
Methods: 3C
Chromosome Conformation Capture (3C) is a technique used to analyze the spatial organization
of chromosomes in a cell's natural state studying regions in remote locations that are in very
close physical proximity using formaldehyde cross-linked chromatin.
Methods: 3C
Chromosome Conformation Capture (3C) is a technique used to analyze the spatial
organization of chromosomes in a cell's natural state studying regions in remote locations
that are in very close physical proximity using formaldehyde cross-linked chromatin.
First, the cell genomes are cross-linked with formaldehyde which introduces bonds that
"freeze" interactions between genomic loci. The genome is then cut into fragments with a
restriction endonuclease. The next step is proximity-based ligation. This takes place at low
DNA concentrations or within intact, permeabilized nuclei in the presence of T4 DNA ligase,
such that ligation between cross-linked interacting fragments is favored over ligation
between fragments that are not cross-linked. Subsequently, interacting loci are quantified
by amplifying ligated junctions by PCR methods.
Because the interaction frequency between any two fragments are analyzed in a pairwise
manner (one by one) by PCR using specific primers for each fragment, researchers are
limited to analyzing only a few loci or a genomic region within relatively small regions (10 kb
to 1 Mb). Thus, 3C is considered a “hypothesis driven” technique, as a priori knowledge
about the genomic locations of the elements to be tested is required
Methods: 3C-carbon copy (5C)

The 5C method begins with preparation of a 3C library. Then, several hundred 5C


primers are designed to span a large genomic region of interest such that the primers
will anneal precisely at the ligation junctions of the restriction fragments in the 3C
library. Next, the fragments are subjected to ligation mediated amplification (LMA), to
simultaneously amplify thousands of 3C junctions in a single reaction. The resulting
PCR amplicons are detected by either microarray analysis or deep sequencing.

5C technique expands from 3C allowing for chromatin organization of a selected


genomic region, but it is unsuitable for conducting genome-wide complex interactions
since that will require millions of 5C primers to be used
Methods: Hi-C

Hi-C is very similar to 3C in terms of methodology, except that,


after the restriction digestion, the digested ends are treated to
incorporate biotin prior to the diluted ligation step. After ligation,
all chromosome interactions can be captured genome-wide in an
unbiased manner by recovering ligated fragments using
streptavidin.
In addition to being genome-wide, a major advantage of Hi-C is
that interactions can be detected even over relatively large
genomic-distances. Hi-C can detect in-cis interactions many
megabases away, as well as trans-interactions. The Hi-C resolution
achieved depends on the sequencing depth. With sufficient
replicates and deep sequencing, an interaction map for the whole
genome can be obtained at restriction-fragment-length resolution
Methodologies to
study 3D DNA
structure
https://europepmc.org/article/PMC/5522765

(A) Two fragments: A (red) and B (blue), are spatially separated in the linear genome (gray dotted line) or neighboring (red and blue to gray fading). (B) If fragment A and B are in close
spatial proximity they can become cross-linked and ligated during the Hi-C procedure (1). Partial digests result from undigested neighboring fragments that were biotinylated (2). Other
possible, non-valid products can be derived from non-ligated DNA (dangling-end; 3) or single fragments that have become circularized after ligation (self-circles; 4). The gray arrow indicate
the orientation of the paired-end reads in the Hi-C library (C) Dangling ends can be removed from the Hi-C library prior to sequencing, as described in this protocol. Any remaining dangling-
ends and self-circles can be filtered out from the sequenced library computationally after mapping and assessing the orientation of the DNA reads. After mapping, valid reads locate to
different fragments in the reference genome and are either inward or outward oriented, or directed in the same direction (both pointing left or both pointing right) (1). Unligated partial
digestion products cannot be distinguished from valid reads because the two reads will map to two (neighboring) restriction fragments. This category is characterized by an inward read
orientation (2). Invalid reads have mapped to the same fragment in the reference genome and can be either inward (dangling ends; 3), outward (self-circles; 4) or same direction (error; 5).
Gray arrows indicate the read orientation in the reference genome.
Chromosome conformation capture
analysis revealed numerous features
of genomic organization, such as the
presence of chromosome territories
and the preferential association of
small gene-rich chromosomes.
HiC - Interaction frequency

The contact frequency between a


pair of loci strongly correlates with
the one-dimensional distance
between them.

More intra-chromosomal interaction


than inter-chromosomal

Chr
Topologically Associating Domains
A topologically associating domain (TAD) is a sub megabase region of high self-
interaction that displays limited interaction outside the domain, meaning that DNA
sequences within a TAD physically interact with each other more frequently than with
sequences outside the TAD.
Boundaries at both side of these domains are conserved between different
mammalian cell types and even across species and are highly enriched with CCCTC-
binding factor (CTCF) and cohesin binding sites. In addition, some types of genes (such
as transfer RNA genes and housekeeping genes) appear near TAD boundaries more often
than would be expected by chance. In humans and mice there are 2000–3000 domains,
with an average size of about 1 Mb
TADs were discovered in 2012 using chromosome conformation capture techniques.
They have been shown to be present in multiple species, including Drosophila, mouse,
plants, fungi and human genomes. In bacteria, they are referred to as Chromosomal
Interacting Domains (CIDs)
TADs were originally defined algorithmically in low-resolution (40 kb) mammalian Hi-C
matrices as megabase-scale genomic blocks in which DNA sequences exhibit
significantly higher interaction frequency with other DNA sequences within the
domain than with those outside of the block (Fig. 1a).
The most salient feature of TADs is that TADs are demarcated by boundaries.

Heat-map representations (top) and schematized


globular interactions (bottom) of TADs (a,b) and
nested subTADs (c,d).
Smaller, sub-megabase-scale chromatin domains
(subTADs) resemble the domain-like structure of
TADs and are also demarcated by boundaries.
However, subTAD boundaries exhibit weaker
insulation strength, as evidenced by their relatively
lower capacity to attenuate long-range contacts
between domains, and they are also significantly
more likely than TADs to exhibit cell-type-dynamic
folding
https://www.nature.com/articles/s41588-019-0561-1?proof=tNature
Identification of contact-domain classes from the previous slide, binned at 10-kb resolution

https://www.nature.com/articles/s41588-019-0561-1?proof=tNature
Topologically Associating Domains

• TADs with repressed transcriptional


activity tend to be associated with
the nuclear lamina

• Active TADs tend to reside more in


the nuclear interior

• An active TAD has several


interactions between distal
regulatory elements and genes
within it
By identifying the proteins
in the TADS (es. ChIPseq) it
is possible to identify TAD
activity
Insulator binding proteins and sequences

Active chromatin mark

Inactive chromatin mark

Transcriptomic data
TAD disruption can lead to pathological conditions

health disease
A 660 kb deletion that removes a TAD boundary and enhancer-A (Enh-A) leads to enhancer B (Enh-B)
adoption by lamin B1 gene (LMNB1), leading to its misexpression and, subsequently, autosomal dominant
adult-onset leukodystrophy. Deletion breakpoints are depicted by magenta and yellow dots

autosomal dominant adult-onset leukodystrophy (ADLD)


Adult-onset autosomal dominant leukodystrophy (ADLD) is a rare slowly progressive neurological disorder involving central nervous
system demyelination, leading to autonomic dysfunction,ataxia and mild cognitive impairment.
All of the techniques explained thus far are used to study genome
organization from a “DNA-centric” point of view. Because DNA organization
is established and maintained by protein and RNA complexes, several
methods have been devised to study genome structure from the protein
perspective.
Method: Chromatin Interaction Analysis by Paired-End Tag
Sequencing (ChIA-PET)

ChIA-PET combines ChIP-based methods and Chromosome conformation capture (3C), to


extend the capabilities of both approaches. ChIP-Seq is typically used for genome-wide
identification of TF binding sites, but it provides only linear information of protein binding
sites along the chromosomes (but not interactions between them). Compared with Hi-C,
ChIA-PET is better at its higher resolution associated with a protein of interest for
functional study.
DNA-protein complexes are crosslinked and fragmented. Specific antibodies are used to
immunoprecipitate proteins of interest. The DNA aliquots are self-ligated based on
proximity and then precipitated, digested with restriction enzymes, and sequenced in
paired end. Deep sequencing provides base-pair resolution of the ligated fragments.
Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET)
sequencing technology and application

The ChIA-PET experimental protocol, which includes chromatin preparation, ChIP, linker ligation, proximity ligation, Mme I
restriction digestion, and DNA sequencing. This figure is from Genome Biol. 2010, 11 (2): R22-10.1186/gb-2010-11-2-r22.
Transcription models based on chromatin interactions
In addition to promoter-enhancer and enhancer-enhancer interactions, ChIA-PET revealed
that also promoter-promoter interactions are pervasive in human cells. In all the promoter-
nonpromoter interactions, more than 40% of the non-promoter regulatory elements didn't
interact with their nearest promoters. This means that the current assumption that
transcription factor binding sites regulate their nearest genes - is not valid.
Three types of transcription models have been proposed: 1) basal promoter models, in
which there are no chromatin interactions; 2) single-gene interaction models, in which one
gene is involved with one or more promoter-nonpromoter interactions; and 3) multi-gene
interaction models, in which multiple genes are linked together by chromatin interactions to
form a transcription factory for potential correlated transcription.
Chromosome interacting regions

Physical interaction between distinct chromosome regions that can be separated by


hundreds of kilobases is thought to be important in the regulation of gene expression.
• The average number of distal elements interacting with a promoter was 3.9, and the
average number of promoters interacting with a distal element was 2.5, indicating a
complex network of interconnected chromatin
• Whereas promoter regions of 2,324 genes were involved in ‘single-gene’ enhancer-
promoter interactions, those of 19,813 genes were involved in ‘multi-gene’
interaction complexes
• 50-60% of long range interactions occurred in only one of the four cell lines,
indicative of a high degree of tissue specificity for gene-element connectivity
The impact of selection on functional
elements
The impact of selection on functional elements

From comparative genomic studies, at least 3-8% of bases


are under purifying (negative) selection, indicating that these
bases may potentially be functional.
Darwin’s 5 points

1. Population has variations


2. Some variations are favorable
3. More offspring are produced than survive
4. Those that survive have favorable traits
5. A population will change over time
Negative selection and positive selection

• Negative selection or purifying selection is the


selective removal of alleles that are deleterious.

• Positive selection is selection on a particular trait and


the increased frequency of an allele in a population

• Deleterious variants are removed


• Favourable variants are fixed
Minor allele frequency (MAF)

Minor allele frequency (MAF) refers to the frequency at which the second
most common allele occurs in a given population.

It is widely used in population genetics studies because it provides information


to differentiate between common and rare variants in the population.

• Variation: MAF ≤1%


• Polymorphism: MAF >1%

Rare variants: <1%


Measurement of negative selection

Negative selection can be examined using two measures that highlight different periods of
selection in the human genome:

1. Inter-species, pan-mammalian constraint (GERP-based scores; 33 mammals), addresses


selection during mammalian evolution.
2. Intra-species constraint estimated from the numbers of variants discovered in human
populations, covers selection over human evolution.
Inter-species pan-mammalian Conservation score
The conservation score summarizes how well sequence at a given site is conserved
among 33 mammal taxa. Substitutions at Evolutionarily Conserved (EC) positions are
more deleterious than those at Evolutionarily Unconserved (EU) positions

• Position: GERP, PhyloP


– Genomic Evolutionary Rate Profiling
(GERP) measures base conservation
– PhyloP assigns conservation P-values

• Small regions: PhastCons


PhastCons fits a Hidden Markov Model
GERP

Genomic Evolutionary Rate Profiling identifies constrained elements in multiple


alignments by quantifying substitution deficits

Deficits represent substitutions that would have occurred if the element were neutral
DNA but did not occur because the element has been under functional constraint.

Positive scores represent a substitution deficit (i.e., fewer substitutions than the
average neutral site) and thus indicate that a site may be under evolutionary
constraint. Negative scores indicate that a site is probably evolving neutrally
The impact of selection on functional elements
The next slide reports the levels of pan-mammalian constraint (higher GERP score = higher conservation =
higher constraint) compared to diversity, a measure of negative selection in the human population (mean
expected heterozygosity, inverted scale, y axis). Each point is an average for a single data set.
The top-right corners have the strongest evolutionary constraint and lowest diversity.
Coding (C), UTR (U), genomic (G), intergenic (IG) and intronic (IN) averages are shown as filled squares. In
each case the vertical and horizontal cross hairs show representative levels for the neutral expectation for
mammalian conservation and human population diversity, respectively. Each graph also shows genomic
background levels and measures of coding-gene constraint for comparison. Because human population
diversity are plotted on an inverted scale, elements that are more constrained by negative selection will tend
to lie in the upper and right-hand regions of the plot.

GERP: Positive scores represent a substitution deficit (i.e., fewer substitutions than the
average neutral site) and thus indicate that a site may be under evolutionary constraint.
Negative scores indicate that a site is probably evolving neutrally
DHS sites show enrichment in pan-mammalian Bound transcription factor motifs show both
constraint (are under negative selection ) and more mammalian constraint and higher
decreased human population diversity suppression of human diversity

Derived allele frequency (DAF) is the extent of


deviation in human population from the
ancestral base.

Examination of variants segregating in primate-


specific regions revealed that all classes of
elements (RNA and regulatory) show depressed
derived allele frequencies, consistent with
recent negative selection occurring in at least
some of these regions, as the increase in low-
DAF in primate specific sequences compared to
background is indicative of negative selection
occurring in the set of variants annotated by the
No evidence for pan-mammalian selection of novel RNA sequences
ENCODE data
(intronic RNA, dark green and intergenic RNA, light green)
DHS sites show enrichment in
pan-mammalian constraint (are
under negative selection ) and
decreased human population
diversity

Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Genomic averages (G) are shown
Bound transcription
factor motifs show
both more mammalian
constraint and higher
suppression of human
diversity

Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Genomic averages (G) are shown
No evidence for pan-
mammalian selection of
intronic RNA (dark green) and
intergenic RNA (light green)
Evidence of negative selection
of intronic RNA in humans

Levels of pan-mammalian constraint (higher GERP score = higher conservation = higher constraint) compared to
diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y axis).
The top-right corner have the strongest evolutionary constraint and lowest diversity. Intergenic (IG) and intronic (IN)
averages are shown
Derived allele frequency (DAF) is the
extent of deviation in human population
from the ancestral base.

The increase in 0-DAF in primate specific


sequences compared to background is
indicative of negative selection occurring
in the set of variants annotated by the
ENCODE data
Derived alleles
Derived alleles are the new mutations that have arisen in the population: variants
arising since last common ancestor

High derived allele frequency (DAF) means that a mutation likely occurred somewhere
on the human lineage and is now found in about 95% of humans. The underlying
mechanism is unknown.

A DAF of 95% could be due to:


1. chance, alleles will fix or be weeded out by chance especially if the effective
population size is low. Chimps have a greater effective population size but could
still fix or weed out variants by chance.
2. positive selection
3. weak background selection
• There are also a large number of elements without mammalian constraint: many
transcription-factor-binding regions as well as DHSs and FAIRE regions.

• Many of them are due to retrotransposon activity, but an appreciable proportion is non-
repetitive primate-specific sequence. Examination of these primate-specific regions
revealed that all classes of elements show depressed derived allele frequencies,
consistent with recent negative selection occurring in at least some of these regions (Fig.
1e).

This indicates that an appreciable proportion of the unconstrained elements are lineage-
specific elements required for organismal function
Exons exhibit by far the strongest levels of constraint.
Over 94% of the coding exons in the human genome overlap at least one
predicted constrained element (CE); conversely, only about 16% of
constrained elements overlap a coding exon. 3′ UTR regions show
noticeable constraint levels.

Introns on average have slightly lower GERP scores than


the overall genomic baseline. However, a nontrivial
fraction of introns does exhibit evidence of constraint,
as nearly 7% of intron positions make up a large fraction
of constrained element bases.

(A) Mean rejected substitution scores for entire human genome,


constrained elements predicted by GERP++.

(B) Breakdown of constrained element positions by region type.

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1001025
Summary of ENCODE-identified elements (2012)
A surprisingly large amount of the human genome, 80.4%, is covered by at least one
ENCODE-identified element.

• The broadest element class represents the different RNA types, covering 62% of the
genome (although the majority is inside of introns or near genes).

• Regions highly enriched for histone modifications form the next largest class (56.1%).

• Smaller proportions of the genome are occupied by regions of open chromatin (15.2%) or
sites of transcription factor binding (8.1%).

• Using our most conservative assessment, 8.5% of bases are covered by either a
transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%). This, however, is
still about 4.5-fold higher than the amount of protein-coding exons, and about twofold
higher than the estimated amount of pan-mammalian constraint.
ENCODE data integration with known genomic
features
ENCODE data integration with known genomic features
Gene regulation at functional elements is
governed by interplay of nucleosome remodeling,
histone modifications and TF binding.

Binding of TFs to regulatory DNA regions triggers


chromatin remodeling.
• Such binding and remodeling lead to nuclease
hypersensitivity, which forms an open
chromatin environment and in turn facilitates
the interplay of functional elements.
• Binding of regulatory factors to genomic DNA
protects the underlying sequences from
cleavage by DNase I, thus leaving nucleotide-
resolution ‘footprints’.
http://www.sciencedirect.com/science/article/pii/S167202291300048X
ENCODE data integration with known genomic features

Many of the motifs detected in footprints display


highly cell type selective occupancy patterns that
are similar to major developmental and tissue-
specific regulators. Integrating analyses of these
functional elements can interpret the gene
regulation more accurately.
-> Both TF binding and histone modification are
predictive of gene expression levels

http://www.sciencedirect.com/science/article/pii/S167202291300048X
ENCODE data integration with known genomic features:
Promoters
Hypothesis: RNA expression (output) can be
effectively predicted from patterns of chromatin
modification or transcription factor binding (input).

Predictive models have been developed to explore


the interaction between histone modifications and
measures of transcription at promoters.

Activating acetylation marks (H3K27ac and H3K9ac)


are roughly as informative as activating methylation
marks (H3K4me3 and H3K4me2)

In many cases, chromatin marks are sufficient to


‘explain’ transcription
ENCODE data integration with known genomic features:
Transcription Factors
ENCODE examined the predictive capacity of
transcription factor-binding signals for the
expression levels of promoters.

Most transcription factors show enriched binding


signals in a narrow DNA region near the TSS, with
relatively higher binding signals in promoters with
higher CpG content.

These correlation models indicate both that a


limited set of chromatin marks are sufficient to
‘explain’ transcription and that a variety of
transcription factors might have broad roles in
general transcription levels across many genes
Distribution of transcription factor binding regions
Transcription factor binding regions are nonrandomly distributed across the
genome, with respect to both other features (for example, promoters) and other
transcription-factor-binding regions.

They can be detected using ChIP-seq of Transcription-Factor-binding peaks.

Most transcription factors have a nonrandom association to other transcription


factors. Some associations were expected, such as Jun and Fos, but many were novel
associations, such as TCF7L2 with HNF4alpha and FoxA2

These associations are dependent on the genomic context, meaning that once the
genome is separated into promoter proximal and distal regions, the overall levels of
co-association decrease, but more specific relationships are uncovered
a, Significant coassociations of transcription
factor pairs using the GSC statistic across the
entire genome in K562 cells.

Most transcription factors


have a nonrandom
association to other
transcription factors
These associations are dependent on the genomic context

Once the genome is separated into promoter proximal and


distal regions, the overall levels of co-association decrease,
but more specific relationships are uncovered.
Three classes of behaviour are shown.

The first column shows a set of


associations for which strength is
independent of location in promoter
and distal regions

The second column shows a set of


transcription factors that have
stronger associations in promoter-
proximal regions.
ENCODE data integration independent of
genomic landmarks
Genome states

Defined a consensus set of 7 major classes of genome states:

• Three active proximal states, with a distinct core promoter region (TSS and promoter
flanking, PF, states), leading to active gene bodies (transcribed state, T).

• Three ‘active’ distal states, tentatively labelled two as enhancers (predicted enhancers, E,
and predicted weak enhancers, WE) due to their occurrence in regions of open chromatin
with high H3K4me1. The other active state (CTCF) has high CTCF binding.

• The repressed state (R) summarizes sequences split between different classes of actively
repressed or inactive, quiescent chromatin.

CTCF encodes a transcriptional regulator protein with 11 highly conserved zinc finger (ZF) domains. This nuclear protein is able to use different combinations of the ZF domains to bind
different DNA target sequences and proteins. Depending upon the context of the site, the protein can bind a histone acetyltransferase (HAT)-containing complex and function as a
transcriptional activator or bind a histone deacetylase (HDAC)-containing complex and function as a transcriptional repressor. Mutations in this gene have been associated with invasive
breast cancers, prostate cancers, and Wilms' tumors.
Active, proximal genome states

• TSS: Predicted promoter region including TSS. Found close to or overlapping


GENCODE TSS sites. Enriched for H3K4me3. Sites of open chromatin. Enriched
for transcription factors known to act close to promoters and polymerases Pol II
and Pol III. Short RNAs are most enriched in these segments.
• PF: Predicted promoter flanking region. Regions that generally surround TSS
segments.
• T: Predicted transcribed region. Overlap gene bodies with H3K36me3
transcriptional elongation signal. Enriched for phosphorylated form of Pol II
signal (elongating polymerase) and poly(A)1 RNA, especially cytoplasmic.
Active, distal genome states

• E: Predicted enhancer. Regions of open chromatin associated with H3K4me1


signal. Enriched for other enhancer associated marks, including transcription
factors known to act at enhancers
• WE: Predicted weak enhancer or open chromatin cis-regulatory element.
Similar to the E state, but weaker signals and weaker enrichments
• CTCF: CTCF-enriched element. Sites of CTCF signal lacking histone
modifications, often associated with open chromatin. Many probably have a
function in insulator assays, but because of the multifunctional nature of CTCF,
we are conservative in our description.
Repressed genome state
R: Predicted repressed or low-activity region.
Silencers have been shown to exist in the human genome but are less well
characterized than enhancers. Until now, there are only a few known experimentally
validated silencers that have been demonstrated to repress target genes in vitro.
Unlike H3K9me3 which remains silenced all the time, H3K27me3 marks are
associated with gene repression for cell type-specific genes. H3K27me3 is known to
be a characteristic of silencers. H3K27me3-rich genomic regions can function as
silencers to repress gene expression via chromatin interactions
H3K27me3 Repression Promoters, gene-rich regions
H3K9me3 Repression Satellite repeats, telomeres, pericentromeres
Variability of states between cell lines
The CTCF-binding-associated state is relatively invariant across all six cell types.

Conversely, the E and T states have substantial cell-specific behaviour, whereas the TSS state
has a bimodal behaviour with similar numbers of cell-invariant and cell-specific occurrences

Variability of states between cell lines, showing the distribution of occurrences of the state in the six cell lines (Tier 1 + Tier 2) at
specific genome locations: from unique to one cell line (1) to ubiquitous in all six cell lines (6) for five states (CTCF, E, T, TSS and
R).
Distribution of Transcription factors across genome state
segments

A striking pattern is the concentration of


transcription factors in the TSS associated
state.
Distribution of Methylation across genome state segments

Similarly, DNA methylation shows marked


distinctions between segments, recapitulating
the known biology of predominantly
unmethylated active promoters (TSS states)
followed by methylated gene bodies (T state).

The two enhancer-enriched states show distinct


patterns of DNA methylation, with the less
active enhancer state showing higher
methylation (R, WE).
Insights into human genomic variation
Explored the potential impact of sequence variation on ENCODE functional elements by
examining allele-specific variation (Trio design).
Comparison of the ChIPseq results in the region of NACC2 shows a strong paternal bias for
H3K79me2 and POL2RA and a strong maternal bias for H3K27me3, indicating differential
activity for the maternal and paternal alleles.

Transcription signal is shown in green. The purple


signal shows the signal for all sequence reads,
whereas the blue and red signals show sequence
reads specifically assigned to either the paternal or
maternal copies of the genome, respectively.

NACC2 has a statistically significant:

• paternal bias for RNA Polymerase II (POLR2A) and


the transcription-associated mark H3K79me2
• maternal bias for the repressive mark H3K27me3.
GWAS
A genome-wide association study (GWA study, or GWAS), is an examination of a genome-wide
set of genetic variants in different individuals to see if any variant is associated with a trait.
GWASs typically focus on associations between single-nucleotide polymorphisms (SNPs) and
traits like major human diseases, but can equally be applied to any other organism.

The allele count of each measured SNP is


evaluated—in this case with a chi-squared test—to
identify variants associated with the trait in
question. The numbers in this example are taken
from a 2007 study of coronary artery disease (CAD)
that showed that the individuals with the G-allele of
SNP1 were overrepresented amongst Coronary
Arthery Disease patients
Common variants associated with disease
GWAS have greatly extended our knowledge of SNPs associated with human disease risk and other phenotypes, although they
are not necessarily the functional variants -> linkage disequilibrium?

88% of GWAS associated SNPs are either intronic or intergenic. ENCODE found that 12% of
these SNPs overlap transcription-factor-occupied regions whereas 34% overlap DHSs.

Overlap of lead SNPs in the NHGRI


GWAS SNP catalogue (June 2011)
with DHSs (left) or transcription-
factor-binding sites (right) as red
bars compared with various
control SNP sets in blue.

The control SNP sets are (from left to right): SNPs on the Illumina 2.5Mchip as an example of a widely used GWAS SNP typing panel; SNPs from the 1000 Genomes
project; SNPs extracted from 24 personal genomes (see personal genome variants track at http://main.genome-browser.bx.psu.edu (ref. 80)), all shown as blue bars.
8 SNPs associated with Crohn’s disease and other inflammatory diseases reside in a large
gene desert on chromosome 5, along with some epigenetic features indicative of function.

No genes here!

The SNP (rs11742570) strongly associated to Crohn’s disease overlaps a GATA2 transcription-factor-binding signal determined
in HUVECs. This region is also DNase I hypersensitive in HUVECs and T-helper TH1 and TH2 cells.
“Nonrandom association of phenotypes with ENCODE cell
types strengthens the argument that at least some of the
GWAS lead SNPs are functional or extremely close to
functional variants.

Each of the associations between a lead SNP and an ENCODE


annotation remains a credible hypothesis of a particular
functional element class or cell type to explore with future
experiments”
«ENCODE and similar studies provide a first step towards interpreting the
rest of the genome —beyond protein-coding genes— thereby
augmenting common disease genetic studies with testable hypotheses.

Such information justifies performing whole-genome sequencing (rather


than exome only, 1.2% of the genome) on rare diseases and
investigating somatic variants in non-coding functional elements, for
instance, in cancer.»
ENCODE major findings

1. Primate-specific elements as well as elements without detectable mammalian constraint


show evidence of negative selection
2. The genome can be classified into several chromatin states
3. It is possible to correlate quantitatively RNA transcription with both chromatin marks and
transcription factor binding at promoters
4. Many non-coding variants in individual genome sequences lie in ENCODE-annotated
functional regions
5. SNPs associated with disease by GWAS are enriched within non-coding functional
elements, with a majority residing in or near ENCODE-defined regions that are outside of
protein-coding genes
Integration of ENCODE data
Sex reversal following deletion of a single distal enhancer of Sox9 located within the XY SR region

Nitzan Gonen et al. Science 2018;science.aas9408


Fig. 1 Enh13 is a testis-positive enhancer of Sox9 located within the XY SR region.(A) A schematic representation of the gene desert upstream of the mSox9 gene and the locations
of the putative enhancers identified by ATAC-seq and DNaseI-seq that were screened in vivo using transgenic reporter mice. Enhancers that did not drive gonad expression of LacZ
are shown in gray. Enhancers that drove testis-specific and ovary-specific LacZ expression are shown in blue and pink, respectively. The mouse regions that show conserved synteny
with the human XY SR and REV SEX are depicted in green and purple boxes, respectively. (B) Enh13 (gray box) is located at the 5′ side of the 25.7-kb mouse equivalent XY SR locus
(black rectangular box). DNaseI-seq (black) on E15.5 and E13.5 XY sorted Sertoli cells and ATAC-seq on E13.5 sorted Sertoli cells (blue) and granulosa cells (purple), as well as E10.5
sorted somatic cells, at Enh13 genomic region are presented. Peaks correspond to nucleosome depleted regions, and are marked by black box if they are significantly enriched
compared to flanking regions as determined by MACS, and present in at least two biological replicates. The gray box overlaying each peak indicates the cloned fragment. Green lines
represent sequence conservation between mouse, human, rhesus, cow, and chicken (sequence conservation tracks obtained from UCSC). (C) β-Gal staining (blue) of E13.5 testes and
ovaries from two representative independent stable Enh13 transgenic lines. Scale bars, 100 μm.
https://www.youtube.com/watch?v=3uWL0vM8qSo

You might also like