Chloroplast

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

Chloroplast

Jump to: navigation, search

The inside of a chloroplast

Chloroplasts are organelles found in plant cells and other eukaryotic organisms that conduct photosynthesis.
Chloroplasts capture light energy to conserve free energy in the form of ATP and reduce NADP to NADPH
through a complex set of processes called photosynthesis.[1]

The word chloroplast is derived from the Greek words chloros which means green and plast which means form
or entity. Chloroplasts are members of a class of organelles known as plastids.

Evolutionary origin

Plant cells with visible chloroplasts.

Chloroplasts are one of the many different types of organelles in the cell. They are generally considered to have
originated as endosymbiotic cyanobacteria (previously known as blue-green algae). This was first suggested by
Mereschkowsky in 1905 [2] after an observation by Schimper in 1883 that chloroplasts closely resemble
cyanobacteria. [3] All chloroplasts are thought to derive directly or indirectly from a single endosymbiotic event
(in the Archaeplastida), except for Paulinella chromatophora, which has recently acquired a photosynthetic
cyanobacterial endosymbiont which is not closely related to chloroplasts of other eukaryotes. [4] In that they
derive from an endosymbiotic event, chloroplasts are similar to mitochondria but chloroplasts are found only in
plants and protista. The chloroplast is surrounded by a double-layered composite membrane with an
intermembrane space; further, it has reticulations, or many infoldings, filling the inner spaces. The chloroplast
has its own DNA which codes for redox proteins involved in electron transport in photosynthesis, this is termed
the plastome.[5]

In green plants, chloroplasts are surrounded by two lipid-bilayer membranes. They are believed to correspond to
the outer and inner membranes of the ancestral cyanobacterium.[6] Chloroplasts have their own genome, which is
considerably reduced compared to that of free-living cyanobacteria, but the parts that are still present show clear
similarities with the cyanobacterial genome. Plastids may contain 60-100 genes whereas cyanobacteria often
contain more than 1500 genes.[7] Many of the missing genes are encoded in the nuclear genome of the host. The
transfer of nuclear information has been estimated in tobacco plants at one gene for every 16000 pollen grains.[8]

In some algae (such as the heterokonts and other protists such as Euglenozoa and Cercozoa), chloroplasts seem
to have evolved through a secondary event of endosymbiosis, in which a eukaryotic cell engulfed a second
eukaryotic cell containing chloroplasts, forming chloroplasts with three or four membrane layers. In some cases,
such secondary endosymbionts may have themselves been engulfed by still other eukaryotes, thus forming
tertiary endosymbionts. In the alga Chlorella, there is only one chloroplast, which is bell shaped.

In some groups of mixotrophic protists such as the dinoflagellates, chloroplasts are separated from a captured
alga or diatom and used temporarily. These klepto chloroplasts may only have a lifetime of a few days and are
then replaced.[9]

[edit] Structure

Chloroplasts are observable morphologically as flat discs usually 2 to 10 micrometer in diameter and 1
micrometer thick. In land plants they are generally 5 μm in diameter and 2.3 μm thick. The chloroplast is
contained by an envelope that consists of an inner and an outer phospholipid membrane. Between these two
layers is the intermembrane space. A typical parenchyma cell contains about 10 to 100 chloroplasts.

Chloroplast ultrastructure:
1. outer membrane
2. intermembrane space
3. inner membrane (1+2+3: envelope)
4. stroma (aqueous fluid)
5. thylakoid lumen (inside of thylakoid)
6. thylakoid membrane
7. granum (stack of thylakoids)
8. thylakoid (lamella)
9. starch
10. ribosome
11. plastidial DNA
12. plastoglobule (drop of lipids)

The material within the chloroplast is called the stroma, corresponding to the cytosol of the original bacterium,
and contains one or more molecules of small circular DNA. It also contains ribosomes, although most of its
proteins are encoded by genes contained in the host cell nucleus, with the protein products transported to the
chloroplast.

Within the stroma are stacks of thylakoids, the sub-organelles which are the site of photosynthesis. The
thylakoids are arranged in stacks called grana (singular: granum).[10] A thylakoid has a flattened disk shape.
Inside it is an empty area called the thylakoid space or lumen. Photosynthesis takes place on the thylakoid
membrane; as in mitochondrial oxidative phosphorylation, it involves the coupling of cross-membrane fluxes
with biosynthesis via the dissipation of a proton electrochemical gradient.

In the electron microscope, thylakoid membranes appear as alternating light-and-dark bands, each 0.01 μm
thick. Embedded in the thylakoid membrane is the antenna complex, which consists of the light-absorbing
pigments, including chlorophyll and carotenoids, as well as proteins that bind the pigments. This complex both
increases the surface area for light capture, and allows capture of photons with a wider range of wavelengths.
The energy of the incident photons is absorbed by the pigments and funneled to the reaction centre of this
complex through resonance energy transfer. Two chlorophyll molecules are then ionised, producing an excited
electron which then passes onto the photochemical reaction centre.

Recent studies have shown that chloroplasts can be interconnected by tubular bridges called stromules, formed
as extensions of their outer membranes.[11][12] Chloroplasts appear to be able to exchange proteins via stromules,
[13]
and thus function as a network.

[edit] Transplastomic plants


Recently, chloroplasts have caught attention by developers of genetically modified plants. In most flowering
plants, chloroplasts are not inherited from the male parent,[14][15] although in plants such as pines, chloroplasts are
inherited from males.[16] Where chloroplasts are inherited only from the female, transgenes in these plastids
cannot be disseminated by pollen. This makes plastid transformation a valuable tool for the creation and
cultivation of genetically modified plants that are biologically contained, thus posing significantly lower
environmental risks. This biological containment strategy is therefore suitable for establishing the coexistence of
conventional and organic agriculture. While the reliability of this mechanism has not yet been studied for all
relevant crop species, recent results in tobacco plants are promising, showing a failed containment rate of
transplastomic plants at 3 in 1,000,000.[15]

Chloroplast Genes

Chloroplasts (as well as mitochondria) have their


own genome.
Link to discussion of the reason for this.

The diagram (based on the work of Ohyama, K. et


al., Nature 322:572, 1986 and Linda A. Raubeson
and R. K. Jansen, Science 225:1697, 1992) shows
the genome of the first chloroplast DNA to be
sequenced, that of the liverwort Marchantia
polymorpha. It contains 121,024 base pairs
encoding 128 genes. The short lines indicate a few of
the tRNA genes, some of which are labeled.

The order of the genes between the arrows (~6:30 to


~10:00) is also found in the lycopsids. But in all
other vascular plants, this region is inverted and
the order of the genes is precisely reversed. This
provides further evidence that the other vascular plants we shall examine below, the

 horsetails
 ferns
 gymnosperms and
 angiosperms

belong to a separate clade.

What genes are present in the chromophytic algal chloroplast genome?

Stramenopiles are a "crown" taxon that evolved about 300 million years ago and radiated after the Cretaceous
Period. Photosynthetic members of this taxon vary in morphology from simple unicells to highly complex
parenchymous seaweeds with intricate reproductive structures. These autotrophic eukaryotes impact many of the
earth's biogeochemical cycles (e.g. sulfur and nitrogen loading) and serve as primary producers that fix a
significant portion of the total CO2 processed on earth. The stramenopiles represent a major eukaryotic group,
comprised of an estimated 500,000 to one million species that is taxonomically distinct from the chlorophytic or
rhodophytic lineages of autotrophs.

To date, of the thousands of species, only two chloroplast genomes are available for the entire stramenopile
assemblage (both are from diatoms). In our newly funded work, chloroplast DNA of selected representatives
from this taxon are being sequenced. Morphologically diverse organisms that are ecologically contributive and
economically valuable have been targeted. We have completed the sequence of the chloroplast genomes from
two Heterosigma strains (East Pacific and West Atlantic). Analysis of the chloroplast genomes from 30
additional stramenopiles as part of an NSF Tree of Life project is ongoing. We are using our newly developed
Fosmid sequencing approach in these studies. This method completely eliminates the need to isolate chloroplast
DNA (an impossible task in 1um-sized algal representatives). I am collaborating with G. Rocap (U of W; Ocean
Sciences) and M. Jacobs, (Genome Center) in this endeavor.
Fig. 4: Sequenced Heterosigma akashiwo (strain CCMP 452) chloroplast genome

What chloroplast genes are expressed during Heterosigma's various life history phases?

Heterosigma is an obligate photoauxotroph that not only survives as a free swimming vegetative cell, but also
exists for long periods in stasis (in the dark and cold). We have shown that mRNA abundance in synchronously
maintained Heterosigma vegetative cells is transcriptionally regulated. This response is determined by
photoperiod. Having a large number of chloroplast gene sequences, through our chloroplast genome sequencing
project, will allow us to probe selective gene expression profiles at a given life-history phase using quantitative
RT PCR and micro-array technologies.

(B) Analysis of psbA transcript abundance in


Fig. 5: (A) Steady-state analysis of psbA and rbcL, nbcS
24L:0D and 12L:12D regimes shows that
mRNAs during a 12L:12D cycle shows genes to be
transcription of this gene is light (not circadian)
expressed in an ocillatory manner
regulated

How do chromophytic algae regulate chloroplast gene expression?


Sequence analysis of the Heterosigma chloroplast genome revealed the presence of a single two-component
His-to-Asp (tsg1/Trg1) pair in this golden-brown alga. Our data represent the first documentation of His-to-Asp
arrays in chromophytic algae and counter previous reports suggesting that such regulatory proteins are lacking
in the stramenopile taxonomic cluster. Molecular modeling of the 27 kd Heterosigma Trg1 regulator protein is
consistent with a winged helix-turn-helix identity--a class of proteins that is known to impact gene expression at
the level of transcription. Our data support the hypothesis that the response regulators of chromophytes,
rhodophytes, and glaucophytes interact with a sigma 70 subunit (encoded by rpoD) of a eubacterial type
polymerase. rpoD has now been sequenced in Heterosigma. Both transcriptional and western analysis verify
that tsg1 is not a pseudogene. We hypothesize that tsg1 is red/ox regulated. tsg1/trg1 pair is phylogenetically
constrained in its distribution. Some chloroplast genomes contain no identifiable tsg1 orthologue,while others
contain multiple trg1-like genes.

Fig. 6: Comparison of Heterosigma trg1 DNA and RNA polymerase binding domain to bacterial response
regulators

How is carbon metabolized via the Calvin cycle? What are the evolutionary origin(s) of this cycle?

Stramenopiles serve as large-scale bioconverters in CO2 processing via the Calvin-Bensen Cycle. The dogma,
present in all text books, that Rubisco small subunit was universally nuclear encoded was disproven by this
laboratory (all "red" and "brown" algae have this gene localized to their plastid genome). We have also
published some of the only available Km values and enzyme analysis for Rubisco and phosphoribulokinase in
these algae. Our present work (in collaboration with Dr. Michael Salvucci; USDA laboratory) is focused on
analyzing the molecular biology and biochemistry of a putative Rubisco activase (our target protein is very
unlike the Rubisco activase of land plants and green algae). Sequencing of this putative rubisco activase from 24
Heterosigma strains revealed the presence of six naturally occurring varients in the protein. We propose to test
the functional significance of these altered proteins.
Fig. 7: Protein model of a putative rubisco activase found in the Heterosigma akashiwo chloroplast genome

How do chloroplasts replicate?

The survival of an autotrophic cell requires that mutiple copies of the plastid genome are reproduced with high
fidelity and that these replicated chromosomes are efficiently distributed to the next chloroplast generation.
Though Ris and Plaut reported the presence of DNA in chloroplasts over 40 years ago, we still know very little
about plastid DNA synthesis. The goal of our study is to analyze the events associated with chloroplast DNA
replication and its dispersal to daughter plastids. We are using qPCR to probe the temporal expression of
chloroplast-encoded genes that are implicated in DNA replication and division in synchronous Heterosigma
cultures. Future studies will use confocal microscopy to document the replication of DNA "beads" on the ring-
shaped nucleoid of this organelle and the localization of division proteins.

hloroplast Genome - The chloroplasts of green plants are cytoplasmic organelles that house the various
pigments and enzymes of the light harvesting photosynthetic apparatus. Even before the turn of the century it
was clear that green pigmentation was one of the easiest traits to observe in plant breeding experiments.
Although some pigmentation traits obeyed Mendel's laws, other colour traits were only transmitted through the
female parent that provided the cytoplasm of the zygote.

These observations of cytoplasmic or maternal inheritance eventually led to the hypothesis that chloroplasts
must carry genes. We know that chloroplasts contain a unique circular DNA genome that is completely different
from the nuclear genome. The presence of a genetic system within chloroplasts had already been inferred from
studies on non Mendelian inheritance in 1909, but the presence of organellar DNA and ribosomes was
demonstrated only in 1962.

Since then it has been shown that chloroplasts and other plastids contain all the machinery necessary for gene
expression. The chloroplast genetic components form a large proportion of those in the leaf, comprising up to
15% of the total DNA and up to 60% of the total ribosomes. The chloroplast genome has been extensively
characterized from a variety of species and cooperation between the chloroplast and nuclear genome in
chloroplast biogenesis is currently under investigation.

Electron micrographs indicate that the chloroplast DNA is some 10 to 20 times smaller than the E. coli
chromosomes. For example, the chloroplast genome of maize (corn) contains 140,000 base pairs of DNA. Such
genomes are much too small to encode the approximately 1,000 different proteins found in chloroplasts. Instead,
biosynthesis of the chloroplast involves an intimate collaboration between the nuclear and chloroplast genomes.

In fact, every known multimeric protein component of chloroplasts is a mixture of the products of both nuclear
and chloroplast genes. Most chloroplast proteins are encoded by nuclear DNA, translated in the cytoplasm, and
imported into the chloroplast by a specific transport mechanism that enables polypeptides to cross the outer
membrane of the organelle.

However, some 100 chloroplast specific proteins are synthesized within the chloroplast itself. These proteins are
encoded by chloroplast DNA, transcribed by the chloroplast specific RNA polymerase, and translated by the
chloroplast specific protein-synthesizing machinery. Since RNA cannot cross the outer membrane of the
chloroplast, chloroplast ribosomal RNAs and tRNAs must be encoded in chloroplast DNA.

Chloroplasts are not static organelles but can adapt to different physiological conditions, such as high or low
levels of light. For example, when grown entirely in the dark, chloroplasts lack chlorophyll but retain carotenoid
pigments. Thus many chloroplast genes are light regulated in certain cases by light sensitive promoters.

Mitochondrial DNA

Mitochondrial DNA (mtDNA) is the DNA located in organelles called mitochondria. Most other DNA present
in eukaryotic organisms is found in the cell nucleus. Mitochondrial DNA was discovered by Margit M. K. Nass
and Sylvan Nass by electron microscopy as DNAase-sensitive thread inside mitochondria,[1] and by Ellen
Haslbrunner, Hans Tuppy and Gottfried Schatz by biochemical assays on highly purified mitochondrial
fractions.[2]

Nuclear and mitochondrial DNA are thought to be of separate evolutionary origin, with the mtDNA being
derived from the circular genomes of the bacteria that were engulfed by the early ancestors of today's eukaryotic
cells. Each mitochondrion is estimated to contain 2-10 mtDNA copies.[3] In the cells of extant organisms, the
vast majority of the proteins present in the mitochondria (numbering approximately 1500 different types in
mammals) are coded for by nuclear DNA, but the genes for some of them, if not most, are thought to have
originally been of bacterial origin, having since been transferred to the eukaryotic nucleus during evolution. In
most multicellular organisms, mtDNA is inherited from the mother (maternally inherited). Mechanisms for this
include simple dilution (an egg contains 100,000 to 1,000,000 mtDNA molecules, whereas a sperm contains
only 100 to 1000), degradation of sperm mtDNA in the fertilized egg, and, at least in a few organisms, failure of
sperm mtDNA to enter the egg. Whatever the mechanism, this single parent (uniparental) pattern of mtDNA
inheritance is found in most animals, most plants and in fungi as well. mtDNA is particularly susceptible to
reactive oxygen species generated by the respiratory chain due to its close proximity. Though mtDNA is
packaged by proteins and harbors significant DNA repair capacity, these protective functions are less robust
than those operating on nuclear DNA and therefore thought to contribute to enhanced susceptibility of mtDNA
to oxidative damage. Mutations in mtDNA can in some cases cause maternally inherited diseases and some
evidence suggests that they might be major contributors to the aging process and age-associated pathologies. [4]

In humans (and probably in metazoans in general), 100-10,000 separate copies of mtDNA are usually present
per cell (egg and sperm cells are exceptions). In mammals, each double-stranded circular mtDNA molecule
consists of 15,000-17,000 base pairs. The two strands of mtDNA are differentiated by their nucleotide content
with the guanine rich strand referred to as the heavy strand, and the cytosine rich strand referred to as the light
strand. The heavy strand encodes 28 genes, and the light strand encodes 9 genes for a total of 37 genes. Of the
37 genes, 13 are for proteins (polypeptides), 22 are for transfer RNA (tRNA) and two are for the small and large
subunits of ribosomal RNA (rRNA). This pattern is also seen among most metazoans, although in some cases
one or more of the 37 genes is absent and the mtDNA size range is greater. Even greater variation in mtDNA
gene content and size exists among fungi and plants, although there appears to be a core subset of genes that are
present in all eukaryotes (except for the few that have no mitochondria at all). Some plant species have
enormous mtDNAs (as many as 2,500,000 base pairs per mtDNA molecule) but, surprisingly, even those huge
mtDNAs contain the same number and kinds of genes as related plants with much smaller mtDNAs.

Use in identification

Mitochondria are structures within cells that convert the energy from food into a form that cells can use.
Although most DNA is packaged in chromosomes within the nucleus, mitochondria also have a small amount of
their own DNA. This genetic material is known as mitochondrial DNA or mtDNA. In humans, mitochondrial
DNA spans about 16,500 DNA building blocks (base pairs), representing a fraction of the total DNA in cells.
Unlike nuclear DNA, which is inherited from both parents and in which genes are rearranged in the process of
recombination, there is usually no change in mtDNA from parent to offspring. Although mtDNA also
recombines, it does so with copies of itself within the same mitochondrion. Because of this and because the
mutation rate of animal mtDNA is higher than that of nuclear DNA,[5] mtDNA is a powerful tool for tracking
ancestry through females (matrilineage) and has been used in this role to track the ancestry of many species
back hundreds of generations.

Human mtDNA can also be used to identify individuals.[6] Forensic laboratories occasionally use mtDNA
comparison to identify human remains, and especially to identify older unidentified skeletal remains. Although
unlike nuclear DNA mtDNA is not specific to one individual, it can be used in combination with other evidence
(anthropological evidence, circumstantial evidence, and the like) to establish identification. mtDNA is also used
to exclude possible matches between missing persons and unidentified remains.[7] Many researchers believe that
mtDNA is better suited to identification of older skeletal remains than nuclear DNA because the greater number
of copies of mtDNA per cell increases the chance of obtaining a useful sample, and because a match with a
living relative is possible even if numerous maternal generations separate the two. American outlaw Jesse
James's remains were identified using a comparison between mtDNA extracted from his remains and the
mtDNA of the son of the female-line great-granddaughter of his sister.[8] Similarly, the remains of Alexandra
Feodorovna (Alix of Hesse), last Empress of Russia, and her children were identified by comparison of their
mitochondrial DNA with that of Prince Philip, Duke of Edinburgh, whose maternal grandmother was
Alexandra’s sister Victoria of Hesse.[9]

The low effective population size and rapid mutation rate (in animals) makes mtDNA useful for assessing
genetic relationships of individuals or groups within a species and also for identifying and quantifying the
phylogeny (evolutionary relationships; see phylogenetics) among different species, provided they are not too
distantly related. To do this, biologists determine and then compare the mtDNA sequences from different
individuals or species. Data from the comparisons is used to construct a network of relationships among the
sequences, which provides an estimate of the relationships among the individuals or species from which the
mtDNAs were taken. This approach has limits that are imposed by the rate of mtDNA sequence change. In
animals, the rapid rate of change makes mtDNA most useful for comparisons of individuals within species and
for comparisons of species that are closely or moderately-closely related, among which the number of sequence
differences can be easily counted. As the species become more distantly related, the number of sequence
differences becomes very large; changes begin to accumulate on changes until an accurate count becomes
impossible.

Mitochondrial inheritance

Female inheritance

In sexually reproducing organisms, mitochondria are normally inherited exclusively from the mother. The
mitochondria in mammalian sperm are usually destroyed by the egg cell after fertilization. Also, most
mitochondria are present at the base of the sperm's tail, which is used for propelling the sperm cells. Sometimes
the tail is lost during fertilization. In 1999 it was reported that paternal sperm mitochondria (containing mtDNA)
are marked with ubiquitin to select them for later destruction inside the embryo.[10] Some in vitro fertilization
techniques, particularly injecting a sperm into an oocyte, may interfere with this.

The fact that mitochondrial DNA is maternally inherited enables researchers to trace maternal lineage far back
in time. (Y chromosomal DNA, paternally inherited, is used in an analogous way to trace the agnate lineage.)
This is accomplished in humans by sequencing one or more of the hypervariable control regions (HVR1 or
HVR2) of the mitochondrial DNA. HVR1 consists of about 440 base pairs. These 440 base pairs are then
compared to the control regions of other individuals (either specific people or subjects in a database) to
determine maternal lineage. Most often, the comparison is made to the revised. Vilà et al. have published studies
tracing the matrilineal descent of domestic dogs to wolves.[11] The concept of the Mitochondrial Eve is based on
the same type of analysis, attempting to discover the origin of humanity by tracking the lineage back in time.

Because mtDNA is not highly conserved and has a rapid mutation rate, it is useful for studying the evolutionary
relationships - phylogeny - of organisms. Biologists can determine and then compare mtDNA sequences among
different species and use the comparisons to build an evolutionary tree for the species examined.
Male inheritance

It has been reported that mitochondria can occasionally be inherited from the father in some species such as
mussels.[12][13] Paternally inherited mitochondria have also been reported in some insects such as fruit flies [14]
honeybees,[15] and periodical cicadas.[16]

Evidence supports rare instances of male mitochondrial inheritance in some mammals as well. Specifically,
documented occurrences exist for mice,[17][18] where the male-inherited mitochondria was subsequently rejected.
It has also been found in sheep,[19] and in cloned cattle.[20] It has been found in a single case in a human male and
was linked to infertility.[21]

While many of these cases involve cloned embryos or subsequent rejection of the paternal mitochondria, others
document in vivo inheritance and persistence under lab conditions.

Genes

Transport chain

Many of the genes encode the transport chain:

Category Genes
NADH dehydrogenase MT-ND1, MT-ND2, MT-ND3, MT-ND4, MT-ND4L, MT-
(complex I) ND5, MT-ND6
Coenzyme Q - cytochrome c
reductase/Cytochrome b MT-CYB
(complex III)
cytochrome c oxidase
MT-CO1, MT-CO2, MT-CO3
(complex IV)
ATP synthase MT-ATP6, MT-ATP8

[edit] rRNA

Mitochondrial rRNA is encoded by MT-RNR1 (12S) and MT-RNR2 (16S).

[edit] tRNA

The following genes encode tRNA:

Amino Acid   3-Letter   1-Letter   MT DNA  


Alanine Ala A MT-TA
Arginine Arg R MT-TR
Asparagine Asn N MT-TN
Aspartic acid Asp D MT-TD
Cysteine Cys C MT-TC
Glutamic acid Glu E MT-TE
Glutamine Gln Q MT-TQ
Glycine Gly G MT-TG
Histidine His H MT-TH
Isoleucine Ile I MT-TI
Leucine Leu L MT-TL1, MT-TL2
Lysine Lys K MT-TK
Methionine Met M MT-TM
Phenylalanine Phe F MT-TF
Proline Pro P MT-TP
Serine Ser S MT-TS1, MT-TS2
Threonine Thr T MT-TT
Tryptophan Trp W MT-TW
Tyrosine Tyr Y MT-TY
Valine Val V MT-TV

Genetic influence

Genetic illness

Mutations of mitochondrial DNA can lead to a number of illnesses including exercise intolerance and Kearns-
Sayre syndrome (KSS), which causes a person to lose full function of their heart, eye, and muscle movements.
(See also Mitochondrial disease).
Simplified structure of mitochondrion

C-value

From Wikipedia, the free encyclopedia

Jump to: navigation, search

The term C-value refers to the amount of DNA contained within a haploid nucleus (e.g., in a gamete or one half
the amount in a diploid somatic cell) of a eukaryotic organism. In some cases (notably among diploid
organisms), the terms C-value and genome size are used interchangeably, however in polyploids the C-value
may represent two genomes contained within the same nucleus. Greilhuber et al. [1] have suggested some new
layers of terminology and associated abbreviations to clarify this issue, but these somewhat complex additions
have yet to be used by other authors. C-values are reported in picograms.

Origin of the term


Many authors have incorrectly assumed that the "C" in "C-value" refers to "characteristic", "content", or
"complement". Even among authors who have attempted to trace the origin of the term, there had been some
confusion because Hewson Swift did not define it explicitly when he coined it in 1950.[2] In his original paper,
Swift appeared to use the designation "1C value", "2C value", etc., in reference to "classes" of DNA content
(e.g., Gregory 2001[3], 2002[4]); however, Swift explained in personal correspondence to Prof. Michael D.
Bennett in 1975 that "I am afraid the letter C stood for nothing more glamorous than 'constant', i.e., the amount
of DNA that was characteristic of a particular genotype" (quoted in Bennett and Leitch 2005[5]). This is in
reference to the report in 1948 by Vendrely and Vendrely of a "remarkable constancy in the nuclear DNA
content of all the cells in all the individuals within a given animal species" (translated from the original French).
[6]
Swift's study of this topic related specifically to variation (or lack thereof) among chromosome sets in
different cell types within individuals, but his notation evolved into "C-value" in reference to the haploid DNA
content of individual species and retains this usage today.

Variation among species

C-values vary enormously among species. In animals they range more than 3,300-fold, and in land plants they
differ by a factor of about 1,000.[5][7] Protist genomes have been reported to vary more than 300,000-fold in size,
but the high end of this range (Amoeba) has been called into question. Variation in C-values bears no
relationship to the complexity of the organism or the number of genes contained in its genome, an observation
that was deemed wholly counterintuitive before the discovery of non-coding DNA and which became known as
the C-value paradox as a result. However, although there is no longer any paradoxical aspect to the discrepancy
between C-value and gene number, this term remains in common usage. For reasons of conceptual clarification,
the various puzzles that remain with regard to genome size variation instead have been suggested to more
accurately comprise a complex but clearly defined puzzle known as the C-value enigma. C-values correlate with
a range of features at the cell and organism levels, including cell size, cell division rate, and, depending on the
taxon, body size, metabolic rate, developmental rate, organ complexity, geographical distribution, and/or
extinction risk (for recent reviews, see Bennett and Leitch 2005[5]; Gregory 2005[7].)

Calculating C-values

Table 1: Relative Molecular Weights of Nucleotides†


Nucleotide Chemical formula Relative molecular weight
2′-deoxyadenosine 5′-monophosphate C10H14N5O6P 331.2213
2′-deoxythymidine 5′-monophosphate C10H15N2O8P 322.2079
2′-deoxyguanosine 5′-monophosphate C10H14N5O7P 347.2207
2′-deoxycytidine 5′-monophosphate C9H14N3O7P 307.1966

†Source of table: Doležel et al., 2003[8]

By using the data in Table 1, relative weights of nucleotide pairs can be calculated as follows: AT = 615.3830
and GC = 616.3711. Provided the ratio of AT to GC pairs is 1:1, the mean relative weight of one nucleotide pair
is 615.8771 (±1%).[8]

The relative molecular weight may be converted to an absolute value by multiplying it by the atomic mass unit
(1 u), which equals one-twelfth of a mass of 12C, i.e., 1.660539 × 10-27 kg. Consequently, the mean weight of
one nucleotide pair would be 1.023 × 10-9 pg, and 1 pg of DNA would represent 0.978 × 109 base pairs.[8]

The formulas for converting the number of nucleotide pairs (or base pairs) to picograms of DNA and vice-versa
are:[8]

genome size (bp) = (0.978 x 109) x DNA content (pg)


DNA content (pg) = genome size (bp) / (0.978 x 109)
1 pg = 978 Mb

The current estimates for human female and male diploid genome sizes are 6.406 × 109 bp and 6.294 × 109 bp,
respectively.[9] By using the conversion formulas given above, diploid human female and male nuclei in G1
phase of the cell cycle should contain 6.550 and 6.436 pg of DNA, respectively.
Genomes 2 1. Genomes, Transcriptomes and Proteomes 2. Genome Anatomies

2.1. An Overview of Genome Anatomies

Biologists recognize that the living world comprises two types of organism ( Figure 2.1 ):

1. Eukaryotes, whose cells contain membrane-bound compartments, including a nucleus and organelles such as
mitochondria and, in the case of plant cells, chloroplasts. Eukaryotes include animals, plants, fungi and
protozoa.

2. Prokaryotes, whose cells lack extensive internal compartments. There are two very different groups of
prokaryotes, distinguished from one another by characteristic genetic and biochemical features:
a. the bacteria, which include most of the commonly encountered prokaryotes such as the gram-negatives (e.g.
E. coli), the gram-positives (e.g. Bacillus subtilis), the cyanobacteria (e.g. Anabaena) and many more;

b. the archaea, which are less well-studied, and have mostly been found in extreme environments such as hot
springs, brine pools and anaerobic lake bottoms.

Eukaryotes and prokaryotes have quite different types of genome and we must therefore consider them
separately.

2.1.1. Genomes of eukaryotes

Humans are fairly typical eukaryotes and the human genome is in many respects a good model for eukaryotic
genomes in general. All of the eukaryotic nuclear genomes that have been studied are, like the human version,
divided into two or more linear DNA molecules, each contained in a different chromosome; all eukaryotes also
possess smaller, usually circular, mitochondrial genomes. The only general eukaryotic feature not illustrated by
the human genome is the presence in plants and other photosynthetic organisms of a third genome, located in the
chloroplasts.

Although the basic physical structures of all eukaryotic nuclear genomes are similar, one important feature is
very different in different organisms. This is genome size, the smallest eukaryotic genomes being less than 10
Mb in length, and the largest over 100 000 Mb. As can be seen in Table 2.2 , this size range coincides to a
certain extent with the complexity of the organism, the simplest eukaryotes such as fungi having the smallest
genomes, and higher eukaryotes such as vertebrates and flowering plants having the largest ones. This might
appear to make sense as one would expect the complexity of an organism to be related to the number of genes in
its genome - higher eukaryotes need larger genomes to accommodate the extra genes. However, the correlation
is far from precise: if it was, then the nuclear genome of the yeast S. cerevisiae, which at 12 Mb is 0.004 times
the size of the human nuclear genome, would be expected to contain 0.004 × 35 000 genes, which is just 140. In
fact the S. cerevisiae genome contains about 5800 genes.

For many years the lack of precise correlation between the complexity of an organism and the size of its genome
was looked on as a bit of a puzzle, the so-called C-value paradox. In fact the answer is quite simple: space is
saved in the genomes of less complex organisms because the genes are more closely packed together. The S.
cerevisiae genome, the sequence of which was completed in 1996, illustrates this point, as we can see from the
top two parts of Figure 2.2 , where the 50-kb segment of the human genome that we looked at in Chapter 1 is
compared with a 50-kb segment of the yeast genome. The yeast genome segment, which comes from
chromosome III (the first eukaryotic chromosome to be sequenced; Oliver et al., 1992), has the following
distinctive features:

 It contains more genes than the human segment. This region of yeast chromosome III contains 26
genes thought to code for proteins and two that code for transfer RNAs (tRNAs), short non-coding
RNA molecules involved in reading the genetic code during protein synthesis (Section 3.2.1).
 Relatively few of the yeast genes are discontinuous. In this segment of chromosome III none of the
genes are discontinuous. In the entire yeast genome there are only 239 introns, compared with over 300
000 in the human genome.
 There are fewer genome-wide repeats. This part of chromosome III contains a single long terminal
repeat (LTR) element, called Ty2, and four truncated LTR elements called delta sequences. These five
genome-wide repeats make up 13.5% of the 50-kb segment, but this figure is not entirely typical of the
yeast genome as a whole. When all 16 yeast chromosomes are considered, the total amount of sequence
taken up by genome-wide repeats is only 3.4% of the total. In humans, the genome-wide repeats make
up 44% of the genome.

The picture that emerges is that the genetic organization of the yeast genome is much more economical than that
of the human version. The genes themselves are more compact, having fewer introns, and the spaces between
the genes are relatively short, with much less space taken up by genome-wide repeats and other non-coding
sequences.

The hypothesis that more complex organisms have less compact genomes holds when other species are
examined. The third part of Figure 2.2 shows a 50-kb segment of the fruit-fly genome (Adams et al., 2000). If
we agree that a fruit fly is more complex than a yeast cell but less complex than a human then we would expect
the organization of the fruit-fly genome to be intermediate between that of yeast and humans. This is what we
see in Figure 2.2C , this 50-kb segment of the fruit-fly genome having 11 genes, more than in the human
segment but fewer than in the yeast sequence. All of these genes are discontinuous, but seven have just one
intron each. The picture is similar when the entire genome sequences of the three organisms are compared (
Table 2.3 ). The gene density in the fruit-fly genome is intermediate between that of yeast and humans, and the
average fruit-fly gene has many more introns than the average yeast gene but still three times fewer than the
average human gene.

The comparison between the yeast, fruit-fly and human genomes also holds true when we consider the genome-
wide repeats (see Table 2.3 ). These make up 3.4% of the yeast genome, about 12% of the fruit-fly genome, and
44% of the human genome. It is beginning to become clear that the genome-wide repeats play an intriguing role
in dictating the compactness or otherwise of a genome. This is strikingly illustrated by the maize genome, which
at 5000 Mb is larger than the human genome but still relatively small for a flowering plant. Only a few limited
regions of the maize genome have been sequenced, but some remarkable results have been obtained, revealing a
genome dominated by repetitive elements. Figure 2.2D shows a 50-kb segment of this genome, either side of
one member of a family of genes coding for the alcohol dehydrogenase enzymes (SanMiguel et al., 1996). This
is the only gene in this 50-kb region, although there is a second one, of unknown function, approximately 100
kb beyond the right-hand end of the sequence shown here. Instead of genes, the dominant feature of this genome
segment is the genome-wide repeats. The majority of these are of the LTR element type, which comprise
virtually all of the non-coding part of the segment, and on their own are estimated to make up approximately
50% of the maize genome. It is becoming clear that one or more families of genome-wide repeats have
undergone a massive proliferation in the genomes of certain species. This may provide an explanation for the
most puzzling aspect of the C-value paradox, which is not the general increase in genome size that is seen in
increasingly complex organisms, but the fact that similar organisms can differ greatly in genome size. A good
example is provided by Amoeba dubia which, being a protozoan, might be expected to have a genome of 100–
500 kb, similar to other protozoa such as Tetrahymena pyriformis (see Table 2.2 ). In fact the Amoeba genome is
over 200 000 Mb. Similarly, we might guess that the genomes of crickets are similar in size to those of other
insects, but these bugs have genomes of approximately 2000 Mb, 11 times that of the fruit fly.

2.1.2. Genomes of prokaryotes

Prokaryotic genomes are very different from eukaryotic ones. There is some overlap in size between the largest
prokaryotic and smallest eukaryotic genomes, but on the whole prokaryotic genomes are much smaller. For
example, the E. coli K12 genome is just 4639 kb, two-fifths the size of the yeast genome, and has only 4405
genes. The physical organization of the genome is also different in eukaryotes and prokaryotes. The traditional
view has been that an entire prokaryotic genome is contained in a single circular DNA molecule. As well as this
single ‘chromosome', prokaryotes may also have additional genes on independent smaller, circular or linear
DNA molecules called plasmids ( Figure 2.3 ). Genes carried by plasmids are useful, coding for properties such
as antibiotic resistance or the ability to utilize complex compounds such as toluene as a carbon source, but
plasmids appear to be dispensable - a prokaryote can exist quite effectively without them. We now know that
this traditional view of the prokaryotic genome has been biased by the extensive research on E. coli, which has
been accompanied by the mistaken assumption that E. coli is a typical prokaryote. In fact, prokaryotes display a
considerable diversity in genome organization, some having a unipartite genome, like E. coli, but others being
more complex. Borrelia burgdorferi B31, for example, has a linear chromosome of 911 kb, carrying 853 genes,
accompanied by 17 or 18 linear and circular molecules, which together contribute another 533 kb and at least
430 genes (Fraser et al., 1997). Multipartite genomes are now known in many other bacteria and archaea.

In one respect, E. coli is fairly typical of other prokaryotes. After our discussion of eukaryotic gene
organization, it will probably come as no surprise to learn that prokaryotic genomes are even more compact than
those of yeast and other lower eukaryotes. We can see this fact illustrated in Figure 2.2E , which shows a 50-kb
segment of the E. coli K12 genome. It is immediately obvious that there are more genes and less space between
them, with 43 genes taking up 85.9% of the segment. Some genes have virtually no space between them: thrA
and thrB, for example, are separated by a single nucleotide, and thrC begins at the nucleotide immediately
following the last nucleotide of thrB. These three genes are an example of an operon, a group of genes involved
in a single biochemical pathway (in this case, synthesis of the amino acid threonine) and expressed in
conjunction with one another. Operons have been used as model systems for understanding how gene
expression is regulated (Section 9.3.1). In general, prokaryotic genes are shorter than their eukaryotic
counterparts, the average length of a bacterial gene being about two-thirds that of a eukaryotic gene, even after
the introns have been removed from the latter (Zhang, 2000). Bacterial genes appear to be slightly longer than
archaeal ones.

Two other features of prokaryotic genomes can be deduced from Figure 2.2E . First, there are no introns in the
genes present in this segment of the E. coli genome. In fact E. coli has no discontinuous genes at all, and it is
generally believed that this type of gene structure is virtually absent in prokaryotes, the few exceptions
occurring mainly among the archaea. The second feature is the infrequency of repetitive sequences. Most
prokaryotic genomes do not have anything equivalent to the high-copy-number genome-wide repeat families
found in eukaryotic genomes. They do, however, possess certain sequences that might be repeated elsewhere in
the genome, examples being the insertion sequences IS1 and IS186 that can be seen in the 50-kb segment shown
in Figure 2.2E . These are further examples of transposable elements, sequences that have the ability to move
around the genome and, in the case of insertion elements, to transfer from one organism to another, even
sometimes between two different species (see page 64). The positions of the IS1 and IS186 elements shown in
Figure 2.2E refer only to the particular E. coli isolate from which this sequence was obtained: if a different
isolate is examined then the IS sequences could well be in different positions or might be entirely absent from
the genome. Most other prokaryotic genomes have very few repeat sequences - there are virtually none in the
1.64 Mb genome of Campylobacter jejuni NCTC11168 (Parkhill et al., 2000b) - but there are exceptions,
notably the meningitis bacterium Neisseria meningitidis Z2491, which has over 3700 copies of 15 different
types of repeat sequence, collectively making up almost 11% of the 2.18 Mb genome (Parkhill et al., 2000a).
Chromatin organisation

The human genome consists of approximately 3.0 X 10 9 nucleotides having a maximum length of more than 1
meter if fully stretched and still it can be found packaged within the nucleus of each individual cell, which
measures approximately 5 µm in diameter. The human genome is made up of 23 pairs of chromosomes
(diploid number 46) that consists of 22 autosomes and a pair of sex chromosomes (XX in females, XY in males).

DNA is packed inside the nucleus in association with a number of proteins, which are extensively coiled and
folded forming nucleosomes. Each nucleosome is made up a histone octamer mainly made up of histones H2A,
H2B, H3 and H4. Histones consists of large amounts of positively charged  amino acids mainly lysine and
arginine, that binds electro statically to the negatively charged phosphate groups of the DNA backbone. The
DNA turns in a 1.65 left handed orientation around each histone octamer covering a total of 146 bp of double
stranded DNA. The next 50 bp links one nucleosome to another also interacting with another histone (H1)
forming a thicker fibre consisting of six nucleosomes, known as the solenoid. Besides histones there are other
proteins that make up what is known as the nuclear scaffold. One of these proteins is the enzyme
topoisomerase type II. Different solenoids will in turn form what is known as chromatin fibres of approximately
200 nm in diameter and eventually make up the chromatids that are 600 -  700 nm in diameter. Histones does
not dissociate from DNA during replication in S phase of the cell cycle but new histones assemble on the
lagging strand.

Histones are encoded by clusters of genes that are repeated many times in the genome and that are highly
conserved through different species. Histone genes also lack introns. Histone proteins are replaced by
protamines in sperm heads.

Figure 1. Structure of nucleosome

 
Satellite DNA

Re-association kinetics and sedimentation equilibrium centrifugation showed that when eukaryotic DNA was
sheared and analysed a main band DNA and one or two satellite peaks were observed. Re-association kinetics
also showed that the satellites observed in the eukaryotic genomes were a result of the re-association of
highly repetitive DNA sequences, of which there are two main types either moderately or highly repetitive.
Only 10% of the human genome is thought to behave as single copy DNA.

Highly repetitive sequences are short sequences that are repeated a large number of times, usually occurring
as tandem repeats. Satellite DNA is found in specific areas on the chromosomes, better known as
heterochromatin. Such areas include those around the centromere. The centromere is the region of the
chromosome to which the spindle fibres attach during mitosis and meiosis that helps the chromosome to
move to one of the poles during anaphase. This region is known as the CEN and in yeasts it consists of
approximately 225 bp divided into three regions. Region CEN III is the largest region and is 95% AT rich and is
thought to be the most important region for centromere function since the sequences at this region are
important to bind spindle fibres. In humans alphoid family of repetitive sequences are found at the
centromere and are about 170 bp in length present in tandem arrays of up to 1 million base pairs.

Another very important structure of the chromosome is the telomere, which also consists of repetitive
sequences of DNA. Telomeres are found at the tips of linear chromosomes. There are telomeric sequences that
consist of short tandem repeats while there are telomere associated sequences found adjacent to and within
the telomere. With each cycle of DNA replication these telomeres become shorter and eventually serve as an
internal biological clock for the cell and thus determines its age. In germ cells (but not in somatic cells)
telomeres are protected by the presence of an RNA-containing enzyme known as telomerase. In immortalised
human cancer cells, the activation of telomerase is a very important step in the transition to malignancy.

Repetitive DNA

Moderate repetitive DNA can be found either interspersed or else in tandem across the genome. There are
two main types of interspersed repetitive elements known as short or long. The short interspersed elements or
SINEs are less than 500 bp long and can be found as much as 500,000 times in the genome. An example of a
SINE is the AluI element found in mammals. The long interspersed elements or LINES are about 6400 bp long
and can be found as much as 40,000 times. Moderate repetitive DNA can be clustered and some functional
genes also fall within this category including those coding for 5.8S, 18S, 28S rRNA in humans that are clustered
on the p arms of chromosomes 13, 14, 15, 21, and 22.  There are also tandem repeats such as the variable
number tandem repeats (VNTRs) that consists of repeats of 15 to 100 bp and were very useful for forensic
work. Another type of tandem repeats are the short tandem repeats (STRs) that can be either di-, tri-, tetra- or
even pentanucleotide repeats. These repeats are also used for genetic identification in forensic DNA analysis.
 

Structure of the Eukaryotic gene

It is estimated that there are about 20,000 to 50,000 genes in the human genome that code for proteins, that
is less than two times the amount found in much simpler organisms. The structure of the human protein
coding gene is quite complex. Sizes of eukaryotic genes can vary greatly in size ranging from less than 1 kb
(histones) to as much as 2500 kb for the dystrophin gene. A typical gene consists of coding and non-coding
sequences known as exons and introns, respectively. The exon (coding part) is the code which is transcribed
into the mature mRNA and eventually translated into protein. An exon is usually small in size and codes for a
single protein domain, averaging 150 nucleotides encoding about 50 amino acids. Each amino acid is encoded
by a triplet code known as a codon, and most amino acids are encoded by more than one codon. On the other
hand non-coding intervening introns are relatively large and can even be made up of 20,000 bp. The sequence
within introns is random but it can contain regulatory sequences that affect the splicing mechanism. Introns
are transcribed into the primary RNA but will be eventually removed (or spliced) and so does not make up part
of the mature mRNA molecule. The number of introns and exons between genes vary greatly and a gene can
consists of simply two or three exons, but can be up to more than 20 exons.

Figure 2. Typical Structure of a Eukaryotic Gene

Besides the introns and coding exons, genes also have other regulatory elements that mainly affect the way
how the gene itself is expressed and regulated. The 5' and 3' untranslated regions usually consists of
sequences that serve this purpose. The 5' region ahead of the transcriptional start site usually makes up what
is known as the promoter region. The promoter region consists of sequences such as the TATA box, where RNA
polymerase binds to initiate transcription. Further upstream there is the CCAAT box which also plays a part in
the regulation of transcription. Usually there are a number of other consensus sequences to which a number
of proteins or transcriptional factors bind and control transcription. A number of enhancers or/and silencers
that can be found close or sometimes even quite distant from the gene itself are involved in the regulation of
gene expression. Also sequences at the 3' end of the gene act as regulators and terminators of transcription as
well as for polyadenylation of the mRNA molecules.

The Genetic Code

The sequence of nucleotides found in exons code for the sequence of amino acids synthesised during
translation forming different protein domains. It was shown that a triplet of bases specifies the ribosomal
translation of a given amino acid. All amino acids are coded by more than one codon (degenerate code) with
the exceptions of tryptophan and methionine. In each codon the last base has reduced specificity and so four
codons differing by the last base only will encode for the same amino acid. This ensures that random
mutations at this base does not lead to alteration in the amino acid sequence. The code also has three codons
that are termination signals and a start codon which is AUG that codes for methionine, in such a way that the
first amino acid in a protein is always methionine. This code is shared by all living organisms although some
variations exist in the mitochondrial genome.

Structure of prokaryotic genes

Transcription start signal (promoter)

The sequence in DNA which defines the start of a gene is called a PROMOTER. There are two regions which
are important in the definition of a prokaryotic promoter (the -35 region and the -10 region also known as the
Pribnow box).

 The -35 region This sequence is centred about 35bp before the start (UPSTREAM) of a bacterial gene.
It functions in the initial recognition of a gene by RNA polymerase. The CONSENSUS sequence is
TTGACAT.
 The Pribnow box (-10 region) This has the consensus sequence TATAAT and is centred about 10bp
before the start of a bacterial gene ie there are about 12bp between the -35 region and the Pribnow box.
Since three hydrogen bonds hold a G-C base pair together while only two are present between A and T
the strands of an AT rich region are separated more easily than in a GC rich region. It is thought that
the enzyme DNA dependent RNA polymerase initially makes contact with the -35 region and then
moves along the DNA until it finds an AT rich region (the Pribnow box). At this point the enzyme
separates the two DNA strands and the RNA polymerase initiates RNA synthesis approximately 7bp
further along the DNA (DOWNSTREAM of the TATAAT site). It then TRANSCRIBES the DNA
using the TEMPLATE strand to produce an RNA molecule which is, by definition, the SENSE strand.

Coding sequence organisation

The part of the DNA which is to be copied into RNA is bracketed by the start (promoter) and stop (terminator)
signals. Nearly all sequences between these points are destined to become RNA. Some of these stretches of
DNA will code for RNA which has a structural role (ribosomal RNA; rRNA) and some codes RNA which codes
for protein (messenger RNA; mRNA). mRNA is an ephemeral copy of the information present in DNA. The
DNA remains unchanged while protein expression can be altered by degrading one set of RNA molecules and
replacing them with another which codes for different proteins. Prokaryotes have very well organised genomes
with the result that all genes involved in the metabolism of related compounds eg those enzymes involved in the
metabolism of lactose are positioned adjacent to one another on the DNA between the promoter and terminator.
The result of this organisation is that one single binding event by the enzyme RNA polymerase is sufficient to
cause the synthesis of all relevant proteins ie the genes are co-ordinately regulated. This makes the genes easier
to regulate. Such an organised set of genes is called an OPERON

Transcription stop signal (terminator)

At the end of the coding sequences a signal exists (the terminator) which says stop making RNA here. It is
composed of a sequence which is rich in the bases G and C and which can form a HAIRPIN LOOP. This
structure is more strongly hydrogen bonded (G-C base pairs are held together by three hydrogen bonds) causing
the RNA polymerase to slow down (PAUSE ). In some cases the G-C rich hairpin is followed by a stretch of A's
in the template strand. If the terminator lacks the stretch of A's then a protein (called RHO) is required to help
with termination. If the stretch of A's is present, termination is RHO INDEPENDENT. Prokaryotes lack a
nucleus and so as soon as a RNA molecule coding for protein has been synthesised ribosomes in the cytoplasm
are able to bind to it and to start translating it into a protein.

A 'typical' bacterial operon


Nucleotides (colored in blue) are sequentially added to form the complementary strand of the single-stranded
PSQ template, to which a sequencing primer has been annealed (top left of diagram). This is carried out in the
presence of polymerase, sulfurylase, luciferase and apyrase enzymes. One molecule of pyrophosphate (PPi) is
released for every nucleotide incorporated into the growing strand by the DNA polymerase, and is converted to
ATP by sulfurylase. Visible light is produced from luciferin in a luciferase-catalyzed reaction that utilizes the
ATP produced above, and unincorporated nucleotides are degraded by apyrase between each cycle. A pyrogram
(bottom panel) displays peaks representing the amount of generated light, which is proportional to the amount of
incorporated nucleotides, at each nucleotide dispensation. This pyrogram shows the genotyping of a [G>A] SNP
(highlighted with a box) in a heterozygous DNA sample.

Model organism

Escherichia coli is a model gram-negative prokaryotic organism

Drosophila melanogaster, one of the most famous subjects for experiments

A model organism is a species that is extensively studied to understand particular biological phenomena, with
the expectation that discoveries made in the organism model will provide insight into the workings of other
organisms.[1] In particular, model organisms are widely used to explore potential causes and treatments for
human disease when human experimentation would be unfeasible or unethical. This strategy is made possible by
the common descent of all living organisms, and the conservation of metabolic and developmental pathways and
genetic material over the course of evolution.[2] Studying model organisms can be informative, but care must be
taken when generalizing from one organism to another.

Selecting a model organism

Models are those organisms with a wealth of biological data that make them attractive to study as examples for
other species – including humans – that are more difficult to study directly. These can be classed as genetic
models (with short generation times, such as the fruitfly and nematode worm), experimental models, and
genomic models, with a pivotal position in the evolutionary tree [3]. Historically, model organisms include a
handful of species with extensive genomic research data, such as the NIH model organisms.[4]

Often, model organisms are chosen on the basis that they are amenable to experimental manipulation. This
usually will include characteristics such as short life-cycle, techniques for genetic manipulation (inbred strains,
stem cell lines, and methods of transformation) and non-specialist living requirements. Sometimes, the genome
arrangement facilitates the sequencing of the model organism's genome, for example, by being very compact or
having a low proportion of junk DNA (e.g. yeast, Arabidopsis, or pufferfish).
When researchers look for an organism to use in their studies, they look for several traits. Among these are size,
generation time, accessibility, manipulation, genetics, conservation of mechanisms, and potential economic
benefit. As comparative molecular biology has become more common, some researchers have sought model
organisms from a wider assortment of lineages on the tree of life.

Use of model organisms

There are many model organisms. One of the first model systems for molecular biology was the bacterium
Escherichia coli, a common constituent of the human digestive system. Several of the bacterial viruses
(bacteriophage) that infect E. coli also have been very useful for the study of gene structure and gene regulation
(e.g. phages Lambda and T4). However, bacteriophages are not organisms because they lack metabolism and
depend on functions of the host cells for propagation.

In eukaryotes, several yeasts, particularly Saccharomyces cerevisiae ("baker's" or "budding" yeast), have been
widely used in genetics and cell biology, largely because they are quick and easy to grow. The cell cycle in a
simple yeast is very similar to the cell cycle in humans and is regulated by homologous proteins. The fruit fly
Drosophila melanogaster is studied, again, because it is easy to grow for an animal, has various visible
congenital traits and has a polytene (giant) chromosome in its salivary glands that can be examined under a light
microscope. The roundworm Caenorhabditis elegans is studied because it has very defined development
patterns involving fixed numbers of cells, and it can be rapidly assayed for abnormalities.

Electron microphotograph of tobacco mosaic virus (TMV) particles

Yeast in biological research

[edit] A model organism

When researchers look for an organism to use in their studies, they look for several traits. Among these are size,
generation time, accessibility, manipulation, genetics, conservation of mechanisms, and potential economic
benefit. The yeast species S. pombe and S. cerevisiae are both well studied; these two species diverged
approximately 300 to 600 million years before present, and are significant tools in the study of DNA damage
and repair mechanisms.[1]

The alpha-factor of S. cerevisiae, has been compared to the liphophilic peptide created by the fungus Tremella
mesenterica.[2]

Saccharomyces cerevisiae has developed as a model organism because it scores favorably on a number of these
criteria.

 As a single celled organism S. cerevisiae is small with a short generation time (doubling time 1.5–2
hours @ 30 °C) and can be easily cultured. These are all positive characteristics in that they allow for
the swift production and maintenance of multiple specimen lines at low cost.
 S. cerevisiae can be transformed allowing for either the addition of new genes or deletion through
homologous recombination. Furthermore, The ability to grow S. cerevisiae as a haploid simplifies the
creation of gene knockouts strains.
 As a eukaryote, S. cerevisiae shares the complex internal cell structure of plants and animals without
the high percentage of non-coding DNA that can confound research in higher eukaryotes.
 S. cerevisiae research had a strong economic driver, at least initially, as a result of its established use in
industry (e.g. beer, bread and wine fermentation).

[edit] Genome sequencing

S. cerevisiae was the first eukaryotic genome that was completely sequenced.[3] The genome sequence was
released in the public domain on April 24, 1996. Since then, regular updates have been maintained at the
Saccharomyces Genome Database (SGD). This database is a highly annotated and cross-referenced database for
yeast researchers. Another important S. cerevisiae database is maintained by the Munich Information Center for
Protein Sequences (MIPS). The genome is composed of about 12,156,677 base pairs and 6,275 genes,
compactly organised on 16 chromosomes. Only about 5,800 of these are believed to be true functional genes. It
is estimated that yeast shares about 23% of its genome with that of humans .

[edit] Other tools in yeast research

The availability of the S. cerevisiae genome sequence and the complete set of deletion mutants has further
enhanced the power of S. cerevisiae as a model for understanding the regulation of eukaryotic cells. A project
underway to analyze the genetic interactions of all double deletion mutants through Synthetic genetic array
analysis will take this research one step further.

Approaches have been developed by yeast scientists which can be applied in many different fields of biological
and medicinal science. These include Yeast two-hybrid for studying protein interactions and tetrad analysis.

Laboratory uses

C. elegans is studied as a model organism for a variety of reasons. Strains are cheap to breed and can be frozen.
When subsequently thawed they remain viable, allowing long-term storage. Because the complete cell lineage
of the species has been determined, C. elegans has proven especially useful for studying cellular differentiation.

From a research perspective, C. elegans has the advantage of being a multicellular eukaryotic organism that is
simple enough to be studied in great detail. In addition, it is transparent, facilitating the study of developmental
processes in the intact organism. The developmental fate of every single somatic cell (959 in the adult
hermaphrodite; 1031 in the adult male) has been mapped out. These patterns of cell lineage are largely invariant
between individuals, in contrast to mammals where cell development from the embryo is more largely
dependent on cellular cues. In both sexes, a large number of additional cells (131 in the hermaphrodite, most of
which would otherwise become neurons), are eliminated by programmed cell death (apoptosis).

In addition, C. elegans is one of the simplest organisms with a nervous system. In the hermaphrodite, this
comprises 302 neurons whose pattern of connectivity has been completely mapped out, and shown to be a
small-world network.[7] Research has explored the neural mechanisms responsible for several of the more
interesting behaviors shown by C. elegans, including chemotaxis, thermotaxis, mechanotransduction, and male
mating behavior.

A useful feature of C. elegans is that it is relatively straightforward to disrupt the function of specific genes by
RNA interference (RNAi). Silencing the function of a gene in this way can sometimes allow a researcher to
infer what the function of that gene may be. The nematode can either be soaked in (or injected with) a solution
of double stranded RNA, the sequence of which is complementary to the sequence of the gene that the
researcher wishes to disable. Alternatively, worms can be fed on genetically transformed bacteria which express
the double stranded RNA of interest.

C. elegans has also been useful in the study of meiosis. As sperm and egg nuclei move down the length of the
gonad, they undergo a temporal progression through meiotic events. This progression means that every nucleus
at a given position in the gonad will be at roughly the same step in meiosis, eliminating the difficulties of
heterogeneous populations of cells.

The organism has also been identified as a model for nicotine dependence as it has been found to experience the
same symptoms humans experience when they quit smoking.[8]

As for most model organisms, there is a dedicated online database for the species that is actively curated by
scientists working in this field. The WormBase database attempts to collate all published information on C.
elegans and other related nematodes. A reward of $5000 has been advertised on their website, for the finder of a
new species of closely related nematode.[9] Such a discovery would broaden research opportunities with the
worm.[10]

[edit] Genome

C. elegans was the first multicellular organism to have its genome completely sequenced. The finished genome
sequence was published in 1998,[11] although a number of small gaps were present (the last gap was finished by
October 2002). The C. elegans genome sequence is approximately 100 million base pairs long and contains
approximately 18,400 genes. The vast majority of these genes encode for proteins but there are likely to be as
many as 1,000 RNA genes. Scientific curators continue to appraise the set of known genes, such that new gene
predictions continue to be added and incorrect ones modified or removed.

In 2003, the genome sequence of the related nematode C. briggsae was also determined, allowing researchers to
study the comparative genomics of these two organisms.[12] Work is now ongoing to determine the genome
sequences of more nematodes from the same genus such as C. remanei,[13] C. japonica[14] and C. brenneri.[15]
These newer genome sequences are being determined by using the whole genome shotgun technique which
means that the resulting genome sequences are likely to not be as complete or accurate as C. elegans (which was
sequenced using the 'hierarchical' or clone-by-clone approach).

The official version of the C. elegans genome sequence continues to change as and when new evidence reveals
errors in the original sequencing (DNA sequencing is not an error-free process). Most changes are minor, adding
or removing only a few base pairs (bp) of DNA. E.g. the WS169 release of WormBase (December 2006) lists a
net gain of 6 bp to the genome sequence.[16] Occasionally more extensive changes are made, e.g. the WS159
release of May 2006 added over 300 bp to the sequence.[1
Two-hybrid screening
Overview of two-hybrid assay, checking for interactions between two proteins, called here Bait and Prey.
A. Gal4 transcription factor gene produces two domain protein (BD and AD) which is essential for transcription
of the reporter gene (LacZ).
B,C. Two fusion proteins are prepared: Gal4BD+Bait and Gal4AD+Prey. None of them is usually sufficient to
initiate the transcription (of the reporter gene) alone.
D. When both fusion proteins are produced and Bait part of the first interact with Prey part of the second,
transcription of the reporter gene occurs.

Two-hybrid screening (also known as yeast two hybrid system or Y2H) is a molecular biology technique
used to discover protein-protein interactions[1] and protein-DNA interactions[2][3] by testing for physical
interactions (such as binding) between two proteins or a single protein and a DNA molecule, respectively.

The premise behind the test is the activation of downstream reporter gene(s) by the binding of a transcription
factor onto an upstream activating sequence (UAS). For the purposes of two-hybrid screening, the transcription
factor is split into two separate fragments, called the binding domain (BD) and activating domain (AD). The BD
is the domain responsible for binding to the UAS and the AD is the domain responsible for activation of
transcription.[1][2]

History

Pioneered by Fields and Song in 1989, the technique was originally designed to detect protein-protein
interactions using the GAL4 transcriptional activator of the yeast Saccharomyces cerevisiae. The GAL4 protein
activated transcription of a protein involved in galactose utilization which formed the basis of selection. [4] Since
then, the same principle has been adapted to describe many alternative methods including some that detect
protein-DNA interactions, DNA-DNA interactions and use Escherichia coli instead of yeast.[3]

[edit] Basic premise

The key to the two-hybrid screen is that in most eukaryotic transcription factors, the activating and binding
domains are modular and can function in close proximity to each other without direct binding.[5] This means that
even though the transcription factor is split into two fragments, it can still activate transcription when the two
fragments are indirectly connected.

The most common screening approach is the yeast two-hybrid assay.[6] This system often utilizes a genetically
engineered strain of yeast in which the biosynthesis of certain nutrients (usually amino acids or nucleic acids) is
lacking. When grown on media that lacks these nutrients, the yeast fail to survive. This mutant yeast strain can
be made to incorporate foreign DNA in the form of plasmids. In yeast two-hybrid screening, separate bait and
prey plasmids are simultaneously introduced into the mutant yeast strain.

Plasmids are engineered to produce a protein product in which the DNA-binding domain (BD) fragment is fused
onto a protein while another plasmid is engineered to produce a protein product in which the activation domain
(AD) fragment is fused onto another protein. The protein fused to the BD may be referred to as the bait protein
and is typically a known protein that the investigator is using to identify new binding partners. The protein fused
to the AD may be referred to as the prey protein and can be either a single known protein or a library of known
or unknown proteins. In this context, a library may consist of a collection of protein-encoding sequences that
represent all the proteins expressed in a particular organism or tissue or may be generated by synthesising
random DNA sequences.[3] Regardless of the source, they are subsequently incorporated into the protein-
encoding sequence of a plasmid which are then transfected into the cells chosen for the screening method. [3] This
technique, when using a library, assumes that each cell is transfected with no more than a single plasmid and
that therefore, each cell ultimately expresses no more than a single member from the protein library.

If the bait and prey proteins interact (i.e. bind), then the AD and BD of the transcription factor are indirectly
connected, bringing the AD in proximity to the transcription start site and transcription of reporter gene(s) can
occur. If the two proteins do not interact, there is no transcription of the reporter gene. In this way, a successful
interaction between the fused protein is linked to a change in the cell phenotype. [1]

The challenge of separating those cells which express proteins which happen to interact with their counterpart
fusion proteins, from those which do not, is addressed in the following section.
[edit] Reporter and selection genes

In order to link the interaction to a change in observable phenotype, a reporter gene is provided with the
upstream activation sequence (UAS) to which the binding domain binds, resulting in gene expression in
successful cases of interaction. Since its inception in 1989, the technique has been combined with a number of
different reporter genes which can allow selection through a simple colour change or through automatic death of
cells in which the interaction does or does not take place.[1]

The lacZ reporter gene will allow the highlighting of cells in which the UAS-BD-AD interaction is taking place.
[1]
β-galactosidase, the protein product of the lacZ gene produces a blue colouration through the metabolism of
X-gal (5-bromo-4-chloro-3-indolyl-β-D-galactoside) which allows the experimenter to manually choose the
individuals which host proteins displaying the required [level of] interaction. [2]

Manual differentiation of cells according to colour may be acceptable for small numbers of cells, as when
investigating a small number of proteins, but when a large library of proteins must be screened, some
automation is necessary.[2] This automation is provided by a number of genes and gene systems which either
cause the death of cells that aren't hosting interactions or vice versa, leaving only the cells expressing the
proteins of interest.[2]

For more autonomy, alternative reporter genes exist that enable automatic deletion of individuals that fail to
express the reporter or vice versa.

Some projects utilise a dual reporter system in which two reporter genes are used in order to help substantiate
the interaction.[1][3] When developing their HIS3 selection system, Joung et al. (2000) included an aadA gene,
downstream of the HIS3 gene such that it may be expressed at the same time as the HIS3 gene. The aadA gene
confers spectinomycin resistance and although insufficient as a selection gene in itself, was used to maintain
selection pressure and provide greater stringency.

[edit] Positive selection

In positive selection, the individuals that host a successful interaction are able to live on the provided media,
while those failing to host this interaction die. This can be achieved by combining a media that is lacking an
essential nutrient with a strain of an organism which is dependent on expression of the reporter gene in order to
produce this nutrient. Examples include the HIS3 gene, encoding a protein required for histidine synthesis, the
LEU2 gene, encoding a protein required for leucine synthesis and the URA3 gene, encoding a protein required
for uracil synthesis.[1]

The positive selection method is used in investigations in which the investigators are aiming to discover proteins
which interact. Applications include the discovery of homologues in other species and the discovery of members
of a protein family (containing a particular domain and therefore binding to a compatible domain). [1] This HIS3
gene for example can be used with E. coli cells bearing a deletion in their homologous hisB gene (ΔhisB cells).[2]

[edit] Counter-selection

Alternatively, reporter genes conferring sensitivity to an agent supplied in the media, will kill cells expressing
the reporter, thereby removing cells that host an interaction. Examples of genes used in the counter selection
method include CYH2 (confers sensitivity to cycloheximide), CAN1 and URA3.[1] Alternatively, using repressor
domains or other negative regulators in place of activation domains, counter-selection may be achieved using
the same reporter genes used in positive selection methods. One such negative regulator is the yeast GAL80
domain which binds and inactivates the transcriptional activator region of GAL4.[1]

Such counter-selection methods may be used in investigations aiming to discover a mutation, chemical or
protein that will interfere with the interaction.

[edit] Fixed domains


In any study, some of the protein domains, those under investigation, will be varied according to the goals of the
study whereas other domains, those that are not themselves being investigated, will be kept constant. For
example in a two-hybrid study to select DNA-binding domains, the DNA-binding domain, BD, will be varied
whilst the two interacting proteins, the bait and prey, will need to be kept constant in order to maintain a strong
binding between the BD and AD. There are a number of domains from which to choose the BD, bait and prey
and AD, if these are to remain constant. In protein-protein interaction investigations, the BD may be chosen
from any of many strong DNA-binding domains such as Zif268.[2] A frequent choice of bait and prey domains
are residues 263-352 of yeast Gal11P with a N342V mutation[2] and residues 58-97 of yeast Gal4,[2] respectively.
These domains can be used in both yeast- and bacterial-based selection techniques and are known to bind
together strongly.[1][2]

The AD chosen must be able to activate transcription of the reporter gene, using the cell's own transcription
machinery. Thus, the variety of ADs available for use in yeast-based techniques may not be suited to use in their
bacterial-based analogues. The herpes simplex virus-derived AD, VP16 and yeast Gal4 AD have been used with
success in yeast[1] whilst a portion of the α-subunit of E. coli RNA polymerase has been utilised in E. coli-based
methods.[2][3]

Whilst powerfully-activating domains may allow greater sensitivity towards weaker interactions, conversely, a
weaker AD may provide greater stringency.

[edit] Construction of expression plasmids

There are a number of engineered genetic sequences which must be incorporated into the host cell in order to
perform a two-hybrid analysis or one of its derivative techniques. The considerations and methods used in the
construction and delivery of these sequences differ according to the needs of the assay and the organism chosen
as the experimental background.

There are two broad categories of hybrid library: random libraries and cDNA-based libraries. A cDNA library is
constituted by the cDNA produced through reverse transcription of mRNA collected from specific cells of types
of cell. This library can be ligated into a construct so that it is attached to the BD or AD being used in the assay.
[1]
A random library uses lengths of DNA of random sequence in place of these cDNA sections. A number of
methods exist for the production of these random sequences, including cassette mutagenesis.[2] Regardless of the
source of the DNA library, it is ligated into the appropriate place in the relevant plasmid/phagemid using the
appropriate restriction endonucleases.[2]

[edit] E. coli-specific considerations

By placing the hybrid proteins under the control of IPTG-inducible lac promoters, they are expressed only on
media supplemented with IPTG. Further, by including different antibiotic resistance genes in each genetic
construct, the growth of non-transformed cells is easily prevented through culture on media containing the
corresponding antibiotics. This is particularly important for counter selection methods in which a lack of
interaction is needed for cell survival.[2]

The reporter gene may be inserted into the E. coli genome by first inserting it into an episome, a type of plasmid
with the ability to incorporate itself into the bacterial cell genome[2] with a copy number of approximately one
per cell.[7]

The hybrid expression phagemids can be electroporated into E. coli XL-1 Blue cells which after amplification
and infection with VCS-M13 helper phage, will yield a stock of library phage. These phage will each contain
one single-stranded member of the phagemid library.[2]

[edit] Recovery of protein information

Once the selection has been performed, the primary structure of the proteins which display the appropriate
characteristics must be determined. This is achieved by retrieval of the protein-encoding sequences (as
originally inserted) from the cells showing the appropriate phenotype.
[edit] E. coli

The phagemid used to transform E. coli cells may be "rescued" from the selected cells by infecting them with
VCS-M13 helper phage. The resulting phage particles that are produced contain the single-stranded phagemids
and are used to infect XL-1 Blue cells.[2] The double-stranded phagemids are subsequently collected from these
XL-1 Blue cells, essentially reversing the process used to produce the original library phage. Finally, the DNA
sequences are determined through dideoxy sequencing.[2]

[edit] Controlling sensitivity

The Escherichia coli-derived Tet-R repressor can be used in line with a conventional reporter gene and can be
controlled by tetracycline or doxicycline (Tet-R inhibitors). Thus the expression of Tet-R is controlled by the
standard two-hybrid system but the Tet-R in turn controls (represses) the expression of a previously mentioned
reporter such as HIS3, through its Tet-R promoter. Tetracycline or its derivatives can then be used to regulate
the sensitivity of a system utilising Tet-R.[1]

Sensitivity may also be controlled by varying the dependency of the cells on their reporter genes. For example,
this effected by altering the concentration of histidine in the growth medium for his3-dependent cells and
altering the concentration of streptomycin for aadA dependent cells.[2][3] Selection-gene-dependency may also be
controlled by applying an inhibitor of the selection gene at a suitable concentration. 3-Amino-1,2,4-triazole (3-
AT) for example, is a competitive inhibitor of the HIS3-gene product and may be used to titrate the minimum
level of HIS3 expression required for growth on histidine-deficient media.[2]

Sensitivity may also be modulated by varying the number of operator sequences in the reporter DNA.

[edit] Non-fusion proteins

A third, non-fusion protein may be co-expressed with the two fusion proteins. Depending on the investigation,
the third protein may modify one of the fusion proteins or mediate or interfere with their interaction. [1]

Coexpression of the third protein may be necessary for modification or activation of one or both of the fusion
proteins. For example S. cerevisiae possesses no endogenous tyrosine kinase. If an investigation involves a
protein that requires tyrosine phosphorylation, the kinase must be supplied in the form of a tyrosine kinase gene.
[1]

The non-fusion protein may mediate the interaction by binding both fusion proteins simultaneously, as in the
case of ligand-dependent receptor dimerisation.[1]

For a protein with an interacting partner, its functional homology to other proteins may be assessed by supplying
the third protein in non-fusion form which then may or may not compete with the fusion-protein for its binding
partner. Binding between the third protein and the other fusion protein will interrupt the formation of the
reporter expression activation complex and thus reduce reporter expression, leading to the distinguishing change
in phenotype.[1]

[edit] One-hybrid, three-hybrid and one-two-hybrid variants

[edit] One-hybrid

The one-hybrid variation of this technique is designed to investigate protein-DNA interactions and uses a single
fusion protein in which the AD is linked directly to the binding domain. The binding domain in this case
however is not necessarily of fixed sequence as in two-hybrid protein-protein analysis but may be constituted by
a library. This library can be selected against the desired target sequence which is inserted in the promoter
region of the reporter gene construct. In a positive-selection system, a binding domain which successfully binds
the UAS and allows transcription will thus be selected.[1]
Note that selection of DNA-binding domains isn't necessarily performed using a one-hybrid system, but may
also be performed using a two-hybrid system in which the binding domain is varied and the bait and prey
proteins are kept constant.[2][3]

[edit] Three-hybrid

Overview of three-hybrid assay.

RNA-protein interactions have been investigated through a three-hybrid variation of the two-hybrid technique.
In this case, a hybrid RNA molecule serves to adjoin together the two protein fusion domains which aren't
intended to interact with each other but rather the intermediary RNA molecule (through their RNA-binding
domains).[1] Techniques involving non-fusion proteins that perform a similar function, as described in the 'non-
fusion proteins' section above, may also be referred to as three-hybrid methods.

[edit] One-two-hybrid

Simultaneous use of the one- and two-hybrid methods (that is, simultaneous protein-protein and protein-DNA
interaction) is known as a one-two-hybrid approach and expected to increase the stringency of the screen. [1]

[edit] Host organism

Although theoretically, any living cell might be used as the background to a two-hybrid analysis, there are
practical considerations that dictate which will be chosen. The chosen cell line should be relatively cheap and
easy to culture and sufficiently robust to withstand application of the investigative methods and reagents. [1]

[edit] Yeast

S. cerevisiae was the model organism used during the two-hybrid technique's inception. It has several
characteristics that make it a robust organism to host the interaction, including the ability to form tertiary protein
structures, neutral internal pH, enhanced ability to form disulfide bonds and reduced-state glutathione among
other cytosolic buffer factors, to maintain a hospitable internal environment.[1] The yeast model can be
manipulated through non-molecular techniques and its complete genome sequence is known.[1] Yeast systems
are tolerant of diverse culture conditions and harsh chemicals which could not be applied to mammalian tissue
cultures.[1]

Proteins from as small as eight to as large as 750 amino acids have been studied using yeast.[1]

[edit] E. coli

E. coli-based methods have several characteristics which may make them preferable to yeast-based homologues.
The higher transformation efficiency and faster rate of growth lends E. coli to the use of larger libraries (in
excess of 108).[2] A low false positive rate of approximately 3x108, the absence of requirement for a nuclear
localisation signal to be included in the protein sequence and the ability to study proteins that would be toxic to
yeast may also be major factors to consider when choosing an experimental background organism.[2]

It may be of note that the methylation activity of certain E. coli DNA methyltransferase proteins may interfere
with some DNA-binding protein selections. If this is anticipated, the use of an E. coli strain that is defective for
a particular methyltransferase may be an obvious solution.[2]
Applications

Determination of sequences crucial for interaction

By changing specific amino acids by mutating the corresponding DNA base-pairs in the plasmids used, the
importance of those amino acid residues in maintaining the interaction, can be determined. [1]

After using bacterial cell-based method to select DNA-binding proteins, it is necessary to check the specificity
of these domains as there is a limit to the extent to which the bacterial cell genome can act as a sink for domains
with an affinity for other sequences (or indeed, a general affinity for DNA). [2]

[edit] Drug and poison discovery

Protein-protein signalling interactions pose suitable therapeutic targets due to their specificity and
pervasiveness. The random drug discovery approach utilises compound banks that comprise random chemical
structures and demands a high-throughput method in order to test these structures in their intended target. [1]

The cell chosen for the investigation can be specifically engineered to mirror the molecular aspect that the
investigator intends to study and then used to identify new human or animal therapeutics or anti-pest agents. [1]

[edit] Determination of protein function

By determination of the interaction partners of unknown proteins, the possible functions of these new proteins
may be inferred.[1] This can be done using a single known protein against a library of unknown proteins or
conversely, by selecting from a library of known proteins using a single protein of unknown function.[1]

[edit] Zinc finger protein selection

To select zinc finger proteins (ZFPs) for protein engineering, methods adapted from the two-hybrid screening
technique have been used with success.[2][3] A ZFP is itself a DNA-binding protein used in the construction of
custom DNA-binding domains that bind to a desired DNA sequence. [8]

By using a selection gene with the desired target sequence included in the UAS and randomising the relevant
amino acid sequences to produce a ZFP library, cells that host a DNA-ZFP interaction with the required
characteristics can be selected. Each ZFP typically recognises only 3-4 base pairs, so to prevent recognition of
sites outside the UAS, the randomised ZFP is engineered into a 'scaffold' consisting of another two ZFPs of
constant sequence. The UAS is thus designed to include the target sequence of the constant scaffold in addition
to the sequence for which a ZFP is being selected.[2][3]

A number of other DNA-binding domains may also be investigated using this system.[2]

[edit] Strengths and weaknesses

Two-hybrid screens are now routinely performed in many labs. They can provide an important first hint for the
identification of interaction partners. Moreover, the assay is scalable which makes it possible to screen for
interactions among many proteins.

The main criticism applied to the yeast two-hybrid screen of protein-protein interactions is the possibility of a
high number of false positive (and false negative) identifications. The exact rate of false positive results is not
known, but estimates are as high as 50%.[9] The reason for this high error rate lies in the principle of the screen:
The assay investigates the interaction between (i) overexpressed (ii) fusion proteins in the (iii) yeast (iv)
nucleus. Each of these points (i-iv) alone can give rise to false results. For example, overexpression can result in
non-specific interactions. Moreover, a mammalian protein is sometimes not correctly modified in yeast (e.g.
missing phosphorylation), which can also lead to false results. Finally, some proteins might specifically interact
when they are co-expressed in the yeast, although in reality they are never present in the same cell at the same
time. Due to the combined effects of all error sources the overall confidence of the yeast two-hybrid assay is
rather low. However, yeast two-hybrid data is shown to be of similar quality to data generated by the alternative
approach of coaffinity purification followed by mass spectrometry (AP/MS).[10] The probability of generating
false positives means that all interactions should be confirmed by a high confidence assay, for example co-
immunoprecipitation of the endogenous proteins, which is difficult for large scale protein-protein interaction
data.

Two-hybrid system for detecting protein-protein interactions

Biochemistry/MCB 568 -- Fall 2007


John W. Little--University of Arizona
Bioc/MCB568 Home Page

The two-hybrid system is a useful way to detect proteins that interact with a protein you are studying. In general,
it is used primarily for initial identification of interacting proteins, not for detailed characterization of the
interaction. See here for more detail.

This system is based on the modular organization of many transcription factors (see figure). Many such proteins
have two or more discrete structural and functional units, or domains (see here for caution about this way of
thinking). The first protein to be used for this was the yeast protein GAL4; many later studies use the DNA-
binding domain of the E. coli protein LexA. GAL4 has a DNA-binding domain (or DBD) and an activation
domain (or AD). The structure of GAL4 complexed to its specific site (PDB file) is only of the first 65 amino
acids, and comprises a minimal DBD; usually residues 1-100 or so are used.

When GAL4 binds to its cognate binding site, the activation domain is brought close to the promoter, allowing
the activation domain to interact with the transcription machinery and resulting in activation of transcription.
Typically a reporter gene, often lacZ, is used. Hence, there are standard reporter constructs, with variable
numbers of binding sites and a reporter gene.

Now consider how these elements can be used to detect protein-protein interactions. Two types of hybrids are
made:

DBD Hybrid: This hybrid contains the DBD fused to a protein of interest (often termed the "bait"). This fusion
protein can bind to the DNA, but cannot activate transcription because the bait does not contain an activation
function (if it does, this procedure will not work).

AD Hybrid: This hybrid contains the AD fused to another protein (often termed the "prey"). Usually, a
recombinant DNA "library" is prepared in which genes for many different proteins are fused to the AD. Then
both hybrid proteins are expressed in the same cell. Those expressing the reporter gene are identified and
purified for further characterization.

Typically, libraries will contain large numbers of different clones (>10 6 different ones); a few of them will be
able to interact with the bait. These few can then be recognized by their ability to turn on the reporter gene.

 
 

You might also like