Gene Hunting
Gene Hunting
Gene Hunting
Lecture 1
Strategies for Identifying Disease Genes
If you have some idea of another pathological basis of the disease, or if there is a similar
animal or human disease for whose basis is known, it can be possible to guess what the
gene might be and sequence that gene directly in patients.
Positional Approach
Positional Cloning
Identification of a disease gene through its location in the human genome, without prior
knowledge of its function.
This would be followed by the analysis of candidate genes within the disease-associated genomic
region.
These are the DNA variation between individuals, most of them are not themselves disease
causing.
Known, characterised human genetic polymorphisms (markers) may lie close to unknown,
uncharacterised polymorphisms that cause disease (mutations).
Gene Hunting 1
Genetic Markers: Useful DNA variants at known chromosomal locations that can be genotyped
easily, using simple lab techniques (e.g. PCR).
Meiotic Recombination
Closely linked genes are unlikely to experience a crossover between them during meiosis,
A genetic distance of 1 centimorgan (1cM) is the distance between 2 genes that show a 1%
rate of recombination.
In 99% of meioses, the alleles of the two genes are inherited together. 1-2Mb in
Mammals.
Uses easily assayed genetic markers scattered throughout the genome and looks at the
pattern of inheritance of alleles of these markers.
The markers whose alleles are most commonly co-inherited with disease are those that
are most closely linked to a disease gene.
In a specific genetic disease, polymorphisms are studies in family members to find genetic
linkage.
If the polymorphism is close to (or in) the disease gene on the chromosome, there is a low
chance of recombination between the marker and the disease causing mutation at meiosis,
and linkage is observed,
If the polymorphism and disease gene are far apart or on different chromosomes, linkage is
not observed.
Recombination Fractions
Recombination fraction ‘θ’, is calculated by dividing the number of offspring with the
recombinant allele combination (R) by the total number of offspring (R + NR): θ = R/(R+NR)
The closer together the genes are, the smaller the Recombination Fraction (RF),
Gene Hunting 2
θ of 0.01 = 1 cM = 1 Mapping Unit (m.u.)
Most polymorphisms are in non-coding DNA, as there is more of it, and mutations are not
selected against it.
Genetic markers is the name given to polymorphic sites that can be easily genotyped.
Satellite DNA
Satellite DNA was identified in the early days when genomic DNA fragments were
centrifuged in sucrose or CsCl Gradients,
We now know it consists of DNA with long stretches of Tandem Repeat Sequence
Minisatellites
Hypervariable Minisatellite DNA’: The number of repeat motifs at any one locus is very
variable and there are many alleles in the population.
Southern blotting using probes against minisatellite core sequence was the original basis
of DNA fingerprinting.
Shared minisatellite bands between parents and children can prove or disprove paternity
cases.
Microsatellites
Microsatellites are currently the most significant and widely used genetic markers for most
species, it consists of stretches of (usually) di- or trinucleotide repeats
‘CTCTCTCTCTCTCTCTCT’.
The length of each microsatellite is very variable, they appear to be very prone to ‘slippage’
during DNA replication, maybe also unequal crossing over of longer stretches.
Gene Hunting 3
Well over 100000 microsatellites have been identified and mapped through the human genome,
and their length can be assayed by PCR using unique primers that flank each microsatellite
sequence. Different alleles produce different sized bands after PCR.
Whatever microsatellite you want to use as a marker, you can go online and by the primer you
need.
Primers are specific so they only amplify the one microsatellite locus.
Any particular microsatellite is likely to have many different alleles (i.e. length polymorphisms)
within the wider population.
After PCR, these alleles can be assayed as different sized bands on a gel, or by running through
capillary tubes.
Similar to running sequence reactions: get the exact length as a peak coming through the
machine.
Most microsatellites lie outside genes or in introns and are purely used as markers. Some lie
within genes, e.g. the trinucleotide repeats underlying Huntingtons disease or Duchenne
Muscular Dystrophy are microsatellites.
Example
A Microsatellite marker is closely linked ti the gene and can be assayed by PCR,
Affected dad is Heterozygous for the microsatellite (184bp and 200bp) and the mutant
disease gene allele segregates with the closely linked 200bp microsatellite allele.
The disease allele of the gene is linked to the 200bp allele of the microsatellite (this can then
be used for prenatal screening of the affected daughters children in this family).
Microsatellites can be used on a much larger scale as linkage markers to fine disease genes.
Gene Hunting 4
Autozygous Mapping:
Autosomal Recessive
Disease,
As humans, the vast majority (99.9%) of our DNA sequence is identical, even between different
ethnic populations.
There are many loci where differences exist: bits of sequence where different people have a
different base, for example, 85% of alleles have an A and 15% of alleles have a C at a
particular base pair.
This would be a single nucleotide polymorphism. Over 5000000 have been identified
and characterised in the human genome.
A Single SNP may have different allele frequencies in different human populations: 85% of
people may have A/A in western Europe, but 90% in Africa, and 65% in Asia.
SNPs can also be used as genetic markers for gene mapping and disease association.
We have known and used SNPs since before they were names. The basis for ‘Restriction
Fragment Length Polymorphisms’ (RFLPs).
Some SNPs by chance create or mutate restriction sites, so that when DNA or interest is
digested, you get different sized bands depending on which SNP allele is present.
Gene Hunting 5
Microsatellites vs SNPs
There are many more SNPs identified than microsatellites, allows for much finer mapping.
Majority of SNPs only have 2 alleles, where some diseases have multiple alleles within
microsatellites (C.F).
Genome Scans
The problem with Human genetics is you cant set up the crosses and control who mates with
whom.
All you have is pedigrees, which may be incomplete, and genetic date may not be
forthcoming from all individuals in the pedigree.
Genome Scan is genotyping a collection of families with the genetic disease using hundreds of
genetic markers from all over the genome.
Using hundreds of markers ensures the unknown gene will be close enough to one or two of
them to show genetic linkage.
The aim may be to find linkage with several markers. Then you would know that the disease
gene must be in the candidate region of the genome around those markers.
Experimental animals such as mice or fruit flies produce very large numbers of offspring, so can
estimate recombination frequency ‘θ’ very accurately.
Human families only produce small numbers of children, to get statistically significant evidence
for linkage: combine evidence from different families.
The LOD score is a statistic that describes the strength of evidence for linkage, at any
chosen value of θ, given the family data available.
Gene Hunting 6
The recombination fraction ‘θ’ can be calculated for any two loci by linkage analysis. θ varies
between 0 (100% linkage) to 0.5 (no linkage).
Likelihood ratio at given value of ‘θ’ equals the likelihood of the observed data, if the loci are
linked at recombination factor θ (Lθ), divided by the likelihood of the observed data if the loci is
not linked (θ = 0.5)(L0.5).
If a particular disease gene is linked to a DNA marker with LOD score (Z) of 4, at
recombination fraction θ=0.05, it means that in the families studies it is 10000 times more
likely that the disease and marker are linked roughly 5cM apart than that they are not linked.
Produce a plot of LOD scores for different values pf θ, based on observed data.
A LOD score of 3 or more is considered good evidence for linkage.
A LOD score of -2 or less is evidence against linkage.
Values between -2 and 3 are inconclusive and indicate that more data must be obtained.
LOD Graph
Gene Hunting 7
Gene Hunting 8
A LOD score of 4 implies a 1000x likelihood of linkage over chance, however we have 46
chromosome arms (23 chromosomes with p and q arms).
Take any two random loci in the human genome, the prior probability of them being on
different chromosome arms (unlinked) is around 50x more than the probability that they are
linked.
Non-linkage is 50x more likely by chance than linkage, so LOD score of 3 (1000:1) will be
spurious 1 in 20 times (1000/50).
Gene Hunting 9
A and B are alleles of a locus that is suspected of being linked to disease,
Looks like disease carried with B allele, So In generation 4 of progeny in III from the
pairing in II, the probability of the observed [B and Disease] will occur in 2 individuals
and [A and Healthy] in the other 2 is (1 − 0.05)4 .
If the loci are not linked (θ = 0.5), the probability of the observed [B and Disease] will
occur in 2 individuals, and [A and Healthy] in the other 2 is (0.5)4 .
LOD score for the family at θ = 0.5 therefore is log10[(0.95)4 /(0.5)4 )]i.e. = 1.12,
LOD score for the family at θ = 0 is log1 0(1)4 /(0.5)4 i.e. = 1.2
Lecture 2
Genome Wide Association Studies
Powerful use of SNPs to identify the location of disease susceptibility genes,
Population level study - get 100s or 1000s of affected and unaffected individuals and screen
alleles of 500000 - 1000000 SNPs in all,
Statistical analysis may show that a particular allele of a certain SNP is over-represented in a
patient suffering from a disease,
E.g. a SNP somewhere in the genome which in the general population shows 90% alleles as
‘T’, and 10% are ‘C’, but in patients suffering with disease, 80% are ‘C’,
Gene Hunting 10
Suggests that the disease gene is somewhere around that SNP
Why does GWAS work?
Possible, though less likely, that as SNP identified with a disease is the disease mutation: e.g.
inactivating a gene,
More likely that the actual disease mutation is somewhere near the SNP,
The human genome is young and within populations closely linked SNPs are in
disequilibrium.
When new mutation arises, it will remain associated with the surrounding SNP alleles of the
person who developed a mutation,
1. Detection: Find potentially pathogenic mutations in the gene that segregate with the disease,
2. Validation: Confirm that the mutation(s) you found are pathogenic, i.e. by molecular
analysis in cells or recapitulation of disease in an animal model
Gene Hunting 11
Eye condition “absence of iris” is caused by the heterozygous mutation in the PAX6 gene,
expressed in the iris, cornea, leans and retina.
Study 1:
LOD Scores:
For catalase gene and unknown disease gene (AN2 locus): Z=7.27 at θ=0.00,
For microsatellite D11S151 and unknown disease gene (AN2 locus): Z=3.86, at θ=0.10,
Enabled identification of candidate region within which the aniridia gene was presumed to
lie.
Study 2:
Wilm’s tumour is a genetic childhood kidney cancer, Can get wilm’s tumour without aniridia
and vice-verse, 2 genes.
Chromosome Jumping: Screened libraries to clone random bits of genomic DNA in the
candidate region
Got random clones of gDNA from candidate region and looked for ones with CpG islands
and were highly conserved between different species - Potential Genes,
Took gDNA from potential genes and screened an embryonic cDNA library to see which
clones represent the expressed sequence,
One clone lit up - encoded a paired domain-containing DNA - binding transcription factor,
PAX6,
Genomic DNA from WAGR and aniridia (non-WAGR) families, did not have this big of
gDNA, meaning that PAX6 was missing.
Study 3:
Gene Hunting 12
The mouse small eye mutation, that has aniridia, cataracts etc. is also a mutation in Pax6,
i.e. Knock out the gene in an animal model - replicates the human disease,
They are often used now in clinical diagnostic capacity to screen for mutations in known disease-
associated genes,
The 3bp deletion (the ΔF508) mutation that causes CF in many patients,
PCR using forward and reverse primers that span the mutation, get 98bp product in normal
allele 95bp product from the ΔF508 allele.
Gene Hunting 13
Detection of Copy Number Variation (CNV)
Diseases may be caused by deletions or duplications that increase or decrease the number of
copies of the disease gene,
Chip, next-gen sequencing or qPCR-based techniques for detecting changes in gene copy
number,
Validation
When you think you have found the disease-causing mutation, how do you know if is pathogenic
and not just harmless genetic variation?
2. If you have a pedigree, check that the disease allele is carried by all affected individuals,
3. Recreate the mutation in wild-type cells or models and show that it causes problems that
explain the human disease state (e.g. CFTR mouse above)
DISC1 + Schizophrenia
Schizophrenia: a mental illness with strong evidence for genetic component,
Lab in Scotland studies a Scottish family with high incidence of schizophrenia and related
mental disorders,
They had a balanced chromosome 1/11 translocation that segregated with disease,
Maximum LOD score of 6, suggested the breakpoint was very closely linked to the disease gene,
The breakpoint falls a region of chromosome 11 that has no genes (gene desert), but falls with
and intron of a (then) novel gene in chromosome 1, that they called ‘Disrupted in Schizophrenia
1’ (DISC1),
Together, the fact that the breakpoints cut DISC1 in two (therefore inactivating it) and the
linkage analysis showing that the breakpoint is likely to be the pathological mutation, indicate
that the mutation in DISC1 causes schizophrenia,
Gene Hunting 14
Generate a DISC1 knock out in mice, the mice were viable and fertile (like human schizophrenia
patient),
The mice developed abnormal emotional behaviour as assessed by the elevated plus-maze and
cliff-avoidance tests, therefore suggesting that a deficiency of full-length DISC1 may result in
lower anxiety and/or higher impulsivity,
A small (e.g. point) mutation in non-coding DNA that underlies a human disease,
May for example mutate a transcription factor binding sequence in a gene regulatory element,
From 1000+ genome projects, there are 1000000s of harmless neutral SNPs, if you find a new
SNP close to an identified disease gene which has no mutation in its coding region, is it affecting
gene expression, or is it a neural polymorphism?
2. Does your new SNP disrupt the consensus binding sequence of a transcription factor,
rVista will predict whether a particular DNA sequence is likely to be bound by any of a
panel of TFs for which it knows the consensus DNA binding sequence,
If your SNP is in a putative evolutionary conserved gene enhancer region, check if it affects the
ability of the enhancer to drive gene expression,
Direct in vitro or in vivo assays for DNA-protein binding (ChIP, SW blot),
Gene Hunting 15
Make reporter DNA constructs with with-type and mutant alleles of the enhancer.
Gene Hunting 16