Tassel Working For GWAS

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 140

Genome Wide Association Studies

(GWAS)
BY USING TASSEL 5.0

Dr. Amit Joshi

HOD-Department of Biochemistry
Kalinga University
GWAS INTRODUCTION
• In genomics, a genome-wide association study (GWA study, or
GWAS), is an observational study of a genome-wide set of genetic
variants in different individuals to see if any variant is associated
with a trait.
• GWA studies typically focus on associations between single-
nucleotide polymorphisms (SNPs) and traits like major human
diseases, but can equally be applied to any other genetic variants
and any other organisms.
• Aware about some software's for conducting GWAS are: Structure,
Plink, Tassel, R-Studio based r-commands….etc etc.
Download & Install Tassel 5.0
• https://tassel.bitbucket.io/ Webpage link

Webpage view
Preparing the Input files
A. Phenotype file
• Prepare the phenotype file as shown below in the figure
• Note: Please remember if your data has covariates such as sex, age or treatment,
then, please categories them with header name factor.
B. Genotype file
• TASSEL allows various genotype file formats such as VCF (variant call
format), .hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the
genotype file. The below is the screenshot of the hmp.txt genotype file.
Importing phenotype and genotype files
• Import the files by following the steps shown below. Tip! Both files
can be opened at same time holding CTRL and clicking the file names.
Phenotype Data
Phenotype distribution plot
Genotype summary analysis

• Next crucial step is to look at the genotype data by simply following


the steps shown. Couple of keys things to look at are:
• Minor allele frequency distribution
• Missing genotypic data to see if it requires to be imputed
• Proportion of heterozygous in the samples to check for self-ed samples
Filter genotypes based on call rate

• Steps to filter the genotypes based on call rate and heterozygosity


level are shown below:

• In the video, genotypes were filtered based on listed parameters:

• 90% minimum sites persent


• 5% minimum heterozygosity
• 100% maxmimum heterozygosity
Filter Markers based on read depth, Minor and Major
allele frequency (MAF)

• Steps to filter markers based on read depth, Minor and Major allele
frequency (MAF) are shown below:
• In the video, markers were filtered based on listed parameters:
• 100 minimum count of 545 sequences (Its the number of times a
particular allele was seen for that locus)
• 0.05 Minor allele Frequency (set filter thresholds for rare alleles)
• 0.95 Major allele frequency (set filter thresholds to remove
monomorphic markers)
Conduct GWAS analysis
: Multidimensional scaling (MDS)
MDS output can be used as the covariate in the GLM or MLM
to correct for population structure. Please follow the steps
shown below:
Intersecting the files
• Intersect the genotype, phenotype and MDS files by following the
steps below:
running General Linear Model (GLM)
• Run the GLM analysis by selecting the intersected files following the
steps below:
The output of the GLM analyis is produced ubder the
Result node. The GLM association test can be evaluated
by plotting Q-Q plot and the Manhattan plot as shown
below.
Mixed Linear Model (MLM)
• Calculating Kinship matrix
• Follow the below steps to calcuate the kinship matrix:
MLM
running Mixed Linear Model (MLM)
• MLM model includes the PCA and the kinship matrix i.e.
MLM(PCA+K).
• Therefore, once the Kinship matrix has been calculated, MLM can be
now be conducted by following below steps:
Exporting results

• One may export the results in .txt format Results- Table


Determine GWAS Significance Threshold

• Bonferroni threshold can be determined to identify significantly


markers associated with the trait by using the below equation:

• where, N is the total number of markers tested in association analysis)


was used to identify the most significantly markers associated with
the trait. Similarly, another way is to perform FDR (False Discoveyy
Rate) correction method, which is a less stringent than the family-
wise error rate.
Adjust P-Values For Multiple Comparisons: Bonferroni and
False Discovery Rate

• Give the output from GLM and or MLM analysis, one calcuate the
adjusted p-values using one of the frequenlty comparison methods:
Bonferroni and False Discovery Rate (FDR)in R using the below code.
# Import GLM or MLM stats file.

glm_stats <- read.table("GLMstats.txt", header = T, sep = "\t")

# Check data
head(glm_stats)

# Import R library
library(dplyr)

# Calculate Bonferroni Correction and False Discovery Rate

adj_glm <- glm_stats %>%


transmute(Marker, Chr, Pos, p,
p_Bonferroni = p.adjust(glm_stats$p,"bonferroni"),
p_FDR = p.adjust(glm_stats$p,"fdr")
)

View(adj_glm)

# Save the result to a file


write.csv(adj_glm, file="adj_p_GLM.csv", quote = T, eol = "\n", na= "NA")

# QQ plot
library(qqman)

# import data
adj_glm_KRN_4 <- read.csv("adj_p_GLM.csv", header = T)

#plot qq plot GLM(PCA)

par(mfrow=c(1,3))
qq(adj_glm_KRN_4$p, main = "non-adjusted P-value")
qq(adj_glm_KRN_4$p_Bonferroni, main = "Bonferroni")
qq(adj_glm_KRN_4$p_FDR, main = "FDR")
par(mfrow=c(1,1))
The Hardy–Weinberg (HD) principle
• Allele and genotype frequencies in a population will remain
constant from generation to generation in the absence of other
evolutionary influences.
• These influences include non-random mating, mutation,
selection, genetic drift, gene flow and meiotic drive.
• Allele frequency: f(A)=p, f(a)=q
• Genotype frequency: f(AA)=p2, f(aa)=q2, f(Aa)=2pq
• Both allele and genotype frequency remain unchanged: Hardy-
Weinberg equilibrium
HD principle for two loci
• First locus: A and a alleles; Second locus: B and b alleles
• Allele frequency: PA+Pa = 1, PB+Pb=1
• Haplotype frequency: PAB=PAPB, Pab=PaPb, so on so forth
• Haplotype frequency reaches the equilibrium stage with one generation of
random matting if the two loci are on different chromosomes
• It takes multiple generation to reach the the equilibrium stage if the two
loci are on the same chromosome
• It takes more generation to move out the linkage disequilibrium stage
with lower recombination rate between the two loci
Linkage equilibrium and Disequilibrium

Linkage equilibrium: haplotype frequencies in a population


have the same value that they would have if the genes at each
locus were combined at random.
Linkage disequilibrium: Non-random association of alleles at
different loci in a given population
Linkage Disequilibrium Quantification

• Linkage equilibrium: PAB=PAPB


D(ifference)=PAB-PAPB=
Linkage Disequilibrium (LD)
Loci and
A a B b
allele
Frequency .6 .4 .7 .3

Gametic type AB Ab aB ab

Observed 0.5 0.1 0.2 0.2

Frequency 0.42 0.18 0.28 0.12


equilibrium

Difference 0.08 -0.08 -0.08 0.08

D = PAB-PAPB =Pab-PaPb
=-(PAb-PAPb) =-(PaB-PaPB )
Lemma

D=PABPab-PAbPaB

Proof
(1): PABPab=(PAPB+D) (PaPb+D)= PAPB PaPb + PAPB D + PaPb D + D2
(2): PAbPaB=(PAPb-D) (PaPB-D)= PAPb PaPB - PAPb D - PaPB D + D2
Subtracting (2) from (1): PABPab-PAbPaB=D(PAPB + PaPb + PAPb + PaPB )=D
D depends on allele frequency
• Vary even with complete LD
• PAb=PaB=0
• PAB=1-Pab=PA=PB
• D=PA-PAPA
Property of D

• Deviation between observed and expected


• Extreme values: -0.25 and 0.25
• Non LD (equilibrium): D=0
• Dependency on allele frequency
Modification of D: D’

• Lewontin (1964) proposed standardizing D to the


maximum possible value it can take:
• D’=D/DMax
• Dmax: =
• Range of D’: 0 to 1
Example
Loci and
A a B b
allele
Frequency .6 .4 .7 .3

Gametic type AB Ab aB ab

Observed 0.5 0.1 0.2 0.2

Frequency 0.42 0.18 0.28 0.12


equilibrium

Difference 0.08 -0.08 -0.08 0.08

• D =PAB-PAPB = 0.08 • D’=D/


• Dmax=min (PAPb, PaPB) Dmax=0.08/0.18=0.44
• =min(.6x.3, .4x.7)
• =0.18
Example (switch A and a)
Loci and
a A B b
allele
Frequency .6 .4 .7 .3

Gametic type aB ab AB Ab

Observed 0.5 0.1 0.2 0.2

Frequency 0.42 0.18 0.28 0.12


equilibrium

Difference 0.08 -0.08 -0.08 0.08

• D =PAB-PAPB = -0.08 • D’=D/Dmax=-0.08/-0.18=0.44


• Dmax=max (-PAPB, -PaPb)
• =max(-.4x.7, -.6x.3)
• =-0.18
R2

• Hill and Robertson (1968) proposed the following


measure of linkage disequilibrium:
• r2 (Δ2)=D2/(PAPBPaPb)
• Square makes positive
• The product of allele frequency creates penalty for
50% allele frequency.
• Range: 0 to 1
Summary of LD statistics

P values D D’ R2
Definition Statistical test PAB-PAPB D/DMax D2/(PAPBPaPb)
(e.g. X2)
Value at 1 0 0 0
equilibrium
Value at 0 -0.25 or 0.25 1 1
complete LD
Disadvantage Dependency Penalty on
on allele neutral loci
frequency
Causes of LD

• Linkage
True association
• Mutation
• Selection
• Inbreeding
Spurious association
• Genetic drift
• Gene flow/admixture
Mutation and selection
Generation 1 A____q A____Q
A____q mutation
A____q A____q
A____q A____q

Generation 2 A____q A____Q


A____Q A____q Selection
A____q A____q

Generation 3 A____Q A____Q


A____Q A____q Selection
A____Q A____q
Change in D over time
• c: recombination rate
• Dt=D0(1-c)t
• t=log(Dt/D0)/log(1-c)
• if c=10%, it takes 6.5 generation for D to be cut in half
• 1Mb=1cM,
• if two SNPs 100kb apart,
• c=1% / 10 = 0.001
• It takes 693 generations for D to be cut in half
Change in D over time

c=.01

c=.05

c=.1
c=.25
Human out of Africa

https://arstechnica.com/science/2015/12/the-human-migration-out-of-africa-left-its-mark-in-mutations/
LD decay over distance
HW equilibrium, Linkage equilibrium and
Linkage disequilibrium

HWE LE LD

Single locus PAA=P2

Association
LD Decay
Multiple locus PAB=PAPB PAB=PAPB PAB!=PAPB
Same chromosome
different chromosome
Estimation and plotting of LD decay over distance from the LD
results from TASSEL
Code to plot in R using GGplot:
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
• Reduce number of dimensions in data
• Find patterns in high-dimensional data
• Visualize data of high dimensionality
• Example applications:
• Face recognition
• Image compression
• Gene expression analysis

84
Principal Components Analysis Ideas ( PCA)

• Does the data set ‘span’ the whole of d dimensional space?


• For a matrix of m samples x n genes, create a new covariance matrix
of size n x n.
• Transform some large number of variables into a smaller number of
uncorrelated variables called principal components (PCs).
• developed to capture as much of the variation in data as possible

85
Principal Component Analysis
 See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf
X2

Y1

Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
x
x
Y2 is the second. x x
Y2 ignorable. x
x x x X1
x x Key observation:
x x
x x variance = largest!

86
Principal Component Analysis: one
attribute first Temperature
42
40
24
• Question: how much 30
spread is in the data 15
along the axis? 18
(distance to the mean) 15
30
• Variance=Standard 15
deviation^2 n
30

(X
35
i  X) 2
30
s2  i 1 40
(n  1) 30

87
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y 40 90
• cov(X,Y)=0: independent
40 90
•Cov(X,Y)>0: move same dir
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
30 90
15 70
n 30 70
(X
i 1
i  X )(Yi  Y ) 30 70
cov( X , Y )  30 90
(n  1)
40 70
30 90
88
More than two attributes: covariance matrix
• Contains covariance values between all possible
dimensions (=attributes):

C nxn
 (cij | cij  cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):

 cov( x, x) cov( x, y ) cov( x, z ) 


 
C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
89
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue of A.

 2 3   3  12   3
  x      4 x 
 2 1  2  8   2

90
Eigenvalues & eigenvectors

• Ax=x  (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues 
• Solve (A- I) x=0 for each  to obtain eigenvectors x

91
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.

92
Steps of PCA
• Let Xbe the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean •  is called an eigenvalue of
X’ = X – C.
X
• Compute the covariance • Ce=e  (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.

93
Eigenvalues
• Calculate eigenvalues  and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:

j n
V j  100  n  x n
 x
x 1
x 1

94
Principal components - Variance

25

20

Variance (%) 15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

95
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
 yi1   e1  xi1  x1 
    
 yi 2   e2  xi 2  x2 
 ...    ...  
    ... 
 y   e  x  x 
 ip   p  in n  96
An Example Mean1=24.1
Mean2=53.8

X1 X2 X1' X2' 100


90
80
70

19 63 -5.1 9.25 60
50 Series1
40
30

39 74 14.9 20.25 20
10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30 23 5.9 -30.75 30

20

10
15 35 -9.1 -18.75 0 Series1
-15 -10 -5 0 5 10 15 20
-10

15 43 -9.1 -10.75 -20

-30

-40
15 32 -9.1 -21.75

97
30 73 5.9 19.25
Covariance Matrix
75 106
• C=
106 482

• Using MATLAB, we find out:


• Eigenvectors:
• e1=(-0.98,-0.21), 1=51.8
• e2=(0.21,-0.98), 2=560.2
• Thus the second eigenvector is more important!

98
If we only keep one dimension: e2
0.5
yi
0.4
-10.14
0.3
-16.72
• We keep the dimension 0.2
0.1 -31.35
of e2=(0.21,-0.98) 0 31.37
-40 -20 -0.1 0 20 40 4
• We can obtain the final -0.2 16.46
4
data as
-0.3
-0.4 8.624
-0.5
19.40
4
-17.63

 xi1 
yi  0.21  0.98   0.21* xi1  0.98 * xi 2
 xi 2 

99
100
101
102
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
• RetrievedRowData=(RowFeatureVectorT x FinalData)
+OriginalMean
• Yields original data using the chosen components

103
Principal components
• General about principal components
• summary variables
• linear combinations of the original variables
• uncorrelated with each other
• capture as much of the original variance as possible

104
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with multiple
clusters
• Alternative: determination of gene’s closest neighbours

105
Two Way (Angle) Data Analysis
Genes 103–104 Conditions 101–102

Genes 103-104
Samples 101-102

Gene expression Gene expression


matrix matrix

Sample space analysis Gene space analysis

106
PCA - example

107
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2

108
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2

109
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2

110
PCA Analysis
For genotype & Phenotype practice data
• https://
avikarn.com/image/gwas/sample_Genotype-Chr-1_data_forTASSEL.h
mp.txt
• https://
avikarn.com/image/gwas/sample_Phenotype_data_forTASSEL.txt
• https://www.panzea.org/
• http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip
Another Way for running MLM:
Structure tool to Tassel tool
Generally burnin period set to 10000, and MOMC Reps= 100000
For which K-Value?
Inferred ancestory for Q matrix
Q-Matrix from structure
From Sequence data we make Kinship
centered_IBS_Filtered file
We do intersection join for genotype data+
phenotype data+ q-matrix
Result MLM-Statistics file can be analysed
for Manhattan Plot and QQ-Plot
QQ Plot Interpretation
• This plot provide information on two main aspects of
GWAS data: whether the statistical testing is well
controlled for challenges such as population
stratification & whether there is any association.
• QQ-Plots measures and compares the p-values expected
to be seen when testing for association & those actually
observed.
• Each dot represent SNP
• X-Axis shows: Expected –log10(p)
• Y-Axis shows: Observed –log10(p)
• The red line shows pattern of –log10(p) value if no SNP
have significant genetic association with the trait.
• When there are significant associations between SNP
markers and the traits then SNP Dots (Blue color) rise
off the line
Manhattan plot Interpretation
• A scatter plot used to show on
which chromosomes have any
significantly associated SNPs
based on their p-value.
• In this genomic coordinates are
displayed on x-axis and negative
logarithm of association p-values
for each SNP on y-axis.
• Each dot signifies a SNP
• Strongest association have
smaller p-values (10-15)
•Thank You

You might also like