Tassel Working For GWAS
Tassel Working For GWAS
Tassel Working For GWAS
(GWAS)
BY USING TASSEL 5.0
HOD-Department of Biochemistry
Kalinga University
GWAS INTRODUCTION
• In genomics, a genome-wide association study (GWA study, or
GWAS), is an observational study of a genome-wide set of genetic
variants in different individuals to see if any variant is associated
with a trait.
• GWA studies typically focus on associations between single-
nucleotide polymorphisms (SNPs) and traits like major human
diseases, but can equally be applied to any other genetic variants
and any other organisms.
• Aware about some software's for conducting GWAS are: Structure,
Plink, Tassel, R-Studio based r-commands….etc etc.
Download & Install Tassel 5.0
• https://tassel.bitbucket.io/ Webpage link
Webpage view
Preparing the Input files
A. Phenotype file
• Prepare the phenotype file as shown below in the figure
• Note: Please remember if your data has covariates such as sex, age or treatment,
then, please categories them with header name factor.
B. Genotype file
• TASSEL allows various genotype file formats such as VCF (variant call
format), .hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the
genotype file. The below is the screenshot of the hmp.txt genotype file.
Importing phenotype and genotype files
• Import the files by following the steps shown below. Tip! Both files
can be opened at same time holding CTRL and clicking the file names.
Phenotype Data
Phenotype distribution plot
Genotype summary analysis
• Steps to filter markers based on read depth, Minor and Major allele
frequency (MAF) are shown below:
• In the video, markers were filtered based on listed parameters:
• 100 minimum count of 545 sequences (Its the number of times a
particular allele was seen for that locus)
• 0.05 Minor allele Frequency (set filter thresholds for rare alleles)
• 0.95 Major allele frequency (set filter thresholds to remove
monomorphic markers)
Conduct GWAS analysis
: Multidimensional scaling (MDS)
MDS output can be used as the covariate in the GLM or MLM
to correct for population structure. Please follow the steps
shown below:
Intersecting the files
• Intersect the genotype, phenotype and MDS files by following the
steps below:
running General Linear Model (GLM)
• Run the GLM analysis by selecting the intersected files following the
steps below:
The output of the GLM analyis is produced ubder the
Result node. The GLM association test can be evaluated
by plotting Q-Q plot and the Manhattan plot as shown
below.
Mixed Linear Model (MLM)
• Calculating Kinship matrix
• Follow the below steps to calcuate the kinship matrix:
MLM
running Mixed Linear Model (MLM)
• MLM model includes the PCA and the kinship matrix i.e.
MLM(PCA+K).
• Therefore, once the Kinship matrix has been calculated, MLM can be
now be conducted by following below steps:
Exporting results
• Give the output from GLM and or MLM analysis, one calcuate the
adjusted p-values using one of the frequenlty comparison methods:
Bonferroni and False Discovery Rate (FDR)in R using the below code.
# Import GLM or MLM stats file.
# Check data
head(glm_stats)
# Import R library
library(dplyr)
View(adj_glm)
# QQ plot
library(qqman)
# import data
adj_glm_KRN_4 <- read.csv("adj_p_GLM.csv", header = T)
par(mfrow=c(1,3))
qq(adj_glm_KRN_4$p, main = "non-adjusted P-value")
qq(adj_glm_KRN_4$p_Bonferroni, main = "Bonferroni")
qq(adj_glm_KRN_4$p_FDR, main = "FDR")
par(mfrow=c(1,1))
The Hardy–Weinberg (HD) principle
• Allele and genotype frequencies in a population will remain
constant from generation to generation in the absence of other
evolutionary influences.
• These influences include non-random mating, mutation,
selection, genetic drift, gene flow and meiotic drive.
• Allele frequency: f(A)=p, f(a)=q
• Genotype frequency: f(AA)=p2, f(aa)=q2, f(Aa)=2pq
• Both allele and genotype frequency remain unchanged: Hardy-
Weinberg equilibrium
HD principle for two loci
• First locus: A and a alleles; Second locus: B and b alleles
• Allele frequency: PA+Pa = 1, PB+Pb=1
• Haplotype frequency: PAB=PAPB, Pab=PaPb, so on so forth
• Haplotype frequency reaches the equilibrium stage with one generation of
random matting if the two loci are on different chromosomes
• It takes multiple generation to reach the the equilibrium stage if the two
loci are on the same chromosome
• It takes more generation to move out the linkage disequilibrium stage
with lower recombination rate between the two loci
Linkage equilibrium and Disequilibrium
Gametic type AB Ab aB ab
D = PAB-PAPB =Pab-PaPb
=-(PAb-PAPb) =-(PaB-PaPB )
Lemma
D=PABPab-PAbPaB
Proof
(1): PABPab=(PAPB+D) (PaPb+D)= PAPB PaPb + PAPB D + PaPb D + D2
(2): PAbPaB=(PAPb-D) (PaPB-D)= PAPb PaPB - PAPb D - PaPB D + D2
Subtracting (2) from (1): PABPab-PAbPaB=D(PAPB + PaPb + PAPb + PaPB )=D
D depends on allele frequency
• Vary even with complete LD
• PAb=PaB=0
• PAB=1-Pab=PA=PB
• D=PA-PAPA
Property of D
Gametic type AB Ab aB ab
Gametic type aB ab AB Ab
P values D D’ R2
Definition Statistical test PAB-PAPB D/DMax D2/(PAPBPaPb)
(e.g. X2)
Value at 1 0 0 0
equilibrium
Value at 0 -0.25 or 0.25 1 1
complete LD
Disadvantage Dependency Penalty on
on allele neutral loci
frequency
Causes of LD
• Linkage
True association
• Mutation
• Selection
• Inbreeding
Spurious association
• Genetic drift
• Gene flow/admixture
Mutation and selection
Generation 1 A____q A____Q
A____q mutation
A____q A____q
A____q A____q
c=.01
c=.05
c=.1
c=.25
Human out of Africa
https://arstechnica.com/science/2015/12/the-human-migration-out-of-africa-left-its-mark-in-mutations/
LD decay over distance
HW equilibrium, Linkage equilibrium and
Linkage disequilibrium
HWE LE LD
Association
LD Decay
Multiple locus PAB=PAPB PAB=PAPB PAB!=PAPB
Same chromosome
different chromosome
Estimation and plotting of LD decay over distance from the LD
results from TASSEL
Code to plot in R using GGplot:
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
• Reduce number of dimensions in data
• Find patterns in high-dimensional data
• Visualize data of high dimensionality
• Example applications:
• Face recognition
• Image compression
• Gene expression analysis
84
Principal Components Analysis Ideas ( PCA)
85
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf
X2
Y1
Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
x
x
Y2 is the second. x x
Y2 ignorable. x
x x x X1
x x Key observation:
x x
x x variance = largest!
86
Principal Component Analysis: one
attribute first Temperature
42
40
24
• Question: how much 30
spread is in the data 15
along the axis? 18
(distance to the mean) 15
30
• Variance=Standard 15
deviation^2 n
30
(X
35
i X) 2
30
s2 i 1 40
(n 1) 30
87
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y 40 90
• cov(X,Y)=0: independent
40 90
•Cov(X,Y)>0: move same dir
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
30 90
15 70
n 30 70
(X
i 1
i X )(Yi Y ) 30 70
cov( X , Y ) 30 90
(n 1)
40 70
30 90
88
More than two attributes: covariance matrix
• Contains covariance values between all possible
dimensions (=attributes):
C nxn
(cij | cij cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):
2 3 3 12 3
x 4 x
2 1 2 8 2
90
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues
• Solve (A- I) x=0 for each to obtain eigenvectors x
91
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.
92
Steps of PCA
• Let Xbe the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean • is called an eigenvalue of
X’ = X – C.
X
• Compute the covariance • Ce=e (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.
93
Eigenvalues
• Calculate eigenvalues and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:
j n
V j 100 n x n
x
x 1
x 1
94
Principal components - Variance
25
20
Variance (%) 15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
95
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
yi1 e1 xi1 x1
yi 2 e2 xi 2 x2
... ...
...
y e x x
ip p in n 96
An Example Mean1=24.1
Mean2=53.8
19 63 -5.1 9.25 60
50 Series1
40
30
39 74 14.9 20.25 20
10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30 23 5.9 -30.75 30
20
10
15 35 -9.1 -18.75 0 Series1
-15 -10 -5 0 5 10 15 20
-10
-30
-40
15 32 -9.1 -21.75
97
30 73 5.9 19.25
Covariance Matrix
75 106
• C=
106 482
98
If we only keep one dimension: e2
0.5
yi
0.4
-10.14
0.3
-16.72
• We keep the dimension 0.2
0.1 -31.35
of e2=(0.21,-0.98) 0 31.37
-40 -20 -0.1 0 20 40 4
• We can obtain the final -0.2 16.46
4
data as
-0.3
-0.4 8.624
-0.5
19.40
4
-17.63
xi1
yi 0.21 0.98 0.21* xi1 0.98 * xi 2
xi 2
99
100
101
102
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
• RetrievedRowData=(RowFeatureVectorT x FinalData)
+OriginalMean
• Yields original data using the chosen components
103
Principal components
• General about principal components
• summary variables
• linear combinations of the original variables
• uncorrelated with each other
• capture as much of the original variance as possible
104
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with multiple
clusters
• Alternative: determination of gene’s closest neighbours
105
Two Way (Angle) Data Analysis
Genes 103–104 Conditions 101–102
Genes 103-104
Samples 101-102
106
PCA - example
107
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
108
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2
109
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2
110
PCA Analysis
For genotype & Phenotype practice data
• https://
avikarn.com/image/gwas/sample_Genotype-Chr-1_data_forTASSEL.h
mp.txt
• https://
avikarn.com/image/gwas/sample_Phenotype_data_forTASSEL.txt
• https://www.panzea.org/
• http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip
Another Way for running MLM:
Structure tool to Tassel tool
Generally burnin period set to 10000, and MOMC Reps= 100000
For which K-Value?
Inferred ancestory for Q matrix
Q-Matrix from structure
From Sequence data we make Kinship
centered_IBS_Filtered file
We do intersection join for genotype data+
phenotype data+ q-matrix
Result MLM-Statistics file can be analysed
for Manhattan Plot and QQ-Plot
QQ Plot Interpretation
• This plot provide information on two main aspects of
GWAS data: whether the statistical testing is well
controlled for challenges such as population
stratification & whether there is any association.
• QQ-Plots measures and compares the p-values expected
to be seen when testing for association & those actually
observed.
• Each dot represent SNP
• X-Axis shows: Expected –log10(p)
• Y-Axis shows: Observed –log10(p)
• The red line shows pattern of –log10(p) value if no SNP
have significant genetic association with the trait.
• When there are significant associations between SNP
markers and the traits then SNP Dots (Blue color) rise
off the line
Manhattan plot Interpretation
• A scatter plot used to show on
which chromosomes have any
significantly associated SNPs
based on their p-value.
• In this genomic coordinates are
displayed on x-axis and negative
logarithm of association p-values
for each SNP on y-axis.
• Each dot signifies a SNP
• Strongest association have
smaller p-values (10-15)
•Thank You