Pathway Analysis Bioinformatics Series - Final
Pathway Analysis Bioinformatics Series - Final
Pathway Analysis Bioinformatics Series - Final
data
Main methods for pathway analysis
• Over-representation analysis
• Gene set enrichment analysis (GSEA)
• Topology-based pathway analysis
• Perturbation signature-based pathway analysis
Over-representation analysis
• Involves comparing a list of differentially expressed genes with a reference database of
pathways to determine if certain pathways are over-represented in the dataset.
• It aims to identify biological pathways that are statistically enriched with a higher number
of genes or proteins of interest than would be expected by chance.
• It provides binary results (significant or not) for each gene set without a continuous
measure of enrichment.
Over-representation analysis
KEGG/Reactome/GO
Over-representation analysis
UC patients who have failed anti-TNF therapy vs UC patients who are anti-TNF naïve
Number of DEGS = 11
n N-n N
What is the probability of seeing at least 2 out of the 8 DEGs annotated to this particular
GO term, given the proportion of background genes that are annotated to that term?
What is the probability of getting at least 2 red balls when drawing a sample containing 8 balls (without
replacing the balls back into the bag), when there are 106 red balls in the total population of 14176 balls?
Use hypergeometric probability distribution.
Over-representation analysis using clusterprofiler
> genelist_sig_DE_antiTNF
[1] "REG4" "MUC17" "TFPI2" "PYY" "DEFB4A" "TM4SF20" "REG1B" "C10orf99"
[9] "MS4A12" "LOC100288985" "C4orf7"
Bonferroni correction => use a p value of < (0.05 / n number of tests) i.e. 0.0000016 --> problem it is VERY
conservative
False Discovery Rate (FDR) => proportion of false positives amongst all significant results. Typically, one
accepts 5% of the significant results being false positives.
• Disadvantages:
Need to carefully select background
Disregards the vast majority of data
Assumes all gene act independently i.e. ignores interactions between genes
Only considers the number of DEGs represented in each pathway rather than its position within the
pathway
Can lead to many false positives
Does not provide any information on pathway activity
Functional class scoring methods e.g. Gene Set
Enrichment Analysis (GSEA)
• GSEA assesses whether predefined reference sets of genes associated with each
pathway from a reference database is collectively upregulated or downregulated in an
experimentally-derived ranked gene list.
• It uses an enrichment score and permutation-based statistics to determine if gene sets
are significantly enriched, providing a continuous measure of enrichment.
• GSEA is suitable for identifying subtle, coordinated changes in gene expression and
doesn't rely on arbitrary cutoffs, which is require for ORA.
• It is particularly useful when studying complex biological conditions with multifaceted
gene regulation.
GSEA
GSEA
GSEA
GSEA
GSEA
GSEA
GSEA T cell differentiation Cellular division Heart contraction
Which of these pathways show non-random distribution across this sorted list?
Use statistical test called Kolmogorov-Smirnov test to give a p value
GSEA: Enrichment score plot
Positive ES = Enrichment at the top of the ranked list Negative ES = Enrichment at the bottom of the ranked list
Leading edge = Subset of members within the gene set that contribute most to the enrichment score.
It tells you which genes of a particular gene set is most important in your data.
GSEA
• Disadvantages:
Assumes all gene act independently i.e. ignores interactions between genes
Analyses each pathway independently
Can also result in false positives
No information on pathway activity
Topology-based pathway analysis
• These methods go beyond simply considering the presence or absence of genes in pathways and instead take
into account the network and interaction characteristics of the genes and proteins within pathways.
• Network Structure: Topology-based methods consider the topology, or the network structure, of pathways.
They examine how genes or proteins interact within pathways, including the type and strength of interactions.
• Node Centrality: These methods often incorporate measures of node centrality, such as degree centrality (the
number of connections a node has), betweenness centrality (the importance of a node in connecting other
nodes), or other network centrality metrics to assess the significance of individual genes or proteins within
pathways.
• Pathway Cross-Talk: Topology-based analysis can identify cross-talk between pathways, showing how genes
or proteins in one pathway may also be involved in other related pathways, revealing the interconnected nature
of biological processes.
• Functional Impact: These methods aim to determine the functional impact of specific genes or proteins within
pathways, considering their position and interactions within the network.
• Pathway Rewiring: Topology-based analysis can identify instances where pathways are rewired in response
to genetic mutations or experimental conditions, helping to understand the dynamics of pathway regulation.
Topology-based pathway analysis: Signalling
Pathway Impact Analysis (SPIA)
• The most well-known topology-based pathway analysis method is SPIA.
• Two evidences of differential expression of a pathway are combined:
o ORA
o Pathway topology reflected in the perturbation factor
The authors assume that a differentially expressed gene at the beginning of a pathway topology (e.g. a receptor in a
signaling pathway) has a stronger effect on the functionality of a pathway than a differentially expressed gene at the
end of a pathway (e.g. a transcription factor in a signaling pathway).
The perturbation factors of all genes are calculated from a system of linear equations and then combined within a
pathway.
• The two evidences in the form of p-values are combined into a global p-
value, which is used to rank the pathways.
Topology-based pathway analysis: ROntoTools1
1
Ansari et al. 2016: A Novel Pathway Analysis Approach Based on the Unexplained Disregulation of Genes
Topology-based pathway analysis: ROntoTools
• The measured expression change of a gene in a given phenotype can be seen as the result of influences from
upstream genes superimposed on the dis-regulation incurred by that particular gene itself. We will refer to this
later quantity as the primary dis-regulation (pDis).
• The diffusion of signals between genes in regulatory networks, called “network propagation,” can be used to
find the active genes and subnetworks as well as the function of the genes in different conditions.
• Here, we are using a similar approach that uses propagation between genes to calculate pDis in order to find
the most impacted pathways.
• We propose a pathway analysis method that focuses on this primary dysregulation
Topology-based pathway analysis: ROntoTools
KEGG pathway
pathNames terms pPert pPert.fdr
PPAR signaling pathway 0.004975 0.014627
MAPK signaling pathway 0.004975 0.014627
Calcium signaling pathway 0.004975 0.014627
Cytokine-cytokine receptor interaction 0.004975 0.014627
Chemokine signaling pathway 0.004975 0.014627
NF-kappa B signaling pathway 0.004975 0.014627
HIF-1 signaling pathway 0.004975 0.014627
Neuroactive ligand-receptor interaction 0.004975 0.014627
Cell cycle 0.004975 0.014627
p53 signaling pathway 0.004975 0.014627
Endocytosis 0.004975 0.014627
Phagosome 0.004975 0.014627
PI3K-Akt signaling pathway 0.004975 0.014627
Wnt signaling pathway 0.004975 0.014627
Axon guidance 0.004975 0.014627
Osteoclast differentiation 0.004975 0.014627
Focal adhesion 0.004975 0.014627
ECM-receptor interaction 0.004975 0.014627
Tight junction 0.004975 0.014627
Complement and coagulation cascades 0.004975 0.014627
Toll-like receptor signaling pathway 0.004975 0.014627
Topology-based pathway analysis: ROntoTools
Perturbation propagation of the TGFb signaling pathway
Topology-based pathway analysis
• Over 30 tools and methods fall in this category including Pathway-Express, SPIA,
NetGSA, TopoGSA, TopologyGSA, PWEA, PathOlogist, GGEA, cepaORA, cepaGSA,
PathNet, ROntoTools, BLMA etc
Perturbation signature-based pathway analysis (e.g.
PROGENy)
• Most pathway approaches make use of either the set (e.g. ORA/GSEA) or infer or
incorporate structure (topology-based methods) of signaling molecules to make
statements about possible activation of a pathway, while signature-based approaches
such as PROGENy consider the genes affected by actually perturbing the pathway
• Aims to infer the activity of signalling pathways based on gene expression data in the
context of genetic or molecular perturbations
• PROGENy leverages a large compendium of pathway-responsive gene signatures
derived from a wide range of different conditions in order to identify genes that are
consistently deregulated when perturbing a particular pathway i.e. identify a common
core of Pathway RespOnsive GENes to a specified set of stimuli
• While this approach has been taken before, previous studies either focused less on
integrating responses from many different cell lines or derived their scores from a much
smaller collection of perturbation experiments.
Perturbation signature-based pathway analysis (e.g.
PROGENy)
• They curated a total of 208 different submissions to ArrayExpress/GEO, spanning perturbations of the
11 pathways EGFR, MAPK, PI3K, VEGF, JAK-STAT, TGFb, TNFa, NFkB, Hypoxia, p53-mediated DNA
damage response, and Trail (apoptosis). This consisted of 568 experiments and 2652 microarrays
• They calculated z-scores of gene expression changes for each experiment
• For each pathway, they identified 100 responsive genes that are most consistently deregulated across
experiments. Interestingly, these responsive genes are specific to the perturbed pathway and have little
overlap with genes encoding for its signaling proteins highlighting, that pathway expression and
activation are distinct processes
• They used the z-scores of those 100 pathway-responsive genes in a simple, yet effective, linear model
to infer pathway activity from gene expression called PROGENy i.e. basically assesses how closely the
gene expression in a sample matches the expected response patterns for genes within each pathway.
• Then used a scoring algorithm to compute pathway activity scores for each sample. This score reflects
the likelihood that a particular pathway is active in a given sample based on the gene expression data.
A high score indicates a high likelihood of pathway activity, while a low score suggests pathway
inactivity.
PROGENy advantages
• Advantages:
Infer pathway activity: Able to more accurately infer activity of signalling pathways by providing
continuous pathway activity scores, allowing for a more granular assessment of pathway
involvement. In contrast, GSEA and ORA only tells you whether a pathway is enriched or not
enriched.
Capturing Complex Changes: Perturbation-based analysis can capture subtle and complex
changes in gene expression patterns, which may not be easily detected by binary methods like
ORA. It provides a more nuanced understanding of how pathways are altered.
• Disadvantages:
Computationally more intense
Coverage is low (n = 14 pathways currently in PROGENy)
Perturbation-based pathway analysis (PROGENy)
To access it we can use decoupleR - to run decoupleR methods, we need an input matrix (mat), an input prior knowledge
network/resource (net), and the name of the columns of net that we want to use.
• net <- get_progeny(organism = 'human', top = 500)
• counts<-na.omit(UNIFI_counts)
• counts<-as.matrix(counts)
• deg<-DE_TNF[,3, drop=FALSE]
• deg<-as.matrix(deg)
Patient 6
Patient 5
Patient 4
Patient 3
Patient 2
Patient 1
PROGENy pathway analysis in anti-TNF treated vs anti-TNF untreated UC patients
• The t value is a measure of the strength and direction of the differential expression of a gene between groups
• PROGENy assigns weight values to each gene within a pathway. These weight values are determined based on various factors, including the biological
significance of the gene within the pathway and its expression pattern in the experimental data. The weight represents the contribution of the gene to the
overall activity of the pathway.
• A negative weight for a gene within a pathway suggests that the expression of that gene is negatively correlated with the activity of the pathway. In other
words, when this gene's expression increases, it tends to suppress or inhibit the activity of the pathway thus, they negatively regulate the pathway
Pathway responsive genes and their strength of differential expression in anti-TNF treated vs untreated
Which pathway analysis method is the best?
Which pathway analysis method is the best?
• Here, for the first time, we present a comparison of the performances of 13 representative
pathways analysis methods on 86 real data sets from two species: human and mouse.
• Aimed to answer following questions:
i. is there any difference in performance between non-TB and TB methods?
ii. is there a method that is consistently better than the others in terms of its ability to
identify target pathways, accuracy, sensitivity, specificity, and the area under the
receiver operating characteristic curve (AUC)?
iii. are there any specific pathways that are biased (in the sense of being more likely
or less likely to be significant across all methods)?
iv. do specific methods have a bias toward specific pathways (e.g., is pathway X likely
to be always reported as significant by method Y)?
Which pathway analysis method is the best?
• But this approach focuses solely on one true positive, the target pathway. We
do not know what other pathways are also truly impacted and therefore cannot
evaluate other criteria such as the accuracy, specificity, sensitivity, and the
AUC of a method. Here, we use knockout data sets that involve using
knockout experiments (KO), where the source of the perturbation is known,
i.e., the KO gene.
• We consider pathways containing the KO gene as true positives and the
others as true negatives.
• Subsequently, we calculate the accuracy, sensitivity, specificity, and AUC of
methods studied using 11 KO data sets.
Which pathway analysis method is the best?
• ROntoTools and PADOG have the
highest median value of accuracy
(0.91).
• ROntoTools also has the highest
median value of specificity (0.94).
• All methods show rather low
sensitivity. Among them, KS is the
best one with the median value of
sensitivity of 0.2.
• AUC is the most comprehensive and
important one because it combines
both the sensitivity and specificity
across all possible thresholds
• In conclusion, TB methods outperform
non-TB methods in all aspects,
namely ranks and p values of target
pathways, and the AUC. Moreover,
the results suggest that there is still
room for improvement since the ranks
of target pathways are still far from
optimal in both groups.
Are some pathways
particularly biased during
pathway analysis?
• They created a true null hypothesis by using simulated data
sets that are constructed by randomly selected healthy
samples from the 75 aforementioned data sets.
• Then applied each method more than 2000 times, each
time on different simulated data sets. Repeated
~2000
• Each pathway for each method then has an empirical null times
distribution of p values resulting from those 2000 runs
• When the null hypothesis is true, p values obtained from
any sound statistical test should be uniformly distributed
between 0 and 1.
• A null distribution of p values of a pathway generated by a
method skewed to the right (biased toward 0) shows that
this method has a tendency to yield low p values and
therefore report the pathway as significantly impacted even
when it is not (false positive).
frequency
• A null distribution of p values of a pathway skewed to the
left (biased toward 1) indicates that the given method tends
to produce consistently higher p values thus possibly report
this pathway as insignificant when it is indeed impacted
(false negative). False positive
Which pathway analysis
method is the best?
• The number of biased pathways is at least 66 for
all the methods compared in this work, except
GSEA which has no biased pathway.
• The figure shows that performing pathway
analysis using the FE test produces the highest
number (137 out of 150 pathways) of false
positives (biased toward 0); this is followed by The numbers of pathways The numbers of pathways
the WRS test (114 out of 150 pathways) and biased toward 0 (false biased toward 1 (false
CePaGSA (112 out of 186 pathways). On the positives) negatives)
other hand, GSEA and PathNet produce no false
positive pathways.
• Similarly, produced by different methods are
shown in Fig. 6c. PathNet produces the highest
number (129 out of 130 pathways) of false
negative pathways. No false negative pathways
are identified while performing pathway analysis
using GSEA, CePaGSA, WRS test, and FE test
Which pathway analysis method is the best?
• The resulting graph indicates that there is
no such “ideal" unbiased pathway. Each
pathway is biased by at least 2 out of 13
investigated methods.
• Some pathways are biased by as many as
12 methods (out of 13 methods). The
common characteristic of these most
biased pathways is that they are small in
size (less than 50 genes), except for “PPAR
signaling pathway” (259 genes) and
“Complement and coagulation cascades”
(102 genes). In contrast, all pathways in the
top 10 least biased have more than 200
genes and up to 2806 genes.
• In essence, small pathways are generally
more likely to be biased than larger ones.