Abstract
Background
Pathway enrichment analysis is extensively used in high-throughput
experimental studies to gain insight into the functional roles of
pre-defined subsets of genes, proteins and metabolites. Methods that
leverages information on the topology of the underlying pathways
outperform simpler methods that only consider pathway membership,
leading to improved performance. Among all the proposed software tools,
there’s the need to combine high statistical power together with a
user-friendly framework, making it difficult to choose the best method
for a particular experimental environment.
Results
We propose SEMgsa, a topology-based algorithm developed into the
framework of structural equation models. SEMgsa combine the SEM p
values regarding node-specific group effect estimates in terms of
activation or inhibition, after statistically controlling biological
relations among genes within pathways. We used SEMgsa to identify
biologically relevant results in a Coronavirus disease (COVID-19)
RNA-seq dataset (GEO accession: [27]GSE172114) together with a
frontotemporal dementia (FTD) DNA methylation dataset (GEO accession:
[28]GSE53740) and compared its performance with some existing methods.
SEMgsa is highly sensitive to the pathways designed for the specific
disease, showing low p values (
[MATH: <0.001 :MATH]
) and ranking in high positions, outperforming existing software tools.
Three pathway dysregulation mechanisms were used to generate simulated
expression data and evaluate the performance of methods in terms of
type I error followed by their statistical power. Simulation results
confirm best overall performance of SEMgsa.
Conclusions
SEMgsa is a novel yet powerful method for identifying enrichment with
regard to gene expression data. It takes into account topological
information and exploits pathway perturbation statistics to reveal
biological information. SEMgsa is implemented in the R package
SEMgraph, easily available at
[29]https://CRAN.R-project.org/package=SEMgraph.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12859-022-04884-8.
Keywords: Pathway enrichment analysis, Pathway topology, SEM, SEMgsa,
Sensitivity, Prioritization, Type I error, Power
Background
Biomedical research has been transformed by recent advances in
high-throughput technologies, enabling extensive monitoring of complex
biological systems. As a result, new methodological developments have
emerged, most notably the adaptation of systems perspectives to analyze
biological systems. Pathway enrichment has become a key tool in the
analytic pipeline for Omics data and has been effectively used to
generate novel biological hypotheses and determine if specific pathways
are linked to specific phenotypes. In the literature, dozens of
strategies have been developed, varying in model complexity and
effectiveness.
Earlier methodologies, such as over-representation analysis (ORA)
[[30]1] and gene set analysis (GSA) [[31]2, [32]3], treat each pathway
as a collection of biomolecules, as [[33]4] point out in their review
paper. The ORA approach used a list of differentially expressed (DE)
genes as input to determine which sets of DE genes are over-represented
or under-represented, being strongly reliant on the criteria used to
choose the DE genes, such as the statistical tests and thresholds
utilized.
A second generation of approaches was created to reduce this reliance
on gene selection criteria by taking into account all gene expression
values. The hypothesis behind these approaches is that small yet
coordinated changes in groups of functionally related genes may be
crucial in biological processes. These methods are named functional
class scoring methodologies (FCS) [[34]5]. Such methods include gene
set enrichment analysis (GSEA) [[35]2], gene set analysis (GSA) [[36]6]
and correlation adjusted mean rank gene set test (CAMERA) [[37]7] among
others.
ORA and FCS methods can be referred to as the first two generations of
pathway enrichment analysis approaches. However, when pathways are seen
as a basic unstructured and unordered collection of genes, all the
genetic connections and interactions that are supposed to capture and
characterize the actual processes at hand are simply neglected.
With the aim of including all of this additional information into the
analysis, topology-based (TB) approaches have been created. These
methods account for interactions between biomolecules and provide
better performance than standard second-generation methods [[38]2,
[39]3]. A variety of tools have been implemented, such as DEGraph
[[40]6], topologyGSA [[41]8], NetGSA [[42]9–[43]11], Pathway-Express
[[44]12, [45]13], SPIA [[46]14] among others. The common feature of
approaches in this category is that they use prior knowledge of pathway
topology information to obtain some gene-level statistic, which is then
used to produce a pathway-level statistic, which is then used to rank
the pathways.
The goal of pathway enrichment approaches is to compare the ’activity’
of interest pathways across two or more biological situations or groups
of specimens (patients, cell lines, etc.). Another technical feature
that distinguished pathway enrichment methods is the type of the
statistical null hypothesis being tested. The majority of approaches
may be divided into two categories: those that test (I) self-contained
null hypotheses and (II) competitive null hypotheses [[47]15]. A
self-contained null hypothesis examines the activity of each pathway
across biological situations (for example, normal vs. illness samples)
without comparing it to the activity of other biomolecules/pathways. On
the other hand, the activity of each pathway is compared to that of
other biomolecules/pathways in a competitive null hypothesis. Even if
the competitive null hypothesis has an interesting interpretation,
assessing the significance of the competitive null is challenging,
since tests based on it take into account a framework for gene sampling
that treats genes as independent.
The main contribution of this article is the development of a
self-contained topology-based algorithm developed into the framework of
structural equation models (SEM), called SEMgsa() [[48]16, [49]17].
Evaluation of system perturbation is incorporated in SEM [[50]18],
where the experimental condition is compared to a control one through
the use of an exogeneous group variable acting on every node of the
network. Statistical significance of specific-pathway score is obtained
combining node activation and node inhibition statistics extracted from
SEM model fitting. In addition, unlike existing methods, an overall
status of pathway perturbation of genes between case and control group
has been computed considering both node perturbation and up- or down-
regulation of genes for gaining more biological insights into the
functional roles of predetermined gene subsets.
A second objective of this study is to provide a consistent optimum
solution of any given biological situation. Many topology-based methods
that investigate distinct null hypothesis have been proposed in
literature. We compare five popular pathway analysis approaches to
SEMgsa(), starting from the most similar ones in terms of multivariate
test and self-contained hypothesis type (DEGraph, NetGSA and
topologyGSA) together with one approach of competitive hypothesis type
(PathwayExpress) and the older approach of over-representation analysis
(ORA). All the methods in this article offer a nice use interface in R.
The aforementioned methods have been applied on observed and simulated
expression data. The ultimate goal of expression data application is to
provide a meaningful comparison of gene set analysis methods in terms
of (i) sensitivity and (ii) prioritization for observed data and (i)
type I error and (ii) power within each simulation run.
The remainder of the article is organized as follows. Firstly, we
describe SEMgsa() features with regard to gene expression data both in
terms of inference procedure and user interface. Then, we outline the
experimental setup constructed to evaluate pathway enrichment methods,
including real data application and simulation design. In the end, we
provide the results together with overall discussion.
Method and implementation
SEM framework
A structural equation model (SEM) [[51]18, [52]19] is a statistical
framework for causal inference based on a system of structural
equations defining a path diagram, represented as a graph
[MATH: G=(V,E) :MATH]
, where V is the set of nodes (i.e., variables) and E is the set of
edges (i.e., connections). The set E may include both directed edges
[MATH: k→jifk∈pa(j) :MATH]
and bidirected edges
[MATH: k↔jifk∈sib(j) :MATH]
. Although in the general setting of SEM latent variables and
non-linear functions can be included [[53]18], we focus on the special
case where the parent set
[MATH: pa(j)
:MATH]
, and the siblings set
[MATH: sib(j)
:MATH]
, determine a system of linear equations, as follows:
[MATH: Yj=∑k∈pa(j)βjkYk+Ujj∈V :MATH]
1
[MATH: cov(Uj;Uk)=ψjkifj=kork∈sib(j)0otherwise :MATH]
2
where
[MATH: Yj :MATH]
and
[MATH: Uj :MATH]
are an observed variable and an unobserved error term, respectively;
[MATH: βjk :MATH]
is a regression (path) coefficient, and a covariance
[MATH: ψjk :MATH]
indicates that errors are dependent, which is assumed when there exists
an unobserved (i.e. latent) confounder between k and j.
Under such a model, the dependence structure among genes provided by
pathways in biological database with directed and/or bi-directed edges
(i.e., KEGG, Reactome, and many others) [[54]20, [55]21], interacting
with each other to generate a single biological effect, can be included
explicitly though the graph,
[MATH: G=(V,E) :MATH]
and evaluated using local and global statistics.
SEMgsa() procedure adds a binary group (treatment or disease class)
node labeled X to V, and suppose that
[MATH: X={0,1} :MATH]
directly affects the set of genes in the pathway. As bonus, adding
group node and group-genes edges, the pathway with several components
(clusters) and singleton genes induces a connected graph (see
Fig. [56]1). Thus we consider a directed graph
[MATH: G=(V∪X,E∪EX) :MATH]
with the linear structural equations:
[MATH: Yj=βjX+Ujj∈V(x)
:MATH]
3
[MATH: Yj=∑k∈pa(j)βjkYk+βjX+Ujj∈V(y) :MATH]
4
where V(x) and V(y) are the sets of exogenous (i.e., source and
singleton genes) and endogenous (i.e., connectors and sinks) genes,
respectively. The covariances,
[MATH: cov(Uj;Uk) :MATH]
are assumed to be equal to Eq. ([57]2).
Fig. 1.
[58]Fig. 1
[59]Open in a new tab
Visualisation of SEMgsa() procedure starting from Asthma KEGG pathway.
The first graph summarise Asthma network properties, showing a pathway
consisting of 31 nodes, 4 edges and 25 singletons. To maximise pathway
information, SEMgsa procedure adds a binary group node (G = {0, 1})
that directly affects the set of genes in the pathway. In this way, the
pathway with numerous singleton genes is edged with a group node and
group-genes, resulting in a linked graph
Comparing Eq. ([60]1) with Eqs. ([61]3)–([62]4), we note that the added
node X may affect the mean gene expression values, but not their
regression paths or covariances. In the R package SEMgraph these
coefficients can vary by experimental or disease group via two-group
SEM [[63]16, [64]17], but in the following we assume additive group
effects. Coefficients
[MATH: βj :MATH]
(adjusted by the parents of the j-th node) determine the effect of the
group on the j-th node, while the common path coefficients
[MATH: βjk :MATH]
represent regression coefficients, adjusted by parent set and group
effect.
This type of SEM enables the identification of differentially expressed
genes (DEGs) if genes show a statistically significant variation in
their activity (e.g., gene expression) in the experimental group
respect to the control one. In other terms, a test for the null value
of the path,
[MATH: X→Y
:MATH]
is a test of:
[MATH: H0:Yj⊥Xvs.H1:
Yj⊥̸X,j∈V(x)
:MATH]
5
[MATH: H0:Yj⊥pa(Yj),Xvs.H1:<
mi>Yj⊥̸pa(Yj),X,j∈V(y)
:MATH]
6
From Eqs. ([65]5)–([66]6) we note two different tests. Marginal tests
of conventional DEGs analysis [[67]22] for source and singleton genes,
and conditional tests, given the parents, for connectors and sink
genes. Conditioning increases power when there is a direct group
effect, and reduces gene variability. So, pathway topological structure
makes the inference more precise [[68]23].
Maximum likelihood estimates (MLEs) of the paths (
[MATH: X→Y
:MATH]
) can be easily obtained with one of the three algorithms of SEMgraph.
Specifically, the core of model fitting in SEMgsa() function relies on
the residual iterative conditional fitting (RICF), an efficient
iterative algorithm that can be implemented through least squares, with
the advantage of clear convergence properties [[69]24], and
permutation-based p values for testing null hypothesis in Eqs.
([70]5)–([71]6).
p values of group effect (
[MATH: X→Y
:MATH]
) are computed by randomization of group labels comparing the estimated
parameters by RICF with their random resampling distribution after a
sufficiently high number of case/control labels permutations. Accurate
small p value estimations are possible with no need for a large number
of permutations (SEMgsa() makes default = 1000 permutations), using the
moment based approximation proposed by [[72]25]. Once the empirical
distribution of the permuted path coefficients is obtained, the
two-sided p values are extracted from the normal distribution with mean
and standard deviation estimated from the empirical distribution.
From node-wise p values, overall group perturbation on pathway genes
can be computed based on the Brown’s method for combining
non-independent, one-sided significance tests [[73]26]. The method
computes the sum of one-sided p values:
[MATH:
X2=-2∑jlog(pj) :MATH]
, where the direction is chosen according to the alternative hypothesis
(
[MATH: H1 :MATH]
), and the overall p value is obtained from the chi-square distribution
with new degrees of freedom f and a correction factor c to take into
consideration the correlation among p values, resulting in
[MATH:
X2c∼
mo>χ2(f) :MATH]
.
The conversion of two-sided p values in one-sided p values is performed
according to the sign of the path (
[MATH: X→Y
:MATH]
) coefficient,
[MATH: βj :MATH]
:
[MATH: H1:withatleastoneβj>
0⇒pj(+)=pj/2ifβj>
01-p
mi>j/2ifβj<
0 :MATH]
7
[MATH: H1:withatleastoneβj<
0⇒pj(-)=pj/2ifβj<
01-p
mi>j/2ifβj>
0 :MATH]
8
If the overall p value
[MATH: <α :MATH]
(i.e., the significance level), we define node perturbation as
activated when the direction of the alternative hypothesis is positive.
Conversely, the status is inhibited if the direction is negative.
Node-wise p values
[MATH: <α :MATH]
(after correcting for multiple comparisons with one of several
adjustment methods, including Bonferroni or Benjamini–Hochberg
procedure), are used for DEGs identification. While, a single p value
of the two Brown’s p values (
[MATH: p(+) :MATH]
: p-activation, and
[MATH: p(-) :MATH]
: p-inhibition) combined with a Bonferroni procedure [[74]27], i.e.
[MATH: 2×min(p(+);p(-))
:MATH]
, indicates the global pathway perturbation.
In some cases, edge weights are defined in signalling pathways with
discrete values [
[MATH:
-1,0,1 :MATH]
], indicating gene-gene activity derived from biological database (e.g.
KEGG). Usually they are:
[MATH: -1 :MATH]
for repressed or inactive, 0 for neutral, and + 1 for enhanced or
activated. For gaining more biological insights into the functional
roles of prior subset of genes, the sign of the minimum p value between
node activation and inhibition has been retained to assess, in
combination with pathway weights, an overall status of pathway
perturbation of genes between case and control group. In detail, node
perturbation obtained from RICF fitting has been combined with up- or
down-regulation of genes to obtain overall pathway perturbation
classification as displayed in Table [75]1.
Table 1.
Overall pathway perturbation
Up/down regulation Node perturbation Overall perturbation
+ 1 P− (inh) Down act
− 1 P− (inh) Up inh
+ 1 P+ (act) Up act
− 1 P+ (act) Down inh
[76]Open in a new tab
* The weighted adjacency matrix of each pathway was used to determine
the up- or down-regulation of genes (taken from the KEGG database)
as the column sum of weights across each source node. The pathway
is marked as down-regulated if the total sum of the node weights is
less than 1, and otherwise as up-regulated.
* The minimum among the p values determines whether the node
perturbation is activated or inhibited; if positive, the node
perturbation is described as activated, and otherwise as inhibited.
* It is possible to determine the direction (up or down) of gene
perturbation by combining these two quantities. In cases compared
to the control group, an up- or down-regulated gene status that is
associated with node inhibition shows a decrease in activation (or
an increase in inhibition). In contrast, up- or down-regulated gene
status, associated with node activation, leads to an increase in
activation (or decrease in inhibition) in cases relative to control
group.
User interface
The example code of the function SEMgsa() is as follows.
SEMgsa(g = list(), data, group, method = "BH", alpha = 0.05, n_rep =
1000, ...)
The inputs are: a list of pathways to be examined (g); gene expression
data where rows represent subjects, and columns graph nodes (data); a
binary vector with 1 for cases and 0 for control subjects (group).
Optional inputs are the multiple testing correction method (method),
and the significance level (alpha) for DEGs selection, and the number
of randomization replicates for RICF algorithm (
[MATH: n_rep
:MATH]
, default = 1000).
The first step in SEMgsa() workflow is to compute the weighted
adjacency matrix of each pathway, obtain the sum of node weights and
flag the pathway as up- or down-regulated. This is crucial to obtain
the overall pathway perturbation status at the end. Then RICF algorithm
of R package SEMgraph, i.e. SEMrun(graph, data, group, fit = 1, algo =
"ricf"), is applied on data, considering the group binary vector and
the number of specified randomization replicates. More specifically,
SEMrun() takes as input a single graph as an igraph object and has
several additional inputs, including: a numeric value fit indicating
the SEM fitting mode, where
[MATH: fit=1
:MATH]
specifies a “common” model to evaluate group effects on graph nodes;
the MLE method algo is used for model fitting, in this case fitting is
done via RICF(algo = “ricf”).
The covariance matrix could not be semi-definite positive in the
situation of large dimensionality (n.variables > n.subjects), making it
impossible to estimate the parameters. When this occurs, regularization
of the covariance matrix is enabled. SEMrun() uses internally two
functions of the corpcor R package: the is.positive.definite() tests if
the observed covariance matrix is positive, and if the response is
equal to FALSE, the function pcor.shrink() implements the
James-Stein-type shrinkage estimator [[77]28] to regularized the
covariance matrix.
Node-wise group effect p values are extracted from model fitting object
together with the number of DEGs obtained adjusting p values with the
chosen correction method while testing the specified level of alpha.
Then, a data frame of combined SEM results is obtained putting together
node-wise p values with Brown’s method and Bonferroni’s correction.
The output of SEMgsa() is represented by a list containing two objects
with the following information for each pathway in the input list:
* gsa, a dataframe reporting size (No.nodes), DEGs number (No.DEGs),
pathway perturbation status (pert), Brown’s combined p value of
pathway node activation (pNA), Brown’s combined p value of pathway
node inhibition (pNI) and the Bonferroni combination of them
(PVAL). ADJP refer to the pathway combined p value adjusted for
multiple tests with Bonferroni correction, i.e.,
[MATH:
ADJP=m<
mi>in(K×PV
AL;1) :MATH]
, where K is the number of the input pathways.
* DEG, a list with DEGs names for each pathway, selected with p value
< alpha after the multiple correction procedure with one of the
method available in R function p.adjust(). By default, method is
set to “BH” (i.e., Benjamini–Hochberg correction) and the
significance level alpha to 0.05.
To read more about SEMgsa() function, in terms of description, usage,
function arguments and value, refer to
[78]https://rdrr.io/cran/SEMgraph/man/SEMgsa.
Experimental design
Benchmark data
Coronavirus disease (COVID-19) RNA-seq expression data from [[79]29]
(GEO accession: [80]GSE172114) together with Frontotemporal Dementia
(FTD) DNA methylation data (DNAme) from [[81]30] (GEO accession:
[82]GSE53740) have been used as benchmark data. Network information has
been retrieved from kegg.pathways object of the SEMgraph package as a
list of 225 edge weighted igraph objects corresponding to the KEGG
pathways extracted using the ROntoTools R package [[83]31]. Edge
weights are defined with discrete values
[MATH: [-1,0,1]:-1 :MATH]
for inactive gene–gene activity, 0 for neutral, and
[MATH: +1 :MATH]
for activated.
Coronavirus disease (COVID-19)
Coronavirus disease of 2019 (COVID-19) is a highly contagious
respiratory infection that is caused by the severe acute respiratory
syndrome coronavirus 2 (SARS-CoV-2). Multiple probes for each Entrez
gene ID were first eliminated. The empirical Bayes technique, as
implemented in the limma R package [[84]32], was used to fit linear
models for differential expression analysis, and p values were adjusted
for multiple testing using the method of Benjamini–Hochberg [[85]33].
This procedure results in a matrix of 69 subjects
[MATH: × :MATH]
14,000 genes. Subjects include patients in the intensive care unit with
Acute Respiratory Distress Syndrome (“critical group”, N = 46) defined
as cases, and those in a non-critical care ward under supplemental
oxygen (“non-critical group”, N = 23) defined as controls. The
expression matrix was finally matched with the corresponding
Coronavirus disease—COVID-19 (hsa05171) KEGG pathway according to its
name. The latter is a graph with 232 nodes and 208 edges, including
five components and 109 sigleton (i.e. node degree = 0). The maximum
subgraph consists of 54 nodes and 83 edges. This pathway was
subsequently labeled as target pathway and its p value and rank were
further investigated for assessing the sensitivity and prioritization
ability of the methods [[86]34, [87]35].
Frontotemporal dementia (FTD)
Frontotemporal Dementia, a neurodegenerative disorder characterized by
cognitive and behavioural impairments [[88]36]. We will use DNAme data
stored in the SEMdata package as ftdDNAme, a list of two objects: a
data matrix of 256 rows (subjects) and 16,560 columns (genes)
containing the value of the first principal component of DNAme levels,
obtained applying a principal component analysis to methylated CpG
sites within the promoter region, for each gene (genes with
unmethylated CpGs in both conditions were discarded); and a binary
group vector of 105 FTD patients (1) and 150 healthy controls (0).
Unlike COVID-19 data, FTD has not a unique KEGG pathway associated to
its name. According to KEGG BRITE database, the term Frontotemporal
lobar degeneration (an alias for FTD; KEGG ID:[89]H00078) is associated
to 6 KEGG pathways: MAPK signaling pathway (hsa04010), Protein
processing in endoplasmic reticulum (hsa04141), Endocytosis (hsa4144),
Wnt signaling pathway (hsa04310), Notch signaling pathway (hsa04330),
and Neurotrophin signaling pathway (hsa04722). We can use the SEMgsa()
function to apply gene set analysis (GSA) on a collection of networks,
exploring the 6 selected FTD pathways as target ones. Thus, the ability
of GSA methods will be investigated on 6 target pathways, combining
results in terms of median p values and ranks for readability purposes.
We refer the reader to the Additional file [90]1 for more detailed
results.
Data simulations
Synthetic data, based on realistic expression data (“[91]Coronavirus
disease (COVID-19)” section), was used to carry out simulations
following the practice in [[92]10]. A subset of pathways
[MATH:
q1<K
:MATH]
out of total K pathways has been chosen to be dis-regulated. Next, a
pre-specified number (s) of genes within each dis-regulated pathway was
chosen to be altered (up or down) according to a topological measure
(betweenness, community, neighbourhood) and different mean signals (
[MATH:
±0.5,±0.6,<
mo>±0.7 :MATH]
). Another subset of
[MATH: q0<(K-q1
msub>) :MATH]
pathways with no dis-regulated genes has been chosen to evaluate
statistical metrics. Finally, the COVID-19 benchmark data matrix is
first normalized, with mean zero and unit variance for each gene within
each group (cases and controls). Then, nine data generation procedures
are executed, according to topological measures and adding mean signal
to the pre-specified genes in the selected dis-regulated pathways. In
summary, the simulation design (
[MATH: 3×3 :MATH]
) with 100 randomization per design levels is reported in Table [93]2.
Table 2.
Summary of simulation design (
[MATH: 3×3 :MATH]
) with 100 randomization per design levels
Topology design Gene regulation Mean signal
[MATH: ± :MATH]
0.5
[MATH: ± :MATH]
0.6
[MATH: ± :MATH]
0.7
Betweenness Up/down 100 100 100
Community Up/down 100 100 100
Neighbourhood Up/down 100 100 100
[94]Open in a new tab
Pathway deregulation
Each data generation procedure starts with the definition of a list of
pathways to be tested. After recalling the list of kegg.pathways
including
[MATH: N=225 :MATH]
signaling pathways from the KEGG database, for efficiency purposes
pathways with a minimum and a maximum number of nodes equal
respectively to 30 and 300 have been filtered out for the analysis.
Then, to speed up computations, the maximum component of each igraph
[[95]37] object corresponding to each selected pathway has been
selected. Given this choice, igraph objects with maximum component
smaller than the
[MATH: 60% :MATH]
of the total graph size have not been considered. This filtering
procedure results into a list of
[MATH: K=117 :MATH]
igraph objects.
We have to alter
[MATH:
q1=10
:MATH]
pathway’s genes in order to deregulate it within a simulation.
Specifically, we consider the 9 KEGG pathways associated to Coronavirus
disease—COVID-19 (hsa05171): Vascular smooth muscle contraction
(hsa04270), Platelet activation (hsa04611), Toll-like receptor
signaling pathway (hsa04620), NOD-like receptor signaling pathway
(hsa04621), RIG-I-like receptor signaling pathway (hsa04622), JAK-STAT
signaling pathway (hsa04630), Natural killer cell mediated cytotoxicity
(hsa04650), Fc gamma R-mediated phagocytosis (hsa04666) and Leukocyte
transendothelial migration (hsa04670).
We looked at three different methods for reflecting pathway topology in
order to assign impacted genes to the deregulated pathways:
betweenness, community and neighborhood, following the practice in
[[96]10, [97]38].
The number of all shortest paths in a directed graph that pass through
a given node is known as its betweenness. The top 10 highest scoring
betweenness nodes were used to choose affected genes in the betweenness
deregulation design. According to the community deregulation design, we
located modules with dense connections between module nodes and spare
connections between module nodes. Given the division of vertices in
each community, the 10 affected genes are then randomly sampled from
the community with the highest proportion of members. In the
neighbourhood deregulation procedure, the vertices not farther than a
given limit from another fixed vertex are called the neighborhood of
the vertex. After computing the neighbourhood of order 2, we sampled
the 10 vertices from the neighbourhood with the biggest size.
Within each of the associated pathways plus the target pathway of
interest, an equal number of genes s has been selected as
‘dysregulated’. We decided to fix the number of affected genes to
[MATH: s=10 :MATH]
for each pathway, with the aim of obtaining equal contribution from
each associated disease (due to the presence of smaller pathway sizes).
However, given that overlapping genes between pathways may occur, the
unique genes out of the total
[MATH:
S≤s×q1=10×10 :MATH]
have been retained as DEGs for pathway dysregulation. Thus, the number
of total dysregulated genes S is not fixed, but changes according to
the chosen topology design. As a note, given the random sampling within
the community deregulation design, the sampled genes change according
to the specified seed. In the end, a weight representing the up- or
down-regulation of genes was kept along with the Entrez gene ID of the
affected genes. The weighted adjacency matrix of each pathway comprises
a column regarding the sum of weights over each source node, which can
be used to determine up- or down-regulation (taken from the KEGG
database). The pathway is marked as down-regulated or up-regulated
depending on whether the total sum of node weights is less than 1. Each
gene’s weight has been extracted in order to obtain an up or down mean
signal, which better reflects variations in the expression of the
impacted genes between the control and treatment groups.
After identifying the subset of DEGs according to the chosen topology
design, pathways with a number of dysregulated genes
[MATH: ≤1 :MATH]
have been selected as true negatives (
[MATH: q0 :MATH]
) to evaluate type I error from simulations. Unlike the
[MATH: q1 :MATH]
number that is fixed to 10 COVID-19 related pathways, the number of
[MATH: q0 :MATH]
pathways changes according to the chosen subset of DEGs.
To summarise, for all topology design, the total altered genes S inside
the
[MATH:
q1=10
:MATH]
pathways are differentially expressed with a mean difference varying
from (
[MATH: ±0.5 :MATH]
,
[MATH: ±0.6 :MATH]
,
[MATH: ±0.7 :MATH]
). Note that the magnitude of the mean signal is expressed relative to
the unit variance of each gene (see [[98]10]).
The simulated expression matrices were directly supplied to SEMgsa()
and DEGraph, topologyGSA, NetGSA, PathwayExpress and ORA algorithms
together with the list of igraph objects corresponding to the chosen
KEGG pathways.
Pathway enrichment methods
Table [99]3 provides an overview of the tested pathway enrichment
methods in terms of null hypothesis, input requirements, pathway
information and availability on R together with main papers for
reference. These methods differ in two main aspects: (i) the type of
null hypothesis, self-contained or competitive; (ii) input data,
expression data or thresholded gene p values. ORA and PathwayExpress
test the competitive null hypothesis of whether the genes in the set of
interest are at most as often DE as the genes not in the set, instead
SEMgsa(), DEGraph, topologyGSA and NetGSA test the self-contained null
hypothesis of no genes in the set of interest are DE. Another major
difference among these methods regards the input requirements. There is
a high sensitivity to the p value cutoff because all techniques based
on testing the competitive null hypothesis must identify DE genes based
on a pre-specified threshold of corrected p values. Without making any
arbitrary decisions on the list of DE genes, all self-contained tests
directly employ expression data.
Table 3.
Overview of tested pathway enrichment methods
Method Null hypothesis Gene p value tresholding Expression data Pathway
R/Bioconductor [References]