Abstract Background Pathway enrichment analysis is extensively used in high-throughput experimental studies to gain insight into the functional roles of pre-defined subsets of genes, proteins and metabolites. Methods that leverages information on the topology of the underlying pathways outperform simpler methods that only consider pathway membership, leading to improved performance. Among all the proposed software tools, there’s the need to combine high statistical power together with a user-friendly framework, making it difficult to choose the best method for a particular experimental environment. Results We propose SEMgsa, a topology-based algorithm developed into the framework of structural equation models. SEMgsa combine the SEM p values regarding node-specific group effect estimates in terms of activation or inhibition, after statistically controlling biological relations among genes within pathways. We used SEMgsa to identify biologically relevant results in a Coronavirus disease (COVID-19) RNA-seq dataset (GEO accession: [27]GSE172114) together with a frontotemporal dementia (FTD) DNA methylation dataset (GEO accession: [28]GSE53740) and compared its performance with some existing methods. SEMgsa is highly sensitive to the pathways designed for the specific disease, showing low p values ( [MATH: <0.001 :MATH] ) and ranking in high positions, outperforming existing software tools. Three pathway dysregulation mechanisms were used to generate simulated expression data and evaluate the performance of methods in terms of type I error followed by their statistical power. Simulation results confirm best overall performance of SEMgsa. Conclusions SEMgsa is a novel yet powerful method for identifying enrichment with regard to gene expression data. It takes into account topological information and exploits pathway perturbation statistics to reveal biological information. SEMgsa is implemented in the R package SEMgraph, easily available at [29]https://CRAN.R-project.org/package=SEMgraph. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04884-8. Keywords: Pathway enrichment analysis, Pathway topology, SEM, SEMgsa, Sensitivity, Prioritization, Type I error, Power Background Biomedical research has been transformed by recent advances in high-throughput technologies, enabling extensive monitoring of complex biological systems. As a result, new methodological developments have emerged, most notably the adaptation of systems perspectives to analyze biological systems. Pathway enrichment has become a key tool in the analytic pipeline for Omics data and has been effectively used to generate novel biological hypotheses and determine if specific pathways are linked to specific phenotypes. In the literature, dozens of strategies have been developed, varying in model complexity and effectiveness. Earlier methodologies, such as over-representation analysis (ORA) [[30]1] and gene set analysis (GSA) [[31]2, [32]3], treat each pathway as a collection of biomolecules, as [[33]4] point out in their review paper. The ORA approach used a list of differentially expressed (DE) genes as input to determine which sets of DE genes are over-represented or under-represented, being strongly reliant on the criteria used to choose the DE genes, such as the statistical tests and thresholds utilized. A second generation of approaches was created to reduce this reliance on gene selection criteria by taking into account all gene expression values. The hypothesis behind these approaches is that small yet coordinated changes in groups of functionally related genes may be crucial in biological processes. These methods are named functional class scoring methodologies (FCS) [[34]5]. Such methods include gene set enrichment analysis (GSEA) [[35]2], gene set analysis (GSA) [[36]6] and correlation adjusted mean rank gene set test (CAMERA) [[37]7] among others. ORA and FCS methods can be referred to as the first two generations of pathway enrichment analysis approaches. However, when pathways are seen as a basic unstructured and unordered collection of genes, all the genetic connections and interactions that are supposed to capture and characterize the actual processes at hand are simply neglected. With the aim of including all of this additional information into the analysis, topology-based (TB) approaches have been created. These methods account for interactions between biomolecules and provide better performance than standard second-generation methods [[38]2, [39]3]. A variety of tools have been implemented, such as DEGraph [[40]6], topologyGSA [[41]8], NetGSA [[42]9–[43]11], Pathway-Express [[44]12, [45]13], SPIA [[46]14] among others. The common feature of approaches in this category is that they use prior knowledge of pathway topology information to obtain some gene-level statistic, which is then used to produce a pathway-level statistic, which is then used to rank the pathways. The goal of pathway enrichment approaches is to compare the ’activity’ of interest pathways across two or more biological situations or groups of specimens (patients, cell lines, etc.). Another technical feature that distinguished pathway enrichment methods is the type of the statistical null hypothesis being tested. The majority of approaches may be divided into two categories: those that test (I) self-contained null hypotheses and (II) competitive null hypotheses [[47]15]. A self-contained null hypothesis examines the activity of each pathway across biological situations (for example, normal vs. illness samples) without comparing it to the activity of other biomolecules/pathways. On the other hand, the activity of each pathway is compared to that of other biomolecules/pathways in a competitive null hypothesis. Even if the competitive null hypothesis has an interesting interpretation, assessing the significance of the competitive null is challenging, since tests based on it take into account a framework for gene sampling that treats genes as independent. The main contribution of this article is the development of a self-contained topology-based algorithm developed into the framework of structural equation models (SEM), called SEMgsa() [[48]16, [49]17]. Evaluation of system perturbation is incorporated in SEM [[50]18], where the experimental condition is compared to a control one through the use of an exogeneous group variable acting on every node of the network. Statistical significance of specific-pathway score is obtained combining node activation and node inhibition statistics extracted from SEM model fitting. In addition, unlike existing methods, an overall status of pathway perturbation of genes between case and control group has been computed considering both node perturbation and up- or down- regulation of genes for gaining more biological insights into the functional roles of predetermined gene subsets. A second objective of this study is to provide a consistent optimum solution of any given biological situation. Many topology-based methods that investigate distinct null hypothesis have been proposed in literature. We compare five popular pathway analysis approaches to SEMgsa(), starting from the most similar ones in terms of multivariate test and self-contained hypothesis type (DEGraph, NetGSA and topologyGSA) together with one approach of competitive hypothesis type (PathwayExpress) and the older approach of over-representation analysis (ORA). All the methods in this article offer a nice use interface in R. The aforementioned methods have been applied on observed and simulated expression data. The ultimate goal of expression data application is to provide a meaningful comparison of gene set analysis methods in terms of (i) sensitivity and (ii) prioritization for observed data and (i) type I error and (ii) power within each simulation run. The remainder of the article is organized as follows. Firstly, we describe SEMgsa() features with regard to gene expression data both in terms of inference procedure and user interface. Then, we outline the experimental setup constructed to evaluate pathway enrichment methods, including real data application and simulation design. In the end, we provide the results together with overall discussion. Method and implementation SEM framework A structural equation model (SEM) [[51]18, [52]19] is a statistical framework for causal inference based on a system of structural equations defining a path diagram, represented as a graph [MATH: G=(V,E) :MATH] , where V is the set of nodes (i.e., variables) and E is the set of edges (i.e., connections). The set E may include both directed edges [MATH: k→jifk∈pa(j) :MATH] and bidirected edges [MATH: k↔jifk∈sib(j) :MATH] . Although in the general setting of SEM latent variables and non-linear functions can be included [[53]18], we focus on the special case where the parent set [MATH: pa(j) :MATH] , and the siblings set [MATH: sib(j) :MATH] , determine a system of linear equations, as follows: [MATH: Yj=∑k∈pa(j)βjkYk+Ujj∈V :MATH] 1 [MATH: cov(Uj;Uk)=ψjkifj=kork∈sib(j)0otherwise :MATH] 2 where [MATH: Yj :MATH] and [MATH: Uj :MATH] are an observed variable and an unobserved error term, respectively; [MATH: βjk :MATH] is a regression (path) coefficient, and a covariance [MATH: ψjk :MATH] indicates that errors are dependent, which is assumed when there exists an unobserved (i.e. latent) confounder between k and j. Under such a model, the dependence structure among genes provided by pathways in biological database with directed and/or bi-directed edges (i.e., KEGG, Reactome, and many others) [[54]20, [55]21], interacting with each other to generate a single biological effect, can be included explicitly though the graph, [MATH: G=(V,E) :MATH] and evaluated using local and global statistics. SEMgsa() procedure adds a binary group (treatment or disease class) node labeled X to V, and suppose that [MATH: X={0,1} :MATH] directly affects the set of genes in the pathway. As bonus, adding group node and group-genes edges, the pathway with several components (clusters) and singleton genes induces a connected graph (see Fig. [56]1). Thus we consider a directed graph [MATH: G=(V∪X,E∪EX) :MATH] with the linear structural equations: [MATH: Yj=βjX+Ujj∈V(x) :MATH] 3 [MATH: Yj=∑k∈pa(j)βjkYk+βjX+Ujj∈V(y) :MATH] 4 where V(x) and V(y) are the sets of exogenous (i.e., source and singleton genes) and endogenous (i.e., connectors and sinks) genes, respectively. The covariances, [MATH: cov(Uj;Uk) :MATH] are assumed to be equal to Eq. ([57]2). Fig. 1. [58]Fig. 1 [59]Open in a new tab Visualisation of SEMgsa() procedure starting from Asthma KEGG pathway. The first graph summarise Asthma network properties, showing a pathway consisting of 31 nodes, 4 edges and 25 singletons. To maximise pathway information, SEMgsa procedure adds a binary group node (G = {0, 1}) that directly affects the set of genes in the pathway. In this way, the pathway with numerous singleton genes is edged with a group node and group-genes, resulting in a linked graph Comparing Eq. ([60]1) with Eqs. ([61]3)–([62]4), we note that the added node X may affect the mean gene expression values, but not their regression paths or covariances. In the R package SEMgraph these coefficients can vary by experimental or disease group via two-group SEM [[63]16, [64]17], but in the following we assume additive group effects. Coefficients [MATH: βj :MATH] (adjusted by the parents of the j-th node) determine the effect of the group on the j-th node, while the common path coefficients [MATH: βjk :MATH] represent regression coefficients, adjusted by parent set and group effect. This type of SEM enables the identification of differentially expressed genes (DEGs) if genes show a statistically significant variation in their activity (e.g., gene expression) in the experimental group respect to the control one. In other terms, a test for the null value of the path, [MATH: X→Y :MATH] is a test of: [MATH: H0:Yj⊥Xvs.H1: Yj⊥̸X,j∈V(x) :MATH] 5 [MATH: H0:Yj⊥pa(Yj),Xvs.H1:< mi>Yj⊥̸pa(Yj),X,j∈V(y) :MATH] 6 From Eqs. ([65]5)–([66]6) we note two different tests. Marginal tests of conventional DEGs analysis [[67]22] for source and singleton genes, and conditional tests, given the parents, for connectors and sink genes. Conditioning increases power when there is a direct group effect, and reduces gene variability. So, pathway topological structure makes the inference more precise [[68]23]. Maximum likelihood estimates (MLEs) of the paths ( [MATH: X→Y :MATH] ) can be easily obtained with one of the three algorithms of SEMgraph. Specifically, the core of model fitting in SEMgsa() function relies on the residual iterative conditional fitting (RICF), an efficient iterative algorithm that can be implemented through least squares, with the advantage of clear convergence properties [[69]24], and permutation-based p values for testing null hypothesis in Eqs. ([70]5)–([71]6). p values of group effect ( [MATH: X→Y :MATH] ) are computed by randomization of group labels comparing the estimated parameters by RICF with their random resampling distribution after a sufficiently high number of case/control labels permutations. Accurate small p value estimations are possible with no need for a large number of permutations (SEMgsa() makes default = 1000 permutations), using the moment based approximation proposed by [[72]25]. Once the empirical distribution of the permuted path coefficients is obtained, the two-sided p values are extracted from the normal distribution with mean and standard deviation estimated from the empirical distribution. From node-wise p values, overall group perturbation on pathway genes can be computed based on the Brown’s method for combining non-independent, one-sided significance tests [[73]26]. The method computes the sum of one-sided p values: [MATH: X2=-2∑jlog(pj) :MATH] , where the direction is chosen according to the alternative hypothesis ( [MATH: H1 :MATH] ), and the overall p value is obtained from the chi-square distribution with new degrees of freedom f and a correction factor c to take into consideration the correlation among p values, resulting in [MATH: X2c∼χ2(f) :MATH] . The conversion of two-sided p values in one-sided p values is performed according to the sign of the path ( [MATH: X→Y :MATH] ) coefficient, [MATH: βj :MATH] : [MATH: H1:withatleastoneβj> 0⇒pj(+)=pj/2ifβj> 01-pj/2ifβj< 0 :MATH] 7 [MATH: H1:withatleastoneβj< 0⇒pj(-)=pj/2ifβj< 01-pj/2ifβj> 0 :MATH] 8 If the overall p value [MATH: <α :MATH] (i.e., the significance level), we define node perturbation as activated when the direction of the alternative hypothesis is positive. Conversely, the status is inhibited if the direction is negative. Node-wise p values [MATH: <α :MATH] (after correcting for multiple comparisons with one of several adjustment methods, including Bonferroni or Benjamini–Hochberg procedure), are used for DEGs identification. While, a single p value of the two Brown’s p values ( [MATH: p(+) :MATH] : p-activation, and [MATH: p(-) :MATH] : p-inhibition) combined with a Bonferroni procedure [[74]27], i.e. [MATH: 2×min(p(+);p(-)) :MATH] , indicates the global pathway perturbation. In some cases, edge weights are defined in signalling pathways with discrete values [ [MATH: -1,0,1 :MATH] ], indicating gene-gene activity derived from biological database (e.g. KEGG). Usually they are: [MATH: -1 :MATH] for repressed or inactive, 0 for neutral, and + 1 for enhanced or activated. For gaining more biological insights into the functional roles of prior subset of genes, the sign of the minimum p value between node activation and inhibition has been retained to assess, in combination with pathway weights, an overall status of pathway perturbation of genes between case and control group. In detail, node perturbation obtained from RICF fitting has been combined with up- or down-regulation of genes to obtain overall pathway perturbation classification as displayed in Table [75]1. Table 1. Overall pathway perturbation Up/down regulation Node perturbation Overall perturbation + 1 P− (inh) Down act − 1 P− (inh) Up inh + 1 P+ (act) Up act − 1 P+ (act) Down inh [76]Open in a new tab * The weighted adjacency matrix of each pathway was used to determine the up- or down-regulation of genes (taken from the KEGG database) as the column sum of weights across each source node. The pathway is marked as down-regulated if the total sum of the node weights is less than 1, and otherwise as up-regulated. * The minimum among the p values determines whether the node perturbation is activated or inhibited; if positive, the node perturbation is described as activated, and otherwise as inhibited. * It is possible to determine the direction (up or down) of gene perturbation by combining these two quantities. In cases compared to the control group, an up- or down-regulated gene status that is associated with node inhibition shows a decrease in activation (or an increase in inhibition). In contrast, up- or down-regulated gene status, associated with node activation, leads to an increase in activation (or decrease in inhibition) in cases relative to control group. User interface The example code of the function SEMgsa() is as follows. SEMgsa(g = list(), data, group, method = "BH", alpha = 0.05, n_rep = 1000, ...) The inputs are: a list of pathways to be examined (g); gene expression data where rows represent subjects, and columns graph nodes (data); a binary vector with 1 for cases and 0 for control subjects (group). Optional inputs are the multiple testing correction method (method), and the significance level (alpha) for DEGs selection, and the number of randomization replicates for RICF algorithm ( [MATH: n_rep :MATH] , default = 1000). The first step in SEMgsa() workflow is to compute the weighted adjacency matrix of each pathway, obtain the sum of node weights and flag the pathway as up- or down-regulated. This is crucial to obtain the overall pathway perturbation status at the end. Then RICF algorithm of R package SEMgraph, i.e. SEMrun(graph, data, group, fit = 1, algo = "ricf"), is applied on data, considering the group binary vector and the number of specified randomization replicates. More specifically, SEMrun() takes as input a single graph as an igraph object and has several additional inputs, including: a numeric value fit indicating the SEM fitting mode, where [MATH: fit=1 :MATH] specifies a “common” model to evaluate group effects on graph nodes; the MLE method algo is used for model fitting, in this case fitting is done via RICF(algo = “ricf”). The covariance matrix could not be semi-definite positive in the situation of large dimensionality (n.variables > n.subjects), making it impossible to estimate the parameters. When this occurs, regularization of the covariance matrix is enabled. SEMrun() uses internally two functions of the corpcor R package: the is.positive.definite() tests if the observed covariance matrix is positive, and if the response is equal to FALSE, the function pcor.shrink() implements the James-Stein-type shrinkage estimator [[77]28] to regularized the covariance matrix. Node-wise group effect p values are extracted from model fitting object together with the number of DEGs obtained adjusting p values with the chosen correction method while testing the specified level of alpha. Then, a data frame of combined SEM results is obtained putting together node-wise p values with Brown’s method and Bonferroni’s correction. The output of SEMgsa() is represented by a list containing two objects with the following information for each pathway in the input list: * gsa, a dataframe reporting size (No.nodes), DEGs number (No.DEGs), pathway perturbation status (pert), Brown’s combined p value of pathway node activation (pNA), Brown’s combined p value of pathway node inhibition (pNI) and the Bonferroni combination of them (PVAL). ADJP refer to the pathway combined p value adjusted for multiple tests with Bonferroni correction, i.e., [MATH: ADJP=m< mi>in(K×PV AL;1) :MATH] , where K is the number of the input pathways. * DEG, a list with DEGs names for each pathway, selected with p value < alpha after the multiple correction procedure with one of the method available in R function p.adjust(). By default, method is set to “BH” (i.e., Benjamini–Hochberg correction) and the significance level alpha to 0.05. To read more about SEMgsa() function, in terms of description, usage, function arguments and value, refer to [78]https://rdrr.io/cran/SEMgraph/man/SEMgsa. Experimental design Benchmark data Coronavirus disease (COVID-19) RNA-seq expression data from [[79]29] (GEO accession: [80]GSE172114) together with Frontotemporal Dementia (FTD) DNA methylation data (DNAme) from [[81]30] (GEO accession: [82]GSE53740) have been used as benchmark data. Network information has been retrieved from kegg.pathways object of the SEMgraph package as a list of 225 edge weighted igraph objects corresponding to the KEGG pathways extracted using the ROntoTools R package [[83]31]. Edge weights are defined with discrete values [MATH: [-1,0,1]:-1 :MATH] for inactive gene–gene activity, 0 for neutral, and [MATH: +1 :MATH] for activated. Coronavirus disease (COVID-19) Coronavirus disease of 2019 (COVID-19) is a highly contagious respiratory infection that is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Multiple probes for each Entrez gene ID were first eliminated. The empirical Bayes technique, as implemented in the limma R package [[84]32], was used to fit linear models for differential expression analysis, and p values were adjusted for multiple testing using the method of Benjamini–Hochberg [[85]33]. This procedure results in a matrix of 69 subjects [MATH: × :MATH] 14,000 genes. Subjects include patients in the intensive care unit with Acute Respiratory Distress Syndrome (“critical group”, N = 46) defined as cases, and those in a non-critical care ward under supplemental oxygen (“non-critical group”, N = 23) defined as controls. The expression matrix was finally matched with the corresponding Coronavirus disease—COVID-19 (hsa05171) KEGG pathway according to its name. The latter is a graph with 232 nodes and 208 edges, including five components and 109 sigleton (i.e. node degree = 0). The maximum subgraph consists of 54 nodes and 83 edges. This pathway was subsequently labeled as target pathway and its p value and rank were further investigated for assessing the sensitivity and prioritization ability of the methods [[86]34, [87]35]. Frontotemporal dementia (FTD) Frontotemporal Dementia, a neurodegenerative disorder characterized by cognitive and behavioural impairments [[88]36]. We will use DNAme data stored in the SEMdata package as ftdDNAme, a list of two objects: a data matrix of 256 rows (subjects) and 16,560 columns (genes) containing the value of the first principal component of DNAme levels, obtained applying a principal component analysis to methylated CpG sites within the promoter region, for each gene (genes with unmethylated CpGs in both conditions were discarded); and a binary group vector of 105 FTD patients (1) and 150 healthy controls (0). Unlike COVID-19 data, FTD has not a unique KEGG pathway associated to its name. According to KEGG BRITE database, the term Frontotemporal lobar degeneration (an alias for FTD; KEGG ID:[89]H00078) is associated to 6 KEGG pathways: MAPK signaling pathway (hsa04010), Protein processing in endoplasmic reticulum (hsa04141), Endocytosis (hsa4144), Wnt signaling pathway (hsa04310), Notch signaling pathway (hsa04330), and Neurotrophin signaling pathway (hsa04722). We can use the SEMgsa() function to apply gene set analysis (GSA) on a collection of networks, exploring the 6 selected FTD pathways as target ones. Thus, the ability of GSA methods will be investigated on 6 target pathways, combining results in terms of median p values and ranks for readability purposes. We refer the reader to the Additional file [90]1 for more detailed results. Data simulations Synthetic data, based on realistic expression data (“[91]Coronavirus disease (COVID-19)” section), was used to carry out simulations following the practice in [[92]10]. A subset of pathways [MATH: q1<K :MATH] out of total K pathways has been chosen to be dis-regulated. Next, a pre-specified number (s) of genes within each dis-regulated pathway was chosen to be altered (up or down) according to a topological measure (betweenness, community, neighbourhood) and different mean signals ( [MATH: ±0.5,±0.6,< mo>±0.7 :MATH] ). Another subset of [MATH: q0<(K-q1) :MATH] pathways with no dis-regulated genes has been chosen to evaluate statistical metrics. Finally, the COVID-19 benchmark data matrix is first normalized, with mean zero and unit variance for each gene within each group (cases and controls). Then, nine data generation procedures are executed, according to topological measures and adding mean signal to the pre-specified genes in the selected dis-regulated pathways. In summary, the simulation design ( [MATH: 3×3 :MATH] ) with 100 randomization per design levels is reported in Table [93]2. Table 2. Summary of simulation design ( [MATH: 3×3 :MATH] ) with 100 randomization per design levels Topology design Gene regulation Mean signal [MATH: ± :MATH] 0.5 [MATH: ± :MATH] 0.6 [MATH: ± :MATH] 0.7 Betweenness Up/down 100 100 100 Community Up/down 100 100 100 Neighbourhood Up/down 100 100 100 [94]Open in a new tab Pathway deregulation Each data generation procedure starts with the definition of a list of pathways to be tested. After recalling the list of kegg.pathways including [MATH: N=225 :MATH] signaling pathways from the KEGG database, for efficiency purposes pathways with a minimum and a maximum number of nodes equal respectively to 30 and 300 have been filtered out for the analysis. Then, to speed up computations, the maximum component of each igraph [[95]37] object corresponding to each selected pathway has been selected. Given this choice, igraph objects with maximum component smaller than the [MATH: 60% :MATH] of the total graph size have not been considered. This filtering procedure results into a list of [MATH: K=117 :MATH] igraph objects. We have to alter [MATH: q1=10 :MATH] pathway’s genes in order to deregulate it within a simulation. Specifically, we consider the 9 KEGG pathways associated to Coronavirus disease—COVID-19 (hsa05171): Vascular smooth muscle contraction (hsa04270), Platelet activation (hsa04611), Toll-like receptor signaling pathway (hsa04620), NOD-like receptor signaling pathway (hsa04621), RIG-I-like receptor signaling pathway (hsa04622), JAK-STAT signaling pathway (hsa04630), Natural killer cell mediated cytotoxicity (hsa04650), Fc gamma R-mediated phagocytosis (hsa04666) and Leukocyte transendothelial migration (hsa04670). We looked at three different methods for reflecting pathway topology in order to assign impacted genes to the deregulated pathways: betweenness, community and neighborhood, following the practice in [[96]10, [97]38]. The number of all shortest paths in a directed graph that pass through a given node is known as its betweenness. The top 10 highest scoring betweenness nodes were used to choose affected genes in the betweenness deregulation design. According to the community deregulation design, we located modules with dense connections between module nodes and spare connections between module nodes. Given the division of vertices in each community, the 10 affected genes are then randomly sampled from the community with the highest proportion of members. In the neighbourhood deregulation procedure, the vertices not farther than a given limit from another fixed vertex are called the neighborhood of the vertex. After computing the neighbourhood of order 2, we sampled the 10 vertices from the neighbourhood with the biggest size. Within each of the associated pathways plus the target pathway of interest, an equal number of genes s has been selected as ‘dysregulated’. We decided to fix the number of affected genes to [MATH: s=10 :MATH] for each pathway, with the aim of obtaining equal contribution from each associated disease (due to the presence of smaller pathway sizes). However, given that overlapping genes between pathways may occur, the unique genes out of the total [MATH: S≤s×q1=10×10 :MATH] have been retained as DEGs for pathway dysregulation. Thus, the number of total dysregulated genes S is not fixed, but changes according to the chosen topology design. As a note, given the random sampling within the community deregulation design, the sampled genes change according to the specified seed. In the end, a weight representing the up- or down-regulation of genes was kept along with the Entrez gene ID of the affected genes. The weighted adjacency matrix of each pathway comprises a column regarding the sum of weights over each source node, which can be used to determine up- or down-regulation (taken from the KEGG database). The pathway is marked as down-regulated or up-regulated depending on whether the total sum of node weights is less than 1. Each gene’s weight has been extracted in order to obtain an up or down mean signal, which better reflects variations in the expression of the impacted genes between the control and treatment groups. After identifying the subset of DEGs according to the chosen topology design, pathways with a number of dysregulated genes [MATH: ≤1 :MATH] have been selected as true negatives ( [MATH: q0 :MATH] ) to evaluate type I error from simulations. Unlike the [MATH: q1 :MATH] number that is fixed to 10 COVID-19 related pathways, the number of [MATH: q0 :MATH] pathways changes according to the chosen subset of DEGs. To summarise, for all topology design, the total altered genes S inside the [MATH: q1=10 :MATH] pathways are differentially expressed with a mean difference varying from ( [MATH: ±0.5 :MATH] , [MATH: ±0.6 :MATH] , [MATH: ±0.7 :MATH] ). Note that the magnitude of the mean signal is expressed relative to the unit variance of each gene (see [[98]10]). The simulated expression matrices were directly supplied to SEMgsa() and DEGraph, topologyGSA, NetGSA, PathwayExpress and ORA algorithms together with the list of igraph objects corresponding to the chosen KEGG pathways. Pathway enrichment methods Table [99]3 provides an overview of the tested pathway enrichment methods in terms of null hypothesis, input requirements, pathway information and availability on R together with main papers for reference. These methods differ in two main aspects: (i) the type of null hypothesis, self-contained or competitive; (ii) input data, expression data or thresholded gene p values. ORA and PathwayExpress test the competitive null hypothesis of whether the genes in the set of interest are at most as often DE as the genes not in the set, instead SEMgsa(), DEGraph, topologyGSA and NetGSA test the self-contained null hypothesis of no genes in the set of interest are DE. Another major difference among these methods regards the input requirements. There is a high sensitivity to the p value cutoff because all techniques based on testing the competitive null hypothesis must identify DE genes based on a pre-specified threshold of corrected p values. Without making any arbitrary decisions on the list of DE genes, all self-contained tests directly employ expression data. Table 3. Overview of tested pathway enrichment methods Method Null hypothesis Gene p value tresholding Expression data Pathway R/Bioconductor [References]