Abstract

   Genes are organized in functional modules (or pathways), thus their
   action and their dysregulation in diseases may be better understood by
   the identification of the modules most affected by the disease (aka
   disease modules, or active subnetworks). We describe how an algorithm
   based on the Core&Peel method is used to detect disease modules in
   co-expression networks of genes. We first validate Core&Peel for the
   general task of functional module detection by comparison with 42
   methods participating in the Disease Module Identification DREAM
   challenge. Next, we use four specific disease test cases (colorectal
   cancer, prostate cancer, asthma, and rheumatoid arthritis), four
   state-of-the-art algorithms (ModuleDiscoverer, Degas, KeyPathwayMiner,
   and ClustEx), and several pathway databases to validate the proposed
   algorithm. Core&Peel is the only method able to find significant
   associations of the predicted disease module with known validated
   relevant pathways for all four diseases. Moreover, for the two cancer
   datasets, Core&Peel detects further eight relevant pathways not
   discovered by the other methods used in the comparative analysis.
   Finally, we apply Core&Peel and other methods to explore the
   transcriptional response of human cells to SARS-CoV-2 infection,
   finding supporting evidence for drug repositioning efforts at a
   pre-clinical level.

   Subject terms: Computational biology and bioinformatics, Diseases

Introduction

   In a typical systems biology paradigm, large amount of molecular data
   collected via high throughput ’omics’ experiments are stored in curated
   databases then filtered and reorganized in the form of an interaction
   network among molecular species (for example, co-expression networks
   are built via measures of the co-expression of genes under a variety of
   conditions)^[26]1,[27]2. Next, such a network is analyzed to detect
   interesting phenomena from a biological point of view, potentially
   relevant for a phenotype of interest or a specific biological process.
   Biological networks have been found to have a rich modular structure
   which mediates biological processes and cellular activities. Thus
   discovering and validating modules within biological networks has
   become an activity propaedeutic to the discovery of biological
   mechanisms. Genes in a functional module should act in a highly
   correlated (or anti-correlated) way in response to cell conditions, as
   revealed by approaches integrating different levels of ’omic’
   data^[28]3.

   In a second paradigm typical of systems biology , we are not so much
   interested in finding genes that behave similarly across a variety of
   cell conditions, but in finding genes (or modules) whose behavior is
   different between two clearly stated cell conditions to find
   subnetworks that are activated differently and thus characterize at a
   functional level the differences between the two conditions (active
   subnetwork detection problem) (^[29]4–[30]8).

   Module and active module identification in biological networks is a key
   component of a full network analysis aimed at exploring issues related
   to applications such as drug target discovery^[31]9,[32]10, cancer
   subtypes classification^[33]11, finding biomarkers for cancer
   prognosis^[34]12,[35]13, detection of histone modifications^[36]14, and
   many others. Previous work on active subnetwork analysis in cancer has
   focused on finding similarities among cancer subtypes and detecting of
   prognostic discriminative subnetworks. Gaire et al.^[37]15 apply
   subnetwork analysis to Breast Cancer (BC) data looking for conserved
   circuits in BC subtypes; their finding suggest that subtypes of
   different cancers may have molecular similarities indicating that
   therapeutic approaches in different sub-types of cancer may be shared.
   Chuang et al.^[38]16 propose a subnetwork-based analysis of gene
   expression profiles to discriminate between groups of patients with
   various risks for chronic lymphocytic leukemia (CLL) progression. For
   COVID-19, active subnetwork analysis has been mainly focused on helping
   current drug repositioning efforts^[39]17–[40]19. For a general
   literature digest on COVID-19, see Harapan et al.^[41]20.

   This paper demonstrates the performance of the Core&Peel method for
   these two key problems in systems biology: (a) network module
   identification and (b) active subnetworks detection. We illustrate the
   complete workflows in Fig. [42]1a,b for the two problems, respectively.
   In both cases, we used the same gene co-expression network as Core&Peel
   input. In the first task (Fig. [43]1a), the output modules have been
   tested for association with diseases using two sets of genome-wide
   association studies (GWASs) as well as Gene Ontology (GO) enrichment
   analysis for assessing a biological relevance. In the second task
   (Fig. [44]1b), we selected only the Core&Peel modules significantly
   enriched for the differentially expressed genes to obtain active
   subnetworks in two cancer types (prostate and colorectal cancer), two
   inflammatory diseases (asthma and rheumatoid arthritis) and COVID-19
   infection.

Figure 1.

   [45]Figure 1
   [46]Open in a new tab

   Schematic workflow illustrating the main steps used in our work.
   Core&Peel was tested for two problems: (a) disease module
   identification and (b) active subnetwork detection. In both tasks, the
   DREAM challenge gene co-expression network was used as input of
   Core&Peel algorithm. In (a) we tested different values of density and
   Jaccard index. For each parameters combination, we tested the
   disease-association of each module detected by Core&Peel using the GWAS
   datasets made available by the DREAM challenge. The leaderboard set
   (including 76 GWAS datasets) was used to optimize the parameters and
   the final test (including 104 GWAS datasets) to get a final evaluation
   of them. We also tested the biological relevance of each module through
   the Gene Ontology (GO) enrichment analysis. In (b), we used the
   differentially expressed genes and the hypergeometric test to select
   the active Core&Peel modules in a specific disease. We used several
   transcriptomic datasets: two cancer RNA-Seq studies (prostate and
   colorectal cancer), two inflammatory disease microarray experiments
   (asthma and rheumatoid arthritis) and three COVID-19 related datasets.
   We used the R/CRAN package DiagrammeR^[47]96 to illustrate the
   workflow.

   We show that Core&Peel has a performance that matches or surpasses that
   of current state-of-the-art approaches, in several measurements, on
   benchmark diseases, in particular cancer benchmark datasets. Finally,
   we apply Core&Peel and other state-of-the-art methods to explore the
   transcriptional signature of the response of human cells to SARS-CoV-2
   infection at a modular level. The combined output of these algorithms
   uncovers several enriched pathways related to current drug
   repositioning efforts.

   The Core&Peel method has been developed in^[48]21 for predicting
   protein complexes in large protein-protein interaction networks (PPIN).
   Core&Peel follows a generalist approach, assuming that modules have the
   topological properties of quasi-cliques and ego-networks. The method
   makes a very mild assumption on the properties of the input network,
   and of the predicted modules, thus in principle, it can be applied
   quite directly to other community detection problems with a similar
   flavor. One example of such a problem is disease module identification.
   In 2016, an open DREAM challenge attempted to identify the best
   performing algorithms for disease module identification^[49]22. We
   applied our previously reported Core&Peel algorithm^[50]21 to the DREAM
   challenge data to see how well Core&Peel could perform on this new
   task. In practice, the list of 42 methods mentioned in^[51]22 for
   functional module detection, and the list of 19 methods mentioned
   in^[52]21 for protein complex prediction, have very little overlap; the
   only methods mentioned in both lists are SPICi and ClusterOne, which
   are used in the functional module detection DREAM challenge as part of
   an ensemble or hybrid computational pipelines. Thus the nature of the
   two problems, functional module identification and protein complex
   prediction, is such as to require a fresh look as it is not obvious
   that any algorithm can perform equally well in either case.

Results

Core&Peel: detection of disease modules

   The DREAM challenge proposed the problem of identifying disease modules
   from six heterogeneous networks of proteins/genes. To test the
   performance of Core&Peel in detecting disease modules, we decided to
   use only the DREAM challenge gene co-expression network. We varied the
   Core&Peel parameters for subgraph density and Jaccard coefficient to
   find optimal settings (Fig. S1). We noticed that some modules had more
   than 100 genes. We removed these modules to be consistent with the
   DREAM challenge, requiring no module to have more than 100 genes (Fig.
   S2). The scoring methodology for the DREAM challenge used GWAS datasets
   as an orthogonal validation for the submitted networks. The PASCAL tool
   was used to correct the GWAS datasets for underlying linkage, creating
   gene-fusion regions tagged by the most significant SNP. By supplying
   PASCAL with the GWAS data and a gene network, this tool can score
   whether the detected modules contain significant GWAS hits. We tested
   the different Core&Peel parameters to the DREAM leaderboard (parameter
   tuning) and final submission GWAS datasets (Fig. [53]2). We chose the
   0.7 density and 0.8 Jaccard index (more precisely
   Core&Peel-r1-nl20-d0.7-f1-j0.8 configuration) since it detected the
   largest number of enriched modules compared to other configurations in
   the leaderboard GWAS dataset. Generally when the Jaccard index is from
   0.8 and 1.0 there is not a substantial variation in the number of
   disease modules identified (Fig. [54]2), so we decided to choose the
   smallest Jaccard index into this range. However when the density is too
   high (mostly equal to 1) very few modules were identified. This is more
   evident in the leaderboard step, in fact when the final GWAS datasets
   were used a remarkable improvement can be noticed, probably due to the
   larger number of final GWAS datasets than those in the leaderboard. We
   further noted that removing modules with a size greater than 100
   slightly improves the performance (comparing Fig. [55]2 with Fig. S5).

Figure 2.

   [56]Figure 2
   [57]Open in a new tab

   Number of disease-enriched modules detected by Core&Peel using the
   Leaderboard and Final GWAS datasets. The number of enriched modules
   were calculated by the Pascal tool using a 0.05 FDR cutoff. In abscissa
   the tested configurations of parameters for Core&Peel. We used the R
   package ggplot2^[58]97 to make the figure.

Comparison of Core&Peel with DREAM methods

   The DREAM comparison methodology is fit for comparing methods producing
   non-overlapping modules, while we allow overlaps; we thus resort to
   introducing an explicit post-processing equalizing phase for Core&Peel
   (based on ranking and cut-off) when we compare the score of a method
   producing overlapping modules (Core&Peel) with a method producing
   non-overlapping modules. This is done for the sake of fairness of the
   comparison without sacrificing the advantages that overlapping modules
   have in modelling the underlying biology. In order to compare Core&Peel
   with DREAM methods, we thus applied the CRank ranking algorithm to the
   results of the configuration chosen (Core&Peel-r1-nl20-d0.7-f1-j0.8).
   We selected the highest rank communities to obtain the same number of
   modules of DREAM methods (Supplementary Fig. [59]S3). The results of
   the leaderboard and final tests for Core&Peel versus the DREAM methods
   are represented in Fig. [60]3. In most of the cases, Core&Peel can
   identify more enriched modules. In a few cases, Core&Peel finds fewer
   GWAS enriched modules, for example when compared with K5 (Fig. [61]3b),
   probably due to the low number of predicted modules (equal to 16).
   Obviously, Core&Peel still has the advantage to have overlapping
   modules and this helps to obtain more disease enriched modules for an
   equal number of predicted modules with DREAM methods. This last aspect
   highlights the importance of taking into account modules with genes in
   common since they are more realistic from a biological point of view,
   as also demonstrated in^[62]23.

Figure 3.

   [63]Figure 3
   [64]Open in a new tab

   Comparison of number of disease modules between
   Core&Peel-r1-nl20-d0.7-f1-j0.8 and all the DREAM methods using
   Leaderboard (a) and Final (b) GWAS datasets. The number of enriched
   modules were calculated by the Pascal tool using a 0.05 FDR cutoff. The
   CRank ranking was applied to Core&Peel modules in order to select the
   same number of modules detected by each DREAM method. The x-axis is in
   ascending order respect to the number of enriched modules detected by
   DREAM methods. We used the R package ggplot2^[65]97 to make the figure.

   Moreover, we performed the GO analysis to evaluate the biological
   coherence of each predicted module. As before, we applied the CRank
   ranking to Core&Peel before conducting the enrichment analysis. We
   compared the number of predicted modules with different thresholds of
   significance (Supplementary Table [66]S1), and Core&Peel enables us to
   detect on average more functional modules than DREAM methods, in
   particular when the FDR threshold is shallow.

Comparison between Core&Peel and the overlapping community detection method

   We compared Core&Peel with the method by Tripathi et al.^[67]23
   described in the Section “[68]A further method: overlapping community
   detection”, which also detects overlapping modules (i.e. a gene can
   belong to more than one module) like Core&Peel. Core&Peel predicts in
   total many more modules than this method (Supplementary Fig. [69]S4),
   moreover Core&Peel identifies many more disease-enriched modules (Fig.
   [70]4). We also report the fraction of predicted disease modules (i.e.
   number of enriched modules divided by the total number of detected
   modules) in Table [71]1. Core&Peel identifies more disease modules both
   in absolute and relative terms, so we can consider it a competitive
   algorithm in disease modules identification problem, also for the class
   of methods producing overlapping modules. In Table [72]1 we reported
   the fraction of disease modules detected by the hits and spread hubs
   algorithms after filtering them according to the same Jaccard index
   (0.8) used in Core&Peel. We added this comparison to assess if there
   are any changes in the results when the rate of the overlap is the same
   among the two methods. However, we obtained similar outcomes with or
   without the Jaccard filtering. Besides, we analyzed the biological
   relevance of the predicted modules with GO analysis and Core&Peel
   achieves the best result (Supplementary Tables [73]S2 and [74]S3).

Figure 4.

   [75]Figure 4
   [76]Open in a new tab

   Number of disease enriched modules using both Leaderboard and Final
   GWAS datasets and comparing Core&Peel-r1-nl20-d0.7-f1-j0.8 with the two
   options (hits and spread hubs) of the method by Tripathi et al.,
   explained in Section “[77]A further method: overlapping community
   detection”. The number of enriched modules were calculated by the
   Pascal tool using a 0.05 FDR cutoff. We used the R package
   ggplot2^[78]97 to make the figure.

Table 1.

   Fraction of predicted disease modules (i.e. number of enriched modules
   over the total number of detected modules) that are enriched in the
   Leaderboard and Final GWAS datasets in Core&Peel-r1-nl20-d0.7-f1-j0.8
   and in the two options (hits and spread_hubs) of the method explained
   in Section “[79]A further method: overlapping community detection”.
               Core&Peel-r1-nl20-d0.7-f1-j0.8 hits  spread_hubs hits_j0.8
   spread_hubs_j0.8
   Leaderboard 0.021                          0.002 0.003       0.004     0.005
   Final       0.037                          0.01  0.01        0.006     0.01
   [80]Open in a new tab

   The hits_j0.8 and spread_hubs_j0.8 indicate the original algorithms
   after filtering the modules according to the 0.8 Jaccard index. The
   hits and spread_hubs indicate the original algorithm without any
   filtering. The numbers in bold highlight the best value.

Core&Peel: detection of active subnetworks

   We next examined the performance of Core&Peel for detecting active
   subnetworks, i.e. finding subnetworks that are activated under
   different conditions. We used the DREAM challenge co-expression network
   and seven different datasets: two cancer RNA-Seq studies (prostate and
   colorectal cancer), two inflammatory disease microarray experiments
   (asthma and rheumatoid arthritis), and three COVID-19 related datasets
   (bronchioalveolar lavage fluid RNA-Seq, infected patient PBMCs, and
   COVID-19 infected A549 cells). We calculated differentially expressed
   genes using edgeR for the cancer data and limma for gene microarrays.
   We processed the first two COVID-19 data according to the pipeline the
   authors used; we used the summary statistics of the third dataset
   released from the authors to retrieve the differentially expressed
   genes (see Methods). We retained only modules that contained at least
   one differentially expressed gene in the comparison of interest. We ran
   Core&Peel using both nl=10 and nl=20 as the minimum subgraph size. We
   noticed that, in the active subnetwork detection problem, Core&Peel
   with nl = 10 identified more pathways with disease-associated genes
   than the case nl = 20. So we chose the Core&Peel-r1-nl10-d0.7-f1-j0.8
   since it is one of the best configurations (same density and Jaccard
   coefficient) on the DREAM challenge data. Actually, this configuration
   works better in prostate and colorectal cancer cases where the number
   of differentially expressed genes (DEGs) is pretty high, instead when
   the number of DEGs is low (Supplementary Table [81]S4), Core&Peel with
   a smaller density obtained better results. So we decided to use
   Core&Peel-r1-nl10-d0.5-f1-j0.8 in asthma and rheumatoid arthritis
   cases. In COVID-19 case studies, we took into account both Core&Peel
   configurations.

   Core&Peel detected 1270, 141, 49 and 1495 modules in prostate cancer,
   asthma, rheumatoid arthritis, and colorectal cancer datasets,
   respectively. In COVID-19 cases, Core&Peel with density 0.5 detected
   715, 558, and 109 in BALF, PBMC, and COVID-cells, respectively.
   Instead, Core&Peel with density 0.7 detected 545, 427, and 90
   significant modules. As expected, the more are the DEGs (Supplementary
   Table [82]S4), the more modules are identified. Since the significant
   modules were merged in one active subnetwork, we checked if this
   subnetwork is still significantly enriched with the DEGs; all active
   subnetworks produced got a p value
   [MATH:
   <mrow><mo><</mo><msup><mn>10</mn><mrow><mo>-</mo><mn>18</mn></mrow></ms
   up></mrow> :MATH]
   (data not shown).

Comparison with other active subnetwork detection methods

   We compared the Core&Peel modules with those detected by the four
   existing active subnetwork identification methods, namely
   ModuleDiscoverer (MD)^[83]24, KeyPathwayMiner (KPM)^[84]25,
   ClustEx^[85]26 and Degas^[86]27 (see description on Supplementary Notes
   Sect. [87]3.1). The number of genes in each active subnetwork is
   reported in Table [88]2. Generally, regulatory modules of asthma and
   rheumatoid arthritis include fewer genes than the prostate and
   colorectal cancer modules; this could be due to the lower number of
   DEGs. Degas is the method that produced the smallest modules.
   Core&Peel, ModuleDiscoverer and ClustEx produced the biggest ones,
   instead. Next, we were interested in comparing the modules detected by
   the different methods from a biological point of view. We compared the
   enriched pathways (Reactome-based) and plotted the overlap among all
   the method combinations (Fig. S6). We also conducted as baseline the
   pathway enrichment analysis using the DEGs. Globally, MD, ClustEx and
   Core&Peel detected a higher number of pathways than Degas, KPM, and
   DEGs. Degas is the method that identified the lowest number of
   pathways. In particular, in prostate and colorectal cancer cases (Fig.
   S6a,b), it does not overlap with the other methods. In both cancer
   cases, Core&Peel and MD share a significant part of their pathways
   (around 40) and a large number of pathways was identified only by
   ClustEx and MD in the prostate (Fig. S6a) and colorectal cancer (Fig.
   S6b), respectively. In the rheumatoid arthritis case, more than half of
   the pathways were shared by Core&Peel, ClustEx, and MD (Fig. S6d) and
   there is a small overlap for the other combinations. Overall a smaller
   number of enriched pathways were detected in asthma (see horizontal
   bars on Fig. S6c) than the other three case studies (horizontal bars on
   Fig. S6a,b,d) and we can also notice there are fewer overlapping
   pathways among the methods (Fig. S6c). Almost the entire set of
   enriched pathways of ClustEx is not in common with those produced by
   any other method. Similarly, more than half of the pathways identified
   by Core&Peel were not detected by the competing methods (Fig. S6c).

Table 2.

   Number of genes in each active subnetwork detected by all the methods
   for the prostate and colorectal cancer, asthma and rheumatoid arthritis
   studies.
             Prostate Asthma Colorectal Rheumatoid arthritis
   Core&Peel 3048     1305   3883       519
   MD        2708     192    2606       403
   KPM       1266     74     2965       20
   ClustEx   3000     1300   3800       500
   Degas     122      37     44         85
   [89]Open in a new tab

   Core&Peel-r1-nl10-d0.7-f1-j0.8 configuration was used for prostate and
   colorectal cancer, instead Core&Peel-r1-nl10-d0.5-f1-j0.8 for asthma
   and rheumatoid arthritis.

Detection of disease-associated pathways

   To assess the performance in detecting pathways with genes associated
   with the disease, we calculated the number of enriched pathways with at
   least two genes associated with the specific disease (annotated on
   DisGeNET database). The results are shown in Fig. [90]5. Generally,
   Core&Peel can identify a substantial number of pathways. It indicates
   to be a competitive method, and it outperforms most of the algorithms.
   In particular, it performs exceptionally well in asthma case, in which
   most of the competitors can not detect any pathway with
   disease-associated genes. Core&Peel and ClustEx are the only methods in
   able to identify some of them, with Core&Peel catching a larger number
   of pathways (Fig. [91]5b). Overall more pathways consistent with
   disease-associated genes were detected in the two cancer datasets (Fig.
   [92]5a) than the asthma and rheumatoid arthritis datasets (Fig.
   [93]5b). This finding is consistent with the larger number of genes
   annotated on DisGeNET in cancer than asthma and rheumatoid arthritis.
   Generally Degas finds the fewest disease-associated pathways in all the
   case studies (almost zero pathways with disease-associated genes in all
   cases). Core&Peel, along with ClustEx and MD, can identify more
   pathways with disease-associated genes than DEGs; this highlights how
   the combination of network analysis and gene expression data can
   increase the power to detect pathways associated to the disease, which
   otherwise would not have been identified.

Figure 5.

   [94]Figure 5
   [95]Open in a new tab

   Number of enriched pathways with at least two disease-associated genes.
   The pathway enrichment analysis was performed for each subnetwork
   detected by Core&Peel and by the four competitive methods (ClustEx,
   Degas, KPM and ModuleDiscoverer) using the Reactome database. The
   pathway analysis using the differentially expressed genes was conducted
   as baseline. Pathways with an adjusted p value
   [MATH: <mrow><mo><</mo><mn>0.05</mn></mrow> :MATH]
   were selected and the number of disease-associated genes in each
   enriched pathway was calculated using the DisGeNET database. (a)
   Prostate and Colorectal cancer cases; Core&Peel-r1-nl10-d0.7-f1-j0.8
   was used. (b) Asthma and rheumatoid arthritis cases where
   Core&Peel-r1-nl10-d0.5-f1-j0.8 configuration was used. We used the R
   package ggplot2^[96]97 to make the figure.

   Finally, we conducted further enrichment analyses using different
   databases to investigate the enrichment with specific-disease pathways.
   The results are shown in Fig. [97]6. Modules identified by Core&Peel,
   KPM, ClustEx, and MD (along with DEGs) are significantly enriched for
   the Carcinoma pathway in both cancer cases (Fig. [98]6a), suggesting
   their relevance to cancer in general. More specifically, the Core&Peel
   module is the only one to be significantly enriched for prostate cancer
   specific pathway (Fig. [99]6d). Moreover, also pathways involved in the
   prostate morphogenesis (prostate gland morphogenesis, prostate gland
   epithelium morphogenesis and branching involved in prostate gland
   morphogenesis) are significantly enriched and each of them contains
   some prostate cancer-associated genes, suggesting a possible
   involvement of these pathways in the prostate cancer (Fig. [100]6d).
   The epithelial to mesenchymal transition in colorectal cancer pathway
   is significantly enriched in Core&Peel module (Fig. [101]6c), giving a
   colorectal cancer specificity to the Core&Peel module. Moreover, the
   Core&Peel showed enrichment for the asthma pathway even if slightly
   less significant than ClustEx (Fig. [102]6b). No significantly enriched
   pathways were found for the rheumatoid arthritis case. Finally, in
   Supplementary notes (Sect. [103]3.2.2) we reported the same analysis
   using Core&Peel-r1-nl20-d0.8-f1-j0.8. Also in this case, Core&Peel can
   detect modules enriched for specific prostate and colorectal cancer
   pathways (neoplasm of the colon, abnormal prostate morphology, prostate
   cancer and prostate neoplasm) along with Integrated Cancer Pathway in
   colorectal cancer case. Besides, we analyzed the Reactome-based
   pathways detected only by Core&Peel for cancer along with asthma and
   rheumatoid arthritis cases. This analysis highlighted some significant
   pathways (with annotation in the literature), which could affect the
   studied diseases (see “ClueGO” section in Supplementary Material).
   Overall, these analyses have highlighted how Core&Peel can find modules
   specific to the disease.

Figure 6.

   [104]Figure 6
   [105]Open in a new tab

   Results of gProfileR and enrichR enrichment analyses. The bars
   represent the adjusted p value in log10-scale. Only the methods which
   reached a significant p value (
   [MATH: <mrow><mo><</mo><mn>0.05</mn></mrow> :MATH]
   ) are showed. (a) The adjusted p-values of Carcinoma pathway annotated
   in DISEASES database in colorectal and prostate cancer. (b) Adjusted
   p-values of asthma pathway annotated on OMIM_Disease database. (c)
   Adjusted p-values of epithelial to mesenchymal transition in colorectal
   cancer pathway annotated on WikiPathways database. (d) Adjusted
   p-values of Prostate cancer pathway (annotated on DISEASES database)
   and prostate gland morphogenesis, prostate gland epithelium
   morphogenesis and branching involved in prostate gland morphogenesis
   (annotated on GO). These pathways are enriched only in the Core&Peel
   active subnetwork. We used the R package ggplot2^[106]97 to make the
   figure.

Comparison with other active subnetwork detection methods on COVID-19 data

   The numbers of genes in each module detected by all methods are
   reported in Supplementary Table [107]S5. Both Core&Peel configurations
   (Core&Peel-r1-nl10-d0.5-f1-j0.8 and Core&Peel-r1-nl10-d0.7-f1-j0.8)
   have been tested to have a broader exploration of the enriched
   pathways. Consequently, two versions of ClustEx have been generated to
   get modules with size comparable with Core&Peel. In the three COVID-19
   cases, the biological similarities between methods are different case
   by case. In BALF case (Supplementary Fig. [108]S7), Degas has detected
   many pathways more than the other methods, and most of them are not in
   common among other methods. Core&Peel and ClustEx share a large part of
   the pathways. In the PBMC case (Supplementary Fig. [109]S8), many
   pathways were detected only by ClustEx, and a large number of pathways
   are in common among Core&Peel, MD, and Degas. Finally, in
   COVID-19-infected cells case (Supplementary Fig. [110]S9), Core&Peel
   detected the highest number of pathways, and just a few are in common
   with the other methods.

Analysis of cellular transcriptional response to SARS-CoV-2 infection

   Current research on network-based analysis of COVID-19 focuses mainly
   in analyzing host-pathogen protein interaction
   maps^[111]18,[112]19,[113]28 in order to find druggable target
   interactions. A second approach by Gysi et al.^[114]17 uses
   host-pathogens protein interaction to extract a subnetwork of a large
   human PPI network, which is further analyzed for possible drug targets.
   Here we change focus, analyzing the modularity and functional
   enrichment of active subnetworks in the host’s response to the
   infection, within a global host co-expression network. We applied
   active subnetwork detection methodologies to three case/control
   differential expression data sets on the human response to COVID-19
   versus corresponding normal tissues/cell lines, using several active
   subnetwork detection algorithms. The complete list of significant GO BP
   functional annotations enriched in such active subnetworks is available
   at [115]https://github.com/martaluc/CorePeel_activeSubnetwork. Next in
   this section, we comment on some of these annotations.

Broad biological processes

   The first group of annotations comprises known pathways that are
   activated in viral infections, as well as in cellular apoptotic
   processes. Blanco-Melo et al.^[116]29 using DEA analysis noted that
   infection by COVID-19 has a molecular fingerprint characterized by
   “exuberant inflammatory cytokine production as a defining and driving
   feature of COVID-19”, and this is also noted by Xiong et al.^[117]30.
   Our analysis of active subnetworks confirms this finding at the level
   of enriched pathways in the active subnetworks: positive regulation of
   cytokine production involved in inflammatory response (by Core&Peel),
   regulation of cytokine production involved in inflammatory response (by
   Core&Peel), cytokine production involved in inflammatory response (by
   Core&Peel), and Cytokines and Inflammatory Response WP530 (by Core&Peel
   and ClustEx) in Table [118]S8. Pathway Cytokines and Inflammatory
   Response WP530 is found enriched also in Supplementary Tables [119]S6
   and [120]S7. Blanco-Melo et al.^[121]29 also noted a high expression of
   IL-6 as part of the characteristic COVID-19 signature. The importance
   of the interleukin activation is also shown in our analysis, as the
   pathway IL-10 Anti-inflammatory Signaling Pathway WP4495, which has
   regulatory effects on IL-6, is found enriched in Supplementary Tables
   [122]S6 and [123]S7, by ClustEx, and in Supplementary Table [124]S8, by
   Core&Peel. Consistently with known mechanisms involved in SARS-CoV-2
   progression^[125]31, we have found several apoptosis-related and
   lymphocite-related pathways in all three data sets. These observations
   are meant as a sanity check; their purpose is to recover in the
   functional enrichment analysis biological processes that are somewhat
   expected in any viral infection. These pathways per se are however
   usually aspecific and too broad to give useful hints for drug
   re-purposing efforts.

Specific biological processes hinting at candidate drugs for repositioning

   Next we turn to discuss specific pathways for which it is more likely
   that existing drugs may alter their behaviour and have potential
   therapeutic effects. We keep our discussion at a pre-clinical level
   since the best we can hope from using the active network analysis is to
   find supporting evidence suggesting that some drugs (or families of
   drugs) should be shortlisted for further in vitro and in vivo
   experimental analysis, or finding additional evidence supporting drugs
   already shortlisted for clinical trials. Other sources and experiments
   may reasonably contradict the type of evidence we can hope to provide,
   therefore all available evidence should always be weighted in before
   moving to the next steps of a drug repurposing pipeline.

   Functional annotations relative to Ebola virus (EBOV) infection is
   found enriched in Supplementary Table [126]S8 (by all seven methods),
   S6 and S7, with strong p-values. This finding is of interest as it
   supports current attempts of repositioning drugs developed to cure
   Ebola infections in patients affected by COVID-19. One of the few drugs
   known to be acting on Ebola (at a pre-clinical level) is remdesivir
   (https://it.wikipedia.org/wiki/Remdesivir). Historical accounts of the
   development of the drug remdesivir^[127]32,[128]33 show that interest
   on this drug in pre-clinical COVID-19 studies stemmed from its in vitro
   capability of inhibiting COVID-19/SARS-CoV2 infection^[129]34, and its
   known safety profile obtained in clinical trials for EBOV in
   2014^[130]35. At the time of the study of Wang et al.^[131]34
   remdesivir was already known, since 2017, to be effective in vitro and
   in vivo experiments involving members of the coronavirus family of
   viruses in a study by Sheahan et al.^[132]36, and this might be one of
   the reasons for shortlisting this drug in 2020^[133]34, along with
   other candidate drugs. Interestingly, our study of the patient’s
   transcriptional response to COVID-19 detects a direct link between the
   human response to EBOV and the response to COVID-19, at a modular
   level. In a hypothetical scenario in which one wishes to shortlist
   candidate drugs for an essay similar to Wang et al.^[134]34, our
   techniques would provide evidence that complements the supporting
   evidence provided by an essay identical to that in Sheahan et
   al.^[135]36.

   A second set of enriched functional modules is HIV-related (annotations
   in Supplementary Tables [136]S6, [137]S7, and [138]S8). This
   observation is consistent with the efforts in repositioning of AIDS
   therapeutic agents for COVID-19^[139]37. The NIH website
   ([140]https://aidsinfo.nih.gov/understanding-hiv-aids/fact-sheets/21/58
   /fda-approved-hiv-medicines) lists about 23 FDA approved drugs
   effective in HIV/AIDS therapy (excluding combinations). One of these,
   lopinair, was found by in vitro experiments to be active against
   SARS^[141]38 and the Middle East respiratory syndrome coronavirus
   (MERS-CoV)^[142]39,[143]40. Probably, for this reason, it has been
   selected for in vitro testing on cells infected by SARS-CoV2^[144]41.
   As in the previous case of the Ebola enriched module, in a hypothetical
   scenario, finding such HIV-enriched modules in the human
   transcriptional response to COVID-19 in late 2019/early 2020 would
   further support the choice of testing lopinavir by Cao et al. in early
   2020. A few more drugs (or pre-drugs) effective in HIV therapy have
   been tested by Jockusch et al.^[145]42 and demonstrate potency in
   inhibiting SARS-CoV2 viral replication in cell cultures.

   The Staphylococcus aureus infection, and T-Cell antigen Receptor (TCR)
   pathway during Staphylococcus aureus infection WP3863 pathways are
   detected in Supplementary Tables [146]S6, [147]S7 and [148]S8; while
   Staphylococcus aureus infection is detected in Table [149]S6.
   Staphylococcus aureus is known to produce pulmonary infections^[150]43
   thus it is a reasonable hypothesis that human response to S.aureus
   could share similarities at a molecular level with the COVID-19
   response. Moreover, since several strands of S.aureus have developed
   resistance to traditional antibiotics, a new generation of antibiotic
   drugs have been introduced recently to cope with it. A listing of
   antibiotic drugs under consideration for repurposing on COVID-19 is
   reported in
   https://en.wikipedia.org/wiki/COVID-19_drug_repurposing_research and
   lists five antibiotics, among these four: Teicoplanin, Oritavancin,
   Dalbavancin, and Monensin are effective against Methicillin-Resistant
   Staphylococcus Aureus (MRSA)^[151]44–[152]47. Baron et al.^[153]48
   shortlisted Teicoplanin in in vitro testing for activity against
   COVID-19. Oritavancin, Dalbavancin, and Monensin have been found to
   inhibit the entry of Ebola, MERS, and SARS viruses in in vitro
   tests^[154]49. Again, finding S.aureus enriched pathways in human
   transcriptomic response would indicate the opportunity for shortlisting
   modern antibiotics active in severe S.aureus infections for use in
   COVID-19 in vitro testing.

   We could find the pathway Inflammatory mediator regulation of TRP
   channels enriched in Supplementary Tables [155]S8 and [156]S6.
   Transient receptor potential channels have been studied for their role
   in inflammatory processes and as therapeutic targets^[157]50–[158]53.
   This observation supports another possible line of attack in the search
   for COVID-19 treatments^[159]54,[160]55.

   Two tables in supplementary material show enriched pathways for
   response to Cytomegalovirus (CMV). Jockusch et al.^[161]42,[162]56
   report that three drugs known to be active for AIDS-related
   cytomegalovirus (CMV) infections inhibit the SARS-CoV-2 RNA-dependent
   RNA polymerase (RdRp), and have potential to evade COVID-19 exonuclease
   activity. These drugs belong to a class of drugs known to inhibit SARS
   coronavirus infections in in vitro tests^[163]57. Thus also in this
   case, the detection of enriched pathways typical of the response to CMV
   and HIV in our analysis, would support shortlisting of the family of
   drugs selected in by Jockusch et al.^[164]42,[165]56.

Other Specific biological processes

   The third group of enriched pathways points at specific transcriptional
   response to pathogens. However, it is not clear whether indication of a
   candidate drug (or class of drugs) emerges at this stage. Enriched
   pathways for the response to Epstein-Barr virus (EBV) are found in
   Supplementary Tables [166]S6, [167]S7, and [168]S8. Cases of patients
   co-infected by COVID-19 and EBV have been reported^[169]58, however,
   our active subnetwork analysis suggests that also these two viruses may
   provoke similar response profiles in patients at a molecular level.
   Other pathogen-response modules enriched in the active subnetworks
   detected in all three data sets (Supplementary Tables [170]S6, [171]S7,
   and [172]S8) are relative to Escherichia coli and Human papillomavirus.
   Two tables in supplementary material show enriched pathways for the
   response to Herpesvirus, Helicobacter pylori, Salmonella, Yersinia, and
   Cholera. We cannot rule out that a case of co-infection by a variety of
   pathogens could explain some statistical associations found in the
   data^[173]59. This event is however, less likely for the data obtained
   through cellular lines rather than patient samples. A second
   possibility is that the systemic human responses to these pathogens are
   much overlapping among themselves, besides being enriched in COVID-19
   data, thus the statistical associations may not be independent of each
   other.

   Chloroquine (Hydroxychloroquine) was one of the first drugs to move
   into the clinical trial stage for COVID-19 rising much expectation and
   controversy (see e.g. Pastick et al.^[174]60, Hoffmann et al.^[175]61).
   Limiting the discussion to the pre-clinical level, several in vitro
   experiments on SARS and SARS-CoV2 showed activity of
   Chloroquine^[176]34,[177]62–[178]65. Chloroquine is mostly known as an
   anti-malaria drug, while in our data, we could not spot specific
   pathways related to malaria, and therefore no indication for
   shortlisting it. With hindsight, though, a very indirect link can be
   spotted by combining a rarely cited result of 1990^[179]66 that reports
   on Chloroquine’s activity of on Vibrio cholerae, and our finding of a
   cholera-related enriched pathway. Recent results^[180]61 show however,
   that the action of Chloroquine in in vitro experiments may be
   tissue-dependent, thus making it harder to generalize such findings.

Discussion

   In this work, we explore the potential value of the Core&Peel method
   for solving two related problems. The first problem is functional
   module detection within a biological co-expression (co-expr) network of
   genes. The second is the problem of active functional module detection,
   that is, finding modules within a biological network of genes, that are
   most affected in a specific disease vs. a normal background state. We
   initially discuss the two problems separately as the validation
   methodology is different in either case, however, the two aspects of
   the issue often interact with each other.

   Firstly, we compare Core&Peel with 42 methods producing non-overlapping
   modules that participated in the DREAM challenge project, as well as
   some recent methods of Tripathi et al. 2019, which allow substantial
   module overlap. Module overlap is an issue that often arises in the
   context of network partitions and modularization (as also in
   graph-based community detection). The DREAM challenge did require
   competing methods to report non-overlapping modules. This choice has
   the first effect since there is no need in the DREAM challenge of a
   requirement on the maximum number of produced modules. The number of
   nodes in a network is a natural upper bound on the number of reported
   modules. It is also somewhat easier to evaluate the enrichment of
   disjoint modules since one does not have to cope with inflated results,
   where possibly many high quality overlapping modules differ by just one
   or two genes, collectively representing quasi-duplicates of the same
   biological module. On the other hand, there is ample
   evidence^[181]67–[182]70 (see also the discussion in the DREAM
   paper^[183]22) that functional biological modules often share
   sub-modules and that this sharing has precise implications for the
   biological interpretation of the processes under study. When we compare
   Core&Peel with the output of the DREAM challenge, we mitigate the
   potential bias in the measurements due to overlapping modules by
   imposing a maximum value of the Jaccard coefficient between any pair of
   modules, thus avoiding to report quasi-duplicates. Moreover, for the
   sake of comparison, we rank the modules by intrinsic topological
   quality (see^[184]71) and we limit Core&Peel to the top k modules in
   this ranking when we compare it to a method reporting k modules (for
   the same input graph).

   Within these caveats, we strictly follow the performance measurement
   methodology of DREAM. We use the GWAS leaderboard and the final
   datasets separately (see Fig. [185]2). The results on the GWAS
   leaderboard are used to optimize the choice of the parameters of the
   Core&Peel method. Figure [186]2 reports good qualitative concordance of
   the relative performance of Core&Peel between leaderboard and final
   GWAS datasets for a wide range of parameters. There is a discrepancy
   for the density value
   [MATH: <mrow><mi>d</mi><mo>=</mo><mn>1.0</mn></mrow> :MATH]
   , which corresponds to detecting full cliques in the input graph. Since
   even the best co-expression networks are approximations to the true
   network of interactions, missing edges (false negatives) are expected
   in the input, thus implying that quasi-cliques may be more relevant to
   functional module detection than full-cliques.

   Figure [187]3 reports the comparison of the absolute number of enriched
   modules found by Core&Peel (with the selected parameters) versus the
   DREAM methods. Both for the leaderboard and the final GWAS data,
   Core&Peel restricted to the top k modules finds more enriched modules
   most of the times (for final GWAS 39/42 times, for leaderboard GWAS
   35/42 times). Comparing the method of Tripathi et al. 2019^[188]23,
   results in Table [189]1 and Fig. [190]4 show that Core&Peel can detect
   many more significant modules (with a common Jaccard coefficient
   maximum threshold of 0.8), and that Core&Peel has a larger fraction of
   the reported modules enriched for GWAS data. We also performed a second
   analysis in which we count the number of modules enriched within one
   (or more) Gene Ontology annotation at an FDR below a fixed threshold
   (see Supplementary Tables [191]S1, [192]S2, and [193]S3). Supplementary
   Table [194]S1 reports the number of enriched modules for the DREAM
   methods, and the average number overall DREAM methods. The average
   number of GO:BP enriched modules for the DREAM methods is comparable to
   those found by Core&Peel for thresholds
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>2</mn></mrow></msup> :MATH]
   , and
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>3</mn></mrow></msup> :MATH]
   , however, for thresholds from
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup> :MATH]
   , down to
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>7</mn></mrow></msup> :MATH]
   Core&Peel has a substantially higher number of enriched modules. At
   threshold
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>7</mn></mrow></msup> :MATH]
   Core&Peel reports more enriched modules than 39/42 of the DREAM
   methods. Comparing the method of Tripathi et al.^[195]23, results in
   Supplementary Table [196]S2 and [197]S3 show superior performance of
   Core&Peel, both in terms of enriched modules for GO categories, and in
   terms of the fraction of enriched modules over the total number of
   returned modules. Both the validation with GWAS data and that with GO
   data give the same picture, showing that Core&Peel is competitive with
   the best methods in the field on many measured quality functions, even
   when discounting for the difference in the number of reported modules.
   Moreover as noted in^[198]72 (page 8) the validation via GWAS
   enrichment uses data derived from association studies that are not
   normally used to define pathways and functional categories found in
   databases.

   Supplementary Tables [199]S2 and [200]S3 show that at threshold
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup> :MATH]
   Core&Peel reports 6285 modules, and more than 53% of these (3343) are
   enriched with one or more GO annotations. This is a very rich
   structure, and we can exploit this richness to attack the problem of
   active module detection. Similarly to other methods, like
   ModuleDiscoverer, we use the list of differentially expressed genes
   (DEGs) in case-control experiments to compute the enrichment of each
   module in DEGs, and we report as active modules those with an
   enrichment below a given FDR threshold (typically 0.01). Other methods
   like Degas^[201]27 (which is available the suite Matisse^[202]73) take
   a different approach, as the original input graph is annotated with
   DEGs, and the subnetwork induced by the DEG nodes is used as a seed to
   discover a minimally connected subgraph that joins all DEGs.

   The main validation methodology for the active subnetwork detection
   methods involves finding validated disease-related pathways and testing
   the enrichment of such pathways in the active subnetworks. For this
   type of analysis (over the four test cases), we use several curated
   databases of pathways. The association of the pathway to the disease
   can be either obvious from the pathway textual description, or it can
   be inferred by the presence of disease-related genes (at least two).
   Figure [203]5a and b report the number of affected pathways, obtained
   by combining the Reactome and DisGeNET information, enriched in the
   predicted active subnetworks. We note that only two methods (Core&Peel
   and ClustEx) can detect significant pathways on all the four test
   cases. ModuleDiscoverer finds at least one pathway in three test cases,
   while DEGs and KPM find pathways in two test cases and Degas in only
   one. Although Core&Peel and ClustEx do report enriched pathways on all
   four datasets, Core&Peel does find more such pathways (and for prostate
   cancer, substantially more). Although any of the competing methods
   (ClustEx, KPM, ModuleDiscoverer, and Degas) may perform better than
   Core&Peel in some particular case, Core&Peel can perform uniformly well
   across all four test cases, and it is the only method in the
   comparative evaluation that could find significant enrichment for many
   validated cancer pathways.

   All methods mentioned in this paper require in input a biological
   network in which nodes represent genes, and edges represent
   interactions between pairs of genes. Many active subnetwork detection
   methods have been developed and tested using as input PPI networks
   (protein-protein interaction networks)^[204]5. Instead for this study
   we chose as target (large) gene co-expression network developed within
   the DREAM challenge project. There are advantages and supporting
   evidence that using a co-expr network is a sensible choice. (a) an edge
   in co-expression network may be due both to direct protein-protein
   interactions of the proteins associated with the genes, as measured in
   large scale proteomic essays. However it may also be due to indirect
   co-regulation in which the two genes show correlation because their
   expression is modulated by a common Transcription Factor or an ncRNA
   (miRNA, lncRNA, etc..). Thus a richer array of phenomena is captured in
   a co-expr network. (b) Co-expr networks tend to have more edges and
   thus to be denser than most PPIN. This is an advantage for density
   based-methods, as modules tend to be more compact (in other words the
   average path length within a module is shorter for the same module
   detected in a denser network). (c) Co-expr networks are usually based
   on the correlation between expression vectors of pairs of genes, across
   thousands of measurements of gene panels (via microarray or RNAseq
   technologies). This set up integrates naturally data from different
   experiments, and different platforms, via standard normalization. The
   resulting correlation values depend very little on outlier conditions
   or variations across experimental conditions, thus representing a
   robust datum. In contrast, the integration of PPI data is somewhat more
   affected by outlier experiments and heterogeneous experimental
   conditions. (d) Huang et al.^[205]74 compare the performance of 21
   networks in uncovering pathways associated with a particular disease
   phenotype, by using 50 literature-based validated pathways. Huang et
   al. note that larger network size gives a higher absolute performance
   score. Moreover, among seven networks integrating co-expression data,
   six rank in the top seven positions by absolute performance score. (e)
   The findings in^[206]74 are also consistent with the observations
   in^[207]22 as the largest average number of disease-related modules is
   detected (averaging among methods) in the co-expression network (of 1M
   edges). Slightly less performant is a PPI network which, however, has
   more than twice as many edges (about 2.2 M). On balance, although there
   is evidence pointing at co-expression networks as an input of choice,
   both^[208]22 and^[209]74 indicate that usually, different types of
   networks reveal various aspects of the active subnetworks. Thus a
   multi-facet approach should always be considered for specific
   investigations.

   In our active subnetwork detection approach, we have as input both a
   biological gene interaction network (called the “bio-network” data) and
   a list of gene expression data on cohorts of cases and controls for a
   disease under study (called the “perturbation” data). The bio-network
   represents a global and essentially static picture of the cell’s
   machinery, while the perturbation represents the dynamic element we
   wish to study within the context of the bio-network. All the methods we
   compare with imply these two roles (static/dynamic) for the two inputs
   (network/perturbation), although they may differ in many other aspects.
   A limitation of this view on the data is that the phenomenon of network
   rewiring in response to the stimulus is captured somewhat imperfectly.
   Rare novel gene interactions (i.e. new edges in the network) arising as
   the result of the stimulus are not included in this model, and cannot
   be predicted. There is a different family of approaches (see
   e.g.^[210]75) that focus instead on the network rewiring phenomenon, by
   a more direct experimental assessment of dynamics of physical and
   genetic networks, through experimental mapping of networks. In this
   setting typically one would produce and analyze two full networks, one
   representing the system’s status before and one after the stimulus, to
   capture the variations. Here there is a different trade-off between the
   complexity of the experimental setup and the depth of the downstream
   ’in silico’ analysis.

Conclusion

   In the first part of this research, we aim at comparing Core&Peel with
   competing methods on benchmark data. Adapting the DREAM challenge
   evaluation methodology to the case of overlapping functional modules we
   conclude that, in terms of the number of enriched detected modules,
   Core&Peel is competitive with 42 functional module detection methods
   (which report non-overlapping modules) participating in the DREAM
   Challenge, and with the recent method by Raman’s group^[211]23 (which
   allows overlapping modules). Naturally, the number of enriched modules
   detected is just one easy-to-measure parameter, and other features may
   be important in a comparative assessment of several methods for the
   same task. On a more qualitative level, comparison of Core&Peel with
   several active subnetwork detection algorithms on four benchmark test
   cases show that Core&Peel has more uniform performance across all these
   four benchmark datasets, and for two cancer data sets (prostate, and
   colorectal) it produces modules enriched with more specific pathways,
   overlooked by the competing methods. COVID-19 is a recent infectious
   disease caused by severe acute respiratory syndrome coronavirus 2
   (SARS-CoV-2) first isolated in December 2019. As relatively little is
   known of this contagious disease at this time, we adopt an exploratory
   approach using Core&Peel along with several other methods to uncover
   active subnetworks of human response to COVID-19. Pathway enrichment
   analysis of such active subnetworks uncovers some broad biological
   processes common to many infections, as well as more specific pathways
   that show similarities between human transcriptional response to
   SARS-CoV-2 and response to other pathogens, and thus provide evidence
   useful in ongoing drug repositioning efforts at a pre-clinical level.

Availability and usage

   Core&Peel algorithm is available here
   [212]http://bioalgo.iit.cnr.it/index.php?pg=ppin. The active subnetwork
   detection script is available as Docker image
   https://hub.docker.com/repository/docker/ma10r02t90a/corepeel_active-su
   bnet/general (see also GitHub repository
   [213]https://github.com/martaluc/CorePeel_activeSubnetwork). The Github
   repository includes a user manual to help users applying the pipeline
   to custom data.

Methods

DREAM challenge

   The “Disease Module Identification DREAM challenge”^[214]22,[215]72 is
   an open competition to assess disease-module identification methods
   across heterogeneous biological networks for homo sapiens. In this
   section, we summarize the data and the methods that were made available
   by the challenge. We mention a further recent method developed for the
   module detection problem, which uses data from the challenge by
   Tripathi et al. 2019^[216]23. We introduce the strategy used to test
   for the association of the predicted modules with complex traits and
   diseases using a collection of Genome-Wide Association Study (GWAS)
   datasets and a tool for module scoring called PASCAL^[217]76. Finally,
   we introduce the Gene Ontology (GO) analysis we performed to evaluate
   the biological coherence of each predicted module.

Co-expression network

   Among the six different biological networks available from the DREAM
   challenge, we decided to focus our attention on the co-expression
   network. This network is based on Affymetrix HG-U133 Plus 2 arrays
   extracted from the Gene Expression Omnibus (GEO)^[218]77. A gene
   expression matrix of 22268 genes by 19019 samples was obtained, and the
   pairwise Spearman correlation of genes across samples was computed. The
   network was created with genes as nodes and the correlation
   coefficients as edge weights. In order to reduce the noise, we removed
   all the edges with a weight lower than 0.25, obtaining a network with
   12477 nodes and 665187 edges.

DREAM methods

   Forty-two methods took part in the DREAM challenge, and we compared our
   results with every algorithm (called DREAM methods in this manuscript).
   These methods are classified into different categories: (i) kernel
   clustering, (ii) modularity optimization, (iii) random-walk based, (iv)
   local methods, (v) ensemble clustering, (vi) hybrid methods and (vii)
   others. For more details see Table [219]S1 in^[220]22,[221]72. The
   common characteristic of DREAM methods is to identify non-overlapping
   modules.

A further method: overlapping community detection

   Apart from comparing the Core&Peel results with DREAM ones, we took
   into account a further method developed by the Raman group^[222]23.
   This method has been designed to detect disease modules starting from
   heterogeneous biological networks. Like Core&Peel, it uses the data
   from the DREAM challenge and can detect overlapping modules. Basically
   this method involves two steps: (i) seed nodes selection and (ii) seed
   expansion. The seed nodes selection relies on the observation that
   disease genes have a higher degree than the non-disease genes.
   Consequently, the authors used HITS and spread hubs, which are based on
   the degree of a node, as a seed selection mechanism to select some
   important nodes from the network. After that, for each seed node, they
   applied a seed expansion algorithm that uses the Personalized PageRank
   score to rank the nodes in the neighborhood of a seed node. The nodes
   were added to the module one by one based on their ranking and the
   modularity score was re-calculated after adding every node. The set of
   nodes with the maximum modularity formed a module.

   We used their code (seed selection and seed expansion) released on
   GitHub: [223]https://github.com/RamanLab/DiseaseModuleIdentification.
   We applied the same co-expression network used in Core&Peel as input
   data. The only tunable parameter of this method is the number of the
   seed nodes to set in the seed nodes selection script. The authors
   tested different values of this parameter (see ’Co-expression network’
   Table [224]S4 in the Supplementary Material of the original paper
   [225]https://www.frontiersin.org/articles/10.3389/fgene.2019.00164/full
   #supplementary-material) but we chose the maximum one (4099) since the
   smaller the value, the fewer disease-enriched modules were detected.
   Using this value, we generated the maximum number of HITS and spread
   hubs seed nodes; in particular the algorithm found 1423 and 1279 HITS
   and spread hubs, respectively. Additionally we filtered the modules
   detected by this method according to the same Jaccard index used in
   Core&Peel to have the same rate of overlap between the two methods.

Genome-wide association study (GWAS)

   A Genome-Wide Association Study (GWAS) is an approach used in genetics
   to identify genetic variants associated with the risk of disease or a
   particular trait. In particular, this method searches the genome for
   small variations, called Single Nucleotide Polymorphisms (SNPs), which
   consist of a single nucleotide alteration. Single SNPs usually do not
   have large effects on disease risk of an individual, but they can be
   associated with increased risk at population scale. The DREAM challenge
   provides a collection of 180 GWAS datasets from public sources. This
   collection was split into two sets of 76 and 104 GWASs used for the
   leaderboard step (parameter optimization) and the final evaluation,
   respectively. We used both sets separately to test the performance of
   the methods.

Gene and module scoring using Pascal

   PASCAL^[226]76 (PAthway SCoring ALgorithm) is a powerful tool for
   computing gene and pathway scores from SNP-phenotype association
   p-values. Pascal uses analytic and numerical solutions to calculate
   gene and module scores from the SNP p-values correcting for linkage
   disequilibrium (LD) correlation structure. PASCAL was used to get a
   score for each method. This score was defined as the number of modules
   with significant Pascal p-values in at least one GWAS (if a module is
   significant for multiple GWAS traits, it was counted once). The Pascal
   p-values were adjusted to control the FDR via the Benjamini-Hochberg
   procedure, and a 5% FDR cutoff was applied. This methodology is also
   used in^[227]78.

Gene ontology (GO) analysis

   To assess the biological coherence and relevance of each predicted
   module, we performed the GO analysis. Basically, we computed the
   hypergeometric p-value of the association of each module to each GO
   class. We used the hypergeom method implemented in the scipy.stats
   package of Python. For each module, we assigned the GO class of lowest
   p value. In order to correct for multiple comparisons, we adjusted the
   p-values using the p.adjust R function with fdr method which is the
   Benjamini-Hochberg FDR estimation method. The GO database was
   downloaded by
   [228]ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz
   and the Biological Processes (BP) category was selected. Additionally,
   only the genes in the co-expression network were retained as
   background.

Core&Peel

   In this section we briefly summarize the algorithm’s main features; for
   more details refer to^[229]21. Basically, Core&Peel aims at identifying
   dense subgraphs from large or medium size networks, which are both
   ego-networks and quasi-cliques. Briefly, it constructs a set of
   neighbors for each node of the graph based on its core number.
   Afterward it applies a peeling step that iteratively removes nodes of
   minimum degree in the graph. The peeling procedure stops when the
   number of nodes drops below the minimum size threshold or when the
   density is above or equal to the minimum density threshold. Finally,
   the duplicates were removed, and if two subgraphs were too similar
   according to the Jaccard coefficient, the biggest one was retained.

   For both our purposes, we assessed the Core&Peel performance,
   optimizing different control parameters. The subgraph minimum density
   (d) and the maximum Jaccard coefficient (j) varied from 0.5 to 1.0, and
   the minimum subgraph size (nl) was set to 10 or 20. Instead, the filter
   strategy (f) was fixed to 1 because it is neither too strict nor loose
   and the distance from the seed node (r) was set to 1.

CRank

   CRank^[230]71 is a general approach for prioritizing network
   communities. It takes a network and detected communities as its input
   and produces a ranked list of communities, where the high-ranking ones
   represent the promising candidates for downstream analyses. It is based
   on the evaluation of four different structural features of each
   community: (i) the likelihood of the edges, (ii) internal connectivity,
   (iii) external connectivity, and (iv) relationship with the rest of the
   network. CRank then applies a rank aggregation method to combine these
   measures in order to produce the final ranking list of communities.

   We applied this ranking method on the communities detected by Core&Peel
   to compare them with the modules detected by the algorithms of the
   DREAM challenge. This aimed at reducing the effect of the overlap of
   the Core&Peel communities when compared with DREAM method communities,
   which do not contain any common genes between them.

   Basically, we selected the first k-communities of Core&Peel with the
   highest CRank scores, where k is the number of communities detected by
   the DREAM method we want to compare our result with. In other words,
   Core&Peel results were compared with every DREAM method, taking the
   same number of communities (in most of the cases, Core&Peel detects
   more communities than DREAM methods).

Selection of active subnetworks

   The second purpose of this work is to assess the Core&Peel performance
   in active subnetwork detection. We used the same Core&Peel modules
   calculated previously for the identification of disease modules. Among
   them we selected only the subnetworks that include genes that have a
   significant change of their expression in the patients with disease
   with respect to the healthy samples. More specifically, for each case
   study, we calculated the significance of the overlap between each
   module and the differentially expressed genes (DEGs) through the
   hypergeometric test. We took into account only the modules with a p
   value of less than or equal to 0.01. The p-values are computed by the
   hypergeom method implemented in the scipy.stats (version 1.1.0) package
   of Python (version 3.7). Finally, all the significant modules are
   merged in one, taking each gene once. We made this decision to
   facilitate the comparison of our results with the other methods, which
   did not take into account modules separately.

Transcriptomic data, pre-processing and differential expression analysis

   This section summarizes the datasets used to identify the active
   subnetworks from the co-expression network and gene expression data. In
   total, seven gene expression data have been analyzed, two are from The
   Cancer Genome Atlas (TCGA), two are from GEO, and the other three from
   COVID-19 studies. We also describe the differential expression analysis
   (DEA) we performed to get the DEGs, which were used for the active
   subnetworks selection (Table [231]S4).

TCGA data

   We downloaded and pre-processed the RNA-Seq data from TCGA for prostate
   and colorectal cancer using the TCGAbiolinks R/Bioconductor
   package^[232]79–[233]81. We performed the DEA with the edgeR R
   package^[234]82. We compared the tumor with the normal samples, and a
   False Discovery Rate (FDR) cutoff was set to 0.01, along with log
   fold-change lower threshold
   [MATH: <mrow><mo
   stretchy="false">|</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo
   stretchy="false">(</mo><mi>F</mi><mi>C</mi><mo
   stretchy="false">)</mo><mo
   stretchy="false">|</mo><mo>≥</mo><mn>1</mn></mrow> :MATH]
   .

GEO data

   We downloaded the microarray data from GEO for asthma ([235]GSE137268)
   and rheumatoid arthritis ([236]GSE15573). The normalized data were
   downloaded using the GEOquery R package^[237]83. Differential
   expression analyses have been carried out with limma^[238]84 setting
   the FDR threshold to 0.05 and FC lower threshold
   [MATH: <mrow><mo
   stretchy="false">|</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo
   stretchy="false">(</mo><mi>F</mi><mi>C</mi><mo
   stretchy="false">)</mo><mo
   stretchy="false">|</mo><mo>≥</mo><mn>1</mn></mrow> :MATH]
   . We decided to increase the FDR threshold respect to that one chosen
   for the TCGA data to obtain a higher number of DEGs.

COVID-19 data

   We retrieved the DEGs from three COVID-19 datasets included in two
   different studies. In the first study, Xiong et al.^[239]30 carried out
   transcriptome sequencing of the RNAs isolated from the bronchoalveolar
   lavage fluid (BALF) and peripheral blood mononuclear cells (PBMCs)
   specimens of COVID-19 patients. In both cases, they compared the case
   with healthy patients, and we retrieved the DEGs using the pipeline
   released by authors. Instead, Blanco-Melo et al.^[240]29 infected the
   human adenocarcinomic alveolar basal epithelial (A549) cells with
   SARS-CoV-2 (COVID-19-cells case) and compared them with controls. The
   authors released only the statistical parameters (ex: logFC, adjusted p
   value, etc...) from the DEA for each gene. We selected the DEGs
   choosing the FDR less than 0.05 and
   [MATH: <mrow><mo
   stretchy="false">|</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo
   stretchy="false">(</mo><mi>F</mi><mi>C</mi><mo
   stretchy="false">)</mo><mo
   stretchy="false">|</mo><mo>≥</mo><mn>1</mn></mrow> :MATH]
   . Additionally, the R DESeq package was used to perform DEA in all the
   three COVID-19 studies, but the authors released their data
   differently; in the first work the DEA code was available on GitHub
   ([241]https://github.com/zhouyulab/ncov/tree/master/figs), in the
   second one only the summary statistics were made available
   (Supplementary Table [242]S2 in
   [243]https://www.biorxiv.org/content/10.1101/2020.03.24.004655v1.supple
   mentary-material).

Validation of the active subnetwork

   In this section, we sum up some enrichment analyses conducted to
   evaluate each method’s power to detect pathways associated to the
   diseases studied in this work. The R software has been used in these
   analyses^[244]85.

Pathway enrichment analysis: reactome and DisGeNET

   We performed the pathway enrichment analysis based on the Reactome
   database for each active subnetwork obtained for each case study. We
   employed the enrichPathway function of the R package ReactomePA^[245]86
   to retrieve the enriched pathways with adjusted p-values below the 0.05
   cutoff. We calculated the number of disease-associated genes in each
   enriched pathway using the DisGeNET database^[246]87. Lastly, we
   calculated the number of enriched pathways with at least two
   disease-associated genes.

Enrichment analyses based on different databases

   Different databases including specific-disease pathways are available.
   In order to get a broader investigation of these pathways, we decided
   to use two R packages enrichR^[247]88 and gprofiler2^[248]89, which use
   different databases of interest. In the first case, we conducted the
   analysis using DISEASES^[249]90, OMIM_Disease^[250]91 and
   WikiPathways_2019_Human^[251]92 databases. In the second case, we took
   into account the KEGG^[252]93, GO^[253]94 (BP category) and Human
   phenotype ontology databases^[254]95. For each analysis we extracted
   enriched pathways (with adjusted p value < 0.05) whose description
   included specific key-words (example: “prostate”, “carcinoma”,
   “asthma”, “colon”, “rheumatoid”, “virus”, “infection”) and compared
   their adjusted p-values in -log10-scale.

Supplementary information

   [255]Supplementary information.^ (4.3MB, pdf)

Acknowledgements