Abstract A fundamental issue related to the understanding of the molecular mechanisms, is the way in which common pathways act across different biological experiments related to complex diseases. Using network-based approaches, this work aims to provide a numeric characterization of pathways across different biological experiments, in the prospect to create unique footprints that may characterise a specific disease under study at a pathway network level. In this line we propose PathExNET, a web service that allows the creation of pathway-to-pathway expression networks that hold the over- and under expression information obtained from differential gene expression analyses. The unique numeric characterization of pathway expression status related to a specific biological experiment (or disease), as well as the creation of diverse combination of pathway networks generated by PathExNET, is expected to provide a concrete contribution towards the individualization of disease, and further lead to a more precise personalised medicine and management of treatment. PathExNET is available at: [31]https://bioinformatics.cing.ac.cy/PathExNET and at [32]https://pathexnet.cing-big.hpcf.cyi.ac.cy/ Keywords: Pathway expression networks, Differential expression analysis, Pathway analysis 1. Introduction Analysis of differential gene expression profiles often generates top-scored gene-sets on which pathway-based enrichment analysis is routinely performed, leading to a statistically significant list of pathways, that may be related to the underlying biology of the condition being studied [33][7], [34][28], [35][51]. The challenge through these types of analyses is to find specific pathways affected by a group of related genes, namely pathways perturbed by differentially expressed genes. Although such tools may reveal significant top-scored pathways, the pathway complexity and the varying characteristics of genes do not easily allow to optimally relate these pathways to a specific biological condition being studied [36][18]. Despite the magnificent efforts of differential expression analysis pipelines, generating unbiased results is still a challenge, while common aetiologies of such failures usually vary between issues related to the experimental setup and difficulties in customization of the statistical analysis tools [37][9], [38][26]. Scoring and filtering of differentially expressed data results to a loss of a large amount of important yet not statistically significant genes, where despite their weak statistical significance, their contribution into the biology of the condition being studied remains a relatively unexplored scientific issue. Another crucial confusing issue scientists usually face through enrichment analysis, is a quite significant list of top-scored pathways that are common across a variety of diverse diseases [39][42], [40][46]. For example, the pathway of “apoptosis“ is a very common pathway that appears very often in several studies, while less generic pathways such as “N-glycosylation” or “N-glycan biosynthesis” have been also associated with a series of congenital disorders [41][13]. However a pathway’s association across different diseases, by no means suggests in biochemical and/or biological terms, that a specific pathway contributes in the same way to all types of diseases in which may be identified. Indeed, biological pathways can be considered as topological networks formed by sets of genes or molecules that interact through chemical reactions, molecule modifications or signal transduction [42][5], [43][48]. Thus their significance should be a result that derives from the integration of both gene-set analysis and topology information [44][35], [45][48]. In this line, network-based approaches have proved to be a promising Systems Bioinformatics framework of analysis, both at gene and pathway analysis level [46][22], [47][38], [48][53]. In the prospect to develop of additional tools able to enrich the outcome of the routinely performed pipelines used for this type of research, the present work aims to explore whether the overall differential expression information of the genes included in a specific pathway is adequate to give a different characterization for the same pathway across different diseases. Using network-based approaches, in this work we present a methodology and a related web-service for the numeric characterization of pathways across different differential expression datasets, able to provide unique pathway network footprints, which in turn may represent a specific biological condition (and/or disease) under study. In this line, we propose PathExNET, a web service that allows the creation of pathway expression networks that hold the over- and under-expression information obtained from differential gene expression analyses. PathExNET holds a large database of reference pathway-to-pathway networks, which have been developed through the freely available information included in the KEGG, Reactome and Wiki Pathways database repositories. Users can upload their differential gene expression statistical analysis, followed with pathways and/or genes of interest, and further chose a scoring methodology to create and explore the derived pathway-to-pathway expression networks. In order to provide a concrete set of well-evaluated differential gene expression statistical analyses and to further increase the data-availability and easy data access of PathExNET, an additional tool has been rooted in PathExNET framework that allows to search and directly import pre-processed statistic files from the Expression Atlas (EA) ([49]https://www.ebi.ac.uk/gxa) data resource of the European Bioinformatics Institute (EMBL-EBI) ([50]https://www.ebi.ac.uk). 2. Software description & methods By definition, the term Pathway Expression Networks (PENs) employed in this work refers to pathway-to-pathway networks, where: (a) the nodes are pathways, the node size and the node colour represent a specific parameter that characterizes the level of over- and under-expression statistical information of genes included in a specific pathway, and (b) the edge-weight represents the number of common genes between two pathways. PENs draw from the log-fold-change ( [MATH: logFC :MATH] ) parameter obtained through the Differential Expression Analysis (DEA) of genes. The pathway characteristic parameter is obtained by means of four diverse methodologies employed in this work. In the following we describe in detail the main components and methodologies used for the implementation of the proposed tool. 2.1. Overall design and software availability PathExNET comes with a frontend web interface that consists of the mainframe and a help page, written in HTML, PHP and JavaScript language environments. The mainframe provides 2 individual steps designed to guide the user until the end of the workflow process. The backend of PathExNET has been written in R environment, where several functionalities have been parallelised to achieve fast performance. Evaluation, testing and understanding of PathExNET functionalities can be easily performed by means of several available example datasets provided through the web interface. The proposed tool is available online at the webpage of the Bioinformatics Department, at the Cyprus Institute of Neurology and Genetics (CING) ([51]http://bioinformatics.cing.ac.cy). PathExNET is served by a Docker space at the CYTERA High Performance Computer Facility of the Cyprus Institute ([52]https://hpcf.cyi.ac.cy). PathExNET further uses parallel processing scripts to handle and pre-process large file sizes that make the use of the “doParallel” R package [53][4]. 2.2. The pathway reference network repository The pathway-to-pathway network information draws from a web-service that holds a large database of reference pathway networks, which have been developed through the freely available information included in the KEGG [54][41], Reactome [55][11] and Wiki Pathways [56][25] database repositories. Herein, the functional relation between two pathways that forms an edge in a network, is considered when a specific pathway involves or is being involved in to another pathway accordingly. In effect, this type of information which is mainly obtained from the available XML maps of the above mentioned repositories, can form an undirected-unweighted pathway-to-pathway network. In this line of thought, a large number of pathway XML maps were obtained for all the organisms included in the three above mentioned repositories, and all the available data related to the functional connections that exist between all the available pathways were retrieved. Specifically, we obtained 177 organisms from KEGG, 16 from Reactome and 38 from Wiki Pathways repositories, accordingly. The output of this data mining process was further used to construct in total 231 undirected pathway reference networks, stored in a data repository. Further information rooted in these reference pathway-to-pathway networks, involves the number of common genes between two pathways that forms the edge-weight of these networks, and the number of total genes included in a pathway that forms the node size. The underlying networks are regularly updated, constructing the main pathway repository for the services and methodologies that PathExNET draws from. It should be noticed that an initial version of this reference network repository supporting only 16 organisms from KEGG and Reactome repositories, has been recently used in PathwayConnector [57][34], [58][35], with noteworthy results to pathways related to Alzheimer’s Disease (AD) [59][53], to Huntington’s disease (HD) and Spastic Ataxia (SA) [60][23], as well as to a recent study on Breast Cancer [61][16]. 2.3. The expression Atlas searching and importing tool The Expression Atlas (EA) is a database repository that provides information about gene and protein expression in different species and contexts, namely: tissue, developmental stage, and disease or cell type. The EA web service is hosted at the European Bioinformatics Institute (EMBL-EBI) ([62]https://www.ebi.ac.uk). EA holds a large set of publicly available and controlled access datasets that at the time of writing derive from over 4,000 studies across 65 different species, including over 900 studies from plants. These datasets have been curated and re-analysed using standardized, open source pipelines and have been made available along with the analyses data for queries, download and visualization [63][40]. EA incorporates baseline expression profiles of tissues from Human Protein Atlas, GTEx and FANTOM5, as well as of cancer cell lines from ENCODE, CCLE and Genentech projects. Through the last update EA incorporates data from large-scale RNA sequencing studies including Blueprint, PCAWG, ENCODE, GTEx and HipSci. In order to provide a concrete set of well-evaluated DEA data files, a productive collaboration with EA team led to the development of an additional search tool, rooted in PathExNET framework. The underlying tool allows users to search and directly import into PathExNET pre-processed DEA files. Users can search by means of specific EA experiment accession, organism, and experiment type, or alternatively perform free text keyword search. 2.4. Creating pathway expression networks There are three input combinations where users can provide to create PENs: (a) DEA file accompanied with list of pathways of interest, (b) DEA file accompanied with list of genes of interest, and (c) DEA file accompanied with a list of pathways and a list of genes of interest. A significant differentiation in this approach is that PENs use all the genes included in the experiment, thus the DEA files should be used unfiltered without performing any specific threshold for reducing their size. The DEA file should at least include the gene-symbol and the log-fold-change value for each gene included in the file, while the p-value field is optional. These parameters should be strictly named as: “Gene.symbol”, “logFC”, and “P.Value”, accordingly. Pathways and genes of interest may derive from any type of omics data analysis that leads to significant pathways and genes accordingly. The proposed methodology for the creation of PENs reads as follows. For a specific pathway of interest, our methodology first finds all the genes included in the pathway. This type of information is obtained from the pathway reference network repository described in the previous section, which holds all the genes involved in each pathway. These genes are further matched with those that derive from the DEA files, where the [MATH: logFC :MATH] value is attached for each one of these genes. Synonyms of gene symbols are also considered in this process in the prospect to reduce the number genes that may be missed through this type of matching. The next step of our method involves the assignment of a specific numeric value to the pathway of interest by means of the following equations: The sumFC value is obtained by calculating the sum of all the log-fold-change values included in the specific pathway, as follows: [MATH: sumFC=i= 1Nlog(FC) :MATH] (1) where [MATH: N :MATH] refers to the total number of genes included in the pathway, and [MATH: logFC :MATH] is the logarithmic representation of the fold-change (FC) value. When this overall score is above zero, the specific pathway of interest is mostly considered as an over-expressed pathway. On the contrary for negative values of [MATH: sumFC :MATH] the pathway is mostly considered as an under-expressed pathway. However this approach may provide biased results since the underlying score does not take into account the balance between the number of over and under expressed genes in a sample. For example, four genes with [MATH: logFC :MATH] values of −0.2, −0.3, −0.1 and 0.8, would give a [MATH: sumFC=0.2 :MATH] , suggesting an over-expressed network, against the fact that the sample includes more under-expressed genes that over-expressed ones. To handle with this limitation we proceed with two additional equations that lead to a combined score. These read as follows: The rateFC value is the fraction of the number of over-expressed genes divided by the number of total genes included in the pathway, as follows: [MATH: rateFC=#of over-ex< mi>pressedgenes# ofgenes< /mrow> :MATH] (2) For [MATH: rateFC0.5 :MATH] , the specific pathway of interest is mostly considered as an over-expressed pathway. On the contrary for [MATH: rateFC0.5 :MATH] the pathway is considered as an under-expressed pathway. The normMeanFC value is obtained by calculating the weighted mean of the normalised histogram of the log-fold-change values. Specifically, for a given vector of log-fold-change values [MATH: VLFC=v1,v2,v< mi>n :MATH] we first apply a normalisation function in order to restrict the values within the range of [MATH: VLFC :MATH] . [MATH: Vnorm=(VLFC-min(V_LFC))/(maxVLFC -min(V_LFC)) :MATH] (3) Then the histogram of the normalised vector [MATH: Vnorm :MATH] , is calculated using a bin of 0.01, which results to [MATH: N=100 :MATH] ranges ( [MATH: x=x1,x2,x< mn>100 :MATH] ), represented by their frequencies [MATH: F(xi) :MATH] , which in turn are used to calculate the weight vector [MATH: Wxi=FxiN :MATH] . The weighted mean is then obtained by the following equation: [MATH: normMeanFC=i =1NWxixii=1NWxi :MATH] (4) Eq. [64](4) suggests that for a normal distribution of [MATH: logFC :MATH] values, those with a larger weight contribute more to the weighted mean than those with a smaller weight. In effect a numeric mean characterisation of a pathway will be based on most frequent values that exist in the vector. Herein, for [MATH: normMeanFC0.5 :MATH] , the specific pathway of interest is mostly considered as an over-expressed pathway. For [MATH: normMeanFC<0.5, :MATH] the pathway is considered as an under-expressed pathway. The underlying metric aims to slightly fix the ambiguity in the [MATH: sumFC :MATH] balancing by estimating the normalised weight of the distribution of the [MATH: logFC :MATH] values included in the sample. The combinedFC value is a combination of equations [65](2), [66](4) as follows: [MATH: combinedFC=rateFC+norm MeanFC :MATH] (5) Typically for [MATH: combinedFC1.0 :MATH] , the specific pathway of interest is mostly considered as an over-expressed pathway. For [MATH: combinedFC<1.0, :MATH] the pathway is considered as an under-expressed pathway. In order to estimate a score for the overall pathway expression network we calculate the overall expression ratio which is defined as the fraction of: [MATH: RNET=#of over-ex< mi>pressedpathways#ofpa< mi>thways :MATH] (6) For values of [MATH: RNET0.5 :MATH] , the network is considered as an over-expressed network, while for values [MATH: RNET<0.5 :MATH] , the network is considered as an under-expressed network, respectively. Herein, we clarify that there is not an optimal theory that clearly defines where is the transition line between over- and under-expression of gene sets. The most widely used methods that use the log-fold-change value, mainly examine how the [MATH: logFC :MATH] value is different from zero, without suggesting any biologically objective truth [67][33]. Thus the transitions of [MATH: 0.5 :MATH] and [MATH: 1.0 :MATH] used in the above equations have been arbitrary selected, assuming that the [MATH: logFC :MATH] values included in a gene expression dataset, which has been transformed and normalised successfully, follow a typical normal Gaussian-like distribution around zero. 2.5. Performing enrichment analysis In order to provide an indicative information that shows whether the selected by the user pathways are also significant in terms of enrichment analysis, we used the “gprofile2” R package which has been also suggested as a main analysis tool by the ELIXIR consortium [68][29]. Specifically, when users provide lists of genes, the tool performs pathway enrichment analysis by using these genes. The enrichment score (namely the p-value) obtained for each selected pathway, is now provided on the visualised networks, especially when the user puts the mouse cursor over a specific node. To further handle the large network problem we rooted into the tool the possibility to select the maximum number of top-scored pathways to be visualised in the network. The ranking is based on the p-value score obtained from enrichment analysis of the given gene-set. Herein the limitation we find in this approach is that insignificant lists of pathways provided by the user may not be included in the enrichment result. In that case the enrichment score is simply NA for those pathways. 2.6. Providing gene regulatory information Another significant issue in studying the expressional behaviour of pathways is the regulatory information in between the genes included in a specific pathway. Thus in order to provide such information through PathExNET framework, we further created a database repository that includes regulatory information in between genes, obtained from both the Transcriptional Regulatory Relationships Unraveled by Sentence-based Text mining (TRRUST) [69][20], and the SIGnaling Network Open Resource (SIGNOR) [70][31] repositories. The underlying repository includes regulation information about 95,086 pairs of genes. However, the specific information, although significant, remains limited since it is available for only three species: Homo Sapiens, Mus Musculus, and Rattus Novergicus, accordingly. Depending on the user’s input, PathExNET automatically examines the genes of interest and further provides the regulatory information to where is available, in a single table. 2.7. Exporting network statistics The mathematical content of complex networks in biological systems has become a benchmark approach towards identifying biomarkers, understanding their dynamics, their biological status and related biological mechanisms involved. In order to provide more statistics related to the complex nature of the proposed pathway expression networks, PathExNET further provides additional statistics, for network manipulation [71][8]. These indicatively include measures of median, mean and maximum values of: betweeness-centrality, degree distribution, closeness, and clustering coefficient. This attempt aims to create a concrete web-framework of analysis adequate to provide a multilevel information content, sufficient for further investigation and understanding of pathway networks. 3. Demonstration of PathExNET capabilities It should be stressed that the concept of PathExNET is not to serve as enrichment tool but as a post analysis tool that facilitates an estimation and the subsequent visualization of the collective over/under expression of the selected pathways’ gene members. As opposed to traditional gene enrichment tools and methodologies [72][30], [73][37], [74][45], PathExNET allows users to create pathway expression networks in order to evaluate specific biological conditions where pathways or genes of interest are not necessary significant, namely a high-score result of a ranking methodology. Thus the equations provided in PathExNET have been designed in a simplified manner in order to be independent of any gene or pathway rankings. In the following subsections we present two different case studies in order to support this argument and to show how same clusters of pathways behave across different gene expression datasets. 3.1. A case study on SARS-CoV-2 experimental data A common biological perspective for an effective treatment against COVID-19 and its causative virus, SARS-CoV-2, is the deciphering of the involved host pathways, as well as the related transmission and replication mechanisms [75][6], [76][19], [77][44]. In this line of thought, our approach here focuses on the examination of the over-expressed and under-expressed gene content of specific categories of pathway networks related to COVID-19. On this ground, we are using PathExNET to analyze a recently introduced high throughput sequencing expression dataset, related to the transcriptional response of human lung epithelial cells to SARS-CoV-2 infection [78][3]. The dataset includes expression profiles of two independent biological triplicates of: (i) normal human bronchial epithelial (NHBE) cells and (ii) transformed lung alveolar (A549) cells, which were both mock treated or infected with SARS-CoV-2 (USA-WA1/2020). The underlined subset has been analysed by the team of EA, who performed differential expression analysis, available on EA repository at [79]https://www.ebi.ac.uk/gxa/experiments/E-GEOD-147507. Herein, we used the PathExNET EA tool to download the analysis performed by EA, namely the unfiltered statistics required for the creation of the pathway expression networks proposed in this work. [80]Fig. 1a depicts the frequency distribution of the [MATH: logFC :MATH] parameter included in these statistics, showing that both samples are well-distributed around the zero point. In addition, [81]Fig. 1b depicts the over- and under-expressed information estimated by means of the [MATH: logFC :MATH] parameter. Both samples seem to exhibit the same behaviour between the over- and under- expressed genes, with estimated ratios [MATH: RA549=0.46 :MATH] , and [MATH: RNHBE=0.49 :MATH] , accordingly, where the ratio [MATH: R :MATH] refers to the fraction of the number of over-expressed genes, to the total number of genes included in the sample. Fig. 1. [82]Fig. 1 [83]Open in a new tab (a) Depicts the frequency distribution of [MATH: logFC :MATH] values obtained from the differential expression analysis of A549 and NHBE comparisons, (b) depicts the number of over- and under-expressed genes per comparison, estimated by means of the [MATH: logFC :MATH] parameter. Furthermore, [84]Fig. 2 depicts the Venn diagrams that represent the overlap between the two samples regarding (a) their over-expressed genes, (b) their under-expressed genes, and (c) their genes with zero [MATH: logFC :MATH] values, respectively. It is observed that both samples share almost the same amount of common over-expressed and under-expressed genes, which in effect secures that the obtained pathway expression networks will be a result of a well-balanced common genes included in these pathways. Fig. 2. [85]Fig. 2 [86]Open in a new tab Venn diagrams representing the datasets overlapping for: (a) their over-expressed genes, (b) their under-expressed genes, (c) their genes with zero [MATH: logFC :MATH] values, included in the two cell lines under study. It has been stated that for COVID-19, as well as for all infectious diseases, the host immune system, is a major component [87][2], [88][14], towards understanding the host response on the infection. In this line, 20 pathways related to the immune system, were obtained from the KEGG pathway repository, in order to be analyzed by means of the PathExNET methodology. The compiled pathway list, was further used to create pathway-to-pathway expression networks, in combination with the differential expression statistics obtained from EA, by means of the PathExNET methodology proposed in this work. Specifically, [89]Fig. 3a depicts a pathway expression network that includes the 20 candidate KEGG pathways, where the node (pathway) values have been estimated by means of the [MATH: combinedFC :MATH] score (see Eq. [90](5)) described in previous section. The [MATH: logFC :MATH] values have been obtained from the A549 analysed dataset. The edge-weights refer to the number of common genes between the two pathways that form the edge. [91]Fig. 3b depicts the same analysis for the NHBE differential expression dataset accordingly. As opposed to the colour scale provided by the web tool, here in order to show the expression difference in between these two networks, an arbitrary transition threshold was selected while the colour scale used includes only two colours. Specifically, the red-coloured pathways refer to the over-expressed pathways ( [MATH: combinedFC1.0 :MATH] ), while the blue ones are the under-expressed ones ( [MATH: combinedFC<1.0 :MATH] ). The overall network expression ratio (see Eq. [92](6)) was found [MATH: RNET=0.40 :MATH] for the A549 gene-set, and [MATH: RNET=0.05 :MATH] for the NHBE gene-set, suggesting that the first is considered as an enriched network of pathways with higher content in over-expressed genes in contrast to the second one. In consequence, the latter methodology suggests that pathway networks obtained from the given set of gene expressions, may have a unique identity in terms of the proposed scores, which in effect may contribute to a unique characterization of the specific biological condition under study. It should be stressed that the specific analysis draws from a simple biological concept that aims to examine the expression status of all the immune system pathway networks, subjected to SARS-COV-2 infection. Fig. 3. [93]Fig. 3 [94]Open in a new tab (a, b) Pathway expression networks of 20 KEGG pathways related to immune system. The node values have been estimated by means of the [MATH: combinedFC :MATH] parameter. The [MATH: logFC :MATH] values have been obtained from the A549 and NHBE datasets. The red nodes refer to the pathways with high content of over-expressed genes ( [MATH: combinedFC1.0 :MATH] ), while the blue nodes refer to the pathways with high content of under-expressed genes ( [MATH: combinedFC<1.0 :MATH] ). (For interpretation of the references to color in this figure