Abstract

   A fundamental issue related to the understanding of the molecular
   mechanisms, is the way in which common pathways act across different
   biological experiments related to complex diseases. Using network-based
   approaches, this work aims to provide a numeric characterization of
   pathways across different biological experiments, in the prospect to
   create unique footprints that may characterise a specific disease under
   study at a pathway network level. In this line we propose PathExNET, a
   web service that allows the creation of pathway-to-pathway expression
   networks that hold the over- and under expression information obtained
   from differential gene expression analyses. The unique numeric
   characterization of pathway expression status related to a specific
   biological experiment (or disease), as well as the creation of diverse
   combination of pathway networks generated by PathExNET, is expected to
   provide a concrete contribution towards the individualization of
   disease, and further lead to a more precise personalised medicine and
   management of treatment.

   PathExNET is available at:
   [31]https://bioinformatics.cing.ac.cy/PathExNET and at
   [32]https://pathexnet.cing-big.hpcf.cyi.ac.cy/

   Keywords: Pathway expression networks, Differential expression
   analysis, Pathway analysis

1. Introduction

   Analysis of differential gene expression profiles often generates
   top-scored gene-sets on which pathway-based enrichment analysis is
   routinely performed, leading to a statistically significant list of
   pathways, that may be related to the underlying biology of the
   condition being studied [33][7], [34][28], [35][51]. The challenge
   through these types of analyses is to find specific pathways affected
   by a group of related genes, namely pathways perturbed by
   differentially expressed genes. Although such tools may reveal
   significant top-scored pathways, the pathway complexity and the varying
   characteristics of genes do not easily allow to optimally relate these
   pathways to a specific biological condition being studied [36][18].
   Despite the magnificent efforts of differential expression analysis
   pipelines, generating unbiased results is still a challenge, while
   common aetiologies of such failures usually vary between issues related
   to the experimental setup and difficulties in customization of the
   statistical analysis tools [37][9], [38][26]. Scoring and filtering of
   differentially expressed data results to a loss of a large amount of
   important yet not statistically significant genes, where despite their
   weak statistical significance, their contribution into the biology of
   the condition being studied remains a relatively unexplored scientific
   issue. Another crucial confusing issue scientists usually face through
   enrichment analysis, is a quite significant list of top-scored pathways
   that are common across a variety of diverse diseases [39][42],
   [40][46]. For example, the pathway of “apoptosis“ is a very common
   pathway that appears very often in several studies, while less generic
   pathways such as “N-glycosylation” or “N-glycan biosynthesis” have been
   also associated with a series of congenital disorders [41][13]. However
   a pathway’s association across different diseases, by no means suggests
   in biochemical and/or biological terms, that a specific pathway
   contributes in the same way to all types of diseases in which may be
   identified. Indeed, biological pathways can be considered as
   topological networks formed by sets of genes or molecules that interact
   through chemical reactions, molecule modifications or signal
   transduction [42][5], [43][48]. Thus their significance should be a
   result that derives from the integration of both gene-set analysis and
   topology information [44][35], [45][48]. In this line, network-based
   approaches have proved to be a promising Systems Bioinformatics
   framework of analysis, both at gene and pathway analysis level
   [46][22], [47][38], [48][53]. In the prospect to develop of additional
   tools able to enrich the outcome of the routinely performed pipelines
   used for this type of research, the present work aims to explore
   whether the overall differential expression information of the genes
   included in a specific pathway is adequate to give a different
   characterization for the same pathway across different diseases. Using
   network-based approaches, in this work we present a methodology and a
   related web-service for the numeric characterization of pathways across
   different differential expression datasets, able to provide unique
   pathway network footprints, which in turn may represent a specific
   biological condition (and/or disease) under study. In this line, we
   propose PathExNET, a web service that allows the creation of pathway
   expression networks that hold the over- and under-expression
   information obtained from differential gene expression analyses.
   PathExNET holds a large database of reference pathway-to-pathway
   networks, which have been developed through the freely available
   information included in the KEGG, Reactome and Wiki Pathways database
   repositories. Users can upload their differential gene expression
   statistical analysis, followed with pathways and/or genes of interest,
   and further chose a scoring methodology to create and explore the
   derived pathway-to-pathway expression networks. In order to provide a
   concrete set of well-evaluated differential gene expression statistical
   analyses and to further increase the data-availability and easy data
   access of PathExNET, an additional tool has been rooted in PathExNET
   framework that allows to search and directly import pre-processed
   statistic files from the Expression Atlas (EA)
   ([49]https://www.ebi.ac.uk/gxa) data resource of the European
   Bioinformatics Institute (EMBL-EBI) ([50]https://www.ebi.ac.uk).

2. Software description & methods

   By definition, the term Pathway Expression Networks (PENs) employed in
   this work refers to pathway-to-pathway networks, where: (a) the nodes
   are pathways, the node size and the node colour represent a specific
   parameter that characterizes the level of over- and under-expression
   statistical information of genes included in a specific pathway, and
   (b) the edge-weight represents the number of common genes between two
   pathways. PENs draw from the log-fold-change (
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   ) parameter obtained through the Differential Expression Analysis (DEA)
   of genes. The pathway characteristic parameter is obtained by means of
   four diverse methodologies employed in this work. In the following we
   describe in detail the main components and methodologies used for the
   implementation of the proposed tool.

2.1. Overall design and software availability

   PathExNET comes with a frontend web interface that consists of the
   mainframe and a help page, written in HTML, PHP and JavaScript language
   environments. The mainframe provides 2 individual steps designed to
   guide the user until the end of the workflow process. The backend of
   PathExNET has been written in R environment, where several
   functionalities have been parallelised to achieve fast performance.
   Evaluation, testing and understanding of PathExNET functionalities can
   be easily performed by means of several available example datasets
   provided through the web interface. The proposed tool is available
   online at the webpage of the Bioinformatics Department, at the Cyprus
   Institute of Neurology and Genetics (CING)
   ([51]http://bioinformatics.cing.ac.cy). PathExNET is served by a Docker
   space at the CYTERA High Performance Computer Facility of the Cyprus
   Institute ([52]https://hpcf.cyi.ac.cy). PathExNET further uses parallel
   processing scripts to handle and pre-process large file sizes that make
   the use of the “doParallel” R package [53][4].

2.2. The pathway reference network repository

   The pathway-to-pathway network information draws from a web-service
   that holds a large database of reference pathway networks, which have
   been developed through the freely available information included in the
   KEGG [54][41], Reactome [55][11] and Wiki Pathways [56][25] database
   repositories. Herein, the functional relation between two pathways that
   forms an edge in a network, is considered when a specific pathway
   involves or is being involved in to another pathway accordingly. In
   effect, this type of information which is mainly obtained from the
   available XML maps of the above mentioned repositories, can form an
   undirected-unweighted pathway-to-pathway network. In this line of
   thought, a large number of pathway XML maps were obtained for all the
   organisms included in the three above mentioned repositories, and all
   the available data related to the functional connections that exist
   between all the available pathways were retrieved. Specifically, we
   obtained 177 organisms from KEGG, 16 from Reactome and 38 from Wiki
   Pathways repositories, accordingly. The output of this data mining
   process was further used to construct in total 231 undirected pathway
   reference networks, stored in a data repository. Further information
   rooted in these reference pathway-to-pathway networks, involves the
   number of common genes between two pathways that forms the edge-weight
   of these networks, and the number of total genes included in a pathway
   that forms the node size. The underlying networks are regularly
   updated, constructing the main pathway repository for the services and
   methodologies that PathExNET draws from. It should be noticed that an
   initial version of this reference network repository supporting only 16
   organisms from KEGG and Reactome repositories, has been recently used
   in PathwayConnector [57][34], [58][35], with noteworthy results to
   pathways related to Alzheimer’s Disease (AD) [59][53], to Huntington’s
   disease (HD) and Spastic Ataxia (SA) [60][23], as well as to a recent
   study on Breast Cancer [61][16].

2.3. The expression Atlas searching and importing tool

   The Expression Atlas (EA) is a database repository that provides
   information about gene and protein expression in different species and
   contexts, namely: tissue, developmental stage, and disease or cell
   type. The EA web service is hosted at the European Bioinformatics
   Institute (EMBL-EBI) ([62]https://www.ebi.ac.uk). EA holds a large set
   of publicly available and controlled access datasets that at the time
   of writing derive from over 4,000 studies across 65 different species,
   including over 900 studies from plants. These datasets have been
   curated and re-analysed using standardized, open source pipelines and
   have been made available along with the analyses data for queries,
   download and visualization [63][40]. EA incorporates baseline
   expression profiles of tissues from Human Protein Atlas, GTEx and
   FANTOM5, as well as of cancer cell lines from ENCODE, CCLE and
   Genentech projects. Through the last update EA incorporates data from
   large-scale RNA sequencing studies including Blueprint, PCAWG, ENCODE,
   GTEx and HipSci. In order to provide a concrete set of well-evaluated
   DEA data files, a productive collaboration with EA team led to the
   development of an additional search tool, rooted in PathExNET
   framework. The underlying tool allows users to search and directly
   import into PathExNET pre-processed DEA files. Users can search by
   means of specific EA experiment accession, organism, and experiment
   type, or alternatively perform free text keyword search.

2.4. Creating pathway expression networks

   There are three input combinations where users can provide to create
   PENs: (a) DEA file accompanied with list of pathways of interest, (b)
   DEA file accompanied with list of genes of interest, and (c) DEA file
   accompanied with a list of pathways and a list of genes of interest. A
   significant differentiation in this approach is that PENs use all the
   genes included in the experiment, thus the DEA files should be used
   unfiltered without performing any specific threshold for reducing their
   size. The DEA file should at least include the gene-symbol and the
   log-fold-change value for each gene included in the file, while the
   p-value field is optional. These parameters should be strictly named
   as: “Gene.symbol”, “logFC”, and “P.Value”, accordingly. Pathways and
   genes of interest may derive from any type of omics data analysis that
   leads to significant pathways and genes accordingly. The proposed
   methodology for the creation of PENs reads as follows. For a specific
   pathway of interest, our methodology first finds all the genes included
   in the pathway. This type of information is obtained from the pathway
   reference network repository described in the previous section, which
   holds all the genes involved in each pathway. These genes are further
   matched with those that derive from the DEA files, where the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   value is attached for each one of these genes. Synonyms of gene symbols
   are also considered in this process in the prospect to reduce the
   number genes that may be missed through this type of matching. The next
   step of our method involves the assignment of a specific numeric value
   to the pathway of interest by means of the following equations:

   The sumFC value is obtained by calculating the sum of all the
   log-fold-change values included in the specific pathway, as follows:
   [MATH: <mrow><mi mathvariant="italic">sumFC</mi><mo
   linebreak="goodbreak">=</mo><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=
   </mo><mn>1</mn></mrow><mi>N</mi></munderover><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">g</mi><mrow><mo
   stretchy="false">(</mo><mi>F</mi><mi>C</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   (1)

   where
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   refers to the total number of genes included in the pathway, and
   [MATH: <mrow><mi mathvariant="normal">log</mi><mrow><mfenced open="("
   close=")"><mrow><mi>F</mi><mi>C</mi></mrow></mfenced></mrow></mrow>
   :MATH]
   is the logarithmic representation of the fold-change (FC) value. When
   this overall score is above zero, the specific pathway of interest is
   mostly considered as an over-expressed pathway. On the contrary for
   negative values of
   [MATH: <mrow><mi mathvariant="italic">sumFC</mi></mrow> :MATH]
   the pathway is mostly considered as an under-expressed pathway. However
   this approach may provide biased results since the underlying score
   does not take into account the balance between the number of over and
   under expressed genes in a sample. For example, four genes with
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values of −0.2, −0.3, −0.1 and 0.8, would give a
   [MATH: <mrow><mi mathvariant="italic">sumFC</mi><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mn>0.2</mn></mrow>
   :MATH]
   , suggesting an over-expressed network, against the fact that the
   sample includes more under-expressed genes that over-expressed ones. To
   handle with this limitation we proceed with two additional equations
   that lead to a combined score. These read as follows:

   The rateFC value is the fraction of the number of over-expressed genes
   divided by the number of total genes included in the pathway, as
   follows:
   [MATH: <mrow><mi mathvariant="italic">rateFC</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mo>#</mo><mi>o</mi><mi>f</mi>
   <mi>o</mi><mi>v</mi><mi>e</mi><mi>r</mi><mo>-</mo><mi>e</mi><mi>x</mi><
   mi>p</mi><mi>r</mi><mi>e</mi><mi>s</mi><mi>s</mi><mi>e</mi><mi>d</mi><m
   i>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>s</mi></mrow><mrow><mo>#</mo>
   <mi>o</mi><mi>f</mi><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>s</mi><
   /mrow></mfrac></mrow> :MATH]
   (2)

   For
   [MATH: <mrow><mi
   mathvariant="italic">rateFC</mi><mo>≥</mo><mn>0.5</mn></mrow> :MATH]
   , the specific pathway of interest is mostly considered as an
   over-expressed pathway. On the contrary for
   [MATH: <mrow><mi
   mathvariant="italic">rateFC</mi><mo>≤</mo><mn>0.5</mn></mrow> :MATH]
   the pathway is considered as an under-expressed pathway.

   The normMeanFC value is obtained by calculating the weighted mean of
   the normalised histogram of the log-fold-change values. Specifically,
   for a given vector of log-fold-change values
   [MATH: <mrow><msub><mi>V</mi><mrow><mi
   mathvariant="italic">LFC</mi></mrow></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi>v</mi><mn>1</mn></msub><mo>,</mo
   ><msub><mi>v</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><msub><mi>v</mi><
   mi>n</mi></msub></mrow> :MATH]
   we first apply a normalisation function in order to restrict the values
   within the range of
   [MATH: <mrow><msub><mi>V</mi><mrow><mi
   mathvariant="italic">LFC</mi></mrow></msub><mo>∈</mo></mrow> :MATH]
   .
   [MATH: <mrow><msub><mi>V</mi><mrow><mi
   mathvariant="italic">norm</mi></mrow></msub><mo
   linebreak="goodbreak">=</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>V</mi><mrow><mi
   mathvariant="italic">LFC</mi></mrow></msub><mo>-</mo><mi>m</mi><mi>i</m
   i><mi>n</mi><mrow><mo
   stretchy="false">(</mo><mi>V</mi><mi>_</mi><mi>L</mi><mi>F</mi><mi>C</m
   i><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">/</mo><mrow><mo
   stretchy="false">(</mo><mi>m</mi><mi>a</mi><mi>x</mi><mrow><mfenced
   open="(" close=")"><mrow><msub><mi>V</mi><mrow><mi
   mathvariant="italic">LFC</mi></mrow></msub></mrow></mfenced></mrow><mo>
   -</mo><mi>m</mi><mi>i</mi><mi>n</mi><mrow><mo
   stretchy="false">(</mo><mi>V</mi><mi>_</mi><mi>L</mi><mi>F</mi><mi>C</m
   i><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   (3)

   Then the histogram of the normalised vector
   [MATH: <mrow><msub><mi>V</mi><mrow><mi
   mathvariant="italic">norm</mi></mrow></msub></mrow> :MATH]
   , is calculated using a bin of 0.01, which results to
   [MATH: <mrow><mi>N</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>100</mn></mrow> :MATH]
   ranges (
   [MATH: <mrow><mi>x</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo
   ><msub><mi>x</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><msub><mi>x</mi><
   mn>100</mn></msub></mrow> :MATH]
   ), represented by their frequencies
   [MATH: <mrow><mi>F</mi><mo
   stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   , which in turn are used to calculate the weight vector
   [MATH: <mrow><mi mathvariant="normal">W</mi><mrow><mfenced open="("
   close=")"><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow></mfenced></mro
   w><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mfrac><mrow><mi
   mathvariant="normal">F</mi><mrow><mfenced open="("
   close=")"><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow></mfenced></mro
   w></mrow><mi>N</mi></mfrac></mrow> :MATH]
   . The weighted mean is then obtained by the following equation:
   [MATH: <mrow><mi mathvariant="italic">normMeanFC</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msubsup><mo>∑</mo><mrow><mi>i
   </mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>W</mi><mrow><mf
   enced open="("
   close=")"><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow></mfenced></mro
   w><msub><mi>x</mi><mi>i</mi></msub></mrow><mrow><msubsup><mo>∑</mo><mro
   w><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mi>W</mi><m
   row><mfenced open="("
   close=")"><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow></mfenced></mro
   w></mrow></mfrac></mrow> :MATH]
   (4)

   Eq. [64](4) suggests that for a normal distribution of
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values, those with a larger weight contribute more to the weighted mean
   than those with a smaller weight. In effect a numeric mean
   characterisation of a pathway will be based on most frequent values
   that exist in the vector. Herein, for
   [MATH: <mrow><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">M</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">F</mi><mi
   mathvariant="normal">C</mi><mo>≥</mo><mn>0.5</mn></mrow> :MATH]
   , the specific pathway of interest is mostly considered as an
   over-expressed pathway. For
   [MATH: <mrow><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">M</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">F</mi><mi
   mathvariant="normal">C</mi><mo><</mo><mn>0.5</mn><mo>,</mo></mrow>
   :MATH]
   the pathway is considered as an under-expressed pathway. The underlying
   metric aims to slightly fix the ambiguity in the
   [MATH: <mrow><mi mathvariant="italic">sumFC</mi></mrow> :MATH]
   balancing by estimating the normalised weight of the distribution of
   the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values included in the sample.

   The combinedFC value is a combination of equations [65](2), [66](4) as
   follows:
   [MATH: <mrow><mi mathvariant="italic">combinedFC</mi><mo
   linebreak="goodbreak">=</mo><mi>r</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi
   >F</mi><mi>C</mi><mo
   linebreak="badbreak">+</mo><mi>n</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>
   M</mi><mi>e</mi><mi>a</mi><mi>n</mi><mi>F</mi><mi>C</mi></mrow> :MATH]
   (5)

   Typically for
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo>≥</mo><mn>1.0</mn></mrow>
   :MATH]
   , the specific pathway of interest is mostly considered as an
   over-expressed pathway. For
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo><</mo><mn>1.0</mn><mo>,</mo></m
   row> :MATH]
   the pathway is considered as an under-expressed pathway.

   In order to estimate a score for the overall pathway expression network
   we calculate the overall expression ratio which is defined as the
   fraction of:
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="italic">NET</mi></mrow></msub><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mo>#</mo><mi>o</mi><mi>f</mi>
   <mi>o</mi><mi>v</mi><mi>e</mi><mi>r</mi><mo>-</mo><mi>e</mi><mi>x</mi><
   mi>p</mi><mi>r</mi><mi>e</mi><mi>s</mi><mi>s</mi><mi>e</mi><mi>d</mi><m
   i>p</mi><mi>a</mi><mi>t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi>y</mi><mi
   >s</mi></mrow><mrow><mo>#</mo><mi>o</mi><mi>f</mi><mi>p</mi><mi>a</mi><
   mi>t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi>y</mi><mi>s</mi></mrow></mfr
   ac></mrow> :MATH]
   (6)

   For values of
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="italic">NET</mi></mrow></msub><mo>≥</mo><mn>0.5</mn></mrow
   > :MATH]
   , the network is considered as an over-expressed network, while for
   values
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="italic">NET</mi></mrow></msub><mo><</mo><mn>0.5</mn></mrow
   > :MATH]
   , the network is considered as an under-expressed network,
   respectively.

   Herein, we clarify that there is not an optimal theory that clearly
   defines where is the transition line between over- and under-expression
   of gene sets. The most widely used methods that use the log-fold-change
   value, mainly examine how the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   value is different from zero, without suggesting any biologically
   objective truth [67][33]. Thus the transitions of
   [MATH: <mrow><mn>0.5</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>1.0</mn></mrow> :MATH]
   used in the above equations have been arbitrary selected, assuming that
   the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values included in a gene expression dataset, which has been
   transformed and normalised successfully, follow a typical normal
   Gaussian-like distribution around zero.

2.5. Performing enrichment analysis

   In order to provide an indicative information that shows whether the
   selected by the user pathways are also significant in terms of
   enrichment analysis, we used the “gprofile2” R package which has been
   also suggested as a main analysis tool by the ELIXIR consortium
   [68][29]. Specifically, when users provide lists of genes, the tool
   performs pathway enrichment analysis by using these genes. The
   enrichment score (namely the p-value) obtained for each selected
   pathway, is now provided on the visualised networks, especially when
   the user puts the mouse cursor over a specific node. To further handle
   the large network problem we rooted into the tool the possibility to
   select the maximum number of top-scored pathways to be visualised in
   the network. The ranking is based on the p-value score obtained from
   enrichment analysis of the given gene-set. Herein the limitation we
   find in this approach is that insignificant lists of pathways provided
   by the user may not be included in the enrichment result. In that case
   the enrichment score is simply NA for those pathways.

2.6. Providing gene regulatory information

   Another significant issue in studying the expressional behaviour of
   pathways is the regulatory information in between the genes included in
   a specific pathway. Thus in order to provide such information through
   PathExNET framework, we further created a database repository that
   includes regulatory information in between genes, obtained from both
   the Transcriptional Regulatory Relationships Unraveled by
   Sentence-based Text mining (TRRUST) [69][20], and the SIGnaling Network
   Open Resource (SIGNOR) [70][31] repositories. The underlying repository
   includes regulation information about 95,086 pairs of genes. However,
   the specific information, although significant, remains limited since
   it is available for only three species: Homo Sapiens, Mus Musculus, and
   Rattus Novergicus, accordingly. Depending on the user’s input,
   PathExNET automatically examines the genes of interest and further
   provides the regulatory information to where is available, in a single
   table.

2.7. Exporting network statistics

   The mathematical content of complex networks in biological systems has
   become a benchmark approach towards identifying biomarkers,
   understanding their dynamics, their biological status and related
   biological mechanisms involved. In order to provide more statistics
   related to the complex nature of the proposed pathway expression
   networks, PathExNET further provides additional statistics, for network
   manipulation [71][8]. These indicatively include measures of median,
   mean and maximum values of: betweeness-centrality, degree distribution,
   closeness, and clustering coefficient. This attempt aims to create a
   concrete web-framework of analysis adequate to provide a multilevel
   information content, sufficient for further investigation and
   understanding of pathway networks.

3. Demonstration of PathExNET capabilities

   It should be stressed that the concept of PathExNET is not to serve as
   enrichment tool but as a post analysis tool that facilitates an
   estimation and the subsequent visualization of the collective
   over/under expression of the selected pathways’ gene members. As
   opposed to traditional gene enrichment tools and methodologies
   [72][30], [73][37], [74][45], PathExNET allows users to create pathway
   expression networks in order to evaluate specific biological conditions
   where pathways or genes of interest are not necessary significant,
   namely a high-score result of a ranking methodology. Thus the equations
   provided in PathExNET have been designed in a simplified manner in
   order to be independent of any gene or pathway rankings. In the
   following subsections we present two different case studies in order to
   support this argument and to show how same clusters of pathways behave
   across different gene expression datasets.

3.1. A case study on SARS-CoV-2 experimental data

   A common biological perspective for an effective treatment against
   COVID-19 and its causative virus, SARS-CoV-2, is the deciphering of the
   involved host pathways, as well as the related transmission and
   replication mechanisms [75][6], [76][19], [77][44]. In this line of
   thought, our approach here focuses on the examination of the
   over-expressed and under-expressed gene content of specific categories
   of pathway networks related to COVID-19. On this ground, we are using
   PathExNET to analyze a recently introduced high throughput sequencing
   expression dataset, related to the transcriptional response of human
   lung epithelial cells to SARS-CoV-2 infection [78][3]. The dataset
   includes expression profiles of two independent biological triplicates
   of: (i) normal human bronchial epithelial (NHBE) cells and (ii)
   transformed lung alveolar (A549) cells, which were both mock treated or
   infected with SARS-CoV-2 (USA-WA1/2020). The underlined subset has been
   analysed by the team of EA, who performed differential expression
   analysis, available on EA repository at
   [79]https://www.ebi.ac.uk/gxa/experiments/E-GEOD-147507. Herein, we
   used the PathExNET EA tool to download the analysis performed by EA,
   namely the unfiltered statistics required for the creation of the
   pathway expression networks proposed in this work. [80]Fig. 1a depicts
   the frequency distribution of the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   parameter included in these statistics, showing that both samples are
   well-distributed around the zero point. In addition, [81]Fig. 1b
   depicts the over- and under-expressed information estimated by means of
   the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   parameter. Both samples seem to exhibit the same behaviour between the
   over- and under- expressed genes, with estimated ratios
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="normal">A</mi><mn>549</mn></mrow></msub><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mn>0.46</mn></mrow>
   :MATH]
   , and
   [MATH: <mrow><msub><mi>R</mi><mrow><mi mathvariant="normal">N</mi><mi
   mathvariant="normal">H</mi><mi mathvariant="normal">B</mi><mi
   mathvariant="normal">E</mi></mrow></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0.49</mn></mrow> :MATH]
   , accordingly, where the ratio
   [MATH: <mrow><mi>R</mi></mrow> :MATH]
   refers to the fraction of the number of over-expressed genes, to the
   total number of genes included in the sample.

Fig. 1.

   [82]Fig. 1
   [83]Open in a new tab

   (a) Depicts the frequency distribution of
   [MATH: <mrow><mrow><mi mathvariant="bold-italic">l</mi></mrow><mrow><mi
   mathvariant="bold-italic">o</mi></mrow><mrow><mi
   mathvariant="bold-italic">g</mi></mrow><mrow><mi
   mathvariant="bold-italic">F</mi></mrow><mrow><mi
   mathvariant="bold-italic">C</mi></mrow></mrow> :MATH]
   values obtained from the differential expression analysis of A549 and
   NHBE comparisons, (b) depicts the number of over- and under-expressed
   genes per comparison, estimated by means of the
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   parameter.

   Furthermore, [84]Fig. 2 depicts the Venn diagrams that represent the
   overlap between the two samples regarding (a) their over-expressed
   genes, (b) their under-expressed genes, and (c) their genes with zero
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values, respectively. It is observed that both samples share almost the
   same amount of common over-expressed and under-expressed genes, which
   in effect secures that the obtained pathway expression networks will be
   a result of a well-balanced common genes included in these pathways.

Fig. 2.

   [85]Fig. 2
   [86]Open in a new tab

   Venn diagrams representing the datasets overlapping for: (a) their
   over-expressed genes, (b) their under-expressed genes, (c) their genes
   with zero
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values, included in the two cell lines under study.

   It has been stated that for COVID-19, as well as for all infectious
   diseases, the host immune system, is a major component [87][2],
   [88][14], towards understanding the host response on the infection. In
   this line, 20 pathways related to the immune system, were obtained from
   the KEGG pathway repository, in order to be analyzed by means of the
   PathExNET methodology. The compiled pathway list, was further used to
   create pathway-to-pathway expression networks, in combination with the
   differential expression statistics obtained from EA, by means of the
   PathExNET methodology proposed in this work. Specifically, [89]Fig. 3a
   depicts a pathway expression network that includes the 20 candidate
   KEGG pathways, where the node (pathway) values have been estimated by
   means of the
   [MATH: <mrow><mi mathvariant="italic">combinedFC</mi></mrow> :MATH]
   score (see Eq. [90](5)) described in previous section. The
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values have been obtained from the A549 analysed dataset. The
   edge-weights refer to the number of common genes between the two
   pathways that form the edge. [91]Fig. 3b depicts the same analysis for
   the NHBE differential expression dataset accordingly. As opposed to the
   colour scale provided by the web tool, here in order to show the
   expression difference in between these two networks, an arbitrary
   transition threshold was selected while the colour scale used includes
   only two colours. Specifically, the red-coloured pathways refer to the
   over-expressed pathways (
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo>≥</mo><mn>1.0</mn></mrow>
   :MATH]
   ), while the blue ones are the under-expressed ones (
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo><</mo><mn>1.0</mn></mrow>
   :MATH]
   ). The overall network expression ratio (see Eq. [92](6)) was found
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="italic">NET</mi></mrow></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0.40</mn></mrow> :MATH]
   for the A549 gene-set, and
   [MATH: <mrow><msub><mi>R</mi><mrow><mi
   mathvariant="italic">NET</mi></mrow></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0.05</mn></mrow> :MATH]
   for the NHBE gene-set, suggesting that the first is considered as an
   enriched network of pathways with higher content in over-expressed
   genes in contrast to the second one. In consequence, the latter
   methodology suggests that pathway networks obtained from the given set
   of gene expressions, may have a unique identity in terms of the
   proposed scores, which in effect may contribute to a unique
   characterization of the specific biological condition under study. It
   should be stressed that the specific analysis draws from a simple
   biological concept that aims to examine the expression status of all
   the immune system pathway networks, subjected to SARS-COV-2 infection.

Fig. 3.

   [93]Fig. 3
   [94]Open in a new tab

   (a, b) Pathway expression networks of 20 KEGG pathways related to
   immune system. The node values have been estimated by means of the
   [MATH: <mrow><mi mathvariant="italic">combinedFC</mi></mrow> :MATH]
   parameter. The
   [MATH: <mrow><mi mathvariant="italic">logFC</mi></mrow> :MATH]
   values have been obtained from the A549 and NHBE datasets. The red
   nodes refer to the pathways with high content of over-expressed genes (
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo>≥</mo><mn>1.0</mn></mrow>
   :MATH]
   ), while the blue nodes refer to the pathways with high content of
   under-expressed genes (
   [MATH: <mrow><mi
   mathvariant="italic">combinedFC</mi><mo><</mo><mn>1.0</mn></mrow>
   :MATH]
   ). (For interpretation of the references to color in this figure