Abstract
A fundamental issue related to the understanding of the molecular
mechanisms, is the way in which common pathways act across different
biological experiments related to complex diseases. Using network-based
approaches, this work aims to provide a numeric characterization of
pathways across different biological experiments, in the prospect to
create unique footprints that may characterise a specific disease under
study at a pathway network level. In this line we propose PathExNET, a
web service that allows the creation of pathway-to-pathway expression
networks that hold the over- and under expression information obtained
from differential gene expression analyses. The unique numeric
characterization of pathway expression status related to a specific
biological experiment (or disease), as well as the creation of diverse
combination of pathway networks generated by PathExNET, is expected to
provide a concrete contribution towards the individualization of
disease, and further lead to a more precise personalised medicine and
management of treatment.
PathExNET is available at:
[31]https://bioinformatics.cing.ac.cy/PathExNET and at
[32]https://pathexnet.cing-big.hpcf.cyi.ac.cy/
Keywords: Pathway expression networks, Differential expression
analysis, Pathway analysis
1. Introduction
Analysis of differential gene expression profiles often generates
top-scored gene-sets on which pathway-based enrichment analysis is
routinely performed, leading to a statistically significant list of
pathways, that may be related to the underlying biology of the
condition being studied [33][7], [34][28], [35][51]. The challenge
through these types of analyses is to find specific pathways affected
by a group of related genes, namely pathways perturbed by
differentially expressed genes. Although such tools may reveal
significant top-scored pathways, the pathway complexity and the varying
characteristics of genes do not easily allow to optimally relate these
pathways to a specific biological condition being studied [36][18].
Despite the magnificent efforts of differential expression analysis
pipelines, generating unbiased results is still a challenge, while
common aetiologies of such failures usually vary between issues related
to the experimental setup and difficulties in customization of the
statistical analysis tools [37][9], [38][26]. Scoring and filtering of
differentially expressed data results to a loss of a large amount of
important yet not statistically significant genes, where despite their
weak statistical significance, their contribution into the biology of
the condition being studied remains a relatively unexplored scientific
issue. Another crucial confusing issue scientists usually face through
enrichment analysis, is a quite significant list of top-scored pathways
that are common across a variety of diverse diseases [39][42],
[40][46]. For example, the pathway of “apoptosis“ is a very common
pathway that appears very often in several studies, while less generic
pathways such as “N-glycosylation” or “N-glycan biosynthesis” have been
also associated with a series of congenital disorders [41][13]. However
a pathway’s association across different diseases, by no means suggests
in biochemical and/or biological terms, that a specific pathway
contributes in the same way to all types of diseases in which may be
identified. Indeed, biological pathways can be considered as
topological networks formed by sets of genes or molecules that interact
through chemical reactions, molecule modifications or signal
transduction [42][5], [43][48]. Thus their significance should be a
result that derives from the integration of both gene-set analysis and
topology information [44][35], [45][48]. In this line, network-based
approaches have proved to be a promising Systems Bioinformatics
framework of analysis, both at gene and pathway analysis level
[46][22], [47][38], [48][53]. In the prospect to develop of additional
tools able to enrich the outcome of the routinely performed pipelines
used for this type of research, the present work aims to explore
whether the overall differential expression information of the genes
included in a specific pathway is adequate to give a different
characterization for the same pathway across different diseases. Using
network-based approaches, in this work we present a methodology and a
related web-service for the numeric characterization of pathways across
different differential expression datasets, able to provide unique
pathway network footprints, which in turn may represent a specific
biological condition (and/or disease) under study. In this line, we
propose PathExNET, a web service that allows the creation of pathway
expression networks that hold the over- and under-expression
information obtained from differential gene expression analyses.
PathExNET holds a large database of reference pathway-to-pathway
networks, which have been developed through the freely available
information included in the KEGG, Reactome and Wiki Pathways database
repositories. Users can upload their differential gene expression
statistical analysis, followed with pathways and/or genes of interest,
and further chose a scoring methodology to create and explore the
derived pathway-to-pathway expression networks. In order to provide a
concrete set of well-evaluated differential gene expression statistical
analyses and to further increase the data-availability and easy data
access of PathExNET, an additional tool has been rooted in PathExNET
framework that allows to search and directly import pre-processed
statistic files from the Expression Atlas (EA)
([49]https://www.ebi.ac.uk/gxa) data resource of the European
Bioinformatics Institute (EMBL-EBI) ([50]https://www.ebi.ac.uk).
2. Software description & methods
By definition, the term Pathway Expression Networks (PENs) employed in
this work refers to pathway-to-pathway networks, where: (a) the nodes
are pathways, the node size and the node colour represent a specific
parameter that characterizes the level of over- and under-expression
statistical information of genes included in a specific pathway, and
(b) the edge-weight represents the number of common genes between two
pathways. PENs draw from the log-fold-change (
[MATH: logFC :MATH]
) parameter obtained through the Differential Expression Analysis (DEA)
of genes. The pathway characteristic parameter is obtained by means of
four diverse methodologies employed in this work. In the following we
describe in detail the main components and methodologies used for the
implementation of the proposed tool.
2.1. Overall design and software availability
PathExNET comes with a frontend web interface that consists of the
mainframe and a help page, written in HTML, PHP and JavaScript language
environments. The mainframe provides 2 individual steps designed to
guide the user until the end of the workflow process. The backend of
PathExNET has been written in R environment, where several
functionalities have been parallelised to achieve fast performance.
Evaluation, testing and understanding of PathExNET functionalities can
be easily performed by means of several available example datasets
provided through the web interface. The proposed tool is available
online at the webpage of the Bioinformatics Department, at the Cyprus
Institute of Neurology and Genetics (CING)
([51]http://bioinformatics.cing.ac.cy). PathExNET is served by a Docker
space at the CYTERA High Performance Computer Facility of the Cyprus
Institute ([52]https://hpcf.cyi.ac.cy). PathExNET further uses parallel
processing scripts to handle and pre-process large file sizes that make
the use of the “doParallel” R package [53][4].
2.2. The pathway reference network repository
The pathway-to-pathway network information draws from a web-service
that holds a large database of reference pathway networks, which have
been developed through the freely available information included in the
KEGG [54][41], Reactome [55][11] and Wiki Pathways [56][25] database
repositories. Herein, the functional relation between two pathways that
forms an edge in a network, is considered when a specific pathway
involves or is being involved in to another pathway accordingly. In
effect, this type of information which is mainly obtained from the
available XML maps of the above mentioned repositories, can form an
undirected-unweighted pathway-to-pathway network. In this line of
thought, a large number of pathway XML maps were obtained for all the
organisms included in the three above mentioned repositories, and all
the available data related to the functional connections that exist
between all the available pathways were retrieved. Specifically, we
obtained 177 organisms from KEGG, 16 from Reactome and 38 from Wiki
Pathways repositories, accordingly. The output of this data mining
process was further used to construct in total 231 undirected pathway
reference networks, stored in a data repository. Further information
rooted in these reference pathway-to-pathway networks, involves the
number of common genes between two pathways that forms the edge-weight
of these networks, and the number of total genes included in a pathway
that forms the node size. The underlying networks are regularly
updated, constructing the main pathway repository for the services and
methodologies that PathExNET draws from. It should be noticed that an
initial version of this reference network repository supporting only 16
organisms from KEGG and Reactome repositories, has been recently used
in PathwayConnector [57][34], [58][35], with noteworthy results to
pathways related to Alzheimer’s Disease (AD) [59][53], to Huntington’s
disease (HD) and Spastic Ataxia (SA) [60][23], as well as to a recent
study on Breast Cancer [61][16].
2.3. The expression Atlas searching and importing tool
The Expression Atlas (EA) is a database repository that provides
information about gene and protein expression in different species and
contexts, namely: tissue, developmental stage, and disease or cell
type. The EA web service is hosted at the European Bioinformatics
Institute (EMBL-EBI) ([62]https://www.ebi.ac.uk). EA holds a large set
of publicly available and controlled access datasets that at the time
of writing derive from over 4,000 studies across 65 different species,
including over 900 studies from plants. These datasets have been
curated and re-analysed using standardized, open source pipelines and
have been made available along with the analyses data for queries,
download and visualization [63][40]. EA incorporates baseline
expression profiles of tissues from Human Protein Atlas, GTEx and
FANTOM5, as well as of cancer cell lines from ENCODE, CCLE and
Genentech projects. Through the last update EA incorporates data from
large-scale RNA sequencing studies including Blueprint, PCAWG, ENCODE,
GTEx and HipSci. In order to provide a concrete set of well-evaluated
DEA data files, a productive collaboration with EA team led to the
development of an additional search tool, rooted in PathExNET
framework. The underlying tool allows users to search and directly
import into PathExNET pre-processed DEA files. Users can search by
means of specific EA experiment accession, organism, and experiment
type, or alternatively perform free text keyword search.
2.4. Creating pathway expression networks
There are three input combinations where users can provide to create
PENs: (a) DEA file accompanied with list of pathways of interest, (b)
DEA file accompanied with list of genes of interest, and (c) DEA file
accompanied with a list of pathways and a list of genes of interest. A
significant differentiation in this approach is that PENs use all the
genes included in the experiment, thus the DEA files should be used
unfiltered without performing any specific threshold for reducing their
size. The DEA file should at least include the gene-symbol and the
log-fold-change value for each gene included in the file, while the
p-value field is optional. These parameters should be strictly named
as: “Gene.symbol”, “logFC”, and “P.Value”, accordingly. Pathways and
genes of interest may derive from any type of omics data analysis that
leads to significant pathways and genes accordingly. The proposed
methodology for the creation of PENs reads as follows. For a specific
pathway of interest, our methodology first finds all the genes included
in the pathway. This type of information is obtained from the pathway
reference network repository described in the previous section, which
holds all the genes involved in each pathway. These genes are further
matched with those that derive from the DEA files, where the
[MATH: logFC :MATH]
value is attached for each one of these genes. Synonyms of gene symbols
are also considered in this process in the prospect to reduce the
number genes that may be missed through this type of matching. The next
step of our method involves the assignment of a specific numeric value
to the pathway of interest by means of the following equations:
The sumFC value is obtained by calculating the sum of all the
log-fold-change values included in the specific pathway, as follows:
[MATH: sumFC=∑i=
1Nlog(FC) :MATH]
(1)
where
[MATH: N :MATH]
refers to the total number of genes included in the pathway, and
[MATH: logFC
:MATH]
is the logarithmic representation of the fold-change (FC) value. When
this overall score is above zero, the specific pathway of interest is
mostly considered as an over-expressed pathway. On the contrary for
negative values of
[MATH: sumFC :MATH]
the pathway is mostly considered as an under-expressed pathway. However
this approach may provide biased results since the underlying score
does not take into account the balance between the number of over and
under expressed genes in a sample. For example, four genes with
[MATH: logFC :MATH]
values of −0.2, −0.3, −0.1 and 0.8, would give a
[MATH: sumFC=0.2
:MATH]
, suggesting an over-expressed network, against the fact that the
sample includes more under-expressed genes that over-expressed ones. To
handle with this limitation we proceed with two additional equations
that lead to a combined score. These read as follows:
The rateFC value is the fraction of the number of over-expressed genes
divided by the number of total genes included in the pathway, as
follows:
[MATH: rateFC=#of
over-ex<
mi>pressedgenes#
ofgenes<
/mrow> :MATH]
(2)
For
[MATH: rateFC≥0.5 :MATH]
, the specific pathway of interest is mostly considered as an
over-expressed pathway. On the contrary for
[MATH: rateFC≤0.5 :MATH]
the pathway is considered as an under-expressed pathway.
The normMeanFC value is obtained by calculating the weighted mean of
the normalised histogram of the log-fold-change values. Specifically,
for a given vector of log-fold-change values
[MATH: VLFC=v1,v2,⋯v<
mi>n :MATH]
we first apply a normalisation function in order to restrict the values
within the range of
[MATH: VLFC∈ :MATH]
.
[MATH: Vnorm=(VLFC-min(V_LFC))/(maxVLFC
-min(V_LFC)) :MATH]
(3)
Then the histogram of the normalised vector
[MATH: Vnorm :MATH]
, is calculated using a bin of 0.01, which results to
[MATH: N=100 :MATH]
ranges (
[MATH: x=x1,x2,⋯x<
mn>100 :MATH]
), represented by their frequencies
[MATH: F(xi) :MATH]
, which in turn are used to calculate the weight vector
[MATH: Wxi=FxiN :MATH]
. The weighted mean is then obtained by the following equation:
[MATH: normMeanFC=∑i
=1NWxixi∑i=1NWxi :MATH]
(4)
Eq. [64](4) suggests that for a normal distribution of
[MATH: logFC :MATH]
values, those with a larger weight contribute more to the weighted mean
than those with a smaller weight. In effect a numeric mean
characterisation of a pathway will be based on most frequent values
that exist in the vector. Herein, for
[MATH: normMeanFC≥0.5 :MATH]
, the specific pathway of interest is mostly considered as an
over-expressed pathway. For
[MATH: normMeanFC<0.5,
:MATH]
the pathway is considered as an under-expressed pathway. The underlying
metric aims to slightly fix the ambiguity in the
[MATH: sumFC :MATH]
balancing by estimating the normalised weight of the distribution of
the
[MATH: logFC :MATH]
values included in the sample.
The combinedFC value is a combination of equations [65](2), [66](4) as
follows:
[MATH: combinedFC=rateFC+norm
MeanFC :MATH]
(5)
Typically for
[MATH: combinedFC≥1.0
:MATH]
, the specific pathway of interest is mostly considered as an
over-expressed pathway. For
[MATH: combinedFC<1.0, :MATH]
the pathway is considered as an under-expressed pathway.
In order to estimate a score for the overall pathway expression network
we calculate the overall expression ratio which is defined as the
fraction of:
[MATH: RNET=#of
over-ex<
mi>pressedpathways#ofpa<
mi>thways :MATH]
(6)
For values of
[MATH: RNET≥0.5 :MATH]
, the network is considered as an over-expressed network, while for
values
[MATH: RNET<0.5 :MATH]
, the network is considered as an under-expressed network,
respectively.
Herein, we clarify that there is not an optimal theory that clearly
defines where is the transition line between over- and under-expression
of gene sets. The most widely used methods that use the log-fold-change
value, mainly examine how the
[MATH: logFC :MATH]
value is different from zero, without suggesting any biologically
objective truth [67][33]. Thus the transitions of
[MATH: 0.5 :MATH]
and
[MATH: 1.0 :MATH]
used in the above equations have been arbitrary selected, assuming that
the
[MATH: logFC :MATH]
values included in a gene expression dataset, which has been
transformed and normalised successfully, follow a typical normal
Gaussian-like distribution around zero.
2.5. Performing enrichment analysis
In order to provide an indicative information that shows whether the
selected by the user pathways are also significant in terms of
enrichment analysis, we used the “gprofile2” R package which has been
also suggested as a main analysis tool by the ELIXIR consortium
[68][29]. Specifically, when users provide lists of genes, the tool
performs pathway enrichment analysis by using these genes. The
enrichment score (namely the p-value) obtained for each selected
pathway, is now provided on the visualised networks, especially when
the user puts the mouse cursor over a specific node. To further handle
the large network problem we rooted into the tool the possibility to
select the maximum number of top-scored pathways to be visualised in
the network. The ranking is based on the p-value score obtained from
enrichment analysis of the given gene-set. Herein the limitation we
find in this approach is that insignificant lists of pathways provided
by the user may not be included in the enrichment result. In that case
the enrichment score is simply NA for those pathways.
2.6. Providing gene regulatory information
Another significant issue in studying the expressional behaviour of
pathways is the regulatory information in between the genes included in
a specific pathway. Thus in order to provide such information through
PathExNET framework, we further created a database repository that
includes regulatory information in between genes, obtained from both
the Transcriptional Regulatory Relationships Unraveled by
Sentence-based Text mining (TRRUST) [69][20], and the SIGnaling Network
Open Resource (SIGNOR) [70][31] repositories. The underlying repository
includes regulation information about 95,086 pairs of genes. However,
the specific information, although significant, remains limited since
it is available for only three species: Homo Sapiens, Mus Musculus, and
Rattus Novergicus, accordingly. Depending on the user’s input,
PathExNET automatically examines the genes of interest and further
provides the regulatory information to where is available, in a single
table.
2.7. Exporting network statistics
The mathematical content of complex networks in biological systems has
become a benchmark approach towards identifying biomarkers,
understanding their dynamics, their biological status and related
biological mechanisms involved. In order to provide more statistics
related to the complex nature of the proposed pathway expression
networks, PathExNET further provides additional statistics, for network
manipulation [71][8]. These indicatively include measures of median,
mean and maximum values of: betweeness-centrality, degree distribution,
closeness, and clustering coefficient. This attempt aims to create a
concrete web-framework of analysis adequate to provide a multilevel
information content, sufficient for further investigation and
understanding of pathway networks.
3. Demonstration of PathExNET capabilities
It should be stressed that the concept of PathExNET is not to serve as
enrichment tool but as a post analysis tool that facilitates an
estimation and the subsequent visualization of the collective
over/under expression of the selected pathways’ gene members. As
opposed to traditional gene enrichment tools and methodologies
[72][30], [73][37], [74][45], PathExNET allows users to create pathway
expression networks in order to evaluate specific biological conditions
where pathways or genes of interest are not necessary significant,
namely a high-score result of a ranking methodology. Thus the equations
provided in PathExNET have been designed in a simplified manner in
order to be independent of any gene or pathway rankings. In the
following subsections we present two different case studies in order to
support this argument and to show how same clusters of pathways behave
across different gene expression datasets.
3.1. A case study on SARS-CoV-2 experimental data
A common biological perspective for an effective treatment against
COVID-19 and its causative virus, SARS-CoV-2, is the deciphering of the
involved host pathways, as well as the related transmission and
replication mechanisms [75][6], [76][19], [77][44]. In this line of
thought, our approach here focuses on the examination of the
over-expressed and under-expressed gene content of specific categories
of pathway networks related to COVID-19. On this ground, we are using
PathExNET to analyze a recently introduced high throughput sequencing
expression dataset, related to the transcriptional response of human
lung epithelial cells to SARS-CoV-2 infection [78][3]. The dataset
includes expression profiles of two independent biological triplicates
of: (i) normal human bronchial epithelial (NHBE) cells and (ii)
transformed lung alveolar (A549) cells, which were both mock treated or
infected with SARS-CoV-2 (USA-WA1/2020). The underlined subset has been
analysed by the team of EA, who performed differential expression
analysis, available on EA repository at
[79]https://www.ebi.ac.uk/gxa/experiments/E-GEOD-147507. Herein, we
used the PathExNET EA tool to download the analysis performed by EA,
namely the unfiltered statistics required for the creation of the
pathway expression networks proposed in this work. [80]Fig. 1a depicts
the frequency distribution of the
[MATH: logFC :MATH]
parameter included in these statistics, showing that both samples are
well-distributed around the zero point. In addition, [81]Fig. 1b
depicts the over- and under-expressed information estimated by means of
the
[MATH: logFC :MATH]
parameter. Both samples seem to exhibit the same behaviour between the
over- and under- expressed genes, with estimated ratios
[MATH: RA549=0.46
:MATH]
, and
[MATH: RNHBE=0.49 :MATH]
, accordingly, where the ratio
[MATH: R :MATH]
refers to the fraction of the number of over-expressed genes, to the
total number of genes included in the sample.
Fig. 1.
[82]Fig. 1
[83]Open in a new tab
(a) Depicts the frequency distribution of
[MATH: logFC :MATH]
values obtained from the differential expression analysis of A549 and
NHBE comparisons, (b) depicts the number of over- and under-expressed
genes per comparison, estimated by means of the
[MATH: logFC :MATH]
parameter.
Furthermore, [84]Fig. 2 depicts the Venn diagrams that represent the
overlap between the two samples regarding (a) their over-expressed
genes, (b) their under-expressed genes, and (c) their genes with zero
[MATH: logFC :MATH]
values, respectively. It is observed that both samples share almost the
same amount of common over-expressed and under-expressed genes, which
in effect secures that the obtained pathway expression networks will be
a result of a well-balanced common genes included in these pathways.
Fig. 2.
[85]Fig. 2
[86]Open in a new tab
Venn diagrams representing the datasets overlapping for: (a) their
over-expressed genes, (b) their under-expressed genes, (c) their genes
with zero
[MATH: logFC :MATH]
values, included in the two cell lines under study.
It has been stated that for COVID-19, as well as for all infectious
diseases, the host immune system, is a major component [87][2],
[88][14], towards understanding the host response on the infection. In
this line, 20 pathways related to the immune system, were obtained from
the KEGG pathway repository, in order to be analyzed by means of the
PathExNET methodology. The compiled pathway list, was further used to
create pathway-to-pathway expression networks, in combination with the
differential expression statistics obtained from EA, by means of the
PathExNET methodology proposed in this work. Specifically, [89]Fig. 3a
depicts a pathway expression network that includes the 20 candidate
KEGG pathways, where the node (pathway) values have been estimated by
means of the
[MATH: combinedFC :MATH]
score (see Eq. [90](5)) described in previous section. The
[MATH: logFC :MATH]
values have been obtained from the A549 analysed dataset. The
edge-weights refer to the number of common genes between the two
pathways that form the edge. [91]Fig. 3b depicts the same analysis for
the NHBE differential expression dataset accordingly. As opposed to the
colour scale provided by the web tool, here in order to show the
expression difference in between these two networks, an arbitrary
transition threshold was selected while the colour scale used includes
only two colours. Specifically, the red-coloured pathways refer to the
over-expressed pathways (
[MATH: combinedFC≥1.0
:MATH]
), while the blue ones are the under-expressed ones (
[MATH: combinedFC<1.0
:MATH]
). The overall network expression ratio (see Eq. [92](6)) was found
[MATH: RNET=0.40 :MATH]
for the A549 gene-set, and
[MATH: RNET=0.05 :MATH]
for the NHBE gene-set, suggesting that the first is considered as an
enriched network of pathways with higher content in over-expressed
genes in contrast to the second one. In consequence, the latter
methodology suggests that pathway networks obtained from the given set
of gene expressions, may have a unique identity in terms of the
proposed scores, which in effect may contribute to a unique
characterization of the specific biological condition under study. It
should be stressed that the specific analysis draws from a simple
biological concept that aims to examine the expression status of all
the immune system pathway networks, subjected to SARS-COV-2 infection.
Fig. 3.
[93]Fig. 3
[94]Open in a new tab
(a, b) Pathway expression networks of 20 KEGG pathways related to
immune system. The node values have been estimated by means of the
[MATH: combinedFC :MATH]
parameter. The
[MATH: logFC :MATH]
values have been obtained from the A549 and NHBE datasets. The red
nodes refer to the pathways with high content of over-expressed genes (
[MATH: combinedFC≥1.0
:MATH]
), while the blue nodes refer to the pathways with high content of
under-expressed genes (
[MATH: combinedFC<1.0
:MATH]
). (For interpretation of the references to color in this figure