Abstract

Background

   Handling the vast amount of gene expression data generated by
   genome-wide transcriptional profiling techniques is a challenging task,
   demanding an informed combination of pre-processing, filtering and
   analysis methods if meaningful biological conclusions are to be drawn.
   For example, a range of traditional statistical and computational
   pathway analysis approaches have been used to identify over-represented
   processes in microarray data derived from various disease states.
   However, most of these approaches tend not to exploit the full spectrum
   of gene expression data, or the various relationships and dependencies.
   Previously, we described a pathway enrichment analysis tool created in
   MATLAB that yields a Pathway Regulation Score (PRS) by considering
   signalling pathway topology, and the overrepresentation and magnitude
   of differentially-expressed genes (J Comput Biol 19:563–573, 2012).
   Herein, we extended this approach to include metabolic pathways, and
   described the use of a graphical user interface (GUI).

Results

   Using input from a variety of microarray platforms and species, users
   are able to calculate PRS scores, along with a corresponding z-score
   for comparison. Further pathway significance assessment may be
   performed to increase confidence in the pathways obtained, and users
   can view Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway
   diagrams marked-up to highlight impacted genes.

Conclusions

   The PRS tool provides a filter in the isolation of
   biologically-relevant insights from complex transcriptomic data.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-014-0358-2)
   contains supplementary material, which is available to authorized
   users.

   Keywords: ■■■

Background

   Increasingly, high-throughput transcriptional profiling techniques
   (microarrays or, increasingly, RNAseq) inform modern life-science
   research. Such techniques provide a molecular “camera” taking
   genome-wide “snap-shots” of genetic activity. However, the effective
   analysis of microarray data presents a number of challenges, in
   particular handling the large number of genes that are studied
   simultaneously.

   Analysing gene expression in the context of curated knowledge, or
   “knowledge base-driven pathway analysis”, is critical as this guides
   the reduction in search space from many thousands of genes to an subset
   of biological processes, which are much more tractable to human
   interpretation [[31]1]. According to Khatri et al [[32]2], pathway
   enrichment approaches can be divided into three generations:
     * i.
       Over-representation Analysis (ORA): This scores a pathway by
       considering the proportion of differentially-expressed genes (DEGs)
       observed in each pathway relative to the proportion of all
       microarray DEGs. This is used by several pathway analysis tools,
       including GenMAPP [[33]3], GoMiner [[34]4], Onto-Express [[35]5]
       and FatiGo [[36]6].
     * ii.
       Functional Class Scoring (FCS): FCS gives a score to each gene in a
       pathway based on its expression, from which a pathway-score is
       calculated based on the scores of all the genes in the pathway. A
       number of FCS methods have been implemented through standalone
       tools such as GSEA [[37]7], SigPathway [[38]8], and SAFE [[39]9],
       or web tools such as T-profiler [[40]10], Gazer [[41]11] and
       GeneTrail [[42]12].
     * iii.
       Pathway Topology (PT)-based approaches: These approaches exploit
       the topology of pathways by giving weights to pre-defined
       connections between genes, which inform pathway scoring. Several
       topology-based approaches have been described in the literature
       over the past few years. According to Mitrea et al [[43]13],
       PT-based approaches differ in the way they translate pathway
       topology information into a pathway score. Some methods use only
       the topology data of differentially-expressed genes (DEGs) in the
       enrichment score (for example MetaCore [[44]14] and EnrichNet
       [[45]15]), whereas others (including SPIA [[46]16] and GANPA
       [[47]17]) use expression data of DEGs along with the topology data.
       Alternatively, some methods use expression data derived from all
       microarray genes, whether they change between conditions or not,
       for example PathOlogist [[48]18], DEGraph [[49]19], and ACST
       [[50]20]. Importantly, some PT-based tools use only signalling
       pathway descriptions, such as Pathway-Express [[51]21], NetGSA
       [[52]22], ScorePAGE [[53]23], TAPPA [[54]24] MetPA [[55]25], and
       Clipper [[56]26].

   Previously, we proposed a new pathway enrichment method, in which both
   pathway topology and the magnitude of gene expression changes informed
   the creation of a Pathway Regulation Score (PRS) [[57]27].
   Specifically, by combining fold-change data for those transcripts
   exceeding a significance threshold, and by taking into account the
   potential of altered gene expression to impact upon downstream
   transcription, we identified those pathways most relevant to the
   pathophysiological process under investigation. Our approach addressed
   a number of issues that potentially compromise enrichment methods. We
   took steps to mitigate the influence of errors in ID mapping, and to
   reduce the bias introduced by highly-redundant pathways (i.e. multiple
   instances of the same gene). Topology methods also have to handle loops
   effectively, so we used a search algorithm derived from graph theory to
   resolve this problem. We also felt that arbitrarily dividing processes
   into either up- or down-regulated was artificial as changes in gene
   expression are likely to be distributed throughout pathways, thus ours
   was an overall impact assessment.

   Herein, we described the implementation of our PRS approach as a
   standalone tool that provides end users with the option of importing
   data from different microarray platforms and species. The tool yields
   both PRS and z-scores, provides statistical analysis, and allows
   browsing of pathways with impacted genes highlighted in different
   colours. An enhancement from our original report is that users are able
   to enrich both signalling and metabolic pathways.

Implementation

   The PRS approach was implemented in MATLAB. Users without access to the
   MATLAB environment can down-load the MATLAB Runtime Compiler (MRC) in
   order to deploy the software described herein, via a user-friendly GUI.
   The PRS interface (Figure [58]1) provides users with several functions:

Figure 1.

   Figure 1
   [59]Open in a new tab

   The PRS user interface showing analysis of a sample dataset.

Preprocessing microarray data

   We did not re-engineer a filter to normalise data from a variety of
   platforms, rather users must first preprocess transcriptomic data using
   one of the myriad existing tools. Data must be in the form of a simple
   Excel spreadsheet, in which the first column should be probe ID, and
   the following columns normalised replicated expression values from the
   control and test conditions. Additional information regarding species,
   sample numbers, fold-change and t-test thresholds, normalisation method
   and platform is required.

Pathway representation

   Our fundamental algorithm was described previously [[60]27]. Briefly,
   Kyoto Encyclopaedia of Genes and Genomes pathway definitions [[61]28]
   were used, in which pathways are maintained in KEGG Mark-up Language
   (KGML) format. We imported a total of 189 signalling and metabolic
   descriptions from KEGG and parsed these into MATLAB objects, which were
   then converted into directed graphs. KGML files contain three types of
   objects: entries, relations, and reactions. These can be mapped to
   graphical objects in the associated pathway map (Additional file
   [62]1). Only entries (which form nodes, represented as boxes) and
   relations (represented as edges) were used to represent signalling
   pathways where proteins (boxes) are linked by “relations”. All three
   types are used to represent the structure of metabolic pathways in
   order to capture substrate-enzyme-product relationships where enzymes
   (boxes) are linked by “relations”, and compounds (circles) are linked
   by “reactions”. To convert a metabolic pathway into a graph in a
   rational way, we represented enzymes as nodes in the graph, while
   substrates and products were used to detect the direction of relations
   (edges) between nodes (Figure [63]2). While we acknowledge that is not
   possible to predict any effect on flux by this rationale, we reasoned
   that any change in node expression in a metabolic pathway could be of
   physiological relevance, particularly if nodes were connected.

Figure 2.

   Figure 2
   [64]Open in a new tab

   Example of the conversion of a group of reactions in a metabolic
   pathway (a) into a diagraph (b) after removing redundancy.

   Representing pathways as graphs had an additional advantage as it
   reduced redundancy in that genes were only represented once in any
   pathway graph. A Depth-First Search (DFS) algorithm, derived from Graph
   Theory was used to ensure that loops were only counted once.

Pathway scoring

   Our method assigned weights to all significant nodes (i.e. DEGs) in a
   pathway to reflect their topological strength (specifically the number
   of significant downstream nodes that are pointed to, either directly or
   via other significant nodes as described previously [[65]27]). A PRS
   was calculated on the basis of fold-change value and weighting of all
   significant nodes in the pathway and normalized for pathway size. We
   also calculated a z-score [[66]29] (with an improvement over earlier
   implementations in that this was performed after removing redundant
   genes from pathway descriptions). The software outputs two lists of
   pathways ranked according to PRS and z-score, saved as both Excel and
   .mat files for later analysis.

Pathway significance assessment

   We then went on to establish the probability of achieving scores at
   least as high as the PRS score by chance using a non-parametric
   permutation method. Initially, fold-change values for all expressed
   microarray genes were permuted. These values were then mapped back onto
   pathways, and a PRS recalculated. This process was repeated n times,
   where n is provided by the user through the interface (typically
   n = 1000). The statistical significance (p-value) of each pathway score
   was estimated by a comparison between the observed score and the n
   random scores generated. To achieve more reliable statistical
   significance evaluation, p-values were adjusted for multiple-test
   correction by a False Discovery Rate (FDR) method based on a threshold
   provided by the user. This is described in more detail in our original
   report [[67]27].

Visualizing enriched pathways

   After running the analysis, results are saved as .mat format files for
   ease of retrieval. By clicking on the pathway name from the list of
   ranked pathways shown in the table and selecting the option of
   visualizing a pathway from the interface, a marked-up pathway map will
   be displayed. Technically, the software will call a pathway mapping web
   service (REST-based API service) hosted on the KEGG website and pass a
   number of parameters, including a list of all expressed genes with
   their fold- changes and specified colours to differentiate DEGs from
   non-impacted genes. Figure [68]3 shows a typical pathway map where
   significant (i.e. above threshold) genes are coloured in red and
   non-significant (i.e. unchanged or not expressed) in green.

Figure 3.

   Figure 3
   [69]Open in a new tab

   A typical marked-up pathway, in this case the KEGG “acute myeloid
   leukaemia pathway” enriched in an AML dataset (GEO accession
   #[70]GSE9476); significant genes are coloured in red and
   non-significant ones in green.

UML for modelling and software description

   Herein, we used Unified Modelling Language (UML) to describe, model and
   visualize the structure and functions of our method by diagrams. There
   are 14 types of diagrams classified in three categories in UML 2.0
   [[71]30], however, in this paper we used only two: class and sequence
   diagrams. Class diagrams represent static structures or main objects in
   the software. Figure [72]4 shows the key classes at the pathway
   analysis stage. The class “Analysis” is the main class, which provides
   an interface to run all the services provided by the tool. It has four
   main attributes:
     * ▪
       MicroarrayObject: an object of the class “Microarray_Dataset” built
       by calling initialiseMicroarray() function (see Additional file
       [73]2). This holds the normalised gene expression data, and a list
       of all genes with their fold-change values.
     * ▪
       kgmlObject: an object of the class “KGML_Parser” built by calling
       the parseKGML() function (see Additional file [74]3). This holds
       the static structure of all pathways as a list of objects of
       “KGML_Path” class that is defined by KGML format. An object of
       “KGML_Path” represents the structure of one KEGG pathway and is
       composed of entriesList, reactionsList, and relationsList (see
       Additional file [75]1).
     * ▪
       PathList: this is a list of objects of the class “Pathway” which is
       created by calling CreatePathListFromKegg() function (see
       Additional file [76]4). This object ultimately holds a list of
       pathways enriched with reference to a given microarray dataset.
     * ▪
       rankedPaths: this object is created by calling the rankPaths()
       function. It holds the same list of pathways defined by PathsList,
       but they are ranked in descending order based on PRS values.

Figure 4.

   Figure 4
   [77]Open in a new tab

   UML class diagram illustrating the main classes of the package at the
   pathway analysis stage.

   Sequence diagrams were used to represent the functions of the PRS tool
   according to different types of interactions between objects. As an
   example, Figure [78]5 represents the main PRS functions with the
   following steps:
     * i.
       Conversion of pathways into graphs by the convertPath2Graph()
       function, which requires the usage of kgmlObject that holds a list
       of entries, relations and reactions of all pathways.
     * ii.
       Using information stored in kgmlObject and PathsList for each graph
       (see Figure [79]4), a list of nodes is created (where each node
       represents one or more genes from the original pathway) and a list
       of children for each node.
     * iii.
       Removal of redundant genes, which may be represented many times in
       the same pathway. Two functions are designed to deal with node
       redundancy: checkNodeRedundancy() and handleNodeRedundancy().
     * iv.
       After building a graph for each pathway, graphs are weighted by
       calling the createWeightedGraphs() function, which uses the DFS
       algorithm to traverse the nodes of each graph and assign a weight
       for each significant node taking into account the loops in the
       graph.
     * v.
       A pathway regulation score (PRS) is assigned to each weighted graph
       using the weights of the significant nodes in the graph and other
       parameters.

Figure 5.

   Figure 5
   [80]Open in a new tab

   UML sequence diagram illustrating PRS calculation and pathway ranking.

   We implemented all these classes, functions, and DFS algorithm using
   MATLAB R2010a.

Results and discussion

   The objective evaluation of novel enrichment analysis methods is
   difficult, relying on their ability to discern biological processes
   already known to be perturbed in disease states. We and others
   previously attempted this by studying performance across a range of
   datasets derived from distinct conditions ([[81]27] and references