Abstract Background Handling the vast amount of gene expression data generated by genome-wide transcriptional profiling techniques is a challenging task, demanding an informed combination of pre-processing, filtering and analysis methods if meaningful biological conclusions are to be drawn. For example, a range of traditional statistical and computational pathway analysis approaches have been used to identify over-represented processes in microarray data derived from various disease states. However, most of these approaches tend not to exploit the full spectrum of gene expression data, or the various relationships and dependencies. Previously, we described a pathway enrichment analysis tool created in MATLAB that yields a Pathway Regulation Score (PRS) by considering signalling pathway topology, and the overrepresentation and magnitude of differentially-expressed genes (J Comput Biol 19:563–573, 2012). Herein, we extended this approach to include metabolic pathways, and described the use of a graphical user interface (GUI). Results Using input from a variety of microarray platforms and species, users are able to calculate PRS scores, along with a corresponding z-score for comparison. Further pathway significance assessment may be performed to increase confidence in the pathways obtained, and users can view Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway diagrams marked-up to highlight impacted genes. Conclusions The PRS tool provides a filter in the isolation of biologically-relevant insights from complex transcriptomic data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0358-2) contains supplementary material, which is available to authorized users. Keywords: ■■■ Background Increasingly, high-throughput transcriptional profiling techniques (microarrays or, increasingly, RNAseq) inform modern life-science research. Such techniques provide a molecular “camera” taking genome-wide “snap-shots” of genetic activity. However, the effective analysis of microarray data presents a number of challenges, in particular handling the large number of genes that are studied simultaneously. Analysing gene expression in the context of curated knowledge, or “knowledge base-driven pathway analysis”, is critical as this guides the reduction in search space from many thousands of genes to an subset of biological processes, which are much more tractable to human interpretation [[31]1]. According to Khatri et al [[32]2], pathway enrichment approaches can be divided into three generations: * i. Over-representation Analysis (ORA): This scores a pathway by considering the proportion of differentially-expressed genes (DEGs) observed in each pathway relative to the proportion of all microarray DEGs. This is used by several pathway analysis tools, including GenMAPP [[33]3], GoMiner [[34]4], Onto-Express [[35]5] and FatiGo [[36]6]. * ii. Functional Class Scoring (FCS): FCS gives a score to each gene in a pathway based on its expression, from which a pathway-score is calculated based on the scores of all the genes in the pathway. A number of FCS methods have been implemented through standalone tools such as GSEA [[37]7], SigPathway [[38]8], and SAFE [[39]9], or web tools such as T-profiler [[40]10], Gazer [[41]11] and GeneTrail [[42]12]. * iii. Pathway Topology (PT)-based approaches: These approaches exploit the topology of pathways by giving weights to pre-defined connections between genes, which inform pathway scoring. Several topology-based approaches have been described in the literature over the past few years. According to Mitrea et al [[43]13], PT-based approaches differ in the way they translate pathway topology information into a pathway score. Some methods use only the topology data of differentially-expressed genes (DEGs) in the enrichment score (for example MetaCore [[44]14] and EnrichNet [[45]15]), whereas others (including SPIA [[46]16] and GANPA [[47]17]) use expression data of DEGs along with the topology data. Alternatively, some methods use expression data derived from all microarray genes, whether they change between conditions or not, for example PathOlogist [[48]18], DEGraph [[49]19], and ACST [[50]20]. Importantly, some PT-based tools use only signalling pathway descriptions, such as Pathway-Express [[51]21], NetGSA [[52]22], ScorePAGE [[53]23], TAPPA [[54]24] MetPA [[55]25], and Clipper [[56]26]. Previously, we proposed a new pathway enrichment method, in which both pathway topology and the magnitude of gene expression changes informed the creation of a Pathway Regulation Score (PRS) [[57]27]. Specifically, by combining fold-change data for those transcripts exceeding a significance threshold, and by taking into account the potential of altered gene expression to impact upon downstream transcription, we identified those pathways most relevant to the pathophysiological process under investigation. Our approach addressed a number of issues that potentially compromise enrichment methods. We took steps to mitigate the influence of errors in ID mapping, and to reduce the bias introduced by highly-redundant pathways (i.e. multiple instances of the same gene). Topology methods also have to handle loops effectively, so we used a search algorithm derived from graph theory to resolve this problem. We also felt that arbitrarily dividing processes into either up- or down-regulated was artificial as changes in gene expression are likely to be distributed throughout pathways, thus ours was an overall impact assessment. Herein, we described the implementation of our PRS approach as a standalone tool that provides end users with the option of importing data from different microarray platforms and species. The tool yields both PRS and z-scores, provides statistical analysis, and allows browsing of pathways with impacted genes highlighted in different colours. An enhancement from our original report is that users are able to enrich both signalling and metabolic pathways. Implementation The PRS approach was implemented in MATLAB. Users without access to the MATLAB environment can down-load the MATLAB Runtime Compiler (MRC) in order to deploy the software described herein, via a user-friendly GUI. The PRS interface (Figure [58]1) provides users with several functions: Figure 1. Figure 1 [59]Open in a new tab The PRS user interface showing analysis of a sample dataset. Preprocessing microarray data We did not re-engineer a filter to normalise data from a variety of platforms, rather users must first preprocess transcriptomic data using one of the myriad existing tools. Data must be in the form of a simple Excel spreadsheet, in which the first column should be probe ID, and the following columns normalised replicated expression values from the control and test conditions. Additional information regarding species, sample numbers, fold-change and t-test thresholds, normalisation method and platform is required. Pathway representation Our fundamental algorithm was described previously [[60]27]. Briefly, Kyoto Encyclopaedia of Genes and Genomes pathway definitions [[61]28] were used, in which pathways are maintained in KEGG Mark-up Language (KGML) format. We imported a total of 189 signalling and metabolic descriptions from KEGG and parsed these into MATLAB objects, which were then converted into directed graphs. KGML files contain three types of objects: entries, relations, and reactions. These can be mapped to graphical objects in the associated pathway map (Additional file [62]1). Only entries (which form nodes, represented as boxes) and relations (represented as edges) were used to represent signalling pathways where proteins (boxes) are linked by “relations”. All three types are used to represent the structure of metabolic pathways in order to capture substrate-enzyme-product relationships where enzymes (boxes) are linked by “relations”, and compounds (circles) are linked by “reactions”. To convert a metabolic pathway into a graph in a rational way, we represented enzymes as nodes in the graph, while substrates and products were used to detect the direction of relations (edges) between nodes (Figure [63]2). While we acknowledge that is not possible to predict any effect on flux by this rationale, we reasoned that any change in node expression in a metabolic pathway could be of physiological relevance, particularly if nodes were connected. Figure 2. Figure 2 [64]Open in a new tab Example of the conversion of a group of reactions in a metabolic pathway (a) into a diagraph (b) after removing redundancy. Representing pathways as graphs had an additional advantage as it reduced redundancy in that genes were only represented once in any pathway graph. A Depth-First Search (DFS) algorithm, derived from Graph Theory was used to ensure that loops were only counted once. Pathway scoring Our method assigned weights to all significant nodes (i.e. DEGs) in a pathway to reflect their topological strength (specifically the number of significant downstream nodes that are pointed to, either directly or via other significant nodes as described previously [[65]27]). A PRS was calculated on the basis of fold-change value and weighting of all significant nodes in the pathway and normalized for pathway size. We also calculated a z-score [[66]29] (with an improvement over earlier implementations in that this was performed after removing redundant genes from pathway descriptions). The software outputs two lists of pathways ranked according to PRS and z-score, saved as both Excel and .mat files for later analysis. Pathway significance assessment We then went on to establish the probability of achieving scores at least as high as the PRS score by chance using a non-parametric permutation method. Initially, fold-change values for all expressed microarray genes were permuted. These values were then mapped back onto pathways, and a PRS recalculated. This process was repeated n times, where n is provided by the user through the interface (typically n = 1000). The statistical significance (p-value) of each pathway score was estimated by a comparison between the observed score and the n random scores generated. To achieve more reliable statistical significance evaluation, p-values were adjusted for multiple-test correction by a False Discovery Rate (FDR) method based on a threshold provided by the user. This is described in more detail in our original report [[67]27]. Visualizing enriched pathways After running the analysis, results are saved as .mat format files for ease of retrieval. By clicking on the pathway name from the list of ranked pathways shown in the table and selecting the option of visualizing a pathway from the interface, a marked-up pathway map will be displayed. Technically, the software will call a pathway mapping web service (REST-based API service) hosted on the KEGG website and pass a number of parameters, including a list of all expressed genes with their fold- changes and specified colours to differentiate DEGs from non-impacted genes. Figure [68]3 shows a typical pathway map where significant (i.e. above threshold) genes are coloured in red and non-significant (i.e. unchanged or not expressed) in green. Figure 3. Figure 3 [69]Open in a new tab A typical marked-up pathway, in this case the KEGG “acute myeloid leukaemia pathway” enriched in an AML dataset (GEO accession #[70]GSE9476); significant genes are coloured in red and non-significant ones in green. UML for modelling and software description Herein, we used Unified Modelling Language (UML) to describe, model and visualize the structure and functions of our method by diagrams. There are 14 types of diagrams classified in three categories in UML 2.0 [[71]30], however, in this paper we used only two: class and sequence diagrams. Class diagrams represent static structures or main objects in the software. Figure [72]4 shows the key classes at the pathway analysis stage. The class “Analysis” is the main class, which provides an interface to run all the services provided by the tool. It has four main attributes: * ▪ MicroarrayObject: an object of the class “Microarray_Dataset” built by calling initialiseMicroarray() function (see Additional file [73]2). This holds the normalised gene expression data, and a list of all genes with their fold-change values. * ▪ kgmlObject: an object of the class “KGML_Parser” built by calling the parseKGML() function (see Additional file [74]3). This holds the static structure of all pathways as a list of objects of “KGML_Path” class that is defined by KGML format. An object of “KGML_Path” represents the structure of one KEGG pathway and is composed of entriesList, reactionsList, and relationsList (see Additional file [75]1). * ▪ PathList: this is a list of objects of the class “Pathway” which is created by calling CreatePathListFromKegg() function (see Additional file [76]4). This object ultimately holds a list of pathways enriched with reference to a given microarray dataset. * ▪ rankedPaths: this object is created by calling the rankPaths() function. It holds the same list of pathways defined by PathsList, but they are ranked in descending order based on PRS values. Figure 4. Figure 4 [77]Open in a new tab UML class diagram illustrating the main classes of the package at the pathway analysis stage. Sequence diagrams were used to represent the functions of the PRS tool according to different types of interactions between objects. As an example, Figure [78]5 represents the main PRS functions with the following steps: * i. Conversion of pathways into graphs by the convertPath2Graph() function, which requires the usage of kgmlObject that holds a list of entries, relations and reactions of all pathways. * ii. Using information stored in kgmlObject and PathsList for each graph (see Figure [79]4), a list of nodes is created (where each node represents one or more genes from the original pathway) and a list of children for each node. * iii. Removal of redundant genes, which may be represented many times in the same pathway. Two functions are designed to deal with node redundancy: checkNodeRedundancy() and handleNodeRedundancy(). * iv. After building a graph for each pathway, graphs are weighted by calling the createWeightedGraphs() function, which uses the DFS algorithm to traverse the nodes of each graph and assign a weight for each significant node taking into account the loops in the graph. * v. A pathway regulation score (PRS) is assigned to each weighted graph using the weights of the significant nodes in the graph and other parameters. Figure 5. Figure 5 [80]Open in a new tab UML sequence diagram illustrating PRS calculation and pathway ranking. We implemented all these classes, functions, and DFS algorithm using MATLAB R2010a. Results and discussion The objective evaluation of novel enrichment analysis methods is difficult, relying on their ability to discern biological processes already known to be perturbed in disease states. We and others previously attempted this by studying performance across a range of datasets derived from distinct conditions ([[81]27] and references