Abstract

   Although pathways are widely used for the analysis and representation
   of biological systems, their lack of clear boundaries, their dispersion
   across numerous databases, and the lack of interoperability impedes the
   evaluation of the coverage, agreements, and discrepancies between them.
   Here, we present ComPath, an ecosystem that supports curation of
   pathway mappings between databases and fosters the exploration of
   pathway knowledge through several novel visualizations. We have curated
   mappings between three of the major pathway databases and present a
   case study focusing on Parkinson’s disease that illustrates how ComPath
   can generate new biological insights by identifying pathway modules,
   clusters, and cross-talks with these mappings. The ComPath source code
   and resources are available at [33]https://github.com/ComPath and the
   web application can be accessed at
   [34]https://compath.scai.fraunhofer.de/.

Introduction

   The notion of pathways enables the representation, formalization, and
   interpretation of biological events or series of interactions.
   Cataloging biological knowledge into pathways reduces complexity from
   all possible interacting molecular entities to a set of well-studied
   and validated functional relationships between molecular entities
   culminating in biological processes. Several efforts have generated
   databases of pathways with varying specificity and granularity that
   comprise signaling cascades, metabolic routes, and regulatory networks
   from precise signatures with no more than a couple of acting players to
   general pathways involving thousands of molecular players.^[35]1–[36]4

   Simplifying biology into pathways and representation as network models
   or mathematical models inevitably results in a loss of information such
   as spatiotemporal information or even entire biological entity types.
   The network abstraction facilitates pathway visualization and
   interpretation thanks to the harmony between biological networks and
   systems: nodes correspond to molecular entities and edges to types of
   interactions occurring between them (e.g., inhibition, phosphorylation,
   etc.). Although networks can comprise a broad range of molecular types
   (e.g., proteins, chemicals, small molecules, etc.), they are generally
   reduced to the most direct outcome of our genetic makeup - the genetic
   and protein levels - so that we can mechanistically understand their
   functionality. Thus, they are frequently viewed and simplified to “gene
   sets”, the collection of all genes/proteins that constitute the
   pathway, due to the major challenges of incorporating network topology
   and translating the variety of relationships into pathway analysis
   methods.

   While dedicated research groups and commercial entities with
   experienced curators have lead a majority of the efforts to compile,
   delineate, and store biological knowledge into pathway
   databases,^[37]2,[38]5 community and crowdsourced efforts have recently
   gained traction.^[39]3,[40]6 Further, the variability in curation team
   composition, database scope (e.g., signaling pathways, gene regulatory
   networks, and metabolic processes), and curation guidelines led to the
   adoption of different (and in many ways incompatible) schemata and
   formalisms such as Biological Pathway Exchange (BioPAX;^[41]7) and
   Systems Biology Markup Language (SBML;^[42]8). These incompatibilities
   motivated the integration and harmonization of resources into pathway
   meta-databases such as Pathway Commons^[43]9 and PathCards,^[44]10
   which focus on integrating databases; iPath,^[45]11 which focuses on
   pathway visualization; and SIGNOR, which focuses on signaling
   pathways.^[46]12

   Even after integrating multiple pathway databases into a pathway
   meta-database, it is difficult to assess the agreements, discrepancies,
   redundancy, and the complementarity of their contents because of the
   lack of availability of pathway mappings (e.g., pathway A from resource
   X is equivalent to pathway B from resource Y) in the original
   databases. These mappings are difficult to establish because of the
   arbitrary and overlapping nature of pathway boundaries as well as the
   absence of a common pathway nomenclature. Several controlled
   vocabularies have been generated as initial attempts to standardize
   pathway nomenclature,^[47]13,[48]14 but most pathway databases had
   already been established by the time these ontologies were published.
   Therefore, consolidating pathway knowledge is a persisting issue and it
   is still required to map pathways from different resources together to
   improve database interoperability.

   Hierarchical clustering approaches have been presented as a way of
   grouping similar pathways based on their corresponding gene sets in
   order to propose pathway mappings.^[49]10,[50]15 Though these
   approaches can systematically cluster pathways from multiple resources,
   there are some limitations to consider: first, the usual tradeoff
   between over/under-clustering,^[51]16 and second, pathway nomenclature
   and biological context are not considered by the clustering algorithm;
   it often leaves out equivalent pathways with low similarity and ignores
   the context of the pathway (e.g., cell/disease specificity).
   Nevertheless, these limitations can be overcome by following clustering
   and prioritization methods with the manual curation required to
   interpret the abstract concepts that inherent to pathway definitions
   (e.g., biological process, cellular location, condition, etc.).

   Though numerous algorithms^[52]17 and tools^[53]4,[54]18 have been
   successfully applied to interpret experimental data through the context
   of pathway databases,^[55]19,[56]20 there has not yet been a systematic
   comparison between the contents of various pathway databases, an
   assessment of their overlaps and gaps, or an establishment of mappings.
   Previous studies have only focused on comparing a single or small set
   of well-established pathways across multiple resources.^[57]21,[58]22
   For example, a comparison focused on metabolic pathways revealed how a
   set of five databases only agreed in a minimum core of the biochemistry
   knowledge.^[59]23

   These studies demonstrate the need to connect insights provided by each
   pathway database to foster a greater understanding of the underlying
   biology. Here, we present ComPath, a web application that integrates
   content from publicly accessible pathway databases, generates
   comparisons, enables exploration, and facilitates curation of
   inter-database mappings.

Results

   We developed an interactive web application that enables users to
   explore, analyze, and curate pathway knowledge. Below, we present three
   case studies illustrating how it can be used for each of these
   purposes. The figures for each were generated by interactive, dynamic
   views in the ComPath web application based on three major public
   pathway databases: KEGG, Reactome, and WikiPathways (Fig. [60]1).

Fig. 1.

   Fig. 1
   [61]Open in a new tab

   The ComPath ecosystem has three main components: the pathway database
   plugins, the ComPath framework, and the ComPath web application. The
   ComPath framework mediates the communication between the plugins
   containing the pathway database information and the web application

Case study I: comparison of pathway databases

Assessment of gene coverage

   Analysis of the overlaps between Kyoto Encyclopedia of Genes and
   Genomes (KEGG), Reactome, and WikiPathways revealed that there are
   ~3800 common human genes shared between the three databases (Fig.
   [62]2a). While at least one common human gene was present in almost
   every pathway across each database, the number of pathways with more
   common human genes diminishes much more quickly in WikiPathways and
   Reactome (Supplementary Figure [63]S1). This may be due to database
   properties such as pathway size (e.g., on average, pathways contain 90
   genes in KEGG, 50 in Reactome, and 42 in WikiPathways) or gene
   promiscuity (i.e. genes functionally linked to many pathways) that
   might influence the results of analyses using pathway resources
   (Supplementary Table [64]2). For further investigation, the ComPath web
   application generates summary tables and creates several visualizations
   to enable exploration of the distributions of pathway size and gene
   memberships for each database, visualizations that present an overview
   of the database properties to help identify effects such as gene
   promiscuity or differences the distribution of gene set sizes (Fig.
   [65]2b).

Fig. 2.

   [66]Fig. 2
   [67]Open in a new tab

   a An Euler diagram summarizing the human gene-centric coverage of KEGG,
   Reactome, and WikiPathways compared to the universe of all genes from
   HGNC (more details in Supplementary Table 1). b Histogram views present
   gene promiscuity or pathway size distributions. c The pathway
   similarity landscape of KEGG visualized as a heatmap

Exploration of pathways

   While the previous views produced gene-centric summaries of the
   contents of pathway databases, ComPath also enables the exploration of
   pathway similarity landscape using Clustergrammer.js.^[68]24 Figure
   [69]2C illustrates how this view can identify clusters of pathways
   based on their similarity and then elucidate the hierarchical
   relationships between the Metabolic pathway, the largest KEGG pathway,
   and other more high-granular KEGG metabolic pathways (e.g.,
   alpha-Linolenic acid metabolism, Lipoic acid metabolism, and ether
   lipid metabolism).

Case study II: identification of pathway modules, overlaps, and interplays
using pathway enrichment

   ComPath couples classic pathway enrichment
   analysis^[70]18,[71]25–[72]27 with pathway-centric visualizations to
   identify modules, investigate overlaps, and cluster pathways. This case
   study demonstrates their use to investigate the roles of the pathways
   related to established genetic associations in the context of
   Parkinson's disease (PD).

   Pathway enrichment with Fisher's exact test using a gene panel
   associated with PD reviewed by Brás et al.^[73]28 (the gene set will be
   referenced as PDgset) yielded over 300 pathways containing at least one
   of the panel's genes (Fig. [74]3a). We discarded pathways with fewer
   than two genes from PDgset, that were larger than 300 genes, or that
   were not found to be statistically significant (false discovery rate
   >5%) after applying multiple hypothesis testing correction with the
   Benjamini–Yekutieli method under dependency.^[75]29

Fig. 3.

   Fig. 3
   [76]Open in a new tab

   a Results of pathway enrichment using the PDgset as input using the
   ComPath pathway enrichment wizard. We would like to remark that
   enrichment results might change over time since ComPath regularly
   updates their underlying pathway databases. In order to promote
   reproducibility, the current version of the databases is displayed in
   the ComPath overview page and older versions can be provided upon
   request. b The Pathway Network Viewer displays the similarity around a
   selection of pathways. c The Pathway Overlap View depicts the overlaps
   and intersection of pathways enriched from the PDgset

   Three views were used to assist in the interpretation of the remaining
   29 enriched pathways: a pathway network view was used to identify
   pathway modules, a pathway overlap view was used to explore the
   intersections and cross-talks between pathways, and a pathway
   dendrogram view was used for clustering.

   The pathway network view renders a pathway-to-pathway network in which
   nodes represent pathways and weighted edges represent their
   corresponding gene set similarities in a similar fashion to
   PathwayConnector.^[77]30 For the PDgset, this visualization helped us
   to define six different modules (i.e., groups of pathways) by removing
   edges with a weight lower than 0.2 (Fig. [78]3b). The largest module
   (labeled as M[1]) contained pathways related to the processes of
   endocytosis and vesicle transport, both of which are putatively
   disrupted in PD.^[79]31 M[2] comprised pathways related to PTK6
   signaling such as the Reactome pathway, PTK6 promotes HIF1A
   stabilization, whose high pathway enrichment significance
   (q-value = 0.0005), as well as its role in regulating another PDgset
   gene, ATP13A2,^[80]32 suggests that it may be linked to PD. ATP13A2 is
   directly responsible for Kufor-Rakeb syndrome,^[81]33 a rare juvenile
   form of PD, and participates in two other PD mechanisms: lysosomal iron
   storage and mitochondrial stress. Because pathways related to these two
   mechanisms (i.e., Lysosome pathway from KEGG, Pink/Parkin mediated
   mitophagy from Reactome, and Mitophagy pathway from both KEGG and
   Reactome; M[4]) were also enriched by pathway enrichment analysis, we
   investigated the role of ATP13A2 in PD further.

   ATP13A2 is activated by phosphatidylinositol(3,5)bisphosphate, a
   particular phosphatidylinositol involved in M[3] pathways
   (phosphatidylinositol metabolism and signaling pathways). Because this
   activation leads to a reduction in mitochondrial stress and α-synuclein
   toxicity, two hallmarks of PD, ATP13A2 has been proposed as a
   therapeutic target.^[82]34 Ultimately, the exploration of the
   similarities and cross-talks between these three modules suggests
   further investigation of the candidate PD gene ATP13A2. Ultimately,
   this view complements pathway enrichment in the identification of
   pathway modules, exploration pathway cross-talks, and prioritization of
   genes for further study.

   While the pathway network viewer provides an overview of the different
   modules and their cross-talks, it does not reveal information about
   their contained pathways' boundaries and intersections. Therefore, we
   implemented the pathway overlap view; an interactive Euler diagram that
   allows exploration of pathway demarcations (Fig. [83]3c). We employed
   this view to identify the set of genes common to all pathways in M[5],
   a module comprising the two Alzheimer's disease (AD) and two PD
   pathways from KEGG and WikiPathways. Subsequently, we used the ComPath
   pathway enrichment wizard to investigate in which pathways the common
   five genes identified (APAF1, CASP3, CASP9, CYCS, and SNCA)
   participate. The analysis revealed that they are predominantly involved
   in apoptosis, an important process in both AD and PD
   pathophysiology.^[84]35,[85]36

   The third visualization renders the results of the hierarchical
   clustering approach described in Chen et al. in the form of a
   dendrogram, enabling deterministic pathway grouping based on gene set
   similarity. We used this view in the PDgset example to assign the
   pathways without module membership to the closest module (Supplementary
   Figure [86]S2). The dendrogram proposed merging three previously
   unassigned pathways into M[2] (i.e., Allograft Rejection, MAPK
   Signaling pathway, and Rasp1 signaling pathway). Additionally, the
   resulting dendrogram from clustering revealed hierarchical
   relationships between pathways (e.g., Pink/Parkin Mediated Mitophagy is
   a subset of the Reactome Mitophagy pathway), information that can be
   used to establish pathway mappings, as we show in the following case
   study.

Case study III: establishing mappings between pathway databases

   ComPath, as well as other tools, have demonstrated the benefits of
   integrating pathway knowledge from diverse resources to improve
   biological functional analysis.^[87]9,[88]10,[89]18 However, even after
   overcoming the technical hurdle of harmonizing different formats used
   by different databases, these integrative approaches must be
   complemented by mappings at a pathway level in order to have cross
   references between databases; thus, improving their interoperability.