Abstract Although pathways are widely used for the analysis and representation of biological systems, their lack of clear boundaries, their dispersion across numerous databases, and the lack of interoperability impedes the evaluation of the coverage, agreements, and discrepancies between them. Here, we present ComPath, an ecosystem that supports curation of pathway mappings between databases and fosters the exploration of pathway knowledge through several novel visualizations. We have curated mappings between three of the major pathway databases and present a case study focusing on Parkinson’s disease that illustrates how ComPath can generate new biological insights by identifying pathway modules, clusters, and cross-talks with these mappings. The ComPath source code and resources are available at [33]https://github.com/ComPath and the web application can be accessed at [34]https://compath.scai.fraunhofer.de/. Introduction The notion of pathways enables the representation, formalization, and interpretation of biological events or series of interactions. Cataloging biological knowledge into pathways reduces complexity from all possible interacting molecular entities to a set of well-studied and validated functional relationships between molecular entities culminating in biological processes. Several efforts have generated databases of pathways with varying specificity and granularity that comprise signaling cascades, metabolic routes, and regulatory networks from precise signatures with no more than a couple of acting players to general pathways involving thousands of molecular players.^[35]1–[36]4 Simplifying biology into pathways and representation as network models or mathematical models inevitably results in a loss of information such as spatiotemporal information or even entire biological entity types. The network abstraction facilitates pathway visualization and interpretation thanks to the harmony between biological networks and systems: nodes correspond to molecular entities and edges to types of interactions occurring between them (e.g., inhibition, phosphorylation, etc.). Although networks can comprise a broad range of molecular types (e.g., proteins, chemicals, small molecules, etc.), they are generally reduced to the most direct outcome of our genetic makeup - the genetic and protein levels - so that we can mechanistically understand their functionality. Thus, they are frequently viewed and simplified to “gene sets”, the collection of all genes/proteins that constitute the pathway, due to the major challenges of incorporating network topology and translating the variety of relationships into pathway analysis methods. While dedicated research groups and commercial entities with experienced curators have lead a majority of the efforts to compile, delineate, and store biological knowledge into pathway databases,^[37]2,[38]5 community and crowdsourced efforts have recently gained traction.^[39]3,[40]6 Further, the variability in curation team composition, database scope (e.g., signaling pathways, gene regulatory networks, and metabolic processes), and curation guidelines led to the adoption of different (and in many ways incompatible) schemata and formalisms such as Biological Pathway Exchange (BioPAX;^[41]7) and Systems Biology Markup Language (SBML;^[42]8). These incompatibilities motivated the integration and harmonization of resources into pathway meta-databases such as Pathway Commons^[43]9 and PathCards,^[44]10 which focus on integrating databases; iPath,^[45]11 which focuses on pathway visualization; and SIGNOR, which focuses on signaling pathways.^[46]12 Even after integrating multiple pathway databases into a pathway meta-database, it is difficult to assess the agreements, discrepancies, redundancy, and the complementarity of their contents because of the lack of availability of pathway mappings (e.g., pathway A from resource X is equivalent to pathway B from resource Y) in the original databases. These mappings are difficult to establish because of the arbitrary and overlapping nature of pathway boundaries as well as the absence of a common pathway nomenclature. Several controlled vocabularies have been generated as initial attempts to standardize pathway nomenclature,^[47]13,[48]14 but most pathway databases had already been established by the time these ontologies were published. Therefore, consolidating pathway knowledge is a persisting issue and it is still required to map pathways from different resources together to improve database interoperability. Hierarchical clustering approaches have been presented as a way of grouping similar pathways based on their corresponding gene sets in order to propose pathway mappings.^[49]10,[50]15 Though these approaches can systematically cluster pathways from multiple resources, there are some limitations to consider: first, the usual tradeoff between over/under-clustering,^[51]16 and second, pathway nomenclature and biological context are not considered by the clustering algorithm; it often leaves out equivalent pathways with low similarity and ignores the context of the pathway (e.g., cell/disease specificity). Nevertheless, these limitations can be overcome by following clustering and prioritization methods with the manual curation required to interpret the abstract concepts that inherent to pathway definitions (e.g., biological process, cellular location, condition, etc.). Though numerous algorithms^[52]17 and tools^[53]4,[54]18 have been successfully applied to interpret experimental data through the context of pathway databases,^[55]19,[56]20 there has not yet been a systematic comparison between the contents of various pathway databases, an assessment of their overlaps and gaps, or an establishment of mappings. Previous studies have only focused on comparing a single or small set of well-established pathways across multiple resources.^[57]21,[58]22 For example, a comparison focused on metabolic pathways revealed how a set of five databases only agreed in a minimum core of the biochemistry knowledge.^[59]23 These studies demonstrate the need to connect insights provided by each pathway database to foster a greater understanding of the underlying biology. Here, we present ComPath, a web application that integrates content from publicly accessible pathway databases, generates comparisons, enables exploration, and facilitates curation of inter-database mappings. Results We developed an interactive web application that enables users to explore, analyze, and curate pathway knowledge. Below, we present three case studies illustrating how it can be used for each of these purposes. The figures for each were generated by interactive, dynamic views in the ComPath web application based on three major public pathway databases: KEGG, Reactome, and WikiPathways (Fig. [60]1). Fig. 1. Fig. 1 [61]Open in a new tab The ComPath ecosystem has three main components: the pathway database plugins, the ComPath framework, and the ComPath web application. The ComPath framework mediates the communication between the plugins containing the pathway database information and the web application Case study I: comparison of pathway databases Assessment of gene coverage Analysis of the overlaps between Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and WikiPathways revealed that there are ~3800 common human genes shared between the three databases (Fig. [62]2a). While at least one common human gene was present in almost every pathway across each database, the number of pathways with more common human genes diminishes much more quickly in WikiPathways and Reactome (Supplementary Figure [63]S1). This may be due to database properties such as pathway size (e.g., on average, pathways contain 90 genes in KEGG, 50 in Reactome, and 42 in WikiPathways) or gene promiscuity (i.e. genes functionally linked to many pathways) that might influence the results of analyses using pathway resources (Supplementary Table [64]2). For further investigation, the ComPath web application generates summary tables and creates several visualizations to enable exploration of the distributions of pathway size and gene memberships for each database, visualizations that present an overview of the database properties to help identify effects such as gene promiscuity or differences the distribution of gene set sizes (Fig. [65]2b). Fig. 2. [66]Fig. 2 [67]Open in a new tab a An Euler diagram summarizing the human gene-centric coverage of KEGG, Reactome, and WikiPathways compared to the universe of all genes from HGNC (more details in Supplementary Table 1). b Histogram views present gene promiscuity or pathway size distributions. c The pathway similarity landscape of KEGG visualized as a heatmap Exploration of pathways While the previous views produced gene-centric summaries of the contents of pathway databases, ComPath also enables the exploration of pathway similarity landscape using Clustergrammer.js.^[68]24 Figure [69]2C illustrates how this view can identify clusters of pathways based on their similarity and then elucidate the hierarchical relationships between the Metabolic pathway, the largest KEGG pathway, and other more high-granular KEGG metabolic pathways (e.g., alpha-Linolenic acid metabolism, Lipoic acid metabolism, and ether lipid metabolism). Case study II: identification of pathway modules, overlaps, and interplays using pathway enrichment ComPath couples classic pathway enrichment analysis^[70]18,[71]25–[72]27 with pathway-centric visualizations to identify modules, investigate overlaps, and cluster pathways. This case study demonstrates their use to investigate the roles of the pathways related to established genetic associations in the context of Parkinson's disease (PD). Pathway enrichment with Fisher's exact test using a gene panel associated with PD reviewed by Brás et al.^[73]28 (the gene set will be referenced as PDgset) yielded over 300 pathways containing at least one of the panel's genes (Fig. [74]3a). We discarded pathways with fewer than two genes from PDgset, that were larger than 300 genes, or that were not found to be statistically significant (false discovery rate >5%) after applying multiple hypothesis testing correction with the Benjamini–Yekutieli method under dependency.^[75]29 Fig. 3. Fig. 3 [76]Open in a new tab a Results of pathway enrichment using the PDgset as input using the ComPath pathway enrichment wizard. We would like to remark that enrichment results might change over time since ComPath regularly updates their underlying pathway databases. In order to promote reproducibility, the current version of the databases is displayed in the ComPath overview page and older versions can be provided upon request. b The Pathway Network Viewer displays the similarity around a selection of pathways. c The Pathway Overlap View depicts the overlaps and intersection of pathways enriched from the PDgset Three views were used to assist in the interpretation of the remaining 29 enriched pathways: a pathway network view was used to identify pathway modules, a pathway overlap view was used to explore the intersections and cross-talks between pathways, and a pathway dendrogram view was used for clustering. The pathway network view renders a pathway-to-pathway network in which nodes represent pathways and weighted edges represent their corresponding gene set similarities in a similar fashion to PathwayConnector.^[77]30 For the PDgset, this visualization helped us to define six different modules (i.e., groups of pathways) by removing edges with a weight lower than 0.2 (Fig. [78]3b). The largest module (labeled as M[1]) contained pathways related to the processes of endocytosis and vesicle transport, both of which are putatively disrupted in PD.^[79]31 M[2] comprised pathways related to PTK6 signaling such as the Reactome pathway, PTK6 promotes HIF1A stabilization, whose high pathway enrichment significance (q-value = 0.0005), as well as its role in regulating another PDgset gene, ATP13A2,^[80]32 suggests that it may be linked to PD. ATP13A2 is directly responsible for Kufor-Rakeb syndrome,^[81]33 a rare juvenile form of PD, and participates in two other PD mechanisms: lysosomal iron storage and mitochondrial stress. Because pathways related to these two mechanisms (i.e., Lysosome pathway from KEGG, Pink/Parkin mediated mitophagy from Reactome, and Mitophagy pathway from both KEGG and Reactome; M[4]) were also enriched by pathway enrichment analysis, we investigated the role of ATP13A2 in PD further. ATP13A2 is activated by phosphatidylinositol(3,5)bisphosphate, a particular phosphatidylinositol involved in M[3] pathways (phosphatidylinositol metabolism and signaling pathways). Because this activation leads to a reduction in mitochondrial stress and α-synuclein toxicity, two hallmarks of PD, ATP13A2 has been proposed as a therapeutic target.^[82]34 Ultimately, the exploration of the similarities and cross-talks between these three modules suggests further investigation of the candidate PD gene ATP13A2. Ultimately, this view complements pathway enrichment in the identification of pathway modules, exploration pathway cross-talks, and prioritization of genes for further study. While the pathway network viewer provides an overview of the different modules and their cross-talks, it does not reveal information about their contained pathways' boundaries and intersections. Therefore, we implemented the pathway overlap view; an interactive Euler diagram that allows exploration of pathway demarcations (Fig. [83]3c). We employed this view to identify the set of genes common to all pathways in M[5], a module comprising the two Alzheimer's disease (AD) and two PD pathways from KEGG and WikiPathways. Subsequently, we used the ComPath pathway enrichment wizard to investigate in which pathways the common five genes identified (APAF1, CASP3, CASP9, CYCS, and SNCA) participate. The analysis revealed that they are predominantly involved in apoptosis, an important process in both AD and PD pathophysiology.^[84]35,[85]36 The third visualization renders the results of the hierarchical clustering approach described in Chen et al. in the form of a dendrogram, enabling deterministic pathway grouping based on gene set similarity. We used this view in the PDgset example to assign the pathways without module membership to the closest module (Supplementary Figure [86]S2). The dendrogram proposed merging three previously unassigned pathways into M[2] (i.e., Allograft Rejection, MAPK Signaling pathway, and Rasp1 signaling pathway). Additionally, the resulting dendrogram from clustering revealed hierarchical relationships between pathways (e.g., Pink/Parkin Mediated Mitophagy is a subset of the Reactome Mitophagy pathway), information that can be used to establish pathway mappings, as we show in the following case study. Case study III: establishing mappings between pathway databases ComPath, as well as other tools, have demonstrated the benefits of integrating pathway knowledge from diverse resources to improve biological functional analysis.^[87]9,[88]10,[89]18 However, even after overcoming the technical hurdle of harmonizing different formats used by different databases, these integrative approaches must be complemented by mappings at a pathway level in order to have cross references between databases; thus, improving their interoperability.