Abstract

   Although knowledge of biological pathways is essential for interpreting
   results from computational biology studies, the growing number of
   pathway databases complicates efforts to efficiently perform pathway
   analysis due to high redundancies among pathways from different
   databases, and inconsistencies in how pathways are created and named.
   We introduce the PAthway Communities (PAC) framework, which reconciles
   pathways from different databases and reduces pathway redundancy by
   revealing informative groups with distinct biological functions.
   Uniquely applying the Louvain community detection algorithm to a
   network of 4847 pathways from KEGG, REACTOME and Gene Ontology
   databases, we identify 35 distinct and automatically annotated
   communities of pathways and show that they are consistent with
   expert-curated pathway categories. Further, we demonstrate that our
   pathway community network can be queried with new gene sets to provide
   biological context in terms of related pathways and communities. Our
   approach, combined with an interpretable web tool we provide, will help
   computational biologists more efficiently contextualize and interpret
   their biological findings.

INTRODUCTION

   Many computational biology studies aim to identify groups of genes that
   are associated with or differentially expressed with respect to
   phenotypes of interest ([28]1,[29]2). Performing pathway analyses to
   associate these gene-level findings with biological processes is a
   crucial step in providing both biological context for genes and a
   systems perspective to the analysis. Researchers have constructed
   pathways (i.e. networks of genes that interact in various ways to
   perform certain biological tasks ([30]3)) using a variety of methods;
   these range from large-scale computational analysis over literature
   (e.g. Gene Ontology ([31]4)) to hand-curation by experts (e.g. KEGG
   ([32]5–7) and REACTOME ([33]8)). These pathway gene sets are then
   commonly used in pathway enrichment analysis, where a new list of genes
   is compared with previously established pathway gene sets (via
   statistical tests to measure whether their genes overlap at higher than
   random rate) in order to identify which pathways are most related to
   the gene set of interest.

   Since many pathway gene set databases offer unique advantages, it has
   become common practice to perform pathway enrichment analysis across
   multiple databases. Unfortunately, the difficulty of interpreting
   pathway-level results is increasing as the number of available pathways
   and databases grows. Figure [34]1A shows that pathways across different
   databases and even within a database often have highly overlapping gene
   sets, causing redundancy in enrichment results. In fact, 50% of all
   pathways in Figure [35]1A have some corresponding pathway in a separate
   database, with at least 70% of genes in common. Therefore, a single
   query of genes can return multiple enriched pathways (e.g. KEGG’s
   oxidative phosphorylation and Parkinson's disease pathways, which have
   71% overlap), and it may be difficult to assess how the pathways are
   related, especially if they are from varying sources. Systematically
   analysing these phenomena can provide context for the simultaneous
   capture of multiple pathways.

Figure 1.

   [36]Figure 1.
   [37]Open in a new tab

   Overview of the pathway community approach. (A) The pathway overlap
   problem in pathway enrichment analyses. For each pathway database, we
   show the distribution of each pathway's maximum fraction of overlap
   with other pathways, both within the same database (top) and with other
   databases (bottom). (B) Our approach for learning pathway communities:
   we first construct a pathway network based on gene overlaps, and we
   then perform community detection to produce pathway communities. (C)
   Schematic highlighting of functionality on our webpage showing detailed
   views of each learned community. (D) Schematic highlighting of our
   proposed method for querying a new gene set against our learned
   communities to identify relevant processes enriched in a query gene
   set.

   Many studies have attempted to solve the problem of pathway redundancy.
   One common approach collapses pathways from multiple sources into a
   condensed set of high-level pathways – such as PathCards ([38]9),
   MSigDB Hallmark ([39]10) or GO slim ([40]11) – to simplify enrichment
   tests; however, by condensing many pathways, these approaches tend to
   remove smaller nuanced pathways describing specific biological
   functions. Another set of approaches identify related pathways in a
   post-hoc manner with respect to specific queries. For example, Donato
   et al. ([41]12) identify crosstalk effects when measuring enrichment of
   pathways for a set of differentially expressed genes, providing a
   revised set of enriched pathways with crosstalk effects removed. A
   similar approach has also been applied to identify broad patterns of
   co-occurring pathways in a collection of transcriptomic datasets
   ([42]13). In these approaches, the relationships identified among
   pathways are with respect to collected data (e.g. a specific
   transcriptomic dataset), whereas a dataset-agnostic method relying only
   on the pathways themselves may provide a complementary approach for
   exploring related pathways in a unified and data-agnostic manner. To
   that end, pathway network visualization tools, such as Enrichment Map,
   have been developed to highlight the relatedness of pathways ([43]14)
   but do not map pathways to functional categories. Although standard
   clustering algorithms have been useful for identifying groups of
   related pathways ([44]15,[45]16), they are limited in that these
   approaches have not been systematically validated with respect to
   expert labels. Furthermore, previous clustering-based approaches have
   relied on manual annotation of clusters and thus do not scale well to
   the growing numbers of pathways and databases ([46]14–16).

   In this paper, we address the problems of heterogeneity and redundancy
   across literature-derived pathways by introducing the Pathway
   Communities (PAC) framework. Our framework uniquely relies on the
   Louvain community detection algorithm to cluster pathways into
   communities based on their gene-level similarities (Figures
   [47]1 and [48]2). Further, we enhance the biological interpretability
   of cross-database pathway analysis by (i) characterizing learned
   communities with respect to pre-defined categories (Figure [49]3), (ii)
   devising a method to algorithmically annotate communities (Figure
   [50]4A), (iii) applying interactive visualization techniques to
   investigate newly revealed connections within and across pathway
   communities (Figure [51]4) and (iv) providing a tool to help
   researchers investigate novel gene sets in the context of our learned
   communities and member pathways, which we demonstrate on a breast
   cancer gene expression example (Figure [52]5).

Figure 2.

   [53]Figure 2.
   [54]Open in a new tab

   Comparison of graph clustering and community detection algorithms we
   evaluated. Pathway graphs and communities were learned separately for
   each database and then compared with curated category labels from each
   source database via normalized mutual information.

Figure 3.

   [55]Figure 3.
   [56]Open in a new tab

   Overview of learned communities and their association with known
   processes. (A) Communities’ number of member pathways (ranging from 7
   to 492 members). (B) Composition of pathway database members in each
   community. (C) Communities’ associations with MSigDB Hallmark pathways.
   Each Hallmark pathway has a list of associated ‘founder’ gene sets,
   some of which are in the REACTOME and KEGG databases. Heatmap
   annotations indicate the number of founders in each community
   associated with each Hallmark pathway. Cells are colored by each
   community’s frequency of founders distributed across Hallmark pathways
   (e.g. the darkest purple indicates that all Hallmark founders contained
   in a community are mapped to a single Hallmark pathway). We note that
   some Hallmark pathways have identical founder gene sets, and we include
   both Hallmark pathways in the same row of the heatmap (separated by
   ‘-&-’ in the label). Finally, we use PAC’s gene set querying method
   (described in Materials and Methods) to assign each Hallmark pathway to
   a community based on its member genes. The assigned communities are
   indicated by squares in the heatmap, and these squares are colored by
   how closely they agree with the top community based on founders.

Figure 4.

   Figure 4.
   [57]Open in a new tab

   (A) The final learned community network. We display all community
   nodes, with sizes proportional to the number of pathway members, and
   edge widths proportional to the average weights among all pairs of
   edges between members of each community. Each community is labeled with
   automatically generated labels as described in Methods. (B) An example
   view of a single community. We display all members of community 28 as
   nodes with sizes proportional to their hubness in the subnetwork
   containing only community 28 pathways, and colored by the database for
   each pathway. Edge widths are proportional to the –log10(p-value) for
   Fisher's exact test measuring gene overlap between pathways.

Figure 5.

   [58]Figure 5.
   [59]Open in a new tab

   (A) Enrichment of top pathways for the 243 most significantly
   differentially expressed genes biopsied samples for ER + vs. ER- breast
   cancers. Each bar shows the –log10(p-value) for enrichment based on
   Fisher's exact test after Bonferroni correction over all pathways. Each
   pathway is labeled with its assigned community and enrichment rank. (B)
   Using the PAC’s gene set querying method (described in Methods), we
   query the 243 most significantly differentially expressed genes against
   the learned communities and display the modularity change associated
   with assigning the query gene set to each community. The top 10
   communities are shown, with the top two automatically generated labels
   (Methods) for each of the top 5 communities, indicating processes that
   are generally related to the query gene set. (C) Network overview for
   the top 108 enriched pathways (P< 0.02) for the 243 genes described
   above. Each node is colored by its community and sized proportionally
   to the enrichment –log10(p-value), and we annotate the top 20 pathways
   with their rank as indicated in (A).

MATERIALS AND METHODS

Pre-processing of pathways and generation of curated categories

   For our analyses, we constructed a pathway graph comprised of pathways
   from KEGG ([60]7), REACTOME ([61]8) and Gene Ontology (GO) ([62]4) gene
   sets, downloaded from MSigDB v7.0 ([63]17,[64]18). Each pathway
   database contains a list of named pathways that each have set of
   associated genes. For each database, we additionally pre-processed a
   provided set of curated categories or hierarchical relationships among
   pathways to identify higher-level categories associated with each
   pathway (see Supplementary Methods). These simple mappings from
   pathways to curated categories were treated as ground-truth labels for
   evaluating our method. Although these curated categories share some
   common themes across databases, it is not possible to combine them, so
   initial validation experiments were performed separately for each
   database. Finally, we downloaded MSigDB Hallmark gene sets ([65]10), a
   collection of refined pathways meant to summarize thousands of founder
   gene sets (some of which are KEGG and REACTOME pathways in our
   analysis), which we use for additional validation.

Generating the pathway graph and learning communities

   Using the PAC method for identifying communities of related pathways
   involves two steps: 1) construction of the pathway network, and 2)
   detection of pathway communities (Figure [66]1B). For the first step,
   we represented pathways from multiple databases as a large graph, where
   each node is a pathway (with an associated set of genes). For each pair
   of pathways, we performed Fisher's exact test which evaluates the
   significance of their gene overlap under a hypergeometric null
   distribution ([67]Supplementary Methods and[68]Supplementary Figure
   S1A) ([69]19–21). We added edges between all pairs of pathways, with
   edge weights corresponding to the –log[10](P-value) from Fisher's exact
   test measuring the significance of gene overlap (P-values were
   Bonferroni corrected, and edges between pathways with P > 0.01 were set
   to a weight of 0). This process generated a sparsely connected graph in
   which each node represents a pathway, and edges indicate similarity of
   pathways with respect to shared genes.

   For the second step, we identified communities of related pathways from
   this network using the Louvain community detection algorithm
   ([70]22,[71]23) using the Community API in Python. To our knowledge,
   ours is the first work to use an approach based on graph modularity to
   cluster pathways. The Louvain algorithm learns well-connected
   communities from a network by greedily optimizing for graph modularity,
   namely, a measure of the density of intra- vs inter-community edges.
   During our evaluation, we first performed the PAC method separately for
   each pathway database (generating four separate pathway graphs and
   performing community detection separately). We ran the Louvain
   algorithm with a resolution parameter of 0.4 and conducted several
   experiments to verify the stability of our approach with different
   initializations, hyperparameters, and alternative methods for graph
   construction and evaluation (Supplementary Methods; [72]Supplementary
   Figures S2–S6). We also compared other graph clustering approaches with
   the Louvain algorithm: agglomerative clustering, spectral clustering,
   and the Clauset-Newman-Moore (CNM) algorithm (another modularity-based
   approach); for all methods, we evaluated several hyperparameter
   options. Figure [73]2 reports the highest NMI achieved between the
   learned clusters and curated categories (i.e. the reduction uncertainty
   for the curated category label when the algorithmically learned
   community label is known).

   Finally, after confirming the consistency of our learned communities
   with curated categories within each database (Figure [74]2), we learned
   an integrative set of communities from the pathway graph that combined
   all four pathway databases, resulting in 35 learned communities across
   4847 pathways. The full list of pathways’ community assignments is
   provided in [75]Supplementary Table S1. Our code for producing the
   pathway graph and learning communities is available at
   [76]https://gitlab.cs.washington.edu/nbbwang/P-COM.

Automatic labelling of pathways

   To automatically generate descriptions for our communities that were
   independent of expert-curated category labels, we developed a method
   based directly on the names of member pathways. For each community, we
   pre-processed member pathway names from MSigDB and identified commonly
   appearing terms. After pre-processing these names to remove database
   identifiers and common stop words (using the Natural Language Toolkit
   Python package), we counted instances of all 3-mers across community
   members and used the most commonly appearing 3-mer (with at least three
   appearances) as a label to describe our community. In the case of a
   tie, we selected the 3-mer whose source pathways had the highest
   average hubness within their community. When no 3-mers made at least
   three appearances, we repeated the process with 2-mers.
   [77]Supplementary Figure S7 shows an example of this procedure, and
   [78]Supplementary Table S2 provides the top ten terms for each
   community.

Querying new gene sets

   As described above, the PAC method creates a graph with edges based on
   gene set overlaps among pathways and learns communities via the Louvain
   algorithm, which greedily optimizes the modularity of graph partitions.
   Therefore, using the following process, it is straightforward to query
   a new gene set in our learned community network. First, we calculate
   Fisher's exact test p-values between the new gene set and each pathway
   already in the graph ([79]Supplementary Figure S1B), and we then
   temporarily add the gene set as a new node in the pathway graph. Next,
   holding the previously learned partition constant and initially
   considering the new gene set as a single-member community, we calculate
   the modularity change associated with moving the gene set to each of
   the other communities. We then rank the candidate communities for the
   new query gene based on their associated increase in modularity.

   To ensure that this approach is useful for any arbitrary list of genes
   (e.g. from a differential gene expression analysis), we first validated
   it by again using the MSigDB Hallmark pathways, because some of their
   founders are pathways in our graph. Thus, we evaluated whether, when
   queried, these Hallmark gene lists tended to be assigned to communities
   containing many of their founder gene sets. Finally, as described
   below, we created an interactive web tool to help users query a list of
   genes and identify candidate communities most closely associated with
   it.

Interactive web tool

   Because the pathway graph contains thousands of pathways and learned
   communities contain up to 492 pathways, our integrative communities are
   not amenable to static visualizations. Therefore, we created an
   interactive web tool, available at
   [80]https://pathwaycommunities.cs.washington.edu/, using the Plotly
   Dash framework. The tool lets users explore the community and pathway
   networks to gain a clearer understanding of how pathways relate to each
   other and the higher level processes learned by each community. It
   offers several views. First, a community-level view shows users the
   community graph (i.e. Figure [81]4A) along with detailed annotations,
   including the top automatically generated labels, pathway members, and
   most commonly appearing genes. Users can query specific biological
   processes (i.e. automatically learned labels) to identify which
   communities contain relevant pathways. Second, a pathway-level view
   reveals sub-graphs containing all pathway members for a selected
   community. Here, users can highlight a specific pathway in the graph to
   visualize how it relates to other pathways in its community. Third, a
   gene-level view helps users query any gene to see which communities
   contain pathways with that gene and whether the gene appears
   disproportionately often in certain communities ([82]Supplementary
   Table S3 provides the full list of genes with significant
   overrepresentation in any community as described in Supplementary
   Methods). Finally, we provide a page that lets users query a new gene
   set against our pathway network and visualize both enriched pathways
   and top related communities (based on the method described above).
   Together, these different pages for examining the learned communities
   of pathways, along with the ability to query a new gene set, may allow
   users to more effectively interpret their biological findings in the
   context of known pathways.

Biological example with breast cancer data

   To demonstrate an application of our method to new gene sets, we used
   an example of a differential gene expression analysis in breast cancer
   samples. Data was provided by the Molecular Taxonomy of Breast Cancer
   International Consortium (METABRIC) database. The database includes
   breast cancer sequencing data from 2,509 patients; gene expression
   measurements from biopsied breast cancer samples, including 24 368
   measured genes; and phenotypic or treatment labels associated with each
   sample (e.g. estrogen receptor status and whether the patient was
   treated with chemotherapy).

   We restricted our analysis to 2469 sampled profiles for which estrogen
   receptor status was reported, of which 74% of samples were positive for
   estrogen receptors. For each of the 24 368 genes measured, we compared
   expression levels for ER + and ER- samples using two-sided independent
   t-tests and identified 8,984 genes significantly differently expressed
   between groups (P < .05 after Bonferroni correction; [83]Supplementary
   Methods describe data processing). Consistent with prior knowledge
   ([84]24), this indicates that expression differences between ER + and
   ER− cancers are widespread across genes. To refine our understanding of
   the top genes, we identified the top 1% (243) of genes with the most
   significant differential expression ([85]Supplementary Figure S8) and
   examined them in the context of our community network, thus enhancing
   the interpretability of a set of differentially expressed genes.

RESULTS

The PAC framework uses a Louvain community detection method that outperforms
alternative approaches

   To evaluate whether we could learn informative and meaningful
   communities from pathway networks, we separately applied the pathway
   network construction step to each pathway database and then applied
   various community detection and clustering algorithms to each
   individual graph. We then computed the normalized mutual information
   (NMI) to compare the resulting communities with the curated categories
   (see Methods for details) from each database. Figure [86]2 shows that
   the Louvain community detection method achieved high NMI scores when
   comparing the automatically learned communities with ground-truth
   curated labels, exceeding NMI scores from all alternative approaches
   across all pathway databases (P < 0.001 for all t-tests; alternative
   evaluation metrics revealed similar results, as shown in
   [87]Supplementary Figure S2). This indicates that the Louvain community
   detection method generates communities that are consistent with
   expert-curated labels and offers a promising approach for automatically
   categorizing communities based on their shared genes.

PAC combines four major pathway databases, resulting in communities that are
consistent with hand-curated categories

   To learn a unified set of communities across all 4847 gene sets from
   the four different sources (KEGG, REACTOME, GO BP and GO MF), we
   constructed a joint network and used the Louvain algorithm with a
   resolution of 0.4 (selected because it provided high NMI scores across
   all four datasets; Supplementary Methods & [88]Supplementary Figure
   S4). This approach generated 35 communities, ranging in size from 7 to
   492 pathways (Figure [89]3A) and effectively integrated pathways from
   different sources (Figure [90]3B), which indicates that the communities
   are likely driven by function rather than database-specific signals.

   Like the separately generated ones, communities found from our combined
   pathway graph tend to be very consistent with their own curated
   categories (NMIs of 0.62, 0.43, 0.32, and 0.41 for KEGG, REACTOME, GO
   BP and GO MF, respectively; NMI of 0.30 when all curated categories
   were concatenated; [91]Supplementary Figures S9-1S3 show comparisons
   between these communities with curated categories). However, differing
   curated categories from each database are not easily reconcilable into
   common biological themes. Thus, to explore whether the learned
   communities capture meaningful common processes, we analysed our
   results with respect to the MSigDB’s Hallmark pathways (v7.1),
   developed to summarize pathways across MSigDB’s various sources
   ([92]10).

   All Hallmark pathways have associated sets of founder pathways, sets of
   pathways from which the Hallmark pathways are derived, including 745
   pathways in our combined graph. We initially observe that our
   automatically learned communities are highly consistent with grouping
   pathways by Hallmark pathways for which they are founders (NMI = 0.60;
   pathways that were not Hallmark founders were not considered in this
   calculation). Furthermore, purple cells in Figure [93]3C illustrates
   the distribution of Hallmark pathway founders within each community;
   most of our communities are associated with few Hallmark pathways,
   signifying that Hallmark pathways provide insight into coherent
   biological processes within communities. For example, there is a nearly
   one-to-one mapping between community 23 and the Hallmark ‘apical
   junction’ pathway (i.e. genes involved in adherens and tight junctions
   between cells ([94]10)), suggesting that this Hallmark pathway may be
   an appropriate annotation for community 23. This is further supported
   by the fact that most pathways in the community relate to cell-to-cell
   adhesion ([95]Supplementary Table S1), consistent with the biological
   function of the Hallmark pathway. Similarly, all Hallmark apoptosis
   founders are in community 16, consistent with the fact that many
   central members of the community relate to apoptotic signalling
   ([96]Supplementary Table S1).

   Although the examples illustrated above and in Figure [97]3C are
   promising, because Hallmark pathways are based on only a small subset
   (15.4%) of our original pathways, they cannot be used to annotate
   smaller communities with no Hallmark founders. Furthermore, this
   interpretation may fail to reveal highly specific processes occurring
   within the communities. For example, the PI3K/AKT/mTOR Signalling
   Hallmark Pathway, broadly important in regulating the cell cycle, is
   split across several of our communities, showing that these communities
   may each represent more specific sub-processes. In the next section, we
   address this problem by automatically generating descriptive labels for
   each community.

   Finally, we also use Hallmark pathways to validate our method for
   querying new gene sets in the PAC framework. Because each Hallmark
   pathway has a set of associated founder gene sets (some of which are in
   our pathway network), we use our querying method to assign each
   hallmark pathway to a community in our network; we can then determine
   if there is agreement between the community assigned via our method and
   communities containing its founders. For 74% of Hallmark pathways,
   Figure [98]3C shows that the assigned community based on our querying
   method is the same community that contains the most founders (see pink
   cell outlines in Figure [99]3C); for all Hallmark pathways, the
   assigned community is in at least the top three communities based on
   founder membership.

PAC’s Automatically generated labels and visualizations provide high-level
overviews of learned communities

   To highlight high-level relationships among our pathway communities, we
   visualize our pathway community graph with their automatically learned
   labels (Figure [100]4A). First, we observe that many of our
   automatically generated label terms are consistent with our Hallmark
   pathway analyses (Figure [101]3C), e.g. Community 16, whose top label
   was ‘Apoptotic signalling pathway,’ is consistent with its most closely
   related Hallmark pathway, ‘apoptosis.’ Importantly, our labelling
   approach also provides insights unavailable from Hallmark pathway
   analysis alone. For example, Communities 9, 18, and 19 are all strongly
   associated with the Protein Secretion Hallmark pathway (Figure [102]3),
   but our automatically generated labels reveal separate phenomena (i.e.
   lipoproteins, synaptic processes, and viral activity for Communities 9,
   18 and 19, respectively). Overall, Figure [103]4A demonstrates that our
   communities’ labels cover a broad range of cellular activities while
   remaining sufficiently specific to describe well-defined biological
   functions, while network edges highlight the interrelatedness of
   biological processes across these communities.

   Finally, to demonstrate that our approach produces informative pathway
   communities, we explored individual networks within communities. As an
   example, Figure [104]4B visualizes Community 28, one of multiple
   communities associated with the Hallmark hypoxia and glycolysis
   pathways. Pathway members are consistent with the top label
   ‘carbohydrate metabolic process,’ and the community integrates pathways
   related to this biological function from multiple databases. These
   promising results highlight that our process can overcome database bias
   and capture functional similarities.

   [105]Supplementary Table S1 shows community pathway membership, and
   [106]Supplementary Table S2 shows the top labels for each community.
   Our web tool provides detailed interactive visualizations for each
   community; see [107]https://pathwaycommunities.cs.washington.edu/.

A biological example: genes differentially expressed between estrogen
receptor positive versus negative breast cancers relate to relevant pathway
communities

   Breast cancers are commonly classified by their estrogen receptor (ER)
   status—i.e. whether estrogen receptors are expressed in cancer cells
   (ER+) or not (ER−). These cancer subtypes manifest in markedly
   different ways and require different treatment regimens because
   ER+ cancers rely heavily on estrogen to grow and reproduce, whereas ER-
   cancers do not ([108]25,[109]26). Thus, ER status is associated with a
   wide range of transcriptional differences between ER + and ER− cancers
   ([110]27). In this example, we conduct a simple differential expression
   analysis to identify the most differentially expressed genes and then
   highlight how our method clarifies these findings. In particular, as
   described in Methods, we identify the top 1% of the most significantly
   differentially expressed genes between 1,825 ER + and 655 ER- samples
   from the METABRIC dataset ([111]Supplementary Figure S8) and use these
   genes as a query gene set to explore relevant pathways in the context
   of our learned communities.

   Figure [112]5A shows the top enriched pathways (as determined by
   Fisher's exact test of overlap for the top genes and all pathways),
   which would be the final outcome of a standard pathway analysis. By
   additionally colouring each bar with the community to which the pathway
   was assigned, we see that some communities appear frequently in the top
   20 enriched pathways.

   Next, we query the top genes against our communities (see Materials
   and Methods). This approach considers not only the top enriched
   pathways, but the overall association of all pathways in each community
   with the query gene set. This analysis reveals that the top communities
   related to differentially expressed genes between ER+ and ER− broadly
   relate to growth factors and cell differentiation (community 20) and
   metabolic activity (community 1) (Figure [113]5B). Interestingly,
   although community 27 contains multiple estrogen-related pathways that
   are highly enriched in the top genes (e.g. #3 and 4 in Figure [114]5A),
   most pathways in community 27 relate to broader cell-cycle activity
   ([115]Supplementary Tables S1, S2); therefore, that community is not
   favoured in the community-level analysis. Thus, although our approach
   identifies specific pathways relevant to our gene sets, use of the
   community network query reveals some broader patterns of processes
   captured in the top genes related to ER-status. Finally, Figure [116]5C
   shows a network visualization that highlights 108 pathways enriched at
   the P < 0.02 level in our gene set query. This helps the user visualize
   not only which pathways are enriched but also how they relate to each
   other (note that interactive versions of Figure [117]5B, C, which
   provide more detail, are available at
   [118]https://pathwaycommunities.cs.washington.edu/).

DISCUSSION

   Our approach is not without limitations. In particular, because our
   graph is constructed using Fisher's exact test-based edges, our method
   for querying new gene sets also relies on the use of Fisher's exact
   test for pathway enrichment to be consistent with the rest of the
   graph's structure (since querying a new gene set involves simulating
   its addition as a node to the pathway network). Thus, our tool does not
   currently support the use of an alternative approach to compute pathway
   enrichment (e.g. GSEA); we may implement this functionality in the
   future. Additionally, although we empirically found that our automatic
   community annotation approach was consistent and interpretable, it
   relies primarily on the quality of pathway names. If the analysis were
   repeated with a new set of pathway databases that used uninformative
   pathway names, our approach for automatically labelling communities
   would not be applicable.

   In summary, we contribute the PAC framework, an automated approach for
   identifying communities of closely related pathways across several
   databases that successfully recapitulates expert-curated categories. By
   conducting separate analyses on pathway databases, we verified that the
   learned communities were consistent with curated ground truth labels.
   When scaled up to an integrative analysis across four databases, we
   verified that learned communities were consistent with Hallmark
   pathways. Unlike previous methods focused solely on visualization of
   pathway enrichment results, we leverage an automatic labelling approach
   to yield additional insights about biological pathway relationships. We
   also found that maintaining a pathway-level understanding of our
   communities provides additional nuance and context that is lost by
   consolidated gene sets (e.g. using Hallmark pathways alone for pathway
   enrichment analysis). Further, our approach may also complement
   existing methods of post-hoc disambiguation (e.g. by Donato et al.
   ([119]12)), as our learned communities highlight closely-related
   pathways which would likely be combined and revised with respect to a
   specific dataset. Finally, we believe that our tool to query new gene
   sets against our analyses (demonstrated with a breast cancer gene
   expression data example analysis) and our interactive webpage for
   examining relationships among pathways and communities will help
   computational biologists contextualize meaningful new findings for a
   wide variety of biological processes.

DATA AVAILABILITY

   Molecular Signatures Database (MSigDB) is a collection of annotated
   gene sets, which include KEGG, REACTOME, GO, and Hallmark gene sets
   ([120]http://www.gsea-msigdb.org/gsea/downloads.jsp#msigdb). Curated
   categories for KEGG pathways are available at
   [121]https://www.genome.jp/kegg/pathway.html. Curated categories for
   REACTOME pathways are available from their interactive web browser
   ([122]https://reactome.org/PathwayBrowser/). GO relations are provided
   at [123]http://geneontology.org/docs/download-ontology/ (we used the
   go-basic version).

   Data from the METABRIC (Molecular Taxonomy of Breast Cancer
   International Consortium) cohort are available from the cBioPortal for
   cancer genomics. The specific dataset used in our study was downloaded
   here:
   [124]https://cbioportal-datahub.s3.amazonaws.com/brca_metabric.tar.gz,
   and an interactive view of clinical features is available here:
   [125]https://www.cbioportal.org/study/summary?id=brca_metabric.

   Python packages used in this study include the Community API
   (python-louvain.readthedocs.io/), NetworkX
   ([126]https://networkx.org/), the Natural Language Tool Kit
   ([127]https://www.nltk.org/) and Scikit-learn
   ([128]https://scikit-learn.org/stable/).

   Our code for reproducing our results is available at
   [129]https://gitlab.cs.washington.edu/nbbwang/P-COM. Our interactive
   webpage accompanying this paper is available at
   [130]https://pathwaycommunities.cs.washington.edu/.

Supplementary Material

   lqac044_Supplemental_Files
   [131]Click here for additional data file.^ (191.7MB, zip)

ACKNOWLEDGEMENTS