Abstract

   The 534 protein kinases encoded in the human genome constitute a large
   druggable class of proteins that include both well-studied and
   understudied “dark” members. Accurate prediction of dark kinase
   functions is a major bioinformatics challenge. Here, we employ a graph
   mining approach that uses the evolutionary and functional context
   encoded in knowledge graphs (KGs) to predict protein and pathway
   associations for understudied kinases. We propose a new scalable graph
   embedding approach, RegPattern2Vec, which employs regular pattern
   constrained random walks to sample diverse aspects of node context
   within a KG flexibly. RegPattern2Vec learns functional representations
   of kinases, interacting partners, post-translational modifications,
   pathways, cellular localization, and chemical interactions from a
   kinase-centric KG that integrates and conceptualizes data from curated
   heterogeneous data resources. By contextualizing information relevant
   to prediction, RegPattern2Vec improves accuracy and efficiency in
   comparison to other random walk-based graph embedding approaches. We
   show that the predictions produced by our model overlap with pathway
   enrichment data produced using experimentally validated Protein-Protein
   Interaction (PPI) data from both publicly available databases and
   experimental datasets not used in training. Our model also has the
   advantage of using the collected random walks as biological context to
   interpret the predicted protein-pathway associations. We provide
   high-confidence pathway predictions for 34 dark kinases and present
   three case studies in which analysis of meta-paths associated with the
   prediction enables biological interpretation. Overall, RegPattern2Vec
   efficiently samples multiple node types for link prediction on
   biological knowledge graphs and the predicted associations between
   understudied kinases, pseudokinases, and known pathways serve as a
   conceptual starting point for hypothesis generation and testing.

   Keywords: Random walk, Illuminating Druggable Genome (IDG), Pathway
   prediction, Data integration, Ontologies, Link prediction,
   Classification, Evolution, Signaling networks, Drug discovery

Introduction

   Protein kinases play a fundamental role in cell signaling by
   phosphorylating peptides and proteins on serine, threonine, and
   tyrosine residues. The 534 protein kinases encoded in the human genome
   regulate protein activity in diverse pathways including those involved
   in DNA repair, cell cycle control, and metabolism ([34]Hunter, 2009;
   [35]Johnson et al., 2023; [36]Lahiry et al., 2010; [37]Ochoa et al.,
   2020). Because dysregulation of protein kinase signaling is causally
   associated with the pathogenesis of many diseases ([38]Ardito et al.,
   2017; [39]Oprea et al., 2018), much effort has gone into the
   characterization of their physiological functions ([40]Ayala-Aguilera
   et al., 2022; [41]Xie et al., 2021) and the development of selective
   protein kinase inhibitors ([42]Ardito et al., 2017; [43]Attwood et al.,
   2021; [44]Xie et al., 2021). Despite these efforts, nearly 164 out of
   the 534 known kinases remain relatively understudied and are referred
   to as “dark” kinases by the NIH Illuminating the Druggable Genome
   consortium (IDG) ([45]Berginski et al., 2021; [46]Gillespie et al.,
   2022; [47]Kelleher et al., 2023). Characterizing the functions of these
   dark kinases is crucial because they work in conjunction with other
   well-studied kinases in signaling pathways and are also frequently
   mutated, or abnormally expressed, in human diseases such as cancers
   ([48]Berginski et al., 2021; [49]Brognard & Hunter, 2011; [50]Collins
   et al., 2018; [51]Ravanmehr et al., 2021; [52]Soleymani et al., 2022).
   While considerable progress has been made in illuminating the functions
   of several dark kinases, placing these kinases in a pathway or a cell
   signaling network context remains a major bioinformatics challenge.

   The compendium of genomic, proteomic and interactome data now available
   through the efforts of the IDG consortium and numerous
   investigator-initiated efforts allows for the possibility of inferring
   the functions and pathways for dark kinases through integrative mining
   of the known patterns and relationships in existing data. In
   particular, the development of network-based approaches, such as
   knowledge graphs (KGs), that relate and link diverse types of protein
   kinase data in the forms of networks (composed of nodes and edges)
   enables the prediction of dark kinase functions or pathways through
   network context ([53]Soleymani et al., 2022). Indeed, KG mining
   approaches and machine learning (ML) methods applied to KGs have been
   successfully employed in the identification of kinase substrates
   ([54]Gavali et al., 2022; [55]Nováček et al., 2020), prioritization of
   understudied kinases ([56]Huang et al., 2018), and identification of
   disease associations ([57]Bachman, Gyori & Sorger, 2022; [58]Gyori et
   al., 2017; [59]Moret et al., 2021; [60]Soleymani et al., 2022).

   The working premise of these network-based approaches is that proteins
   (kinases in this case) that share similar network neighbors (such as
   interacting proteins, downstream substrates, cellular localization, or
   molecular functions) with well-studied kinases are likely to be
   involved in common pathways. Several predictive ML methods utilize this
   premise, however many of these methods are designed for use with
   homogenous networks ([61]Grover & Leskovec, 2016; [62]Perozzi, Al-Rfou
   & Skiena, 2014). Homogenous networks have just a single node and edge
   type. Whereas heterogenous networks, like KGs, are defined by multiple
   node and relationship types such as Drug–targets–Protein and
   Protein–isAssociatedWith–DiseaseState, where Drug and Protein are node
   types and targets and isAssociatedWith are edge types (or
   relationships). Because heterogenous networks offer a broader range of
   knowledge representation over homogenous networks, they are suitable
   for biological representations that require data integration and
   conceptualization from multiple data sources. However, models that can
   fully leverage heterogenous graphs often suffer from scalability
   issues, limiting their predictive power and making the construction of
   the KG difficult ([63]Alshahrani, Thafar & Essack, 2021; [64]Bonner et
   al., 2022; [65]Gavali et al., 2022). Due to these limitations, KGs have
   not been previously employed to predict pathway associations for dark
   kinases because they are sparsely connected in KGs (due to low
   information content), making it difficult to predict functional
   associations through local context alone.

   Embedding-based approaches have been proposed to overcome the
   limitations of sparse data within KGs, as they have the capability to
   represent the graph in dense and low dimensional feature space ([66]Dai
   et al., 2020). However, many seminal methods developed for graph
   embedding are not optimized for heterogenous graphs as is the case with
   many message-passing GNN-based models ([67]Hamilton, Ying & Leskovec,
   2017; [68]Pei et al., 2020; [69]Tang, Li & Yu, 2019; [70]Velickovic et
   al., 2017; [71]Zhang et al., 2018; [72]Zhang et al., 2019). Other
   methods, such as proximity-preserving methods, have had success in
   large-scale knowledge graph embedding, however using these methods to
   fully capture complex relations is still an active area of research as
   many of these models lack full expressivity (accounting for both local
   and global dependencies) ([73]Bordes et al., 2013; [74]Gao et al.,
   2020; [75]Ge et al., 2022; [76]Huang et al., 2021; [77]Sadeghian et
   al., 2021; [78]Sun et al., 2019; [79]Wang et al., 2021; [80]Yang & Liu,
   2021; [81]Zhou, Yi & Jia, 2021). However, a recent advancement, the
   spatio-translational model BoxE, is fully expressive, but due to the
   model’s use of the “rule injection” method, yields high false
   positivity rates with sparse graphs ([82]Abboud et al., 2020).

   Relation-preserving methods attempt to capture complex relationships by
   matching semantic similarities of entities and relations and learn
   “context” from nodes within the KG. Many base-line relation-preserving
   methods are still widely used for link prediction tasks with biological
   data ([83]Chen et al., 2020; Gan et al. 2023; [84]Ha & Park, 2022;
   [85]Li et al., 2020; [86]Long & Luo, 2020; [87]Peng, Guan & Shang,
   2019; [88]Samizadeh & Minaei-Bidgoli, 2020; [89]Wang et al., 2021;
   [90]Wong et al., 2020). This deep learning approach is based on a
   family of models from Natural Language Processing (NLP) called word2vec
   and can be combined with constrained random walks to sample the graph
   and allow for use with heterogenous networks ([91]Dong, Chawla & Swami,
   2017). By sampling subsections of the KG, random walk-based methods can
   scale with larger KGs that would otherwise be too computationally
   expensive to use in full. While many techniques are used for data
   reduction in KGs ([92]Wang et al., 2017), constrained random walks have
   distinct advantages. Using a constrained “random walk” method on the
   KG, the graph is stochastically “sampled” such that multiple data types
   present within the network can be appropriately leveraged for
   predictions. The sampled section of the graph can then be converted
   into an embedding of the KG such that the biological context relevant
   for the prediction is accurately captured enabling nontrivial
   associations between dark kinase and associated pathways ([93]Dong,
   Chawla & Swami, 2017; [94]Nickel et al., 2015).

   One of the most well-known relation-preserving methods that utilizes
   random walks is metpath2vec. metapath2vec preserves the relationships
   in the graph during the sampling process by utilizing schemas (ordered
   sets of specific node or edge types). By specifying what node/edge
   types should be sampled, integration of different data types can be
   fully leveraged for predictions, thus accurately capturing the
   structure of heterogenous graphs. However, creating such meta-paths for
   complex heterogeneous KGs, often with no well-defined schema, is
   challenging and time-consuming with the performance of the model being
   highly dependent on the chosen meta-path (series of nodes).

   Several previous attempts to automatically generate meta-paths have
   been reported in the literature ([95]Meng et al., 2015; [96]Wan et al.,
   2020; [97]Yang et al., 2018) but they all rely on fixed-length
   meta-paths, which prevents accurate representation of latent and
   hierarchical structure encoded in large KGs. Other works have expanded
   upon the random walk framework proposed in metapath2vec, including
   W-MetaPath2Vec ([98]Pham & Do, 2019), HIN2Vec ([99]Fu, Lee & Lei,
   2017), and RW-k ([100]Anil et al., 2019). Additional works have
   combined the use of meta-path-based sampling with graph neural networks
   (GNNs) to make the aggregated message passing method more appropriate
   with heterogenous networks, as is the case with MAGNN ([101]Fu et al.,
   2020). Finally, attempts to exploit semantics explicit in structural
   relations found within the graph have been leveraged for graph
   sampling, as is the case with Relation Structure-Aware Heterogenous
   Information Network Embedding (RHINE) ([102]Shi et al., 2020), which
   defines relations that link nodes according to their similar properties
   or ability to bridge compatible nodes. We explore several of the
   methods listed above (MAGNN, RHINE, BoxE) for use with our graph (See
   Results ‘Benchmarking RegPattern2Vec’s predictive performance with
   metapath2vec and other graph embedding approaches’). However due to the
   focus of our question relating to the functionality of dark kinases
   there is inherent data imbalance within the graph making these methods
   not appropriate for use.

   To address this, we employ RegPattern2Vec ([103]Keshavarzi, Kannan &
   Kochut, 2021), a novel approach for learning on semantic data to
   predict new pathway associations for dark kinases. RegPattern2Vec
   directly addresses the limitations of previous models and provides more
   accurate predictions by precisely capturing relevant node/edge context
   in areas of graph sparsity using regular patterns and hyperparameters
   (See Methods & Materials ‘Regular pattern selection and its usage in
   random walks’–Biased random walk constrained by a regular pattern).
   Unlike metapath2vec, RegPattern2Vec can make node-association
   predictions without specifying the entire schema. This allows us to
   meaningfully sample the KG without establishing multiple meta-paths
   ([104]Keshavarzi, Kannan & Kochut, 2021). Here we show the application
   of RegPattern2Vec for link prediction on a large heterogenous KG.

   Overall, we successfully employed RegPattern2Vec to capture the
   node/edge context more accurately within our KG. After sampling, the ML
   technique (link prediction) is used to predict new associations between
   dark kinases and characterized pathways. We place 34 dark kinases for
   which we have consistent, high-confidence predictions produced through
   replicate runs (with differing hyperparameters) in a pathway context
   and based on an analysis of the meta-paths navigated for three selected
   dark kinases, we provide biological interpretations of the predicted
   pathway associations. The RegPattern2Vec pipeline is available at the
   GitHub repository
   ([105]https://github.com/gravelCompBio/RegPattern2Vec).

Materials & Methods

Knowledge graph architecture

   KGs are very similar to Heterogeneous Information Networks. An
   Information Network is a directed graph
   [MATH: <mi>G</mi><mo>=</mo><mfenced separators="" open="("
   close=")"><mrow><mi>V</mi><mo>,</mo><mi>E</mi></mrow></mfenced> :MATH]
   , composed of vertices (also called nodes) and edges, with an
   associated node type mapping function ϕ:V → A and an edge type mapping
   function ψ:E → R ([106]Shi et al., 2016). Each node v ∈ V belongs to
   one particular node type in the node type set
   [MATH: <mi>A</mi><mo>:</mo><mi>ϕ</mi><mfenced separators="" open="("
   close=")"><mrow><mi>v</mi></mrow></mfenced><mo>∈</mo><mi>A</mi> :MATH]
   , and each edge e ∈ E belongs to a particular edge type in the edge
   type set
   [MATH: <mi>R</mi><mo>:</mo><mi>ψ</mi><mfenced separators="" open="("
   close=")"><mrow><mi>e</mi></mrow></mfenced><mo>∈</mo><mi>R</mi> :MATH]
   . An Information Network is called a Heterogeneous Information Network
   if the sets A and R both contain more than one element, that is, there
   are multiple labels (types) for graph nodes and multiple labels (types)
   for edges. If sets A and R are singletons, the Information Network is
   called a Homogeneous Information Network (all nodes in the network are
   of the same type, and all edges are of the same type). While
   Heterogeneous Information Networks require that if two edges belong to
   the same edge type, the two edges share the same starting node type and
   ending node type, KGs do not. That is, the same edge (relation) type
   can be applied to different starting node types and different ending
   node types. This is also the case in the Resource Description Framework
   (RDF) ([107]Cyganiak et al., 2014; [108]W3C, 2014), a notation often
   used to represent KGs. RDF is based on the notion of triples of the
   form Subject-predicate-Object ([109]Cyganiak et al., 2014) representing
   edges connecting nodes in the graph.

   For example, in [110]Fig. 1A, for a KG representing information about
   protein kinases, a node (entity) representing a dark protein kinase CDK
   13 (cyclin dependent kinase 13) can be connected by an edge
   (relationship) labeled hasPathway to a node representing a pathway
   NeutrophilDegranulation, which represents the knowledge that CDK 13
   participates in the pathway NeutrophilDegranulation. In such a KG, CDK
   13 may be connected to other nodes using different labels (hasPathway),
   such as CDK13–hasMolelcularFunction—CyclinBinding,
   CDK13–hasBiologicalProcess—GranulocyteActivation and
   CDK13–hasCellularComponent—GolgiApparatus. Other protein kinases can
   additionally be included in the neighborhood context through shared
   nodes, such as the shared molecular function of CyclinBinding between
   dark kinase CDK13 and light kinase CDK6 (cyclin-dependent kinase 6).
   Here, edges have multiple labels, and destination nodes are of
   different types, which indicates that it is a heterogeneous KG.

Figure 1. Overview of the KG.

   [111]Figure 1
   [112]Open in a new tab

   (A) Visual representation of the node types present within the KG
   generated from the listed databases above. The circles represent nodes
   (entities), and the lines represent edges (relationships). The circles
   are colored according to node type. The labels inside the circle are
   specific examples of data found in our KG related to the dark kinase
   CDK13. The labels outside of the circles correspond to the node type in
   bold text and the edge type in italics. (B) A graphical representation
   of all databases utilized in the construction of the KG with the node
   type label listed inside the representation of each corresponding
   database. (C) Abundance of associations with corresponding edge and
   node counts that exist in the KG.

   Given a Heterogeneous Information Network
   [MATH: <mi>G</mi><mo>=</mo><mfenced separators="" open="("
   close=")"><mrow><mi>V</mi><mo>,</mo><mi>E</mi></mrow></mfenced> :MATH]
   (as defined above), the network’s schema is a directed graph,
   [MATH: <mi>S</mi><mo>=</mo><mfenced separators="" open="("
   close=")"><mrow><mi>A</mi><mo>,</mo><mi>R</mi></mrow></mfenced> :MATH]
   , based on G’s node type mapping ϕ:V → A and its edge (relation) type
   mapping ψ:E → R.S is a directed graph defined over node types A, with
   edges as relations from R. Similarly, the schema of a KG represented in
   RDF is represented in RDFS (RDF Schema ([113]W3C, 2014)). For example,
   given the CDK13 examples above, the schema would contain the edges
   Protein —participatesIn—Pathway,
   Protein–hasBiologicalProcess—BiologicalProcess, and
   Protein—isLocatedIn—CellularLocation. The schema, sometimes referred to
   as a meta-knowledge graph, or meta-graph, specifies constraints on
   using the edge labels to certain types of starting and ending nodes
   (subjects and objects in RDF). Also, given a KG schema, we can create a
   KG conforming to the schema containing many individuals (nodes).

Data sources and curation of datasets for knowledge graph generation

   To construct our human kinase-centric KG, we gathered data from
   multiple publicly available curated resources ([114]Fig. 1B). Nodes and
   edges within the graph were further integrated under more general node
   and relationship labels (also referred to as node types and edge types
   respectively) such as Disease ([115]Figs. 1B and [116]2B). We further
   filtered the information included within our KG to avoid redundancy
   (described below). [117]Fig. 1C shows basic statistics of different
   edges and their source and target nodes. The final KG contained
   1,064,097 nodes of 11 types and 6, 279, 374 associations of 13 types.

Figure 2. RegPattern2Vec Overview.

   [118]Figure 2
   [119]Open in a new tab

   (A) The hypothetical sequence of nodes produced by random walks guided
   by the regular pattern, shown in (B) (see legend for panel B for more
   details) and with the limitation of collecting five nodes
   (hyperparameter Length of Walk). This can result in several walks shown
   in (A). Walk1 shows a full walk that ends on a pathway node, Walk2
   shows a full walk that ends on a node other than pathway, Walk3 shows a
   walk that ends early, and finally Walk4 shows a reverse walk. (B) The
   regular pattern designed for random walk sampling (schema). This
   pattern is defined as Protein [^∧Pathway ] + Protein Pathway. Each
   random walk instance is constrained based on this pattern. All
   meta-paths must follow the schema in which they start with either a
   DarkKinase, LightKinase or Protein node and must collect a pathway node
   during the random walk. Once a pathway node is collected, if there are
   no other appropriate nodes according to the limitations put into place
   by both the schema and hyperparameters then reverse walks are allowed
   in which the pathway node will be collected again, and the walk resumes
   in a random direction until hyperparameters are satisfied. This feature
   is added to capture local relevant nodes in areas of data sparsity. (C)
   A hypothetical subgraph of the KG. The dashed lines show the node
   sequences highlighted in (A). Nodes are labeled and colored according
   to node types as shown in panel B with numbering corresponding to the
   order in which it is sampled (IE: P3 is the third Protein node
   sampled). This figure serves as a graphical depiction of how various
   data types exist within a neighborhood for sampling. (D) The meta-paths
   produced through our modified random walk approach are turned into
   embeddings for representation learning. Once the node context has been
   vectorized we then use logistic regression to perform link prediction
   as a binary classification task.

   The node types Protein and FunctionalDomain were populated with
   information on human kinase classification, functional domains,
   structure, and Protein–Protein Interaction (PPI). Information on these
   data types was extracted from the following databases: ProKinO
   ([120]McSkimming et al., 2015), Dark Kinase Knowledgebase
   (darkkinome.org) ([121]Berginski et al., 2021), UniProt ([122]UniProt,
   2021), PDBe-KB ([123]Velankar et al., 2010), Pfam v33.1 ([124]Mistry et
   al., 2021), and STRING v11 ([125]Szklarczyk et al., 2019). Of note,
   included within our graph are numerous ortholog (non-human) Protein
   nodes collected from PPI databases. Protein Post-Translational
   Modifications (PTMs) associations were also included within our KG
   under node type PTM and were retrieved from iPTMnet v5 ([126]Ross et
   al., 2017). The Chemical and Disease node types were populated from the
   Comparative Toxicogenomics Database ([127]Davis et al., 2021)
   (protein-chemical, protein-disease, chemical-disease, and chemical-GO
   term associations) as well as cancer mutation data from the Catalogue
   of Somatic Mutations in Cancer (COSMIC) ([128]Tate et al., 2019). The
   Genomics of Drug Sensitivity in Cancer (GDSC) ([129]Yang et al., 2013)
   database was used for drug activity data. The Chemical node type
   additionally includes information on kinase-associated and ligand
   interaction motifs and ligand activity retrieved from Kinase-Ligand
   Interaction Fingerprints and Structures (KLIFS) ([130]Kanev et al.,
   2021) and Pharos ([131]Sheils et al., 2021), respectively. Gene
   ontology terms and pathway information regarding human kinases were
   retrieved from the Gene Ontology v2019_11 ([132]Gene Ontology, 2021)
   (GOTerm node type) and Reactome v76 ([133]Jassal et al., 2020) (Pathway
   node type), respectively.

Further filtering of data sources to reduce redundancy

   The retrieved datasets were further curated prior to KG generation and
   population. All protein-pathway associations obtained from Reactome
   were split into manually curated associations (evidence = TAS) and
   predicted associations (evidence = IEA), only manually curated
   associations were included in our KG. Additionally, due to the
   hierarchical nature of the data taken from Reactome (i.e., Pathways are
   often organized into parent–child relationships), we eliminated high
   level pathways as they were deemed too general for our experiments. For
   example, we filtered out the pathways beneath “Disease” (R-HSA-1643685)
   but not beneath “Infectious disease” (R-HSA-5663205). This was done not
   only to reduce redundancy in the data but also to ensure the vector
   embeddings created from our KG for link prediction were not overfitted
   to highly connected nodes (often referred to as hubs). Similar
   filtering was done with Protein-GO term associations retrieved from
   Gene Ontology-GO terms with more than 5000 associations were removed.
   Additionally, Protein-GO terms were assigned separate node and edge
   types in the KG according to the MolecularFunction, CellularComponent,
   and BiologicalProcess terms.

   Additional filtering to increase confidence in the data within our KG
   was performed using various strategies specific to the data source.
   Only PPIs (STRING) directly involving a kinase and with an experimental
   data score of more than 700 were included in our KG, and only PTMs
   (iPTMnet) with a confidence score of more than 1.0 were included in the
   graph. Additionally, Pfam entries with the protein kinase domains
   (“Pkinase” and “Pkinase_Tyr”) were removed.

Creation of curated kinase knowledge graph for kinase-pathway link prediction

   The KG used in our link prediction experiments described here contains
   several node types such as Protein, GO Terms (BiologicalProcess,
   MolecularFunction, and CellularComponent), Disease, and other types.
   The nodes collected during a hypothetical random walk are shown in
   [134]Fig. 2A. The hypothetical random walk is constrained by a schema
   organization that is shown in [135]Fig. 2B the same schema is applied
   throughout the rest of the paper which indicates [Protein, any node not
   Protein and not Pathway, Protein, Pathway] ([136]Jassal et al., 2020).
   In addition, the Protein type includes the understudied kinases
   (referred to as DarkKinases), well-studied kinases (referred to as
   LightKinases), and other proteins. This distinction is essential, as it
   gives us the ability to make predictions for a subset of proteins. The
   edge types (edges in the schema) are not labeled here, as we do not
   consider them in our method described here. Instead, we only rely on
   the types of source and target nodes. The loop edge returning to the
   Protein type indicates the Protein–Protein Interaction relation, which
   is included in our knowledge graph. Finally, [137]Fig. 2C depicts the
   hypothetical subgraph in which the random walks are performed relating
   back to the nodes displayed in [138]Fig. 2A.

Link prediction workflow

   In the experiments and results presented in this paper, we used
   RegPattern2Vec. It is used as the first step in our link prediction
   process to produce vector representations for the nodes in the graph
   and formulate the link prediction as a classification problem. A ML
   model was trained using the combined vectors of existing pairs of nodes
   connected by an edge (link) of interest. The outline of the method is
   illustrated in [139]Fig. 2D. We discuss each step of our link
   prediction method in the subsequent sections.

Regular pattern selection and its usage in random walks

   Given a Heterogenous Information Network (as defined above), we have
   used regular patterns of the form H[^∧ T]+ H T for link prediction
   tasks, i.e., edges (links) between nodes of type H and T. In the
   pattern, H is a source node type and T is a target node type of the
   link of interest, respectively. Additionally, [^∧T] denotes any node
   type different from T (in regular expressions terminology it is known
   as the complement operator), and [^∧T]+ denotes a repetition (sequence)
   of one or more node types that are different from T. The overall
   pattern requires that any random walk must match the specified sequence
   of node types, in that a walk must begin with a node of type H then it
   must be followed by one or more nodes of type different than T, then a
   node of type H, which must then be immediately followed by a node of
   type T.

   Simply put, the regular pattern is defined as a schema subgraph that
   should include only the relevant node types for a specific problem. An
   example regular pattern subgraph is shown in [140]Fig. 2C. Unlike in
   the metapath2vec method ([141]Dong, Chawla & Swami, 2017), the schema
   graph pattern represents multiple paths (walks) that can be used to
   predict missing links. Here, we aim to predict protein-pathway links
   and the random walks of interest are constrained by a regular pattern
   connecting node types and edge types most relevant to the prediction
   links. Here, when generating protein-pathway predictions, we have
   defined the regular pattern as Protein [^∧Pathway] + Protein Pathway,
   and each random walk instance must match it. Each walk starts from a
   Protein node. Selecting the next nodes on a walk is based on the
   existing graph nodes and matching the neighbor’s type in the regular
   pattern. When a random walk reaches a Pathway node, it follows the
   pattern in reverse order until a certain number of steps (nodes) in the
   walk are reached, or when it reaches a termination node, i.e., when
   there is no neighbor to match the pattern. We designed this specific
   aspect of the model due to the challenge of data sparsity for dark
   kinases. By allowing the reverse pattern to be followed we allow for
   additional relevant neighborhood nodes to be collected- when walking
   backwards once the pathway node is reached, the walk travels in a
   random direction until the hyperparameters are satisfied, capturing
   potentially relevant neighbors for our prediction task.

Biased random walk constrained by a regular pattern

   As this work considered undirected graphs, enumerating a given graph’s
   path is computationally infeasible. The solution is to sample a subset
   of paths from all possible paths from a random distribution. The random
   walk constrained by a regular pattern is selected to generate walks of
   arbitrary length controlled by the “walk length” hyperparameter. Having
   the undirected heterogeneous network and a selected regular pattern,
   the random walk can be started from each instance of a starting node
   type in the pattern. As we want all the nodes to appear in our walks,
   iterating over them would be desirable. It is obvious that if we repeat
   the walk from each node (not all KG nodes may begin a walk, as
   constrained by the regular pattern), we will discover more walks as the
   node might link to multiple nodes of the same type. We call this
   hyperparameter the “number of walks” (referred to as NW in Results
   ‘Benchmarking RegPattern2Vec’s predictive performance with metapath2vec
   and other Graph Embedding Approaches’–Investigating the impact of
   hyperparameters on regpattern2vec’s predictive variability among
   replicate runs’). We will discuss how to choose the hyperparameter and
   the analysis of their impact in the Results ‘Benchmarking
   RegPattern2Vec’s predictive performance with metapath2vec and other
   Graph Embedding Approaches’–Investigating the impact of hyperparameters
   on regpattern2vec’s predictive variability among replicate runs’. The
   next step for each node is to select a node from the adjacent nodes
   based on the established regular pattern. This might result in multiple
   choices, and this is where randomization comes to play. RegPattern2Vec
   can utilize random distribution created by a user-defined function to
   generate the same probability for all the nodes or use an arbitrary
   distribution. To implement the random walk collection process, a
   regular pattern is converted to a Deterministic Finite Automaton (DFA),
   denoted by M. Each possible walk step is mapped to the DFA. The DFA M
   is used to check if transitions are allowed (an edge between two
   nodes); hence, a disallowed change gets a zero probability and is not
   used in the random walks.

   On the other hand, in scale-free networks where the degree distribution
   follows the power law, some nodes may have a high degree of
   incoming/outgoing edges (known as hubs). Because such high degree nodes
   can dominate random walks and, consequently, representation learning,
   one popular way to address this issue is to bias the walks by the
   inverse of degrees of nodes ([142]Grover & Leskovec, 2016), where the
   probability of choosing a node υ^i+1from υ^i is calculated by
   normalizing the inverse of degrees of all neighbors of υ^i. Although
   this approach lowers the probability of selecting high-degree nodes, it
   biases the random walk toward low-degree nodes, thereby capturing
   pertinent information related to low-degree “dark kinase” nodes. We
   used the formula below for node selection:
   [MATH: <mstyle displaystyle="true"><mi>P</mi><mfenced separators=""
   open="("
   close=")"><mrow><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi><mo>+</mo>
   <mn>1</mn></mrow></msup><mo>|</mo><msup><mrow><mi>v</mi></mrow><mrow><m
   i>i</mi></mrow></msup><mo>,</mo><mi>M</mi></mrow></mfenced><mo>=</mo><m
   fenced separators="" open="{" close=""><mrow><mtable><mtr><mtd
   columnalign="left"><mi mathvariant="normal">g</mi><mfenced
   separators="" open="("
   close=")"><mrow><msup><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></m
   sup></mrow></mfenced><mfrac><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><
   mfenced separators="" open="|"
   close="|"><mrow><msub><mrow><mi>N</mi></mrow><mrow><msup><mrow><mi>v</m
   i></mrow><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msup></mrow></msu
   b></mrow></mfenced></mrow></mfrac></mrow><mrow><munder><mrow><mo
   mathsize="big" movablelimits="false">
   ∑</mo></mrow><mrow><mi>v</mi><mo>∈</mo><msub><mrow><mi>N</mi></mrow><mr
   ow><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></msup></mrow></
   msub></mrow></munder><mfrac><mrow><mn>1</mn></mrow><mrow><mfenced
   separators="" open="|"
   close="|"><mrow><msub><mrow><mi>N</mi></mrow><mrow><mi>v</mi></mrow></m
   sub></mrow></mfenced></mrow></mfrac></mrow></mfrac><mspace
   width="10.00002pt"></mspace><mfenced separators="" open="("
   close=")"><mrow><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></m
   sup><mo>,</mo><msup><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></msu
   p><mo>,</mo><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi><mo>+</mo><mn>
   1</mn></mrow></msup></mrow></mfenced><mtext>is an edge in the graph and
   the</mtext></mtd></mtr><mtr><mtd columnalign="left"><mspace
   width="20.00003pt"></mspace><mtext>transition from v to v is allowed in
   M</mtext><mspace width="10.00002pt"></mspace></mtd></mtr><mtr><mtd
   columnalign="left"><mn>0</mn><mspace
   width="10.00002pt"></mspace><mfenced separators="" open="("
   close=")"><mrow><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></m
   sup><mo>,</mo><msup><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></msu
   p><mo>,</mo><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi><mo>+</mo><mn>
   1</mn></mrow></msup></mrow></mfenced><mtext>is an edge, but the
   transition is undefined</mtext><mspace
   width="10.00002pt"></mspace></mtd></mtr><mtr><mtd
   columnalign="left"><mn>0</mn><mspace
   width="10.00002pt"></mspace><mfenced separators="" open="("
   close=")"><mrow><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></m
   sup><mo>,</mo><msup><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></msu
   p><mo>,</mo><msup><mrow><mi>v</mi></mrow><mrow><mi>i</mi><mo>+</mo><mn>
   1</mn></mrow></msup></mrow></mfenced><mtext>does not exist in the
   graph</mtext><mspace
   width="10.00002pt"></mspace></mtd></mtr></mtable></mrow></mfenced></mst
   yle> :MATH]

   In the formula, υ^i denotes the current node and the candidate node for
   next step is υ^i+1 ∈ N[υ^i].∣N[υ]∣ denotes the degree of node υ,  and
   N[υ^i] is the set of all the neighbors of node
   [MATH:
   <msup><mrow><mi>υ</mi></mrow><mrow><mi>i</mi></mrow></msup><mo>.</mo><m
   i>g</mi><mfenced separators="" open="("
   close=")"><mrow><msup><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></m
   sup></mrow></mfenced> :MATH]
   is the proportion of the edges of type r ^i among all edges of node
   υ^i. Therefore, we randomly choose one edge (relation) type and then
   use the probability distribution by the inverse of node degrees to
   select the next node in the walk, but only among the nodes connected by
   the edge type chosen.

Vectorization of nodes and deep learning of semantic relationships between
proteins and pathways

   RegPattern2Vec uses a modified skip-gram model presented in ([143]Dong,
   Chawla & Swami, 2017) to generate vector representations for the nodes
   of the KG. The random walks generate sequences of nodes, which resemble
   natural language sentences. The ML model simultaneously captures the
   local structure of the graph and the types of the nodes and encodes
   them as vector representations.

Link prediction as a classification problem

   For each pair of protein-pathway associations, we combined their vector
   embeddings utilizing a widely used Hadamard product ([144]Grover &
   Leskovec, 2016; [145]Lin et al., 2015; [146]Minervini et al., 2015).
   The resulting vector is used as features to train a Logistic Regression
   model. Training the model additionally requires generating negative
   examples for protein-pathway association pairs which are difficult as
   resources with true biological negatives, such as Negatome ([147]Blohm
   et al., 2014), do not exist for our Pathway data type. To generate
   negative examples, we must first work under the closed-world
   assumption. This assumes that all information needed is provided, so
   for a statement to be true it must be explicitly stated to be so. Thus,
   a lack of a statement denotes that it is false (this contrasts with the
   open-world assumption in which statements can be true without them
   being explicitly known to be so). For our relation of interest
   (predicted link), we randomly select a head and tail node not connected
   by an edge in the graph. The embeddings of such nodes are then combined
   to produce negative examples. This process is repeated until the number
   of negative examples matches the positive examples.

   Our logistic regression method utilizes the one-versus-rest strategy,
   in which a multi-class classification is split into one binary
   classification problem per class, meaning a classifier will be trained
   for each task, and therefore all the other data points in the split
   data are used as negative examples with the methods discussed above
   (under the closed-world assumption). Thus, the number of negative
   examples is dependent on the number of positive examples for each
   classification task. Although our dataset is large, this strategy is
   employed due to the inherent data imbalance between well-studied and
   dark kinases.

Results

   In this work, we propose a new guided random walk approach for KGs,
   known as RegPattern2Vec, used to predict pathway associations for dark
   kinases. To generate protein-pathway predictions, we first constructed
   a protein kinase KG by integrating curated data from various resources
   populating the graph with nodes on molecular function, disease
   association, protein–protein associations and more (see Materials &
   Methods ‘Data sources and curation of datasets for knowledge graph
   generation’). The resulting KG consisted of 1,064,097 nodes (entities)
   and 6, 279, 373 edges (relationships) ([148]Table S1, [149]Fig. 1C).
   RegPattern2Vec was then used to sample the large KG into sequences of
   nodes. Regular patterns from RegPattern2Vec guided sampling. Using such
   patterns, we generate several walk paths (or a list of sequential nodes
   in the network) for representation learning. This allows the sampling
   to capture different sub-structures in the graph (covering different
   meta-paths) without explicitly designing walks for them.

   After sampling, the model learned vector embeddings for each node in
   the collected paths, utilizing a Natural Language Processing (NLP)
   model similar to metapath2vec ([150]Dong, Chawla & Swami, 2017). Link
   predictions utilizing the embeddings were then formulated as binary
   classification tasks and predictions were made for protein-pathway
   associations using a logistic regression model. For more details, refer
   to the Materials & Methods ‘Link prediction as a classification
   problem’.

Capturing semantic relationships between proteins and pathways in deep
learning

   As previously mentioned in more detail, (see Materials & Methods
   ‘Vectorization of nodes and deep learning of semantic relationships
   between proteins and pathways’) RegPattern2Vec uses a modified
   skip-gram model presented in ([151]Dong, Chawla & Swami, 2017) to
   generate vector representations for the nodes of the KG. [152]Figure 3
   shows the learned vector representation of nodes in the vector space
   using principal component analysis (PCA), a standard dimensionality
   reduction technique. For the protein-pathway predictions, we just
   consider the nodes to be of three types: “Protein”, “Pathway”, and
   “Others” when learning representation for the nodes. The separation of
   nodes in PCA shows our vector representation captures the node types
   and their network context. Several key differences exist between
   protein nodes and other nodes in our graph that may explain why they
   cluster separately from all other nodes. Protein nodes (including those
   annotated as dark/light kinases) are the default starting node in the
   schema, and many edge types describe proteins and connect them to other
   nodes in our graph (for example, a node describing a functional domain
   will be connected to the protein node it is describing and we
   additionally illustrate protein–protein interactions in our graph).
   This may allow protein nodes to occupy a unique space within the
   embedding reflected in the PCA.

Figure 3. PCA of vector embeddings.

   [153]Figure 3
   [154]Open in a new tab

   Principle component analysis of the vector embedding produced for all
   nodes generated by ML model. Given that the defined schema of
   RegPattern2Vec is defined as Protein [^∧Pathway ] + Protein Pathway
   when clustering the vector embeddings, we considered all nodes to be of
   type: Protein, anything that is not a Pathway or Protein (Others), and
   Pathways.

Benchmarking RegPattern2Vec’s predictive performance with metapath2vec and
other graph embedding approaches

   To evaluate the accuracy of our method and note any improvements from
   metapath2vec, a test set was first generated by excluding 50% of the
   known protein-pathway associations from the training set. The training
   examples were generated using the process detailed in the Materials &
   Methods (Link prediction workflow–Link prediction as a classification
   problem’). To compare our method to metapath2vec, we produced AUC ROC
   curves for both models. Multiple curves were generated for the
   metapath2vec model as multiple meta-paths were used resulting in
   variability in the model’s performance ([155]Fig. 4). Metapath 1
   (‘Protein’, ‘FunctionalDomain’, ‘Protein’, ‘Pathway’, ‘Protein’) showed
   the best performance, based on 10-fold cross-validation, with an
   f1-score of 0.87 and AUC ROC of 0.94 for protein-pathway prediction. In
   contrast, RegPattern2Vec achieved an f1-score of 0.90 and AUC ROC of
   0.96 on the same training and testing datasets ([156]Fig. 4). We also
   compared RegPattern2Vec to other recently proposed embedding approaches
   for heterogenous graphs, namely BoxE ([157]Abboud et al., 2020), MAGNN
   ([158]Fu et al., 2020), and RHINE ([159]Shi et al., 2020). These models
   were chosen mostly due to their ability to scale with larger knowledge
   graphs. Surprisingly, these models displayed poor performance in the
   task of predicting dark kinase pathway associations ([160]Fig. S1),
   presumably because of the sparsity of knowledge related to dark kinases
   in the KG. RegPattern2Vec overcomes some of the challenges imposed by
   data sparsity by resampling a node as a starting point, allowing for
   constraint backward walking for a greater sampling of neighbor nodes,
   and use of inverse node degree as a consideration to select the next
   node in the walk.

Figure 4. Comparison of AUC ROC curves generated with RegPattern2Vec and
metapath2vec.

   [161]Figure 4
   [162]Open in a new tab

   The comparison of AUC ROC curves of RegPattern2Vec and differing
   schemas utilized with the metapath2vec model. AUC ROC curves were
   generated by excluding 50% of the known associations from the training
   set, link prediction was then carried out using a logistic regression
   algorithm type as the binary classification model. metapath 1 consists
   of the schema [’Protein’, ’FunctionalDomain’, ’Protein’, ’Pathway’,
   ’Protein’]. metapath 2 consists of the schema [’Protein’, ’DarkKinase’,
   ’Protein’, ’Pathway’, ’Protein’]. metapath 3 consists of the schema
   [’LightKinase’, ’Protein’, ’DarkKinase’, ’Protein’, ’Pathway’,
   ’LightKinase’]. metapath 1 was the best performing schema tested,
   metapath 2 was still among our top performing schemas tested, and
   metapath 3 was among the worst performing schemas tested.

Investigating the impact of hyperparameters on regpattern2vec’s predictive
variability among replicate runs

   Although RegPattern2Vec is better suited for the task of learning on
   sparse biological KGs, due to the nature of the random walks, there is
   an intrinsic level of variability with the predictions made by our
   model. Although the exact same sequence of nodes will not be sampled
   each time, there should be a larger context within the embedding space
   that identifies patterns useful for predictions for the dark kinases
   sampled. Therefore, overlaps in the Pathway-Protein predictions made by
   our model using different hyperparameters are expected. To demonstrate
   our model’s robustness, we measured the predictive consistency while
   modifying hyperparameters across three replicates (runs that all have
   the same hyperparameters) and compared these results. The percentage of
   protein-pathway predictions that overlap (within all replicates) was
   used as a metric for our model’s robustness. This analysis also
   revealed the variability in the overlap of the protein-pathway
   predictions on a per-kinase basis (for both light and dark kinases).
   This variability may be due to imbalances in the data available for
   each kinase. Notably, greater variability was seen amongst dark
   kinases, with the highest percentage overlap being 87.5% and the lowest
   percentage overlap being 0% (for the hyperparameter 40 NW with 95%
   confidence cutoff). A supplementary table ([163]Table S2) provides
   further information indicating the overlap seen for each kinase.
   Additionally, we provided a supplementary figure ([164]Fig. S2),
   displaying the confidence (0.6 to 1.0) in overlap for all
   hyperparameters tested.

   One source of variability within our model is directly controlled by
   the hyperparameter “a number of walks (NW)”, which specifies the number
   of times a node should be sampled as the starting node during the
   random walk process (refer to Materials & Methods ‘Vectorization of
   nodes and deep learning of semantic relationships between proteins and
   pathways’). Repeating the walk from a specified node allows for a more
   complete characterization of neighbor nodes, often allowing for more
   node/edge types to be captured and considered in downstream analysis
   which is especially important when working with sparse data.

   To find the optimal NW, the same starting node was used for generating
   paths in the random walk anywhere from 10 to 80 times. This process was
   repeated, resulting in a total of three replicates for each variation
   of the NW hyperparameter. Overlap between the protein-pathway
   predictions obtained through replicate hyperparameter conditions was
   then compared. From this analysis, we sought to identify the smallest
   value for the hyperparameter NW that can accurately capture local
   structure and node context within the graph to make consistent
   predictions (as determined by replicate overlap). This analysis
   revealed that changing the hyperparameter NW did not drastically change
   the number of overlapping protein-pathway predictions between replicate
   datasets ([165]Fig. S3). Of note, the value of 40 for the
   hyperparameter NW (abbreviated 40 NW) generated the most consistent
   predictions among the NW parameters tested, with a mean of ∼22%
   overlapping pathways across three 0.7 cutoff replicates when averaging
   overlap over all kinases even when utilizing a range of confidence
   cutoffs from 0.7−0.95 ([166]Fig. S3). As a result of this benchmark, we
   determined that 40 NW is the optimal hyperparameter for predicting
   protein-pathway relationships in the current KG, as it performs
   slightly better (as measured by replicate overlap) than the more
   computationally intensive hyperparameter values tested.

Hyperparameter optimization results in consistent replicate pathway
predictions for 34 dark kinases across different replicate runs

   After finding the optimal value for the hyperparameter NW ([167]Fig.
   S3), we decided to investigate the overlap of all protein-pathway
   predictions produced by our previous analysis in which we test the
   hyperparameter NW (from 10–80 with a total of three replicate runs for
   each condition). The overlapping predictions present amongst highly
   variable walks suggest that these predictions are made based on nodes
   more often sampled within our network. After filtering for predictions
   that have over 0.95 confidence, only 95 dark kinase protein-pathway
   predictions emerged that were present in all replicates of all
   variations of NW tested (a total of 24 datasets) ([168]Fig. 5A),
   resulting in representative protein-pathway predictions for 34 of the
   unique dark kinases in our dataset ([169]Table S3). The top ten kinases
   (ranked according to the number of predictions that overlap) are shown
   in [170]Fig. 5B. Of note, many of the protein-pathway predictions
   generated by our model were highly consistent (either in pathways with
   similar function or protein family involvement) on a per kinases basis
   ([171]Table S3). For example, for the dark kinase VRK serine/threonine
   kinase 3 (VRK3), seven out of ten predictions were related to Toll Like
   Receptor (TLR) cascades, and five out of five protein-pathway
   predictions for the dark kinase protein kinase, membrane-associated
   tyrosine/threonine 1 (PKMYT1) were related to cell cycle control
   ([172]Table S3). We further focused on a subset of dark kinases for
   path analysis described below.

Figure 5. Investigating the overlap among all replicates and NW
hyperparameter values.

   [173]Figure 5
   [174]Open in a new tab

   (A) Total number of overlapping DarkKinase-Pathway predictions for all
   variations of the NW hyperparameter tested (10–80) with replicates
   (each hyperparameter tested three times) for a total of 24 datasets
   compared. Replicates were averaged and then graphed. Finally, overlap
   between all 24 datasets was compared in the “All Overlap” bar (B) Graph
   depicting the number of overlapping pathways for the top ten (out of
   34) kinases ranked by number (abundance) of overlapping predictions (C)
   We highlight a portion of the subgraph for the nodes collected through
   our modified random walk process for the protein-pathway prediction
   generated through link prediction between PRKACB and DAG/IP3 signaling.
   The nodes shown are common between five paths in the KG, obtained
   through the variation in NW datasets described above. PRKACB was chosen
   for this analysis based on panel B. The key indicates the color-coded
   node types, and the magnified panel highlights the highly connected
   hubs and their edge types.

Path analysis provides context for prediction associations between
understudied PRKACB and DAG/IP3 signaling

   The dark protein kinase cAMP-activated catalytic subunit beta (PRKACB)
   was amongst the kinases identified with the highest overlap of
   protein-pathway predictions ([175]Fig. 5B). PRKACB, the beta-catalytic
   subunit of Protein Kinase A (PKA), appears to be involved in calcium
   regulation and modification, a known role for PKA ([176]Barodia,
   Sophronea & Luthra, 2022; [177]Kania et al., 2017; [178]Ould Amer &
   Hebert-Chatelain, 2018; [179]Palencia-Campos et al., 2020). Eight of
   the ten pathways identified were directly linked to calcium regulation
   ([180]Table S3). To contextualize the predictions made by our model for
   PRKACB, we created a subgraph displaying common nodes traversed during
   the random walk process for the node representing the Reactome
   protein-pathway prediction R-HSA-1489509 (DAG/IP3Signaling). A portion
   of the subgraph highlighting the common nodes sampled between different
   paths are shown in [181]Fig. 5C. The nodes found in all sampled paths
   are more likely to represent nodes that are important for providing the
   context that results in the shared protein-pathway prediction. The
   entirety of the subgraph ([182]Data S1) displays all nodes with the
   addition of unconnected nodes that only appear in one path generated
   during the random walk process.

   The highly sampled nodes of the subgraph shown in [183]Fig. 5C, reveal
   that PRKACG (protein kinase cAMP-activated catalytic subunit gamma),
   PRKAR2A (protein kinase cAMP-dependent type II regulatory subunit
   alpha) ([184]Manchev et al., 2014; [185]Wei et al., 2021) and
   (calcium/calmodulin dependent protein kinase kinase 2) CAMKK2
   ([186]Najar et al., 2021) are nodes often sampled for the
   DAG/IP3Signaling prediction. The nodes being sampled multiple times
   demonstrate our model’s ability to leverage PPIs to infer pathway
   involvement. Additional nodes of interest that are often sampled for
   this prediction include PPIs between PRKACB (protein of interest) and
   both A-kinase anchor protein 28 (AKA28), and cGMP-specific 3′,
   5′-cyclic phosphodiesterase (PDE5A). Both proteins have biologically
   relevant functions supporting the protein-pathway prediction for
   PRKACB’s involvement with DAG/IP3 signaling. AKA28 has been shown to
   bind to type II regulatory subunits of PKA (related to our dark kinase)
   and anchors/targets PKA to discrete locations within the cell
   ([187]Kennedy & Scott, 2015; [188]Kultgen et al., 2002; [189]Omar &
   Scott, 2020; [190]Sarma et al., 2010). PDE5A is a phosphodiesterase
   that regulates intracellular levels of cAMP and AMP ([191]Peng et al.,
   2020). These observations support the predicted link/association
   between understudied PRKCB kinase and DAG/IP3 signaling pathway.

Predicted association of understudied CDK19 in TGF-beta receptor signaling:
interpretability with path analysis and validation with PPI datasets

   To further explore the biological relevance of our predictions for dark
   kinases, we compared our protein-pathway predictions to Reactome
   datasets generated using Protein Interaction and Proximity (PPI) data
   for dark kinases not included in our training data. PPI data was
   obtained from the Dark Kinase Knowledgebase ([192]Berginski et al.,
   2021). Of note, many kinases in this dataset did not have sufficient
   protein-interactors to perform Reactome pathway enrichment analysis.
   For the 50 dark kinases with an adequate number of identified
   interactors, the overlap between the protein-pathway predictions
   generated by our link prediction and the Reactome enrichment generated
   by PPI data were compared. [193]Figure 6A shows the top ten dark
   kinases ranked according to protein-pathway prediction overlap
   abundance and [194]Table S4 provides the pathway information for all
   dark kinases with overlap seen. Among the dark kinases with the highest
   overlap from this comparison was cyclin-dependent kinase 19 (CDK19), a
   dark kinase belonging to the cyclin-dependent kinase superfamily
   ([195]Malumbres, 2014).

Figure 6. Investigating protein-pathway predictions and dark kinase
knowledgebase experimentally validated predictions.

   [196]Figure 6
   [197]Open in a new tab

   Protein Interaction and Proximity (PPI) data for dark kinases obtained
   through the Dark Kinase Knowledgebase were used as input for Reactome
   enrichment. These predictions were then compared to those produced by
   our model using data produced during the hyperparameters exercise (24
   datasets). (A) Graph depicting the number of overlapping pathways
   between the two datasets for the top ten kinases ranked by number
   (abundance) of overlapping predictions. (B) We highlight a portion of
   the subgraph for the nodes collected through our modified random walk
   process for the protein-pathway prediction generated through link
   prediction between CDK19 and TGF-beta. The nodes shown are common
   between five paths in the KG, obtained through the 40 NW dataset. CDK19
   was chosen for this analysis based on (A). The key indicates the
   color-coded node types, and the magnified panel highlights the highly
   connected hubs and their edge types.

   After performing similar subgraph analysis to that previously described
   for PRKACB (Results,, ‘Hyperparameter optimization results in
   consistent replicate pathway predictions for 34 dark kinases across
   different replicate runs’), we were able to isolate nodes commonly
   traversed for the protein node of interest (CDK19) and the pathway node
   of interest TGF-betaReceptorSignaling (R-HSA-170834) ([198]Fig. 6B).
   Again, only a subset of the paths is shown for ease of viewing. The
   entirety of the subgraph ([199]Data S2) displays all nodes with the
   addition of unconnected nodes that only appear in one path generated
   during the random walk process. For the dark protein kinase CDK19 with
   the link prediction of TGF-beta receptor signaling, two “key” node hubs
   were identified including CCNC (cyclin-c) and CDK8 (cyclin dependent
   kinase 8). Recent total mRNA sequencing has demonstrated that loss of
   CCNC could activate the transforming growth factor (TGF)-beta signaling
   pathway ([200]Tang et al., 2021). CDK8 has also been implicated in
   TGF-beta signaling through previous studies showing they drive Smad
   transcriptional action and turnover in TGF-beta pathways ([201]Alarcón
   et al., 2009). Through interactions with several members of the
   mediator of RNA polymerase II transcription (MED) protein family, CDK19
   can also indirectly exploit the relationship between CDK8 and TGF-beta
   receptor signaling. It appears that CDK19, and CDK8 have been shown to
   directly interact with not just each other but MED12, as well as CCNC,
   forming a complex collectively termed the “Mediator Complex” ([202]Fant
   & Taatjes, 2019). This complex interacts with DNA-bound transcription
   factors and RNA polymerase II (Pol II) to activate and repress gene
   expression, with mediator subunit MED12 promoting TGF-beta signaling
   through both canonical regulation of transcription and non-genomic
   activity ([203]Weber & Garabedian, 2018). Taken together this suggests
   that while direct connections contribute to the predictive power of the
   model the surrounding node context can also reinforce these connections
   and can contribute to predictions made.

Predicted association between understudied PSKH2 and cilium assembly

   Protein-pathway predictions were additionally produced by our model for
   an understudied protein kinase PSKH2 ([204]Byrne et al., 2022) which is
   a paralog of the Golgi-associated protein serine kinase H1 (PSKH1)
   ([205]Brede et al., 2000). Currently, the Dark Kinase Knowledgebase
   ([206]Berginski et al., 2021) lists only ten known protein interactors
   for PSKH2 and no pathway annotation for this kinase exists in the
   literature. We utilized this and another recently published dataset
   ([207]Byrne et al., 2022) not included in training, for the Reactome
   enrichment analysis and independent validation of RegPattern2Vec
   predictions on PSKH2.

   Comparing the overlap in protein-pathway predictions (40 NW dataset)
   with the Reactome enrichment, 18 pathways were shown to overlap
   ([208]Table S5). The previously characterized subgraph visualization
   (Results, ‘Hyperparameter optimization results in consistent replicate
   pathway predictions for 34 dark kinases across different replicate
   runs’) was then performed, with five random paths chosen for
   visualization of the nodes commonly traversed for the protein node of
   interest (PSKH2) and the prediction node of interest (CiliumAssembly
   R-HSA- 5617833) ([209]Fig. 7 and [210]Data S3). Through this analysis
   one major “key” node hub was identified, UNC119B. unc-119 lipid binding
   chaperone B (UNC119B) is a protein that plays a key role in the
   localization of ciliary proteins by binding to N-myristoylated proteins
   and masking the hydrophobic lipid from hydrophilic cytosol, therefore
   facilitating trafficking to the primary cilium or the immunological
   synapse ([211]Stroukov et al., 2019; [212]Yelland et al., 2021). Many
   of the additional commonly traversed nodes belong to larger families of
   proteins known to play important roles in cilium assembly such as
   tubulins and dyneins. Thus, further characterization of these proteins
   can shed light on the role of PSKH2 in cilium assembly.

Figure 7. A subgraph of the network context of the KG surrounding PSKH2.

   [213]Figure 7
   [214]Open in a new tab

   As demonstrated in the previous [215]Fig. 5B our model can generate
   consistent predictions for dark kinase PSKH2. We compared the
   predictions produced by our model (40 NW) and Reactome enrichment using
   unpublished mass-spectrometry data produced by our collaborators. We
   highlight a portion of the subgraph for the nodes collected through our
   modified random walk process for the protein-pathway prediction
   generated through link prediction between PSKH2 and CiliumAssembly. The
   nodes shown are common between five paths in the KG, obtained through
   the 40 NW dataset. The key indicates the color-coded node types, and
   the magnified panel highlights the highly connected hubs and their edge
   types.

   Overall, through our analysis of the paths utilized by the model as
   context for the link prediction process, we demonstrated that our
   modified random walk approach, RegPattern2Vec, can successfully sample
   multiple types of nodes and edges. These findings support our notion
   that our model preserves both local and latent structure through
   appropriate neighbor sampling and therefore provides the node context
   needed to make predictions. The predictions made by our model regarding
   dark kinases allow us to explore dark kinases of interest in a pathway
   context, giving us an idea of how these kinases may function and relate
   to other proteins.

Discussion

   In this work, we present a network approach to predict pathway
   associations for understudied dark kinases by leveraging the network
   context of well-studied kinases. We predict pathway associations for
   dark kinases and propose a new graph embedding approach,
   RegPattern2Vec, which has significant advantages over the previously
   proposed metapath2vec approach for KG embedding. First, unlike
   metapath2vec, RegPattern2Vec does not require pre-defined knowledge of
   KG schema to define the paths. Second, because RegPattern2Vec does not
   use fixed length meta-paths, it captures the structure of the graph and
   the characteristics of both local and latent representations more
   effectively, as indicated by improved performance measures ([216]Fig.
   4). Third, our hyperparameter optimization strategies, combined with
   analysis of meta-paths, enable biological interpretation of predicted
   links.

   While the KG embedding and ML approaches described here have distinct
   advantages over metapath2vec for link prediction, our model can be
   further improved through the generation of curated training and testing
   datasets and additional data that capture pathway context. Currently,
   to generate negative datasets, any missing link between nodes is used
   as an indication that a relationship does not exist. In short, the
   negative sample is simply non-connected nodes. In certain knowledge
   domains turned into KGs this is appropriate, however just because a
   link does not exist within our graph does not mean it does not occur in
   a cellular context (i.e., two proteins interacting or a protein having
   a certain association). To address this limitation, we are currently
   working on a KG that includes subcellular localization information of
   proteins, and tissue/cell specificity to create a more informed
   negative set that better reflects the natural boundaries placed by
   biology rather than the boundaries of our current knowledge. Additional
   future directions include fully incorporating edge information into the
   embeddings used for our model’s predictions. The use of edge labels
   could allow for the use of additional Natural Language Processing (NLP)
   based models such as Transformers ([217]Bi et al., 2022;
   [218]Koncel-Kedziorski et al., 2019; [219]Xie et al., 2022; [220]Yao,
   Mao & Luo, 2019).

   Overall, we have demonstrated that our model can make valid biological
   predictions even with sparse input through the use of node context
   extracted from the KG. While this current work was focused on placing
   dark kinases in a pathway context, within the KG are nodes and edge
   types relevant for PPI and substrate predictions as well ([221]Fig. 1C,
   [222]Table S1). Therefore, our KG can be further utilized for other
   predictive tasks and the functionality of the KG can be further
   expanded by incorporating other forms of data. For example, including
   information such as cell-type specific expression of kinases could
   potentially improve the overall model performance while enabling
   predictions for cell-type specific dark kinase functions and
   interactions. However, the addition of new data types can significantly
   alter the topology of the KG. As data is added, new node/edge types and
   schemas will need to be incorporated into the graph and hyperparameters
   will again need to be optimized. The protocol described in Results
   (‘Benchmarking RegPattern2Vec’s predictive performance with
   metapath2vec and other graph embedding approaches’) for testing
   hyperparameters serves as a framework for continued usage of the KG,
   and exploration of variability within the random walk process. The code
   for RegPattern2Vec has been published on GitHub and these methodologies
   can be applied to additional KGs for other understudied proteins and
   protein families.

Conclusions

   We sought to characterize understudied dark kinases utilizing a
   network-based approach that combines multi-domain knowledge on both
   well-studied kinases and dark kinases within a single heterogeneous KG.
   Our method focuses on improvements within the random walk component of
   the graph embedding process. To accomplish this, we utilized
   RegPattern2Vec, allowing for the selection of the relevant part(s) of
   the KG using labeled semantics (regular patterns) to not only learn the
   local similarity between nodes and edges but also more complex and
   non-trivial associations which lead to novel protein-pathway
   predictions for dark kinases. Our method performs better in terms of
   scalability for large graphs compared to other network-based methods by
   sampling the graph more accurately while preserving node/edge context
   and structure.

Supplemental Information

   Supplemental Information 1. Comparing AUC ROC Curves of RegPattern2Vec
   and differing models for link prediction utilizing embeddings.

   AUC ROC curves were generated by excluding 50% of the known
   associations from the training set, link prediction was then carried
   out using a logistic regression algorithm type as the binary
   classification model. Both RegPattern2Vec and metapath2vec (with the
   best performing schema) perform well despite the imbalanced nature of
   the data within the graph which causes both data sparsity and scarcity.
   BoxE (fully expressive proximity-preserving method), RHINE (method that
   uses two major meta-labels for preserving graph structure), and MAGNN
   (a GNN based method that uses one-skip and two-skip metapaths- a global
   sampling technique and node aggregation) appear to have greater
   difficulty learning on our knowledge graph. All models were used with
   default settings, the only parameters changed were embedding size
   (MAGNN, RHINE. BoxE, metpath2vec) to allow for use of the embedding
   with our link prediction method. The configuration files which show the
   parameters each model was run with are additionally included in our
   github ( [223]https://github.com/gravelCompBio/RegPattern2Vec).
   [224]Click here for additional data file.^ (161.7KB, png)
   DOI: 10.7717/peerj.15815/supp-1
   Supplemental Information 2. Confidence Values for Replicates with
   Varying Hyperparameters.

   Confidence ranges from 0.60 to 1.00 for all hyperparemeters tested (NW
   10-80 with each hyperparameter tested three times) are displayed. The
   y-axis depicts the number of overlapping predictions averaged between
   replicates while the x-axis depicts and each hyperparameter number of
   walk is displayed as a separate graph.
   [225]Click here for additional data file.^ (239.3KB, png)
   DOI: 10.7717/peerj.15815/supp-2
   Supplemental Information 3. Measuring Variability Among Hyperparameter
   Replicates.

   Comparison of the overlap between protein-pathway predictions produced
   by the model when changing the hyperparameter for Number of Walks
   (NWs). The NW hyperparameter, indicating how many times the starting
   node should be sampled, was tested from 10 to 80 with each variation of
   the changed NW hyperparameter for the random walk process replicated
   three times. (A–B) Box plot of the average percentage overlap for
   protein-pathway predictions between the three replicates produced for
   all variations of the hyperparameter used (10–80). Overlap of the
   predictions produced by the model was compared on a per dark kinase
   basis and then averaged overall for all dark kinases (A) filtered for
   0.7 confidence or (B) filtered for 0.9 confidence.
   [226]Click here for additional data file.^ (215.2KB, png)
   DOI: 10.7717/peerj.15815/supp-3
   Supplemental Information 4. Node Statistics.

   Table depicting all node type abundance and average degree (number of
   edges) for node types within our KG.
   [227]Click here for additional data file.^ (9.1KB, xlsx)
   DOI: 10.7717/peerj.15815/supp-4
   Supplemental Information 5. Protein-Pathway Predictions Overlap for all
   Dark Kinases.

   Table depicting the average overlap between three replicates for all
   hyper-parameters tested (NW from 10–80). Columns include “Number of
   predictions that overlap across 3 replicates” and “Total number of
   unique pathway predictions”. The “Percent of pathway predictions that
   overlap” column is made by dividing the values of the other columns.
   [228]Click here for additional data file.^ (22KB, xlsx)
   DOI: 10.7717/peerj.15815/supp-5
   Supplemental Information 6. Protein-Pathway Predictions that Overlap
   with All NW Replicates.

   This file can be used to generate the subgraph highlighted in [229]Fig.
   6 through use with Cytoscape program. Columns are labeled according to
   their input in Cytoscape. The file includes more nodes that were
   displayed in the figure. More information on how this data was
   generated can be found in manuscript (Results ‘Predicted association of
   understudied CDK19 in TGF-beta receptor signaling: interpretability
   with path analysis and validation with PPI datasets)
   [230]Click here for additional data file.^ (12.4KB, xlsx)
   DOI: 10.7717/peerj.15815/supp-6
   Supplemental Information 7. Protein-pathway predictions that overlap
   with protein–protein interactions provided by the dark kinase
   knowledgebase.

   Table depicting all protein-pathway predictions produced through our
   model that overlapped with the Reactome enrichment results produced
   through protein–protein interactions from the Dark Kinase
   Knowledgebase. Pathway associations for dark kinases.
   [231]Click here for additional data file.^ (19.3KB, xlsx)
   DOI: 10.7717/peerj.15815/supp-7
   Supplemental Information 8. Protein-pathway predictions of PSKH2 that
   overlap with protein–protein interactions generated using mass
   spectrometry.

   Table depicting all protein pathway predictions that overlapped with
   the protein–protein interactions from experimentally validated
   mass-spectrometry data not used in training. Pathway associations that
   were connected to dark kinases from RegPattern2Vec’s predictions were
   linked to their associated pathway according to Reactome.
   [232]Click here for additional data file.^ (5.5KB, xlsx)
   DOI: 10.7717/peerj.15815/supp-8
   Supplemental Information 9. Subgraph of PRKACB generated from random
   walks sampled that include PRKACB.

   This file can be used to generate the subgraph highlighted in [233]Fig.
   5 through use with Cytoscape program. Columns are labeled according to
   their input in Cytoscape. The file includes more nodes that were
   displayed in the figure. More information on how this data was
   generated can be found in manuscript (Results ‘Path analysis provides
   context for prediction associations between understudied PRKACB and
   DAG/IP3 signaling’).
   [234]Click here for additional data file.^ (6.6KB, csv)
   DOI: 10.7717/peerj.15815/supp-9
   Supplemental Information 10. Subgraph of CDK19 Generated from Random
   Walks Sampled that Include CDK19.

   This file can be used to generate the subgraph highlighted in [235]Fig.
   6 through use with Cytoscape program. Columns are labeled according to
   their input in Cytoscape. The file includes more nodes that were
   displayed in the figure. More information on how this data was
   generated can be found in manuscript (Results ‘Predicted association of
   understudied CDK19 in TGF-beta receptor signaling: interpretability
   with path analysis and validation with PPI datasets’).
   [236]Click here for additional data file.^ (5.9KB, csv)
   DOI: 10.7717/peerj.15815/supp-10
   Supplemental Information 11. Subgraph of PSKH2 generated from random
   walks sampled that include PSKH2.

   This file can be used to generate the subgraph highlighted in [237]Fig.
   7 through use Cytoscape program. Columns are labeled according to their
   input in Cytoscape. The file includes more nodes that were displayed in
   the figure. More information on how this data was generated can be
   found in the manuscript (Results ‘Predicted association between
   understudied PSKH2 and cilium assembly’).
   [238]Click here for additional data file.^ (4.9KB, csv)
   DOI: 10.7717/peerj.15815/supp-11

Acknowledgments