Abstract

   The urinary proteome is a promising pool of biomarkers of kidney
   disease. However, the protein changes observed in urine only partially
   reflect the deregulated mechanisms within kidney tissue. In order to
   improve on the mechanistic insight based on the urinary protein
   changes, we developed a new prioritization strategy called PRYNT
   (PRioritization bY protein NeTwork) that employs a combination of two
   closeness-based algorithms, shortest-path and random walk, and a
   contextualized protein–protein interaction (PPI) network, mainly based
   on clique consolidation of STRING network. To assess the performance of
   our approach, we evaluated both precision and specificity of PRYNT in
   prioritizing kidney disease candidates. Using four urinary proteome
   datasets, PRYNT prioritization performed better than other
   prioritization methods and tools available in the literature. Moreover,
   PRYNT performed to a similar, but complementary, extent compared to the
   upstream regulator analysis from the commercial Ingenuity Pathway
   Analysis software. In conclusion, PRYNT appears to be a valuable freely
   accessible tool to predict key proteins indirectly from urinary
   proteome data. In the future, PRYNT approach could be applied to other
   biofluids, molecular traits and diseases. The source code is freely
   available on GitHub at: [34]https://github.com/Boizard/PRYNT and has
   been integrated as an interactive web apps to improved accessibility
   ([35]https://github.com/Boizard/PRYNT/tree/master/AppPRYNT).

   Subject terms: Computational biology and bioinformatics, Systems
   biology, Biomarkers, Molecular medicine, Kidney diseases

Introduction

   Kidney diseases can be defined as any chronic or acute disorder that
   affects renal structure and function^[36]1. In their most severe form,
   they are associated with a variety of complications, such as anemia,
   mineral and bone disorder or cardiovascular disease, leading to overall
   increased mortality^[37]2. Causes of renal failure are highly variable
   and sometimes unknown^[38]3. Some kidney diseases are monogenic,
   resulting from modifications in a single gene. Others are more complex
   and can result from a multifactorial combination of genetic,
   environmental and additional modifiers such as age, diabetes, smoking
   or hypertension. The use of high-resolution analytical omics
   technologies have resulted in major advances in the elucidation of
   diverse molecular pathophysiological mechanisms associated with kidney
   disease. While genomics is frequently used to unravel specific
   mutations in the genome that can increase the risk of developing
   certain diseases, disease activity is best captured by transcriptome or
   proteome analysis, as these traits are closer to the phenotype^[39]4.
   Moreover, whilst urine has been known for a very long time as a very
   informative and non-invasive source of potential candidates in the
   context of kidney disease^[40]5–[41]9, the molecular changes observed
   in urine partially reflect the deregulated mechanisms within kidney
   tissue. Urinary proteins predominately originate (~ 70%) from kidney
   and urinary tract by mechanisms of secretion and cellular
   shedding^[42]10–[43]12. The remaining challenge associated with such
   analysis is that these techniques require time-consuming validation
   experiments to try precisely pinpointing the most probable disease
   candidate from a list of hundreds of potential candidates. Most of
   these studies considered urinary proteins showing most prominent
   changes, either based on fold change or p-value, as new promising
   disease-related candidates. However, not all renal proteins can be
   found in urine and not all urinary proteins originate from the kidney.
   Hence, ranking disease proteins solely based on observed urinary
   changes might limit the complex view of the disease and insight in its
   pathophysiology.

   To help decipher the picture of the deregulated molecular networks and
   prioritize disease candidates, computational methods and tools have
   been proposed^[44]13. Some approaches prioritize candidates based on
   their similarity to the list of disease-modified genes^[45]14. These
   methods use databases (e.g. OMIM), ontologies (e.g. Gene Ontology) or
   text-mining from literature to assess similarity of sequence (e.g.
   POCUS^[46]15), functional annotation (e.g. PANDA^[47]16,
   Endeavour^[48]17, ToppGene^[49]18) or locus proximity (e.g.
   OPEN^[50]19, PhenoRank^[51]20). Other approaches use biological
   networks in order to prioritize candidates (e.g. MaxLink^[52]21,
   ToppNet^[53]18). One of the network-based software most commonly used
   by biologists in order to interpret high-throughput expression data is
   Ingenuity Pathway Analysis (IPA)^[54]22. This suite is based on a PPI
   network containing millions of structured, manually curated
   experimental observations. In IPA, the “Upstream Regulator Analysis”
   (URA) algorithm prioritizes disease candidates using in-house causal
   network approach to elucidate upstream biological causes that can
   explain the observed molecular changes^[55]23,[56]24. One of the main
   limitations hampering the use of IPA is that the software is
   proprietary and therefore its use cannot be broadly generalized to the
   biology community. Many other computational prioritization methods
   already exist^[57]13. Some are looking for candidates that directly
   interact with known disease genes, following the principle of
   “guilt-by-association”^[58]14,[59]25. Other, such as
   shortest-path^[60]26 or random walk^[61]27 algorithms, further consider
   the closeness between candidates and known disease genes in a network
   considering both direct and indirect relationships. Previous studies
   have shown that closeness-based approaches outperformed direct
   neighbour-based methods and that combining closeness-based approaches
   further improved disease candidate prioritization^[62]14,[63]28.
   However, most of these strategies have been used to identify disease
   candidates at the transcriptome level and not at the proteome level.
   Moreover, to date, none have been tested in the context of biological
   fluids.

   In order to move from this status quo, we developed an approach, named
   PRYNT (PRioritization bY protein NeTwork) that could help expand and
   fill the gaps of the molecular view, and predict the significance of
   proteins that were undetectable in the urine. PRYNT is based on the
   integration of Search Tool for the Retrieval of Interacting (STRING,
   version 10.5)^[64]29 PPI network and a combination of shortest-path and
   random walk, two closeness-based algorithms as it has been previously
   shown in the literature that this method outperformed other
   computational methods^[65]14,[66]26–[67]28. We used PRYNT in the
   context of two prototypic human kidney diseases: autosomal dominant
   polycystic kidney disease (ADPKD)^[68]5,[69]9 and ureteropelvic
   junction obstruction (UPJ)^[70]6,[71]7. ADPKD is a well-characterized
   monogenic kidney disease induced by a mutation of the PKD1 or PKD2
   gene. UPJ is a congenital kidney disease resulting from a complex
   multifactorial combination of genetic and environmental factors. In
   order to assess the performance of our approach, we first evaluated the
   precision of PRYNT in prioritizing ADPKD and UPJ disease candidates and
   compared it with other methods from recent literature. We also
   performed an in-depth comparison of the results obtained with PRYNT to
   two main reference prioritization methods currently used by biologists:
   prioritization based on experimental results and prioritization based
   on IPA’s URA algorithm.

Results

Contextualization of PRYNT PPI network

   In order to test PRYNT approach, four urinary proteome datasets were
   used: two associated with ADPKD (ADPKD1 and ADPKD2) and two associated
   with UPJ (UPJ1 and UPJ2) (Table [72]1 and Supplementary Tables
   [73]S1–[74]S4). We constructed a PPI network based on STRING database.
   Approximately 50–60% of the deregulated urinary proteins from ADPKD and
   UPJ proteomic datasets were present in the raw PPI network
   (Fig. [75]1). This rather low percentage could be explained either
   because part of the deregulated proteins were absent from STRING v10.5
   database altogether, or because they did not match the STRING settings
   that were selected i.e. sharing a protein.actions interactions with
   other proteins in the network, directional interaction and interaction
   reaching the highest confidence level (Fig. [76]1). Moreover, 56% (3569
   proteins) of the 6391 proteins present in the network were grouped in
   265 cliques, which are sets of proteins that all interact with each
   other and often share similar biological functions. In order to assess
   the impact of the missing biological input and of the presence of
   clique sub-graphs in the network, we modified the raw PPI network into
   three additional contextualized PPI networks (Fig. [77]2). The first
   contextualization consisted in generating a PPI network where the
   deregulated urinary proteins were added regardless of their confidence
   level (Fig. [78]2, +DP). The second contextualization consisted in
   generating a PPI network where cliques were taken into account
   (Fig. [79]2, +C). The last network combined both contextualization
   strategies (Fig. [80]2, +DP +C). We applied the prioritization strategy
   combining shortest path and random walk on the four different PPI
   networks on the four proteomics datasets (Fig. [81]2). We compared the
   ranked lists to a list of 500 reference disease candidates of ADPKD for
   ADPKD1 and ADPKD2, and of UPJ for UPJ1 and UPJ2. The precision was
   plotted (Fig. [82]3a) and the areas under the precision curves (AUC)
   were compared (Fig. [83]3b). Compared to the raw PPI, the use of the
   contextualized PPI + DP and PPI + C networks slightly increased the AUC
   of the precision. However, in the four datasets, the combined
   PPI + DP + C showed much better performance in terms of prioritizing
   disease candidates. Based on these results, we generated a
   contextualized PRYNT PPI network combining both the addition of the
   deregulated proteins and the management of the cliques (Fig. [84]2).

Table 1.

   Dataset description.
   Reference Type of kidney disease Controls Cases Deregulated proteins
   ADPKD1 Bakun et al.^[85]5 Monogenic 30 30 155
   ADPKD2 Rauniyar et al.^[86]9 Monogenic 18 14 69
   UPJ1 Lacroix et al.^[87]7 Complex 10 8 174
   UPJ2 Chen et al.^[88]6 Complex 23 23 175
   [89]Open in a new tab

Figure 1.

   [90]Figure 1
   [91]Open in a new tab

   Number of deregulated urinary proteins from proteomic datasets present
   in the raw PPI network. Part of the deregulated proteins (DP) present
   in the proteomics datasets could not be included as they were absent in
   String v10.5 database (Homo sapiens). Moreover, a number of DP was
   excluded as they did not share any interaction with other proteins
   (absent from protein.actions PPI) or did not have a directional
   interaction with highest confidence (> = 0.9). PPI: protein–protein
   interaction network; DP: deregulated protein.

Figure 2.

   [92]Figure 2
   [93]Open in a new tab

   Description of PRYNT algorithm. PRYNT PPI network was based on STRING
   10.5 protein.actions restricted to Homo sapiens (9606.protein.actions),
   and only directional interaction with confidence >  = 0.9 were
   selected. The raw PPI network (Raw) was further contextualized by
   adding the deregulated proteins (+DP) regardless of their confidence
   level and by grouping the proteins within cliques (+C). PRYNT
   prioritization approach was based on the combination of shortest-path
   (SP) and random walk (RW) algorithms and was achieved by multiplying
   the rank of the protein with the shortest-path ranking strategy
   (ranksp), and the rank of the protein with the random walk strategy
   (rankrw). PPI: protein–protein interaction network; DP: deregulated
   proteins; C: clique.

Figure 3.

   [94]Figure 3
   [95]Open in a new tab

   Performance of PRYNT depending on PPI network contextualization. (a)
   The precision was calculated based on the percentage of reference ADPKD
   or UPJ disease candidates that were prioritized in the top 100
   candidates ranked by PRYNT in the four datasets using either the raw
   PPI network (Raw) or the PPI networks contextualized by the addition of
   deregulated urinary proteins regardless of their confidence level
   (+DP), by the management of clique sub-graphs (+C) or by the
   combination of both (+DP +C). (b) The corresponding area under the
   precision curve (AUC) was calculated in the four datasets. Graphs were
   designed using GraphPad Prism version 5.0 for Mac, GraphPad Software,
   San Diego, California USA, [96]http://www.graphpad.com. DP: deregulated
   protein; C: clique.

Precision of PRYNT compared to other approaches

   We first compared PRYNT performance to shortest-path (SP) and random
   walk (RW) prioritization methods (Fig. [97]4a and Supplementary Figure
   [98]S1). Shortest-path between a disease candidate and a differentially
   abundant urinary protein is defined by the distance between any protein
   in the network and the differentially abundant proteins, taking into
   account the direction of interactions. Random walk with restarts
   simulates a random walker starting on differentially abundant urinary
   proteins and moving to their immediate neighbors’ randomly at each
   step. Each protein in the graph is prioritized by the probability of
   the random walker reaching it. Overall, PRYNT approach, combining both
   algorithms, showed better performance compared to the two strategies
   taken separately. Next, we compared PRYNT to seven additional
   state-of-the-art prioritization algorithms and tools (Fig. [99]4b and
   Supplementary Figure [100]S2): direct ranking (Direct),
   interconnectedness combined with random walk (ICN + RW),
   Phenolyzer^[101]30, Endeavour^[102]17, MaxLink^[103]21,
   ToppGene^[104]18 and ToppNet^[105]18. Direct ranking and
   interconnectedness combined with random walk were applied to String raw
   PPI network. Direct ranking was performed by applying out-degree
   centrality as described in the study of Oti et al.^[106]25. Disease
   candidates were prioritized based on the number of directly interacting
   differentially abundant urinary proteins. The interconnectedness-based
   approach combined with random walk was implemented following the study
   of Hsu et al.^[107]28. Phenolyzer, Endeavour and ToppGene are
   similarity-based prioritization approaches, extracting knowledge from
   diverse databases such as OMIM, Disease Ontology, or Gene Ontology.
   ToppNet and MaxLink are network-based prioritization approaches, using
   k-step markov and neighbor-based algorithms respectively. Overall,
   PRYNT showed better precision compared with these methods (Fig. [108]4b
   and Supplementary Figure [109]S2). In particular, the number of
   candidates predicted by MaxLink was < 100 so we could not assess the
   AUC for the precision in the top 100 predicted candidates in ADPKD2,
   UPJ1 and UPJ2. PRYNT performance was then compared to two reference
   approaches commonly used by biologists (Fig. [110]4c and Supplementary
   Figure [111]S3): URA from IPA (URA), and prioritization based on
   experimental results (Exp). Except for Exp, all tested approaches so
   far mine a network to find and rank new disease candidates that are
   linked to the deregulated proteins, without being in the initial set of
   deregulated proteins. In Exp however, the deregulated proteins are the
   disease candidates and their prioritization is based on a p-value
   ranking, the most significant proteins being the highest ranked
   candidates. In the four datasets, PRYNT showed higher performance to
   prioritize reference disease candidates compared to URA and Exp, with
   better precision and superior AUC (Fig. [112]4c and Supplementary
   Figure [113]S3). We next analyzed the overlap of reference disease
   candidates ranked in the top 100 by PRYNT, URA and Exp in the four
   datasets (Fig. [114]5). We observed that only a minority of reference
   disease candidates prioritized by PRYNT and URA were commonly
   prioritized by both approaches (59–70% uniquely prioritized by PRYNT
   and 48–64% uniquely prioritized by URA). For Exp, not only the number
   of prioritized reference disease candidates was very low, but it also
   showed very poor overlap with URA and no overlap with PRYNT.

Figure 4.

   [115]Figure 4
   [116]Open in a new tab

   Performance of PRYNT compared to other approaches. PRYNT performance
   was compared to prioritization using shortest-path or random walk
   algorithms alone (a), to prioritization by other common, state of the
   art prioritization strategies (b), or to prioritization by reference
   approaches (c). The precision was calculated based on the percentage of
   reference ADPKD or UPJ disease candidates that were prioritized in the
   top 100 candidates ranked by the different strategies in the four
   datasets. The corresponding area under the precision curve (AUC) was
   then calculated. Graphs were designed using GraphPad Prism version 5.0
   for Mac, GraphPad Software, San Diego, California USA,
   [117]http://www.graphpad.com. SP: shortest-path; RW: random walk; D:
   direct; ICN + RW: interconnectedness combined with random walk; Exp:
   experimental; URA: upstream regulator analysis.

Figure 5.

   [118]Figure 5
   [119]Open in a new tab

   Overlap of reference disease candidates prioritized in the top 100 by
   PRYNT, URA or Exp. Prioritization by PRYNT, URA or from the
   experimental urinary proteomic candidates (Exp) was applied and
   reference ADPKD and UPJ disease candidates ranked in the top 100 were
   compared in the four datasets. Exp: experimental; URA: upstream
   regulator analysis.

Specificity of PRYNT compared to reference approaches

   We next assessed how the different prioritization strategies ranked
   candidates that were specific to the disease under study. First, we
   studied cross-specificity by analyzing whether prioritization in ADPKD
   datasets was better for specific ADPKD reference disease candidates
   compared to non-specific UPJ reference disease candidates, and
   conversely for UPJ datasets (Fig. [120]6a). For ADPKD1 and ADPKD2, all
   prioritization strategies showed similar cross-specificity, with the
   AUC for specific ADPKD candidates (AUC[ADPKD]) being superior to the
   AUC for non-specific UPJ candidates (AUC[UPJ]). However, for UPJ1 and
   UPJ2, only PRYNT displayed adequate cross-specificity in both datasets.
   We next compared overall specificity of the approaches by comparing the
   AUC of the specific disease to the AUC of 80 non-specific diseases
   (list in Supplementary Table [121]S5) (Fig. [122]6b). For APDKD
   datasets, overall specificity was similar for all strategies in ADPKD1
   with the AUC of the specific disease (AUC[ADPKD]) being in the top 15
   out of 80 non-specific diseases. In ADPKD2, PRYNT showed better
   performance compared to URA and Exp (rank of specific AUC[ADPKD] of
   14/81, 34/81 and 21/81 for PRYNT, URA and Exp respectively). For UPJ
   datasets, overall specificity was lower compared to ADPKD datasets and
   in both datasets, PRYNT prioritization showed best specificity, with a
   rank of specific AUC[UPJ] of 21/81 and 27/81 for UPJ1 and UPJ2
   respectively. In UPJ2, Exp showed the lowest specificity with the
   specific AUC[UPJ] being ranked 65/81.

Figure 6.

   [123]Figure 6
   [124]Open in a new tab

   Specificity of PRYNT compared to reference approaches. (a)
   Cross-specificity of the prioritization strategies was assessed for the
   four datasets by calculating the difference between the AUC of the
   precision curve for specific disease candidates (AUC[ADPKD] for ADPKD
   datasets and AUC[UPJ] for UPJ datasets) and the AUC of non-specific
   disease candidates (AUC[UPJ] for ADPKD datasets and AUC[ADPKD] for UPJ
   datasets). (b) Overall specificity of the prioritization strategies was
   assessed for the four datasets by assessing the rank of the AUC of the
   precision curve for specific reference disease candidates (AUC[ADPKD]
   for ADPKD datasets and AUC[UPJ] for UPJ datasets) compared to 80
   additional AUCs of reference candidates from non-specific diseases,
   including 40 diseases associated to urogenital tract and 40 diseases
   from other origin. Graphs were designed using GraphPad Prism version
   5.0 for Mac, GraphPad Software, San Diego, California USA,
   [125]http://www.graphpad.com. Exp: experimental, URA: upstream
   regulator analysis.

Pathway annotation

   We used KEGG pathway enrichment analysis^[126]31 to assess the
   biological relevance of the disease candidates prioritized by PRYNT
   (Fig. [127]7). For ADPKD, the 500 reference disease candidates were
   associated to 166 pathways. Approximately 85% of these pathways were
   also enriched with the top 100 ranked candidates prioritized by PRYNT
   (141/166 and 139/166 for ADPKD1 and ADPKD2 respectively) whereas
   enrichment was 67–72% for URA top 100 (112/166 and 119/166 for ADPKD1
   and ADPKD2 respectively) and dropped to approximately 5% for Exp (9/166
   and 10/166 for ADPKD1 and ADPKD2 respectively). Similarly for UPJ,
   PRYNT results showed higher number of enriched pathways and more
   overlapping pathways associated to the reference UPJ candidates
   compared to URA and Exp.

Figure 7.

   [128]Figure 7
   [129]Open in a new tab

   Pathway annotation. KEGG pathway enrichment analysis was applied to the
   500 reference ADPKD and UPJ disease candidates, and compared to the
   pathways enriched from top 100 ranked candidates by PRYNT or URA or
   from the experimental urinary proteomic candidates (Exp) in the four
   datasets.

Links of proteins with pathology of interest

   Next we assessed the involvement of the top 10 protein candidates in
   the disease of interest by a systematic search of the scientific
   literature (Supplementary Tables [130]S6 and [131]S7). For the top
   ranked ADPKD proteins, all but two were previously linked to ADPKD,
   confirming the potential of PRYNT in ranking disease candidates
   (Supplementary Table [132]S6). The two proteins (F2 and HSPA8) not
   previously linked to ADPKD thus constitute potential candidates for
   future experiments. All but three of the top 10 proteins were
   previously linked to UPJ (Supplementary Table [133]S7).

Discussion

   In this study we developed and assessed the performance of PRYNT, a new
   network-based approach using urinary proteomic profiles to prioritize
   disease candidates in the context of kidney disease. While many tools
   and methods are available to predict disease candidates, we developed
   PRYNT to tackle the specificity of our research question. Indeed, most
   of these methods, such as Phenolyzer, Endeavour, MaxLink, ToppGene and
   ToppNet, have been developed on genomic data, seeking for new disease
   genes and showed that they were less suitable than PRYNT to predict new
   disease candidates from proteomic data. Combining both shortest-path
   and random walk, showed better results than using them alone as
   previously shown by Hsu et al.^[134]28 but also better results than
   using direct ranking. This latter result proves that closeness-based
   algorithms are more efficient to mine the PPI network in the context of
   research on biological fluids, because they are able to select key
   proteins of kidney disease in the network even though the links between
   excreted proteins in the urine and modified gene expression at the
   tissue level (i.e. in the kidney) are not necessarily straightforward.
   Another specificity of PRYNT has been to work on improving the PPI
   network. To build PRYNT PPI network, we chose to work with STRING
   database, as it is a well-known, recognized comprehensive database of
   PPI based on experimental evidence as well as interactions predicted by
   comparative genomics and text mining. To limit the risk of false
   prediction, we decided to only select PPI with highest confidence. Two
   major drawbacks that we identified in this network and using such
   settings were that a lot of the input information was missing and that
   the network was massively structured into clique sub-graphs. Instead of
   using the raw PPI network, we hence decided to contextualize then
   network by adding the deregulated proteins from the input data,
   regardless of their confidence, and by grouping the cliques. Cliques
   are important structures in PPI networks^[135]32,[136]33. Taking
   cliques into account by grouping the proteins allowed simplifying the
   network and helping find the most important disease candidates. As a
   result, our specific PRYNT contextualized network showed better
   performance compared to the raw STRING network. We also compared PRYNT
   to IPA’s URA and to ranking based on experimental results, two
   references approaches commonly used by biologists. We showed that PRYNT