Abstract The urinary proteome is a promising pool of biomarkers of kidney disease. However, the protein changes observed in urine only partially reflect the deregulated mechanisms within kidney tissue. In order to improve on the mechanistic insight based on the urinary protein changes, we developed a new prioritization strategy called PRYNT (PRioritization bY protein NeTwork) that employs a combination of two closeness-based algorithms, shortest-path and random walk, and a contextualized protein–protein interaction (PPI) network, mainly based on clique consolidation of STRING network. To assess the performance of our approach, we evaluated both precision and specificity of PRYNT in prioritizing kidney disease candidates. Using four urinary proteome datasets, PRYNT prioritization performed better than other prioritization methods and tools available in the literature. Moreover, PRYNT performed to a similar, but complementary, extent compared to the upstream regulator analysis from the commercial Ingenuity Pathway Analysis software. In conclusion, PRYNT appears to be a valuable freely accessible tool to predict key proteins indirectly from urinary proteome data. In the future, PRYNT approach could be applied to other biofluids, molecular traits and diseases. The source code is freely available on GitHub at: [34]https://github.com/Boizard/PRYNT and has been integrated as an interactive web apps to improved accessibility ([35]https://github.com/Boizard/PRYNT/tree/master/AppPRYNT). Subject terms: Computational biology and bioinformatics, Systems biology, Biomarkers, Molecular medicine, Kidney diseases Introduction Kidney diseases can be defined as any chronic or acute disorder that affects renal structure and function^[36]1. In their most severe form, they are associated with a variety of complications, such as anemia, mineral and bone disorder or cardiovascular disease, leading to overall increased mortality^[37]2. Causes of renal failure are highly variable and sometimes unknown^[38]3. Some kidney diseases are monogenic, resulting from modifications in a single gene. Others are more complex and can result from a multifactorial combination of genetic, environmental and additional modifiers such as age, diabetes, smoking or hypertension. The use of high-resolution analytical omics technologies have resulted in major advances in the elucidation of diverse molecular pathophysiological mechanisms associated with kidney disease. While genomics is frequently used to unravel specific mutations in the genome that can increase the risk of developing certain diseases, disease activity is best captured by transcriptome or proteome analysis, as these traits are closer to the phenotype^[39]4. Moreover, whilst urine has been known for a very long time as a very informative and non-invasive source of potential candidates in the context of kidney disease^[40]5–[41]9, the molecular changes observed in urine partially reflect the deregulated mechanisms within kidney tissue. Urinary proteins predominately originate (~ 70%) from kidney and urinary tract by mechanisms of secretion and cellular shedding^[42]10–[43]12. The remaining challenge associated with such analysis is that these techniques require time-consuming validation experiments to try precisely pinpointing the most probable disease candidate from a list of hundreds of potential candidates. Most of these studies considered urinary proteins showing most prominent changes, either based on fold change or p-value, as new promising disease-related candidates. However, not all renal proteins can be found in urine and not all urinary proteins originate from the kidney. Hence, ranking disease proteins solely based on observed urinary changes might limit the complex view of the disease and insight in its pathophysiology. To help decipher the picture of the deregulated molecular networks and prioritize disease candidates, computational methods and tools have been proposed^[44]13. Some approaches prioritize candidates based on their similarity to the list of disease-modified genes^[45]14. These methods use databases (e.g. OMIM), ontologies (e.g. Gene Ontology) or text-mining from literature to assess similarity of sequence (e.g. POCUS^[46]15), functional annotation (e.g. PANDA^[47]16, Endeavour^[48]17, ToppGene^[49]18) or locus proximity (e.g. OPEN^[50]19, PhenoRank^[51]20). Other approaches use biological networks in order to prioritize candidates (e.g. MaxLink^[52]21, ToppNet^[53]18). One of the network-based software most commonly used by biologists in order to interpret high-throughput expression data is Ingenuity Pathway Analysis (IPA)^[54]22. This suite is based on a PPI network containing millions of structured, manually curated experimental observations. In IPA, the “Upstream Regulator Analysis” (URA) algorithm prioritizes disease candidates using in-house causal network approach to elucidate upstream biological causes that can explain the observed molecular changes^[55]23,[56]24. One of the main limitations hampering the use of IPA is that the software is proprietary and therefore its use cannot be broadly generalized to the biology community. Many other computational prioritization methods already exist^[57]13. Some are looking for candidates that directly interact with known disease genes, following the principle of “guilt-by-association”^[58]14,[59]25. Other, such as shortest-path^[60]26 or random walk^[61]27 algorithms, further consider the closeness between candidates and known disease genes in a network considering both direct and indirect relationships. Previous studies have shown that closeness-based approaches outperformed direct neighbour-based methods and that combining closeness-based approaches further improved disease candidate prioritization^[62]14,[63]28. However, most of these strategies have been used to identify disease candidates at the transcriptome level and not at the proteome level. Moreover, to date, none have been tested in the context of biological fluids. In order to move from this status quo, we developed an approach, named PRYNT (PRioritization bY protein NeTwork) that could help expand and fill the gaps of the molecular view, and predict the significance of proteins that were undetectable in the urine. PRYNT is based on the integration of Search Tool for the Retrieval of Interacting (STRING, version 10.5)^[64]29 PPI network and a combination of shortest-path and random walk, two closeness-based algorithms as it has been previously shown in the literature that this method outperformed other computational methods^[65]14,[66]26–[67]28. We used PRYNT in the context of two prototypic human kidney diseases: autosomal dominant polycystic kidney disease (ADPKD)^[68]5,[69]9 and ureteropelvic junction obstruction (UPJ)^[70]6,[71]7. ADPKD is a well-characterized monogenic kidney disease induced by a mutation of the PKD1 or PKD2 gene. UPJ is a congenital kidney disease resulting from a complex multifactorial combination of genetic and environmental factors. In order to assess the performance of our approach, we first evaluated the precision of PRYNT in prioritizing ADPKD and UPJ disease candidates and compared it with other methods from recent literature. We also performed an in-depth comparison of the results obtained with PRYNT to two main reference prioritization methods currently used by biologists: prioritization based on experimental results and prioritization based on IPA’s URA algorithm. Results Contextualization of PRYNT PPI network In order to test PRYNT approach, four urinary proteome datasets were used: two associated with ADPKD (ADPKD1 and ADPKD2) and two associated with UPJ (UPJ1 and UPJ2) (Table [72]1 and Supplementary Tables [73]S1–[74]S4). We constructed a PPI network based on STRING database. Approximately 50–60% of the deregulated urinary proteins from ADPKD and UPJ proteomic datasets were present in the raw PPI network (Fig. [75]1). This rather low percentage could be explained either because part of the deregulated proteins were absent from STRING v10.5 database altogether, or because they did not match the STRING settings that were selected i.e. sharing a protein.actions interactions with other proteins in the network, directional interaction and interaction reaching the highest confidence level (Fig. [76]1). Moreover, 56% (3569 proteins) of the 6391 proteins present in the network were grouped in 265 cliques, which are sets of proteins that all interact with each other and often share similar biological functions. In order to assess the impact of the missing biological input and of the presence of clique sub-graphs in the network, we modified the raw PPI network into three additional contextualized PPI networks (Fig. [77]2). The first contextualization consisted in generating a PPI network where the deregulated urinary proteins were added regardless of their confidence level (Fig. [78]2, +DP). The second contextualization consisted in generating a PPI network where cliques were taken into account (Fig. [79]2, +C). The last network combined both contextualization strategies (Fig. [80]2, +DP +C). We applied the prioritization strategy combining shortest path and random walk on the four different PPI networks on the four proteomics datasets (Fig. [81]2). We compared the ranked lists to a list of 500 reference disease candidates of ADPKD for ADPKD1 and ADPKD2, and of UPJ for UPJ1 and UPJ2. The precision was plotted (Fig. [82]3a) and the areas under the precision curves (AUC) were compared (Fig. [83]3b). Compared to the raw PPI, the use of the contextualized PPI + DP and PPI + C networks slightly increased the AUC of the precision. However, in the four datasets, the combined PPI + DP + C showed much better performance in terms of prioritizing disease candidates. Based on these results, we generated a contextualized PRYNT PPI network combining both the addition of the deregulated proteins and the management of the cliques (Fig. [84]2). Table 1. Dataset description. Reference Type of kidney disease Controls Cases Deregulated proteins ADPKD1 Bakun et al.^[85]5 Monogenic 30 30 155 ADPKD2 Rauniyar et al.^[86]9 Monogenic 18 14 69 UPJ1 Lacroix et al.^[87]7 Complex 10 8 174 UPJ2 Chen et al.^[88]6 Complex 23 23 175 [89]Open in a new tab Figure 1. [90]Figure 1 [91]Open in a new tab Number of deregulated urinary proteins from proteomic datasets present in the raw PPI network. Part of the deregulated proteins (DP) present in the proteomics datasets could not be included as they were absent in String v10.5 database (Homo sapiens). Moreover, a number of DP was excluded as they did not share any interaction with other proteins (absent from protein.actions PPI) or did not have a directional interaction with highest confidence (> = 0.9). PPI: protein–protein interaction network; DP: deregulated protein. Figure 2. [92]Figure 2 [93]Open in a new tab Description of PRYNT algorithm. PRYNT PPI network was based on STRING 10.5 protein.actions restricted to Homo sapiens (9606.protein.actions), and only directional interaction with confidence >  = 0.9 were selected. The raw PPI network (Raw) was further contextualized by adding the deregulated proteins (+DP) regardless of their confidence level and by grouping the proteins within cliques (+C). PRYNT prioritization approach was based on the combination of shortest-path (SP) and random walk (RW) algorithms and was achieved by multiplying the rank of the protein with the shortest-path ranking strategy (ranksp), and the rank of the protein with the random walk strategy (rankrw). PPI: protein–protein interaction network; DP: deregulated proteins; C: clique. Figure 3. [94]Figure 3 [95]Open in a new tab Performance of PRYNT depending on PPI network contextualization. (a) The precision was calculated based on the percentage of reference ADPKD or UPJ disease candidates that were prioritized in the top 100 candidates ranked by PRYNT in the four datasets using either the raw PPI network (Raw) or the PPI networks contextualized by the addition of deregulated urinary proteins regardless of their confidence level (+DP), by the management of clique sub-graphs (+C) or by the combination of both (+DP +C). (b) The corresponding area under the precision curve (AUC) was calculated in the four datasets. Graphs were designed using GraphPad Prism version 5.0 for Mac, GraphPad Software, San Diego, California USA, [96]http://www.graphpad.com. DP: deregulated protein; C: clique. Precision of PRYNT compared to other approaches We first compared PRYNT performance to shortest-path (SP) and random walk (RW) prioritization methods (Fig. [97]4a and Supplementary Figure [98]S1). Shortest-path between a disease candidate and a differentially abundant urinary protein is defined by the distance between any protein in the network and the differentially abundant proteins, taking into account the direction of interactions. Random walk with restarts simulates a random walker starting on differentially abundant urinary proteins and moving to their immediate neighbors’ randomly at each step. Each protein in the graph is prioritized by the probability of the random walker reaching it. Overall, PRYNT approach, combining both algorithms, showed better performance compared to the two strategies taken separately. Next, we compared PRYNT to seven additional state-of-the-art prioritization algorithms and tools (Fig. [99]4b and Supplementary Figure [100]S2): direct ranking (Direct), interconnectedness combined with random walk (ICN + RW), Phenolyzer^[101]30, Endeavour^[102]17, MaxLink^[103]21, ToppGene^[104]18 and ToppNet^[105]18. Direct ranking and interconnectedness combined with random walk were applied to String raw PPI network. Direct ranking was performed by applying out-degree centrality as described in the study of Oti et al.^[106]25. Disease candidates were prioritized based on the number of directly interacting differentially abundant urinary proteins. The interconnectedness-based approach combined with random walk was implemented following the study of Hsu et al.^[107]28. Phenolyzer, Endeavour and ToppGene are similarity-based prioritization approaches, extracting knowledge from diverse databases such as OMIM, Disease Ontology, or Gene Ontology. ToppNet and MaxLink are network-based prioritization approaches, using k-step markov and neighbor-based algorithms respectively. Overall, PRYNT showed better precision compared with these methods (Fig. [108]4b and Supplementary Figure [109]S2). In particular, the number of candidates predicted by MaxLink was < 100 so we could not assess the AUC for the precision in the top 100 predicted candidates in ADPKD2, UPJ1 and UPJ2. PRYNT performance was then compared to two reference approaches commonly used by biologists (Fig. [110]4c and Supplementary Figure [111]S3): URA from IPA (URA), and prioritization based on experimental results (Exp). Except for Exp, all tested approaches so far mine a network to find and rank new disease candidates that are linked to the deregulated proteins, without being in the initial set of deregulated proteins. In Exp however, the deregulated proteins are the disease candidates and their prioritization is based on a p-value ranking, the most significant proteins being the highest ranked candidates. In the four datasets, PRYNT showed higher performance to prioritize reference disease candidates compared to URA and Exp, with better precision and superior AUC (Fig. [112]4c and Supplementary Figure [113]S3). We next analyzed the overlap of reference disease candidates ranked in the top 100 by PRYNT, URA and Exp in the four datasets (Fig. [114]5). We observed that only a minority of reference disease candidates prioritized by PRYNT and URA were commonly prioritized by both approaches (59–70% uniquely prioritized by PRYNT and 48–64% uniquely prioritized by URA). For Exp, not only the number of prioritized reference disease candidates was very low, but it also showed very poor overlap with URA and no overlap with PRYNT. Figure 4. [115]Figure 4 [116]Open in a new tab Performance of PRYNT compared to other approaches. PRYNT performance was compared to prioritization using shortest-path or random walk algorithms alone (a), to prioritization by other common, state of the art prioritization strategies (b), or to prioritization by reference approaches (c). The precision was calculated based on the percentage of reference ADPKD or UPJ disease candidates that were prioritized in the top 100 candidates ranked by the different strategies in the four datasets. The corresponding area under the precision curve (AUC) was then calculated. Graphs were designed using GraphPad Prism version 5.0 for Mac, GraphPad Software, San Diego, California USA, [117]http://www.graphpad.com. SP: shortest-path; RW: random walk; D: direct; ICN + RW: interconnectedness combined with random walk; Exp: experimental; URA: upstream regulator analysis. Figure 5. [118]Figure 5 [119]Open in a new tab Overlap of reference disease candidates prioritized in the top 100 by PRYNT, URA or Exp. Prioritization by PRYNT, URA or from the experimental urinary proteomic candidates (Exp) was applied and reference ADPKD and UPJ disease candidates ranked in the top 100 were compared in the four datasets. Exp: experimental; URA: upstream regulator analysis. Specificity of PRYNT compared to reference approaches We next assessed how the different prioritization strategies ranked candidates that were specific to the disease under study. First, we studied cross-specificity by analyzing whether prioritization in ADPKD datasets was better for specific ADPKD reference disease candidates compared to non-specific UPJ reference disease candidates, and conversely for UPJ datasets (Fig. [120]6a). For ADPKD1 and ADPKD2, all prioritization strategies showed similar cross-specificity, with the AUC for specific ADPKD candidates (AUC[ADPKD]) being superior to the AUC for non-specific UPJ candidates (AUC[UPJ]). However, for UPJ1 and UPJ2, only PRYNT displayed adequate cross-specificity in both datasets. We next compared overall specificity of the approaches by comparing the AUC of the specific disease to the AUC of 80 non-specific diseases (list in Supplementary Table [121]S5) (Fig. [122]6b). For APDKD datasets, overall specificity was similar for all strategies in ADPKD1 with the AUC of the specific disease (AUC[ADPKD]) being in the top 15 out of 80 non-specific diseases. In ADPKD2, PRYNT showed better performance compared to URA and Exp (rank of specific AUC[ADPKD] of 14/81, 34/81 and 21/81 for PRYNT, URA and Exp respectively). For UPJ datasets, overall specificity was lower compared to ADPKD datasets and in both datasets, PRYNT prioritization showed best specificity, with a rank of specific AUC[UPJ] of 21/81 and 27/81 for UPJ1 and UPJ2 respectively. In UPJ2, Exp showed the lowest specificity with the specific AUC[UPJ] being ranked 65/81. Figure 6. [123]Figure 6 [124]Open in a new tab Specificity of PRYNT compared to reference approaches. (a) Cross-specificity of the prioritization strategies was assessed for the four datasets by calculating the difference between the AUC of the precision curve for specific disease candidates (AUC[ADPKD] for ADPKD datasets and AUC[UPJ] for UPJ datasets) and the AUC of non-specific disease candidates (AUC[UPJ] for ADPKD datasets and AUC[ADPKD] for UPJ datasets). (b) Overall specificity of the prioritization strategies was assessed for the four datasets by assessing the rank of the AUC of the precision curve for specific reference disease candidates (AUC[ADPKD] for ADPKD datasets and AUC[UPJ] for UPJ datasets) compared to 80 additional AUCs of reference candidates from non-specific diseases, including 40 diseases associated to urogenital tract and 40 diseases from other origin. Graphs were designed using GraphPad Prism version 5.0 for Mac, GraphPad Software, San Diego, California USA, [125]http://www.graphpad.com. Exp: experimental, URA: upstream regulator analysis. Pathway annotation We used KEGG pathway enrichment analysis^[126]31 to assess the biological relevance of the disease candidates prioritized by PRYNT (Fig. [127]7). For ADPKD, the 500 reference disease candidates were associated to 166 pathways. Approximately 85% of these pathways were also enriched with the top 100 ranked candidates prioritized by PRYNT (141/166 and 139/166 for ADPKD1 and ADPKD2 respectively) whereas enrichment was 67–72% for URA top 100 (112/166 and 119/166 for ADPKD1 and ADPKD2 respectively) and dropped to approximately 5% for Exp (9/166 and 10/166 for ADPKD1 and ADPKD2 respectively). Similarly for UPJ, PRYNT results showed higher number of enriched pathways and more overlapping pathways associated to the reference UPJ candidates compared to URA and Exp. Figure 7. [128]Figure 7 [129]Open in a new tab Pathway annotation. KEGG pathway enrichment analysis was applied to the 500 reference ADPKD and UPJ disease candidates, and compared to the pathways enriched from top 100 ranked candidates by PRYNT or URA or from the experimental urinary proteomic candidates (Exp) in the four datasets. Links of proteins with pathology of interest Next we assessed the involvement of the top 10 protein candidates in the disease of interest by a systematic search of the scientific literature (Supplementary Tables [130]S6 and [131]S7). For the top ranked ADPKD proteins, all but two were previously linked to ADPKD, confirming the potential of PRYNT in ranking disease candidates (Supplementary Table [132]S6). The two proteins (F2 and HSPA8) not previously linked to ADPKD thus constitute potential candidates for future experiments. All but three of the top 10 proteins were previously linked to UPJ (Supplementary Table [133]S7). Discussion In this study we developed and assessed the performance of PRYNT, a new network-based approach using urinary proteomic profiles to prioritize disease candidates in the context of kidney disease. While many tools and methods are available to predict disease candidates, we developed PRYNT to tackle the specificity of our research question. Indeed, most of these methods, such as Phenolyzer, Endeavour, MaxLink, ToppGene and ToppNet, have been developed on genomic data, seeking for new disease genes and showed that they were less suitable than PRYNT to predict new disease candidates from proteomic data. Combining both shortest-path and random walk, showed better results than using them alone as previously shown by Hsu et al.^[134]28 but also better results than using direct ranking. This latter result proves that closeness-based algorithms are more efficient to mine the PPI network in the context of research on biological fluids, because they are able to select key proteins of kidney disease in the network even though the links between excreted proteins in the urine and modified gene expression at the tissue level (i.e. in the kidney) are not necessarily straightforward. Another specificity of PRYNT has been to work on improving the PPI network. To build PRYNT PPI network, we chose to work with STRING database, as it is a well-known, recognized comprehensive database of PPI based on experimental evidence as well as interactions predicted by comparative genomics and text mining. To limit the risk of false prediction, we decided to only select PPI with highest confidence. Two major drawbacks that we identified in this network and using such settings were that a lot of the input information was missing and that the network was massively structured into clique sub-graphs. Instead of using the raw PPI network, we hence decided to contextualize then network by adding the deregulated proteins from the input data, regardless of their confidence, and by grouping the cliques. Cliques are important structures in PPI networks^[135]32,[136]33. Taking cliques into account by grouping the proteins allowed simplifying the network and helping find the most important disease candidates. As a result, our specific PRYNT contextualized network showed better performance compared to the raw STRING network. We also compared PRYNT to IPA’s URA and to ranking based on experimental results, two references approaches commonly used by biologists. We showed that PRYNT