Abstract

   Clear cell renal cell carcinoma (ccRCC) is the most prevalent type of
   renal cell carcinoma. However, our understanding of ccRCC risk genes
   remains limited. This gap in knowledge poses challenges to the
   effective diagnosis and treatment of ccRCC. To address this problem, we
   propose a deep reinforcement learning-based computational approach
   named RL-GenRisk to identify ccRCC risk genes. Distinct from
   traditional supervised models, RL-GenRisk frames the identification of
   ccRCC risk genes as a Markov Decision Process, combining the graph
   convolutional network and Deep Q-Network for risk gene identification.
   Moreover, a well-designed data-driven reward is proposed for mitigating
   the limitation of scant known risk genes. The evaluation demonstrates
   that RL-GenRisk outperforms existing methods in ccRCC risk gene
   identification. Additionally, RL-GenRisk identifies eight potential
   ccRCC risk genes. We successfully validated epidermal growth factor
   receptor (EGFR) and piccolo presynaptic cytomatrix protein (PCLO),
   corroborated through independent datasets and biological
   experimentation. This approach may also be used for other diseases in
   the future.

   Subject terms: Machine learning, Tumour biomarkers, Renal cell
   carcinoma
     __________________________________________________________________

   The genetic risk of clear cell renal cell carcinoma is complex and the
   genes related are unclear. Here, the authors propose a deep
   reinforcement based model to identify risk genes.

Introduction

   Renal cell carcinoma (RCC), one of the most common cancers worldwide,
   is a type of kidney cancer that initiates in the lining of the proximal
   convoluted tubule^[44]1,[45]2. Clear cell renal cell carcinoma (ccRCC)
   constitutes 80% of all RCC cases and is particularly aggressive due to
   its high immune infiltration^[46]3–[47]5. In addition, over 30% of
   ccRCC patients suffer from metastasis, which is a significant factor
   leading to death in ccRCC patients^[48]6–[49]8. Although several drugs
   have been utilized for the treatment of ccRCC, the efficacy is still
   limited due to the heterogeneity of ccRCC^[50]9–[51]11. Therefore, it
   is necessary to understand the pathogenesis and identify risk genes of
   ccRCC, which may be beneficial for early diagnosis and treatment of
   ccRCC^[52]12–[53]14.

   Cancer is a complex genetic disorder. Its occurrence and progression
   are associated with the accumulation of driver genetic mutations that
   provide a selective growth advantage to cells^[54]15,[55]16.
   Consequently, one class of methods to identify cancer risk genes is
   based on mutation data. In the past years, several cancer sequencing
   projects have generated mutation data from thousands of cancer
   patients, enhancing the identification of cancer risk
   genes^[56]15,[57]17. Traditional statistical approaches focus on genes
   with a higher mutation frequency in the patient cohort than the control
   cohort^[58]18. Youn et al.^[59]19 employed the functional impact of
   mutations on proteins, variations in background mutation rates among
   tumors, and the redundancy of the genetic code in tumor genome
   sequencing data to identify key genes in non-small cell lung cancer.
   Methods like MuSiC^[60]20, OncodriveCLUST^[61]21, and MutSigCV^[62]22
   identified cancer risk genes by comparing the observed gene mutations
   with the predefined background mutation frequency. By now,
   frequency-based methods have identified many cancer risk genes and
   enhanced cancer diagnosis and therapy^[63]23. However, the genetic
   foundations of cancer are highly diverse. Except for genes mutated
   across a large number of patients, some key genes in tumor initiation
   and progression are observed to be mutated in only a few
   patients^[64]24. For example, PIK3CA, which has been validated to be a
   ccRCC risk gene by previous studies^[65]25–[66]27, is mutated in no
   more than 5% of ccRCC patients^[67]27. It is difficult for purely
   frequency-based approaches to identify these genes with low mutation
   frequency but high risk.

   To address the drawback of frequency-based methods, the interactions
   among proteins are introduced for cancer risk gene identification,
   since genes involved in the same signaling and regulatory pathways as
   well as protein complexes may interact to exert their effects together.
   Muffinn^[68]28 identified cancer risk genes through network
   propagation, taking into account mutations not only in individual genes
   but also in their neighbors within the protein-protein interaction
   (PPI) network. DiSCaGe^[69]29 calculated a gene mutation score using an
   asymmetric spreading strength based on the type of mutations and the
   PPI network, then produced a ranking of prioritized cancer risk genes.
   HotNet2^[70]30 used an insulated heat diffusion process to identify
   cancer risk genes by propagating heat through the PPI network.
   nCOP^[71]31 employed a heuristic search method to select connected
   subnetworks from the PPI network based on the mutation data of cancer
   patients, and then ranked cancer risk genes based on the frequencies of
   genes appearing in these subnetworks. Both aforementioned methods are
   unsupervised, which may suffer from the highly diverse genetic
   foundations of cancer or the noise in the PPI network^[72]32. Recently,
   several supervised methods have emerged as potentially valuable tools
   for predicting cancer risk genes^[73]23,[74]33–[75]36. For example,
   Agajanian Steve et al.^[76]34 trained a random forest classifier to
   identify cancer risk genes based on the known cancer-driver mutations.
   DeepDriver^[77]35 used gene mutation types as features, constructed a
   K-nearest neighbor graph based on the Pearson correlation coefficient,
   and trained a convolutional neural network to identify cancer risk
   genes. Nevertheless, different from unsupervised
   methods^[78]28,[79]29,[80]31,[81]37, supervised methods require a
   substantial amount of known high-confidence risk genes as labeled data
   for model training^[82]23. Unfortunately, the number of known
   high-confidence ccRCC risk genes is currently limited^[83]38,[84]39.
   For example, there are only 44 ccRCC risk genes in the IntOGen
   database^[85]40. Owing to the reliance on labels, predicting ccRCC risk
   genes using supervised methods is challenging.

   To overcome the limitations of existing methods, we propose a deep
   reinforcement learning-based approach for ccRCC risk gene
   identification, named RL-GenRisk (Reinforcement Learning-based GENe
   RISK). The reinforcement learning-based model leverages environmental
   interactions for optimization^[86]41, tackling the challenge of scant
   known risk genes. Specifically, RL-GenRisk models the PPI network as
   the environment and utilizes a graph convolutional network^[87]42 to
   learn state representations. It also incorporates the Deep Q-Network
   (DQN)^[88]43 to combine reinforcement learning with deep neural
   networks for ccRCC risk gene identification. Moreover, a data-driven
   reward is designed to facilitate a straightforward method for
   identifying ccRCC risk genes. By focusing on a sampled subgraph with
   node features, the data-driven reward effectively leverages information
   from both the PPI network and gene mutation data. This not only ensures
   the accurate identification of genes with high mutation frequencies but
   also enables RL-GenRisk to identify potential risk genes with low
   mutation frequencies that functionally interact with genes having high
   mutation frequencies. Extensive experiments demonstrate that RL-GenRisk
   outperforms the existing methods in the identification of ccRCC risk
   genes. Furthermore, several potential risk genes are revealed and
   validated in independent datasets. Specifically, we validated two
   top-rank genes EGFR and PCLO through statistical and biological
   experiments. Statistical analyses show significant upregulation of EGFR
   at both bulk and single-cell levels among ccRCC patients, with a
   significant association between overexpression of the protein encoded
   by EGFR and poor survival in ccRCC patients. The in vitro experimental
   results show that decreased EGFR expression promotes ccRCC cell
   apoptosis as well as suppresses colony formation and migration, and the
   use of the EGFR inhibitor erlotinib effectively augments apoptosis and
   inhibits migration. Moreover, knocking down PCLO expression in vitro
   significantly inhibited ccRCC progression. Additionally, the in vivo
   experimental results show that both the erlotinib and EGFR
   downregulation can significantly repress the growth of ccRCC tumors in
   mice.

Results

RL-GenRisk framework

   We propose a deep reinforcement learning-based approach to identify
   ccRCC risk genes (Fig. [89]1), named RL-GenRisk. Fundamentally
   different from existing supervised deep learning-based methods,
   RL-GenRisk incorporates the reinforcement learning paradigm.
   Specifically, RL-GenRisk frames the ccRCC risk gene identification as a
   sequence decision-making process, formulated as a Markov Decision
   Process^[90]44. This enables RL-GenRisk to integrate reinforcement
   learning algorithms seamlessly, thereby effectively addressing the
   inherent challenge of scant known risk genes. RL-GenRisk takes PPI
   network and gene mutation data as input (Fig. [91]1A). The PPI network
   is represented as an undirected graph, with nodes representing genes
   and edges representing interactions between genes. Gene mutation data
   includes details on the presence of mutations in ccRCC patients for
   each gene. RL-GenRisk treats the PPI network with gene mutation
   information as the environment. The state includes a sampled subgraph
   with node features. The action is selecting a node directly connected
   to the sampled subgraph and adding it to this sampled subgraph. Thus,
   the risk gene identification is framed as a Markov Decision Process of
   node selection within the PPI network.

Fig. 1. Overview of workflow.

   [92]Fig. 1
   [93]Open in a new tab

   A Input data of RL-GenRisk, including a Protein-Protein Interaction
   (PPI) network and gene mutation data of ccRCC patients from The Cancer
   Genome Atlas (TCGA). B Overview of the training and identification
   process. During the training process, RL-GenRisk starts by selecting a
   node randomly and appending this node to a sampled subgraph (indicated
   by the red dashed circle). Nodes colored orange are included in the
   sampled subgraph. Nodes interacting with the sampled subgraph are
   colored blue, indicating their inclusion in the action space. The
   policy comprises two graph convolutional layers and a node evaluation
   network. It takes the current state as input and selects an action
   based on Q values of nodes in the action space. Subsequently, the state
   is updated, and the reward is obtained. This process repeats until the
   sampled subgraph reaches its maximum size. Following training,
   RL-GenRisk computes Q values for all genes based on an empty subgraph,
   resulting in a ranking list of risk genes ordered by Q values. C
   Experiment validation on external data for identified ccRCC risk genes.

   Central to RL-GenRisk is a policy that is represented by a neural
   network and interacts with the environment. The policy takes the
   current state as input and predicts Q values, representing the
   probability distribution of all possible actions. The policy of
   RL-GenRisk consists of two main components: a Graph Convolutional
   Network (GCN) for learning state representation and a node evaluation
   network for computing action probability (Fig. [94]1B top). To enhance
   state representation, RL-GenRisk employs the GCN^[95]42, an inductive
   graph representation learning method, to capture node representations
   in the PPI network. Node initial features are derived from the PPI
   network’s topology information and ccRCC patients’ mutation
   information. The policy of RL-GenRisk is trained to select the optimal
   actions by maximizing the reward. In this study, we designed a
   data-driven reward that focused on the sampled subgraph, considering
   both information from the PPI network and gene mutation data. The DQN
   algorithm is employed to update the policy’s parameters. Throughout the
   training, we employed the ϵ-greedy strategy to choose actions based on
   Q values, thus boosting RL-GenRisk’s exploratory potential. In the
   identification phase (Fig. [96]1B bottom), RL-GenRisk starts with an
   empty subgraph, incorporates all nodes into the action space, and
   utilizes the trained policy to calculate Q values for each node.
   Subsequently, ccRCC risk genes are ranked by Q values, with higher
   values indicating greater risk. Further details about the RL-GenRisk
   can be found in the “Methods” section.

RL-GenRisk shows superior performance for ccRCC risk gene prioritization over
existing methods

   To assess the performance of RL-GenRisk, we utilized RL-GenRisk and
   eight other existing methods to identify ccRCC genes with gene mutation
   data of ccRCC patients from The Cancer Genome Atlas (TCGA)^[97]45 and
   five different PPI networks. Six of these eight existing methods are
   specifically designed for cancer risk gene identification, including
   nCOP^[98]31, DiSCaGe^[99]29, Hierarchical HotNet^[100]46,
   HotNet2^[101]30, Muffinn^[102]28, and MutSigCV^[103]22. The other two
   are SVM^[104]47 and Random Forest^[105]48, which are widely used
   supervised machine learning methods. The implementation details of SVM
   and Random Forest are provided in the [106]Supplementary Materials. The
   PPI networks used in our study included HPRD^[107]49, STRING^[108]50,
   Multinet^[109]51, IRefIndex^[110]52, and HumanNet^[111]53. Known ccRCC
   risk genes were retrieved from the IntOGen^[112]40 cancer-specific
   database, the Network of Cancer Genes (NCG) database^[113]54, the
   Cancer Gene Census (CGC) database^[114]55, and a cancer risk gene set
   extracted from the recent and extensive study conducted by Bailey et
   al.^[115]56. In particular, the ccRCC risk genes included in CGC and
   Baylei et al. datasets are limited, containing only nine and twelve
   genes respectively. Consequently, we evaluated different methods using
   three datasets: IntOGen, NCG, and a “Merged” dataset that combined data
   from IntOGen, NCG, Baylei et al. and CGC. The genes included in these
   datasets are provided in Supplementary Data [116]1. To assess the
   performance of the different methods, we used discounted cumulative
   gain (DCG) as one of the primary evaluation metrics, consistent with
   previous studies^[117]29,[118]57. Specifically, the DCG scores were
   calculated based on the top 100 genes identified by each method.
   Moreover, the DCG curves of different methods on the IntOGen, NCG, and
   the Merged ccRCC risk gene dataset are shown in Fig. [119]2 and
   Supplementary Fig. [120]2. Following this, we calculated the normalized
   DCG (N-DCG), the area under the DCG curve (DCG-AUC), and the average
   precision (AP) to further evaluate these methods’ performance (as shown
   in Fig. [121]2D–F, and Supplementary Fig. [122]3). Evaluation results
   indicated that RL-GenRisk outperformed the eight other established
   methods in identifying ccRCC risk genes, achieving the highest DCG,
   N-DCG, DCG-AUC, and AP on all used datasets (as shown in Supplementary
   Fig. [123]1). As anticipated, MutsigCV showed the lowest performance
   among the six existing methods specifically designed for cancer risk
   gene identification due to its reliance solely on mutation data,
   lacking integration with biological network insights. Muffinn performed
   well with STRING but showed inferior performance with other PPI
   networks, highlighting its instability across different PPI networks
   (Supplementary Fig. [124]3). While nCOP and Hierarchical HotNet
   performed well across different datasets and showed more stability
   across various PPI networks, their performance was inferior to
   RL-GenRisk (Supplementary Fig. [125]1 and Supplementary Fig. [126]3).
   Additionally, SVM showed the lowest performance among the compared
   methods. Random Forest performed better than SVM and MutSigCV. However,
   its performance was inferior to RL-GenRisk. RL-GenRisk achieved the
   best performance on HPRD (as shown in Supplementary Fig. [127]1).
   Therefore, the subsequent analyses were grounded on the high-confidence
   risk genes (HRGs) identified by the best-performing method, RL-GenRisk
   with the HPRD network.

Fig. 2. Performance comparison and perturbation analysis.

   [128]Fig. 2
   [129]Open in a new tab

   A–C Discounted Cumulative Gain (DCG) curves of the top 100 identified
   genes with RL-GenRisk and other compared methods on the HPRD network,
   evaluated on the IntOGen, NGC, and the Merged datasets. D–F Boxplot of
   Average Precision (AP) on different PPI networks across different
   methods, evaluated on the IntOGen, NGC, and the Merged datasets.
   Boxplots display the 25th, 50th (median), and 75th percentiles as box
   bounds, with whiskers extending to minima and maxima within
   1.5 × interquartile range. Each box includes 5 data points,
   corresponding to the results on 5 different PPI networks. G, H Mutation
   frequency of the top 20 genes identified by RL-GenRisk (G. Before
   perturbation; H. After perturbation.). Each dot represents a gene. The
   dots in red indicate that the gene is included in the IntOGen database,
   and the blue indicates that the gene is not included in the IntOGen
   database. And the mutation frequency of the genes enclosed by the
   orange box is less than 10%. Source data are provided as a Source Data
   file.

Biological network facilitates the identification of low-frequency mutated
ccRCC risk genes

   Although gene mutation frequency is crucial for assessing cancer
   associations, not all cancer-related genes have high mutation
   frequencies^[130]58. Therefore, relying solely on mutation data might
   overlook low mutation frequency, high-risk cancer genes. To evaluate
   the capability of RL-GenRisk in identifying ccRCC risk genes with low
   mutation frequency but high risk, we utilized the “maftools”
   package^[131]59 to analyze the mutation frequencies of the top 20 HRGs
   (Supplementary Table [132]1) identified by RL-GenRisk in ccRCC patients
   from TCGA (Supplementary Fig. [133]5A). Generally, 317 (77.13%) of the
   411 ccRCC patients from TCGA exhibited somatic mutations in the top 20
   HRGs. Missense mutations were predominant, followed by frameshift
   mutations, among the identified somatic mutation types. Notably, the
   mutation frequencies of four genes, including VHL, PBRM1, SETD2, and
   BAP, exceeded 10%. These genes are already recognized as ccRCC risk
   genes and included in the IntOGen database. Genes with high mutation
   frequencies are more likely to be detected by methods that solely rely
   on mutation data. However, it’s important to note that not all cancer
   genes have high mutation frequencies^[134]58. Among the top 20 HRGs,
   seven recognized ccRCC risk genes in the IntOGen database had mutation
   frequencies below 5%. For example, only 5 of 411 ccRCC patients carry
   mutations in PIK3CA, a known ccRCC risk gene in the IntOGen database,
   indicating variable mutation frequencies among cancer genes. To assess
   the contribution of the biological network in RL-GenRisk for
   identifying low mutation frequency, high-risk ccRCC genes, we
   progressively increased the percentage of edges randomly swapped
   between node pairs from 25% up to 50%, 75%, and 100%. Perturbing the
   PPI network disrupts its internal information, with more extensive
   perturbation causing greater information loss. As the perturbation
   ratio increased, the rankings of known ccRCC genes that initially had
   high ranks and mutation frequencies below 5% gradually dropped (as
   shown in Supplementary Fig. [135]5B). We then compared the mutation
   frequencies of the top 20 HRGs identified by RL-GenRisk before and
   after PPI network perturbation and created dot plots for visualization
   (Fig. [136]2G, H). We found that after network perturbation, three
   known ccRCC risk genes, TP53, PIK3CA, and SPEN, previously identified
   in the top 20 with mutation frequencies below 5%, dropped out of the
   top 20. This indicates that incorporating PPI network knowledge
   improves the detection of low mutation frequency ccRCC genes with
   significant risk implications.

Biological function analysis of high-confidence risk genes

   To delve into the biological function of HRGs, we conducted pathway
   enrichment analysis using the top 20 HRGs and the WikiPathways
   database^[137]60, a comprehensive resource for pathway-based data
   analysis. Enrichment analysis was performed using g:Profiler^[138]61.
   Figure [139]3. A illustrated the top 10 significantly enriched pathways
   (FDR p-value  < = 1.21e-4). The results demonstrated significant
   enrichment in various cancer-related pathways, with “Clear cell renal
   cell carcinoma pathways” being the most prominently enriched (FDR
   p-value = 1.48e-8). The Human Phenotype Ontology (HPO)
   database^[140]62, which provides a standardized vocabulary of
   phenotypic abnormalities associated with human diseases, highlighted
   “Renal neoplasm” and “Renal cell carcinoma” as the most significantly
   enriched phenotypes (FDR p-value  < 4.05e-5, Supplementary
   Fig. [141]6). Furthermore, Gene Ontology (GO) enrichment analysis
   indicated that top 20 HRGs were notably enriched in various
   cancer-related biological processes, including “cell adhesion” (FDR
   p-value = 2.69e-4), “regulation of cell population proliferation” (FDR
   p-value = 3.64e-4), “cell population proliferation” (FDR p-value =
   6.85e-4), and “cell migration” (FDR p-value = 2.39e-3). These
   observations showed a comprehensive molecular landscape linked to HRGs
   within the scope of ccRCC, emphasizing the critical roles of these
   genes in ccRCC-related mechanisms.

Fig. 3. Independent datasets analysis.

   [142]Fig. 3
   [143]Open in a new tab

   A Top 10 enriched pathways of the Wikipathways based on the top 20
   high-confidence genes (HRGs). B The expression levels of EGFR in tumor
   (n = 523) and normal tissues (n = 72) of ccRCC patients from TCGA,
   quantified using Fragments Per Kilobase Million (FPKM), “***” indicates
   p-value  < 0.001. The boxplot displays the 25th, 50th (median), and
   75th percentiles as box bounds, with whiskers extending to minima and
   maxima within 1.5 × interquartile range. p-value = 2.07e-38, using the
   Wilcoxon signed-rank test. C, D Survival curves of EGFR. ccRCC patients
   from TCGA were divided into two groups based on the top 25% (n = 108)
   and others (n = 324) of expression levels of protein encoded by EGFR,
   using Progression-Free Survival (PFS) and Disease-Specific Survival
   (DSS) respectively. The red curve corresponds to the high expression
   group, while the blue curve corresponds to the low expression group. E
   UMAP plot of clustering of single-cell RNA-seq data from kidneys of
   ccRCC patients. DC dendritic cell, NK natural killer cell, NKT natural
   killer T cell, TAM tumor-associated macrophage, T-Reg regulatory T
   cell. F Gene expression of EGFR in different cells of ccRCC patients,
   with colors corresponding to the expression values. G Violin plot
   displaying the normalized gene expression of EGFR in different cells of
   ccRCC patients in the single-cell RNA-seq data. Source data are
   provided as a Source Data file.

   In addition, to further explore whether other top-performing methods
   identified similar pathways albeit with different genes, we performed
   an analysis based on the pathways enriched by the top 20 genes
   identified by top-performing methods, including RL-GenRisk, nCOP,
   Hierarchical HotNet, HotNet2, DiSCaGe, Muffinn, and Random Forest. We
   first calculated the Jaccard similarity coefficients^[144]63 between
   the top 20 gene lists identified by these methods (pairwise comparisons
   among these 7 methods, totaling 21 pairs). The results showed that 76%
   (16 out of 21 pairs) of method pairs have a Jaccard similarity
   coefficient below 0.5 for their identified top 20 genes, indicating
   that the top 20 gene lists identified by these methods do not exhibit a
   high degree of overlap overall (Supplementary Fig. [145]22). After
   that, we analyzed the intersection of pathways enriched by the top 20
   genes identified by top-performing methods. The result revealed that
   two pathways were significantly enriched (FDR p-value < 0.05) across
   these methods, including “Clear cell renal cell carcinoma pathway” and
   “Type 2 papillary renal cell carcinoma”. Notably, the top 20 genes
   identified by RL-GenRisk exhibited the most significant enrichment in
   the “Clear cell renal cell carcinoma pathway” among all these methods
   (as illustrated in Supplementary Fig. [146]7). The detailed results of
   enrichment analysis for different methods are provided in Supplementary
   Data [147]3.

Differential expression analysis revealed significant differential gene
expression of EGFR at both bulk and single-cell levels

   Among the top 20 HRGs identified by RL-GenRisk, 8 HRGs are not included
   in these ccRCC risk gene databases. We then performed differential
   expression analysis on these 8 HRGs using RNA-seq data from ccRCC
   patients in TCGA. Differential expression analysis was performed using
   Limma^[148]64, and the significance threshold was FDR p-value < 0.05.
   Notably, EGFR and PCLO showed significant differential expression
   between tumor tissues and normal tissues in TCGA among these 8 HRGs
   (FDR p-value = 2.07e-38 for EGFR and FDR p-value = 1.04e-14 for PCLO,
   Fig. [149]3B and Supplementary Fig. [150]8). The differential
   expression results for these 8 HRGs are shown in Supplementary
   Fig. [151]8. EGFR was notably upregulated in ccRCC tumor tissues
   compared to normal tissues. EGFR is activated as a homodimer or
   heterodimer, thereby regulating multiple signaling pathways, including
   the RAS/RAF/MAPK, AKT, and JAK/STAT pathways, which play essential
   roles in cell migration, proliferation, and survival^[152]65–[153]67.
   These pathways are intricately involved in driving cell proliferation
   and conferring resistance to apoptosis^[154]67. The overexpression of
   EGFR leads to an excess of receptors on the cell surface, fostering
   uncontrolled cell growth and division. This dysregulation can drive the
   transformation of normal cells into tumor cells, creating a favorable
   environment for sustained tumor cell survival^[155]68. Additionally,
   the protein encoded by PCLO is a component of the presynaptic
   cytoskeletal matrix, which is involved in establishing active synaptic
   zones and in synaptic vesicle trafficking^[156]69. Recent research has
   identified the expression level of PCLO as a prognostic biomarker for
   esophageal squamous cell carcinoma^[157]70, and a notable high mutation
   frequency (47.9%) of PCLO has been observed in large Central European
   cohorts with gastric cancer^[158]71. In addition, to further explore
   the transcriptional change of these 8 HRGs across tumor and normal
   tissues, we conducted differential expression analysis using ccRCC
   patient data from the Gene Expression Omnibus (GEO). We used two mRNA
   expression profiles of ccRCC patients from [159]GSE46699^[160]72 and
   [161]GSE36895^[162]73 for analysis. The [163]GSE46699 contains data
   from 67 tumor tissues and 63 normal tissues. The [164]GSE36895 contains
   data from 29 tumor tissues and 23 normal tissues. We found that among
   these 8 HRGs, 5 genes showed significantly differential expression in
   both [165]GSE46699 and [166]GSE36895 datasets (FDR p-value < 0.05, as
   shown in Supplementary Fig. [167]9). In particular, MUC4 shows high
   expression in tumor tissues in the [168]GSE36895 dataset, but low
   expression in tumor tissues in the [169]GSE46699 dataset. Existing
   studies reported that MUC4 expression is an independent prognostic
   factor for overall survival in ccRCC patients^[170]74, and MUC4
   mutation is associated with an exophytic growth pattern of
   ccRCC^[171]75. Then, we compared the differential expression levels of
   the top 20 ccRCC risk genes identified by RL-GenRisk and other methods
   using the [172]GSE46699 and [173]GSE36895 datasets. The results
   indicated that the top 20 genes identified by RL-GenRisk included a
   higher number (14) of genes that were significantly differentially
   expressed in both [174]GSE46699 and [175]GSE36895 (FDR p-value < 0.05).
   Further details can be found in Supplementary Data [176]4.

   To further investigate the expression patterns of the top 20 HRGs in
   tumor cells and normal cells of ccRCC patients, we performed an
   analysis utilizing single-cell RNA-seq data from ccRCC patients
   reported in a recent work^[177]76 (Supplementary Table [178]2). Uniform
   manifold approximation and projection (UMAP) was used to visualize the
   distribution of 31,856 cells from kidneys in these patients in a
   two-dimensional plane (Fig. [179]3E). We used UMAP to visualize the
   expression of the top 20 HRGs in different cells (Supplementary
   Fig. [180]10) and plotted the distribution of expression of these genes
   in different cell types (Supplementary Fig. [181]11). Subsequently, we
   assessed the expression levels of EGFR across different cell types
   within these patients (Fig. [182]3F and Supplementary Fig. [183]11).
   Our investigation showed a significant difference in EGFR expression
   between tumor cells and other cells, (FDR p-value  < = 1.21e-4,
   Wilcoxon signed-rank test, Supplementary Fig. [184]12), with displaying
   higher expression levels in tumor cells (Fig. [185]3G). Therefore, the
   significant upregulation of EGFR at both bulk and single-cell levels in
   ccRCC patients suggested its potential as a biomarker for ccRCC.

Expression level of protein encoded by the EGFR is significantly correlated
with the prognosis of ccRCC patients

   To reveal whether EGFR affects the prognosis of ccRCC patients, the
   survival analysis is conducted utilizing clinical data and expression
   of EGFR-encoded protein obtained from the TCGA dataset. We obtained
   clinical data and reverse-phase protein array (RPPA) data of ccRCC
   patients from TCGA. Progression-Free Survival (PFS)^[186]77 and
   Disease-Specific Survival (DSS)^[187]78 were utilized to assess the
   relationship between the expression levels of protein encoded by EGFR
   and the survival of ccRCC patients. The ccRCC patients from TCGA were
   categorized into two groups (top 25% and others) based on the
   expression levels of protein encoded by EGFR. We analyzed the ten-year
   survival rate following cancer diagnosis and plotted Kaplan-Meier
   survival curves to illustrate the impact of EGFR encoded protein
   expression on the prognosis of patients (Fig. [188]3C, D). The results
   of the survival analysis revealed a significant association between the
   expression levels of protein encoded by EGFR and the survival time of
   ccRCC patients, with higher EGFR encoded protein expression being
   correlated with poorer survival outcomes (KM log-rank p-value = 0.0018,
   PFS; KM log-rank p-value = 0.021, DSS). Our observations revealed that
   the overexpression of protein encoded by EGFR may play a critical role
   in cancer progression and could potentially serve as a prognostic
   biomarker for ccRCC patients.

EGFR effectively promotes ccRCC progression in vitro and in vivo

   To verify the effect of EGFR expression on ccRCC progression, the
   stable cells harboring EGFR knockdown were obtained using ACHN
   (Fig. [189]4A, B) and 786-O cells (Fig. [190]4G, H). The protein and
   mRNA expression of EGFR was quantitated by western blotting and qPCR,
   respectively, and short hairpin RNA (shRNA)-2/3 showed promising
   knockdown epointing out this issuefficacies for EGFR silencing
   (Fig. [191]4A, B, G, H). The downregulation of EGFR significantly
   inhibits the cell viability (CCK8 assay, Fig. [192]4C-ACHN,
   Fig. [193]4I-786-O) and migration (transwell assay, Fig. [194]4E-ACHN,
   Fig. [195]4K-786-O). Furthermore, decreased EGFR expression markedly
   promotes the ccRCC cell apoptosis (flow cytometry assay,
   Fig. [196]4D-ACHN, Fig. [197]4J-786-O) and represses cell colony
   formation in vitro (colony assay, Fig. [198]4F-ACHN,
   Fig. [199]4L-786-O). Also, the EGFR overexpression was detected using
   qPCR (Fig. [200]4M) and western blotting (Fig. [201]4N), and it
   promotes the migration of 786-O cells significantly (Fig. [202]4O).
   Moreover, the EGFR inhibitor, erlotinib, was further used to inhibit
   its activity, and the data indicated that erlotinib can effectively
   inhibit ccRCC cell migration (Fig. [203]5A middle panel) and growth
   (Fig. [204]5A lower panel) as well as promotes the apoptosis
   (Fig. [205]5A upper panel) in vitro. In vivo, both erlotinib and EGFR
   downregulation can markedly repress the tumor growth (Fig. [206]5B, C).
   Taken together, these results suggested that EGFR can significantly
   promote ccRCC cell progression as a risk factor.

Fig. 4. EGFR dysregulation significantly affect ccRCC cell progression in
vitro.

   [207]Fig. 4
   [208]Open in a new tab

   A, B The protein and mRNA expression of EGFR in stable cells of
   ACHN-shEGFR. n = 4 independent samples. C. The cell viability of
   ACHN-shEGFR cells. n = 10 from three biological replicates. D–F The
   assays of apoptosis (n = 4 independent experiments), migration (n = 5
   independent experiments), and colony formation (n = 3 independent
   experiments) in ACHN-shEGFR cells. G, H The protein and mRNA expression
   of EGFR in stable cells of 786-O-shEGFR. n = 4 independent samples. I
   The cell viability of 786-O-shEGFR cells. n = 10 from three biological
   replicates. J–L The assays of apoptosis (n = 3 independent
   experiments), migration (n = 5 independent experiments), and colony
   formation (n = 3 independent experiments) in 786-O-shEGFR cells. M, N
   Overexpression of EGFR in 786-O cells at mRNA and protein levels. n = 4
   independent samples. O The migration assay of 786-O-EGFR cells. n = 5
   independent experiments. NC negative control, ns no significant. The
   data were shown as the mean  ± SEM. One-way ANOVA (B–F, H–K).
   Two-tailed unpaired Student’s t-test (L–O). Source data are provided as
   a Source Data file.

Fig. 5. EGFR inhibition represses ccRCC cell progression in vitro and in
vivo.

   [209]Fig. 5
   [210]Open in a new tab

   A Erlotinib markedly promoted apoptosis and inhibited migration and
   colony formation of ccRCC cells. n = 3 independent experiments. B, C
   Both erlotinib (B) (n = 3 mice per group) and EGFR knockdown (C) (n = 3
   mice per group) significantly reduced tumor growth in vivo. The data
   were shown as the mean  ± SEM. Two-tailed unpaired Student’s t-test
   (A). Two-way ANOVA (B, C). Source data are provided as a Source Data
   file.

The PCLO knockdown significantly impaired the ccRCC progression

   To investigate the effect of PCLO on the progression of ccRCC in vitro,
   we utilized four distinct shRNAs to knock down PCLO expression in 293T
   cells, identifying shPCLO-1 as the most effective (Fig. [211]6A, B).
   Subsequently, we obtained the PCLO knockdown 786-O cells using shPCLO-1
   (Fig. [212]6C). The downregulation of PCLO significantly inhibited cell
   viability (Fig. [213]6D) and promoted apoptosis (Fig. [214]6E).
   Additionally, the reduction of PCLO expression significantly suppressed
   cell colony formation (Fig. [215]6F), migration (Fig. [216]6G), and
   invasion (Fig. [217]6H). To further investigate the effects of PCLO
   knockdown on the morphology and motility of ccRCC cells, a high-content
   imaging analysis system was employed. The results revealed that PCLO
   knockdown significantly reduced both the length and width of the cells
   at multiple time points (Fig. [218]6I), as well as their area and
   perimeter (Fig. [219]6J). Notably, a reduction in cell movement speed
   was observed in 786-O-shPCLO cells (Fig. [220]6K), leading to a
   shortened accumulated distance (Fig. [221]6L). In addition, we
   performed the same intervention on PCLO in another renal cancer cell
   line, ACHN, and conducted related assays to assess proliferation and
   metastatic potential. The results showed that stable knockdown of PCLO
   in ACHN cells (Supplementary Fig. [222]24A) significantly reduced their
   proliferation (Supplementary Fig. [223]24B), colony formation
   (Supplementary Fig. [224]24D), migration (Supplementary Fig. [225]24E),
   and invasion (Supplementary Fig. [226]24F) abilities, while increasing
   the apoptosis index (Supplementary Fig. [227]24C). High-content imaging
   analysis system also indicated that knockdown of PCLO in ACHN not only
   decreased cell perimeter but also impaired the cells’ motility
   (Supplementary Fig. [228]24G–I). Overall, the PCLO knockdown markedly
   inhibited the progression of ccRCC.

Fig. 6. Loss of PCLO significantly impaired the ccRCC progression.

   [229]Fig. 6
   [230]Open in a new tab

   A, B The protein and mRNA expression of PCLO after transient
   transfection of shPCLO lentivirus in 293T cells. n = 3 independent
   samples. C The protein expression of PCLO in cells of 786-O-shPCLO. D
   The cell viability of 786-O-shPCLO cells. n = 10 from three biological
   replicates. E–H The assays of apoptosis, colony formation, migration,
   and invasion in 786-OshPCLO cells. n = 3 independent experiments. I, J
   Average morphology of 786-O-shPCLO cells per well, including cell
   width, length, area, and perimeter, was evaluated at multiple time
   points using a high-content cell imaging analysis system. n = 10
   independent samples. K, L Average movement speed and accumulated
   distance of 786-O-shPCLO cells per well. n = 10 independent samples. NC
   negative control, ns no significant. The data were shown as the mean
   ± SEM. One-way ANOVA (B). Two-tailed unpaired Student’s t-test (D–L).
   Source data are provided as a Source Data file.

Discussion

   In this study, we developed RL-GenRisk, an approach utilizing deep
   reinforcement learning to enhance ccRCC risk gene identification by
   integrating network knowledge with gene mutation data. By considering
   the risk gene identification as a node selection problem, we model the
   ccRCC risk gene identification as a Markov Decision Process, reducing
   the dependency on labeled data. Furthermore, we designed a data-driven
   reward and employed the DQN algorithm to optimize RL-GenRisk.
   RL-GenRisk exhibits a substantial improvement in the task of ccRCC risk
   gene identification and reveals several potential risk genes.

   RL-GenRisk successfully identified known ccRCC risk genes. Among the
   top 20 ccRCC risk genes identified by RL-GenRisk, 12 genes are listed
   in the ccRC known risk gene datasets (IntOGen, NCG, and the Merged
   dataset) and recognized as ccRCC risk genes in prior research,
   including VHL^[231]79,[232]80, PBRM1^[233]81,[234]82,
   SETD2^[235]83,[236]84, BAP1^[237]85,[238]86, MTOR^[239]87,[240]88,
   ATM^[241]89,[242]90, SPEN^[243]91, TP53^[244]92, KMT2D (also known as
   MLL2)^[245]93, SMARCA4^[246]94,[247]95, and PTEN^[248]96,[249]97. This
   demonstrates that RL-GenRisk can effectively identify ccRCC risk genes.
   Interestingly, 8 of the top 20 identified ccRCC risk genes are not
   listed in the known ccRCC risk gene datasets, including MUC4, DST,
   PABPC1, PCLO, PDE4DIP, USH2A, EGFR, and FLG. We visualized the top 20
   genes identified by RL-GenRisk in the PPI network using the STRING-db.
   These genes were significantly interconnected within the PPI network
   (one-sided t-test, FDR p-value = 2.72e-07). To illustrate the
   interactions between known ccRCC risk genes and the potential risk
   genes, we used a bipartite graph (see Supplementary Fig. [250]16).
   Among the potential risk genes, EGFR exhibited the most interactions
   with known ccRCC risk genes, followed by MUC4 and PCLO. Notably, both
   EGFR and PCLO were validated in our biological experiments. In
   addition, some recent studies have found that PDE4DIP, FLG, and USH2A
   are associated with ccRCC. For example, methylation levels of PDE4DIP
   were found to be associated with reduced overall survival in ccRCC
   patients^[251]98. FLG was found to be specifically mutated in specific
   subtypes of ccRCC^[252]99. USH2A was found to have a significant
   co-mutation with well-known high-confidence ccRCC risk genes VHL and
   PBRM1 in ccRCC patients^[253]100,[254]101. While the genes within these
   identified candidates have been shown a correlation with ccRCC, their
   molecular mechanism within ccRCC needs to be further investigated.

   Furthermore, we validated two top ccRCC risk genes, EGFR and PCLO,
   identified by RL-GenRisk through independent datasets and biological
   experiments. The results indicated that EGFR exhibits significant
   upregulation at both bulk and single-cell levels in ccRCC patients. The
   overexpression of the protein encoded by EGFR is significantly
   associated with poor survival of ccRCC patients. To explore whether the
   few EGFR mutations detected in ccRCC patients are functionally
   relevant, we analyzed EGFR protein expression levels between ccRCC
   patients with and without EGFR mutations, as proteins are the primary
   functional molecules in cellular processes. We use the ccRCC patient
   data from TCGA to perform analysis, dividing them into groups with and
   without EGFR mutations to assess whether the protein expression levels
   significantly differ between these two groups. The results showed that
   EGFR protein expression levels were higher in ccRCC patients carrying
   EGFR mutations (p-value = 0.017, Wilcoxon signed-rank test,
   Supplementary Fig. [255]17). Moreover, we visualized the interacting
   genes of EGFR using STRING-db (Supplementary Fig. [256]20). It is shown
   that EGFR interacts with many genes. To further explore the effect of
   these genes, we conducted pathway enrichment analysis on these genes.
   The results showed that one ccRCC-related pathway and three
   cancer-related pathways are enriched, including the ErbB signaling
   pathway (FDR p-value = 4.99e-2), the PI3K Akt signaling pathway (FDR
   p-value = 3.41e-7), MAPK signaling pathway (FDR p-value = 2.99e-8), and
   the CKAP4 signaling pathway map (FDR p-value = 3.54e-7). The ErbB
   signaling pathway has been found to play a critical role in the
   initiation and progression of ccRCC^[257]102. The PI3K Akt signaling
   pathway plays a crucial role in various cellular processes and is
   aberrantly activated in cancers, contributing to the occurrence and
   progression of tumors^[258]103. The MAPK signaling pathway is one such
   complex interconnected signaling cascade with frequent involvement in
   oncogenesis, tumor progression, and drug resistance^[259]104. The CKAP4
   signaling pathway map has been found to be involved in regulating the
   progression of various cancers^[260]105. The enrichment analysis
   results suggested that the genes interacting with EGFR in the PPI
   network were significantly enriched in pathways related to the
   initiation and progression of ccRCC, as well as tumor initiation and
   progression. Furthermore, through comprehensive biological experiments,
   we validated the impact of EGFR and PCLO on ccRCC progression. The
   results demonstrated that decreased EGFR expression promotes ccRCC cell
   apoptosis and suppresses colony formation. To confirm the status of
   EGFR expression in ccRCC cells, we performed additional experiments. We
   found that the mRNA and protein expression levels of EGFR in the 786-O
   and ACHN cells are significantly higher than that in the human
   embryonic kidney (293T) cells using qPCR and western blotting analyses
   (as shown in Supplementary Fig. [261]23). Accordingly, these results
   showed that EGFR is overexpressed in both 786-O and ACHN cell lines.
   Recent studies have indicated that the downregulation of certain
   receptor tyrosine kinases (RTKs) members can prevent the progression of
   RCC. For instance, silencing AXL and MET using shRNA may overcome the
   resistance to long-term sunitinib treatment in metastatic RCC^[262]106.
   Additionally, inhibition of EphA2 can suppress tumor growth both in
   vitro and in vivo, and restore the sensitivity of sunitinib-resistant
   tumor cells to sunitinib^[263]107. Therefore, we proposed that other
   RTKs could produce the same effect, even these RTKs would play a
   synergistic role with EGFR in ccRCC. However, their roles in ccRCC
   progression need to be further validated in the future. Additionally,
   the use of the EGFR inhibitor erlotinib effectively inhibits ccRCC cell
   migration and growth in mice. While previous studies have associated
   EGFR expression with RCC, functional studies directly examining EGFR’s
   role in RCC cell lines have been limited. Lee et al. utilized shRNA to
   knock down EGFR on 786-O cells, suggesting that EGFR knockdown inhibits
   the invasiveness of RCC cells in vitro and tumorigenicity in
   vivo^[264]108. However, their conclusions were primarily based on cell
   proliferation and branching morphogenesis assays, as well as tumor
   xenograft experiments. Similarly, Wen et al. found that EGFR knockdown
   reversed ADAMTS1-induced prometastatic characteristics of RCC, this was
   only verified by CCK8 and Matrigel invasion assay^[265]109. Therefore,
   comprehensive experiments are essential to further elucidate the role
   of EGFR in ccRCC cell lines. In our study, we stably knocked down EGFR
   expression in both 786-O and ACHN cells using shRNA. We conducted a
   series of experiments, including CCK8, apoptosis assays, transwell
   assays, colony formation assays, and in vivo experiments, which
   comprehensively explored the function of EGFR in ccRCC. Our results
   demonstrate that knocking down EGFR significantly inhibits both the in
   vitro proliferation and metastasis of ccRCC cells, as well as tumor
   growth in vivo. These results not only demonstrate RL-GenRisk’s
   capability in identifying ccRCC risk genes but also provide further
   insights into the role of EGFR in ccRCC cell lines. Moreover, the
   biological experimental results of PCLO showed that knocking down PCLO
   expression in vitro significantly inhibited ccRCC progression. These
   biological experimental results highlighted the potential therapeutic
   significance of our findings.

   In addition, we explored incorporating mutation type information and
   pathogenicity scores into the feature vectors of RL-GenRisk. For
   mutation type, we utilized the TCGA data to calculate the proportion of
   each mutation type relative to the total number of mutations for each
   gene. For the pathogenicity score, we computed the CADD score^[266]110
   for each mutation, and the average CADD score across all mutations in a
   gene was used as the pathogenicity score feature. By adding different
   types of features, we evaluate three variations of RL-GenRisk,
   including mutation type only, pathogenicity score only, and both. The
   results show that after adding mutation type information separately to
   the feature vector, RL-GenRisk showed improvement in AP on the NCG
   dataset but showed decreases in DCG, N-DCG, and DCG-AUC on the NCG
   dataset, as well as decreases in AP, DCG, N-DCG, and DCG-AUC on the
   IntOGen and the Merged dataset (as shown in Supplementary
   Fig. [267]13). After adding pathogenicity scores separately to the
   feature vector, RL-GenRisk showed improvements in AP, DCG, and N-DCG on
   the NCG and decreases in DCG-AUC on the NCG and the Merged dataset, as
   well as decreases in AP and DCG-AUC on the IntOGen dataset (as shown in
   Supplementary Fig. [268]13). Moreover, after adding both mutation type
   information and pathogenicity scores to the feature vector, RL-GenRisk
   showed a decrease in AP, DCG, N-DCG, and DCG-AUC across the IntOGen,
   NCG, and the Merged dataset (as shown in Supplementary Fig. [269]13).
   These results indicated that adding more clinically informative
   features might improve the performance of the method, but further
   exploration is needed.

   Furthermore, we analyzed the overlap between the results of different
   methods by calculating the Jaccard similarity coefficient for the top
   100 genes identified by any two methods (Supplementary Fig. [270]18).
   The result indicated that the top 100 genes identified by Hierarchical
   HotNet and HotNet2 have the highest degree of overlap (with Jaccard
   similarity coefficient = 0.39), as they are similar methods. In
   general, there was a low overlap between the top 100 genes identified
   by different methods (Supplementary Fig. [271]18). Only one gene VHL is
   included in all the top 100 genes from different methods. VHL is a
   known risk gene for ccRCC and with the highest mutation frequency in
   TCGA ccRCC patients. In addition, seven genes were identified in the
   top 100 genes by at least six out of the nine methods, including VHL,
   PBRM1, SETD2, BAP1, MTOR, SPEN, and ATM. These six genes are all
   well-established risk genes for ccRCC. These results suggested that
   while different methods for ccRCC risk gene identification exhibit
   different performances, few genes like VHL, consistently emerge across
   multiple methods. Then, we conducted an analysis to determine the
   minimum number of genes required to encompass all known ccRCC risk
   genes for each method. Since nCOP only outputs a ranked list of genes
   and does not allow users to adjust the list length by setting different
   thresholds, we are unable to assess the minimum number of genes
   required to encompass all known ccRCC risk genes. For other methods,
   the results show that all tested methods require at least more than
   8000 genes to encompass all known ccRCC risk genes (Supplementary
   Table [272]7). This leads to a very low proportion of known risk genes
   in the identified gene list ( <1%). Including all known risk genes
   introduces numerous false positives, which significantly decreases the
   precision of the results. A potential future direction could involve
   developing methods that can identify all (or most) known risk genes
   while maintaining high precision.

   Since the interpretability of most reinforcement learning systems
   remains a significant challenge, to enhance understanding, we have
   incorporated several cues in the output to explain why certain genes
   are more or less likely to be associated with ccRCC. Our hypothesis is
   that genes with high ranks are likely to be in close proximity to known
   ccRCC-related genes. We have provided two types of supplementary
   information to measure the closeness between the predicted genes and
   known high-risk ccRCC genes. First, other than gene rank, we also
   output the average shortest path length between each gene and all known
   risk genes in the PPI network, as well as the FDR p-values obtained by
   the one-sided t-test that measures whether the average shortest path
   length between each gene and the known risk genes was significantly
   shorter. Second, we output the average cosine similarity between each
   gene’s feature embeddings and the feature embeddings of known risk
   genes, as well as the FDR p-values obtained by the one-sided t-test
   that measures whether the average cosine similarity of feature
   embeddings between each gene and the known risk genes was significantly
   higher. Supplementary Table [273]6 shows an example of the output of
   the top genes identified by RL-GenRisk. To further strengthen these
   analyses and facilitate intuitive interpretation, RL-GenRisk can output
   the comparison of the top K genes with K randomly selected genes
   (randomly sampled 100 times) in terms of their average shortest path
   lengths to known risk genes in the PPI network, as well as their
   average cosine similarities of feature embeddings with known risk
   genes. Users can evaluate the reasonableness of the top K predicted
   genes. For illustration purposes, we set K = 20 in this example. The
   results show that the average shortest path lengths between the top 20
   genes identified by RL-GenRisk and known risk genes within the PPI
   network are significantly shorter (FDR p-value = 2.73e-4, one-sided
   t-test, Supplementary Fig. [274]21A), and the average cosine
   similarities of the feature representations between the top 20 genes
   and known risk genes are significantly higher (FDR p-value = 4.61e-8,
   one-sided t-test, Supplementary Fig. [275]21B). These results indicate
   that the top 20 genes identified by RL-GenRisk are closely connected to
   known risk genes in both the network structure and feature space.

   To illustrate whether the top 20 genes identified by each method are
   included in the known ccRCC risk gene datasets, we provided heatmaps in
   Supplementary Fig. [276]4. The results indicated that RL-GenRisk
   identified the highest number of known risk genes (12 genes) among the
   top 20 genes, followed by nCOP (11 genes), highlighting RL-GenRisk’s
   capability in identifying ccRCC risk genes. Additionally, we provided
   the runtime of various methods in Supplementary Table [277]5. The
   results showed that HotNet2 and Hierarchical HotNet had the longest
   runtimes among the methods tested. In comparison, RL-GenRisk was
   completed in approximately 8 hours. Although some methods had shorter
   runtimes than RL-GenRisk, their overall performance was inferior to
   that of RL-GenRisk.

   RL-GenRisk can also be applied to identify risk genes for other types
   of cancer. We evaluated RL-GenRisk and other methods using additional
   tumor datasets, specifically focusing on Bladder Urothelial Carcinoma
   (BLCA) and Glioblastoma Multiforme (GBM) data from the TCGA dataset.
   The known risk genes for BLCA and GBM were extracted from the same
   source as ccRCC in our study. The results indicated that RL-GenRisk
   outperformed other methods on both BLCA and GBM (as shown in
   Supplementary Figs. [278]14 and [279]15), highlighting the potential of
   RL-GenRisk for application in other cancer types.

Methods

   Our research complies with all relevant ethical regulations and
   guidelines. Animal handling and experimental procedures were approved
   by the Ethical Review Committees of West China Hospital, Sichuan
   University. The animal studies were authorized by the Animal Ethics
   Review Committees of the West China Hospital, China (No. 20230214007).
   All animal experiments were strictly implemented in compliance with the
   NIH Guide for the Care and Use of Laboratory Animals.

Data preparation

   Consistent with previous studies^[280]29,[281]31, we used the
   information from the PPI network together with gene mutation data from
   patients. The feature matrix H for nodes is constructed based on the
   topological information of genes in the PPI network, the mutation
   frequency of genes in ccRCC patients, and the genes’ length.

PPI networks

   We collected protein-protein interactions from HPRD^[282]49,
   STRING-db^[283]50, Multinet^[284]51, IRefIndex^[285]52, and
   HumanNet^[286]53. Then, following the previous study^[287]31, we
   performed a two-step preprocess on these PPI networks. First, we
   excluded the nine longest genes (TTN, MUC16, SYNE1, NEB, MUC19,
   CCDC168, FSIP2, OBSCN, GPR98) as they tend to acquire numerous
   mutations by chance and cover many patients^[288]31. Second, to
   mitigate the noise due to the dense connectivity in PPI networks, we
   applied the diffusion state distance (DSD)^[289]111 metric on these PPI
   networks.

Gene mutation data

   We collected somatic mutation of genes in ccRCC patients from The
   Cancer Genome Atlas. Each gene corresponds to a patient list containing
   patients who carry mutations in that gene.

Feature representation

   The initial feature of gene v is represented as
   [MATH: <msub><mrow><mi
   mathvariant="bold">h</mi></mrow><mrow><mi>v</mi></mrow></msub><mo>∈</mo
   ><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>1</mn><mo>×</mo><mi>
   k</mi></mrow></msup> :MATH]
   , where k is the dimension. In RL-GenRisk, the node feature is flexible
   and can have different dimensions. We set k = 3 in RL-GenRisk.
   Therefore, the initial feature of gene v can be represented as
   [MATH: <msub><mrow><mi
   mathvariant="bold">h</mi></mrow><mrow><mi>v</mi></mrow></msub><mo>=</mo
   ><mrow><mo>[</mo><mrow><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</mi>
   </mrow><mrow><mrow><mo>(</mo><mrow><mn>0</mn></mrow><mo>)</mo></mrow></
   mrow></msubsup><mo>,</mo><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</m
   i></mrow><mrow><mrow><mo>(</mo><mrow><mn>1</mn></mrow><mo>)</mo></mrow>
   </mrow></msubsup><mo>,</mo><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v<
   /mi></mrow><mrow><mrow><mo>(</mo><mrow><mn>2</mn></mrow><mo>)</mo></mro
   w></mrow></msubsup></mrow><mo>]</mo></mrow> :MATH]
   , which respectively measures the topological importance of v in the
   PPI network, the mutation frequency in patients, and the length
   information of the gene. We first use the degree of v as the first
   dimension of h[v] since it can represent the topological importance of
   v in the PPI network:
   [MATH:
   <msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</mi></mrow><mrow><mrow><mo>
   (</mo><mrow><mn>0</mn></mrow><mo>)</mo></mrow></mrow></msubsup><mo>=</m
   o><mfrac><mrow><mi
   mathvariant="normal">Degree</mi><mrow><mo>(</mo><mrow><mi>v</mi></mrow>
   <mo>)</mo></mrow></mrow><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>d<
   /mi><mi>e</mi><mi>g</mi><mi>r</mi><mi>e</mi><mi>e</mi></mrow></msub></m
   row></mfrac><mo>,</mo> :MATH]
   1

   where Degree(v) represents the node degree of gene v in the PPI
   network. μ[degree] is a hyperparameter for normalization. We set
   μ[degree] equal to the maximum degree of nodes in the PPI network.
   Then, the second dimension of h[v] considers the mutation frequency of
   gene v in ccRCC patients:
   [MATH:
   <msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</mi></mrow><mrow><mrow><mo>
   (</mo><mrow><mn>1</mn></mrow><mo>)</mo></mrow></mrow></msubsup><mo>=</m
   o><mfrac><mrow><msubsup><mrow><mi>N</mi></mrow><mrow><mi>p</mi></mrow><
   mrow><mi>v</mi></mrow></msubsup></mrow><mrow><msub><mrow><mi>N</mi></mr
   ow><mrow><mi>p</mi></mrow></msub></mrow></mfrac><mo>,</mo> :MATH]
   2

   where
   [MATH:
   <msubsup><mrow><mi>N</mi></mrow><mrow><mi>p</mi></mrow><mrow><mi>v</mi>
   </mrow></msubsup> :MATH]
   represents the number of patients carrying mutation on v and N[p]
   represents the total number of patients. Moreover, the initial feature
   should consider the length information of the gene, since the longer
   genes tend to include more mutations by change. To dilute this type of
   effect, the third dimension of h[v] is designed as follows:
   [MATH:
   <msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</mi></mrow><mrow><mrow><mo>
   (</mo><mrow><mn>2</mn></mrow><mo>)</mo></mrow></mrow></msubsup><mo>=</m
   o><mfrac><mrow><mi
   mathvariant="normal">Length</mi><mrow><mo>(</mo><mrow><mi>v</mi></mrow>
   <mo>)</mo></mrow></mrow><mrow><mi>α</mi><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>N</mi></mrow><mrow><mi>p</mi
   ></mrow><mrow><mi>v</mi></mrow></msubsup></mrow></mfrac><mo>,</mo>
   :MATH]
   3

   where Length(v) represents the gene length of v, and α is a
   normalization parameter. We set α equal to the node number of a
   connected subgraph in the PPI network that each patient carried
   mutations on at least one gene in this subgraph. Finally, the initial
   feature of gene was defined as
   [MATH: <msub><mrow><mi
   mathvariant="bold">h</mi></mrow><mrow><mi>v</mi></mrow></msub><mo>=</mo
   ><mrow><mo>[</mo><mrow><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</mi>
   </mrow><mrow><mrow><mo>(</mo><mrow><mn>0</mn></mrow><mo>)</mo></mrow></
   mrow></msubsup><mo>,</mo><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v</m
   i></mrow><mrow><mrow><mo>(</mo><mrow><mn>1</mn></mrow><mo>)</mo></mrow>
   </mrow></msubsup><mo>,</mo><msubsup><mrow><mi>h</mi></mrow><mrow><mi>v<
   /mi></mrow><mrow><mrow><mo>(</mo><mrow><mn>2</mn></mrow><mo>)</mo></mro
   w></mrow></msubsup></mrow><mo>]</mo></mrow> :MATH]
   , and the initial feature matrix for all genes was represented as
   [MATH: <mi mathvariant="bold">H</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>
   k</mi></mrow></msup> :MATH]
   , where n is the number of genes.

Key elements in RL-GenRisk

   In RL-GenRisk, the identification of ccRCC risk genes is framed as a
   Markov Decision Process. At each step, the policy of RL-GenRisk
   receives the current state as input and selects an action. The action
   is selecting a node that connects with the sampled subgraph and
   appending this node to the sampled subgraph. Thus, this sampled
   subgraph adds a new node at each step. A reward is obtained after
   taking an action. Therefore, there are three key elements in
   RL-GenRisk, including state, action, and reward. These three key
   elements are defined as follows:

State

   The state s at step t is represented as s[t], which consists of the
   feature matrix H and the current subgraph G[t] sampled from the PPI
   network. In the first step, RL-GenRisk creates an empty subgraph and
   randomly adds a node to it. The design of incorporating the sampled
   subgraph into the state allows RL-GenRisk to delve into localized
   network structures. This enables RL-GenRisk to focus on the interaction
   relationships relevant to the current sampled subgraph at each step.
   Therefore, the subgraph information in the state enables RL-GenRisk to
   accurately estimate its current environment and adjust its actions
   accordingly.

Action

   At step t, the action a[t] is selecting a node that connects with the
   sampled subgraph G[t] and appending this node to G[t]. The action space
   at step t is represented as A[t], which contains nodes that connect
   with G[t]. In detail, after getting Q values for all possible actions,
   RL-GenRisk uses an ϵ-greedy strategy to select an action a[t]:
   [MATH:
   <msub><mrow><mi>a</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>=</mo><m
   fenced open="{"><mrow><mtable><mtr><mtd columnalign="left"><mi
   mathvariant="normal">random</mi><mspace
   width="0.25em"></mspace><mi>a</mi><mo>∈</mo><msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>t</mi></mrow></msub></mtd><mt
   d columnalign="left"><mi mathvariant="normal">if</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mspace
   width="0.25em"></mspace><mo><</mo><mspace
   width="0.25em"></mspace><mi>ϵ</mi></mtd></mtr><mtr><mtd
   columnalign="left"><mi>arg</mi><msub><mrow><mi>max</mi></mrow><mrow><mi
   >a</mi><mo>∈</mo><msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></
   msub><mi>Q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>;</mo
   ><mi>a</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow></mtd><mtd
   columnalign="left"><mi
   mathvariant="normal">otherwise</mi><mo>,</mo></mtd></mtr></mtable></mro
   w></mfenced> :MATH]
   4

   where Q(s[t]; a; θ) represents the Q value of action a calculated by
   the policy based on the current state s[t], θ stands for the parameters
   of the policy. A[t] represents the action space at step t. With a
   probability p not exceeding ϵ, we select an action in action space
   randomly. Otherwise, the action with the highest Q value is selected.
   The ϵ-greedy strategy can enhance the exploratory nature of the policy,
   effectively preventing the model from getting stuck in local optimal
   during training. Same with the previous study^[290]112, we set ϵ equal
   to 0.95. To balance exploration and exploitation, ϵ is decreased
   gradually during the training process.

Reward

   In reinforcement learning algorithms, the design of the reward is
   crucial to the performance of the algorithm. In this study, we designed
   a specific data-driven reward based on the current sampled subgraph.
   The sampled subgraph G[t] at step t is expected to cover more patients
   so that risk genes are more likely to appear in it. Instead of focusing
   on an individual gene, RL-GenRisk focuses on a sampled subgraph G[t]
   with node feature matrix H. This makes RL-GenRisk effectively leverage
   information from both the PPI network and gene mutation data. More
   importantly, this design of the reward ensures the accurate
   identification of genes with high mutation frequencies, and also
   identifies potential risk genes with low mutation frequencies but
   functionally interacting with genes that have high mutation
   frequencies. However, only considering the number of patients covered
   by the sampled subgraph G[t] is not enough. Longer genes are more
   likely to mutate by chance in patients. Therefore, shorter genes with
   high mutation frequencies are more likely to be risk genes^[291]31. To
   take gene length into account, we integrate gene length information
   when designing the reward. Thus, the single-step reward is higher when
   the sampled subgraph covers more patients and the genes in the subgraph
   have shorter lengths. The single-step reward r[t] at step t is designed
   as:
   [MATH:
   <msub><mrow><mi>r</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>−</mo><msu
   b><mrow><mi>R</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow></msu
   b><mo>=</mo><mi>δ</mi><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><msubsup><mrow><mi>N</m
   i></mrow><mrow><mi>p</mi></mrow><mrow><msub><mrow><mi>G</mi></mrow><mro
   w><mi>t</mi></mrow></msub></mrow></msubsup></mrow><mrow><msub><mrow><mi
   >N</mi></mrow><mrow><mi>p</mi></mrow></msub></mrow></mfrac></mrow></mfe
   nced><mo>+</mo><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><mi>δ</mi></mr
   ow><mo>)</mo></mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>v</mi><mo>∈</mo><msub><mrow><mi>G
   </mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></msub><mfrac><mrow><m
   i
   mathvariant="normal">Length</mi><mrow><mo>(</mo><mrow><mi>v</mi></mrow>
   <mo>)</mo></mrow></mrow><mrow><msubsup><mrow><mi>N</mi></mrow><mrow><mi
   >p</mi></mrow><mrow><mi>v</mi></mrow></msubsup></mrow></mfrac><mo>,</mo
   > :MATH]
   5

   where r[t] represents the single-step reward at step t. δ is a weight
   hyperparameter. R[t] provides an evaluation score of the current state,
   with
   [MATH:
   <msubsup><mrow><mi>N</mi></mrow><mrow><mi>p</mi></mrow><mrow><msub><mro
   w><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></msubsup>
   :MATH]
   representing the number of patients that carry at least one mutation on
   the genes in the sampled subgraph G[t],
   [MATH:
   <msubsup><mrow><mi>N</mi></mrow><mrow><mi>p</mi></mrow><mrow><mi>v</mi>
   </mrow></msubsup> :MATH]
   represents the number of patients carrying mutations on gene v, and
   N[p]represents the total number of patients. Length(v) represents the
   length of gene v. The cumulative reward is defined as the sum of all
   single-step rewards.

Policy network

   In RL-GenRisk, the policy takes the current state as input and outputs
   Q values for all possible actions. The policy of RL-GenRisk is
   represented by a neural network which is usually referenced as the
   policy network. Specifically, the policy of RL-GenRisk comprises two
   main components: a graph convolutional network (GCN) and a node
   evaluation network. We also use three multi-layer perceptrons (MLP) to
   perform dimensionality transformation. The GCN aggregates neighborhood
   information to get the representation for each node by multiplying the
   graph laplacian^[292]42 with the node feature matrix. Given the feature
   matrix
   [MATH: <mi mathvariant="bold">H</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>
   k</mi></mrow></msup> :MATH]
   , n is the number of nodes and k is the dimension, the hidden
   representation of the l-th graph convolutional layer is calculated as:
   [MATH: <msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mi>l</mi></mrow></msup><mo>=</mo
   ><mi>σ</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">L</mi><msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mi>l</mi><mo>−</mo><mn>1</mn></m
   row></msup><msup><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mi>l</mi></mrow></msup></mrow><m
   o>)</mo></mrow><mo>,</mo> :MATH]
   6

   where H^l represents the hidden representation of the l-th graph
   convolutional layer.
   [MATH: <mi mathvariant="bold">L</mi><mo>=</mo><msup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">D</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></m
   frac></mrow></msup><mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>^</mo></mrow></mover><msup><m
   row><mover accent="true"><mrow><mi
   mathvariant="bold">D</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></m
   frac></mrow></msup> :MATH]
   represents the normalized graph laplacian, which is used to aggregate
   neighborhood information. GCN preserves the original node signal by
   adding self-connections:
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>^</mo></mrow></mover><mo>=</m
   o><mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>~</mo></mrow></mover><mo>+</m
   o><mi mathvariant="bold">I</mi> :MATH]
   ,
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>~</mo></mrow></mover> :MATH]
   represents the adjacency matrix and I represents the identity matrix.
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">D</mi></mrow><mrow><mo>^</mo></mrow></mover> :MATH]
   represents the degree matrix for
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>^</mo></mrow></mover> :MATH]
   and
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">D</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>i</mi><mi>i</mi></mrow></msub><mo>=</mo><msub><mrow><mo>∑</mo>
   </mrow><mrow><mi>j</mi></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>i</mi><mi>j</mi></mrow></msub> :MATH]
   , W^l represents the trainable matrix of the l-th graph convolutional
   layer. σ represents the non-linear activation function.

   In RL-GenRisk, we use two graph convolutional layers. Before the first
   graph convolutional layer receives data, we use an MLP to perform
   dimensionality transformation on the initial features:
   [MATH: <msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mn>0</mn></mrow></msup><mo>=</mo
   ><msub><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mn>1</mn></mrow></msub><mi
   mathvariant="bold">H</mi><mo>,</mo> :MATH]
   7

   where H represents the initial feature matrix, W[1] represents a
   trainable projection matrix. H^0 represents the input feature matrix of
   the first graph convolutional layer.

   Inspired by residual networks^[293]113, we concatenate the hidden
   representation matrices of two graph convolutional layers with H^0, and
   use another MLP for dimension transformation to obtain the final
   representations matrix
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover> :MATH]
   :
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover><mo>=</m
   o><msub><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mn>2</mn></mrow></msub><mrow><mo
   >(</mo><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>Concat</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mn>0</mn></mrow></msup><mo>,</mo
   ><msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mn>1</mn></mrow></msup><mo>,</mo
   ><msup><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow><m
   o>)</mo></mrow></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
   8

   where H^1, H^2 represent the hidden representation matrices of the two
   graph convolutional layers, W[2] represents a trainable projection
   matrix, and Concat( ⋅ ) represents the concatenation operator.

   At step t, after getting the representations matrix by GCN, RL-GenRisk
   calculates the representation of the sampled subgraph through the
   averaging pooling operation and uses the MLP for dimension
   transformation:
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow>
   </msub><mo>=</mo><msub><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mn>3</mn></mrow></msub><mfenced
   close=")" open="("><mrow><mfrac><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>v</mi><mo>∈</mo><msub><mrow><mi>G
   </mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></msub><msub><mrow><mo
   ver accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>v</mi></mrow></msub></mrow><mrow><mi
   mathvariant="normal">Size</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>G</
   mi></mrow><mrow><mi>t</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow>
   </mfrac></mrow></mfenced><mo>,</mo> :MATH]
   9

   where
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow>
   </msub> :MATH]
   represents the representation of sampled subgraph G[t],
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>v</mi></mrow></msub> :MATH]
   represents the representation of gene v, Size( ⋅ ) represents the
   number of nodes in the sampled subgraph, and W[3] represents a
   trainable projection matrix.

   Then, we use a two-layer MLP as the node evaluation network to
   calculate the Q values for each action in action space:
   [MATH: <mi>Q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>;</mo
   ><mi>a</mi><mo>;</mo><mi>θ</mi></mrow><mo>)</mo></mrow><mo>=</mo><msub>
   <mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mn>5</mn></mrow></msub><mrow><mo
   >(</mo><mrow><mi>σ</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mn>4</mn></mrow></msub><mrow><mo
   >(</mo><mrow><mi
   mathvariant="normal">Concat</mi><mrow><mo>(</mo><mrow><msub><mrow><move
   r accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow>
   </msub><mo>,</mo><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>v</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mo>)</mo><
   /mrow></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
   10

   where Q(s[t]; a; θ) represents the Q value for action a based on the
   current state s[t], θ stands for the parameters of the policy.
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow>
   </msub> :MATH]
   represents the representation of sampled subgraph G[t], v represents
   the gene that selected by action a, and
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">H</mi></mrow><mrow><mo>^</mo></mrow></mover></mrow><
   mrow><mi>v</mi></mrow></msub> :MATH]
   represents the representation of gene v. W[4] and W[5] represent two
   trainable projection matrices. σ represents the non-linear activation
   function. By using a two-layer MLP that can continuously update
   parameters during the training process as the node evaluation network,
   RL-GenRisk can better predict Q values for actions. More details about
   the policy network can be found in Supplementary Table [294]3.

Policy training for high-confidence risk gene identification

   Consistent with the previous study^[295]31, we sampled the training
   data, randomly collecting 85% samples from 379 ccRCC patients before
   the training process started. In the training process, since we model
   the ccRCC risk gene identification as a Markov Decision Process, the
   policy of RL-GenRisk iteratively receives the current state as input,
   selects an action, and obtains a reward. The DQN algorithm^[296]43,
   which is widely used in reinforcement learning methods, is employed to
   train the policy of RL-GenRisk. DQN uses two sets of Q values
   calculated by an online network and a target network, respectively. The
   online network in our study is the neural network of RL-GenRisk. The
   target network is designed to prevent the online network from
   overestimating Q values. The architecture of the target network is the
   same as that of the online network, but their parameters are different.
   In RL-GenRisk, the loss function is defined as follows:
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi><mo>=</mo><mi
   mathvariant="double-struck">E</mi><mrow><mo>[</mo><mrow><msup><mrow><mr
   ow><mo>(</mo><mrow><msub><mrow><mi>r</mi></mrow><mrow><mi>t</mi></mrow>
   </msub><mo>+</mo><mi>γ</mi><msub><mrow><mi>max</mi></mrow><mrow><msub><
   mrow><mi>a</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub
   ><mo>∈</mo><msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></m
   row></msub></mrow></msub><mi>Q</mi><mrow><mo>(</mo><mrow><msub><mrow><m
   i
   mathvariant="bold">s</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></m
   row></msub><mo>;</mo><msub><mrow><mi>a</mi></mrow><mrow><mi>t</mi><mo>+
   </mo><mn>1</mn></mrow></msub><mo>;</mo><msup><mrow><mi>θ</mi></mrow><mr
   ow><mo>′</mo></mrow></msup></mrow><mo>)</mo></mrow><mo>−</mo><mi>Q</mi>
   <mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>;</mo
   ><msub><mrow><mi>a</mi></mrow><mrow><mi>t</mi></mrow></msub><mo>;</mo><
   mi>θ</mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mrow><
   mn>2</mn></mrow></msup></mrow><mo>]</mo></mrow><mo>,</mo> :MATH]
   11

   where
   [MATH: <mi class="MJX-tex-caligraphic" mathvariant="script">L</mi>
   :MATH]
   represents the loss to be minimized. r[t] stands for the reward
   received at the step t, and γ is the discount factor.
   [MATH:
   <msub><mrow><mi>max</mi></mrow><mrow><msub><mrow><mi>a</mi></mrow><mrow
   ><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>∈</mo><msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></m
   row></msub></mrow></msub><mi>Q</mi><mrow><mo>(</mo><mrow><msub><mrow><m
   i
   mathvariant="bold">s</mi></mrow><mrow><mi>t</mi><mo>+</mo><mn>1</mn></m
   row></msub><mo>;</mo><msub><mrow><mi>a</mi></mrow><mrow><mi>t</mi><mo>+
   </mo><mn>1</mn></mrow></msub><mo>;</mo><msup><mrow><mi>θ</mi></mrow><mr
   ow><mo>′</mo></mrow></msup></mrow><mo>)</mo></mrow> :MATH]
   represents the target network estimate of the maximum expected Q value
   for the next state s[t+1] and possible action a[t+1]. A[t+1] represents
   the action space. Q(s[t]; a[t]; θ) represents the Q value calculated by
   the online network based on the current state s[t] and action a[t]. θ
   and
   [MATH: <msup><mrow><mi>θ</mi></mrow><mrow><mo>′</mo></mrow></msup>
   :MATH]
   represent the parameters of the online network and target network,
   respectively. θ is updated through gradient backward based on the loss
   function. Then
   [MATH: <msup><mrow><mi>θ</mi></mrow><mrow><mo>′</mo></mrow></msup>
   :MATH]
   is updated through soft updates based on the parameters of the online
   network as follows:
   [MATH:
   <msup><mrow><mi>θ</mi></mrow><mrow><mo>′</mo></mrow></msup><mo>=</mo><m
   i>τ</mi><mi>θ</mi><mo>+</mo><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><
   mi>τ</mi></mrow><mo>)</mo></mrow><msup><mrow><mi>θ</mi></mrow><mrow><mo
   >′</mo></mrow></msup><mo>,</mo> :MATH]
   12

   where τ represents a hyperparameter that controls the proportion of
   each update. More details about the hyperparameters in RL-GenRisk are
   shown in Supplementary Table [297]4. Moreover, we provided recommended
   ranges for each hyperparameter in Supplementary Table [298]4. Users can
   select hyperparameters within the recommended ranges. Additionally,
   users who wish to try a wider range of hyperparameter values can also
   use the grid search^[299]114 to select hyperparameter values.

   In the identification process, RL-GenRisk differs from existing methods
   that predict outcomes following the same procedure as the training
   process^[300]23,[301]29,[302]31. Instead, capitalizing on the advantage
   of RL-GenRisk combined with reinforcement learning, we have designed a
   concise and effective identification process. Specifically, the model
   with the highest cumulative reward during the training process is
   selected as the best model and loaded. The sampled subgraph is
   initialized as empty, and all nodes in the PPI network are included in
   the action space at the beginning. The policy calculates Q values for
   all genes, and subsequently, the final ranking list of ccRCC risk genes
   is ordered based on these Q values. The higher Q value indicates higher
   risk. We analyzed the top 20 high-confidence risk genes in our study:
   [MATH:
   <msub><mrow><mi>H</mi><mi>R</mi><mi>G</mi></mrow><mrow><mi>t</mi><mi>o<
   /mi><mi>p</mi><mn>20</mn></mrow></msub><mo>=</mo><mrow><mo>{</mo><mrow>
   <mi>g</mi><mo>∈</mo><mi>G</mi><mo>∣</mo><mspace
   width="0.25em"></mspace><mi>Q</mi><mrow><mo>(</mo><mrow><mi>g</mi></mro
   w><mo>)</mo></mrow><mo>∈</mo><msub><mrow><mi>Q</mi></mrow><mrow><mi>t</
   mi><mi>o</mi><mi>p</mi><mn>20</mn></mrow></msub></mrow><mo>}</mo></mrow
   ><mo>,</mo> :MATH]
   13

   where HRG[top20] represents the top 20 high-confidence risk genes. g
   represents a gene and G represents the PPI network. Q(g) represents the
   Q value of gene g and Q[top20] represents the top 20 highest Q values
   in the set of all Q values calculated at the identification process.
   Researchers can select the number of top genes for verification based
   on actual verification costs.

Performance evaluation

   Known ccRCC risk genes were sourced from the IntOGen cancer-specific
   database^[303]40, the Network of Cancer Genes (NCG) database^[304]54,
   the Cancer Gene Census (CGC) database^[305]55, and a cancer risk gene
   set derived from a recent comprehensive study by Bailey et al.^[306]56.
   The known ccRCC risk genes used in our study are provided in
   Supplementary Data [307]1. Due to the limited number of genes in the
   CGC (nine genes) and Bailey et al. (twelve genes) lists, we ultimately
   used three datasets to evaluate different methods: IntOGen, NCG, and a
   “Merged” dataset that combined IntOGen, NCG, Bailey et al., and CGC.
   First, the performance was evaluated using the Discounted Cumulative
   Gain (DCG), which has been used for evaluating cancer risk gene
   identification in the previous study^[308]29. The DCG is calculated as:
   [MATH: <mi mathvariant="normal">DCG</mi><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><mfrac><mrow><mi>r</mi><mi>e</mi><msub
   ><mrow><mi>l</mi></mrow><mrow><msub><mrow><mi>g</mi></mrow><mrow><mi>i<
   /mi></mrow></msub></mrow></msub></mrow><mrow><mi>l</mi><mi>o</mi><msub>
   <mrow><mi>g</mi></mrow><mrow><mn>2</mn></mrow></msub><mrow><mo>(</mo><m
   row><mi>i</mi><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></mfra
   c><mo>,</mo> :MATH]
   14

   where N represents the number of the identified ccRCC risk genes.
   [MATH:
   <mi>r</mi><mi>e</mi><msub><mrow><mi>l</mi></mrow><mrow><msub><mrow><mi>
   g</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   is equal to 1 if the i-th gene g[i] is contained in the known risk gene
   database, and 0 otherwise. Therefore, the higher the ranking of known
   ccRCC risk genes in the prediction results, the higher the DCG score.
   Following the previous study^[309]31, the top 100 identified ccRCC risk
   genes were considered in the performance evaluation. The top 100 genes
   identified by different methods are listed in Supplementary
   Data [310]2. Additionally, to provide a more comprehensive evaluation
   of different methods, we presented DCG curves and calculated the area
   under the DCG curve. We also used the normalized DCG (N-DCG) and the
   average precision (AP) as evaluation metrics. The N-DCG is calculated
   as:
   [MATH:
   <mstyle><mtext>N-DCG</mtext></mstyle><mo>=</mo><mfrac><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>DCG</mtext></mstyle></mrow><mrow
   ><mstyle><mtext>I-DCG</mtext></mstyle></mrow></mfrac><mo>,</mo><mspace
   width="0.25em"></mspace><mstyle><mtext>I-DCG</mtext></mstyle><mspace
   width="0.25em"></mspace><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><mfrac><mrow><mi>r</mi><mi>e</mi><msubsup><mrow><m
   i>l</mi></mrow><mrow><msub><mrow><mi>g</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow><mrow><mi>i</mi><mi>d</mi><mi>e</mi><mi>a</mi><mi>l</mi
   ></mrow></msubsup></mrow><mrow><msub><mrow><mi>log</mi></mrow><mrow><mn
   >2</mn></mrow></msub><mrow><mo>(</mo><mrow><mi>i</mi><mo>+</mo><mn>1</m
   n></mrow><mo>)</mo></mrow></mrow></mfrac> :MATH]
   15

   where
   [MATH:
   <mi>r</mi><mi>e</mi><msubsup><mrow><mi>l</mi></mrow><mrow><msub><mrow><
   mi>g</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><mi>i</mi><m
   i>d</mi><mi>e</mi><mi>a</mi><mi>l</mi></mrow></msubsup> :MATH]
   is the relevance score of the item at g[i] in the ideal ranking. The AP
   is calculated as:
   [MATH: <mstyle><mtext>AP</mtext></mstyle><mspace
   width="0.25em"></mspace><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><
   mi>N</mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mn>100</
   mn></mrow></munderover><mi>P</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow
   ><mo>)</mo></mrow><mo>⋅</mo><mi>r</mi><mi>e</mi><msub><mrow><mi>l</mi><
   /mrow><mrow><msub><mrow><mi>g</mi></mrow><mrow><mi>i</mi></mrow></msub>
   </mrow></msub><mo>,</mo> :MATH]
   16

   where N represents the number of the ccRCC risk genes. P(i) represents
   the precision for the top i genes, which is calculated by dividing the
   number of known risk genes in identified genes by the total number of
   identified genes.
   [MATH:
   <mi>r</mi><mi>e</mi><msub><mrow><mi>l</mi></mrow><mrow><msub><mrow><mi>
   g</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   is equal to 1 if the i-th gene g[i] is contained in the known risk gene
   database, and 0 otherwise.

Threshold selection for top K genes

   To determine the optimal number of top K genes for downstream analysis,
   we evaluated the proportion of known ccRCC risk genes across various
   values of K (where K = 20, 30, 40, …, 100). We observed that when K was
   set to 20, the average proportion of known ccRCC risk genes exceeded
   50% (Supplementary Fig. [311]19). As K increased, the proportion of
   known ccRCC risk genes decreased (Supplementary Fig. [312]19). Based on
   this observation, we suggested that selecting K = 20 generates a gene
   set with a higher likelihood of including unknown ccRCC risk genes.
   Additionally, considering the potential downstream analysis, selecting
   K = 20 kept the analysis cost within a manageable range in our study.

Statistical analysis

Gene set enrichment analysis

   We used the g:Profiler^[313]61 for running functional enrichment
   analysis of the top 20 high-confidence risk genes identified by
   RL-GenRisk. g:Profiler maps genes to known functional information
   sources and detects statistically significantly enriched terms. We
   performed enrichment analysis on Gene Ontology, Human Phenotype
   Ontology, and WikiPathways. We used FDR p-value  < 0.05 as the
   significance threshold.

Survival analysis

   Clinical data of ccRCC patients and protein expression data of EGFR
   were obtained from TCGA. We used cSurvival^[314]115 to perform
   progression-free survival and disease-specific survival analysis on
   these data. Progression-free survival utilizes the time from
   randomization or initiation of treatment to the occurrence of disease
   progression or death^[315]77. Disease-specific survival refers to
   deaths caused specifically by a particular disease^[316]78. Patients
   were categorized into quartiles based on the expression levels of
   protein encoded by EGFR. The Kaplan-Meier estimator was used to
   generate survival curves, and the difference was assessed using the
   log-rank test^[317]116.

In vitro and in vivo experiments

Cell lines

   ACHN and 786-O cells were purchased from Shanghai Zhong Qiao Xin Zhou
   Biotechnology Co., Ltd. (Shanghai, China), and were cultured in DMEM
   and RPMI-1640 medium (HyClone, Utah, USA), respectively, supplemented
   with 10% fetal bovine serum (FBS, Gibco, Australia) and 1% antibiotics
   (penicillin and streptomycin, HyClone) in a humidified incubator
   containing 5% CO[2] at 37 °C. The stable cell lines of ACHN-shEGFR,
   786-O-shEGFR, 786-O-EGFR, ACHN-shPCLO, and 786-O-shPCLO were obtained
   as described previously^[318]117.

High-content screening analysis

   Cells were inoculated into a 96-well plate at an optimal density to
   ensure ideal growth and interaction. A high-content live-cell imaging
   system was utilized to monitor the cells at multiple time points,
   assessing various parameters. This imaging system, equipped with
   fluorescence microscopy capabilities, allows for real-time
   visualization of cell morphology and dynamics. At each time point,
   images were automatically captured, focusing on specific areas of
   interest within each well. The analysis included measuring cell
   perimeter, calculating movement speed and accumulated distance. The
   above analysis was done using Opera Phenix Plus and Harmony 5.2.

Western blot analysis

   Total protein was extracted using TPER solution from Thermo Fisher
   Scientific. The protein concentration was measured using a BCA kit from
   Pierce. The proteins were separated using 10% SDS-PAGE. The proteins
   were transferred onto a PVDF membrane, and the membrane was blocked
   with 5% skimmed milk for 1 h at room temperature (RT). Then, the
   membrane was incubated with primary rabbit anti-human EGFR antibodies
   ([319]R22778, ZenBio, WB: 1:1000), rabbit anti-PCLO antibody
   (HPA015858, Sigma-Aldrich, WB: 1:1000) and mouse anti-GAPDH antibody
   (250133, ZenBio, WB: 1:5000) at 4 °C overnight. After three washes with
   TBST for 10 min, the membrane was incubated with horseradish peroxidase
   (HRP)-conjugated goat antirabbit secondary antibodies (458, MBL) for 1
   h at RT. Finally, the protein bands were detected using an HRP
   substrate. All experiments were performed in triplicate.

Quantitative polymerase chain reaction (qPCR)

   Total RNA was extracted from the cells using a QIAGEN kit (74104), and
   the RNA concentration was determined using a NanoPhotometer from Implen
   (München). cDNA was synthesized using a RevertAid first-strand cDNA
   synthesis kit from Thermo Fisher Scientific (Waltham). PCR was
   performed using the following protocol: 98 °C for 2 min and 40 cycles
   at 98 °C for 5 s and 60 °C for 10 s. The relative gene expression was
   calculated using 2^−ΔΔCt. β-Actin was used as an internal control. All
   experiments were performed more than triplicate.

Cell counting kit-8 assay

   The cell counting kit-8 (CCK-8) assay was performed using a kit from
   Dojindo. A total of 3000 cells/well were cultured in a 96-well plate
   for 24 h, and 10 μL of CCK-8 solution was added to the wells. Then, the
   plate was incubated for 2 h at 37 °C, and the optical density (OD)
   value was detected at 450 nm using a microplate reader. All experiments
   were performed in triplicate.

Cell apoptosis

   An Annexin V-Alexa Fluor 647/PI Kit was purchased from 4A Biotech
   (Suzhou, China). The cells were digested and washed twice with cold
   phosphate-buffered saline (PBS). Next, the 1 × binding buffer was used
   to suspend cells to a concentration of 1−5 × 10^6 cells/mL. Then 5 μL
   of Annexin V/Alexa Fluor 647 was added to 100 μL of cells, and the
   mixture was incubated at RT for 5 min in the dark. Finally, the flow
   cytometry assay was performed after adding 10 μL of 20 μg/mL propidium
   iodide (PI) and 400 μL of PBS. Figures illustrating the gating strategy
   are provided in the [320]Supplementary Materials.

Transwell assay

   Transwell migration assays were conducted using a Transwell chamber
   from Corning (REF 3422, Arizona, USA). Briefly, Transwell chambers were
   placed on a 24-well plate. Fresh medium containing 10% FBS in 600 μL
   was added to the lower chambers, and (2–5) × 10^4 cells in 200 μL of
   medium without FBS were added to the upper chamber. The 24-well plate
   was incubated at 37 °C for 48 h. Cells that invaded through the chamber
   were washed, fixed (20 min with 4% paraformaldehyde), and stained
   (30 min with crystal violet). Then, the upper chambers were washed,
   photographed, and preserved under an inverted fluorescence OBSERVER
   D1/AX10 cam HRC microscope (Zeiss). Transferred cells were analyzed
   using ImageJ software.

Colony formation assay

   The cells (500/well) were seeded into a 6-well plate and cultured at
   37 °C for 7–10 days. Then, the clones were imaged using a Celigo
   imaging cytometer from Nexcelom (Lawrence), and the clones were counted
   using ImageJ software. All experiments were performed in triplicate.

Animal study

   Four-week-old male BALB/C-nu/nu mice were purchased from GemPharmatech.
   All these mice were maintained in pathogen-free conditions at 24 °C/50%
   humidity, with a light/dark cycle for 12 h, and given a free supply of
   food (reproductive diet, catalog number: F010201) and water. A total of
   5 × 10^6 cells were inoculated into the right flank of mice, and the
   tumor volume was recorded every 3 days starting from day 17 after
   injection of the tumor cells. Mice were administered with erlotinib
   (S1023, Selleck, Shanghai, China) at a dose of 50 mg/kg. The tumor
   volume was calculated using the following equation: L × W^2 × 0.5236,
   where L is tumor length and W is tumor width^[321]117. The maximum
   tumor diameter allowed by the Ethics Committee is 2 cm. We ensured that
   each time mice were sacrificed the maximal body weight loss did not
   exceed the limit of 20%. The animal procedures were approved by the
   ethics committee of West China Hospital, Sichuan University.

Statistical analysis

   All data are presented as the mean  ± standard error of the mean (SEM)
   or standard deviation (SD). Statistical significance for the comparison
   of multiple groups (>3) and between the groups was determined using
   analysis of variance (ANOVA) and Student’s paired t-test, respectively,
   in GraphPad Prism 9.0. p-value  < 0.05 was considered statistically
   significant.

Reporting summary

   Further information on research design is available in the [322]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [323]Supplementary Information^ (29.4MB, pdf)
   [324]Transparent Peer Review file^ (4.8MB, pdf)
   [325]Reporting Summary^ (2.4MB, pdf)
   [326]41467_2025_58439_MOESM4_ESM.pdf^ (170.6KB, pdf)

   Description of Additional Supplementary Files
   [327]Supplementary Data 1^ (10.4KB, xlsx)
   [328]Supplementary Data 2^ (46.3KB, xlsx)
   [329]Supplementary Data 3^ (442.2KB, xlsx)
   [330]Supplementary Data 4^ (16.5KB, xlsx)

Source data

   [331]Source Data^ (6.2MB, xlsx)

Acknowledgements