Abstract Current gold standard diagnostic strategies are unable to accurately differentiate malignant from benign small renal masses preoperatively; consequently, 20% of patients undergo unnecessary surgery. Devising a more confident presurgical diagnosis is key to improving treatment decision-making. We therefore developed MethylBoostER, a machine learning model leveraging DNA methylation data from 1228 tissue samples, to classify pathological subtypes of renal tumors (benign oncocytoma, clear cell, papillary, and chromophobe RCC) and normal kidney. The prediction accuracy in the testing set was 0.960, with class-wise ROC AUCs >0.988 for all classes. External validation was performed on >500 samples from four independent datasets, achieving AUCs >0.89 for all classes and average accuracies of 0.824, 0.703, 0.875, and 0.894 for the four datasets. Furthermore, consistent classification of multiregion samples (N = 185) from the same patient demonstrates that methylation heterogeneity does not limit model applicability. Following further clinical studies, MethylBoostER could facilitate a more confident presurgical diagnosis to guide treatment decision-making in the future. __________________________________________________________________ The MethylBoostER machine learning model improves the diagnosis of kidney tumor subtypes by utilizing DNA methylation patterns. INTRODUCTION Renal cell carcinoma (RCC) is among the top 10 most common cancers worldwide. Incidence rates have increased by 47% in the past 10 years, making it one of the fastest accelerating cancers, with rates projected to rise even further in the future ([54]1). This is attributed to the increasing prevalence of known risk factors and increased incidental detection, secondary to widespread use of imaging for unrelated symptoms ([55]2). Small renal masses (SRMs), defined as <4 cm in size, are therefore increasingly detected and encompass a mixture of potential diagnoses that comprise dozens of histological and molecular tumor subtypes ([56]3, [57]4). The major subtypes include clear cell RCC (ccRCC), papillary RCC (pRCC), chromophobe RCC (chRCC), or benign oncocytomas. Differentiating malignant from benign solid SRM represents a challenge to both imaging and renal biopsy. Renal biopsy may be nondiagnostic due to small tumor size, challenging anatomy, low biopsy tumor content, or tumor heterogeneity ([58]5–[59]7). As a result of no biopsy being undertaken or the inability to conclude a histological diagnosis based on biopsy samples, approximately 20% of SRMs removed at surgery are found to be benign postoperatively ([60]8). Consequently, a substantial number of patients undergo unnecessary surgery for a benign condition, with associated postoperative risks of morbidity and mortality and long-term impact on renal function. The rates of postoperative complications following minimally invasive surgery are blood transfusion (5%), reoperation (2 to 5%), respiratory complications (1 to 7%), and death (4%) ([61]9). Conversely, a meta-analysis demonstrated that one in four renal biopsies reported as oncocytoma is found to be malignant RCC following surgical excision ([62]5), risking false negatives, late diagnosis, and deleterious outcomes. Improved diagnosis and differentiation of SRM has been identified as a key research focus in RCC by an international research priority setting initiative ([63]10). Each of these pathological subtypes has characteristic genetic and molecular features such that it is argued that RCC is not a single disease ([64]11). ccRCC and pRCC are derived from the proximal convoluted tubule in the renal cortex, whereas chRCC and oncocytoma are derived from the distal nephron ([65]12, [66]13). DNA methylation changes are abundant, genome-wide early events in renal tumorigenesis, and are specific to the cell of origin, and so represent an ideal diagnostic target in this setting ([67]14–[68]17). Analysis of DNA methylation markers therefore has the potential to support and improve the current diagnostic pathway for renal cancer, providing a much more confident presurgical diagnosis to guide treatment decision-making. In this study, we propose MethylBoostER (Methylation and XGBoost for Evaluation of Renal tumors). MethylBoostER is a machine learning model based on the XGBoost (eXtreme Gradient Boosting) library that differentiates pathological subtypes of renal tumors, using DNA methylation markers identified in large tissue datasets. As seen in [69]Fig. 1, which shows the overview of our approach, we externally validate MethylBoostER on four independent datasets and evaluate multiregion samples from the same patient to assess the role of intratumoral heterogeneity (ITH) on MethylBoostER’s performance. In addition, we externally validate our results on ex vivo core biopsy tissue samples (which simulate in vivo patient biopsies) and fine needle aspirates (FNAs) and low-purity samples representing real-world applications and evaluate the role of tumor purity on MethylBoostER’s performance to assess clinical utility. MethylBoostER is trained on the largest DNA methylation cohort of renal tumor samples to date and is extensively validated, demonstrating accurate predictions across international cohorts and potential future clinical utility. Fig. 1. Overview of MethylBoostER. [70]Fig. 1. [71]Open in a new tab Three DNA methylation datasets are used to train and test the XGBoost classification model. The model is then validated on four external datasets. The high- and moderate-confidence predictions from the model output are used for improving diagnostic decisions. Model performance on both multiregion samples and sample purity was assessed. RESULTS DNA methylation sequencing in a cohort of patients with renal tumors DNA methylation data were obtained for 1228 samples from three data sources, namely, Cambridge samples (N = 319), The Cancer Genome Atlas (TCGA) (N = 872), and the Chopra training set (N = 37) ([72]Fig. 2, A and B) ([73]18). The former dataset contained information on 3 million CpGs measured using TruSeq EPIC, while the latter two were obtained via the 450K Illumina array. The merged datasets were obtained by overlapping CpGs within ±50 base pairs (bp) of Illumina probes and overall contained data on 158,670 probes. This dataset (N = 1228) was split into training and testing sets (via fourfold cross-validation; see Materials and Methods), with subsequent validation being performed on four additional independent datasets [Chopra validation ([74]18), Wei ([75]19), Brennan ([76]20), and Evelönn ([77]21)], and will be referred to as such hereafter. Fig. 2. Data characteristics and testing set performance. [78]Fig. 2. [79]Open in a new tab (A) Number of samples in each class used for the training/testing sets. (B) Uniform Manifold Approximation and Projection (UMAP) representation of the training/test dataset, using all input features. (C) Confusion matrix displaying the testing set performance, with precision and recall bars. (D) UMAP representation of the training/test dataset, using the input features learnt by the XGBoost model. (E) ROC curves over the testing set, split by class. MethylBoostER accurately distinguishes pathological subtypes of renal tumors We trained multiclass XGBoost machine learning models to classify four pathological subtypes of renal tumors (ccRCC, chRCC, oncocytoma, and pRCC) and normal tissue. Fourfold nested cross-validation was used [a method to randomly split the dataset into training (75%) and testing (25%) sets over four iterations] so that the testing set performance could be evaluated over the whole training/testing set. Consequently, this results in four trained models. We display summary results for the four models (individual results displayed in figs. S1, S2, S4, S5, and S7). The performance over the training/testing set was high, with an average accuracy of 0.960. Table S1 lists additional performance metrics, including metrics for each renal tumor subtype (forthwith called “class”), which shows that the models could distinguish the normal class perfectly, and the chRCC class was the most difficult to classify correctly. [80]Figure 2C shows the confusion matrix, which depicts how many samples, of a given true class, were predicted as each class. Most samples (1179 of 1228; 96%) are on the diagonal, indicating that they were correctly predicted. Of the 1228 samples evaluated, common misclassifications were in predicting ccRCC as pRCC (13 of 1228 samples; 1.1%) and predicting pRCC as ccRCC (9 of 1228; 0.73%) or chRCC (8 of 1228; 0.65%) samples. Although three oncocytoma samples were predicted as malignant RCC, no malignant samples were classed as benign oncocytoma or normal tissue. A large motivating factor for using XGBoost models is their interpretability—we can examine which input features were learnt by the models when classifying the five classes. We selected only these learned features and used them to visualize the training/testing set, as shown in [81]Fig. 2D (and fig. S2). This shows that although there are dataset-specific clusters within each class, the samples now form class-wise clusters. This suggests that the features that the models learn are not dataset specific but are features that distinguish each class in all three datasets. [82]Figure 2E (and fig. S1) shows the receiver operating characteristic (ROC) curves [and precision-recall (PR) curves] for each class, again indicating that the models achieved a high testing set performance. High- and moderate-confidence predictions make model outputs clinically more informative Given a sample and a trained XGBoost model, we can obtain the model’s predicted probabilities for that sample belonging to each class (renal tumor subtype). This reveals the confidence of the model’s prediction. [83]Figure 3A shows the predicted probabilities of the predicted class over the testing sets, which indicates that while most predictions are certain, some are less certain (13.2% of samples were predicted with a probability of <0.95), with probabilities as low as 0.256. Fig. 3. High- and moderate-confidence predictions. [84]Fig. 3. [85]Open in a new tab (A) Histogram of the model’s probabilities of the predicted class for the testing sets. (B) Line plot showing how the testing set accuracy scores and fraction of high-confidence predictions vary as the threshold changes. The vertical dotted line indicates the chosen threshold, 0.85. (C) Graphical overview of the prediction process with high- and moderate-confidence predictions. We used this information to separate the models’ predictions into two categories: high-confidence and moderate-confidence predictions. High-confidence predictions are where the model’s confidence is greater than a threshold, and we output a single answer—the predicted class. Moderate-confidence predictions are where the model’s confidence is lower than the threshold, and we output two answers—the predicted class (the first most likely prediction) and the second most likely prediction. We refer to these two outputs as first prediction and second prediction, respectively. The threshold was set to 0.85 (see [86]Fig. 3B), which maximizes the accuracy of both high- and moderate-confidence predictions and the fraction of high-confidence predictions over the testing set. In clinical practice in the case of moderate predictions, the clinician could consider the two predicted classes and integrate this information with clinical, histopathology, and imaging features to conclude the most likely diagnosis (see [87]Fig. 3C). The average accuracy of the high-confidence predictions over the testing sets was 0.982. The average accuracy of the moderate-confidence predictions over the testing sets was 0.668 (considering only the first prediction) but was 0.871 when the prediction was treated as correct if the first or second prediction was correct. See fig. S3 for the testing set confusion matrix split by high- and moderate-confidence predictions. External validation on four independent datasets verifies the generalizability of MethylBoostER to new data We externally validated MethylBoostER on four independent datasets, namely, Chopra validation (N = 245), Brennan (N = 37), Wei (N = 92), and Evelönn (N = 144) ([88]18–[89]21). They each contain different numbers of samples and a different class distribution, as shown in [90]Fig. 4A. The normal and ccRCC classes are the most frequent, which reflects clinical prevalence. We evaluated our models on these datasets (without the high- and moderate-confidence predictions method, so just taking the first prediction for all samples), and the average accuracy for Chopra validation, Brennan, Wei, and Evelönn was 0.824, 0.703, 0.875, and 0.894, respectively (see table S2 for additional metrics and fig. S4A for confusion matrices). The models had lower performance for chRCC and oncocytoma; oncocytoma samples were frequently predicted as normal and chRCC. In Chopra validation, the oncocytomas predicted incorrectly on both the first and second prediction were predicted as normal (11 samples), chRCC (4 samples), and pRCC (2 samples). In Brennan, the incorrect oncocytomas (on both first and second predictions) were predicted as normal (one sample) and chRCC (five samples). Combining all external validation sets, 7.89% (3 of 38) of oncocytomas were correctly classified as oncocytoma, 34.2% (13 of 38) were predicted as normal tissue, and 57.9% (22 of 38) were classified as malignant. Of the malignant samples, 94.4% (289 of 306) were classified correctly as malignant, 5.56% (17 of 306) were predicted as normal tissue, and 0% (0 of 306) was classified as oncocytoma (full details in confusion matrices in fig. S4). Although some oncocytoma samples were classified as malignant disease, no malignant samples were classified as benign oncocytoma ([91]Fig. 4). Given current clinical practice (which errs on the side of caution and removes SRMs, which are later found to be benign), it would be acceptable to remove an oncocytoma in case this might be malignant, but it would not be acceptable to confuse a malignant mass as benign, as this would risk the cancer spreading. Malignant tumor samples predicted as normal tissue may be due to low tumor purity (see the next section). The relatively lower performance in oncocytoma and chRCC may be due to the low proportion of these classes in the training set, reflecting a real-world challenge, as these two are the least common pathological subtypes. In addition, previous reports suggest that methylation in these two classes is most similar, in keeping with their common cellular origin ([92]13). We found that incorporating high- and moderate-confidence predictions leads to improved results. [93]Figure 4B and table S3 show the accuracy and Matthews correlation coefficient (MCC) scores for the high- and moderate-confidence predictions. As seen in [94]Fig. 4B, the moderate-confidence first predictions have an accuracy of <0.65 for all four datasets, and the second predictions have an accuracy of <0.45 for all four datasets. When we combine them, we see that the first or second predictions reach an accuracy of >0.7. As expected, the accuracy of the high-confidence predictions is higher and is >0.9 for all datasets. Fig. 4. External validation on four independent datasets. [95]Fig. 4. [96]Open in a new tab (A) Number of samples in each class for each dataset. (B) Accuracy for high- and moderate-confidence predictions for each external dataset. “First or second prediction” indicates that a prediction is treated as correct if its first or second prediction was correct. (C to F) Confusion matrices for both high- and moderate-confidence predictions and ROC curves, split by class, for each external dataset. For the moderate-confidence confusion matrices, the x axis is split into first prediction was correct, the second prediction was correct, and both first and second predictions were incorrect. Many moderate-confidence predictions are correct on the second prediction, as shown in [97]Fig. 4 (C to F) (and fig. S4B for the other models). For example, 16 samples in the Chopra validation dataset had an incorrect first prediction but correct second prediction (see [98]Fig. 4C). These samples would have simply been incorrectly predicted without the use of this high- and moderate-confidence prediction method. There is also a higher fraction of moderate-confidence predictions for the chRCC and oncocytoma classes (fig. S4C), reflecting the difficulty of predicting these classes. We also show that all four datasets achieved high class-wise ROC areas under the curve (AUCs), as shown in [99]Fig. 4 (C to F) (see fig. S5 for the other models’ ROC curves and all PR curves). We benchmarked MethylBoostER to the models introduced by Chopra et al. ([100]18) and Brennan et al. ([101]20) by comparing classification results over the Chopra validation dataset. We include MethylBoostER’s results both with and without high- and moderate-confidence prediction results. We also include Chopra’s results when excluding their benign class, as MethylBoostER or Brennan does not have results for this class. The overall accuracies were 0.846 (Chopra), 0.890 (Chopra excl. benign), 0.872 (Brennan), 0.824 (MethylBoostER), and 0.878 (MethylBoostER high/mod). The number of correct predictions per class is shown in table S4, which shows that Chopra and MethylBoostER high/mod both achieve the highest performance for the normal and chRCC classes. Chopra achieves the highest performance for oncocytoma, and MethylBoostER high/mod achieves the highest performance for the malignant (and most clinically important) ccRCC and pRCC classes. This shows that MethylBoostER high/mod is competitive among similar classification models, showing the joint best performance for chRCC and the best performance for malignant ccRCC and pRCC subtypes. Because of the molecular similarity of the chRCC and oncocytoma subtypes and more frequent misclassification of these two classes in the combined multiclass model, we explored the utility of a binary classifier by training XGBoost models with only chRCC and oncocytoma classes (using the same method as described in Materials and Methods). This approach was hampered by limited sample numbers, with a total of 93 chRCC and 58 oncocytoma samples in the testing set (fig. S6A) and 14 chRCC and 38 oncocytoma samples in the external validation sets (fig. S6B). The models achieved high performance on the testing set (see fig. S6C), with an average accuracy of 0.968, MCC of 0.932, and ROC AUC of 0.9959. In the limited cohort available for external validation datasets, all models predicted every sample as chRCC (fig. S6, D and E), demonstrating that the binary models for these two molecularly similar RCC subtypes could not be generalized and added little value over the multiclass MethylBoostER model. However, the ROC and PR curves highlight clear evidence of subtype-specific signals with these DNA methylation models (fig. S6, F to I), reaching an ROC AUC of 0.948 for the Brennan dataset. This gives a similar picture to ROC and PR curves for the multiclass models (fig. S5) and indicates that the predicted probabilities can differentiate the two classes, but that the model’s classification threshold does not generalize in the validation datasets, possibly because of the limited datasets available for these RCC subtypes. The underlying prediction scores are visualized in fig. S6 (J and K), which show that the probabilities for oncocytoma samples are consistently lower than chRCC samples; however, they are not lower than 0.5 (the current model’s classification threshold). Overall, we found that a separate binary classifier did not add value over the multiclass MethylBoostER model, possibly because of the limited available data for these subtypes (see Discussion for suggested future studies that could overcome these limitations). MethylBoostER provides consistent classification of multiregion samples from the same patient ccRCC is characterized by a high degree of genetic ITH, with over 60% of somatic mutations not detectable across all multiregion samples ([102]22). On average, seven multiregion samples are needed to detect over 75% of genetic variants ([103]23). Very little is known about methylation variation in ccRCC, although a handful of reports suggests relative homogeneity ([104]21, [105]24–[106]26), therefore highlighting the potential relevance of diagnostic methylation markers. In the Cambridge dataset, multiregion samples were available (N = 168) for 25 patients (18 ccRCC, 4 chRCC, 2 oncocytoma, and 1 pRCC). In 92% (23 of 25) of patients, all multiregion samples were predicted consistently as being from the same pathological subtype, with 88% (22 of 25) achieving correct classification for all samples (see [107]Fig. 5A). Multiregion samples (N = 17) were also available for six patients with ccRCC from the Evelönn external validation dataset ([108]21). As shown in [109]Fig. 5B, 83% (five of six) of patients had a concordant prediction for all multiregion samples derived from the same patient, although the model struggles to differentiate chRCC from ccRCC. We noted that incorrectly predicted samples had a very low tumor purity. When the Evelönn dataset is visualized using dimensionality reduction (see fig. S7), almost all the incorrectly predicted samples (including all the incorrectly predicted multiregion samples) can be seen to cluster away from the rest of the ccRCC samples, indicating that these samples have a different methylation pattern (potentially related to their lower tumor purity, as shown in fig. S7B). Fig. 5. Classification of multiregion samples. [110]Fig. 5. [111]Open in a new tab Diagram visualizing the model’s predictions of multiregion samples for each patient in the Cambridge and Evelönn datasets. Impact of tumor purity on MethylBoostER Tumor purity was calculated using methylation beta values and InfiniumPurify, as described ([112]20, [113]27). This method determines tumor purity in the context of contamination with a normal (nontumor) kidney sample. As expected, median sample purity was lower for the Chopra and Brennan datasets compared to TCGA and Cambridge, as the former include ex vivo core biopsy samples (i.e., simulated biopsies using nephrectomy specimens) and FNAs, respectively. Of all the datasets, the Evelönn data appear to have the lowest tumor purity (median purity: 0.88 Cambridge, 0.84 TCGA, 0.80 Chopra training, 0.58 Chopra validation, 0.73 Brennan, 0.81 Wei, and 0.48 Evelönn). [114]Figure 6A summarizes purity for samples that were predicted correctly in the first prediction, second prediction, and incorrectly predicted samples (results for individual datasets are shown in fig. S8). Median purity in samples that were incorrectly predicted was significantly lower than samples correctly predicted on the first or second prediction (0.27 versus 0.82 versus 0.44, P < 0.01 for all comparisons; [115]Fig. 6A). We noted that three of the samples that were misclassified by our model were also incorrectly classified by Brennan et al. ([116]20) and that the study authors postulated that this was due to low sample purity. [117]Figure 6 (B and C) summarizes the purity and prediction probability (for the first prediction), highlighting incorrectly predicted samples. We found a correlation between purity and the probability of first prediction in Wei and Evelönn [Pearson’s correlation coefficients of 0.58 and 0.51, adjusted P < 0.01], but not the other datasets (correlation < 0.30 and/or adjusted P > 0.01). There are a number of samples that are incorrectly predicted despite having a high-confidence prediction, and these tend to have lower purity. Considering all datasets combined, table S5 summarizes the model accuracy, as well as the median probability of the first prediction, by tumor purity. This further demonstrates that samples with higher tumor purity are more accurately predicted, for example, samples with purity ≥ 0.9 obtain an accuracy of 0.99. There is a sharp drop-off in accuracy in samples with a purity of <0.20 compared to samples in which purity was >0.2, suggesting that potentially a biopsy sample may have to be repeated if purity is below this level. This analysis was performed post hoc on both training/testing and validation datasets, and there were a limited number of low-purity samples (5% of the cohort; table S5). Therefore, these results (and a purity threshold that warrants repeat biopsy) remain to be externally validated. Fig. 6. Sample purity and MethylBoostER output. [118]Fig. 6. [119]Open in a new tab (A) Sample purity for samples that are predicted correctly on the first prediction (1st correct) and second prediction (2nd correct) and incorrectly predicted samples (incorrect) on both predictions. Data are shown for all datasets combined, with pathological subtypes shown in different colors. Adjusted P values are shown (*P < 0.05 and ***P < 0.0009). (B and C) Sample purity and the probability of the first prediction are demonstrated for all datasets combined (B) and each dataset individually (C). The threshold t = 0.85 indicates a high-confidence prediction. Samples that are incorrectly predicted (in both first and second prediction) are indicated with a cross. Features selected by MethylBoostER are associated with carcinogenic processes The features that the MethylBoostER models used during classification were obtained. Each of the four XGBoost models selected 1490, 1697, 1476, and 1331 features, respectively. An analysis of the genomic location of the features revealed that the selected features do not follow the same genomic location distribution as the background, as shown in [120]Fig. 7A. In the promoter regions close to the transcription start site (TSS) (≤1 kb), there is a lower percentage of features compared to the background set (68% compared to 78%). However, in the promoter regions further away from the TSS (1 to 2 kb and 2 to 3 kb away), there is a higher percentage of features compared to the background set (8.7% compared to 6.3% and 4.3% compared to 2.3%). There is also a higher percentage of features in introns (5.1% compared to 3.1% for first introns and 8.7% compared to 5.5% for other introns). This distribution of the feature’s genomic locations is significantly different from the background (apart from the intergenic annotation), as shown in table S6. Fig. 7. The genomic location and functional annotation of the features selected by MethlyBoostER. [121]Fig. 7. [122]Open in a new tab (A) Distribution of genomic locations (relative to genes) for the selected features compared to the background (the total set of input features). (B) Enriched GO terms from the Biological Process category represented as a network, where each branch represents a different functional category. Results were obtained from the gene-wise GO analysis. (C) Enriched GO terms from the Biological Process category represented as a bar plot. Results were obtained from the localized region GO analysis. The features from MethylBoostER were mapped to proximal genes and analyzed at the gene ontology (GO) and pathway level. The GO terms are visualized as a network in [123]Fig. 7B. The GO analysis detected a neuronal and muscle developmental signature, cancer signaling [Wnt, mitogen-activated protein kinase (MAPK), and Ras signaling pathways], and both positive and negative regulation of transcription, among others. The significant enrichment (see table S7 for the adjusted P values of all gene list comparisons) of transcription factors [TF Checkpoint ([124]28)] and epigenetic regulators [EpiFactors ([125]29)] indicates the perturbation of transcriptional regulators and chromatin remodelers in renal cancer. While an RCC-related gene set [RCC Harmonizome/Diseases databases ([126]30)] and a ccRCC putative driver gene set ([127]23) were enriched, a similar set of pRCC genes was not enriched [pRCC Harmonizome/Diseases ([128]30)]. The feature set also indicated a strong metastatic signature [Human Cancer Metastasis Database ([129]31)], with enriched GO terms related to Motility, Axon guidance, and Cell adhesion. Enrichment of epithelial-mesenchymal transition (EMT)–related genes [dbEMT2 ([130]32)] provides further evidence for this metastatic signature. In addition, the gene list was enriched for tumor suppressors, oncogenes, and fusion genes [Catalogue Of Somatic Mutations In Cancer (COSMIC) Cancer Gene Census ([131]33)]. We also checked whether the results of the gene-wise GO analysis were consistent with the results from the Genomic Regions Enrichment of Annotations Tool (GREAT) ([132]34), which is designed for localized regions that are not necessarily within genes. This also enabled the features that were not mapped to genes to be included. [133]Figure 7C shows the significant GO terms resulting from this analysis, which demonstrates that they are similar to the results of the gene-wise GO analysis. The consistent themes were development, cell adhesion, cell motility, cell signaling, and neurogenesis. A few functions were unique to this GREAT analysis, such as immune response and extracellular matrix organization, and a few were unique to the gene-wise analysis, such as regulation of transcription and response to cortisol stimulus. Pathway analysis revealed enriched pathways covering similar functions as found during the GO analysis. The most enriched pathways were the following: Wnt signaling pathway (adjusted P = 2.79 × 10^−15), Cadherin signaling pathway (adjusted P = 2.92 × 10^−5), and Netrin-1 signaling (adjusted P = 4.62 × 10^−03). Other enriched pathways included axon guidance, cell junction organization, and signaling by receptor tyrosine kinases. Next, we focused our analysis on the consistently salient features—features that were selected by all four XGBoost models. There were 38 such features, which mapped to 45 genes. Of these 45 genes common to all four models, 9 genes have not been previously associated with kidney function, renal disease, or cancer (ANKMY1, EHMT1, H2AW, JDP2, H2BU1, KIAA1143, KIF15, LINC01960, and SF3A2). As detailed in table S8, an analysis of the literature shows that the others have been associated with ccRCC, chRCC, and RCC. These genes have also been linked to kidney development, gene expression in kidney, adhesion, motility, invasion, metastasis, and poor survival in renal cancer and other cancer types. In addition, some of them contribute to increased survival, act as tumor suppressors, promote cancer stemness, contribute to carcinogenesis, proliferation, and tumor growth, and have been identified in other cancer types. These associations demonstrate that the methylation features identified by MethylBoostER are biologically relevant. The methylation distribution of all 38 features and their genes can be seen in fig. S9. We confirmed these findings by carrying out an entity relationship–driven natural language processing (NLP) analysis of approximately 30 million PubMed abstracts, using the feature mapped gene set and a dictionary of relationship terms based on enriched biological processes identified above. Modeling this information from 28,579 interactions (each supported by multiple publications) as an entity relationship network, and subsequent network topological analysis where relationship strength was visualized as degree (number of genes directly linked to a relationship) and relationship importance as betweenness centrality, highlighted the association of these genes with carcinogenesis, development, and tumor microenvironment interactions. This corroborates the functional bioinformatics analyses and confirms that these genes have been previously associated with carcinogenic processes, tumor microenvironment, metabolism, metastasis, and immune and inflammatory responses and are known to influence patient survival and prognosis in the biomedical literature (fig. S10 and data file S1 containing NLP-derived gene relationship pairs and their frequency). We also extracted the class that each feature contributes toward to identify subtype-specific features. Figure S11A shows that the genomic location of the features was distributed slightly differently for each class, with the normal, pRCC, and oncocytoma classes using fewer features in proximal promoter regions near the TSS (≤1 kb). The normal and pRCC classes also used more features in introns. We then calculated enriched GO terms for each class using GREAT and found enriched terms for normal, ccRCC, chRCC, and pRCC. The normal class was enriched in terms such as membrane organization and cell activation (fig. S11B). The ccRCC class was enriched in terms such as regulation of signaling and regulation of cell communication (fig. S11C). The chRCC class was enriched in multiple terms relating to negative regulation of DNA repair and regulation of multicellular organismal process (fig. S11D). The pRCC class was enriched in many terms, such as single-organism developmental process, movement of cell or subcellular component, localization, defense response, and positive regulation of biological process (fig. S11E). After mapping these subtype-specific features to genes, we also show that many genes were often found in multiple subtypes. Figure S11E demonstrates that 205 genes were found in both ccRCC and pRCC, 159 genes were found in both chRCC and ccRCC, and 137 genes were found in both chRCC and pRCC. DISCUSSION The rising incidence of SRMs drives the need to improve the diagnostic pathway to avoid overtreatment of patients with benign disease and optimize health resource use. The challenge consists of differentiating benign oncocytoma from malignant subtypes, the most common being ccRCC, pRCC, and chRCC. Previously published molecular classifiers are often limited by focusing solely on distinguishing oncocytoma from chRCC, excluding the other, more common subtypes and therefore reducing applicability in the real world ([134]20, [135]35–[136]38). In addition, a number of these studies had small sample sizes (<200 samples) ([137]36, [138]38, [139]39). Existing models often rely on a limited number of markers (e.g., <100 markers), making them less robust when applied to heterogeneous clinical samples and limiting potential future applicability ([140]20, [141]36–[142]38, [143]40). Machine learning models based on DNA methylation have demonstrated excellent accuracy in other cancer types, including lung, brain, and breast malignancies and sarcomas ([144]41–[145]43). The ability of machine learning models trained on DNA methylation data to classify cancer of unknown primary with excellent accuracy, leading to benefits in overall survival, serves as a testament to DNA methylation as a unique marker of cell identity ([146]44). We have therefore developed an XGBoost machine learning model (MethylBoostER), leveraging methylation data from over 1200 patient samples. We demonstrate that the model accurately predicts pathological subtypes of renal tumors in the training and testing set, with ROC AUCs of over 0.95 for all subtypes evaluated. The extensive external validation in four independent methylation datasets, totaling over 500 patient samples, is a measure of the robustness of our method. We demonstrate the high accuracy of the model with average accuracies of 0.824, 0.703, 0.875, and 0.894 and AUCs of over 0.89 for all subtypes in all independent datasets. We also show that MethylBoostER is competitive among similar existing models (when using high- and moderate-confidence predictions), achieving the highest performance for malignant RCC classes (chRCC, ccRCC, and pRCC). A recognized clinical challenge is that biopsy accuracy may be hampered by genetic ITH or reduced tumor content, especially in smaller or more difficult to access tumors. We therefore address these potential drawbacks. We demonstrate that, unlike the known extensive genetic ITH ([147]22), analyzing methylation patterns in multiregion samples from the same patient led to consistent diagnoses in the vast majority of cases (92% in the Cambridge and 83% in the Evelönn datasets). In addition, we assessed low-purity samples (including ex vivo biopsies and FNAs) to represent a real-world scenario. We show that samples that are incorrectly predicted by our model have significantly lower purity than correctly predicted samples. The association between purity and model accuracy was also previously noted by Brennan et al. ([148]20). It is customary to take more than one sample at the time of biopsy, which may help overcome this challenge; alternatively, low-purity samples may require repeat biopsy. In the testing set, the most common misclassifications involved ccRCC, pRCC, and chRCC, while in the external validation sets, chRCC and oncocytoma were the classes with the lowest AUCs. The tumors’ shared cells of origin (proximal versus distal nephron) may explain some of these results. Problems in accurately differentiating between ccRCC and chRCC on histopathology are another well-known challenge. In TCGA, 15 cases were initially classified as ccRCC on histopathological slide review; however, these were later re-reviewed by specialist urohistopathologists and reclassified as chRCC ([149]11, [150]45). We noted that eight of these samples were included in our testing and training dataset, and MethylBoostER classifies five as chRCC and three as ccRCC. The former suggests that our model can correctly classify chRCC samples better than a standard pathologist, and more akin to a specialist urohistopathologist. We explored the methylation and gene expression profiles of TCGA samples (fig. S12, A and B) and demonstrated that the three samples that our model classifies as ccRCC (TCGA participant IDs 4821, 4688, and 4696) cluster more closely with ccRCC than chRCC based on both methylation and gene expression data. We hypothesize that these samples may be ccRCC (i.e., the first classification was correct rather than the reclassification). This highlights existing challenges in diagnosing subtypes using existing histopathology methods and emphasizes the need to produce accurate predictive models. Last, another challenge is that predictive models are trained on datasets in which the true diagnosis is based on histopathology, and if this is incorrect, it may bias the model. To ensure that the predictions of MethylBoostER were not heavily biased by these eight mislabeled samples, we retrained the models with the reclassified labels (eight ccRCCs were reclassified as chRCC, and a number of non-RCC were samples removed). The results over the testing sets were similar, as shown in fig. S12 (C and D), demonstrating that the small number of incorrect labels does not largely affect the results. Carcinogenesis is invariably accompanied by perturbations in gene and epigenome regulation. Aberrant DNA methylation changes on promoters, enhancers, and gene bodies contribute to alteration of gene expression and consequently affect signaling, regulatory, and metabolic pathways. Renal carcinoma has been associated with MET ([151]46), Hippo ([152]47), Wnt ([153]48), MAPK ([154]49), NRF2 (nuclear erythroid 2-related factor 2)-ARE (antioxident response element), phosphatidylinositol 3-kinase (PI3K)/AKT/mechanistic target of rapamycin (mTOR) ([155]50), metabolic, angiogenic, and immune checkpoint–associated pathways ([156]51). GO and pathway enrichment analysis of the MethylBoostER features mapped to genes show a strong association with genes, processes, and pathways involved in carcinogenesis. In particular, oncogenes and tumor suppressors (COSMIC Cancer Gene Census), cancer-associated pathways [Kyoto Encyclopedia of Genes and Genomes (KEGG)], putative ccRCC driver genes, and RCC-associated genes were enriched. In addition, biological processes associated with metastasis (motility, cell adhesion, and axonal guidance) and EMT were also strongly enriched. Comparison with curated gene sets in metastasis [Human Cancer Metastatic Database (HCMDB)] and EMT (dbEMT2) databases further confirmed this. Other signatures such as neuronal and muscle development; Wnt, MAPK, and Ras signaling pathways; nucleoside metabolism; and T cell activity are also enriched. Expectedly, enrichment of both positive and negative regulation of transcription coincides with known consequences of DNA methylation. Furthermore, enrichment of TFs and epigenetic regulators further supported this strong transcriptional regulation signature. Genes associated with response to cortisol stimulus are enriched in our feature set. Serum cortisol levels have been found to be significantly higher in RCC and are associated with higher tumor grade ([157]52). The enrichment in our feature set suggests a possible epigenetic regulatory mechanism for this process. Last, we show that the CpG features selected by the models are enriched for cancer-related genes, and the features selected by all four models are enriched for well-known RCC genes. We use literature analysis to demonstrate that most classification features are associated with genes involved in kidney pathology, cancer, or carcinogenic processes, demonstrating the clear discernment in model-selected methylation patterns. Associations between methylation and gene expression are outside the scope of the present analysis, which aimed to demonstrate that the model selects salient features that are biologically relevant and easily interpretable. Several limitations of the current study must be addressed. One of the main challenges in developing predictive methods pertains to lack of sample and clinical data, particularly for rare disease subtypes. We acknowledge that although the MethylBoostER model accurately predicts the most prevalent pathological subtypes of renal tumors, due to constraints in sample availability, the diagnosis of less prevalent, uncommon renal masses (such as angiomyolipoma, translocation RCC, collecting duct carcinoma, Fumarate Hydratase (FH)-deficient RCC, renal medullary carcinoma, and others) is not accounted for in our current analysis. Although the model achieves a high accuracy across all sample types, chRCC and oncocytoma remain the most difficult classes to distinguish. Future studies including larger sample sizes from these two classes may enable improved prediction accuracies. While the main aim of the analysis is to distinguish pathological subtypes of SRMs, because of insufficient sample sizes, we pooled data from both SRMs and larger tumors for the training/testing set and some of the validation cohorts. However, we validated the model on the Chopra dataset that consists exclusively of SRMs (245 samples) and demonstrated excellent accuracy and generalizability of results in smaller tumors and in low-purity samples (e.g., Evelönn and FNAs in Brennan). Future larger studies could include rarer subtypes (exploring the impact on the model output) and prioritize samples from patients with SRMs. In addition, there were some limitations to our benchmarking method. The models we compared to were trained on a different number of classes, so the accuracies were not directly comparable, and they were also trained on different training datasets. Thus, further benchmarks that solve these limitations would need to be carried out. Excluding TCGA, the other datasets included in our analysis provided methylation data alone. In the future, obtaining both methylation and transcriptomic data may enable head-to-head comparisons between MethylBoostER and existing models predicting pathological subtypes using transcriptomic data ([158]35, [159]36). Ultimately, an ideal model would predict patient outcome in addition to identifying the pathological subtype of the renal tumor and would take into account comorbidities and competing risks of death. Unfortunately, this was not possible in the present analysis due to lack of survival data for the available cohorts. Future work should focus on collecting comprehensive multimodal datasets including patient clinical parameters, imaging, pathology, and multiple omic data types (including both methylation and gene expression). Unique features of our model are its interpretability, ease of use in a clinical setting, and an approach that reduces the clinical importance of ITH. We envision that, in the future, following further model refinement (possibly through the use of larger, more heterogeneous, and balanced renal cancer datasets), validation, and prospective testing (using a well-designed clinical trial to test real-world performance), MethylBoostER could be integrated in clinical practice to improve the patient diagnostic pathway ([160]Fig. 8). Individuals with SRMs would undergo imaging-guided tumor biopsy, which would be processed for targeted DNA methylation analysis, and the MethylBoostER model would be used to predict pathological subtypes, serving as an adjunct to treatment decision-making. For high-confidence predictions, the output is limited to the most likely diagnosis, whereas for moderate-confidence predictions, the model will supply two class predictions and the clinician would be encouraged to integrate imaging and histological data. Such moderate-confidence predictions will provide useful additional information for clinicians to combine with other clinicopathological features, for instance, moderate prediction of “chRCC or ccRCC” in a fit patient with a larger tumor would be much more likely to result in active treatment than a moderate confidence prediction of “chRCC or oncocytoma” in a patient with a relatively small tumor and limited life expectancy. A strength of our work is the output of moderate- and high-confidence predictions, which overall increases model accuracy and empowers clinicians to make patient-centered decisions that take into account both clinical and molecular (DNA methylation) data. We acknowledge that, in a small number of individuals with SRMs, renal biopsy may be high risk or contraindicated due to tumor anatomy or patient factors; therefore, MethylBoostER may not aid diagnosis in this setting. However, there is an increasing use of diagnostic biopsies to minimize morbidity and mortality resulting from overtreatment of patients with benign lesions, supported by strong recommendations in recent clinical guidelines (European Association of Urology Renal Cell Carcinoma Guideline Panel 2022). Following further refinement, MethylBoostER could have a tangible future clinical application, although future studies will be needed to demonstrate the model’s clinical benefit above the existing standard of care diagnostic pathway (imaging with or without pathology) before adoption in the clinic. Fig. 8. Proposed future integration of MethylBoostER model into the existing diagnostic pathway for patients with SRMs. [161]Fig. 8. [162]Open in a new tab Following future model refinements and clinical trials, MethylBoostER could play a role in the diagnostic pathway. Here, we describe the potential clinical utility. Patients would have an image-guided renal biopsy, and biopsy samples would undergo DNA methylation analysis. MethylBoostER results would be interpreted in the context of integration with clinical and imaging data. For high-confidence predictions, MethylBoostER would predict one class, where benign oncocytoma and malignant RCC would likely be managed with active surveillance and active treatment, respectively. In moderate-confidence predictions, the two classes with the highest probabilities would be predicted. Samples with low purity or cases in which MethylBoostER predicts normal kidney (suggesting that the target lesion was missed) would prompt repeat biopsy. In summary, we develop MethylBoostER, a machine learning model that predicts pathological subtypes of benign and malignant renal tumors. The relatively modest number of patients diagnosed with certain tumor subtypes (especially chRCC and oncocytoma) and limited access to patient samples can hamper the application of machine learning–based approaches. Future studies should aim to obtain larger sample sizes and focus on multimodal integration with imaging, patient characteristics, epigenetic data, and clinical outcomes. MATERIALS AND METHODS Cambridge samples Patients with benign and malignant renal tumors undergoing curative or cytoreductive nephrectomy at Addenbrooke’s Hospital were recruited to the DIAMOND study. Ethics approval and patient consent were obtained (Research Ethics Committee, REC ID 03/018). Fresh-frozen tissue was stored at −80°C. Where available, multiregion tumor samples were taken along with adjacent normal kidney tissue. DNA was extracted from a small section of frozen tissue, using the commercially available AllPrep DNA/RNA Mini Kit (QIAGEN) according to the manufacturer’s protocol. DNA samples (10 ng/μl, 500 ng total) were sheared using the S220 Focused-ultrasonicator (Covaris) to generate double-stranded DNA (dsDNA) fragments. The D1000 ScreenTape System (Agilent) was used to ensure that >60% of DNA fragments were between 100 and 300 bp long, with a mean fragment size of 180 to 200 bp. Tissue methylation analysis was performed using the TruSeq Methyl Capture EPIC Library Preparation Kit (Illumina), using the manufacturer’s protocol (a.k.a, or also known as TruSeq EPIC). This consists of a capture-based method targeting approximately 3 million CpGs. Four samples are multiplexed in each capture reaction using sample indexing adaptors. The protocol involves hybridization of biotin-tagged probes to genomic DNA followed by capture using streptavidin beads (two hybridization-capture steps) and bisulfite conversion at 54°C for 2 hours. Following polymerase chain reaction (PCR) amplification, uracils are copied as thymines, with resulting libraries consisting of two families of dsDNA molecules (originating from Watson and Crick strands), with a high thymine-to-cytosine ratio. Twelve samples were pooled for sequencing on the HiSeq4000 Illumina Sequencing platform (single-end 150-bp read) using two lanes per library pool. Technical replicates were performed for cell line data to assess assay reproducibility (correlation = 0.97). Sequenced data were trimmed (TrimGalore v0.4.4) and aligned to the bisulfite-converted human reference genome (GRCh38/hg38). Methylation calling was performed using the Bismark suite of tools (v0.22.1). Trimming and alignment reports were compiled using MultiQC (v1.7). Data were included in downstream analysis if a depth of greater than 10× coverage was achieved to reduce the risk of false-positive calling. In addition, samples were removed if non-CpG methylation was greater than 1%. Publicly available data TCGA data for ccRCC (KIRC), pRCC (KIRP), and chRCC (KICH) were obtained via the TCGAbiolinks package in R ([163]53). Only CpG probes with less than 5% missing data were kept. The Chopra datasets, one for training and one for testing, were downloaded from the Open Science Framework, repository OSF.IO/Y8BH2 ([164]18), and the Acute Myeloid Leukemia (AML) samples were excluded. The Wei dataset ([165]19) and Evelönn dataset ([166]21) were obtained from the Gene Expression Omnibus (GEO), with GEO accession numbers [167]GSE61441 and [168]GSE113501, respectively. The Brennan dataset was obtained directly from the study authors ([169]20). In all cases, methylation data were evaluated using the Illumina Infinium Human DNA Methylation 450 platform (also known as 450k array). Preprocessing Training/testing dataset The training/testing dataset consisted of the TCGA data, the Chopra training data, and the Cambridge samples. The following preprocessing steps were performed on this training/testing set. CpG probes found in two blacklists ([170]54, [171]55) for the 450k array were removed, as well as probes at the site of C/T and G/A single-nucleotide polymorphisms (SNPs), probes that map to multiple regions, and probes at repeat regions. In addition, CpG probes located on the sex chromosomes were omitted to remove gender bias. The 450k array (Chopra training samples and TCGA samples) and TruSeq EPIC (Cambridge data) cover different CpG sites. To combine data from the two methods, TruSeq EPIC beta values within 50 bp of the 450K probes were averaged, as adjacent CpGs tend to be co-methylated ([172]56). Subsequently, probes that were missing for the whole of one dataset (TCGA, Chopra training data, or Cambridge) were removed to avoid dataset-specific bias. This resulted in a dataset with 158,670 CpG probes. Last, the beta values were converted to M values, as M values have been shown to be more homoscedastic ([173]57). They are computed using the following equation [MATH: M=log2β1β :MATH] Where M values were calculated to be infinity (due to β = 1), they were set to the maximum finite value of the training/testing data, and where M values were calculated to be minus infinity (due to β = 0), they were set to the minimum finite value of the training/test data. External datasets The external datasets used were the Chopra validation data (which is independent of the Chopra training data), the Brennan dataset, the Wei dataset, and the Evelönn dataset. The same 158,670 probes in the preprocessed training/testing dataset were selected for all four external datasets. Data visualization Dimensionality reduction was carried out using umap-learn Python package (version 0.3.10) ([174]58). Not a Number (NaN) values were converted to the value 0 before the transformation. MethylBoostER: The XGBoost classification model A multiclass XGBoost classifier model was constructed using the xgboost Python package (version 0.90) ([175]59). Fourfold nested cross-validation was implemented on the training/testing set, with integrated hyperparameter optimization. Fourfold nested cross-validation The training/testing set was split up into four stratified outer folds so that each fold keeps 25% of the dataset for testing. This was split so that multiple samples from the same patient were all in the same fold to avoid data leakage between folds. These folds were iterated through, treating each one as the testing set in each iteration. Within these iterations, the remaining dataset was treated as a training/validation set. Inner rounds of fourfold cross-validation were performed on this training/validation set to find the optimal hyperparameters. For each inner fold iteration, a model was trained on the training set (75% of the training/validation set) and validated on the validation set (25% of the training/validation set). These inner rounds were repeated for various sets of hyperparameters [the hyperparameters tested were dictated by the Tree of Parzen Estimators algorithm in the hyperopt package ([176]60)]. The hyperparameters that resulted in the best average MCC score on the validation sets (averaged across all four inner folds) were selected, and a model was trained with these parameters on the whole training/validation set. This resulted in a trained model with optimal hyperparameters for each outer fold, and the performance of these models on their fold’s testing set was reported. This method of cross-validation ensured that the models had the optimal hyperparameters, they were evaluated on completely unseen test data (specifically data that were not used to select the hyperparameters to avoid inflated scores), and the performance on the whole training/testing dataset was reported. Hyperparameters and training details Two parameters were manually set, namely, subsample = 0.5 and colsample by tree = 0.5, to help reduce overfitting and run time ([177]59). Many of the remaining parameters were found using hyperparameter optimization implemented in the hyperopt Python package (version 0.2.5) ([178]60). The maximum number of evaluations was set to 10 and parallelism was set to 2 to reduce the run time. The search spaces for each parameter were as follows: number of trees: 50 to 500, learning rate: 0.05 to 0.5 (sampled from a log uniform distribution), max tree depth: 2 or 3 (intentionally set very low to help reduce overfitting; see fig. S13 for justification of this decision), L1 regularization term: 0 to 1, and L2 regularization term: 0 to 1. During model training, Gaussian noise with mean 0 and variance 0.2 was added into the training data to help reduce overfitting. Early stopping was also applied to reduce overfitting, with the number of early stopping rounds set to 5 and multiclass log loss as the evaluation metric used. Samples were weighted to avoid both patient bias and to mitigate the class bias. Balanced class weights were calculated using the “compute class weight function” in the sklearn package ([179]61), which ensures that samples in less common classes are weighted higher. The same function was used to generate balanced patient weights so that samples from patients with many samples were weighted lower. The two weights were multiplied together to get a weight for each sample to be used during training. Evaluation metrics Accuracy, precision, recall, F[1], and the MCC were used to evaluate the models’ performance. F[1] is the harmonic mean of precision and recall [MATH: F1=2.< /mo>precision·recallprecision+recall :MATH] The MCC was also reported as it is a performance measure that is not affected by large class imbalances, unlike accuracy. ROC curves and PR curves were also plotted, with each class plotted separately (i.e., using the one-versus-all strategy). ROC curves show the false-positive rate and the true-positive rate over all possible values of the classification threshold. PR curves show the precision and recall over all possible values of the classification threshold. The AUC was also reported for both curves, with 1 as the best possible AUC score. The ROC AUC score is calculated from the model’s class-wise output probabilities, and it is a one-versus-all metric, so it is an easier task than multiclass classification. The ROC AUC score is affected by the class imbalance and can overestimate performance. We therefore report a wide range of metrics and performance indicators ([180]Figs. 2 and [181]4; figs. S1, S4, and S5; and tables S1 to S3). Hence, while interpreting the ROC AUC score, it is important to consider the classification performance demonstrated within the confusion matrices, as well as other metrics (such as PR AUC, accuracy, MCC, precision, recall, and F[1]). High- and moderate-confidence predictions The model’s output probability was used as a confidence measure, and the predictions were split into two categories: high confidence, where the output probability was larger than a specified threshold, t, and moderate confidence, where the output probability was less than t. For moderate-confidence predictions, we outputted the model’s top two predictions (i.e., the two classes with the two highest output probabilities). The parameter t was chosen on the basis of the testing set. Three metrics were plotted for values of t between 0.5 and 1 (in increments of 0.025). These metrics were the fraction of high-confidence predictions, the accuracy of the high-confidence predictions, and the accuracy of the moderate-confidence predictions where a prediction was treated as correct if the correct class was in the model’s top two predictions. To remove noise in these accuracy scores, simple linear models were fitted (i.e., approximated accuracy = a*t + b), and the approximated accuracy scores were used to estimate t. For each value of t, these three metrics were averaged and the value of t where the average was the largest was taken. This was t = 0.85 and was validated on the external validation datasets. Benchmarking To benchmark MethylBoostER to other similar models, we looked at reported results by Chopra et al. ([182]18) and Brennan et al. ([183]20) of the Chopra validation dataset (this was possible as all models were validated over this dataset). The MethylBoostER high/mod model used our high- and moderate-confidence prediction method. When this model predicted moderate confidence, we reported a correct prediction if the first or second prediction was correct. Note that in table S4, the value for the Chopra model for the normal class is 100/99 (i.e., either 100 or 99 correct predictions). This is because Chopra et al. report 100 correct predictions and 2 incorrect predictions for the normal class ([184]18); however, the Chopra validation dataset only contains 101 normal samples. Tumor purity analysis Tumor purity was obtained for each sample using methylation beta values using the InfiniumPurify package in R ([185]27) as previously described ([186]20). Purity in correctly predicted and incorrectly predicted samples was compared using the Wilcoxon rank sum test with correction for multiple testing. Feature analysis MethylBoostER feature importance and annotation Feature importance values were obtained from the trained XGBoost models, which used the gain measure as the feature importance. All features with an importance greater than zero were selected. For each selected feature, the CpG was mapped to a gene if it was within a gene or within 1500 bp upstream of a gene. Genomic location of features The R package ChIPseeker (version 1.30.2) ([187]62) was used for genomic location annotation and visualization. The R package EnsDb.Hsapiens.v86 (version 2.99.0) ([188]63) was used for the annotation source. The background set used was the total set of input features. For significance testing, the GAT tool was used ([189]64) together with annotations extracted from EnsDb.Hsapiens.v86 by ChIPseeker (specifically using the detailGenomicAnnotation field). Overrepresentation analysis GO enrichment analysis was carried out using the Biological Networks Gene Ontology (BINGO v3.0.3) package ([190]65), and network visualization was carried out using Cytoscape (v3.9.0) ([191]66). Pathway enrichment analysis was carried out using the Panther analysis tool (release 16) ([192]67) with Panther and Reactome pathway collections. Statistical overrepresentation was carried out using Fisher’s exact test with false discovery rate (FDR) correction. Overrepresentation analysis was also performed on eight gene lists: dbEMT2 ([193]32), COSMIC Cancer Gene Census ([194]33), a list of putative driver genes for ccRCC ([195]23), pRCC and RCC genes from the Diseases database ([196]30), TF Checkpoint ([197]28) for transcription factors, EpiFactors for epigentics regulators ([198]29), and HCMDB for metastatic genes ([199]31). Enrichment of gene sets was determined using Fisher’s exact test with FDR correction, using the Python package SciPy ([200]68). The background set used in overrepresentation analysis was the total set of input features mapped to genes, and Ensembl gene IDs were used where specified (if not, gene symbols were used). For the localized region GO enrichment analysis, the R package rGREAT (version 1.26.0) was used, and the background set was the total set of input features. The significant (adjusted P ≤ 0.05) biological process terms that best summarized the results (with redundant terms manually excluded) were visualized. Literature mining entity relationships The Chilibot text mining web application ([201]69) was manually searched with the 45 genes common to all four models together with the terms “kidney,” “renal,” “RCC,” and “cancer.” Thirty million PubMed abstracts were also searched for entity relationships using the Pangaea NLP Python package ([202]70) with a renal carcinoma–associated relation term list and a collection of HUGO Gene Nomenclature Committee gene symbols (4217 genes) derived from MethylBoostER’s features mapped to genes (5012 genes). Entity relationship terms with frequencies ≥3 were used to build a gene relationship network using Cytoscape (v3.9.0) ([203]66). Extracting subtype-specific features In the XGBoost implementation that we used ([204]59), the ith tree focuses on the (i mod c)th class, where c is the total number of classes (5 in our case). Thus, for each class, we took all trees focusing on that class and extracted the features from them. The upset plot was created using the upsetplot Python package (version 0.6.0) ([205]71). Acknowledgments