Abstract

   Current gold standard diagnostic strategies are unable to accurately
   differentiate malignant from benign small renal masses preoperatively;
   consequently, 20% of patients undergo unnecessary surgery. Devising a
   more confident presurgical diagnosis is key to improving treatment
   decision-making. We therefore developed MethylBoostER, a machine
   learning model leveraging DNA methylation data from 1228 tissue
   samples, to classify pathological subtypes of renal tumors (benign
   oncocytoma, clear cell, papillary, and chromophobe RCC) and normal
   kidney. The prediction accuracy in the testing set was 0.960, with
   class-wise ROC AUCs >0.988 for all classes. External validation was
   performed on >500 samples from four independent datasets, achieving
   AUCs >0.89 for all classes and average accuracies of 0.824, 0.703,
   0.875, and 0.894 for the four datasets. Furthermore, consistent
   classification of multiregion samples (N = 185) from the same patient
   demonstrates that methylation heterogeneity does not limit model
   applicability. Following further clinical studies, MethylBoostER could
   facilitate a more confident presurgical diagnosis to guide treatment
   decision-making in the future.
     __________________________________________________________________

   The MethylBoostER machine learning model improves the diagnosis of
   kidney tumor subtypes by utilizing DNA methylation patterns.

INTRODUCTION

   Renal cell carcinoma (RCC) is among the top 10 most common cancers
   worldwide. Incidence rates have increased by 47% in the past 10 years,
   making it one of the fastest accelerating cancers, with rates projected
   to rise even further in the future ([54]1). This is attributed to the
   increasing prevalence of known risk factors and increased incidental
   detection, secondary to widespread use of imaging for unrelated
   symptoms ([55]2). Small renal masses (SRMs), defined as <4 cm in size,
   are therefore increasingly detected and encompass a mixture of
   potential diagnoses that comprise dozens of histological and molecular
   tumor subtypes ([56]3, [57]4). The major subtypes include clear cell
   RCC (ccRCC), papillary RCC (pRCC), chromophobe RCC (chRCC), or benign
   oncocytomas. Differentiating malignant from benign solid SRM represents
   a challenge to both imaging and renal biopsy. Renal biopsy may be
   nondiagnostic due to small tumor size, challenging anatomy, low biopsy
   tumor content, or tumor heterogeneity ([58]5–[59]7). As a result of no
   biopsy being undertaken or the inability to conclude a histological
   diagnosis based on biopsy samples, approximately 20% of SRMs removed at
   surgery are found to be benign postoperatively ([60]8). Consequently, a
   substantial number of patients undergo unnecessary surgery for a benign
   condition, with associated postoperative risks of morbidity and
   mortality and long-term impact on renal function. The rates of
   postoperative complications following minimally invasive surgery are
   blood transfusion (5%), reoperation (2 to 5%), respiratory
   complications (1 to 7%), and death (4%) ([61]9). Conversely, a
   meta-analysis demonstrated that one in four renal biopsies reported as
   oncocytoma is found to be malignant RCC following surgical excision
   ([62]5), risking false negatives, late diagnosis, and deleterious
   outcomes. Improved diagnosis and differentiation of SRM has been
   identified as a key research focus in RCC by an international research
   priority setting initiative ([63]10). Each of these pathological
   subtypes has characteristic genetic and molecular features such that it
   is argued that RCC is not a single disease ([64]11). ccRCC and pRCC are
   derived from the proximal convoluted tubule in the renal cortex,
   whereas chRCC and oncocytoma are derived from the distal nephron
   ([65]12, [66]13). DNA methylation changes are abundant, genome-wide
   early events in renal tumorigenesis, and are specific to the cell of
   origin, and so represent an ideal diagnostic target in this setting
   ([67]14–[68]17). Analysis of DNA methylation markers therefore has the
   potential to support and improve the current diagnostic pathway for
   renal cancer, providing a much more confident presurgical diagnosis to
   guide treatment decision-making.

   In this study, we propose MethylBoostER (Methylation and XGBoost for
   Evaluation of Renal tumors). MethylBoostER is a machine learning model
   based on the XGBoost (eXtreme Gradient Boosting) library that
   differentiates pathological subtypes of renal tumors, using DNA
   methylation markers identified in large tissue datasets. As seen in
   [69]Fig. 1, which shows the overview of our approach, we externally
   validate MethylBoostER on four independent datasets and evaluate
   multiregion samples from the same patient to assess the role of
   intratumoral heterogeneity (ITH) on MethylBoostER’s performance. In
   addition, we externally validate our results on ex vivo core biopsy
   tissue samples (which simulate in vivo patient biopsies) and fine
   needle aspirates (FNAs) and low-purity samples representing real-world
   applications and evaluate the role of tumor purity on MethylBoostER’s
   performance to assess clinical utility. MethylBoostER is trained on the
   largest DNA methylation cohort of renal tumor samples to date and is
   extensively validated, demonstrating accurate predictions across
   international cohorts and potential future clinical utility.

Fig. 1. Overview of MethylBoostER.

   [70]Fig. 1.
   [71]Open in a new tab

   Three DNA methylation datasets are used to train and test the XGBoost
   classification model. The model is then validated on four external
   datasets. The high- and moderate-confidence predictions from the model
   output are used for improving diagnostic decisions. Model performance
   on both multiregion samples and sample purity was assessed.

RESULTS

DNA methylation sequencing in a cohort of patients with renal tumors

   DNA methylation data were obtained for 1228 samples from three data
   sources, namely, Cambridge samples (N = 319), The Cancer Genome Atlas
   (TCGA) (N = 872), and the Chopra training set (N = 37) ([72]Fig. 2, A
   and B) ([73]18). The former dataset contained information on 3 million
   CpGs measured using TruSeq EPIC, while the latter two were obtained via
   the 450K Illumina array. The merged datasets were obtained by
   overlapping CpGs within ±50 base pairs (bp) of Illumina probes and
   overall contained data on 158,670 probes. This dataset (N = 1228) was
   split into training and testing sets (via fourfold cross-validation;
   see Materials and Methods), with subsequent validation being performed
   on four additional independent datasets [Chopra validation ([74]18),
   Wei ([75]19), Brennan ([76]20), and Evelönn ([77]21)], and will be
   referred to as such hereafter.

Fig. 2. Data characteristics and testing set performance.

   [78]Fig. 2.
   [79]Open in a new tab

   (A) Number of samples in each class used for the training/testing sets.
   (B) Uniform Manifold Approximation and Projection (UMAP) representation
   of the training/test dataset, using all input features. (C) Confusion
   matrix displaying the testing set performance, with precision and
   recall bars. (D) UMAP representation of the training/test dataset,
   using the input features learnt by the XGBoost model. (E) ROC curves
   over the testing set, split by class.

MethylBoostER accurately distinguishes pathological subtypes of renal tumors

   We trained multiclass XGBoost machine learning models to classify four
   pathological subtypes of renal tumors (ccRCC, chRCC, oncocytoma, and
   pRCC) and normal tissue. Fourfold nested cross-validation was used [a
   method to randomly split the dataset into training (75%) and testing
   (25%) sets over four iterations] so that the testing set performance
   could be evaluated over the whole training/testing set. Consequently,
   this results in four trained models. We display summary results for the
   four models (individual results displayed in figs. S1, S2, S4, S5, and
   S7).

   The performance over the training/testing set was high, with an average
   accuracy of 0.960. Table S1 lists additional performance metrics,
   including metrics for each renal tumor subtype (forthwith called
   “class”), which shows that the models could distinguish the normal
   class perfectly, and the chRCC class was the most difficult to classify
   correctly. [80]Figure 2C shows the confusion matrix, which depicts how
   many samples, of a given true class, were predicted as each class. Most
   samples (1179 of 1228; 96%) are on the diagonal, indicating that they
   were correctly predicted. Of the 1228 samples evaluated, common
   misclassifications were in predicting ccRCC as pRCC (13 of 1228
   samples; 1.1%) and predicting pRCC as ccRCC (9 of 1228; 0.73%) or chRCC
   (8 of 1228; 0.65%) samples. Although three oncocytoma samples were
   predicted as malignant RCC, no malignant samples were classed as benign
   oncocytoma or normal tissue.

   A large motivating factor for using XGBoost models is their
   interpretability—we can examine which input features were learnt by the
   models when classifying the five classes. We selected only these
   learned features and used them to visualize the training/testing set,
   as shown in [81]Fig. 2D (and fig. S2). This shows that although there
   are dataset-specific clusters within each class, the samples now form
   class-wise clusters. This suggests that the features that the models
   learn are not dataset specific but are features that distinguish each
   class in all three datasets. [82]Figure 2E (and fig. S1) shows the
   receiver operating characteristic (ROC) curves [and precision-recall
   (PR) curves] for each class, again indicating that the models achieved
   a high testing set performance.

High- and moderate-confidence predictions make model outputs clinically more
informative

   Given a sample and a trained XGBoost model, we can obtain the model’s
   predicted probabilities for that sample belonging to each class (renal
   tumor subtype). This reveals the confidence of the model’s prediction.
   [83]Figure 3A shows the predicted probabilities of the predicted class
   over the testing sets, which indicates that while most predictions are
   certain, some are less certain (13.2% of samples were predicted with a
   probability of <0.95), with probabilities as low as 0.256.

Fig. 3. High- and moderate-confidence predictions.

   [84]Fig. 3.
   [85]Open in a new tab

   (A) Histogram of the model’s probabilities of the predicted class for
   the testing sets. (B) Line plot showing how the testing set accuracy
   scores and fraction of high-confidence predictions vary as the
   threshold changes. The vertical dotted line indicates the chosen
   threshold, 0.85. (C) Graphical overview of the prediction process with
   high- and moderate-confidence predictions.

   We used this information to separate the models’ predictions into two
   categories: high-confidence and moderate-confidence predictions.
   High-confidence predictions are where the model’s confidence is greater
   than a threshold, and we output a single answer—the predicted class.
   Moderate-confidence predictions are where the model’s confidence is
   lower than the threshold, and we output two answers—the predicted class
   (the first most likely prediction) and the second most likely
   prediction. We refer to these two outputs as first prediction and
   second prediction, respectively. The threshold was set to 0.85 (see
   [86]Fig. 3B), which maximizes the accuracy of both high- and
   moderate-confidence predictions and the fraction of high-confidence
   predictions over the testing set. In clinical practice in the case of
   moderate predictions, the clinician could consider the two predicted
   classes and integrate this information with clinical, histopathology,
   and imaging features to conclude the most likely diagnosis (see
   [87]Fig. 3C). The average accuracy of the high-confidence predictions
   over the testing sets was 0.982. The average accuracy of the
   moderate-confidence predictions over the testing sets was 0.668
   (considering only the first prediction) but was 0.871 when the
   prediction was treated as correct if the first or second prediction was
   correct. See fig. S3 for the testing set confusion matrix split by
   high- and moderate-confidence predictions.

External validation on four independent datasets verifies the
generalizability of MethylBoostER to new data

   We externally validated MethylBoostER on four independent datasets,
   namely, Chopra validation (N = 245), Brennan (N = 37), Wei (N = 92),
   and Evelönn (N = 144) ([88]18–[89]21). They each contain different
   numbers of samples and a different class distribution, as shown in
   [90]Fig. 4A. The normal and ccRCC classes are the most frequent, which
   reflects clinical prevalence. We evaluated our models on these datasets
   (without the high- and moderate-confidence predictions method, so just
   taking the first prediction for all samples), and the average accuracy
   for Chopra validation, Brennan, Wei, and Evelönn was 0.824, 0.703,
   0.875, and 0.894, respectively (see table S2 for additional metrics and
   fig. S4A for confusion matrices). The models had lower performance for
   chRCC and oncocytoma; oncocytoma samples were frequently predicted as
   normal and chRCC. In Chopra validation, the oncocytomas predicted
   incorrectly on both the first and second prediction were predicted as
   normal (11 samples), chRCC (4 samples), and pRCC (2 samples). In
   Brennan, the incorrect oncocytomas (on both first and second
   predictions) were predicted as normal (one sample) and chRCC (five
   samples). Combining all external validation sets, 7.89% (3 of 38) of
   oncocytomas were correctly classified as oncocytoma, 34.2% (13 of 38)
   were predicted as normal tissue, and 57.9% (22 of 38) were classified
   as malignant. Of the malignant samples, 94.4% (289 of 306) were
   classified correctly as malignant, 5.56% (17 of 306) were predicted as
   normal tissue, and 0% (0 of 306) was classified as oncocytoma (full
   details in confusion matrices in fig. S4). Although some oncocytoma
   samples were classified as malignant disease, no malignant samples were
   classified as benign oncocytoma ([91]Fig. 4). Given current clinical
   practice (which errs on the side of caution and removes SRMs, which are
   later found to be benign), it would be acceptable to remove an
   oncocytoma in case this might be malignant, but it would not be
   acceptable to confuse a malignant mass as benign, as this would risk
   the cancer spreading. Malignant tumor samples predicted as normal
   tissue may be due to low tumor purity (see the next section). The
   relatively lower performance in oncocytoma and chRCC may be due to the
   low proportion of these classes in the training set, reflecting a
   real-world challenge, as these two are the least common pathological
   subtypes. In addition, previous reports suggest that methylation in
   these two classes is most similar, in keeping with their common
   cellular origin ([92]13). We found that incorporating high- and
   moderate-confidence predictions leads to improved results. [93]Figure
   4B and table S3 show the accuracy and Matthews correlation coefficient
   (MCC) scores for the high- and moderate-confidence predictions. As seen
   in [94]Fig. 4B, the moderate-confidence first predictions have an
   accuracy of <0.65 for all four datasets, and the second predictions
   have an accuracy of <0.45 for all four datasets. When we combine them,
   we see that the first or second predictions reach an accuracy of >0.7.
   As expected, the accuracy of the high-confidence predictions is higher
   and is >0.9 for all datasets.

Fig. 4. External validation on four independent datasets.

   [95]Fig. 4.
   [96]Open in a new tab

   (A) Number of samples in each class for each dataset. (B) Accuracy for
   high- and moderate-confidence predictions for each external dataset.
   “First or second prediction” indicates that a prediction is treated as
   correct if its first or second prediction was correct. (C to F)
   Confusion matrices for both high- and moderate-confidence predictions
   and ROC curves, split by class, for each external dataset. For the
   moderate-confidence confusion matrices, the x axis is split into first
   prediction was correct, the second prediction was correct, and both
   first and second predictions were incorrect.

   Many moderate-confidence predictions are correct on the second
   prediction, as shown in [97]Fig. 4 (C to F) (and fig. S4B for the other
   models). For example, 16 samples in the Chopra validation dataset had
   an incorrect first prediction but correct second prediction (see
   [98]Fig. 4C). These samples would have simply been incorrectly
   predicted without the use of this high- and moderate-confidence
   prediction method. There is also a higher fraction of
   moderate-confidence predictions for the chRCC and oncocytoma classes
   (fig. S4C), reflecting the difficulty of predicting these classes. We
   also show that all four datasets achieved high class-wise ROC areas
   under the curve (AUCs), as shown in [99]Fig. 4 (C to F) (see fig. S5
   for the other models’ ROC curves and all PR curves).

   We benchmarked MethylBoostER to the models introduced by Chopra et al.
   ([100]18) and Brennan et al. ([101]20) by comparing classification
   results over the Chopra validation dataset. We include MethylBoostER’s
   results both with and without high- and moderate-confidence prediction
   results. We also include Chopra’s results when excluding their benign
   class, as MethylBoostER or Brennan does not have results for this
   class. The overall accuracies were 0.846 (Chopra), 0.890 (Chopra excl.
   benign), 0.872 (Brennan), 0.824 (MethylBoostER), and 0.878
   (MethylBoostER high/mod). The number of correct predictions per class
   is shown in table S4, which shows that Chopra and MethylBoostER
   high/mod both achieve the highest performance for the normal and chRCC
   classes. Chopra achieves the highest performance for oncocytoma, and
   MethylBoostER high/mod achieves the highest performance for the
   malignant (and most clinically important) ccRCC and pRCC classes. This
   shows that MethylBoostER high/mod is competitive among similar
   classification models, showing the joint best performance for chRCC and
   the best performance for malignant ccRCC and pRCC subtypes. Because of
   the molecular similarity of the chRCC and oncocytoma subtypes and more
   frequent misclassification of these two classes in the combined
   multiclass model, we explored the utility of a binary classifier by
   training XGBoost models with only chRCC and oncocytoma classes (using
   the same method as described in Materials and Methods). This approach
   was hampered by limited sample numbers, with a total of 93 chRCC and 58
   oncocytoma samples in the testing set (fig. S6A) and 14 chRCC and 38
   oncocytoma samples in the external validation sets (fig. S6B). The
   models achieved high performance on the testing set (see fig. S6C),
   with an average accuracy of 0.968, MCC of 0.932, and ROC AUC of 0.9959.
   In the limited cohort available for external validation datasets, all
   models predicted every sample as chRCC (fig. S6, D and E),
   demonstrating that the binary models for these two molecularly similar
   RCC subtypes could not be generalized and added little value over the
   multiclass MethylBoostER model. However, the ROC and PR curves
   highlight clear evidence of subtype-specific signals with these DNA
   methylation models (fig. S6, F to I), reaching an ROC AUC of 0.948 for
   the Brennan dataset. This gives a similar picture to ROC and PR curves
   for the multiclass models (fig. S5) and indicates that the predicted
   probabilities can differentiate the two classes, but that the model’s
   classification threshold does not generalize in the validation
   datasets, possibly because of the limited datasets available for these
   RCC subtypes. The underlying prediction scores are visualized in fig.
   S6 (J and K), which show that the probabilities for oncocytoma samples
   are consistently lower than chRCC samples; however, they are not lower
   than 0.5 (the current model’s classification threshold). Overall, we
   found that a separate binary classifier did not add value over the
   multiclass MethylBoostER model, possibly because of the limited
   available data for these subtypes (see Discussion for suggested future
   studies that could overcome these limitations).

MethylBoostER provides consistent classification of multiregion samples from
the same patient

   ccRCC is characterized by a high degree of genetic ITH, with over 60%
   of somatic mutations not detectable across all multiregion samples
   ([102]22). On average, seven multiregion samples are needed to detect
   over 75% of genetic variants ([103]23). Very little is known about
   methylation variation in ccRCC, although a handful of reports suggests
   relative homogeneity ([104]21, [105]24–[106]26), therefore highlighting
   the potential relevance of diagnostic methylation markers. In the
   Cambridge dataset, multiregion samples were available (N = 168) for 25
   patients (18 ccRCC, 4 chRCC, 2 oncocytoma, and 1 pRCC). In 92% (23 of
   25) of patients, all multiregion samples were predicted consistently as
   being from the same pathological subtype, with 88% (22 of 25) achieving
   correct classification for all samples (see [107]Fig. 5A). Multiregion
   samples (N = 17) were also available for six patients with ccRCC from
   the Evelönn external validation dataset ([108]21). As shown in
   [109]Fig. 5B, 83% (five of six) of patients had a concordant prediction
   for all multiregion samples derived from the same patient, although the
   model struggles to differentiate chRCC from ccRCC. We noted that
   incorrectly predicted samples had a very low tumor purity. When the
   Evelönn dataset is visualized using dimensionality reduction (see fig.
   S7), almost all the incorrectly predicted samples (including all the
   incorrectly predicted multiregion samples) can be seen to cluster away
   from the rest of the ccRCC samples, indicating that these samples have
   a different methylation pattern (potentially related to their lower
   tumor purity, as shown in fig. S7B).

Fig. 5. Classification of multiregion samples.

   [110]Fig. 5.
   [111]Open in a new tab

   Diagram visualizing the model’s predictions of multiregion samples for
   each patient in the Cambridge and Evelönn datasets.

Impact of tumor purity on MethylBoostER

   Tumor purity was calculated using methylation beta values and
   InfiniumPurify, as described ([112]20, [113]27). This method determines
   tumor purity in the context of contamination with a normal (nontumor)
   kidney sample. As expected, median sample purity was lower for the
   Chopra and Brennan datasets compared to TCGA and Cambridge, as the
   former include ex vivo core biopsy samples (i.e., simulated biopsies
   using nephrectomy specimens) and FNAs, respectively. Of all the
   datasets, the Evelönn data appear to have the lowest tumor purity
   (median purity: 0.88 Cambridge, 0.84 TCGA, 0.80 Chopra training, 0.58
   Chopra validation, 0.73 Brennan, 0.81 Wei, and 0.48 Evelönn).
   [114]Figure 6A summarizes purity for samples that were predicted
   correctly in the first prediction, second prediction, and incorrectly
   predicted samples (results for individual datasets are shown in fig.
   S8). Median purity in samples that were incorrectly predicted was
   significantly lower than samples correctly predicted on the first or
   second prediction (0.27 versus 0.82 versus 0.44, P < 0.01 for all
   comparisons; [115]Fig. 6A). We noted that three of the samples that
   were misclassified by our model were also incorrectly classified by
   Brennan et al. ([116]20) and that the study authors postulated that
   this was due to low sample purity. [117]Figure 6 (B and C) summarizes
   the purity and prediction probability (for the first prediction),
   highlighting incorrectly predicted samples. We found a correlation
   between purity and the probability of first prediction in Wei and
   Evelönn [Pearson’s correlation coefficients of 0.58 and 0.51, adjusted
   P < 0.01], but not the other datasets (correlation < 0.30 and/or
   adjusted P > 0.01). There are a number of samples that are incorrectly
   predicted despite having a high-confidence prediction, and these tend
   to have lower purity. Considering all datasets combined, table S5
   summarizes the model accuracy, as well as the median probability of the
   first prediction, by tumor purity. This further demonstrates that
   samples with higher tumor purity are more accurately predicted, for
   example, samples with purity ≥ 0.9 obtain an accuracy of 0.99. There is
   a sharp drop-off in accuracy in samples with a purity of <0.20 compared
   to samples in which purity was >0.2, suggesting that potentially a
   biopsy sample may have to be repeated if purity is below this level.
   This analysis was performed post hoc on both training/testing and
   validation datasets, and there were a limited number of low-purity
   samples (5% of the cohort; table S5). Therefore, these results (and a
   purity threshold that warrants repeat biopsy) remain to be externally
   validated.

Fig. 6. Sample purity and MethylBoostER output.

   [118]Fig. 6.
   [119]Open in a new tab

   (A) Sample purity for samples that are predicted correctly on the first
   prediction (1st correct) and second prediction (2nd correct) and
   incorrectly predicted samples (incorrect) on both predictions. Data are
   shown for all datasets combined, with pathological subtypes shown in
   different colors. Adjusted P values are shown (*P < 0.05 and ***P <
   0.0009). (B and C) Sample purity and the probability of the first
   prediction are demonstrated for all datasets combined (B) and each
   dataset individually (C). The threshold t = 0.85 indicates a
   high-confidence prediction. Samples that are incorrectly predicted (in
   both first and second prediction) are indicated with a cross.

Features selected by MethylBoostER are associated with carcinogenic processes

   The features that the MethylBoostER models used during classification
   were obtained. Each of the four XGBoost models selected 1490, 1697,
   1476, and 1331 features, respectively. An analysis of the genomic
   location of the features revealed that the selected features do not
   follow the same genomic location distribution as the background, as
   shown in [120]Fig. 7A. In the promoter regions close to the
   transcription start site (TSS) (≤1 kb), there is a lower percentage of
   features compared to the background set (68% compared to 78%). However,
   in the promoter regions further away from the TSS (1 to 2 kb and 2 to 3
   kb away), there is a higher percentage of features compared to the
   background set (8.7% compared to 6.3% and 4.3% compared to 2.3%). There
   is also a higher percentage of features in introns (5.1% compared to
   3.1% for first introns and 8.7% compared to 5.5% for other introns).
   This distribution of the feature’s genomic locations is significantly
   different from the background (apart from the intergenic annotation),
   as shown in table S6.

Fig. 7. The genomic location and functional annotation of the features
selected by MethlyBoostER.

   [121]Fig. 7.
   [122]Open in a new tab

   (A) Distribution of genomic locations (relative to genes) for the
   selected features compared to the background (the total set of input
   features). (B) Enriched GO terms from the Biological Process category
   represented as a network, where each branch represents a different
   functional category. Results were obtained from the gene-wise GO
   analysis. (C) Enriched GO terms from the Biological Process category
   represented as a bar plot. Results were obtained from the localized
   region GO analysis.

   The features from MethylBoostER were mapped to proximal genes and
   analyzed at the gene ontology (GO) and pathway level. The GO terms are
   visualized as a network in [123]Fig. 7B. The GO analysis detected a
   neuronal and muscle developmental signature, cancer signaling [Wnt,
   mitogen-activated protein kinase (MAPK), and Ras signaling pathways],
   and both positive and negative regulation of transcription, among
   others. The significant enrichment (see table S7 for the adjusted P
   values of all gene list comparisons) of transcription factors [TF
   Checkpoint ([124]28)] and epigenetic regulators [EpiFactors ([125]29)]
   indicates the perturbation of transcriptional regulators and chromatin
   remodelers in renal cancer. While an RCC-related gene set [RCC
   Harmonizome/Diseases databases ([126]30)] and a ccRCC putative driver
   gene set ([127]23) were enriched, a similar set of pRCC genes was not
   enriched [pRCC Harmonizome/Diseases ([128]30)]. The feature set also
   indicated a strong metastatic signature [Human Cancer Metastasis
   Database ([129]31)], with enriched GO terms related to Motility, Axon
   guidance, and Cell adhesion. Enrichment of epithelial-mesenchymal
   transition (EMT)–related genes [dbEMT2 ([130]32)] provides further
   evidence for this metastatic signature. In addition, the gene list was
   enriched for tumor suppressors, oncogenes, and fusion genes [Catalogue
   Of Somatic Mutations In Cancer (COSMIC) Cancer Gene Census ([131]33)].

   We also checked whether the results of the gene-wise GO analysis were
   consistent with the results from the Genomic Regions Enrichment of
   Annotations Tool (GREAT) ([132]34), which is designed for localized
   regions that are not necessarily within genes. This also enabled the
   features that were not mapped to genes to be included. [133]Figure 7C
   shows the significant GO terms resulting from this analysis, which
   demonstrates that they are similar to the results of the gene-wise GO
   analysis. The consistent themes were development, cell adhesion, cell
   motility, cell signaling, and neurogenesis. A few functions were unique
   to this GREAT analysis, such as immune response and extracellular
   matrix organization, and a few were unique to the gene-wise analysis,
   such as regulation of transcription and response to cortisol stimulus.

   Pathway analysis revealed enriched pathways covering similar functions
   as found during the GO analysis. The most enriched pathways were the
   following: Wnt signaling pathway (adjusted P = 2.79 × 10^−15), Cadherin
   signaling pathway (adjusted P = 2.92 × 10^−5), and Netrin-1 signaling
   (adjusted P = 4.62 × 10^−03). Other enriched pathways included axon
   guidance, cell junction organization, and signaling by receptor
   tyrosine kinases.

   Next, we focused our analysis on the consistently salient
   features—features that were selected by all four XGBoost models. There
   were 38 such features, which mapped to 45 genes. Of these 45 genes
   common to all four models, 9 genes have not been previously associated
   with kidney function, renal disease, or cancer (ANKMY1, EHMT1, H2AW,
   JDP2, H2BU1, KIAA1143, KIF15, LINC01960, and SF3A2). As detailed in
   table S8, an analysis of the literature shows that the others have been
   associated with ccRCC, chRCC, and RCC. These genes have also been
   linked to kidney development, gene expression in kidney, adhesion,
   motility, invasion, metastasis, and poor survival in renal cancer and
   other cancer types. In addition, some of them contribute to increased
   survival, act as tumor suppressors, promote cancer stemness, contribute
   to carcinogenesis, proliferation, and tumor growth, and have been
   identified in other cancer types. These associations demonstrate that
   the methylation features identified by MethylBoostER are biologically
   relevant. The methylation distribution of all 38 features and their
   genes can be seen in fig. S9. We confirmed these findings by carrying
   out an entity relationship–driven natural language processing (NLP)
   analysis of approximately 30 million PubMed abstracts, using the
   feature mapped gene set and a dictionary of relationship terms based on
   enriched biological processes identified above. Modeling this
   information from 28,579 interactions (each supported by multiple
   publications) as an entity relationship network, and subsequent network
   topological analysis where relationship strength was visualized as
   degree (number of genes directly linked to a relationship) and
   relationship importance as betweenness centrality, highlighted the
   association of these genes with carcinogenesis, development, and tumor
   microenvironment interactions. This corroborates the functional
   bioinformatics analyses and confirms that these genes have been
   previously associated with carcinogenic processes, tumor
   microenvironment, metabolism, metastasis, and immune and inflammatory
   responses and are known to influence patient survival and prognosis in
   the biomedical literature (fig. S10 and data file S1 containing
   NLP-derived gene relationship pairs and their frequency).

   We also extracted the class that each feature contributes toward to
   identify subtype-specific features. Figure S11A shows that the genomic
   location of the features was distributed slightly differently for each
   class, with the normal, pRCC, and oncocytoma classes using fewer
   features in proximal promoter regions near the TSS (≤1 kb). The normal
   and pRCC classes also used more features in introns. We then calculated
   enriched GO terms for each class using GREAT and found enriched terms
   for normal, ccRCC, chRCC, and pRCC. The normal class was enriched in
   terms such as membrane organization and cell activation (fig. S11B).
   The ccRCC class was enriched in terms such as regulation of signaling
   and regulation of cell communication (fig. S11C). The chRCC class was
   enriched in multiple terms relating to negative regulation of DNA
   repair and regulation of multicellular organismal process (fig. S11D).
   The pRCC class was enriched in many terms, such as single-organism
   developmental process, movement of cell or subcellular component,
   localization, defense response, and positive regulation of biological
   process (fig. S11E). After mapping these subtype-specific features to
   genes, we also show that many genes were often found in multiple
   subtypes. Figure S11E demonstrates that 205 genes were found in both
   ccRCC and pRCC, 159 genes were found in both chRCC and ccRCC, and 137
   genes were found in both chRCC and pRCC.

DISCUSSION

   The rising incidence of SRMs drives the need to improve the diagnostic
   pathway to avoid overtreatment of patients with benign disease and
   optimize health resource use. The challenge consists of differentiating
   benign oncocytoma from malignant subtypes, the most common being ccRCC,
   pRCC, and chRCC. Previously published molecular classifiers are often
   limited by focusing solely on distinguishing oncocytoma from chRCC,
   excluding the other, more common subtypes and therefore reducing
   applicability in the real world ([134]20, [135]35–[136]38). In
   addition, a number of these studies had small sample sizes (<200
   samples) ([137]36, [138]38, [139]39). Existing models often rely on a
   limited number of markers (e.g., <100 markers), making them less robust
   when applied to heterogeneous clinical samples and limiting potential
   future applicability ([140]20, [141]36–[142]38, [143]40).

   Machine learning models based on DNA methylation have demonstrated
   excellent accuracy in other cancer types, including lung, brain, and
   breast malignancies and sarcomas ([144]41–[145]43). The ability of
   machine learning models trained on DNA methylation data to classify
   cancer of unknown primary with excellent accuracy, leading to benefits
   in overall survival, serves as a testament to DNA methylation as a
   unique marker of cell identity ([146]44). We have therefore developed
   an XGBoost machine learning model (MethylBoostER), leveraging
   methylation data from over 1200 patient samples. We demonstrate that
   the model accurately predicts pathological subtypes of renal tumors in
   the training and testing set, with ROC AUCs of over 0.95 for all
   subtypes evaluated. The extensive external validation in four
   independent methylation datasets, totaling over 500 patient samples, is
   a measure of the robustness of our method. We demonstrate the high
   accuracy of the model with average accuracies of 0.824, 0.703, 0.875,
   and 0.894 and AUCs of over 0.89 for all subtypes in all independent
   datasets. We also show that MethylBoostER is competitive among similar
   existing models (when using high- and moderate-confidence predictions),
   achieving the highest performance for malignant RCC classes (chRCC,
   ccRCC, and pRCC). A recognized clinical challenge is that biopsy
   accuracy may be hampered by genetic ITH or reduced tumor content,
   especially in smaller or more difficult to access tumors. We therefore
   address these potential drawbacks. We demonstrate that, unlike the
   known extensive genetic ITH ([147]22), analyzing methylation patterns
   in multiregion samples from the same patient led to consistent
   diagnoses in the vast majority of cases (92% in the Cambridge and 83%
   in the Evelönn datasets). In addition, we assessed low-purity samples
   (including ex vivo biopsies and FNAs) to represent a real-world
   scenario. We show that samples that are incorrectly predicted by our
   model have significantly lower purity than correctly predicted samples.
   The association between purity and model accuracy was also previously
   noted by Brennan et al. ([148]20). It is customary to take more than
   one sample at the time of biopsy, which may help overcome this
   challenge; alternatively, low-purity samples may require repeat biopsy.

   In the testing set, the most common misclassifications involved ccRCC,
   pRCC, and chRCC, while in the external validation sets, chRCC and
   oncocytoma were the classes with the lowest AUCs. The tumors’ shared
   cells of origin (proximal versus distal nephron) may explain some of
   these results. Problems in accurately differentiating between ccRCC and
   chRCC on histopathology are another well-known challenge. In TCGA, 15
   cases were initially classified as ccRCC on histopathological slide
   review; however, these were later re-reviewed by specialist
   urohistopathologists and reclassified as chRCC ([149]11, [150]45). We
   noted that eight of these samples were included in our testing and
   training dataset, and MethylBoostER classifies five as chRCC and three
   as ccRCC. The former suggests that our model can correctly classify
   chRCC samples better than a standard pathologist, and more akin to a
   specialist urohistopathologist. We explored the methylation and gene
   expression profiles of TCGA samples (fig. S12, A and B) and
   demonstrated that the three samples that our model classifies as ccRCC
   (TCGA participant IDs 4821, 4688, and 4696) cluster more closely with
   ccRCC than chRCC based on both methylation and gene expression data. We
   hypothesize that these samples may be ccRCC (i.e., the first
   classification was correct rather than the reclassification). This
   highlights existing challenges in diagnosing subtypes using existing
   histopathology methods and emphasizes the need to produce accurate
   predictive models. Last, another challenge is that predictive models
   are trained on datasets in which the true diagnosis is based on
   histopathology, and if this is incorrect, it may bias the model. To
   ensure that the predictions of MethylBoostER were not heavily biased by
   these eight mislabeled samples, we retrained the models with the
   reclassified labels (eight ccRCCs were reclassified as chRCC, and a
   number of non-RCC were samples removed). The results over the testing
   sets were similar, as shown in fig. S12 (C and D), demonstrating that
   the small number of incorrect labels does not largely affect the
   results.

   Carcinogenesis is invariably accompanied by perturbations in gene and
   epigenome regulation. Aberrant DNA methylation changes on promoters,
   enhancers, and gene bodies contribute to alteration of gene expression
   and consequently affect signaling, regulatory, and metabolic pathways.
   Renal carcinoma has been associated with MET ([151]46), Hippo
   ([152]47), Wnt ([153]48), MAPK ([154]49), NRF2 (nuclear erythroid
   2-related factor 2)-ARE (antioxident response element),
   phosphatidylinositol 3-kinase (PI3K)/AKT/mechanistic target of
   rapamycin (mTOR) ([155]50), metabolic, angiogenic, and immune
   checkpoint–associated pathways ([156]51). GO and pathway enrichment
   analysis of the MethylBoostER features mapped to genes show a strong
   association with genes, processes, and pathways involved in
   carcinogenesis. In particular, oncogenes and tumor suppressors (COSMIC
   Cancer Gene Census), cancer-associated pathways [Kyoto Encyclopedia of
   Genes and Genomes (KEGG)], putative ccRCC driver genes, and
   RCC-associated genes were enriched. In addition, biological processes
   associated with metastasis (motility, cell adhesion, and axonal
   guidance) and EMT were also strongly enriched. Comparison with curated
   gene sets in metastasis [Human Cancer Metastatic Database (HCMDB)] and
   EMT (dbEMT2) databases further confirmed this. Other signatures such as
   neuronal and muscle development; Wnt, MAPK, and Ras signaling pathways;
   nucleoside metabolism; and T cell activity are also enriched.
   Expectedly, enrichment of both positive and negative regulation of
   transcription coincides with known consequences of DNA methylation.
   Furthermore, enrichment of TFs and epigenetic regulators further
   supported this strong transcriptional regulation signature. Genes
   associated with response to cortisol stimulus are enriched in our
   feature set. Serum cortisol levels have been found to be significantly
   higher in RCC and are associated with higher tumor grade ([157]52). The
   enrichment in our feature set suggests a possible epigenetic regulatory
   mechanism for this process. Last, we show that the CpG features
   selected by the models are enriched for cancer-related genes, and the
   features selected by all four models are enriched for well-known RCC
   genes. We use literature analysis to demonstrate that most
   classification features are associated with genes involved in kidney
   pathology, cancer, or carcinogenic processes, demonstrating the clear
   discernment in model-selected methylation patterns. Associations
   between methylation and gene expression are outside the scope of the
   present analysis, which aimed to demonstrate that the model selects
   salient features that are biologically relevant and easily
   interpretable.

   Several limitations of the current study must be addressed. One of the
   main challenges in developing predictive methods pertains to lack of
   sample and clinical data, particularly for rare disease subtypes. We
   acknowledge that although the MethylBoostER model accurately predicts
   the most prevalent pathological subtypes of renal tumors, due to
   constraints in sample availability, the diagnosis of less prevalent,
   uncommon renal masses (such as angiomyolipoma, translocation RCC,
   collecting duct carcinoma, Fumarate Hydratase (FH)-deficient RCC, renal
   medullary carcinoma, and others) is not accounted for in our current
   analysis. Although the model achieves a high accuracy across all sample
   types, chRCC and oncocytoma remain the most difficult classes to
   distinguish. Future studies including larger sample sizes from these
   two classes may enable improved prediction accuracies.

   While the main aim of the analysis is to distinguish pathological
   subtypes of SRMs, because of insufficient sample sizes, we pooled data
   from both SRMs and larger tumors for the training/testing set and some
   of the validation cohorts. However, we validated the model on the
   Chopra dataset that consists exclusively of SRMs (245 samples) and
   demonstrated excellent accuracy and generalizability of results in
   smaller tumors and in low-purity samples (e.g., Evelönn and FNAs in
   Brennan). Future larger studies could include rarer subtypes (exploring
   the impact on the model output) and prioritize samples from patients
   with SRMs. In addition, there were some limitations to our benchmarking
   method. The models we compared to were trained on a different number of
   classes, so the accuracies were not directly comparable, and they were
   also trained on different training datasets. Thus, further benchmarks
   that solve these limitations would need to be carried out. Excluding
   TCGA, the other datasets included in our analysis provided methylation
   data alone. In the future, obtaining both methylation and
   transcriptomic data may enable head-to-head comparisons between
   MethylBoostER and existing models predicting pathological subtypes
   using transcriptomic data ([158]35, [159]36). Ultimately, an ideal
   model would predict patient outcome in addition to identifying the
   pathological subtype of the renal tumor and would take into account
   comorbidities and competing risks of death. Unfortunately, this was not
   possible in the present analysis due to lack of survival data for the
   available cohorts. Future work should focus on collecting comprehensive
   multimodal datasets including patient clinical parameters, imaging,
   pathology, and multiple omic data types (including both methylation and
   gene expression).

   Unique features of our model are its interpretability, ease of use in a
   clinical setting, and an approach that reduces the clinical importance
   of ITH. We envision that, in the future, following further model
   refinement (possibly through the use of larger, more heterogeneous, and
   balanced renal cancer datasets), validation, and prospective testing
   (using a well-designed clinical trial to test real-world performance),
   MethylBoostER could be integrated in clinical practice to improve the
   patient diagnostic pathway ([160]Fig. 8). Individuals with SRMs would
   undergo imaging-guided tumor biopsy, which would be processed for
   targeted DNA methylation analysis, and the MethylBoostER model would be
   used to predict pathological subtypes, serving as an adjunct to
   treatment decision-making. For high-confidence predictions, the output
   is limited to the most likely diagnosis, whereas for
   moderate-confidence predictions, the model will supply two class
   predictions and the clinician would be encouraged to integrate imaging
   and histological data. Such moderate-confidence predictions will
   provide useful additional information for clinicians to combine with
   other clinicopathological features, for instance, moderate prediction
   of “chRCC or ccRCC” in a fit patient with a larger tumor would be much
   more likely to result in active treatment than a moderate confidence
   prediction of “chRCC or oncocytoma” in a patient with a relatively
   small tumor and limited life expectancy. A strength of our work is the
   output of moderate- and high-confidence predictions, which overall
   increases model accuracy and empowers clinicians to make
   patient-centered decisions that take into account both clinical and
   molecular (DNA methylation) data. We acknowledge that, in a small
   number of individuals with SRMs, renal biopsy may be high risk or
   contraindicated due to tumor anatomy or patient factors; therefore,
   MethylBoostER may not aid diagnosis in this setting. However, there is
   an increasing use of diagnostic biopsies to minimize morbidity and
   mortality resulting from overtreatment of patients with benign lesions,
   supported by strong recommendations in recent clinical guidelines
   (European Association of Urology Renal Cell Carcinoma Guideline Panel
   2022). Following further refinement, MethylBoostER could have a
   tangible future clinical application, although future studies will be
   needed to demonstrate the model’s clinical benefit above the existing
   standard of care diagnostic pathway (imaging with or without pathology)
   before adoption in the clinic.

Fig. 8. Proposed future integration of MethylBoostER model into the existing
diagnostic pathway for patients with SRMs.

   [161]Fig. 8.
   [162]Open in a new tab

   Following future model refinements and clinical trials, MethylBoostER
   could play a role in the diagnostic pathway. Here, we describe the
   potential clinical utility. Patients would have an image-guided renal
   biopsy, and biopsy samples would undergo DNA methylation analysis.
   MethylBoostER results would be interpreted in the context of
   integration with clinical and imaging data. For high-confidence
   predictions, MethylBoostER would predict one class, where benign
   oncocytoma and malignant RCC would likely be managed with active
   surveillance and active treatment, respectively. In moderate-confidence
   predictions, the two classes with the highest probabilities would be
   predicted. Samples with low purity or cases in which MethylBoostER
   predicts normal kidney (suggesting that the target lesion was missed)
   would prompt repeat biopsy.

   In summary, we develop MethylBoostER, a machine learning model that
   predicts pathological subtypes of benign and malignant renal tumors.
   The relatively modest number of patients diagnosed with certain tumor
   subtypes (especially chRCC and oncocytoma) and limited access to
   patient samples can hamper the application of machine learning–based
   approaches. Future studies should aim to obtain larger sample sizes and
   focus on multimodal integration with imaging, patient characteristics,
   epigenetic data, and clinical outcomes.

MATERIALS AND METHODS

Cambridge samples

   Patients with benign and malignant renal tumors undergoing curative or
   cytoreductive nephrectomy at Addenbrooke’s Hospital were recruited to
   the DIAMOND study. Ethics approval and patient consent were obtained
   (Research Ethics Committee, REC ID 03/018). Fresh-frozen tissue was
   stored at −80°C. Where available, multiregion tumor samples were taken
   along with adjacent normal kidney tissue. DNA was extracted from a
   small section of frozen tissue, using the commercially available
   AllPrep DNA/RNA Mini Kit (QIAGEN) according to the manufacturer’s
   protocol.

   DNA samples (10 ng/μl, 500 ng total) were sheared using the S220
   Focused-ultrasonicator (Covaris) to generate double-stranded DNA
   (dsDNA) fragments. The D1000 ScreenTape System (Agilent) was used to
   ensure that >60% of DNA fragments were between 100 and 300 bp long,
   with a mean fragment size of 180 to 200 bp. Tissue methylation analysis
   was performed using the TruSeq Methyl Capture EPIC Library Preparation
   Kit (Illumina), using the manufacturer’s protocol (a.k.a, or also known
   as TruSeq EPIC). This consists of a capture-based method targeting
   approximately 3 million CpGs. Four samples are multiplexed in each
   capture reaction using sample indexing adaptors. The protocol involves
   hybridization of biotin-tagged probes to genomic DNA followed by
   capture using streptavidin beads (two hybridization-capture steps) and
   bisulfite conversion at 54°C for 2 hours. Following polymerase chain
   reaction (PCR) amplification, uracils are copied as thymines, with
   resulting libraries consisting of two families of dsDNA molecules
   (originating from Watson and Crick strands), with a high
   thymine-to-cytosine ratio. Twelve samples were pooled for sequencing on
   the HiSeq4000 Illumina Sequencing platform (single-end 150-bp read)
   using two lanes per library pool. Technical replicates were performed
   for cell line data to assess assay reproducibility (correlation =
   0.97). Sequenced data were trimmed (TrimGalore v0.4.4) and aligned to
   the bisulfite-converted human reference genome (GRCh38/hg38).
   Methylation calling was performed using the Bismark suite of tools
   (v0.22.1). Trimming and alignment reports were compiled using MultiQC
   (v1.7). Data were included in downstream analysis if a depth of greater
   than 10× coverage was achieved to reduce the risk of false-positive
   calling. In addition, samples were removed if non-CpG methylation was
   greater than 1%.

Publicly available data

   TCGA data for ccRCC (KIRC), pRCC (KIRP), and chRCC (KICH) were obtained
   via the TCGAbiolinks package in R ([163]53). Only CpG probes with less
   than 5% missing data were kept. The Chopra datasets, one for training
   and one for testing, were downloaded from the Open Science Framework,
   repository OSF.IO/Y8BH2 ([164]18), and the Acute Myeloid Leukemia (AML)
   samples were excluded. The Wei dataset ([165]19) and Evelönn dataset
   ([166]21) were obtained from the Gene Expression Omnibus (GEO), with
   GEO accession numbers [167]GSE61441 and [168]GSE113501, respectively.
   The Brennan dataset was obtained directly from the study authors
   ([169]20). In all cases, methylation data were evaluated using the
   Illumina Infinium Human DNA Methylation 450 platform (also known as
   450k array).

Preprocessing

Training/testing dataset

   The training/testing dataset consisted of the TCGA data, the Chopra
   training data, and the Cambridge samples. The following preprocessing
   steps were performed on this training/testing set. CpG probes found in
   two blacklists ([170]54, [171]55) for the 450k array were removed, as
   well as probes at the site of C/T and G/A single-nucleotide
   polymorphisms (SNPs), probes that map to multiple regions, and probes
   at repeat regions. In addition, CpG probes located on the sex
   chromosomes were omitted to remove gender bias. The 450k array (Chopra
   training samples and TCGA samples) and TruSeq EPIC (Cambridge data)
   cover different CpG sites. To combine data from the two methods, TruSeq
   EPIC beta values within 50 bp of the 450K probes were averaged, as
   adjacent CpGs tend to be co-methylated ([172]56). Subsequently, probes
   that were missing for the whole of one dataset (TCGA, Chopra training
   data, or Cambridge) were removed to avoid dataset-specific bias. This
   resulted in a dataset with 158,670 CpG probes.

   Last, the beta values were converted to M values, as M values have been
   shown to be more homoscedastic ([173]57). They are computed using the
   following equation
   [MATH:
   <mrow><mrow><mi>M</mi><mo>=</mo><msub><mtext>log</mtext><mn>2</mn></msu
   b><mfrac><mi mathvariant="normal">β</mi><mrow><mn>1</mn><mo>−</mo><mi
   mathvariant="normal">β</mi></mrow></mfrac></mrow></mrow> :MATH]

   Where M values were calculated to be infinity (due to β = 1), they were
   set to the maximum finite value of the training/testing data, and where
   M values were calculated to be minus infinity (due to β = 0), they were
   set to the minimum finite value of the training/test data.

External datasets

   The external datasets used were the Chopra validation data (which is
   independent of the Chopra training data), the Brennan dataset, the Wei
   dataset, and the Evelönn dataset. The same 158,670 probes in the
   preprocessed training/testing dataset were selected for all four
   external datasets.

Data visualization

   Dimensionality reduction was carried out using umap-learn Python
   package (version 0.3.10) ([174]58). Not a Number (NaN) values were
   converted to the value 0 before the transformation.

MethylBoostER: The XGBoost classification model

   A multiclass XGBoost classifier model was constructed using the xgboost
   Python package (version 0.90) ([175]59). Fourfold nested
   cross-validation was implemented on the training/testing set, with
   integrated hyperparameter optimization.

Fourfold nested cross-validation

   The training/testing set was split up into four stratified outer folds
   so that each fold keeps 25% of the dataset for testing. This was split
   so that multiple samples from the same patient were all in the same
   fold to avoid data leakage between folds. These folds were iterated
   through, treating each one as the testing set in each iteration. Within
   these iterations, the remaining dataset was treated as a
   training/validation set. Inner rounds of fourfold cross-validation were
   performed on this training/validation set to find the optimal
   hyperparameters. For each inner fold iteration, a model was trained on
   the training set (75% of the training/validation set) and validated on
   the validation set (25% of the training/validation set). These inner
   rounds were repeated for various sets of hyperparameters [the
   hyperparameters tested were dictated by the Tree of Parzen Estimators
   algorithm in the hyperopt package ([176]60)].

   The hyperparameters that resulted in the best average MCC score on the
   validation sets (averaged across all four inner folds) were selected,
   and a model was trained with these parameters on the whole
   training/validation set. This resulted in a trained model with optimal
   hyperparameters for each outer fold, and the performance of these
   models on their fold’s testing set was reported. This method of
   cross-validation ensured that the models had the optimal
   hyperparameters, they were evaluated on completely unseen test data
   (specifically data that were not used to select the hyperparameters to
   avoid inflated scores), and the performance on the whole
   training/testing dataset was reported.

Hyperparameters and training details

   Two parameters were manually set, namely, subsample = 0.5 and colsample
   by tree = 0.5, to help reduce overfitting and run time ([177]59). Many
   of the remaining parameters were found using hyperparameter
   optimization implemented in the hyperopt Python package (version 0.2.5)
   ([178]60). The maximum number of evaluations was set to 10 and
   parallelism was set to 2 to reduce the run time. The search spaces for
   each parameter were as follows: number of trees: 50 to 500, learning
   rate: 0.05 to 0.5 (sampled from a log uniform distribution), max tree
   depth: 2 or 3 (intentionally set very low to help reduce overfitting;
   see fig. S13 for justification of this decision), L1 regularization
   term: 0 to 1, and L2 regularization term: 0 to 1.

   During model training, Gaussian noise with mean 0 and variance 0.2 was
   added into the training data to help reduce overfitting. Early stopping
   was also applied to reduce overfitting, with the number of early
   stopping rounds set to 5 and multiclass log loss as the evaluation
   metric used. Samples were weighted to avoid both patient bias and to
   mitigate the class bias. Balanced class weights were calculated using
   the “compute class weight function” in the sklearn package ([179]61),
   which ensures that samples in less common classes are weighted higher.
   The same function was used to generate balanced patient weights so that
   samples from patients with many samples were weighted lower. The two
   weights were multiplied together to get a weight for each sample to be
   used during training.

Evaluation metrics

   Accuracy, precision, recall, F[1], and the MCC were used to evaluate
   the models’ performance. F[1] is the harmonic mean of precision and
   recall
   [MATH:
   <mrow><mrow><msub><mi>F</mi><mn>1</mn></msub><mo>=</mo><mn>2</mn><mo>.<
   /mo><mfrac><mrow><mtext>precision</mtext><mo>·</mo><mtext>recall</mtext
   ></mrow><mrow><mtext>precision</mtext><mo>+</mo><mtext>recall</mtext></
   mrow></mfrac></mrow></mrow> :MATH]

   The MCC was also reported as it is a performance measure that is not
   affected by large class imbalances, unlike accuracy. ROC curves and PR
   curves were also plotted, with each class plotted separately (i.e.,
   using the one-versus-all strategy). ROC curves show the false-positive
   rate and the true-positive rate over all possible values of the
   classification threshold. PR curves show the precision and recall over
   all possible values of the classification threshold. The AUC was also
   reported for both curves, with 1 as the best possible AUC score.

   The ROC AUC score is calculated from the model’s class-wise output
   probabilities, and it is a one-versus-all metric, so it is an easier
   task than multiclass classification. The ROC AUC score is affected by
   the class imbalance and can overestimate performance. We therefore
   report a wide range of metrics and performance indicators ([180]Figs. 2
   and [181]4; figs. S1, S4, and S5; and tables S1 to S3). Hence, while
   interpreting the ROC AUC score, it is important to consider the
   classification performance demonstrated within the confusion matrices,
   as well as other metrics (such as PR AUC, accuracy, MCC, precision,
   recall, and F[1]).

High- and moderate-confidence predictions

   The model’s output probability was used as a confidence measure, and
   the predictions were split into two categories: high confidence, where
   the output probability was larger than a specified threshold, t, and
   moderate confidence, where the output probability was less than t. For
   moderate-confidence predictions, we outputted the model’s top two
   predictions (i.e., the two classes with the two highest output
   probabilities).

   The parameter t was chosen on the basis of the testing set. Three
   metrics were plotted for values of t between 0.5 and 1 (in increments
   of 0.025). These metrics were the fraction of high-confidence
   predictions, the accuracy of the high-confidence predictions, and the
   accuracy of the moderate-confidence predictions where a prediction was
   treated as correct if the correct class was in the model’s top two
   predictions. To remove noise in these accuracy scores, simple linear
   models were fitted (i.e., approximated accuracy = a*t + b), and the
   approximated accuracy scores were used to estimate t. For each value of
   t, these three metrics were averaged and the value of t where the
   average was the largest was taken. This was t = 0.85 and was validated
   on the external validation datasets.

Benchmarking

   To benchmark MethylBoostER to other similar models, we looked at
   reported results by Chopra et al. ([182]18) and Brennan et al.
   ([183]20) of the Chopra validation dataset (this was possible as all
   models were validated over this dataset). The MethylBoostER high/mod
   model used our high- and moderate-confidence prediction method. When
   this model predicted moderate confidence, we reported a correct
   prediction if the first or second prediction was correct. Note that in
   table S4, the value for the Chopra model for the normal class is 100/99
   (i.e., either 100 or 99 correct predictions). This is because Chopra et
   al. report 100 correct predictions and 2 incorrect predictions for the
   normal class ([184]18); however, the Chopra validation dataset only
   contains 101 normal samples.

Tumor purity analysis

   Tumor purity was obtained for each sample using methylation beta values
   using the InfiniumPurify package in R ([185]27) as previously described
   ([186]20). Purity in correctly predicted and incorrectly predicted
   samples was compared using the Wilcoxon rank sum test with correction
   for multiple testing.

Feature analysis

MethylBoostER feature importance and annotation

   Feature importance values were obtained from the trained XGBoost
   models, which used the gain measure as the feature importance. All
   features with an importance greater than zero were selected. For each
   selected feature, the CpG was mapped to a gene if it was within a gene
   or within 1500 bp upstream of a gene.

Genomic location of features

   The R package ChIPseeker (version 1.30.2) ([187]62) was used for
   genomic location annotation and visualization. The R package
   EnsDb.Hsapiens.v86 (version 2.99.0) ([188]63) was used for the
   annotation source. The background set used was the total set of input
   features. For significance testing, the GAT tool was used ([189]64)
   together with annotations extracted from EnsDb.Hsapiens.v86 by
   ChIPseeker (specifically using the detailGenomicAnnotation field).

Overrepresentation analysis

   GO enrichment analysis was carried out using the Biological Networks
   Gene Ontology (BINGO v3.0.3) package ([190]65), and network
   visualization was carried out using Cytoscape (v3.9.0) ([191]66).
   Pathway enrichment analysis was carried out using the Panther analysis
   tool (release 16) ([192]67) with Panther and Reactome pathway
   collections. Statistical overrepresentation was carried out using
   Fisher’s exact test with false discovery rate (FDR) correction.
   Overrepresentation analysis was also performed on eight gene lists:
   dbEMT2 ([193]32), COSMIC Cancer Gene Census ([194]33), a list of
   putative driver genes for ccRCC ([195]23), pRCC and RCC genes from the
   Diseases database ([196]30), TF Checkpoint ([197]28) for transcription
   factors, EpiFactors for epigentics regulators ([198]29), and HCMDB for
   metastatic genes ([199]31). Enrichment of gene sets was determined
   using Fisher’s exact test with FDR correction, using the Python package
   SciPy ([200]68). The background set used in overrepresentation analysis
   was the total set of input features mapped to genes, and Ensembl gene
   IDs were used where specified (if not, gene symbols were used).

   For the localized region GO enrichment analysis, the R package rGREAT
   (version 1.26.0) was used, and the background set was the total set of
   input features. The significant (adjusted P ≤ 0.05) biological process
   terms that best summarized the results (with redundant terms manually
   excluded) were visualized.

Literature mining entity relationships

   The Chilibot text mining web application ([201]69) was manually
   searched with the 45 genes common to all four models together with the
   terms “kidney,” “renal,” “RCC,” and “cancer.” Thirty million PubMed
   abstracts were also searched for entity relationships using the Pangaea
   NLP Python package ([202]70) with a renal carcinoma–associated relation
   term list and a collection of HUGO Gene Nomenclature Committee gene
   symbols (4217 genes) derived from MethylBoostER’s features mapped to
   genes (5012 genes). Entity relationship terms with frequencies ≥3 were
   used to build a gene relationship network using Cytoscape (v3.9.0)
   ([203]66).

Extracting subtype-specific features

   In the XGBoost implementation that we used ([204]59), the ith tree
   focuses on the (i mod c)th class, where c is the total number of
   classes (5 in our case). Thus, for each class, we took all trees
   focusing on that class and extracted the features from them. The upset
   plot was created using the upsetplot Python package (version 0.6.0)
   ([205]71).

Acknowledgments