Abstract
Current gold standard diagnostic strategies are unable to accurately
differentiate malignant from benign small renal masses preoperatively;
consequently, 20% of patients undergo unnecessary surgery. Devising a
more confident presurgical diagnosis is key to improving treatment
decision-making. We therefore developed MethylBoostER, a machine
learning model leveraging DNA methylation data from 1228 tissue
samples, to classify pathological subtypes of renal tumors (benign
oncocytoma, clear cell, papillary, and chromophobe RCC) and normal
kidney. The prediction accuracy in the testing set was 0.960, with
class-wise ROC AUCs >0.988 for all classes. External validation was
performed on >500 samples from four independent datasets, achieving
AUCs >0.89 for all classes and average accuracies of 0.824, 0.703,
0.875, and 0.894 for the four datasets. Furthermore, consistent
classification of multiregion samples (N = 185) from the same patient
demonstrates that methylation heterogeneity does not limit model
applicability. Following further clinical studies, MethylBoostER could
facilitate a more confident presurgical diagnosis to guide treatment
decision-making in the future.
__________________________________________________________________
The MethylBoostER machine learning model improves the diagnosis of
kidney tumor subtypes by utilizing DNA methylation patterns.
INTRODUCTION
Renal cell carcinoma (RCC) is among the top 10 most common cancers
worldwide. Incidence rates have increased by 47% in the past 10 years,
making it one of the fastest accelerating cancers, with rates projected
to rise even further in the future ([54]1). This is attributed to the
increasing prevalence of known risk factors and increased incidental
detection, secondary to widespread use of imaging for unrelated
symptoms ([55]2). Small renal masses (SRMs), defined as <4 cm in size,
are therefore increasingly detected and encompass a mixture of
potential diagnoses that comprise dozens of histological and molecular
tumor subtypes ([56]3, [57]4). The major subtypes include clear cell
RCC (ccRCC), papillary RCC (pRCC), chromophobe RCC (chRCC), or benign
oncocytomas. Differentiating malignant from benign solid SRM represents
a challenge to both imaging and renal biopsy. Renal biopsy may be
nondiagnostic due to small tumor size, challenging anatomy, low biopsy
tumor content, or tumor heterogeneity ([58]5–[59]7). As a result of no
biopsy being undertaken or the inability to conclude a histological
diagnosis based on biopsy samples, approximately 20% of SRMs removed at
surgery are found to be benign postoperatively ([60]8). Consequently, a
substantial number of patients undergo unnecessary surgery for a benign
condition, with associated postoperative risks of morbidity and
mortality and long-term impact on renal function. The rates of
postoperative complications following minimally invasive surgery are
blood transfusion (5%), reoperation (2 to 5%), respiratory
complications (1 to 7%), and death (4%) ([61]9). Conversely, a
meta-analysis demonstrated that one in four renal biopsies reported as
oncocytoma is found to be malignant RCC following surgical excision
([62]5), risking false negatives, late diagnosis, and deleterious
outcomes. Improved diagnosis and differentiation of SRM has been
identified as a key research focus in RCC by an international research
priority setting initiative ([63]10). Each of these pathological
subtypes has characteristic genetic and molecular features such that it
is argued that RCC is not a single disease ([64]11). ccRCC and pRCC are
derived from the proximal convoluted tubule in the renal cortex,
whereas chRCC and oncocytoma are derived from the distal nephron
([65]12, [66]13). DNA methylation changes are abundant, genome-wide
early events in renal tumorigenesis, and are specific to the cell of
origin, and so represent an ideal diagnostic target in this setting
([67]14–[68]17). Analysis of DNA methylation markers therefore has the
potential to support and improve the current diagnostic pathway for
renal cancer, providing a much more confident presurgical diagnosis to
guide treatment decision-making.
In this study, we propose MethylBoostER (Methylation and XGBoost for
Evaluation of Renal tumors). MethylBoostER is a machine learning model
based on the XGBoost (eXtreme Gradient Boosting) library that
differentiates pathological subtypes of renal tumors, using DNA
methylation markers identified in large tissue datasets. As seen in
[69]Fig. 1, which shows the overview of our approach, we externally
validate MethylBoostER on four independent datasets and evaluate
multiregion samples from the same patient to assess the role of
intratumoral heterogeneity (ITH) on MethylBoostER’s performance. In
addition, we externally validate our results on ex vivo core biopsy
tissue samples (which simulate in vivo patient biopsies) and fine
needle aspirates (FNAs) and low-purity samples representing real-world
applications and evaluate the role of tumor purity on MethylBoostER’s
performance to assess clinical utility. MethylBoostER is trained on the
largest DNA methylation cohort of renal tumor samples to date and is
extensively validated, demonstrating accurate predictions across
international cohorts and potential future clinical utility.
Fig. 1. Overview of MethylBoostER.
[70]Fig. 1.
[71]Open in a new tab
Three DNA methylation datasets are used to train and test the XGBoost
classification model. The model is then validated on four external
datasets. The high- and moderate-confidence predictions from the model
output are used for improving diagnostic decisions. Model performance
on both multiregion samples and sample purity was assessed.
RESULTS
DNA methylation sequencing in a cohort of patients with renal tumors
DNA methylation data were obtained for 1228 samples from three data
sources, namely, Cambridge samples (N = 319), The Cancer Genome Atlas
(TCGA) (N = 872), and the Chopra training set (N = 37) ([72]Fig. 2, A
and B) ([73]18). The former dataset contained information on 3 million
CpGs measured using TruSeq EPIC, while the latter two were obtained via
the 450K Illumina array. The merged datasets were obtained by
overlapping CpGs within ±50 base pairs (bp) of Illumina probes and
overall contained data on 158,670 probes. This dataset (N = 1228) was
split into training and testing sets (via fourfold cross-validation;
see Materials and Methods), with subsequent validation being performed
on four additional independent datasets [Chopra validation ([74]18),
Wei ([75]19), Brennan ([76]20), and Evelönn ([77]21)], and will be
referred to as such hereafter.
Fig. 2. Data characteristics and testing set performance.
[78]Fig. 2.
[79]Open in a new tab
(A) Number of samples in each class used for the training/testing sets.
(B) Uniform Manifold Approximation and Projection (UMAP) representation
of the training/test dataset, using all input features. (C) Confusion
matrix displaying the testing set performance, with precision and
recall bars. (D) UMAP representation of the training/test dataset,
using the input features learnt by the XGBoost model. (E) ROC curves
over the testing set, split by class.
MethylBoostER accurately distinguishes pathological subtypes of renal tumors
We trained multiclass XGBoost machine learning models to classify four
pathological subtypes of renal tumors (ccRCC, chRCC, oncocytoma, and
pRCC) and normal tissue. Fourfold nested cross-validation was used [a
method to randomly split the dataset into training (75%) and testing
(25%) sets over four iterations] so that the testing set performance
could be evaluated over the whole training/testing set. Consequently,
this results in four trained models. We display summary results for the
four models (individual results displayed in figs. S1, S2, S4, S5, and
S7).
The performance over the training/testing set was high, with an average
accuracy of 0.960. Table S1 lists additional performance metrics,
including metrics for each renal tumor subtype (forthwith called
“class”), which shows that the models could distinguish the normal
class perfectly, and the chRCC class was the most difficult to classify
correctly. [80]Figure 2C shows the confusion matrix, which depicts how
many samples, of a given true class, were predicted as each class. Most
samples (1179 of 1228; 96%) are on the diagonal, indicating that they
were correctly predicted. Of the 1228 samples evaluated, common
misclassifications were in predicting ccRCC as pRCC (13 of 1228
samples; 1.1%) and predicting pRCC as ccRCC (9 of 1228; 0.73%) or chRCC
(8 of 1228; 0.65%) samples. Although three oncocytoma samples were
predicted as malignant RCC, no malignant samples were classed as benign
oncocytoma or normal tissue.
A large motivating factor for using XGBoost models is their
interpretability—we can examine which input features were learnt by the
models when classifying the five classes. We selected only these
learned features and used them to visualize the training/testing set,
as shown in [81]Fig. 2D (and fig. S2). This shows that although there
are dataset-specific clusters within each class, the samples now form
class-wise clusters. This suggests that the features that the models
learn are not dataset specific but are features that distinguish each
class in all three datasets. [82]Figure 2E (and fig. S1) shows the
receiver operating characteristic (ROC) curves [and precision-recall
(PR) curves] for each class, again indicating that the models achieved
a high testing set performance.
High- and moderate-confidence predictions make model outputs clinically more
informative
Given a sample and a trained XGBoost model, we can obtain the model’s
predicted probabilities for that sample belonging to each class (renal
tumor subtype). This reveals the confidence of the model’s prediction.
[83]Figure 3A shows the predicted probabilities of the predicted class
over the testing sets, which indicates that while most predictions are
certain, some are less certain (13.2% of samples were predicted with a
probability of <0.95), with probabilities as low as 0.256.
Fig. 3. High- and moderate-confidence predictions.
[84]Fig. 3.
[85]Open in a new tab
(A) Histogram of the model’s probabilities of the predicted class for
the testing sets. (B) Line plot showing how the testing set accuracy
scores and fraction of high-confidence predictions vary as the
threshold changes. The vertical dotted line indicates the chosen
threshold, 0.85. (C) Graphical overview of the prediction process with
high- and moderate-confidence predictions.
We used this information to separate the models’ predictions into two
categories: high-confidence and moderate-confidence predictions.
High-confidence predictions are where the model’s confidence is greater
than a threshold, and we output a single answer—the predicted class.
Moderate-confidence predictions are where the model’s confidence is
lower than the threshold, and we output two answers—the predicted class
(the first most likely prediction) and the second most likely
prediction. We refer to these two outputs as first prediction and
second prediction, respectively. The threshold was set to 0.85 (see
[86]Fig. 3B), which maximizes the accuracy of both high- and
moderate-confidence predictions and the fraction of high-confidence
predictions over the testing set. In clinical practice in the case of
moderate predictions, the clinician could consider the two predicted
classes and integrate this information with clinical, histopathology,
and imaging features to conclude the most likely diagnosis (see
[87]Fig. 3C). The average accuracy of the high-confidence predictions
over the testing sets was 0.982. The average accuracy of the
moderate-confidence predictions over the testing sets was 0.668
(considering only the first prediction) but was 0.871 when the
prediction was treated as correct if the first or second prediction was
correct. See fig. S3 for the testing set confusion matrix split by
high- and moderate-confidence predictions.
External validation on four independent datasets verifies the
generalizability of MethylBoostER to new data
We externally validated MethylBoostER on four independent datasets,
namely, Chopra validation (N = 245), Brennan (N = 37), Wei (N = 92),
and Evelönn (N = 144) ([88]18–[89]21). They each contain different
numbers of samples and a different class distribution, as shown in
[90]Fig. 4A. The normal and ccRCC classes are the most frequent, which
reflects clinical prevalence. We evaluated our models on these datasets
(without the high- and moderate-confidence predictions method, so just
taking the first prediction for all samples), and the average accuracy
for Chopra validation, Brennan, Wei, and Evelönn was 0.824, 0.703,
0.875, and 0.894, respectively (see table S2 for additional metrics and
fig. S4A for confusion matrices). The models had lower performance for
chRCC and oncocytoma; oncocytoma samples were frequently predicted as
normal and chRCC. In Chopra validation, the oncocytomas predicted
incorrectly on both the first and second prediction were predicted as
normal (11 samples), chRCC (4 samples), and pRCC (2 samples). In
Brennan, the incorrect oncocytomas (on both first and second
predictions) were predicted as normal (one sample) and chRCC (five
samples). Combining all external validation sets, 7.89% (3 of 38) of
oncocytomas were correctly classified as oncocytoma, 34.2% (13 of 38)
were predicted as normal tissue, and 57.9% (22 of 38) were classified
as malignant. Of the malignant samples, 94.4% (289 of 306) were
classified correctly as malignant, 5.56% (17 of 306) were predicted as
normal tissue, and 0% (0 of 306) was classified as oncocytoma (full
details in confusion matrices in fig. S4). Although some oncocytoma
samples were classified as malignant disease, no malignant samples were
classified as benign oncocytoma ([91]Fig. 4). Given current clinical
practice (which errs on the side of caution and removes SRMs, which are
later found to be benign), it would be acceptable to remove an
oncocytoma in case this might be malignant, but it would not be
acceptable to confuse a malignant mass as benign, as this would risk
the cancer spreading. Malignant tumor samples predicted as normal
tissue may be due to low tumor purity (see the next section). The
relatively lower performance in oncocytoma and chRCC may be due to the
low proportion of these classes in the training set, reflecting a
real-world challenge, as these two are the least common pathological
subtypes. In addition, previous reports suggest that methylation in
these two classes is most similar, in keeping with their common
cellular origin ([92]13). We found that incorporating high- and
moderate-confidence predictions leads to improved results. [93]Figure
4B and table S3 show the accuracy and Matthews correlation coefficient
(MCC) scores for the high- and moderate-confidence predictions. As seen
in [94]Fig. 4B, the moderate-confidence first predictions have an
accuracy of <0.65 for all four datasets, and the second predictions
have an accuracy of <0.45 for all four datasets. When we combine them,
we see that the first or second predictions reach an accuracy of >0.7.
As expected, the accuracy of the high-confidence predictions is higher
and is >0.9 for all datasets.
Fig. 4. External validation on four independent datasets.
[95]Fig. 4.
[96]Open in a new tab
(A) Number of samples in each class for each dataset. (B) Accuracy for
high- and moderate-confidence predictions for each external dataset.
“First or second prediction” indicates that a prediction is treated as
correct if its first or second prediction was correct. (C to F)
Confusion matrices for both high- and moderate-confidence predictions
and ROC curves, split by class, for each external dataset. For the
moderate-confidence confusion matrices, the x axis is split into first
prediction was correct, the second prediction was correct, and both
first and second predictions were incorrect.
Many moderate-confidence predictions are correct on the second
prediction, as shown in [97]Fig. 4 (C to F) (and fig. S4B for the other
models). For example, 16 samples in the Chopra validation dataset had
an incorrect first prediction but correct second prediction (see
[98]Fig. 4C). These samples would have simply been incorrectly
predicted without the use of this high- and moderate-confidence
prediction method. There is also a higher fraction of
moderate-confidence predictions for the chRCC and oncocytoma classes
(fig. S4C), reflecting the difficulty of predicting these classes. We
also show that all four datasets achieved high class-wise ROC areas
under the curve (AUCs), as shown in [99]Fig. 4 (C to F) (see fig. S5
for the other models’ ROC curves and all PR curves).
We benchmarked MethylBoostER to the models introduced by Chopra et al.
([100]18) and Brennan et al. ([101]20) by comparing classification
results over the Chopra validation dataset. We include MethylBoostER’s
results both with and without high- and moderate-confidence prediction
results. We also include Chopra’s results when excluding their benign
class, as MethylBoostER or Brennan does not have results for this
class. The overall accuracies were 0.846 (Chopra), 0.890 (Chopra excl.
benign), 0.872 (Brennan), 0.824 (MethylBoostER), and 0.878
(MethylBoostER high/mod). The number of correct predictions per class
is shown in table S4, which shows that Chopra and MethylBoostER
high/mod both achieve the highest performance for the normal and chRCC
classes. Chopra achieves the highest performance for oncocytoma, and
MethylBoostER high/mod achieves the highest performance for the
malignant (and most clinically important) ccRCC and pRCC classes. This
shows that MethylBoostER high/mod is competitive among similar
classification models, showing the joint best performance for chRCC and
the best performance for malignant ccRCC and pRCC subtypes. Because of
the molecular similarity of the chRCC and oncocytoma subtypes and more
frequent misclassification of these two classes in the combined
multiclass model, we explored the utility of a binary classifier by
training XGBoost models with only chRCC and oncocytoma classes (using
the same method as described in Materials and Methods). This approach
was hampered by limited sample numbers, with a total of 93 chRCC and 58
oncocytoma samples in the testing set (fig. S6A) and 14 chRCC and 38
oncocytoma samples in the external validation sets (fig. S6B). The
models achieved high performance on the testing set (see fig. S6C),
with an average accuracy of 0.968, MCC of 0.932, and ROC AUC of 0.9959.
In the limited cohort available for external validation datasets, all
models predicted every sample as chRCC (fig. S6, D and E),
demonstrating that the binary models for these two molecularly similar
RCC subtypes could not be generalized and added little value over the
multiclass MethylBoostER model. However, the ROC and PR curves
highlight clear evidence of subtype-specific signals with these DNA
methylation models (fig. S6, F to I), reaching an ROC AUC of 0.948 for
the Brennan dataset. This gives a similar picture to ROC and PR curves
for the multiclass models (fig. S5) and indicates that the predicted
probabilities can differentiate the two classes, but that the model’s
classification threshold does not generalize in the validation
datasets, possibly because of the limited datasets available for these
RCC subtypes. The underlying prediction scores are visualized in fig.
S6 (J and K), which show that the probabilities for oncocytoma samples
are consistently lower than chRCC samples; however, they are not lower
than 0.5 (the current model’s classification threshold). Overall, we
found that a separate binary classifier did not add value over the
multiclass MethylBoostER model, possibly because of the limited
available data for these subtypes (see Discussion for suggested future
studies that could overcome these limitations).
MethylBoostER provides consistent classification of multiregion samples from
the same patient
ccRCC is characterized by a high degree of genetic ITH, with over 60%
of somatic mutations not detectable across all multiregion samples
([102]22). On average, seven multiregion samples are needed to detect
over 75% of genetic variants ([103]23). Very little is known about
methylation variation in ccRCC, although a handful of reports suggests
relative homogeneity ([104]21, [105]24–[106]26), therefore highlighting
the potential relevance of diagnostic methylation markers. In the
Cambridge dataset, multiregion samples were available (N = 168) for 25
patients (18 ccRCC, 4 chRCC, 2 oncocytoma, and 1 pRCC). In 92% (23 of
25) of patients, all multiregion samples were predicted consistently as
being from the same pathological subtype, with 88% (22 of 25) achieving
correct classification for all samples (see [107]Fig. 5A). Multiregion
samples (N = 17) were also available for six patients with ccRCC from
the Evelönn external validation dataset ([108]21). As shown in
[109]Fig. 5B, 83% (five of six) of patients had a concordant prediction
for all multiregion samples derived from the same patient, although the
model struggles to differentiate chRCC from ccRCC. We noted that
incorrectly predicted samples had a very low tumor purity. When the
Evelönn dataset is visualized using dimensionality reduction (see fig.
S7), almost all the incorrectly predicted samples (including all the
incorrectly predicted multiregion samples) can be seen to cluster away
from the rest of the ccRCC samples, indicating that these samples have
a different methylation pattern (potentially related to their lower
tumor purity, as shown in fig. S7B).
Fig. 5. Classification of multiregion samples.
[110]Fig. 5.
[111]Open in a new tab
Diagram visualizing the model’s predictions of multiregion samples for
each patient in the Cambridge and Evelönn datasets.
Impact of tumor purity on MethylBoostER
Tumor purity was calculated using methylation beta values and
InfiniumPurify, as described ([112]20, [113]27). This method determines
tumor purity in the context of contamination with a normal (nontumor)
kidney sample. As expected, median sample purity was lower for the
Chopra and Brennan datasets compared to TCGA and Cambridge, as the
former include ex vivo core biopsy samples (i.e., simulated biopsies
using nephrectomy specimens) and FNAs, respectively. Of all the
datasets, the Evelönn data appear to have the lowest tumor purity
(median purity: 0.88 Cambridge, 0.84 TCGA, 0.80 Chopra training, 0.58
Chopra validation, 0.73 Brennan, 0.81 Wei, and 0.48 Evelönn).
[114]Figure 6A summarizes purity for samples that were predicted
correctly in the first prediction, second prediction, and incorrectly
predicted samples (results for individual datasets are shown in fig.
S8). Median purity in samples that were incorrectly predicted was
significantly lower than samples correctly predicted on the first or
second prediction (0.27 versus 0.82 versus 0.44, P < 0.01 for all
comparisons; [115]Fig. 6A). We noted that three of the samples that
were misclassified by our model were also incorrectly classified by
Brennan et al. ([116]20) and that the study authors postulated that
this was due to low sample purity. [117]Figure 6 (B and C) summarizes
the purity and prediction probability (for the first prediction),
highlighting incorrectly predicted samples. We found a correlation
between purity and the probability of first prediction in Wei and
Evelönn [Pearson’s correlation coefficients of 0.58 and 0.51, adjusted
P < 0.01], but not the other datasets (correlation < 0.30 and/or
adjusted P > 0.01). There are a number of samples that are incorrectly
predicted despite having a high-confidence prediction, and these tend
to have lower purity. Considering all datasets combined, table S5
summarizes the model accuracy, as well as the median probability of the
first prediction, by tumor purity. This further demonstrates that
samples with higher tumor purity are more accurately predicted, for
example, samples with purity ≥ 0.9 obtain an accuracy of 0.99. There is
a sharp drop-off in accuracy in samples with a purity of <0.20 compared
to samples in which purity was >0.2, suggesting that potentially a
biopsy sample may have to be repeated if purity is below this level.
This analysis was performed post hoc on both training/testing and
validation datasets, and there were a limited number of low-purity
samples (5% of the cohort; table S5). Therefore, these results (and a
purity threshold that warrants repeat biopsy) remain to be externally
validated.
Fig. 6. Sample purity and MethylBoostER output.
[118]Fig. 6.
[119]Open in a new tab
(A) Sample purity for samples that are predicted correctly on the first
prediction (1st correct) and second prediction (2nd correct) and
incorrectly predicted samples (incorrect) on both predictions. Data are
shown for all datasets combined, with pathological subtypes shown in
different colors. Adjusted P values are shown (*P < 0.05 and ***P <
0.0009). (B and C) Sample purity and the probability of the first
prediction are demonstrated for all datasets combined (B) and each
dataset individually (C). The threshold t = 0.85 indicates a
high-confidence prediction. Samples that are incorrectly predicted (in
both first and second prediction) are indicated with a cross.
Features selected by MethylBoostER are associated with carcinogenic processes
The features that the MethylBoostER models used during classification
were obtained. Each of the four XGBoost models selected 1490, 1697,
1476, and 1331 features, respectively. An analysis of the genomic
location of the features revealed that the selected features do not
follow the same genomic location distribution as the background, as
shown in [120]Fig. 7A. In the promoter regions close to the
transcription start site (TSS) (≤1 kb), there is a lower percentage of
features compared to the background set (68% compared to 78%). However,
in the promoter regions further away from the TSS (1 to 2 kb and 2 to 3
kb away), there is a higher percentage of features compared to the
background set (8.7% compared to 6.3% and 4.3% compared to 2.3%). There
is also a higher percentage of features in introns (5.1% compared to
3.1% for first introns and 8.7% compared to 5.5% for other introns).
This distribution of the feature’s genomic locations is significantly
different from the background (apart from the intergenic annotation),
as shown in table S6.
Fig. 7. The genomic location and functional annotation of the features
selected by MethlyBoostER.
[121]Fig. 7.
[122]Open in a new tab
(A) Distribution of genomic locations (relative to genes) for the
selected features compared to the background (the total set of input
features). (B) Enriched GO terms from the Biological Process category
represented as a network, where each branch represents a different
functional category. Results were obtained from the gene-wise GO
analysis. (C) Enriched GO terms from the Biological Process category
represented as a bar plot. Results were obtained from the localized
region GO analysis.
The features from MethylBoostER were mapped to proximal genes and
analyzed at the gene ontology (GO) and pathway level. The GO terms are
visualized as a network in [123]Fig. 7B. The GO analysis detected a
neuronal and muscle developmental signature, cancer signaling [Wnt,
mitogen-activated protein kinase (MAPK), and Ras signaling pathways],
and both positive and negative regulation of transcription, among
others. The significant enrichment (see table S7 for the adjusted P
values of all gene list comparisons) of transcription factors [TF
Checkpoint ([124]28)] and epigenetic regulators [EpiFactors ([125]29)]
indicates the perturbation of transcriptional regulators and chromatin
remodelers in renal cancer. While an RCC-related gene set [RCC
Harmonizome/Diseases databases ([126]30)] and a ccRCC putative driver
gene set ([127]23) were enriched, a similar set of pRCC genes was not
enriched [pRCC Harmonizome/Diseases ([128]30)]. The feature set also
indicated a strong metastatic signature [Human Cancer Metastasis
Database ([129]31)], with enriched GO terms related to Motility, Axon
guidance, and Cell adhesion. Enrichment of epithelial-mesenchymal
transition (EMT)–related genes [dbEMT2 ([130]32)] provides further
evidence for this metastatic signature. In addition, the gene list was
enriched for tumor suppressors, oncogenes, and fusion genes [Catalogue
Of Somatic Mutations In Cancer (COSMIC) Cancer Gene Census ([131]33)].
We also checked whether the results of the gene-wise GO analysis were
consistent with the results from the Genomic Regions Enrichment of
Annotations Tool (GREAT) ([132]34), which is designed for localized
regions that are not necessarily within genes. This also enabled the
features that were not mapped to genes to be included. [133]Figure 7C
shows the significant GO terms resulting from this analysis, which
demonstrates that they are similar to the results of the gene-wise GO
analysis. The consistent themes were development, cell adhesion, cell
motility, cell signaling, and neurogenesis. A few functions were unique
to this GREAT analysis, such as immune response and extracellular
matrix organization, and a few were unique to the gene-wise analysis,
such as regulation of transcription and response to cortisol stimulus.
Pathway analysis revealed enriched pathways covering similar functions
as found during the GO analysis. The most enriched pathways were the
following: Wnt signaling pathway (adjusted P = 2.79 × 10^−15), Cadherin
signaling pathway (adjusted P = 2.92 × 10^−5), and Netrin-1 signaling
(adjusted P = 4.62 × 10^−03). Other enriched pathways included axon
guidance, cell junction organization, and signaling by receptor
tyrosine kinases.
Next, we focused our analysis on the consistently salient
features—features that were selected by all four XGBoost models. There
were 38 such features, which mapped to 45 genes. Of these 45 genes
common to all four models, 9 genes have not been previously associated
with kidney function, renal disease, or cancer (ANKMY1, EHMT1, H2AW,
JDP2, H2BU1, KIAA1143, KIF15, LINC01960, and SF3A2). As detailed in
table S8, an analysis of the literature shows that the others have been
associated with ccRCC, chRCC, and RCC. These genes have also been
linked to kidney development, gene expression in kidney, adhesion,
motility, invasion, metastasis, and poor survival in renal cancer and
other cancer types. In addition, some of them contribute to increased
survival, act as tumor suppressors, promote cancer stemness, contribute
to carcinogenesis, proliferation, and tumor growth, and have been
identified in other cancer types. These associations demonstrate that
the methylation features identified by MethylBoostER are biologically
relevant. The methylation distribution of all 38 features and their
genes can be seen in fig. S9. We confirmed these findings by carrying
out an entity relationship–driven natural language processing (NLP)
analysis of approximately 30 million PubMed abstracts, using the
feature mapped gene set and a dictionary of relationship terms based on
enriched biological processes identified above. Modeling this
information from 28,579 interactions (each supported by multiple
publications) as an entity relationship network, and subsequent network
topological analysis where relationship strength was visualized as
degree (number of genes directly linked to a relationship) and
relationship importance as betweenness centrality, highlighted the
association of these genes with carcinogenesis, development, and tumor
microenvironment interactions. This corroborates the functional
bioinformatics analyses and confirms that these genes have been
previously associated with carcinogenic processes, tumor
microenvironment, metabolism, metastasis, and immune and inflammatory
responses and are known to influence patient survival and prognosis in
the biomedical literature (fig. S10 and data file S1 containing
NLP-derived gene relationship pairs and their frequency).
We also extracted the class that each feature contributes toward to
identify subtype-specific features. Figure S11A shows that the genomic
location of the features was distributed slightly differently for each
class, with the normal, pRCC, and oncocytoma classes using fewer
features in proximal promoter regions near the TSS (≤1 kb). The normal
and pRCC classes also used more features in introns. We then calculated
enriched GO terms for each class using GREAT and found enriched terms
for normal, ccRCC, chRCC, and pRCC. The normal class was enriched in
terms such as membrane organization and cell activation (fig. S11B).
The ccRCC class was enriched in terms such as regulation of signaling
and regulation of cell communication (fig. S11C). The chRCC class was
enriched in multiple terms relating to negative regulation of DNA
repair and regulation of multicellular organismal process (fig. S11D).
The pRCC class was enriched in many terms, such as single-organism
developmental process, movement of cell or subcellular component,
localization, defense response, and positive regulation of biological
process (fig. S11E). After mapping these subtype-specific features to
genes, we also show that many genes were often found in multiple
subtypes. Figure S11E demonstrates that 205 genes were found in both
ccRCC and pRCC, 159 genes were found in both chRCC and ccRCC, and 137
genes were found in both chRCC and pRCC.
DISCUSSION
The rising incidence of SRMs drives the need to improve the diagnostic
pathway to avoid overtreatment of patients with benign disease and
optimize health resource use. The challenge consists of differentiating
benign oncocytoma from malignant subtypes, the most common being ccRCC,
pRCC, and chRCC. Previously published molecular classifiers are often
limited by focusing solely on distinguishing oncocytoma from chRCC,
excluding the other, more common subtypes and therefore reducing
applicability in the real world ([134]20, [135]35–[136]38). In
addition, a number of these studies had small sample sizes (<200
samples) ([137]36, [138]38, [139]39). Existing models often rely on a
limited number of markers (e.g., <100 markers), making them less robust
when applied to heterogeneous clinical samples and limiting potential
future applicability ([140]20, [141]36–[142]38, [143]40).
Machine learning models based on DNA methylation have demonstrated
excellent accuracy in other cancer types, including lung, brain, and
breast malignancies and sarcomas ([144]41–[145]43). The ability of
machine learning models trained on DNA methylation data to classify
cancer of unknown primary with excellent accuracy, leading to benefits
in overall survival, serves as a testament to DNA methylation as a
unique marker of cell identity ([146]44). We have therefore developed
an XGBoost machine learning model (MethylBoostER), leveraging
methylation data from over 1200 patient samples. We demonstrate that
the model accurately predicts pathological subtypes of renal tumors in
the training and testing set, with ROC AUCs of over 0.95 for all
subtypes evaluated. The extensive external validation in four
independent methylation datasets, totaling over 500 patient samples, is
a measure of the robustness of our method. We demonstrate the high
accuracy of the model with average accuracies of 0.824, 0.703, 0.875,
and 0.894 and AUCs of over 0.89 for all subtypes in all independent
datasets. We also show that MethylBoostER is competitive among similar
existing models (when using high- and moderate-confidence predictions),
achieving the highest performance for malignant RCC classes (chRCC,
ccRCC, and pRCC). A recognized clinical challenge is that biopsy
accuracy may be hampered by genetic ITH or reduced tumor content,
especially in smaller or more difficult to access tumors. We therefore
address these potential drawbacks. We demonstrate that, unlike the
known extensive genetic ITH ([147]22), analyzing methylation patterns
in multiregion samples from the same patient led to consistent
diagnoses in the vast majority of cases (92% in the Cambridge and 83%
in the Evelönn datasets). In addition, we assessed low-purity samples
(including ex vivo biopsies and FNAs) to represent a real-world
scenario. We show that samples that are incorrectly predicted by our
model have significantly lower purity than correctly predicted samples.
The association between purity and model accuracy was also previously
noted by Brennan et al. ([148]20). It is customary to take more than
one sample at the time of biopsy, which may help overcome this
challenge; alternatively, low-purity samples may require repeat biopsy.
In the testing set, the most common misclassifications involved ccRCC,
pRCC, and chRCC, while in the external validation sets, chRCC and
oncocytoma were the classes with the lowest AUCs. The tumors’ shared
cells of origin (proximal versus distal nephron) may explain some of
these results. Problems in accurately differentiating between ccRCC and
chRCC on histopathology are another well-known challenge. In TCGA, 15
cases were initially classified as ccRCC on histopathological slide
review; however, these were later re-reviewed by specialist
urohistopathologists and reclassified as chRCC ([149]11, [150]45). We
noted that eight of these samples were included in our testing and
training dataset, and MethylBoostER classifies five as chRCC and three
as ccRCC. The former suggests that our model can correctly classify
chRCC samples better than a standard pathologist, and more akin to a
specialist urohistopathologist. We explored the methylation and gene
expression profiles of TCGA samples (fig. S12, A and B) and
demonstrated that the three samples that our model classifies as ccRCC
(TCGA participant IDs 4821, 4688, and 4696) cluster more closely with
ccRCC than chRCC based on both methylation and gene expression data. We
hypothesize that these samples may be ccRCC (i.e., the first
classification was correct rather than the reclassification). This
highlights existing challenges in diagnosing subtypes using existing
histopathology methods and emphasizes the need to produce accurate
predictive models. Last, another challenge is that predictive models
are trained on datasets in which the true diagnosis is based on
histopathology, and if this is incorrect, it may bias the model. To
ensure that the predictions of MethylBoostER were not heavily biased by
these eight mislabeled samples, we retrained the models with the
reclassified labels (eight ccRCCs were reclassified as chRCC, and a
number of non-RCC were samples removed). The results over the testing
sets were similar, as shown in fig. S12 (C and D), demonstrating that
the small number of incorrect labels does not largely affect the
results.
Carcinogenesis is invariably accompanied by perturbations in gene and
epigenome regulation. Aberrant DNA methylation changes on promoters,
enhancers, and gene bodies contribute to alteration of gene expression
and consequently affect signaling, regulatory, and metabolic pathways.
Renal carcinoma has been associated with MET ([151]46), Hippo
([152]47), Wnt ([153]48), MAPK ([154]49), NRF2 (nuclear erythroid
2-related factor 2)-ARE (antioxident response element),
phosphatidylinositol 3-kinase (PI3K)/AKT/mechanistic target of
rapamycin (mTOR) ([155]50), metabolic, angiogenic, and immune
checkpoint–associated pathways ([156]51). GO and pathway enrichment
analysis of the MethylBoostER features mapped to genes show a strong
association with genes, processes, and pathways involved in
carcinogenesis. In particular, oncogenes and tumor suppressors (COSMIC
Cancer Gene Census), cancer-associated pathways [Kyoto Encyclopedia of
Genes and Genomes (KEGG)], putative ccRCC driver genes, and
RCC-associated genes were enriched. In addition, biological processes
associated with metastasis (motility, cell adhesion, and axonal
guidance) and EMT were also strongly enriched. Comparison with curated
gene sets in metastasis [Human Cancer Metastatic Database (HCMDB)] and
EMT (dbEMT2) databases further confirmed this. Other signatures such as
neuronal and muscle development; Wnt, MAPK, and Ras signaling pathways;
nucleoside metabolism; and T cell activity are also enriched.
Expectedly, enrichment of both positive and negative regulation of
transcription coincides with known consequences of DNA methylation.
Furthermore, enrichment of TFs and epigenetic regulators further
supported this strong transcriptional regulation signature. Genes
associated with response to cortisol stimulus are enriched in our
feature set. Serum cortisol levels have been found to be significantly
higher in RCC and are associated with higher tumor grade ([157]52). The
enrichment in our feature set suggests a possible epigenetic regulatory
mechanism for this process. Last, we show that the CpG features
selected by the models are enriched for cancer-related genes, and the
features selected by all four models are enriched for well-known RCC
genes. We use literature analysis to demonstrate that most
classification features are associated with genes involved in kidney
pathology, cancer, or carcinogenic processes, demonstrating the clear
discernment in model-selected methylation patterns. Associations
between methylation and gene expression are outside the scope of the
present analysis, which aimed to demonstrate that the model selects
salient features that are biologically relevant and easily
interpretable.
Several limitations of the current study must be addressed. One of the
main challenges in developing predictive methods pertains to lack of
sample and clinical data, particularly for rare disease subtypes. We
acknowledge that although the MethylBoostER model accurately predicts
the most prevalent pathological subtypes of renal tumors, due to
constraints in sample availability, the diagnosis of less prevalent,
uncommon renal masses (such as angiomyolipoma, translocation RCC,
collecting duct carcinoma, Fumarate Hydratase (FH)-deficient RCC, renal
medullary carcinoma, and others) is not accounted for in our current
analysis. Although the model achieves a high accuracy across all sample
types, chRCC and oncocytoma remain the most difficult classes to
distinguish. Future studies including larger sample sizes from these
two classes may enable improved prediction accuracies.
While the main aim of the analysis is to distinguish pathological
subtypes of SRMs, because of insufficient sample sizes, we pooled data
from both SRMs and larger tumors for the training/testing set and some
of the validation cohorts. However, we validated the model on the
Chopra dataset that consists exclusively of SRMs (245 samples) and
demonstrated excellent accuracy and generalizability of results in
smaller tumors and in low-purity samples (e.g., Evelönn and FNAs in
Brennan). Future larger studies could include rarer subtypes (exploring
the impact on the model output) and prioritize samples from patients
with SRMs. In addition, there were some limitations to our benchmarking
method. The models we compared to were trained on a different number of
classes, so the accuracies were not directly comparable, and they were
also trained on different training datasets. Thus, further benchmarks
that solve these limitations would need to be carried out. Excluding
TCGA, the other datasets included in our analysis provided methylation
data alone. In the future, obtaining both methylation and
transcriptomic data may enable head-to-head comparisons between
MethylBoostER and existing models predicting pathological subtypes
using transcriptomic data ([158]35, [159]36). Ultimately, an ideal
model would predict patient outcome in addition to identifying the
pathological subtype of the renal tumor and would take into account
comorbidities and competing risks of death. Unfortunately, this was not
possible in the present analysis due to lack of survival data for the
available cohorts. Future work should focus on collecting comprehensive
multimodal datasets including patient clinical parameters, imaging,
pathology, and multiple omic data types (including both methylation and
gene expression).
Unique features of our model are its interpretability, ease of use in a
clinical setting, and an approach that reduces the clinical importance
of ITH. We envision that, in the future, following further model
refinement (possibly through the use of larger, more heterogeneous, and
balanced renal cancer datasets), validation, and prospective testing
(using a well-designed clinical trial to test real-world performance),
MethylBoostER could be integrated in clinical practice to improve the
patient diagnostic pathway ([160]Fig. 8). Individuals with SRMs would
undergo imaging-guided tumor biopsy, which would be processed for
targeted DNA methylation analysis, and the MethylBoostER model would be
used to predict pathological subtypes, serving as an adjunct to
treatment decision-making. For high-confidence predictions, the output
is limited to the most likely diagnosis, whereas for
moderate-confidence predictions, the model will supply two class
predictions and the clinician would be encouraged to integrate imaging
and histological data. Such moderate-confidence predictions will
provide useful additional information for clinicians to combine with
other clinicopathological features, for instance, moderate prediction
of “chRCC or ccRCC” in a fit patient with a larger tumor would be much
more likely to result in active treatment than a moderate confidence
prediction of “chRCC or oncocytoma” in a patient with a relatively
small tumor and limited life expectancy. A strength of our work is the
output of moderate- and high-confidence predictions, which overall
increases model accuracy and empowers clinicians to make
patient-centered decisions that take into account both clinical and
molecular (DNA methylation) data. We acknowledge that, in a small
number of individuals with SRMs, renal biopsy may be high risk or
contraindicated due to tumor anatomy or patient factors; therefore,
MethylBoostER may not aid diagnosis in this setting. However, there is
an increasing use of diagnostic biopsies to minimize morbidity and
mortality resulting from overtreatment of patients with benign lesions,
supported by strong recommendations in recent clinical guidelines
(European Association of Urology Renal Cell Carcinoma Guideline Panel
2022). Following further refinement, MethylBoostER could have a
tangible future clinical application, although future studies will be
needed to demonstrate the model’s clinical benefit above the existing
standard of care diagnostic pathway (imaging with or without pathology)
before adoption in the clinic.
Fig. 8. Proposed future integration of MethylBoostER model into the existing
diagnostic pathway for patients with SRMs.
[161]Fig. 8.
[162]Open in a new tab
Following future model refinements and clinical trials, MethylBoostER
could play a role in the diagnostic pathway. Here, we describe the
potential clinical utility. Patients would have an image-guided renal
biopsy, and biopsy samples would undergo DNA methylation analysis.
MethylBoostER results would be interpreted in the context of
integration with clinical and imaging data. For high-confidence
predictions, MethylBoostER would predict one class, where benign
oncocytoma and malignant RCC would likely be managed with active
surveillance and active treatment, respectively. In moderate-confidence
predictions, the two classes with the highest probabilities would be
predicted. Samples with low purity or cases in which MethylBoostER
predicts normal kidney (suggesting that the target lesion was missed)
would prompt repeat biopsy.
In summary, we develop MethylBoostER, a machine learning model that
predicts pathological subtypes of benign and malignant renal tumors.
The relatively modest number of patients diagnosed with certain tumor
subtypes (especially chRCC and oncocytoma) and limited access to
patient samples can hamper the application of machine learning–based
approaches. Future studies should aim to obtain larger sample sizes and
focus on multimodal integration with imaging, patient characteristics,
epigenetic data, and clinical outcomes.
MATERIALS AND METHODS
Cambridge samples
Patients with benign and malignant renal tumors undergoing curative or
cytoreductive nephrectomy at Addenbrooke’s Hospital were recruited to
the DIAMOND study. Ethics approval and patient consent were obtained
(Research Ethics Committee, REC ID 03/018). Fresh-frozen tissue was
stored at −80°C. Where available, multiregion tumor samples were taken
along with adjacent normal kidney tissue. DNA was extracted from a
small section of frozen tissue, using the commercially available
AllPrep DNA/RNA Mini Kit (QIAGEN) according to the manufacturer’s
protocol.
DNA samples (10 ng/μl, 500 ng total) were sheared using the S220
Focused-ultrasonicator (Covaris) to generate double-stranded DNA
(dsDNA) fragments. The D1000 ScreenTape System (Agilent) was used to
ensure that >60% of DNA fragments were between 100 and 300 bp long,
with a mean fragment size of 180 to 200 bp. Tissue methylation analysis
was performed using the TruSeq Methyl Capture EPIC Library Preparation
Kit (Illumina), using the manufacturer’s protocol (a.k.a, or also known
as TruSeq EPIC). This consists of a capture-based method targeting
approximately 3 million CpGs. Four samples are multiplexed in each
capture reaction using sample indexing adaptors. The protocol involves
hybridization of biotin-tagged probes to genomic DNA followed by
capture using streptavidin beads (two hybridization-capture steps) and
bisulfite conversion at 54°C for 2 hours. Following polymerase chain
reaction (PCR) amplification, uracils are copied as thymines, with
resulting libraries consisting of two families of dsDNA molecules
(originating from Watson and Crick strands), with a high
thymine-to-cytosine ratio. Twelve samples were pooled for sequencing on
the HiSeq4000 Illumina Sequencing platform (single-end 150-bp read)
using two lanes per library pool. Technical replicates were performed
for cell line data to assess assay reproducibility (correlation =
0.97). Sequenced data were trimmed (TrimGalore v0.4.4) and aligned to
the bisulfite-converted human reference genome (GRCh38/hg38).
Methylation calling was performed using the Bismark suite of tools
(v0.22.1). Trimming and alignment reports were compiled using MultiQC
(v1.7). Data were included in downstream analysis if a depth of greater
than 10× coverage was achieved to reduce the risk of false-positive
calling. In addition, samples were removed if non-CpG methylation was
greater than 1%.
Publicly available data
TCGA data for ccRCC (KIRC), pRCC (KIRP), and chRCC (KICH) were obtained
via the TCGAbiolinks package in R ([163]53). Only CpG probes with less
than 5% missing data were kept. The Chopra datasets, one for training
and one for testing, were downloaded from the Open Science Framework,
repository OSF.IO/Y8BH2 ([164]18), and the Acute Myeloid Leukemia (AML)
samples were excluded. The Wei dataset ([165]19) and Evelönn dataset
([166]21) were obtained from the Gene Expression Omnibus (GEO), with
GEO accession numbers [167]GSE61441 and [168]GSE113501, respectively.
The Brennan dataset was obtained directly from the study authors
([169]20). In all cases, methylation data were evaluated using the
Illumina Infinium Human DNA Methylation 450 platform (also known as
450k array).
Preprocessing
Training/testing dataset
The training/testing dataset consisted of the TCGA data, the Chopra
training data, and the Cambridge samples. The following preprocessing
steps were performed on this training/testing set. CpG probes found in
two blacklists ([170]54, [171]55) for the 450k array were removed, as
well as probes at the site of C/T and G/A single-nucleotide
polymorphisms (SNPs), probes that map to multiple regions, and probes
at repeat regions. In addition, CpG probes located on the sex
chromosomes were omitted to remove gender bias. The 450k array (Chopra
training samples and TCGA samples) and TruSeq EPIC (Cambridge data)
cover different CpG sites. To combine data from the two methods, TruSeq
EPIC beta values within 50 bp of the 450K probes were averaged, as
adjacent CpGs tend to be co-methylated ([172]56). Subsequently, probes
that were missing for the whole of one dataset (TCGA, Chopra training
data, or Cambridge) were removed to avoid dataset-specific bias. This
resulted in a dataset with 158,670 CpG probes.
Last, the beta values were converted to M values, as M values have been
shown to be more homoscedastic ([173]57). They are computed using the
following equation
[MATH:
M=log2β1−β :MATH]
Where M values were calculated to be infinity (due to β = 1), they were
set to the maximum finite value of the training/testing data, and where
M values were calculated to be minus infinity (due to β = 0), they were
set to the minimum finite value of the training/test data.
External datasets
The external datasets used were the Chopra validation data (which is
independent of the Chopra training data), the Brennan dataset, the Wei
dataset, and the Evelönn dataset. The same 158,670 probes in the
preprocessed training/testing dataset were selected for all four
external datasets.
Data visualization
Dimensionality reduction was carried out using umap-learn Python
package (version 0.3.10) ([174]58). Not a Number (NaN) values were
converted to the value 0 before the transformation.
MethylBoostER: The XGBoost classification model
A multiclass XGBoost classifier model was constructed using the xgboost
Python package (version 0.90) ([175]59). Fourfold nested
cross-validation was implemented on the training/testing set, with
integrated hyperparameter optimization.
Fourfold nested cross-validation
The training/testing set was split up into four stratified outer folds
so that each fold keeps 25% of the dataset for testing. This was split
so that multiple samples from the same patient were all in the same
fold to avoid data leakage between folds. These folds were iterated
through, treating each one as the testing set in each iteration. Within
these iterations, the remaining dataset was treated as a
training/validation set. Inner rounds of fourfold cross-validation were
performed on this training/validation set to find the optimal
hyperparameters. For each inner fold iteration, a model was trained on
the training set (75% of the training/validation set) and validated on
the validation set (25% of the training/validation set). These inner
rounds were repeated for various sets of hyperparameters [the
hyperparameters tested were dictated by the Tree of Parzen Estimators
algorithm in the hyperopt package ([176]60)].
The hyperparameters that resulted in the best average MCC score on the
validation sets (averaged across all four inner folds) were selected,
and a model was trained with these parameters on the whole
training/validation set. This resulted in a trained model with optimal
hyperparameters for each outer fold, and the performance of these
models on their fold’s testing set was reported. This method of
cross-validation ensured that the models had the optimal
hyperparameters, they were evaluated on completely unseen test data
(specifically data that were not used to select the hyperparameters to
avoid inflated scores), and the performance on the whole
training/testing dataset was reported.
Hyperparameters and training details
Two parameters were manually set, namely, subsample = 0.5 and colsample
by tree = 0.5, to help reduce overfitting and run time ([177]59). Many
of the remaining parameters were found using hyperparameter
optimization implemented in the hyperopt Python package (version 0.2.5)
([178]60). The maximum number of evaluations was set to 10 and
parallelism was set to 2 to reduce the run time. The search spaces for
each parameter were as follows: number of trees: 50 to 500, learning
rate: 0.05 to 0.5 (sampled from a log uniform distribution), max tree
depth: 2 or 3 (intentionally set very low to help reduce overfitting;
see fig. S13 for justification of this decision), L1 regularization
term: 0 to 1, and L2 regularization term: 0 to 1.
During model training, Gaussian noise with mean 0 and variance 0.2 was
added into the training data to help reduce overfitting. Early stopping
was also applied to reduce overfitting, with the number of early
stopping rounds set to 5 and multiclass log loss as the evaluation
metric used. Samples were weighted to avoid both patient bias and to
mitigate the class bias. Balanced class weights were calculated using
the “compute class weight function” in the sklearn package ([179]61),
which ensures that samples in less common classes are weighted higher.
The same function was used to generate balanced patient weights so that
samples from patients with many samples were weighted lower. The two
weights were multiplied together to get a weight for each sample to be
used during training.
Evaluation metrics
Accuracy, precision, recall, F[1], and the MCC were used to evaluate
the models’ performance. F[1] is the harmonic mean of precision and
recall
[MATH:
F1=2.<
/mo>precision·recallprecision+recall
mrow> :MATH]
The MCC was also reported as it is a performance measure that is not
affected by large class imbalances, unlike accuracy. ROC curves and PR
curves were also plotted, with each class plotted separately (i.e.,
using the one-versus-all strategy). ROC curves show the false-positive
rate and the true-positive rate over all possible values of the
classification threshold. PR curves show the precision and recall over
all possible values of the classification threshold. The AUC was also
reported for both curves, with 1 as the best possible AUC score.
The ROC AUC score is calculated from the model’s class-wise output
probabilities, and it is a one-versus-all metric, so it is an easier
task than multiclass classification. The ROC AUC score is affected by
the class imbalance and can overestimate performance. We therefore
report a wide range of metrics and performance indicators ([180]Figs. 2
and [181]4; figs. S1, S4, and S5; and tables S1 to S3). Hence, while
interpreting the ROC AUC score, it is important to consider the
classification performance demonstrated within the confusion matrices,
as well as other metrics (such as PR AUC, accuracy, MCC, precision,
recall, and F[1]).
High- and moderate-confidence predictions
The model’s output probability was used as a confidence measure, and
the predictions were split into two categories: high confidence, where
the output probability was larger than a specified threshold, t, and
moderate confidence, where the output probability was less than t. For
moderate-confidence predictions, we outputted the model’s top two
predictions (i.e., the two classes with the two highest output
probabilities).
The parameter t was chosen on the basis of the testing set. Three
metrics were plotted for values of t between 0.5 and 1 (in increments
of 0.025). These metrics were the fraction of high-confidence
predictions, the accuracy of the high-confidence predictions, and the
accuracy of the moderate-confidence predictions where a prediction was
treated as correct if the correct class was in the model’s top two
predictions. To remove noise in these accuracy scores, simple linear
models were fitted (i.e., approximated accuracy = a*t + b), and the
approximated accuracy scores were used to estimate t. For each value of
t, these three metrics were averaged and the value of t where the
average was the largest was taken. This was t = 0.85 and was validated
on the external validation datasets.
Benchmarking
To benchmark MethylBoostER to other similar models, we looked at
reported results by Chopra et al. ([182]18) and Brennan et al.
([183]20) of the Chopra validation dataset (this was possible as all
models were validated over this dataset). The MethylBoostER high/mod
model used our high- and moderate-confidence prediction method. When
this model predicted moderate confidence, we reported a correct
prediction if the first or second prediction was correct. Note that in
table S4, the value for the Chopra model for the normal class is 100/99
(i.e., either 100 or 99 correct predictions). This is because Chopra et
al. report 100 correct predictions and 2 incorrect predictions for the
normal class ([184]18); however, the Chopra validation dataset only
contains 101 normal samples.
Tumor purity analysis
Tumor purity was obtained for each sample using methylation beta values
using the InfiniumPurify package in R ([185]27) as previously described
([186]20). Purity in correctly predicted and incorrectly predicted
samples was compared using the Wilcoxon rank sum test with correction
for multiple testing.
Feature analysis
MethylBoostER feature importance and annotation
Feature importance values were obtained from the trained XGBoost
models, which used the gain measure as the feature importance. All
features with an importance greater than zero were selected. For each
selected feature, the CpG was mapped to a gene if it was within a gene
or within 1500 bp upstream of a gene.
Genomic location of features
The R package ChIPseeker (version 1.30.2) ([187]62) was used for
genomic location annotation and visualization. The R package
EnsDb.Hsapiens.v86 (version 2.99.0) ([188]63) was used for the
annotation source. The background set used was the total set of input
features. For significance testing, the GAT tool was used ([189]64)
together with annotations extracted from EnsDb.Hsapiens.v86 by
ChIPseeker (specifically using the detailGenomicAnnotation field).
Overrepresentation analysis
GO enrichment analysis was carried out using the Biological Networks
Gene Ontology (BINGO v3.0.3) package ([190]65), and network
visualization was carried out using Cytoscape (v3.9.0) ([191]66).
Pathway enrichment analysis was carried out using the Panther analysis
tool (release 16) ([192]67) with Panther and Reactome pathway
collections. Statistical overrepresentation was carried out using
Fisher’s exact test with false discovery rate (FDR) correction.
Overrepresentation analysis was also performed on eight gene lists:
dbEMT2 ([193]32), COSMIC Cancer Gene Census ([194]33), a list of
putative driver genes for ccRCC ([195]23), pRCC and RCC genes from the
Diseases database ([196]30), TF Checkpoint ([197]28) for transcription
factors, EpiFactors for epigentics regulators ([198]29), and HCMDB for
metastatic genes ([199]31). Enrichment of gene sets was determined
using Fisher’s exact test with FDR correction, using the Python package
SciPy ([200]68). The background set used in overrepresentation analysis
was the total set of input features mapped to genes, and Ensembl gene
IDs were used where specified (if not, gene symbols were used).
For the localized region GO enrichment analysis, the R package rGREAT
(version 1.26.0) was used, and the background set was the total set of
input features. The significant (adjusted P ≤ 0.05) biological process
terms that best summarized the results (with redundant terms manually
excluded) were visualized.
Literature mining entity relationships
The Chilibot text mining web application ([201]69) was manually
searched with the 45 genes common to all four models together with the
terms “kidney,” “renal,” “RCC,” and “cancer.” Thirty million PubMed
abstracts were also searched for entity relationships using the Pangaea
NLP Python package ([202]70) with a renal carcinoma–associated relation
term list and a collection of HUGO Gene Nomenclature Committee gene
symbols (4217 genes) derived from MethylBoostER’s features mapped to
genes (5012 genes). Entity relationship terms with frequencies ≥3 were
used to build a gene relationship network using Cytoscape (v3.9.0)
([203]66).
Extracting subtype-specific features
In the XGBoost implementation that we used ([204]59), the ith tree
focuses on the (i mod c)th class, where c is the total number of
classes (5 in our case). Thus, for each class, we took all trees
focusing on that class and extracted the features from them. The upset
plot was created using the upsetplot Python package (version 0.6.0)
([205]71).
Acknowledgments