Abstract

   Cancer, a collection of more than two hundred different diseases,
   remains a leading cause of morbidity and mortality worldwide. Usually
   detected at the advanced stages of disease, metastatic cancer accounts
   for 90% of cancer-associated deaths. Therefore, the early detection of
   cancer, combined with current therapies, would have a significant
   impact on survival and treatment of various cancer types. Epigenetic
   changes such as DNA methylation are some of the early events underlying
   carcinogenesis. Here, we report on an interpretable machine learning
   model that can classify 13 cancer types as well as non-cancer tissue
   samples using only DNA methylome data, with 98.2% accuracy. We utilize
   the features identified by this model to develop EMethylNET, a robust
   model consisting of an XGBoost model that provides information to a
   deep neural network that can generalize to independent data sets. We
   also demonstrate that the methylation-associated genomic loci detected
   by the classifier are associated with genes, pathways and networks
   involved in cancer, providing insights into the epigenomic regulation
   of carcinogenesis.

Introduction

   Cancer remains one of the most challenging human diseases, with over 19
   million cases and 10 million deaths reported annually [[30]1]. The
   increase of an ageing population worldwide, together with exposure to
   environmental carcinogens, and lifestyle choices such as poor diets,
   smoking and lack of physical activity contribute to the worldwide
   increase in cancer incidences. The evolutionary nature of cancer,
   complex interactions with the tissue micro-environment and host immune
   system, engender heterogeneity and make the pursuit and development of
   interventions difficult. Therefore, early detection and diagnosis of
   cancer, leading to better interventions and increased survival, remain
   one of the more effective avenues in combating cancer.

   Each of our somatic cells contains a single identical genome,
   incorporating the information necessary to specify and maintain our
   characteristics. In contrast, each cell will exhibit multiple
   epigenomes that change during different cellular states and over the
   passage of time. These epigenomes consist of a collection of reversible
   chromatin structures, interactions and modifications that do not change
   the DNA sequence and may be heritable across progeny cells. Histone
   variations, post translational modifications of the amino terminal
   tails of histone proteins, and covalent modification of DNA are some of
   the factors that contribute to epigenomic change. Notably, covalent
   methylation of DNA is one such reversible chemical modification with
   many functional consequences, and evidence for its role in embryonic
   development, cell differentiation, genomic imprinting, X chromosome
   inactivation, repression of regulatory elements, genome maintenance and
   the regulation of gene expression has accumulated in the last few
   decades [[31]2].

   Aberrant DNA methylation is observed in many cancers. CpG island
   promoter hypermethylation of tumour suppressor genes is an early
   neoplastic event in many tumours [[32]3–6]. In addition, global DNA
   hypomethylation can lead to chromosomal instability, activation of
   oncogenes and latent retrotransposons that promote carcinogenesis
   [[33]7]. Hypomethylation is seen in many cancer types, including
   cervical, prostate, hepatocellular, breast, brain, and leukaemia
   [[34]8–11]. These hyper- and hypo-methylation patterns can serve as
   cancer-associated signals and prognostic biomarkers. They are of
   particular use for early detection of cancer, as epigenetic
   modifications are some of the earliest neoplastic events associated
   with carcinogenesis [[35]12, [36]13]. Computational methods that detect
   these complex neoplastic methylation patterns can thus assist in cancer
   early detection, diagnosis, and screening. Here, we developed both
   binary and multiclass machine learning models to identify multiple
   cancer types from non-cancerous tissue samples. An expanding corpus of
   literature supports the use of classification methods trained on DNA
   methylation changes to identify carcinogenic signatures. Some of the
   more relevant works are reviewed in [37]Table 1. Moreover, we
   previously demonstrated that machine learning models, leveraging DNA
   methylation data from 1228 tissue samples can accurately classify
   pathological subtypes of renal tumours [[38]14]. In this study, we
   introduce a multiclass deep neural network, EMethylNET: Explainable
   Methylome Neural network for Evaluation of Tumours. EMethylNET is
   robust, generalizable, and interpretable and demonstrates high
   predictive accuracy.

Table 1.

   Summary of related studies, including EMethylNET, detailing the model
   type, number of train/test data sources (and total sample number),
   number of external validation data sources (and total sample number),
   number of CpGs input into the model and number of CpGs used by the
   model.
   Work Model type Train/test data sources (n) External validation data
   sources (n) CpGs input to the model CpGs used by the model
   Hao 2017 [18] LASSO 1 (2676) 1 (718)
   Tang 2017 [19] Random forest 1 (5379) 7 (504) 9-998
   Capper 2018 [20] Random forest 1 (3905) 5 (401) 10000
   Peng 2018 [21] LASSO 1 (1478) 3 (267) 128
   Ding 2019 [22] Logistic regression 1 (7605) 6 (742) 12 12
   Zheng 2020 [23] DNN 1 (7339) 12 (972) 10360 10360
   Koelsche 2021 [24] Random forest 1 (1077) 4 (428) 10000
   Liu 2021 [25] XGBoost 1 (7224) 0 ≤ 294
   Modhurkur 2021 [26] Random forest 9 (9303) 0 2978
   Ibrahim 2022 [27] PLSDA 1 (6502) 10 (1595) 20 20
   Kuschel 2022 [28] Random forest 3 (369) 0 50000
   Zhang 2023 [29] Linear support vector classifier 1 (781) 1 (4702) 1588
   EMethylNET XGBoost and DNN 1 (6224) 9 (940) 276016 3388
   [39]Open in a new tab

Materials and methods

Microarray-based methylation analysis

   Methylome microarray data were obtained from The Cancer Genome Atlas
   (TCGA) GDC data portal
   ([40]https://www.cancer.gov/ccg/research/genome-sequencing/tcga, RRID:
   SCR 003193). The data sets utilized were from the Illumina Infinium
   Human DNA Methylation 450 platform, and 13 human cancer types with at
   least fifteen normal samples were analysed. Metadata was also obtained
   from the TCGA data portal. [41]Supplementary Table S12 shows the number
   of cancer and normal samples for each cancer type.

   In addition to the TCGA data, a number of data sets from independent
   studies were also used in model evaluation. Eight of the independent
   data sets were from the Illumina Infinium Human DNA Methylation 450
   platform, and one (ESCA 2) was from the Illumina Infinium Methylation
   EPIC array platform. The number of cancer and normal samples for each
   independent data set is shown in [42]Supplementary Table S13. The
   sources of each data set are: breast cancer (BRCA): [43]GSE52865, colon
   adenocarcinoma (COAD): [44]GSE77955 (only samples from sites colon,
   left colon, right colon, and sigmoid are taken), esophageal carcinoma
   (ESCA): [45]GSE72874, ESCA 2: EGAD00010001822 and EGAD00010001834, head
   and neck squamous cell carcinoma (HNSC): [46]GSE38266 (note that half
   of these samples are HPV+), kidney renal clear cell carcinoma (KIRC):
   [47]GSE61441, liver hepatocellular carcinoma (LIHC): [48]GSE75041,
   prostate adenocarcinoma (PRAD): project PRAD-CA from ICGC, thyroid
   carcinoma (THCA): [49]GSE97466. For the COAD and THCA independent data
   sets, details regarding the adenoma samples were obtained from their
   metadata.

Data pre-processing

   TCGA data were downloaded using TCGAbiolinks package
   ([50]https://bioconductor.org/packages/TCGAbiolinks/, RRID: SCR 017683)
   [[51]15] in R (version 3.6.2) [[52]16], with the following
   pre-processing steps applied separately for each cancer type. Probes
   listed as potentially noisy by Naeem et al. [[53]17] were discarded,
   and only probes mapping to autosomal and sex chromosomes were kept. In
   addition, we discarded probes with > 5% of missing values. The
   remaining missing values were imputed using a k-nearest neighbours
   approach with the impute package in R (k = 10, rowmax = 0.25) [[54]18].
   This resulted in around 277,000 features (probes) per sample (this
   number varied between cancer types as processing was applied separately
   to each one). As M-values are more homoscedastic than beta-values
   [[55]19], we transformed beta-values to M-values using the function
   shown in [56]Equation (1).
   [MATH: <mi
   mathvariant="bold-italic">M</mi><mo>=</mo><mrow><msub><mrow><mi>log</mi
   ></mrow><mn>2</mn></msub></mrow><mo
   stretchy="true">(</mo><mfrac><mrow><mi
   mathvariant="bold-italic">β</mi></mrow><mrow><mn>1</mn><mo>-</mo><mi
   mathvariant="bold-italic">β</mi></mrow></mfrac><mo
   stretchy="true">)</mo> :MATH]
   (1)

   For the multiclass data, features were obtained by taking the
   intersection of features of every cancer type, and the normal class was
   obtained by pooling all the normal samples from different tissue types
   together. When pre-processing non-TCGA data sets, where the
   non-normalized data were available, we computed the beta values from
   methylated and unmethylated counts (to avoid issues with different
   normalization methods) and selected the same features as the previously
   processed TCGA data. Many data sets contained NA values, and unlike
   before, where we relied on the whole data set to impute the values, we
   set them to a constant beta of 0.5. This simulated a real-world testing
   scenario, where you might test one sample at a time and not have access
   to the whole data set of samples. Lastly, these beta values were
   transformed to M-values.

Classification models and metrics

   Throughout this study, we used both binary and multiclass models. Each
   binary model compared one tissue type, distinguishing cancer from
   normal, and the multiclass models utilized all 13 tissue types and
   normal samples. Note that in the binary models, the normal class was
   only normal samples for that tissue, whereas in the multiclass models,
   the normal class was normal samples from all tissue types pooled
   together. For each model, the input data were split into training and
   test sets, with 25% of samples in the test sets.

   To begin with, we tested two simple classification models: logistic
   regression and an SVM. Both models were created and tuned using the
   package sklearn [[57]20] in Python (version 3.7.5). Hyperparameter
   tuning on the training set using 5-fold cross-validation selected the
   default values in most cases, except for the binary logistic regression
   using the Newton solver and the multiclass SVM using gamma = ‘auto’.

   An XGBoost model based on gradient boosted decision trees was created,
   using the XGBoost ([58]https://xgboost.ai/, RRID: SCR 021361) Python
   package [[59]21]. Hyperparameter tuning for the binary models resulted
   in 450 estimators with a maximum depth of ten and a learning rate of
   0.189. The multiclass model had eight hundred estimators with a max
   depth of three and the same learning rate. In this model, 50% of
   features were randomly sampled when constructing each tree and 50% of
   samples were taken in each iteration, which helped to prevent
   over-fitting.

   Finally, a multiclass feed forward neural network, namely EMethylNET,
   which is based on the multiclass XGBoost model was produced. XGBoost
   models assign an importance to each of their input features, and an
   importance above zero indicates that this feature is helpful for
   classification. All input features with an importance greater than zero
   (3,388 features) were used as input for the neural network. Similar to
   all other models, we trained it on our TCGA training set. We conducted
   a hyperparameter search using the Python package Talos [[60]22], using
   30% of the training data as a validation set for each hyperparameter
   test. For training, we used the Adam optimizer [[61]23] with
   cross-entropy loss. A variant of early stopping was used, where the
   model was trained for a full five hundred epochs, and the model at the
   epoch with the highest validation set accuracy was taken as the final
   model. For evaluation, we used standard accuracy, precision and recall
   metrics. We also report the F[1] score, which is the harmonic mean of
   precision and recall, as shown in [62]Equation (2).
   [MATH: <mi
   mathvariant="bold-italic">F</mi><mn>1</mn><mo>=</mo><mn>2</mn><mo>⋅</mo
   ><mi mathvariant="normal"> </mi><mfrac><mrow><mfenced open="("
   close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">precision</mi><mi
   mathvariant="normal"> </mi><mo>⋅</mo><mi mathvariant="normal"> </mi><mi
   mathvariant="bold-italic">recall</mi></mrow></mfenced></mrow><mrow><mfe
   nced open="(" close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">precision</mi><mi
   mathvariant="normal"> </mi><mo>+</mo><mi mathvariant="normal"> </mi><mi
   mathvariant="bold-italic">recall</mi></mrow></mfenced></mrow></mfrac>
   :MATH]
   (2)

   In addition, we report the Matthews correlation coefficient (MCC)
   measure (see [63]Equation (3)), as it portrays a more comprehensive
   measure of performance, especially with imbalanced classes in the
   binary case [[64]24].
   [MATH: <mi
   mathvariant="bold-italic">MCC</mi><mo>=</mo><mfrac><mrow><mfenced
   open="(" close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">TP</mi><mo>⋅</mo><mi
   mathvariant="bold-italic">TN</mi><mo>-</mo><mi
   mathvariant="bold-italic">FP</mi><mo>⋅</mo><mi
   mathvariant="bold-italic">FN</mi></mrow></mfenced></mrow><mrow><mfenced
   open="(" close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">TP</mi><mo>+</mo><mi
   mathvariant="bold-italic">FP</mi></mrow></mfenced><mfenced open="("
   close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">TP</mi><mo>+</mo><mi
   mathvariant="bold-italic">FN</mi></mrow></mfenced><mfenced open="("
   close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">TN</mi><mo>+</mo><mi
   mathvariant="bold-italic">FP</mi></mrow></mfenced><mfenced open="("
   close=")" separators="|"><mrow><mi
   mathvariant="bold-italic">TN</mi><mo>+</mo><mi
   mathvariant="bold-italic">FN</mi></mrow></mfenced></mrow></mfrac>
   :MATH]
   (3)

   We also report the area under the curve (AUC) for both receiver
   operating characteristic (ROC) curves and precision–recall curves. For
   both metrics, one is a perfect score and 0.5 is the score from a random
   model (assuming classes are balanced). For the multiclass AUCs, an AUC
   was generated for each class using the one-vs-all strategy, which
   reflects the model’s ability to distinguish each class from the rest of
   the classes.

Biological feature interpretability

Multiclass PCC importance analysis

   Importances of the multiclass probes contributing to classification
   (PCCs) were obtained from the trained XGBoost model, which used the
   gain measure as the feature importance.

SHAP values

   The shap package in Python was used for analysing SHAP values of the
   multiclass DNN [[65]25]. A stratified sample of 10% of the training set
   was used as the background set and a stratified sample of 10% of the
   whole data set (training and test) was used to calculate the SHAP
   values.

Probe annotation and mapping

   Probes with an XGBoost importance score > 0 were mapped to genes that
   were overlapping or that had overlapping promoter regions (taken as the
   1500 base pair window upstream of the transcription start site). Each
   probe was mapped to all genes that fulfilled this property. Then, we
   went through the multiclass probe list manually and refined probes that
   mapped to multiple genes, removing mapped genes where it was obvious
   that the gene was not being affected by the probe. This process removed
   161 genes from the multiclass gene list. The gene annotation data were
   obtained from Ensembl (version 101) using the R package biomaRt
   ([66]https://bioconductor.org/packages/biomaRt/, RRID: SCR 019214)
   [[67]26, [68]27], and the mapping functionality was implemented using
   the R package, ChIPpeakAnno
   ([69]http://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnn
   o.html, RRID: SCR 012828) [[70]28].

Differential methylation analysis

   Differential methylation analysis was performed using the R package
   TCGAbiolinks [[71]15], and the input data were M-values of the probes
   after filtering (see Data pre-processing). Differentially methylated
   probes were found by the Wilcoxon test using the Benjamini-Hochberg
   false discovery rate adjustment method. The probes with an adjusted
   P-value < .01 and an absolute mean difference of above 2 were selected.

Enrichment analysis

Gene ontology over-representation analysis

   Functional enrichment analysis was carried out using the R package
   gprofiler2 ([72]https://biit.cs.ut.ee/gprofiler/page/r, RRID: SCR
   018190) [[73]29] with the Bonferroni correction method. The background
   set were the XGBoost input probes (ie, the microarray probe list after
   filtering) mapped to genes. This result was then visualized by REVIGO
   ([74]http://revigo.irb.hr/, RRID: SCR 005825) [[75]30] using the
   settings: small, Homo sapiens GO terms, SimRel similarity. The scatter
   plot in [76]Fig. 5a was based on the R script provided by REVIGO, and
   the visible labels are the twenty most significant terms with four or
   more parents in the GO Biological Process hierarchy (to avoid very
   general terms).

Figure 5.

   [77]Figure 5
   [78]Open in a new tab

   Cancer processes, genes, and pathways in the multiclass gene list. a A
   REVIGO visualization showing the significant Gene Ontology terms,
   restricted to the biological process domain. Only a small selection of
   terms is labelled. b The 20 multiclass genes found most often in
   abstracts about cancer. Colour indicates the number of abstracts also
   specifying a tissue. c A visualization of the significant KEGG
   pathways, where the size of the node (pathway) is the amount of overlap
   between the multiclass gene list and the pathway, and the width of the
   edge indicates the amount of overlap between the two pathways. d The
   Pathways in cancer KEGG pathway, showing only multiclass genes. Each
   multiclass gene is coloured by the difference in average methylation
   between cancer and normal for two cancer types: BLCA and PRAD

Gene set over-representation analysis

   Fisher’s exact tests were performed on two cancer gene sets: COSMIC
   Cancer Gene Census [[79]31] ([80]https://cancer.sanger.ac.uk/census,
   RRID: SCR 002260), OncoKB ([81]https://www.oncokb.org/, RRID: SCR
   014782) Cancer Gene List [[82]32], and the TF Checkpoint 2.0 resource
   ([83]https://www.tfcheckpoint.org,RRID: SCR 023880) to determine
   overlap with translational regulators to assess the overlap with the
   multiclass gene list. In these analyses, we only included genes present
   in our background gene set, i.e. the microarray probe list (after
   filtering) mapped to genes.

Text mining

   The Pangaea package [[84]33] was used for text mining of over four
   million cancer-related PubMed ([85]https://pubmed.ncbi.nlm.nih.gov/,
   RRID: SCR 004846) abstracts (downloaded in 2020) that were associated
   with cancer. We analysed the abstracts that referred to at least one of
   our multiclass genes, which was a total of 183,909 abstracts. The
   output of Pangaea is available as an excel spreadsheet in
   [86]supplementary data ([87]Supplementary File 1).

Pathway enrichment analysis and visualization

   KEGGprofile [[88]34] was used for gene set enrichment of KEGG pathways
   for the multiclass gene list. Transformation of the Ensembl IDs to
   Entrez gene IDs was required (losing some unmappable genes in the
   process), and the background set was the microarray probe list (after
   filtering) mapped to genes. For visualization, KEGG pathways were
   retrieved using the R package KEGGgraph
   ([89]https://bioconductor.org/packages/KEGGgraph/, RRID: SCR 023788)
   [[90]35] and KEGG IDs were converted to Ensembl IDs using the R package
   biomaRt [[91]26, [92]27]. Pathways were visualized with the NetworkX
   Python package ([93]https://networkx.org/, RRID: SCR 016864) [[94]36],
   and only multiclass genes were shown. For each multiclass gene, the
   difference in average methylation between cancer and normal is
   displayed as the node colour. More specifically, for each cancer type
   and each PCC, the mean M-value of the cancer samples minus the mean
   M-value of the normal samples for that cancer type was taken. Where
   multiple PCCs mapped to the same gene, the PCC with the maximum
   absolute difference was taken.

   For the visualization of the pathway network, sixty cancer-related KEGG
   pathways were collected. Only pathways with more than three multiclass
   genes were kept (resulting in fifty-six pathways). Interaction data
   were collected for all multiclass genes, from STRING
   ([95]http://string.embl.de/, RRID: SCR 005223) [[96]37] (using all
   interactions from the default 0.4 confidence), GeneMania
   ([97]http://genemania.org/, RRID: SCR 005709) [[98]38] (with all data
   sources selected) and GeneWalk
   ([99]https://github.com/churchmanlab/genewalk, RRID: SCR 023787)
   [[100]39]. These pathways and interaction data were visualized as a
   network with Cytoscape software ([101]http://cytoscape.org, RRID: SCR
   003032). Each node represented a pathway, and the multiclass genes in
   that pathway were visualized as smaller shapes around the nodes. The
   interaction data was summarized into pathway interactions—if a gene in
   one pathway interacted with another gene in a different pathway, an
   edge was drawn between those two pathways. In addition, data from the
   COSMIC Cancer Gene Census, version 93 [[102]31], were integrated.

Pan-cancer methylome network model

   A model of the pan-cancer methylome network incorporating Molecular
   Mechanisms of Cancer pathway from the Ingenuity Pathway Analysis (IPA)
   resource ([103]http://www.ingenuity.com/, RRID: SCR 008653) [[104]40]
   and the Pathways in Cancer (Human) pathway from the KEGG pathway
   database ([105]https://www.kegg.jp/kegg/pathway.html, RRID: SCR 012773)
   [[106]41, [107]42] was produced using PathVisio software
   ([108]https://pathvisio.org/, RRID: SCR 023789)[[109]43]. Multiclass
   methylation features mapped to genes were displayed as blue nodes
   (non-coding genes highlighted in yellow), or purple if they were also
   known cancer genes from Cosmic Cancer Gene Census or OncoKB.
   Interaction between nodes is derived from the literature, pathway
   databases (including IPA and KEGG) and protein-protein interaction data
   sets (STRING). The model was produced as a gpml object, adhering to the
   Systems Biology Graphical Notation (SBGN) standard. Direct interactions
   are shown as complete black lines and indirect interactions as broken
   black lines, respectively. Catalytic interactions are shown as red
   edges, inhibitory interactions as blue edges, and protein–protein
   interactions as orange edges between nodes.

Long non-coding RNA analysis

   The gene type annotation data were obtained from Ensembl (version 101).
   Literature evidence was obtained using the Pangaea tool [[110]33],
   where cancer hallmark keywords were extracted from the abstracts that
   mentioned at least one of the multiclass lncRNAs. Additionally, we used
   two cancer lncRNA databases, Lnc2Cancer 3.0
   ([111]http://bio-bigdata.hrbmu.edu.cn/lnc2cancer/, RRID: SCR 023781)
   [[112]44] and CRlncRNA [[113]45]. For Lnc2Cancer, we searched for
   cancer hallmark keywords in the description column, and for CRlncRNA,
   these cancer hallmark keywords were included explicitly. LncRNAs found
   in one or more of these two sources were plotted in a heatmap showing
   the average methylation (beta value) for all BRCA samples. The
   differential expression (log[2] fold change) was obtained using the
   DESeq2 package
   ([114]https://bioconductor.org/packages/release/bioc/html/DESeq2.html,
   RRID: SCR 015687) [[115]46].

Comparison with cancer lncRNAs

   We compared our lncRNAs to the Cancer LncRNA Census (CLC) [[116]47],
   using a Fisher’s exact test. We then carried out a pared-down version
   of their CLC features analysis, following a method as similar as
   possible. In each of these tests, a Fisher’s exact test was used when
   not otherwise specified. Gene location and length data were obtained
   from Ensembl (version 101) using the R package biomaRt [[117]26,
   [118]27].

   Close to cancer-associated and non-cancer-associated germline SNPs.
   Data were obtained from the GWAS Catalog (NHGRI-EBI’s catalog of
   published genome-wide association studies
   ([119]http://www.ebi.ac.uk/gwas, RRID: SCR 012745) [[120]48]. Cancer
   SNPs were found using keywords ‘cancer,’ ‘tumor,’ ‘tumour,’ and
   non-cancer SNPs were all other SNPs. We tested whether the closest
   cancer/non-cancer SNPs to the lncRNAs were within a distance threshold
   (1 kb, 10 kb, and 100 kb were tested).

   Within 1 kb of the COSMIC cancer gene census genes. For each background
   and multiclass lncRNA, the distance to the closest COSMIC cancer gene
   [[121]31] was computed, and we tested whether that distance was under
   1 kb more (or less) frequently for the multiclass lncRNAs.

   Epigenetically silenced in tumours. The multiclass lncRNAs were tested
   against a list of cancer-associated epigenetically silenced lncRNA
   genes (CAESLGs) [[122]49].

   Differentially expressed. The multiclass lncRNAs were tested against a
   list of dysregulated lncRNAs in a range of cancer types (BRCA, COAD,
   HNSC, KIRC, lung adenocarcinoma (LUAD), lung squamous cell carcinoma
   (LUSC), and PRAD) [[123]49].

   Gene and exon lengths. To test the difference in lengths, a Wilcoxon
   rank sum test was performed over the logged lengths (to ensure equal
   variance). For the exon length, the longest transcript for each gene
   was taken.

   Higher expression levels. TCGA expression data were used, normalized by
   the TMM method [[124]50] using the edgeR package
   ([125]http://bioconductor.org/packages/edgeR/, RRID: SCR 012802)
   [[126]51], and averaged across samples. A Wilcoxon rank sum test over
   the logged expression values was used to test expression differences.

   Conservation. Phast 100-way conservation scores were downloaded
   [[127]52, [128]53] and all conservation scores that overlap with
   background, or multiclass lncRNA gene bodies were taken. The mean
   conservation score per gene was computed, and then the difference in
   conservation between multiclass and background lncRNAs was tested using
   a Wilcoxon rank sum test.

Survival analysis

   To determine whether the gene sets could differentiate survival, the
   following was executed for each gene list from the binary XGBoost
   models. TCGA expression data were normalized (using the
   variance-stabilizing transformation in the DESeq2 package [[129]46]),
   and matched TCGA survival data were obtained. A Cox proportional
   hazards regression model (using the R package survival
   ([130]https://CRAN.R-project.org/package=survival, RRID: SCR 021137)
   [[131]54], version 3.1.8) was fitted on each gene separately along with
   the age, stage, and gender as covariates (excluding gender for BRCA,
   PRAD and UCEC). The genes that had a significant effect on survival
   (using the Wald statistic P-value with a .05 cutoff) were selected and
   a Cox proportional hazards regression model was fitted over all these
   selected genes. Kaplan–Meier curves were computed (using the survival
   package) by splitting the samples into a high and low hazard, split by
   the median hazard.

   Then a similar analysis that splits the samples up into a train and
   test set was run, to show whether the gene lists could predict
   survival. This was repeated thirty times to get a distribution over the
   test set performance. After gene normalization, samples were split into
   stratified train (75%) and test (25%) sets. Using only the train set, a
   Cox proportional hazards regression model was fitted on each gene
   separately and selected genes as before. Then three Cox proportional
   hazards regression models were fitted to the train set—one using just
   the selected genes, one using just the covariates, and one using the
   selected genes and covariates. We then used these models to predict the
   hazard on the test set, and plotted time dependent ROC curves (using
   the R package timeROC [[132]55], version 0.4) on these predicted
   hazards.

Results

Overview

   We utilized machine learning approaches to identify cancer-specific
   changes from normal tissue-specific methylation. DNA methylation
   microarray data from 13 cancer types and corresponding normal tissues
   were utilized. Illumina Infinium array-based methylome data were used
   in this study and data were extracted, cleaned, and processed as
   described in the Methods. Analysis of this methylation microarray data
   identifies the ratio of the methylated probe intensity over the overall
   intensity, known as the beta value, at given CpG locations using a pair
   of methylated and unmethylated probes.

   In this study, we trained and evaluated four different model types:
   logistic regression, support vector machines (SVM), gradient boosted
   decision trees (XGBoost), and a deep neural network (DNN). See
   [133]Fig. 1 for a visual overview. For the first three model types,
   both binary and multiclass classification models were created.

Figure 1.

   [134]Figure 1.
   [135]Open in a new tab

   Overview of method. DNA methylation microarray data from 13 cancer
   types and corresponding normal tissues were collected from TCGA and
   preprocessed. For binary and multiclass classification tasks, three
   types of models were trained: Simple models (logistic regression and
   support vector machines), XGBoost, and EMethylNET, a model consisting
   of XGBoost combined with a Deep Neural Network. Then the models were
   evaluated on independent data sets and an analysis of their features
   used in classification was performed

Cancer types can be accurately classified using both binary and multiclass
methods

   Here, we present the results from the XGBoost and DNN models. We also
   trained and tested SVM and logistic regression models. In addition to
   measures of accuracy such as ROC AUC and F[1] score, we also provide
   the MCC measure. MCC is used to portray performance of the binary
   classification models and is especially useful when class imbalances
   are present. The SVM models did not outperform the XGBoost or DNN
   models: the SVM binary models reached an average MCC of 0.894 and the
   SVM multiclass model reached an MCC of 0.956. The binary logistic
   regression models had an average MCC of 0.960, outperforming the binary
   XGBoost models on average (average MCC of 0.919); however, their
   performance varies across cancer types: logistic regression models
   perform better for 5 cancer types, XGBoost models perform better for 4
   cancer types, and both achieve the same MCC for four cancer types. The
   multiclass logistic regression model (MCC score of 0.973) did not
   outperform the multiclass XGBoost or DNN (MCC scores of 0.980 and
   0.976). Since the binary logistic regression models did not
   substantially outperform the binary XGBoost models and the multiclass
   logistic regression achieved a lower MCC score than the multiclass
   XGBoost and DNN, we focus our analysis on the XGBoost and DNN. For
   detailed performance metrics of the SVM and logistic regression models,
   see [136]Supplementary Tables S1 and S2.

Detection of the cancer states through binary classification of DNA
methylation from individual tumour and normal tissues

   XGBoost, a type of gradient boosted tree model, is an iterative
   ensemble machine learning approach [[137]21]. We trained 13 binary
   XGBoost models, one for each cancer type. DNA methylome data from TCGA
   were used in training and testing the models, with a total of 6224
   samples. Each model learns to classify between cancer and normal
   samples (adjacent matched normal tissue) for its tissue type. Overall,
   there was good performance on the test set, with five out of 13 models
   achieving a perfect test set performance (COAD, KIRC, LUAD, LUSC, and
   uterine corpus endometrial carcinoma [UCEC]). Across all models, the
   average accuracy was 0.987 and the average MCC (a performance measure
   unaffected by severe class imbalance) was 0.919, demonstrating that the
   models can accurately classify cancer and normal samples.
   [138]Figure 2a–[139]e shows the confusion matrices for the best and
   worst performing models, AUC of ROC curves and precision-recall curves,
   and the MCC scores for all models. Performance metrics for all binary
   models can be found in [140]Supplementary Tables S3.1. A key issue with
   these binary models is the major class imbalance. The average fraction
   of normal samples is 0.135 (see [141]Supplementary Table S12 for the
   numbers of normal and cancer samples for each tissue type), which
   reveals why the average MCC is considerably lower than the average
   accuracy. In addition, the lowest performing model, ESCA, with an
   accuracy of 0.961 and MCC of 0.693, is the tissue type with the lowest
   number of normal samples, of which there were only sixteen. This
   sparsity of data contributed to its worse performance.

Figure 2.

   [142]Figure 2.
   [143]Open in a new tab

   Performance of the binary and multiclass XGBoost models on the TCGA
   test set. a and b Confusion matrices of the best (KIRC) and worst
   (ESCA) performing binary XGBoost models. c AUC of the ROC curves for
   all binary XGBoost models. d AUC of the Precision Recall (PR) curves
   for both cancer and normal classes of all binary XGBoost models. Note
   that the scales of c and d start from 0.7. e MCC scores for all binary
   XGBoost models. f shows the confusion matrix, g shows the AUC of the
   ROC curves for each class, and h shows the AUC of the Precision Recall
   (PR) curves for each class of the multiclass XGBoost model. Note that
   the scales of g and h start from 0.9

The multiclass classification of 13 cancer and normal tissues is more robust

   Here, we trained a single multiclass XGBoost model on the whole of the
   training data. There were classes for each of the 13 cancer types and a
   single normal class, which contained normal samples from every cancer
   type. The model was now required to learn the differences between 13
   tissue types in addition to the differences between cancer and normal
   tissue samples, making it a more challenging task than the previous
   binary classification. However, there was no longer a large class
   imbalance due to pooling of the normal samples together. As shown in
   [144]Fig. 2f–h, the performance of the test set was very good for all
   classes. The model can discriminate each of the 13 cancer types and
   normal samples with a high degree of accuracy. The overall accuracy was
   0.982 and the overall MCC was 0.980, see [145]Supplementary Table S3.2
   for the detailed metrics.

Models achieve high accuracy on independent heterogeneous data sets

   To determine the robustness of our models, we evaluated our XGBoost
   models on several independent data sets representing different cancer
   types, amounting to a total of 940 samples. These were more
   heterogeneous than the TCGA data used for training. Two data sets
   included some adenoma samples (COAD and THCA), one data set consisted
   of samples from early-stage tumours, some of which were later shown to
   recur (LIHC), and one data set included some Human papillomavirus (HPV)
   positive samples (HNSC). The data sets also came from seven different
   countries, viz., Iceland (BRCA), USA (COAD), Australia (ESCA), UK (HNSC
   and ESCA), China (LIHC and KIRC), Canada (PRAD), and Brazil (THCA).

Binary models show good performance on independent data sets

   When these independent data sets were tested, most of the binary
   XGBoost models (trained on TCGA data) performed well, illustrated by
   [146]Fig. 3. Confusion matrices of the best and worst performing binary
   XGBoost models are shown in [147]Fig. 3a and b. In terms of ROC AUC
   ([148]Fig. 3d), the highest performing model was the BRCA model, with a
   perfect ROC AUC of 1.0, and the lowest performing was the COAD model,
   with a ROC AUC of 0.758. The precision–recall AUC results show similar
   trends ([149]Fig. 3e). In terms of MCC ([150]Fig. 3f), the lowest
   performing model was the ESCA model, which is expected given the major
   class imbalance in the ESCA training data set.

Figure 3.

   [151]Figure 3.
   [152]Open in a new tab

   Performance of the binary XGBoost models on independent data sets. a
   and b Confusion matrices of the best (BRCA) and worst (COAD) performing
   binary XGBoost models (according to the ROC AUC scores) on the
   independent data sets. c Detailed confusion matrix for COAD showing the
   predictions of Normal (N), Adenoma (A), and Cancer (C) samples. d AUC
   of the ROC curves for binary XGBoost models where the independent data
   set included normal samples. e AUC of the Precision Recall (PR) curves
   for both cancer and normal (where available) classes of binary XGBoost
   models on the independent data sets. f MCC scores for binary XGBoost
   models where the independent data set included normal samples. For d, e
   and f, ESCA is the average of the two ESCA independent data sets

   Regarding the COAD model, its confusion matrix in [153]Fig. 3b shows
   that it predicted 12 normal samples as cancer. Nine out of these twelve
   samples are in fact adenomas; benign tumours of glandular origin (see
   [154]Supplementary Table S4 for the number of normal, adenoma and
   cancer samples in the COAD independent data set). A confusion matrix
   that also shows whether the samples are Normal (N), Adenomas (A), or
   Carcinomas (C) is shown in [155]Fig. 3c, illustrating that all adenomas
   are classified as cancer. This was unexpected, as there were no
   adenomas in the training data set, and instead of randomly classifying
   them, the model found some cancer-associated signal in adenoma samples
   in the independent data set.

   A similar trend was identified in the other independent data set with
   adenomas. In the THCA model, eleven out of 17 adenomas were predicted
   to be cancer (see [156]Supplementary Table S4 for the number of normal,
   adenoma and cancer samples in the THCA independent data set). In
   detail, all occurrences of ‘follicular adenoma’, and ‘follicular
   adenoma/Hürthle cell’ were classified as cancer (n = 8), all
   ‘lymphocytic thyroiditis’ were classified as normal (n = 3), and
   ‘nodular goitre’ was split evenly between the two classes (n = 6).

EMethylNET, a model consisting of a DNN model trained on features learnt from
multiclass XGBoost, improves performance

   The results for the multiclass XGBoost model on the independent data,
   which had an accuracy of 0.68 and MCC of 0.661, can be found in
   [157]Supplementary Table S6. With the aim of creating a more robust
   model and improving these results, we designed EMethylNET, a
   feed-forward neural network based on our XGBoost model, as shown in
   [158]Fig. 4a. The input features of EMethylNET were the features the
   multiclass XGBoost model learnt to utilize for classification, referred
   to as the probes contributing to classification, see below. See
   [159]Supplementary Table S5 for EMethylNET’s results on the TCGA test
   set. The results on the independent data sets, which had an accuracy of
   0.867 and MCC of 0.844, are shown as a confusion matrix ([160]Figure
   4b), AUC of ROC ([161]Figure 4c) and AUC of PR ([162]Figure 4d) and in
   [163]Supplementary Table S7. The only data set that did not reach a
   F[1]score of at least 0.8 (excluding COAD, as it contains adenomas, see
   above) was HNSC.

Figure 4.

   [164]Figure 4.
   [165]Open in a new tab

   Architecture of the feed forward neural network (a) and its performance
   on all independent data sets. b shows the confusion matrix, c shows the
   AUC of the ROC curves for each class, and d shows the AUC of the
   Precision Recall (PR) curves for each class. The two ESCA data sets are
   combined into one ESCA class. The colour orange denotes normal and
   purple denotes cancer. Note that we do not have independent data sets
   for every cancer type (the independent data sets used lacked BLCA,
   KIRP, LUAD, LUSC and UCEC samples). Nevertheless, for the confusion
   matrix in b all 14 classes are retained in the rows to maintain a
   square configuration, enhancing readability

Comparison of EMethylNET with related cancer classification studies

   The detection and classification of cancer using methylation-based
   approaches is a large and growing body of literature. A diverse range
   of approaches and objectives have been investigated, from binary
   classification of cancer using tissue data [[166]56], to multiclass
   classification using data from liquid biopsies [[167]57]. Here, we
   conduct a comparative analysis of EMethylNET with other related works
   that utilize machine learning for pan-cancer multiclass classification
   of DNA methylation data from tissue samples. These related works
   [[168]58–69] are listed in [169]Table 1. Various machine learning
   approaches have been used, from logistic regression to DNNs, with
   tree-based methods (random forest and XGBoost) being a popular approach
   (6/12 works).

   First, we provide a performance comparison of EMethylNET to these
   related works. We only compare works that provide test set scores on
   TCGA (we do not attempt to run their models). This comparison is not
   exact, and it is important to note the following shortcomings: the test
   sets contain different samples and have different sizes, the related
   works have slightly different classification tasks (for example, some
   only consider cancer samples, some define a normal class for each
   tissue type and some have separate classes for metastatic samples) and
   some classes in related works are not comparable with our classes (for
   example, some works combine cancer types found in the same tissue
   type). In addition, we can only compare with the metrics reported in
   the publication, and so different metrics are compared for different
   works.

   First, we compare with Hao 2017 [[170]58]. They classify four cancer
   types and four normal tissues, and so we can only compare with the four
   cancer types (as our normal samples are pooled). [171]Supplementary
   Table S8 shows the precision and recall metrics, indicating that
   EMethylNET achieves comparable performance for these cancer types
   (higher precision for COAD, LIHC and LUAD, and higher recall for LUAD).
   Next, we compared with Ibrahim 2022 [[172]67]. They perform a slightly
   different task, as they do not include a normal class, and they combine
   colon and rectal tumour data sets, so we cannot compare performance on
   COAD. [173]Supplementary Table S9 shows the ROC AUC scores, showing
   that EMethylNET achieves the same ROC AUC or higher in all classes
   (when rounding to three decimal places). Ibrahim 2022 also externally
   validate their model on the independent BRCA ([174]GSE52865) and THCA
   ([175]GSE97466) data sets. Again, this is not a direct comparison
   because the rest of their independent external validation data set
   differs from ours (which affects the one-vs-all approach to calculating
   AUC scores). For the BRCA data set, they report a ROC AUC of 0.928, and
   we achieve a ROC AUC of 0.99997. For the THCA data set, they report a
   ROC AUC of 0.990, and we achieve 0.99463. Next, we compare with Zheng
   2020 [[176]63]. They do not include normal samples, and they classify
   the cancer origin site, which again is a slightly different task to
   ours. We cannot compare KIRC, kidney renal papillary cell carcinoma
   (KIRP), LUAD and LUSC classes as they combine them into generic kidney
   and lung classes. The ROC AUC, precision and recall metrics are shown
   in [177]Supplementary Table S10, indicating that we achieve comparable
   performance. EMethylNET’s ROC AUCs are the same or higher in all
   classes (when rounding to two decimal places), precision is higher in
   4/8 classes and recall is the same or higher in 5/8 classes. Lastly, we
   compared with Modhurkur 2021 [[178]66]. As they have distinct classes
   for each metastatic cancer and each normal tissue type, they address a
   more challenging task. [179]Supplementary Table S11 shows the
   precision, recall and F1 metrics, which shows that we achieve
   comparable performance. EMethylNET’s precision is the same or higher in
   6/13 classes, recall is the same or higher in 10/13 classes, and F1 is
   higher in 8/13 classes. In summary, we have shown that EMethylNET
   achieves competitive test set performance amongst comparable works.

Biological feature interpretability

   A key advantage of using an interpretable method such as XGBoost is
   that the features utilized for classification can be identified. In our
   case, these were the CpG probes with a feature importance of above
   zero, which we refer to as PCCs. Surprisingly, most PCCs from the
   binary models were found not to be differentially methylated in each
   respective cancer type. Only 65/221, 56/318 and 29/179 PCCs from the
   BRCA, PRAD, and THCA binary models, respectively, were found to be
   differentially methylated, as shown in [180]Supplementary Fig. S1.

   We explored the PCCs from the multiclass XGBoost model (exactly the
   input features to EMethylNET). The importance scores of the most
   important PCCs are shown in [181]Supplementary Fig. S2a, which shows
   that most of the importance is captured by the top one hundred PCCs.
   The most important PCC, cg16508600, is at position chr1:204562255 and
   does not map to any gene. The closest gene is RNA5SP74, which is ∼200bp
   away, and the location of this PCC coincides with a C > T SNV
   (rs567580996). The second most important, cg14789818, is ∼200bp
   upstream of RNA5SP77 on chromosome 1. The third most important,
   cg03988778, is near the promoter of SVIP and [182]AC006299.1.
   [183]Supplementary Fig. S2b shows the distribution of methylation of
   the top ten important probes, indicating that they commonly
   differentiate one class from the rest. For example, the most important
   PCC differentiates BRCA from all other classes, and the third most
   important differentiates a sizable proportion of HNSC class.

   An interpretation of the multiclass DNN model can be achieved by
   analysing its SHAP (SHapley Additive exPlanation) values [[184]25]. The
   feature with the highest average impact on model output (the highest
   average absolute SHAP value) is cg15267232, which is within GATA3, and
   the feature with the second-highest average impact is cg22455450, which
   is within ZNF808. The feature with the third-highest average impact is
   cg22541735, which is within HOXD9 and HOXD-AS2. Interestingly, the
   feature with the 10th highest average impact (cg14789818) is also the
   second most important feature for the multiclass XGBoost model.
   [185]Supplementary Fig. S3 visualizes the features with the highest
   average absolute SHAP values.

The proximal genes of the multiclass model’s features are enriched in genes
contributing to hallmarks of cancer, carcinogenesis, and transcriptional
regulation

   The PCCs can be mapped to the proximal genes—genes where the gene body
   or promoter region (taken as the 1500 base pair window upstream of the
   transcription start site) overlap the PCCs. We will refer to the genes
   obtained by mapping the multiclass PCCs to proximal genes as
   ‘multiclass genes’.

   We performed functional enrichment analysis on the multiclass genes. A
   visualization of the significant Gene Ontology terms, restricted to the
   Biological Process ontology, is shown in [186]Fig. 5a. This shows that
   our multiclass gene list is enriched in development, regulation of
   signaling, processes involved in gene expression changes, and the
   regulation of a wide variety of metabolic processes.
   Over-representation analysis revealed that there is significant overlap
   between the multiclass genes and the COSMIC Cancer Gene Census
   [[187]31], with an overlap of 140 genes (19.0% of COSMIC genes)
   (Fisher’s exact test, p = 8.7e − 17). We also found significant overlap
   between the multiclass genes and the OncoKB Cancer Gene List [[188]32],
   namely 217 genes (19.7% of OncoKB cancer genes) (Fisher’s exact test,
   p = 4.5e − 27). Furthermore, analysis of multiclass features using the
   TF checkpoint 2.0 database indicated that 17.2% (546 genes) (Fisher’s
   exact test, p = 2.4e − 39) are also transcriptional regulators.

   We also looked at the overlap with established DNA methylation
   biomarkers in cancer, by comparing with the genes used by commercially
   available DNA methylation-based biomarker assays [[189]70]. Out of the
   13 genes measured by these assays, four overlap with our multiclass
   genes (RASSF1, SEPTIN9, SHOX2, MGMT). During normal expression, RASSF1A
   represses cell cycle proteins cyclin A2 and cyclin D1, leading to cell
   cycle arrest and plays a significant role in microtubule stability and
   modulates apoptosis. Furthermore, RASSF1 inactivation is one of the
   most common epigenetic changes in cancer [[190]71]. Similarly, SEPTIN9
   participates in cytokinesis during the cell cycle [[191]72], while
   SHOX2 is a transcription factor involved in proliferation, migration
   and colony formation [[192]73] and MGMT inhibits tumour formation
   [[193]74]. All these genes are well-known prognostic biomarkers in
   cancer. There are also several multiclass genes in the same family as
   these 13 genes (such as NDRG3, BMP8A, OTX2, ONECUT1).

   Text mining a corpus of 183,909 PubMed cancer-related abstracts that
   mention at least one multiclass gene revealed that the cancer
   literature provides evidence for the multiclass genes. 65.6% (2083) of
   our multiclass genes are found in at least one cancer-related article
   abstract from PubMed. See [194]Fig. 5b for the genes most supported by
   the literature. These include well-studied oncogenes such as STAT3,
   BRCA1, AR, MYC, CXCR4, NOTCH1, SMAD4, TERT, ZEB1, JUN, amongst others.
   This analyses also demonstrated that just under 40% of these abstracts
   are additionally associated with at least one of the 13 tissue types
   included in the multiclass model. BRCA, PRAD and COAD are most commonly
   found, due to their high prevalence. [195]Supplementary File 1 details
   the evidence for the multiclass genes in these PubMed cancer-related
   abstracts.

The multiclass genes are enriched in cancer-related pathways and networks

   Pathway enrichment analysis using the KEGG pathway database revealed
   enrichment of pathways related to general cancer hallmarks, such as
   Pathways in cancer (adjusted P-value = 4.3e − 4), Metabolic pathways
   (adjusted P-value = .0214), and signal transduction pathways such as
   the Wnt signalling pathway (adjusted P-value = 8.2e − 3), TGF beta
   signalling, Hippo signalling, Axonal guidance pathway involved in
   invasion and metastasis, and many metabolic pathways. See [196]Fig. 5c
   for a visualization of these enriched pathways.

   The multiclass genes in these pathways displayed different methylation
   patterns for different cancer types. A visualization of the Pathways in
   cancer network from KEGG is shown in [197]Fig. 5d for both bladder
   urothelial carcinoma (BLCA) and PRAD, and in [198]Supplementary Fig. S4
   for all other cancer types. This shows that BLCA, KIRC, KIRP, LIHC,
   THCA and UCEC are mostly hypomethylated whilst BRCA, COAD, LUAD, LUSC
   and PRAD are mostly hypermethylated. [199]Supplementary Fig. S5a shows
   a heatmap of PCCs at least 2-fold differentially methylated and their
   recurrent mutation status (from COSMIC cancer gene census and TCGA
   significantly mutated list) is indicated. Similarly, the differential
   methylation of all PCCs is shown in [200]Supplementary Fig. S5b. In
   addition to mutations and copy number aberrations, the PCC features
   identified by our analysis contribute to carcinogenesis in multiple
   cancers via methylation changes of regulatory elements.

   Furthermore, multiclass genes were found to be present in a broad range
   of cancer-related pathways, as shown in [201]Fig. 6. These pathways
   covered a wide range of categories: Individual cancer types, Cell Death
   and Survival, Tissue Microenvironment, Signalling, Metabolism, and
   Immune System. This pathway model also shows that many multiclass genes
   in these pathways are present in the COSMIC Cancer Gene Census
   [[202]31]. To visualize the multiclass genes in one unified cancer
   network, we curated a general cancer network, based on two general
   cancer pathways: KEGG’s Pathways in cancer and Ingenuity Pathway
   Analysis (IPA) Molecular Mechanisms of Cancer (both of which our
   multiclass genes are enriched in, Fisher’s exact test respective
   adjusted P-values = 4.206e − 10 and 1.065e − 05). This network model is
   shown in [203]Supplementary Fig. S6 and demonstrates that the
   multiclass genes span all areas of cellular networks underlying
   carcinogenesis.

Figure 6.

   [204]Figure 6
   [205]Open in a new tab

   A network of cancer pathways and the multiclass genes. Each circle of
   nodes is a cancer pathway, and each node represents a multiclass gene.
   The node colour represents the number of times each multiclass gene is
   displayed (as they can be in multiple pathways), the edge thickness
   represents the number of interactions between pathways, and a black
   outline indicates that the multiclass gene is found in the Cancer Gene
   Census. The colour of the pathway name represents the pathway category

Multiclass long non-coding RNAs are associated with oncogenic properties

   Additionally, we investigated the proportion of protein-coding versus
   non-coding genes in our gene lists. This is visualized in [206]Fig. 7a
   for all individual cancer gene lists and the multiclass genes. It shows
   that, as expected, most genes are protein-coding (around 65% to 85%).
   However, the proportion of long non-coding RNA (lncRNA) is surprisingly
   high for all gene lists (around 14% to 26%), which motivated further
   analysis. We validated some of the multiclass lncRNAs with literature
   evidence using Pangaea [[207]33] and two cancer lncRNA databases
   (Lnc2Cancer 3.0) [[208]44] and CRlncRNA [[209]45]. We found evidence
   for 142 multiclass lncRNAs (out of a total 596 multiclass lncRNAs). See
   the heatmap in [210]Fig. 7b (and [211]Supplementary Fig. S7a for a
   larger version), which shows that there is a wide range of methylation
   values for these lncRNAs, and a range of cancer hallmarks associated
   with them. The most common hallmarks are proliferation, invasion, and
   migration. The lncRNAs with the most evidence include HOTAIR, NEAT1,
   and HOTTIP, as seen in [212]Fig. 7c.

Figure 7.

   [213]Figure 7
   [214]Open in a new tab

   Analysis of the lncRNAs found in the gene lists. a The fractions of
   different gene types in all cancer gene lists, including the multiclass
   gene list. b A heatmap of BRCA data showing the average beta value of
   the multiclass lncRNAs with literature evidence, and the cancer
   hallmarks they are associated with. The row annotation indicates the
   log fold change from differential expression analysis, where
   non-significant fold change (adjusted P-value > 0.05) is in grey. c The
   top 10 multiclass lncRNAs that had the most literature evidence. d The
   significance levels resulting from testing the multiclass lncRNAs for
   previously observed cancer lncRNA features [[215]41]. The dashed red
   line indicates the P-value = .05 level of significance. e Boxplot of
   the log[e] gene length of non-multiclass lncRNAs and multiclass
   lncRNAs. ‘***’ indicates P-value < .001

   We also compared the multiclass lncRNAs to a set of validated cancer
   lncRNAs and found that they share some of the same properties.
   Carlevaro-Fita et al. introduced [[216]47] and Vancura et al. updated
   the Cancer LncRNA Census 2 (CLC2), which is a list of 492 lncRNAs that
   have been causally associated with cancer [[217]75]. Our multiclass
   lncRNAs do have a significant overlap with the CLC2 (Fisher’s exact
   test, P-value = 3.0e − 18); however, this is only 74 overlapping
   lncRNAs. Carlevaro-Fita et al. also uncovered the properties of genes
   in the CLC, such as smaller distances to cancer SNPs, higher
   conservation, and longer gene lengths. By carrying out the same tests
   on our multiclass lncRNAs, we found that the multiclass lncRNAs share
   some of these CLC properties. We tested the distances to
   cancer-associated and non-cancer SNPs, distances to cancer associated
   genes, epigenetic silencing in tumours, differential expression, gene
   and transcript lengths, gene expression levels, and conservation. The
   P-values for each of these tests are shown in [218]Fig. 7d. We found
   that our multiclass lncRNAs did not share any of the same proximity
   properties (distances to SNPs and cancer genes) but did share all five
   remaining properties. See [219]Fig. 7e for a boxplot showing that the
   multiclass lncRNAs have longer gene lengths, and [220]Supplementary
   Fig. S7b–g for boxplots of the other relevant tests.

Models for some of the cancer types can predict 5-year survival

   We used the gene lists from the binary XGBoost models to determine
   whether they could firstly differentiate, and then predict, survival.
   Survival was computed for each cancer type, using just the expression
   of the genes from the binary model as input. For every cancer type, the
   cox proportional hazard model significantly differentiated survival.
   [221]Figure 8a shows the Kaplan–Meier curves for the most significantly
   differentiated cancer types, HNSC (P-value = 3.15e − 16) and KIRC
   (P-value = 3.06e − 15).

Figure 8.

   [222]Figure 8
   [223]Open in a new tab

   Survival analysis using the gene lists from the binary models. a The
   two most significant Kaplan-Meier curves that differentiate survival:
   HNSC (P-value: 3.15x10-16) and KIRC (P-value: 3.06x10-15). b The
   distribution of ROC AUCs when predicting 5-year survival for cancer
   types with sufficient survival data. Colour represents the three
   different variations of input variables to the survival models. c The
   best ROC curves for predicting 5-year survival of the cancer types with
   the highest average ROC AUC: KIRC and COAD

   Next, we explored whether the gene lists could predict survival on a
   held-out test-set. The performance differed between cancer types, as
   shown in [224]Fig. 8b (here we use age, stage, and gender as
   covariates). This shows that there is a broad distribution for some
   cancer types (such as KIRP), which could be due to low sample numbers
   (KIRP has the second-lowest number of samples). Three cancer types did
   not have enough data to converge—PRAD and THCA both had less than 15
   positive samples (events), and ESCA had the least number of samples.
   However, models for some cancer types could predict 5-year survival
   consistently well using only genes as input, such as KIRC, COAD, BLCA,
   HNSC, and UCEC. [225]Figure 8c shows the best ROC curves from the two
   cancer types with the highest average ROC AUC, KIRC and COAD. This
   shows their best ROC AUCs are 0.817 and 0.895, respectively.

Discussion

   The early detection of cancer is vital for enabling treatment options
   that lead to better prognosis. A fundamental requirement for this is to
   distinguish cancerous from non-cancerous tissue samples accurately.
   Here, we have utilized epigenetic changes in the DNA methylome and
   present binary and multiclass machine learning models to classify 13
   cancer types and corresponding normal tissues. Our approach achieved
   good test set performance for all XGBoost models, namely an average
   accuracy of 0.987 and 0.982 for the binary and multiclass models,
   respectively. We were then able to show that the PCCs selected by
   XGBoost can robustly classify cancer when fed into a multiclass deep
   neural network, namely EMethylNET (accuracy 0.976).

   The performance on most independent (non-TCGA) data sets was above a
   F[1] score of 0.8, and half of the independent data sets achieved an
   F[1] score of over 0.9. These independent data sets were more
   heterogeneous and reflected more realistic situations. Lastly, we
   demonstrated that multiclass PCCs do have biologically meaningful
   significance in cancer. Over-representation analysis revealed that the
   multiclass genes were enriched in processes which are linked to cancer
   hallmarks, and other cancer and methylation studies report similar Gene
   Ontology enrichment results [[226]76, [227]77]. Furthermore, a
   comprehensive text mining analysis of the literature demonstrates that
   cancer-associated methylation changes in 892 of our multiclass genes
   are supported by 7831 publications. We also showed that the multiclass
   genes set consists of 229 known tumour suppressors and oncogenes, 546
   transcriptional regulators and are involved in a wide range of
   cancer-related pathways and processes. Additionally, we showed that our
   gene lists contain many non-coding RNA genes, primarily consisting of
   lncRNAs. This is consistent with a growing body of research showing
   that lncRNAs and other non-coding RNAs play a key role in
   carcinogenesis [[228]78–80].

   There were two exceptions to the performance of our models, one of them
   being the independent COAD data set. As indicated in the Results, this
   low performance can be explained by all adenomas, labelled as normal,
   being predicted as cancer. Adenomas are dysplastic polyps which can
   progress via the adenoma–carcinoma sequence to invasive cancer.
   Therefore, it is common to remove colon adenomas when they are found to
   stop the possible progression into carcinomas [[229]81, [230]82] and so
   this behaviour was inadvertently useful. However, a larger sample size
   of adenomas would be needed to validate this. The other exception is
   the HNSC independent data set, which has the lowest performance. HNSC
   is very heterogeneous, in that it can arise from multiple different
   tissue sites, and the TCGA HNSC data reflects this. However, the
   independent HNSC data set only stems from one tissue of origin, the
   oropharynx, and only 1.55% of the TCGA data stems from the oropharynx.
   In addition, half of the independent HNSC data set is Human
   papillomavirus positive (HPV+), which is known to display different
   methylation patterns [[231]83, [232]84]. Thus, we were testing on HNSC
   cancer types with very little TCGA training data, which could explain
   the poor performance. In addition, the independent HNSC data were often
   misclassified as LUAD. The uniform manifold approximation and
   projection visualization in [233]Supplementary Fig. S8 illustrates that
   out of all TCGA classes, the independent HNSC data were the closest to
   LUAD. This could be due to a biological reason, such as the independent
   HNSC data are in fact metastases which originated in the lung, or this
   could be due to specific data generation or processing artefacts.

   We compared EMethylNET with related cancer classification studies and
   demonstrated similar or better performance against test set data. We
   also compare these related works with respect to the features selected
   by the models. The related works all utilized feature selection
   methods, such as the moderated t-statistic or differential methylation
   analysis, with multiple works using redundancy filters, for example the
   Maximum Relevance–Maximum Distance technique [[234]59], and many
   utilizing multiple feature selection methods in parallel [[235]59,
   [236]63, [237]67, [238]69, [239]85]. Thus, most of these approaches
   start from a highly filtered probe list, and some only use tens of
   probes in the final classification model (as detailed in [240]Table 1),
   consequently the models could potentially be biased by the feature
   selection methods used. In our approach, we did not perform a prior
   feature selection, but instead let the XGBoost classification model
   perform the feature selection itself, from an input set of around
   277,000 features. For the multiclass case, this resulted in a large set
   of PCCs, of size 3388, that provided us with an interpretable model and
   an explainable list of genomic loci for further analysis. Only a
   handful of the related works have performed feature analysis of the
   CpGs selected by the model. Ding et al. [[241]62] performed functional
   analysis of its 7 CpGs and Liu et al. [[242]85] found cancer-related
   genes near three out of its 12 CpGs. We provide an extensive analysis
   of our PCCs, encompassing over-representation analyses, extensive
   literature mining, and pathway enrichment visualizations. Exploring the
   pan-cancer methylome as a network ([243]Fig. 6 and [244]supplementary
   Fig. S6) enabled the identification of genes associated with several
   well-studied cancer-associated pathways, including well-known tumour
   suppressor and oncogenes present in the collection of our PCCs. These
   include those genes involved in cancer-associated pathways such as
   TP53, WNT, Notch, TGF beta/BMP, RAS, MAPK, PI3K-AKT and Hedgehog
   signalling as well as pathways impacting proliferation, survival and
   cell death including cell cycle regulators, mitotic checkpoint genes,
   mitochondrial metabolism, DNA damage responses and apoptosis. In
   addition, pathways involved in invasion and metastasis-associated
   processes such as the epithelial mesenchymal transition (EMT)-related
   genes, axonal guidance pathway, and those involved in adherence
   junctions and extracellular matrix interactions, Integrin signalling
   and angiogenesis were present. Furthermore, immune response regulators
   such as cytokine (IFN, interleukin, chemokine), TLR signalling, and
   interferon stimulated genes were also present. Finally, genes and
   pathways affecting global gene expression such as developmental
   regulators, chromatin remodellers, epigenetic regulators and
   transcription factors were detected. Investigating these genes in a
   cancer network context enabled their interactions and relationships to
   be identified. The pan-cancer methylome also demonstrates that in
   addition to mutations and genetic aberrations, epigenetic changes have
   wide-ranging impacts on carcinogenesis. To summarize, in comparison
   with related studies, we are the first to provide an in-depth feature
   analysis where the CpGs were selected freely by the model, with no
   prior feature selection adding potential bias to the feature analysis
   results.

   In conclusion, we demonstrated that XGBoost models are suitable for
   classifying a multitude of cancer types using only DNA methylation data
   as input. We additionally designed EMethylNET, a robust deep neural
   network that was able to generalize to most independent data sets. In
   addition, we find that mapping the PCCs to genes identifies genes that
   are enriched in functional properties and pathways linked to
   carcinogenesis. Depending on the availability of training data, this
   method can be extended to detect hundreds of cancer types. Future
   applications include extending this approach to DNA methylation data of
   cell-free DNA, with the eventual aim being early detection of multiple
   types of cancer from liquid biopsy approaches. Furthermore, a clear
   clinical application of this method is screening for specific cancer
   types or cancers of unknown origin, although the current models are not
   optimized for this purpose.

Supplementary Material

   bpae028_Supplementary_Data
   [245]bpae028_supplementary_data.zip^ (27.9MB, zip)

Acknowledgements