Abstract

Background

   Tumorigenesis is highly heterogeneous, and using clinicopathological
   signatures only is not enough to effectively distinguish clear cell
   renal cell carcinoma (ccRCC) and improve risk stratification of
   patients. DNA methylation (DNAm) with the stability and reversibility
   often occurs in the early stage of tumorigenesis. Disorders of
   transcription and metabolism are also an important molecular mechanisms
   of tumorigenesis. Therefore, it is necessary to identify effective
   biomarkers involved in tumorigenesis through multi-omics analysis, and
   these biomarkers also provide new potential therapeutic targets.

Method

   The discovery stage involved 160 pairs of ccRCC and matched normal
   tissues for investigation of DNAm and biomarkers as well as 318 cases
   of ccRCC including clinical signatures. Correlation analysis of
   epigenetic, transcriptomic and metabolomic data revealed the connection
   and discordance among multi-omics and the deregulated functional
   modules. Diagnostic or prognostic biomarkers were obtained by the
   correlation analysis, the Least Absolute Shrinkage and Selection
   Operator (LASSO) and the LASSO-Cox methods. Two classifiers were
   established based on random forest (RF) and LASSO-Cox algorithms in
   training datasets. Seven independent datasets were used to evaluate
   robustness and universality. The molecular biological function of
   biomarkers were investigated using DAVID and GeneMANIA.

Results

   Based on multi-omics analysis, the epigenetic measurements uniquely
   identified DNAm dysregulation of cellular mechanisms resulting in
   transcriptomic alterations, including cell proliferation, immune
   response and inflammation. Combination of the gene co-expression
   network and metabolic network identified 134 CpG sites (CpGs) as
   potential biomarkers. Based on the LASSO and RF algorithms, five CpGs
   were obtained to build a diagnostic classifierwith better
   classification performance (AUC > 99%). A eight-CpG-based prognostic
   classifier was obtained to improve risk stratification (hazard ratio
   (HR) > 4; log-rank test, p-value < 0.01). Based on independent datasets
   and seven additional cancers, the diagnostic and prognostic classifiers
   also had better robustness and stability. The molecular biological
   function of genes with abnormal methylation were significantly
   associated with glycolysis/gluconeogenesis and signal transduction.

Conclusion

   The present study provides a comprehensive analysis of ccRCC using
   multi-omics data. These findings indicated that multi-omics analysis
   could identify some novel epigenetic factors, which were the most
   important causes of advanced cancer and poor clinical prognosis.
   Diagnostic and prognostic biomarkers were identified, which provided a
   promising avenue to develop effective therapies for ccRCC.

   Keywords: Multi-omics, DNA methylation (DNAm), Clear cell renal cell
   carcinoma (ccRCC), Diagnostic biomarkers, Prognostic biomarkers

Introduction

   According to the latest global cancer statistics in 2018, renal cell
   carcinoma (RCC) is among the top ten most common malignancies in which
   the incidence and mortality accounts for more than 3% of all human
   malignancies ([26]Bray et al., 2018). Among all types of RCC, the major
   histological subtype is clear cell renal cell carcinoma (ccRCC), which
   has the worst prognosis and more complex heterogeneity, accounting for
   80–90% of all RCC cases ([27]Hsieh et al., 2017). To understand the
   potential molecular alterations that drive ccRCC oncogenesis, The
   Cancer Genome Atlas (TCGA) project based on single-omics analysis from
   different working groups has emphasized the importance of molecular
   characterization and histological assessment to stratify ccRCC
   ([28]Linehan & Ricketts, 2019; [29]Prakoura & Chatziantoniou, 2017;
   [30]Sigdel et al., 2019). These results indicate the complexity of
   ccRCC tumorigenesis and suggest that single-omics insufficient to fully
   investigate this cancer type to identify effective diagnostic or
   prognostic biomarkers. Furthermore, it is difficult for pathologists to
   distinguish ccRCC based on morphology and immunohistochemistry
   ([31]Bucur & Zhao, 2018). Therefore, it is necessary to identify
   potential diagnostic or prognostic molecular biomarkers for ccRCC
   ([32]Clark et al., 2020).

   In the early stage, the diagnosis and prognosis of ccRCC was mainly
   based on abdominal imaging and clinicopathologic signatures, and those
   strategies have some contingency with an approximate accuracy of 80%
   ([33]Lightfoot et al., 2000). The diagnostic and prognostic biomarkers
   were also identified at different omics levels, including microRNA
   ([34]Guan et al., 2018), long noncoding RNA ([35]Qu et al., 2018), mRNA
   ([36]Chen et al., 2019; [37]Wang et al., 2017) and proteins ([38]Lopez
   et al., 2016). However, those biomarkers have some obvious drawbacks,
   including unstable molecular structure, the fact that they are easily
   affected by external factors, lower abundance and difficulty in being
   detected, thus limiting their clinical application. The epigenetic
   changes of cancer cells have been proved to be one of the important
   mechanisms of tumorigenesis, in which the abnormal methylation of gene
   promoter regions, especially CpG island regions, is also known to be an
   important factors that caused gene suppression ([39]Bibikova et al.,
   2011). The TCGA project and other studies have shown that DNAm can be
   used as the cancer-specific characteristic for the development of
   diagnostic or prognostic biomarkers and targeted therapy ([40]Linehan &
   Ricketts, 2019). For example, [41]Skrypkina et al. (2016) reported that
   the sensitivity of a model based on an individual gene ranges from
   51.9% to 62.9%. At the same time, the use of a combination of two or
   three genes led to a significant increase in sensitivity (77.9–92.3%)
   and specificity (86.7–93.3%). However, the researches were only based
   on the epigenetic level, and it was not sufficient to assess the effect
   of aberrant methylation status on cancer progression. Based on
   high-throughput technology, the previous studies shown that many
   significantly different epigenetic factors did not always cause
   abnormality in gene expression. The abnormal gene expression is also
   one of the important factors that cause cancer or disease ([42]Cui et
   al., 2016; [43]Linehan & Ricketts, 2019; [44]Zhao et al., 2020).
   Meanwhile, [45]Xu et al. (2020) developed a seven-CpG-based classifier
   for ccRCC prognosis by integrating DNA methylation (DNAm) and gene
   expression. However, owing to significant reprograming of metabolic
   pathways, the integration analysis based on DNAm and gene expression
   may be not detailed enough to identify aberrant patterns of expression
   changes in many biological pathways ([46]Furuta et al., 2010). As is
   known to all, the metabolic reprograming of cancer cells significantly
   influences some important physiological and biochemical reactions,
   including cell invasion, adhesion and immune response. Although not the
   case for all cancers, RCC has been regarded as a cancer driven by
   metabolic disorders due to the abnormal expression of genes that
   regulate various metabolic pathways, such as FH and SDHB in the Krebs
   cycle ([47]Linehan, Srinivasan & Schmidt, 2010). Hence, the analysis of
   the expression patterns of enzyme-encoding genes in metabolic pathways
   was added to our research, which will complement our understanding of
   the pathogenesis and aid in providing new potential biomarkers or
   targets involved in tumorigenesis.

   In this study, the multi-omics analysis of epigenetic measurements
   uniquely identified DNAm dysregulation of cellular mechanisms resulting
   in transcriptomic alterations, including cell proliferation, immune
   response and inflammation. Combination of the gene co-expression
   network and metabolic network, a five-CpG-based classifier with better
   classification performance was obtained using Least Absolute Shrinkage
   and Selection Operator (LASSO) and RF algorithm. An eight-CpG-based
   prognostic classifier was developed to identify the high-risk groups
   more accurately (hazard ratio (HR) > 4; log-rank test, p-value < 0.01).
   Based on independent datasets and seven additional cancers, the
   diagnostic and prognostic classifiers also had better robustness and
   stability. The molecular function analysis of biomarkers with abnormal
   methylation indicated that they were significantly related to
   glycolysis/gluconeogenesis and signal transduction. Thus, our studies
   not only identified biomarkers involved in important metabolic pathways
   in tumorigenesis, but also provided novel therapeutic targets for
   treating advanced RCC in a general manner.

Materials and Methods

Workflow chart and samples preparation

   [48]Figure 1 shows the study workflow chart. In this study, DNAm (290
   tumor and 160 normal), mRNA (526 tumor and 72 normal) and clinical data
   (318 cases of ccRCC) were collected from TCGA
   ([49]https://portal.gdc.cancer.gov/). Of note, 160 cases of ccRCC had
   both ccRCC tissues and matched adjacent normal tissues in terms of DNAm
   profiles. In addition, 24 cases of ccRCC, which had both ccRCC tissues
   and matched adjacent normal tissues in terms of mRNA profiles, had both
   DNAm and mRNA expression profiles. The human KEGG Markup Language
   (KGML) files were obtained from the KEGG database
   ([50]http://www.genome.jp/kegg/), and these files were used to predict
   key enzyme-coding genes. In addition, the DNAm status for the following
   five additional cancers were derived from TCGA, including both DNAm and
   gene expression profiles: BRCA (774 tumor and 82 normal), COAD (292
   tumor and 38 normal), LIHC (380 tumor and 50 normal), LUAD (455 tumor
   and 32 normal) and PRAD (502 tumor, 50 normal). Moreover, the following
   two cancers were obtained from TCGA, including only DNAm or gene
   expression profiles: KIRP (274 tumor and 45 normal) and UCEC (425 tumor
   and 34 normal).

Figure 1. Study flowchart of data generation and analysis.

   [51]Figure 1
   [52]Open in a new tab

   DMC, differentially methylated CpG sites; DNAm, DNA methylation.
   Integrated methylation signatures on ccRCC and non-tumor tissues were
   used to identify 134 candidate biomarkers. Diagnostic biomarker
   selection: LASSO was applied to a training cohort to identify a final
   selection of five biomarkers. These five markers were applied to a
   validation cohort. Prognostic biomarker selection: univariant-cox and
   LASSO-Cox were applied to a training cohort with survival data to
   identify a final selection of eight biomarkers. These eight biomarkers
   were applied to a validation cohort with survival data.

Differential analysis of mRNA expression and DNAm

   Based on DNAm profiles, probes with more than 50% missing values were
   deleted. The “impute.knn” function (R version 3.5.0) was used to impute
   the remaining missing values. CpG probes in chromosomes X and Y were
   excluded, and CpG probes identified as non-unique in the genome as well
   as CpG probes known as single-nucleotide polymorphisms were also
   excluded ([53]Noushmehr et al., 2010; [54]Selamat et al., 2012).
   Differentially methylated CpG sites (CpGs) were detected using the
   non-parametric testing method ([55]Selamat et al., 2012), and p-values
   were adjusted using the false discovery rate (FDR) method in R (R
   version 3.5.0; p.adjust (method = “BH”)). CpG probes were filtered as
   described below, and these filtered CpG probes were considered to be
   statistically significantly differentially methylated. FDR-values were
   less than 0.05, and the absolute values of the mean β-value difference
   between ccRCC and non-ccRCC tissues were greater than 0.2 ([56]Selamat
   et al., 2012). In addition, the differentially expressed genes were
   detected based on “limma” function in Bioconductor 3.10. The
   significantly differentially expressed genes were identified using the
   following cutoffs: FDR-value < 0.05 and |log2 fold change| > 1.

   Finally, we investigated the correlation between DNAm and mRNA
   expression. Pearson correlation coefficients for each gene or CpG were
   calculated between mRNA expression and DNAm using the Spearman’s rank
   correlation coefficient. The correlation was considered statistically
   significant when the absolute value of Spearman’s Rho was greater than
   0.2 and the FDR-values were less than 0.05 ([57]Cancer Genome Atlas
   Research, 2011; [58]Cheng et al., 2018).

Screening key enzyme-coding genes by met-express

   In previous study, the gene co-expression and genome-scale metabolic
   network were integrated to identify the key enzyme-coding genes in
   human diseases. Here, we briefly describe the algorithm ([59]Chen et
   al., 2013). (1) Based on transcriptome data, the Pearson correlation
   coefficients (PCC) were calculated for all pairs of enzyme-coding
   genes. The top three associated genes with the highest PCC for each
   gene were used to construct gene co-expression network. Qcut ([60]Ruan
   & Zhang, 2008) was used to divide the network into gene co-expression
   modules, and we only remain some modules including more than 10 genes.
   (2) We carried out the reconfiguration of metabolic network based on
   KGML files ([61]http://www.genome.jp/kegg/) containing the information
   of critical metabolic reactions, key enzyme-coding genes and
   corresponding metabolites. (3) The importance score was calculated for
   each gene by integrating with gene cox-expression network and
   reconstructed metabolic network. Then, the median importance score was
   used for a threshold to select the candidate key enzyme-coding genes.

Identification of candidate tissue-specific diagnostic biomarkers

   For 24 patients who have both mRNA expression and DNAm profiles,
   first-level feature selection was performed though the following steps.
   First, DNAm profiles were obtained by Wilcoxon rank-sum test to select
   differentially methylated CpGs (FDR < 0.05, |Δβ[mean]| > 0.2); mRNA
   expression profiles were obtained by “limma” function (R version 3.5.0)
   to select differentially expressed genes (FDR < 0.05, |log2 fold
   change| > 1). Second, the Rho was calculated between DNAm and mRNA
   expression by Spearman’s rank correlation coefficient to select genes
   with higher correlation coefficient (FDR < 0.05, |Rho| > 0.2). For
   genes with multiple CpGs measuring Rho, we selected the CpG with the
   highest absolute value of Rho. Third, the key enzyme-coding genes in
   cancers were selected by Met-Express ([62]Chen et al., 2013). Finally,
   CpGs obtained from the above three steps were intersected to obtain the
   candidate CpGs, namely the candidate tissue-specific diagnostic
   biomarkers for first-level feature selection.

   To build the effective diagnostic model, it was essential to identify a
   small set of biomarkers. Thus, some redundant biomarkers were excluded
   to avoid over-fitting and achieve the best prediction performance
   ([63]Tang et al., 2018). After the first-level feature selection, the
   number of those CpGs was still large and redundant. Therefore, we
   performed a second-level feature selection by the LASSO method
   ([64]Tibshirani, 1996) to further remove redundant CpGs and identify
   tissue-specific diagnostic biomarkers.

Evaluation and modeling of candidate tissue-specific diagnostic biomarkers

   In this study, the random forest (RF) classifier was built based on the
   tissue-specific diagnostic biomarkers to evaluate performance of those
   biomarkers. Apart from 24 patients for the variable selection, 136
   patients who have both tumor and adjacent normal tissue were randomly
   allocated to the training and testing sets. The training set containing
   90 patients was used to optimize the parameters of classifiers and
   build classifiers. The testing set containing 46 patients was used to
   evaluate performance of the classifiers. Ten-fold cross-validation was
   used during the training and testing of classifiers, and the
   performance of classifier was evaluated based on area under the ROC
   Curve (AUC). Meanwhile, we also introduced the sensitivity and
   specificity to assist in evaluating the performance of the diagnostic
   classifier.

Validation based on independent data sets and literatures

   In this study, the Illumina HumanMethylation450 BeadChip data was
   comprehensive queried by using keywords focused on “renal clear cell
   carcinoma”, “ccRCC” and “methylation”. In the process of data
   retrieval, the following conditions should be met as much as possible
   which included the initial experimental research providing a comparison
   of ccRCC and non-ccRCC tissues, the patients without receive
   neoadjuvant therapy. We obtained two independent datasets, including
   [65]GSE70303 ([66]Becket et al., 2016) (46 tumor and non-tumor tissues
   from GEO database) and [67]E-MTAB-2007 ([68]Sato et al., 2013) (106
   tumor and non-tumor tissues from ArrayExpress database), to perform
   verification of candidate tissue-specific biomarkers.

Construction of prognostic model

   Apart from 24 patients for the variable selection, we split the rest of
   ccRCC patients in a training dataset and a validation dataset and
   explored to build a predictive model for prognostic and survival
   analysis. We used 192 cases for training and 98 cases for validation.
   We applied a sequential model-based variable selection to screen
   markers for predicting survival outcome ([69]Ding, Chen & Shi, 2019;
   [70]Xu et al., 2017). Based on the candidate biomarkers, we first
   fitted a univariate Cox proportional hazards model by using each marker
   as the covariate. A marker with p-value < 0.05 from the Wald statistic
   was retained in the dataset. Second, we used a similar strategy in a
   prognosis marker selecting process based on the LASSO-Cox method to
   decrease the marker numbers to a reasonable range (less than events).
   The above analysis generated final markers with non-zero coefficient to
   construct a prognostic signature using R package “glmnet”. Based on the
   penalized Cox regression model, we obtained a combined prognostic score
   (designated as risk score) for each individual. The risk-score for
   patients is calculated as follows:
   [MATH: <mrow><mi mathvariant="normal">R</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">k</mi><mspace width="thickmathspace"></mspace><mi
   mathvariant="normal">S</mi><mi mathvariant="normal">c</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">e</mi><mspace width="thickmathspace"></mspace><mi
   mathvariant="normal">f</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">r</mi><mspace width="thickmathspace"></mspace><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">t</mi><mi
   mathvariant="normal">s</mi><mo>=</mo></mrow><msubsup><mrow><mo
   movablelimits="false">∑</mo></mrow><mrow><mrow><mi
   mathvariant="normal">i</mi><mo>=</mo><mn>1</mn></mrow></mrow><mrow><mi
   mathvariant="normal">N</mi></mrow></msubsup><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mi mathvariant="normal">c</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">f</mi><mi
   mathvariant="normal">f</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">s</mi><mspace
   width="thickmathspace"></mspace><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">f</mi><mspace width="thickmathspace"></mspace><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">h</mi><mspace
   width="thickmathspace"></mspace><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">s</mi><mspace
   width="thickmathspace"></mspace><mo>×</mo><mspace
   width="thickmathspace"></mspace><mi>β</mi><mo>−</mo><mi
   mathvariant="normal">v</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">s</mi><mspace
   width="thickmathspace"></mspace><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">f</mi><mspace width="thickmathspace"></mspace><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">h</mi><mspace
   width="thickmathspace"></mspace><mspace
   width="thickmathspace"></mspace><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">s</mi><mo stretchy="false">)</mo></mrow> :MATH]

   To validate the predictive classifier, we calculated a risk-score for
   each patient in the validation dataset using the coefficient estimated
   from the training dataset. By grouping the risk-score based on its
   median, we formed high and low risk-score groups with roughly equal
   number of observations. The Kaplan–Meier estimator and log-rank test
   were performed to test whether the median survival time for two groups
   was significantly different. All analyses were conducted in R version
   3.5.0 with the “glmnet” and “survival” packages.

Functional enrichment analysis based on gene ontology

   The CpG and gene annotations were downloaded from TCGA
   ([71]http://tcga-data.nci.nih.gov/tcga/tcgaPlatformDesign.jsp) and
   GECODE
   ([72]https://www.encodeproject.org/files/gencode.v22.annotation/).
   Genes with abnormal methylation were analyzed for Gene Ontology (GO)
   analysis or Kyoto Encyclopedia of Genes (KEGG) pathway enrichment
   analysis by using the web version of DAVID ([73]Da Huang, Sherman &
   Lempicki, 2009). GO/KEGG-terms have an adjusted p-values which were
   less than 0.05. The genes were uploaded to the online database
   GeneMANIA ([74]http://genemania.org; version 3.6) ([75]Warde-Farley et
   al., 2010) to explore their interactions at the protein level with an
   interaction score > 0.4 as the cutoff value. Afterward, the network was
   visualized in software Cytoscape (version 3.6.1).

Statistical analysis

   Based on Euclidean distance matrix and complete-linkage method,
   unsupervised hierarchical clustering was carried out to verify the
   reliability of the analysis method about CpGs in R version 3.5.0 with
   the “heatmap” and “dist” functions. To discuss the widespread nature of
   cancer-specific methylation alterations and the impact of aberrant
   methylation on gene expression, the Chi-square test and Wilcoxon
   rank-sum test were used in R version 3.5.0 with the “chisq.test” and
   “wilcox.text” functions. When p-values were less than 0.05, it was
   deemed to be statistically significant. The diagnostic models were
   built in scikit-learn framework (version 0.20.3; Python 3.6..8) with
   the “sklearn.ensemble.RandomForestClassifier” algorithm. The parameters
   of models was tuned in scikit-learn framework (version 0.20.3; Python
   3.6.8) with the “sklearn.grid_search.GridSearchCV” algorithm. The
   hyper-parameters included the number of trees (10 to 100), the
   criterion (“gini” and “entropy”), “oob_score” in the RF. The default
   values were used when other parameters were not provided in functions
   or classifier.

Results

   In our study, 160 ccRCC and matched normal tissues were obtained
   through the TCGA projects. Among 160 patients, the mean age was 63
   years, and the ratios for the sex (female to male) and cancer staging
   (stage I–II to stage III–IV) were 1:2 and 7:9, respectively. Although
   smoking, obesity and hypertension could affect the risk of ccRCC, TCGA
   projects did not provide the clinical signatures (NULL). The study
   flowchart is shown in [76]Fig. 1. The patients’ characteristics are
   presented in [77]Table S1.

Identification and landscape of differentially methylation in ccRCC

   Based on the differentially methylated CpGs, the exploratory
   hierarchical clustering was first performed ([78]Fig. S1). The
   methylation profiles of ccRCC and non-ccRCC resulted in separate
   groups, suggesting a potential difference. In our study, we obtained
   4367 CpGs (corresponding to 2,490 genes) that were significantly
   hypermethylated in ccRCC and 5,031 CpGs (corresponding to 3,551 genes)
   that were significantly hypomethylated ([79]Fig. 2A). Some of the most
   hypermethylated loci contained HOXA5, involving 31 CpGs in the CGI
   (29/31) or N_Shore (2/31) and coding for a DNA-binding transcription
   factor, which regulated gene expressions, cell differentiation and
   morphogenesis ([80]Shenoy et al., 2015). Some of the most
   hypomethylated loci contained CXCL8, which promoted angiogenesis and
   metastasis ([81]Shenoy et al., 2015). To investigate the biological
   functions of genes with abnormal methylation, the target genes was
   conducted by DAVID ([82]Da Huang, Sherman & Lempicki, 2009). Most genes
   were significantly enriched in the important biological processes,
   including cell adhesion, signal transduction and positive regulation of
   transcription (Benjamini–Hochberg (BH) adjusted p-value < 0.05)
   ([83]Table 1), indicating that these genes played an important role in
   tumorigenesis.

Figure 2. Identification of DNA methylation difference between ccRCC and
matched adjacent non-ccRCC samples.

   [84]Figure 2
   [85]Open in a new tab

   (A) Volcano plot of the differential DNA methylation analysis (X-axis).
   Mean β-value difference (mean ccRCC–mean non-ccRCC); (Y-axis) Q-values
   for each CpG sites (−1 × log[10] scale). Blue points represent
   hypomethylated CpG sites; red points represent hypermethylated. (B)
   Proportions of CpG sites on CGIs and non-CGIs. Red bar represents
   hypermethylated CpG sites; blue bar represents hypomethylated CpG
   sites; gray bar represents unmethylated CpG sites. p Values were
   computed with χ^2 test. (C) Normalized histogram of CpG sites with
   respect to TSSs distance. Red line represents hypermethylated CpG
   sites; blue line represents hypomethylated CpG sites; black line
   represents background distribution of all promoter CpG sites. Red and
   blue arrows represent significantly difference between the
   characteristics of hyper-and hypomethylated CpG sites in promoters.
   Up-and downstream distances from TSSs are represented by positive and
   negative values, respectively. p Values were computed with the Wilcoxon
   rank sum test. (D) Box plot of TSS absolute distance for CpG sites. Red
   box represents the hypermethylated promoter CpG sites were located in
   CGI regions; blue box represents the hypomethylated promoter CpG sites
   were located in CGI regions.

Table 1. Top 10 Gene Ontology (GO) biological process (BP) based on genes
with aberrant methylation.

   GO Term Count FDR
   [86]GO:0007155 Cell adhesion 198 6.87E−16
   [87]GO:0007165 Signal transduction 391 1.89E−10
   [88]GO:0045944 Positive regulation of transcription from RNA polymerase
   II promoter 339 2.41E−10
   [89]GO:0043547 Positive regulation of GTPase activity 210 1.12E−08
   [90]GO:0007156 Homophilic cell adhesion via plasma membrane adhesion
   molecules 79 1.36E−08
   [91]GO:0001525 Angiogenesis 96 2.43E−06
   [92]GO:0007399 Nervous system development 116 3.93E−06
   [93]GO:0035556 Intracellular signal transduction 149 2.50E−05
   [94]GO:0000122 Negative regulation of transcription from RNA polymerase
   II promoter 241 2.99E−05
   [95]GO:0060333 Interferon-gamma-mediated signaling pathway 38 7.91E−04
   [96]Open in a new tab

   For the significantly differentially methylated CpGs, we also analyzed
   whether they are related with CGIs and whether they are located in the
   promoter or gene body ([97]Fig. S2). We identified 9398 CpGs defined as
   being differentially hyper- or hypomethylated. Moreover, 33.33% and
   31.96% of CpGs located in promoter regions, which was defined as 1.5 kb
   up-or downstream from the transcription start site (TSS), were
   significantly hyper-or hypomethylated, respectively. We found no
   statistically significant difference between the degree of methylation
   and whether the CpGs were located in the gene promoter regions (p-value
   = 0.17, Fisher’s exact test). However, 31.87% and 3.79% of CpGs located
   in CGIs were significantly hyper-or hypomethylated (p-value < 2.2 ×
   10^−16, Fisher’s exact test), respectively. These results showed a
   significant difference between the methylation characteristics of CGIs
   and non-CGIs for differentially methylated CpGs. Based on CGI regions,
   the significantly hypermethylated CpGs were highly enriched in CGIs
   (p-value < 2.2 × 10^−16, χ^2 test), whereas hypomethylated CpGs were
   enriched in non-CGIs (p-value < 2.2 × 10^−16, χ^2 test) ([98]Fig. 2B).
   Compared to hypomethylated CpGs, hypermethylated CpGs were located
   closer to TSSs (p-value = 2.37 × 10^−9, Wilcoxon rank sum test)
   ([99]Fig. 2C). Compared to CGIs containing hypomethylated CpGs, CGIs
   containing hypermethylated CpGs were located closer to TSSs (p-value =
   2.6 × 10^−6, Wilcoxon rank sum test) ([100]Fig. 2D).

   In conclusion, the hypermethylated CpGs were preferentially enriched in
   CGIs, but CpGs exhibiting abnormal hypomethylation in tumors were
   frequently occurred in non-CGI regions. Furthermore, the CGI and
   promoter regions containing hypermethylated CpGs were located closer to
   TSSs.

Impact of the cancer-associated differentially methylation no gene expression

   To identify the tumor-associated methylation alterations with
   concomitant changes at the transcriptional levels, the correlation
   analysis of the methylation and transcriptome data were performed. By
   focusing on CpGs and their nearest neighbor genes (called
   “cis-interactions”) that fall within 1,500 bp upstream of TSS to the
   end of gene body, genes with abnormal methylation were studied
   ([101]Fig. S3). By comparing 1,154 significantly differentially
   methylated CpGs located in 316 associated genes, 998 (86.48%, 998 of
   1,154 CpGs) CpGs with the significant correlations between DNAm and
   gene expression were identified.

   In accordance with DNAm mechanism to silence or enhance local
   transcription, 68% of CpGs identified as significantly correlated were
   negative (681of 998 CpGs) ([102]Figs. 3A and [103]3B). A portion of the
   CpGs were primarily in gene promoter regions (68.14%, 464 of 681 CpGs).
   Of these, 336 CpGs (72.41%, 336 of 464 CpGs) were statistically
   significantly hypermethylated and inhibited gene expression, while 128
   CpGs (27.59%, 128 of 464 CpGs) were hypomethylated and caused
   over-expression of genes, indicating that abnormal DNAm might have
   functional consequence in approximately 40% (464 of 1,154 CpGs) of CpGs
   associated with genes. The hypermethylated and under-expressed genes
   included Secreted Fzd-related protein 1 (SFRP1), which was consistent
   with previous ccRCC studies ([104]Awakura et al., 2008; [105]Gumz et
   al., 2007), and PIC3K2G, which was a known cancer susceptibility gene
   ([106]Brown et al., 2015) ([107]Table S2). The WNT antagonist, SFRP1,
   was also a tumor suppressor that regulates cell proliferation and blood
   vessel formation ([108]Atschekzei et al., 2012). CGI methylation
   located in SFRP1 has also been shown to be indirectly associated with
   tumor progression and tumor stage ([109]Urakami et al., 2006) as well
   as to directly lead to the occurrence of poor prognosis in RCC patients
   ([110]Morris et al., 2010; [111]Saini et al., 2009). PIK3C2G encoded a
   phosphoinositide three kinase (PI3K) subtype, which was involved in the
   regulation of signaling pathways, such as cell proliferation, oncogenic
   transformation, cell survival and cell migration ([112]Brown et al.,
   2015; [113]Freitag et al., 2015). Conversely, the hypomethylated and
   over-expressed genes included single transduction-related genes
   (GUCA2B) as well as functional genes involved in cell differentiation
   (PPDPFL), PPAR signaling pathway (FABP6), HIF-1-alpha transcription
   factor network (CA9), and receptor-associated protein activity (SLC6A3)
   ([114]Table S2). The molecular biological functions uncovered that
   hypermethylated/under-expressed genes were associated with the immune
   response, inflammatory response and cell adhesion ([115]Table S3).
   However, hypomethylated/over-expressed genes were associated with cell
   or tissue development, intracellular transport and cellular secretion
   ([116]Table S3).

Figure 3. Impact of abnormal methylation levels on gene expression in ccRCC.

   [117]Figure 3
   [118]Open in a new tab

   (A) Example of a gene (SFRP1) showing a negative gene expression-DNA
   methylation relationship. Blue, ccRCC tumors; red, matched adjacent
   normal tissues. (B) Starburst plot of gene expression and DNA
   methylation differences in ccRCC and matched adjacent normal tissues.
   Only CpG sites (points) demonstrating significant DNA methylation-gene
   expression correlations are shown. X axis, differential DNA methylation
   levels between ccRCC and matched adjacent normal tissues. Y axis,
   differential gene expression levels between ccRCC and matched adjacent
   normal tissues. (C, E) Relationships of positively correlated CpG sites
   to TSSs. (C) Normalized histogram of positively correlated CpG sites
   hypermethylated and over-expressed (blue) compared to conventional
   negatively correlated CpG sites (red). (E) Histogram of positively
   correlated CpG sites hypomethylated and under-expressed (purple)
   compared to conventional negatively correlated CpG sites (green). (D)
   Bar graphs exhibiting ratios of gene body (black) and promoter (gray)
   CpG frequencies within negatively and positively correlated CpG sites
   to all gene body and promoter CpG site frequencies. p Values were
   computed with Chi-square test. (F, I) (Left) Schematic representation
   of genes including significantly positively correlated gene body CpG
   sites for FBXO2 (F) and RUNX3 (I). (G, J) Box plots comparing gene
   expression levels associated with positively correlated gene body CpG
   sites in ccRCC and matched adjacent normal tissue. (H, K) Box plots
   comparing gene body methylation levels of positively correlated gene
   body CpG sites in ccRCC and matched adjacent normal tissue.

   In our investigation, a certain proportion (31.76%, 317 of 998 CpGs) of
   positive correlation was observed ([119]Fig. 3B). Of which a lower
   percentage (10.73%, 34 of 317 CpGs) of the positively correlated CpGs
   was located in promoter regions, but our analysis indicated that these
   CpGs might be related to atypical TSS distances compared to negatively
   correlated CpGs. Specifically, the hypermethylated and over-expressed
   CpGs tended to lie closer to TSSs (p-value = 1.4 × 10^−5, Wilcoxon rank
   sum test), whereas the hypomethylated and under-expressed CpGs were
   located further upstream of TSSs compared to their cognate negatively
   correlated (hypomethylated and over-expressed) CpGs (p-value = 1.4 ×
   10^−2, Wilcoxon rank sum test) ([120]Figs. 3C and [121]3E). The
   molecular biological functions revealed that the hypermethylated CpGs
   with positive correlation were mainly enriched in immune processes and
   cell motility, whereas the hypomethylated CpGs with positive
   correlation were enriched in cell proliferation and development.
   Furthermore, CpGs that appeared to be positively correlated were
   inclined to be located in gene bodies (p-value < 2.2 × 10^−16, χ^2
   test) ([122]Fig. 3D). In the gene bodies, the hypermethylated and
   over-expressed genes included RUNX3, and hypomethylated and
   under-expressed genes included FBXO2 ([123]Figs. 3F–[124]3K).
   Intriguingly, several genes showing positive correlation in gene bodies
   have been previously linked to cancer, including MUC15 ([125]Huang et
   al., 2009), HEPACAM2 ([126]Klopfleisch et al., 2010), CA10 ([127]Romeo
   et al., 2009), NRG1 ([128]Huang et al., 2004) and RAB25 ([129]Mitra,
   Cheng & Mills, 2012). In addition, ~26% of genes that exhibited
   positive correlation and were located in gene bodies also exhibited
   negative correlations in promoters ([130]Table S4), including RUNX3 and
   TMEM30B ([131]Fig. S4). Based on four independent datasets including
   two methylation datasets ([132]GSE70303 and [133]GSE61441) and two gene
   expression profiles ([134]GSE40435 and [135]GSE76351), we also found
   the same events ([136]Table S5). These results suggested that
   epigenetic alterations had a dual effect on gene expression, that is,
   genes with hypermethylation in the promoter and hypomethylation in the
   gene body were suppressed or genes with hypomethylation in the promoter
   and hypermethylation in the gene body were activated. The occurrence of
   the “tandem control” mechanisms indicated that methylation alterations
   located in the promoter might interact with methylation status in the
   gene body, thereby regulating the expression of cancer-associated
   genes.

Identification of tissue-specific biomarkers based on multi-omics integration

   To obtain cancer tissue-specific CpGs that would distinguish tumor from
   non-tumor tissues, the correlation analysis was used for the selection
   of highly tissue-specific biomarkers using 24 tumor samples with both
   DNAm and matched gene expression profiles ([137]Fig. 1). Based on
   correlation analysis, 134 CpGs associated with 54 genes were obtained
   by integrating gene expression, DNAm, enzyme gene networks and gene
   co-expression networks. For genes with multiple CpGs, the CpGs with the
   highest correlation between DNAm and gene expression were selected
   ([138]Table S6). The optimal number of features was determined by the
   automated searching algorithm of LASSO. Classifiers with higher
   complexity would contain redundant features with no corresponding
   improvement in performance, thereby resulting in poor generalization
   ability. Eventually, the five ccRCC-specific CpGs ([139]Table 2) were
   obtained using the LASSO algorithm, and they were significantly
   differentially methylated between the tumor and matched adjacent normal
   tissues (p-value < 0.001; Wilcoxon rank sum test) ([140]Fig. 4A).
   Unsupervised hierarchical clustering was further performed in the
   independent datasets, and the similarity matrix was constructed using
   Pearson’s correlation coefficient ([141]Fig. 4B). Tumor tissues were
   well discriminated from the matched adjacent normal tissues in
   independent datasets, and an accuracy of 100% was obtained. The results
   showed that five CpGs had better robustness and the potential to
   distinguish cancer.

Table 2. Characteristics of five DNA methylation-based diagnostic biomarkers
in ccRCC diagnosis.

   Gene symbol CpGs Differential expression[142]^a
   (log2FC) Differential methylation[143]^b level Correlation[144]^c
   Distance from
   TSS (bp) CpG island
   NAT8 cg19565262 2.73 −0.49 −0.46 −445 .
   ALOX5 cg07355189 1.64 0.56 0.74 3,841 Island
   PDE1A cg00470341 −3.43 −0.53 0.74 53,972 .
   PDE8B cg12559197 −1.10 −0.55 0.70 1,48,031 .
   PFKP cg15087907 1.62 0.47 0.51 13,608 .
   [145]Open in a new tab

   Notes:
   ^a

   The log2-transformed fold changes between the tumor and benign-adjacent
   tissues (>0 means over-expression in tumor tissues) in ccRCCs.
   ^b

   The differential methylation levels between the tumor and
   benign-adjacent tissues (>0 means hypermethylation in tumor tissues) in
   ccRCCs.
   ^c

   The Spearman’s rank correlations between gene expression and probe
   methylation levels in ccRCCs.

Figure 4. Identification of the DNA methylation-based diagnostic biomarkers.

   [146]Figure 4
   [147]Open in a new tab

   (A) The methylation values and standard deviations of five diagnostic
   biomarkers were from the benign-adjacent and patient cancer tissues.
   The p-value was calculated through Wilcoxon rank sum test, and “****”
   means p-value was less than 0.001. (B) Unsupervised hierarchical
   clustering of five methylation biomarkers selected for use in the
   diagnostic prediction model in the independent dataset, in which the
   metric of similarity was Pearson’s correlation based on the methylation
   levels.

Functional enrichment and biological network analysis

   To further elucidate the function of the above five CpGs significantly
   associated with gene expression, the molecular biological function
   analysis were performed by DAVID ([148]Da Huang, Sherman & Lempicki,
   2009) and GeneMANIA 3.6 ([149]Warde-Farley et al., 2010). As shown in
   [150]Fig. 5A ([151]Table S7), the genes associated with five CpGs were
   mainly enriched in Morphine addiction, Purine metabolism, Pentose
   phosphate pathway, Galactose metabolism and Fructose and mannose
   metabolism. In general, the reprograming of cellular metabolism is a
   major hallmark of cancer cells, which is a direct and indirect
   consequence of carcinogenic factors. The common characteristic of
   cancer cell metabolism is the ability to access adequate nutrients from
   its surrounding environment and make good use of these nutrients to
   maintain viability or proliferation ([152]Fouad & Aanei, 2017;
   [153]Pavlova & Thompson, 2016). Two major nutrients utilized for
   survival and biosynthesis in cancer cells are glucose and glutamine.
   Interestingly, PFKP and NAT8 were significantly associated with
   Glycolysis/Gluconeogenesis, Fructose and mannose metabolism, Galactose
   metabolism and Glutathione metabolism (p-value < 0.05, Fisher’s exact
   test). In addition, PDE1A and PDE8B were significantly enriched in
   Purine metabolism and G protein-coupled receptor signaling pathway
   (p-value < 0.05, Fisher’s exact test). Apart from playing an
   indispensable role in synthesizing DNA and RNA, purine metabolites
   supply some cancer cells with necessary nutrients and cofactors for
   survival or proliferation ([154]Pedley & Benkovic, 2017; [155]Yin et
   al., 2018) and they are also being targeted for the treatment of
   cancers ([156]Fridley et al., 2011). Moreover, the majority of
   biological responses in both normal and cancer cells are regulated via
   multiple signaling pathways by G protein-coupled receptors (GPCRs)
   ([157]Ristic, Bhutia & Ganapathy, 2017). Some research has suggested
   that GPCRs played an important role in purinosome assembly/disassembly
   and regulated metabolic flux through de novo purine biosynthesis in
   cells ([158]Fang et al., 2013; [159]Verrier et al., 2011). Furthermore,
   the ALOX5 gene was associated with several physiological and
   pathological processes, including inflammation, oxidative response and
   carcinogenesis ([160]Tomita et al., 2019). Protein–protein interaction
   networks from GeneMANIA 3.6 suggested a more complex interaction for
   PFKP, PDE1A and ALOX5 ([161]Fig. 5B). Most of these genes, including
   SREBF1 ([162]Audet-Walsh et al., 2018), CALM1 ([163]Fortney et al.,
   2015), PCBP1 ([164]Zhang et al., 2015), COTL1 ([165]Guo et al., 2017)
   and BTK ([166]Yue et al., 2017), are tumor-related genes, which have
   been reported to be associated with tumorigenesis, progression and
   therapy.

Figure 5. Functional enrichment analysis and protein–protein interaction
network by DNA methylation-based diagnostic biomarkers.

   [167]Figure 5
   [168]Open in a new tab

   (A) Functional enrichment analysis was performed through GO and KEGG.
   p-Value was adjusted by Benjamini–Hochberg method. (B) The networks
   were each assigned a weight by the GeneMANIA algorithm. The weight of
   each edge was multiplied by weight of the containing network. The size
   of the circle was defined as the score attribute, which indicated the
   relevance of each gene to the original list based on the selected
   networks. Higher scores suggested that genes that were more likely to
   be functionally related. The shaded circles represented the DNA
   methylation-based diagnostic biomarkers.

Evaluation and validation of diagnostic accuracy with 10-fold
cross-validation

   Based on the five CpGs, a RF algorithm was implemented to construct an
   effectively diagnostic classifier. We obtained a sensitivity of 100%
   and specificity of 100% for ccRCC in the training dataset of 90 ccRCC
   and control samples ([169]Fig. 6A) as well as a sensitivity of 97.83%
   and a specificity of 100% in the testing dataset of 46 ccRCC and
   control samples ([170]Fig. 6B). Furthermore, the diagnostic classifier
   was demonstrated to effectively discriminate ccRCC from normal samples
   both in the training dataset (AUC = 99.99%) and the testing dataset
   (AUC = 99.95%) ([171]Figs. 6C and [172]6D). To evaluate the performance
   of the five-CpG-based classifier, comparison of different independent
   datasets was required. In our research, two independent datasets,
   namely [173]GSE70303 ([174]Becket et al., 2016) and [175]E-MTAB-2007
   ([176]Sato et al., 2013), were used with AUCs for 100% and 99.68%
   ([177]Figs. 6E and [178]6F), respectively. To explore and verify
   universal applicability of the feature selection method and
   classifiers, the five most common cancers from TCGA were used to
   construct diagnostic classifiers, and the independent datasets from GEO
   were used to evaluate performance ([179]GSE69914, [180]GSE48684,
   [181]GSE66836, [182]GSE76938 and [183]GSE54503). Remarkably, the AUCs
   based on the selected cancers were 97.31% (BRCA), 92.68% (COAD), 98.03%
   (LUAD), 94.90% (PRAD) and 98.32% (LIHC) ([184]Fig. S5). The results
   showed that screening and identification of biomarkers based on
   multi-omics analysis could be extended to the application for any other
   cancers.

Figure 6. Construction and validation of the CpGs-based diagnostic model.

   [185]Figure 6
   [186]Open in a new tab

   (A and B) Confusion tables of binary results of diagnostic prediction
   model in the training (A) and validation (B). ROC of the diagnostic
   prediction model with methylation biomarkers in the training (C) and
   validation data sets (D). (E and F) ROC of the diagnostic prediction
   model with methylation biomarkers in two independent data sets
   ([187]GSE70303 (E) and [188]E-MTAB-2007 (F)).

   To further demonstrate the diagnostic capacity of the five-CpG-based
   classifier in the early stages, the diagnostic classifier was
   implemented and evaluated at different cancer stages (stage I–IV).
   Firstly, the unsupervised hierarchical clustering was implemented in
   the independent datasets, including four different cancer stages.
   Strikingly, these patients were divided into two different groups,
   which separately contained tumor and non-tumor tissues ([189]Fig. 7A).
   Subsequently, the five-CpG-based classifier was used to distinguish
   ccRCC from normal tissues at different stages. Remarkably, the
   five-CpG-based classifier achieved ROC curves with an AUCs of 99.60%
   (stage I) and 99.95% (stage II–IV) ([190]Fig. 7B). The results
   indicated that the DNAm levels of the five CpGs were potentially
   effective biomarkers for distinguishing tumor tissues from normal
   tissues in the early stages (stage I–II).

Figure 7. DNA methylation analysis of ccRCC diagnosis at different stages of
tumorigenesis.

   [191]Figure 7
   [192]Open in a new tab

   (A) Unsupervised hierarchical clustering and heatmap for the
   methylation profile of the selected five CpGs across 160 samples at
   different stages of tumorigenesis. (B) ROC curve for the validation
   data sets of stages I–IV from TCGA.

Constructing and validating the prognostic model

   To establish an effectively prognostic classifier for ccRCC, a series
   of rigorous criteria were used to filtrate biomarkers (see “Materials
   and Methods”). [193]Table S8 showed that 21 CpGs were significantly
   correlated with overall survival based on univariate Cox regression
   analysis (p-value = 0.035 – 3.8 × 10^−6). Subsequently, the
   multivariate LASSO Cox regression analysis was performed to screen
   prognostic biomarkers, and 8 of 21 CpGs were obtained ([194]Figs.
   8A–[195]8C). Based on the β-values of the eight CpGs, the weighting
   coefficients were used to calculate risk scores for patients as follow:
   risk-score = −1.77 × cg19516340 + 0.43 × cg22884714 + 0.72 × cg09203199
   + 1.42 × cg16836311 + 0.14 × cg15357821 −0.18 × cg03415545 + 0.86 ×
   cg15200711 + 0.39 × cg09799983. The risk scores was used to evaluate
   survival status, patients with higher risk scores had poor survival
   status comparing with those with lower risk scores in two datasets
   ([196]Figs. 8D and [197]8F). The median risk score (−0.13) was used as
   the cut-off, and patients were significantly grouped into high-risk and
   low-risk groups in the training datasets. The results showed that the
   risk-score was significantly correlated with risk of death (HR: 5.84;
   95% confidence interval (CI) [3.16–10.8]; log-rank test p-value <
   0.001; [198]Fig. 8E). To assess the validity and reproducibility of
   prognostic classifier, patients in the validation datasets were also
   divided into high-risk or low-risk groups (HR: 4.63; CI [1.39–15.5];
   log-rank test p-value = 2.9 × 10^−3; [199]Fig. 8G). The genes
   associated with the eight CpGs were identified, including ADH1C, CES2,
   CYP1B1, LPCAT1, HOOK3, RRM2, CHEK2 and MAN1C1. The prognostic value of
   RRM2, CES2, HOOK3, ADH1C and LPCAT1 has been reported and validated in
   many cancers, such as breast cancer, pancreatic cancer, colorectal
   cancer, bladder cancer, ovarian cancer, non-small cell lung cancer,
   head and neck cancer ([200]Liu et al., 2007; [201]Morikawa et al.,
   2010; [202]Rahman et al., 2013; [203]Xia et al., 2017; [204]Yoshida et
   al., 2011). MAN1C1 and CHEK2 have been implicated in tumor immune
   response, cancer cell proliferation and cell-to-cell adhesion
   ([205]Zhang et al., 2010), and provided potential therapeutic targets
   for RCC ([206]Li et al., 2018). However, the molecular biological
   function of CYP1B1 was still unknown. The results showed that those
   genes with abnormal methylation may paly crucial roles in tumorgenesis,
   and the DNAm sites might also be potential targets for new therapies.

Figure 8. Construction and validation of the eight-CpG-based classifier.

   [207]Figure 8
   [208]Open in a new tab

   (A and C) Eight CpG sites selected by LASSO Cox regression analysis.
   (A) The two dotted vertical lines are drawn at the optimal values by
   minimum criteria and 1-s.e. criteria. Details are provided in Methods.
   (B) LASSO coefficient profiles of the 21 CpG sites. A vertical line is
   drawn at the optimal value by 1-s.e. criteria and results in eight
   non-zero coefficients. (C) A histogram of the absolute values of the
   coefficients for eight CpG sites, and eight CpG sites was selected in
   the LASSO Cox regression model. (D–G) Risk score was calculated by the
   eight-CpG-based classifier and Kaplan–Meier survival in the training
   data sets (D and E) and validation data set (F and G). Risk-score
   distribution of the eight-CpG-based classifier and patient survival
   status. Heatmap showing methylation of the eight CpG sites in the
   patients. Kaplan–Merier survival analysis for the patients. The
   patients were divided into low-risk and high-risk groups using the
   median cutoff value of the classifier risk score (−0.131). p-Values
   were calculated using the log-rank test. HR, hazard ratio.

   The clinical prognostic signatures, including age, gender and TNM
   stage, were used as covariates to adjust prognostic classifier, and the
   results showed that the risk score calculated by the combination of
   eight CpGs was an independent prognostic factor with the better
   performance in the TCGA datasets ([209]Table 3). Based on the subsets
   of ccRCC patients with different clinical signatures, the
   stratification analysis of the prognostic classifier was further
   implemented. When ccRCC patients were grouped according to the clinical
   signatures (age, gender, TNM stage), the prognostic classifier
   ([210]Fig. S6) still had the better performance (log-rank test p-value
   < 0.05). The [211]Fig. S7 shown that the ccRCC patients with the same
   clinical characteristics were also significantly divided into high-risk
   and low-risk groups by the prognostic classifier (log rank test p-value
   < 0.05). To further evaluate the universality of the construction
   methodology, we used five different cancers with the clinical
   information as the independent validation datasets (TCGA-BRCA,
   TCGA-KIRP, TCGA-LIHC, TCGA-LUAD and TCGA-UCEC). Compared to patients in
   the low-risk group, patients in the high-risk group had significantly
   shorter overall survival in all five different cancers (log-rank test,
   all p-value < 0.05; [212]Fig. S8). In summary, these results clearly
   demonstrated that investigation of the DNAm mechanism with the
   multi-omics analysis may identify potential pathogenic factors and
   provide potential biomarkers for prognosis analysis of patients.

Table 3. Multivariate Cox regression analysis of the eight-CpG-based
prognostic model with overall survival in the TCGA datasets.

   Parameters HR (CI 95%)
   HR Lower (CI 95%)
   HR Upper SE Z value p Value
   Gender Male vs. Female 0.65 0.31 1.4 0.38 −1.10 P = 0.27
   Age Younger than 62 years versus 62 years or old 1.0 0.92 1.1 0.016
   2.61 P = 0.011
   pT T1/2 versus T3/4 1.9 1.4 2.8 0.19 3.62 P < 0.001
   pN N0 versus N1 1.1 0.53 2.3 0.37 0.26 P = 0.8
   M M0 versus M1 1.4 0.78 2.5 0.31 1.10 P = 0.26
   Stage T1/2 versus T3/4 2.1 1.5 2.8 0.17 4.32 P < 0.001
   Eight-CpG-Based model Low vs. High risk groups 2.9 1.8 4.9 0.25 4.21 P
   < 0.001
   [213]Open in a new tab

   Note:

   HR, hazard ratio; CI, 95% confidence interval; SE, standard errors of
   coefficient; Z value, Wald z-statistic value.

Discussion

   Early diagnosis and intervention therapy of RCC were the most effective
   ways to reduce the death rate and prolong the life. As an epigenetic
   mechanism, the clinical value of DNAm changes was recognized widely
   ([214]Linehan & Ricketts, 2019). In previous studies, researchers have
   mainly concentrated on the abnormal methylation of CGIs and promoter
   regions, but most of CpGs with abnormal methylation were located in the
   gene bodies, non-CGIs and intergenic regions ([215]Jones, 2012). CpGs
   not in promoter regions may also have great effects on tumorigenesis
   and may offer patients help in terms of diagnosis or prognosis. The
   global human methylome screening performed in ccRCC patients (discovery
   cohort) analyzed using the Infinium HumanMethylation450 BeadChip array
   identified 4367 significantly hypermethylated sites (including 2,490
   genes) and 5,031 significantly hypomethylated sites (including 3,551
   genes) able to perfectly separate the ccRCC and non-ccRCC patients.
   These CpGs provided the basis for identifying diagnostic or prognostic
   biomarkers of ccRCC. In cancerous tissues, the hypermethylated CpGs
   were often located in the CGIs and closer to the TSS, while the
   hypomethylated CpGs were predominantly present in the non-CGIs. The
   molecular biological function analysis showed that these genes
   associated with abnormal methylation were associated with important
   biological processes, including cell adhesion, signal transduction, and
   transcriptional regulation. Findings also suggested that the abnormal
   DNAm was implicated in tumor initiation and progression

   To further distinguish between potentially functional DNAm events
   (“driver events”) and cancer-free events (“passenger events”)
   ([216]Selamat et al., 2012), DNAm and gene expression were integrated
   for comprehensively analysis. 68% of CpGs showed a negative correlation
   with gene expression. These CpGs were mainly found in promoter regions,
   which were associated with various cancerous mechanisms, such as cell
   migration and oncogenic transformation ([217]Freitag et al., 2015). In
   the gene sets in which the methylation levels were positively
   correlated with gene expression profiles, approximately 10% of CpGs
   were also present in the promoter regions, suggesting that their
   mechanism of transcriptional regulation may be dominated by epigenetic
   modification other than DNAm. They were also closely related to
   tumorigenesis, such as MUC15 ([218]Huang et al., 2009), HEPACAM2
   ([219]Klopfleisch et al., 2010), CA10 ([220]Romeo et al., 2009), NRG1
   ([221]Huang et al., 2004) and RAB25 ([222]Mitra, Cheng & Mills, 2012),
   providing a novel information layer to our understanding of ccRCC.
   Another interesting finding was the dual activity of epigenetic
   mechanisms in ccRCC, such as RUNX3 and TMEM30B ([223]Fig. S4;
   [224]Table S4). The existence of the “dual control” mechanism indicated
   that the change of DNAm levels located in the promoter may interact
   with those located in the gene body to regulate oncogene expression and
   tumorigenesis. Although the levels of DNAm and gene expression showed a
   “violation” of classical mechanisms of methylation-regulated
   expression, this phenomenon indicated that the regulation mechanisms of
   DNAm in ccRCC (even all cancers) were complexity. On the other hand,
   the abnormal methylation located in the gene bodies also played an
   important role in tumorigenesis, for example, MUC15 ([225]Huang et al.,
   2009), HEPACAM2 ([226]Klopfleisch et al., 2010), CA10 ([227]Romeo et
   al., 2009), NRG1 ([228]Huang et al., 2004) and RAB25 ([229]Mitra, Cheng
   & Mills, 2012), which was consistent with previous reports ([230]Yang
   et al., 2014). The results indicated that gene body methylation not
   only offered novel insights for comprehensive understanding of DNAm
   mechanisms, but also provided new targets for cancer diagnosis and
   treatment. The Illumina HumanMethylation450 platform detected
   methylation sites based on probe hybridization and single nucleotide
   extension methods. This method of detecting DNAm levels using specific
   probes was highly dependent on the location of probe hybridization,
   such as CpGs located in the inner or outer of CGIs, and whether it was
   in close proximity to the TSS ([231]Brenet et al., 2011; [232]Van
   Vlodrop et al., 2011). In addition, the following properties also
   increased the complexity of regulation between DNAm and gene
   expression: methylation of distal regulatory elements; the distribution
   of DNAm sites ([233]Clark, 2007); DNAm sites located at the edge of
   CGIs (“shores” and “shelves”) ([234]Irizarry et al., 2009); DNAm in the
   regulation of alternatively spliced transcripts of the same gene
   ([235]Maunakea et al., 2010); miRNA-based regulatory mechanism
   ([236]Lopez-Serra & Esteller, 2012) and gene silencing due to the
   methylation of non-CGIs ([237]Han et al., 2011). Therefore, the
   comprehensive detection of DNAm sites based on different techniques
   (such as whole-genome sulfite sequencing) will provide opportunities
   for the discovery of new methylation sites and regulatory mechanisms.

   In addition to reinforcing or complementing the molecular mechanisms of
   gene expression, epigenetic integration also led to unexpected
   discoveries and therapeutic opportunities. During building the
   diagnostic or prognostic classifier, tumor biomarkers determined the
   performance in clinical practice. The redundant biomarkers tended to
   cause over-fitting of the classifier and resulted in poor
   generalization performance, while fewer biomarkers including incomplete
   or damaged information often resulted in under-fitting. Although DNAm
   was an important biological phenomenon in the development of many
   different types of cancer, there was a lack of greater insight into the
   physiological mechanisms of disease based solely on DNAm ([238]Vrba et
   al., 2020; [239]Zhou et al., 2018). Therefore, our multi-omics-based
   analysis provided a unified view for understanding the
   interrelationship between different molecular mechanisms and the
   combined effects on disease processes as well as for screening
   biomarkers with high sensitivity and specificity ([240]Gao et al.,
   2019). At the end of this phase, 134 CpGs involving 54 genes were
   identified and might play an important role in carcinoma progression
   ([241]Ristic, Bhutia & Ganapathy, 2017; [242]Tomita et al., 2019).
   Subsequently, based on LASSO penalized regression or LASSO Cox
   regression, redundant features affecting model performance were further
   removed. Five CpGs for the diagnostic classifier and eight CpGs for the
   prognostic classifier were obtained, with the high sensitivity and
   specificity. More importantly, these specific biomarkers have been
   extensively studied in a variety of cancers, and involved in cell
   development, cell apoptosis, cancer metastasis or therapeutic response
   ([243]Figs. 5A and [244]5B) ([245]Clark, 2007; [246]Lang et al., 2019;
   [247]Van Vlodrop et al., 2011).

   Unlike the small samples, the single-population samples or
   single-omics, this study utilized tumor samples from large databases
   (TCGA), which has the characteristic of owning large-size samples.
   Based on DNAm profiles, gene expression profiles, co-expression network
   and metabolic networks, the best features were selected to build the
   diagnostic or prognostic classifier. In clinical applications, the
   integration analysis of large samples, multi-population samples and
   multi-omics data greatly improved the reliability and universal
   application of diagnostic or prognostic classifiers. Based on the
   diagnostic classifier, an accuracy of 99.17% was obtained in the
   independent datasets. However, Wang et al. constructed a diagnostic
   classifier based on 44 gene expression profiles with an accuracy of
   93.4% ([248]Wang et al., 2017). Other researchers have established
   diagnostic classifiers based on mRNA or miRNA biomarkers with
   accuracies ranging from 90% to 96% ([249]Spector et al., 2013;
   [250]Youssef et al., 2011). However, there were some deficiencies in
   mRNA or miRNA, such as unstable expression, low sequence specificity
   and sensitivity, large amount and difficult to be found ([251]Kwon et
   al., 2014). Our study also constructed a prognostic classifier, which
   effectively divided the patients into high-risk and low-risk groups.
   The prognostic classifier was a practical and powerful prognostic tool
   that provided prognostic value, complemented the current ccRCC staging
   system, and provided a theoretical basis for patient consultation and
   individual follow-up protocols.

Conclusions

   In summary, our study obtained novel biological insights when combining
   complementary epigenetic and transcriptomic analysis in the
   whole-genome perspective. Abnormal methylation events located in
   different genomic regions might have synergistic effects, which
   together caused abnormal expression patterns in metabolic pathways. The
   comprehensive analysis of genome-wide methylomic, transcriptomic and
   metabolic network may provide a promising avenue for facilitating the
   understanding of the mechanisms of ccRCC tumorigenesis. Based on the
   multi-level “omics” analysis, we also identified cancer-specific
   epigenetic signatures to diagnose and stratify patients. In addition,
   due to the reversibility of DNAm, they will also become the most
   promising therapeutic targets.

Supplemental Information

   Supplemental Information 1. Characteristics of subjects and tumors.
   [252]Click here for additional data file.^ (10.5KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-1
   Supplemental Information 2. Gene exhibiting abnormal methylation and
   expression in ccRCC.
   [253]Click here for additional data file.^ (12KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-2
   Supplemental Information 3. Top 10 Gene Ontology (GO) biological
   process (BP) based on the genes with abnormal methylation.
   [254]Click here for additional data file.^ (12.5KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-3
   Supplemental Information 4. Genes exhibiting tandem promoter and gene
   body methylation from TCGA-KIRC datasets.
   [255]Click here for additional data file.^ (14.1KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-4
   Supplemental Information 5. Genes exhibiting tandem promoter and gene
   body methylation from GEO (KIRC) datasets.
   [256]Click here for additional data file.^ (52.3KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-5
   Supplemental Information 6. Distribution of differentially methylated
   CpG sites and genes in the feature selection stage.
   [257]Click here for additional data file.^ (14KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-6
   Supplemental Information 7. Function enrichment and biological network
   analysis using five CpG sites.
   [258]Click here for additional data file.^ (11.7KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-7
   Supplemental Information 8. Characters of 21 methylation markers and
   their coefficients in ccRCC.
   [259]Click here for additional data file.^ (11.9KB, xlsx)
   DOI: 10.7717/peerj.9654/supp-8
   Supplemental Information 9. Unsupervised clustering of ccRCC and
   matched adjacent normal using DNA methylation patterns.

   The heat-map represents 160 ccRCC and 160 matched adjacent normal
   samples clustered using 28271 significantly differentially methylated
   CpG sites demonstrating the reliability of our method across the entire
   cohort of 320 primary samples. The majority of ccRCC (red, 99.38%)
   cluster together, as do the matched adjacent normal samples (blue,
   100%).
   [260]Click here for additional data file.^ (729.9KB, pdf)
   DOI: 10.7717/peerj.9654/supp-9
   Supplemental Information 10. Identification of significantly
   differentially methylated CpG sites in the human genomes.

   G01, whether or not CpG sites were located in promoters; G02, whether
   or not CpG sites were located in CpG islands (CGIs). The promoter
   regions were defined as ±1.5kb from TSS, and the others were
   non-promoter regions.
   [261]Click here for additional data file.^ (34KB, pdf)
   DOI: 10.7717/peerj.9654/supp-10
   Supplemental Information 11. Intersection of genes with DNA methylation
   and gene expression profiles.

   A total of 6,041 (4,625 + 1,100 + 316) genes were profiled on the
   methylation array and a total of 2,478 (1,062 + 1,100 + 316) genes
   profiled on the expression array fulfilled the expression intensity
   filtering criteria, of which 1,416 genes overlap between the two
   platforms. The central set of 316 genes possessing a) Methylation and
   gene expression profiles, and b) exhibiting differential methylation
   between ccRCC tumors and matched adjacent normal tissues was used in
   the global methylation-expression correlation analysis. CpG sites
   targeting the X and Y chromosomes were removed prior to performing the
   differential methylation analysis, so all 316 genes applied to the
   global correlation analysis are located on autosomes.
   [262]Click here for additional data file.^ (26.8KB, pdf)
   DOI: 10.7717/peerj.9654/supp-11
   Supplemental Information 12. RUNX3 (left) and TMEM30B (right) are shown
   as examples.

   For each gene, data are drawn from the individual gene body and
   promoter CpG sites exhibiting the strongest absolute correlations with
   expression. (A) Relationship between expression intensity and gene body
   DNA methylation level. (B) Relationship between expression intensity
   and promoter DNA methylation level. (C) Relationship between gene body
   and promoter DNA methylation levels, confirming occurrence in the same
   samples. (D) Schematic representation of promoter and gene body
   methylation relationships to gene expression alterations in tumor
   samples relative to nonmalignant samples.
   [263]Click here for additional data file.^ (55.7KB, pdf)
   DOI: 10.7717/peerj.9654/supp-12
   Supplemental Information 13. Construction and validation of the
   probes-based diagnostic model.

   ROC curve for the validation datasets of GSE (BRCA), GSE (COAD), GSE
   (LIHC), GSE (LUAD) and GSE (PRAD).
   [264]Click here for additional data file.^ (162.1KB, pdf)
   DOI: 10.7717/peerj.9654/supp-13
   Supplemental Information 14. The analysis of Hazard ratio (HR) for TCGA
   datasets with ccRCC using the eight-CpG-based model in different
   subgroups stratified by clinical parameters.
   [265]Click here for additional data file.^ (5.3KB, pdf)
   DOI: 10.7717/peerj.9654/supp-14
   Supplemental Information 15. Kaplan–Meier survival analysis based on
   the eight-CpG-based model in subsets of different clinical stage
   patients with ccRCC (log-rank test).
   [266]Click here for additional data file.^ (224.2KB, pdf)
   DOI: 10.7717/peerj.9654/supp-15
   Supplemental Information 16. Construction of the probes-based
   prognostic classifier.

   Kaplan–Meier survival analysis of the patients in each of the five
   cancers. The patients were divided into low-risk and high-risk groups
   using the median cutoff value of the partial hazard. p-value were
   calculated by the log-rank test
   [267]Click here for additional data file.^ (197.6KB, pdf)
   DOI: 10.7717/peerj.9654/supp-16

Acknowledgments