Abstract

   Metastatic prostate cancer remains a major clinical challenge and
   metastatic lesions are highly heterogeneous and difficult to biopsy.
   Liquid biopsy provides opportunities to gain insights into the
   underlying biology. Here, using the highly sensitive enrichment-based
   sequencing technology, we provide analysis of 60 and 175 plasma DNA
   methylomes from patients with localized and metastatic prostate cancer,
   respectively. We show that the cell-free DNA methylome can capture
   variations beyond the tumor. A global hypermethylation in metastatic
   samples is observed, coupled with hypomethylation in the
   pericentromeric regions. Hypermethylation at the promoter of a
   glucocorticoid receptor gene NR3C1 is associated with a decreased
   immune signature. The cell-free DNA methylome is reflective of clinical
   outcomes and can distinguish different disease types with 0.989
   prediction accuracy. Finally, we show the ability of predicting copy
   number alterations from the data, providing opportunities for joint
   genetic and epigenetic analysis on limited biological samples.

   Subject terms: Cancer genomics, Prostate cancer
     __________________________________________________________________

   Metastatic prostate cancer can be difficult to biopsy and characterise.
   Here, the authors use cell-free DNA methylation analysis to illustrate
   changes in hypermethylation in metastatic disease.

Introduction

   Prostate cancer (PCa) poses a significant clinical burden as the second
   most common malignancy in men and the third most common cause of
   cancer-related death worldwide^[68]1. While most localized PCa can be
   cured, the 5-year survival rate for patients presenting with metastatic
   disease is as low as 30%^[69]2. In recent years, there has been an
   increased incidence rate for metastatic cases^[70]3. Androgen
   deprivation therapy (ADT) treatment is the standard of care for
   patients with advanced or metastatic disease. However, despite initial
   effectiveness, most patients progress to metastatic
   castration-resistant prostate cancer (mCRPC) shortly, and eventually,
   almost all will die from it. CRPC cells grow independently of
   testosterone stimulation by developing mechanisms to constitutively
   activate the androgen signaling pathway. Development and application of
   the more potent, second-generation androgen signaling inhibitors (ASI)
   like enzalutamide and abiraterone acetate were able to provide
   additional survival benefits for CRPC patients and are increasingly
   applied in the earlier lines of treatment for advanced
   disease^[71]4–[72]8. However, drug resistance will ultimately develop,
   and these agents fail to suppress tumor progression. There is thus an
   urgent need to improve our understanding and treatment of mCRPC.

   Biopsy for mCRPC lesions is challenging, even more so when trying to
   obtain sufficient materials for molecular analysis^[73]9. Analysis of
   circulating tumor DNA (ctDNA) in liquid biopsies has shown potential as
   a minimally invasive and accurate disease monitoring
   tool^[74]10,[75]11. ctDNA refers to the component of total cell-free
   DNA (cfDNA) that is derived from tumor cells circulating in the
   bloodstream. It has been demonstrated that quantification of ctDNA can
   provide information on tumor burden, metastasis, and treatment
   response^[76]12. Of interest, in mCRPC, ctDNA has been shown to reflect
   the genomic profiles of tumor or metastatic lesions, and epigenetic
   characteristics such as methylation status can reflect tumor burden and
   subtypes^[77]13–[78]15. Despite these advantages, genome-wide cfDNA
   methylome profiles of mCRPC in large-scale clinical cohorts are still
   lacking. Recently, the development of cell-free methylated DNA
   immunoprecipitation coupled with next-generation sequencing
   (cfMeDIP-seq) presents an effective approach for the analysis of cfDNA
   methylomes^[79]16. This approach allows for sensitive detection of
   ctDNA from minute quantities of cfDNA and is more cost-effective
   compared to genome-wide bisulfite-based methods^[80]17–[81]19.

   Here, we analyze the cell-free methylome of 60 localized and 175
   metastatic samples with the cfMeDIP-seq technology. The global
   methylome captures variations reflective of the heterogeneous disease
   biology. We further show that the cell-free methylome can distinguish
   different disease status with high accuracy, highlighting its potential
   as a minimally invasive strategy for disease monitoring and
   prognostication.

Results

A genome-wide analysis of plasma DNA methylome in localized and metastatic
PCa

   To gain a deeper understanding of the cfDNA methylation profile changes
   during PCa progression, we curated a total of 133 plasma samples,
   including 30 and 103 plasma samples from patients with primary and
   mCRPC, respectively (Fig. [82]1A, Supplementary Data [83]1). The
   localized tumor plasma samples were collected as part of the Canadian
   Prostate Cancer Genome Network (CPC-GENE, CPC for short)
   project^[84]20, while the mCRPC cases were sampled from three
   well-curated cohorts of metastatic PCa (Fig. [85]1A): (1) 67 plasma
   samples from a randomized phase II clinical trial ([86]NCT02125357)
   comparing the sequential use of abiraterone and enzalutamide on
   first-line mCRPC^[87]11 at Vancouver Prostate Centre (VPC). (2) 14
   plasma samples were collected from three patients over the course of an
   enzalutamide treatment trial (Barrier). (3) 22 plasma samples from
   patients enrolled in the West Coast Prostate Cancer Dream Team (WCDT)
   study^[88]21. For the VPC and WCDT cohorts, we profiled 47 and 11
   samples collected at the time of enrollment (baseline); 30 and 11
   samples upon PSA progression after targeted AR inhibition treatment
   (progression), respectively (Fig. [89]1A). Barrier cohort consisted of
   samples collected at baseline and along the treatment course
   (Fig. [90]1A, Supplementary Data [91]1). Together, the datasets here
   form a comprehensive representation of PCa, particularly for metastatic
   castration-resistant lesions. Furthermore, mCRPC samples from the VPC
   cohort has cfDNA sequencing profiled, and the CPC (localized) and WCDT
   (mCRPC) cohorts have multi-omics sequencing data available for matched
   tissues in previous studies (Supplementary
   Data [92]1)^[93]11,[94]21–[95]25, providing unique opportunities for
   integrative analysis.

Fig. 1. cfMeDIP data capture cell-free DNA methylation and fragmentation
changes in localized vs. mCRPC patient plasma samples.

   [96]Fig. 1
   [97]Open in a new tab

   A Sampling schematics for the three mCRPC cohorts. Numbers in
   parentheses indicate the total number of samples in given cohorts. For
   the Vancouver Prostate Cancer (VPC) and West Coast Dream Team (WCDT)
   cohorts, each column represents a sampling strategy, and the number
   above shows the number of patients subjected to the indicated sampling
   strategy. For the Barrier cohort, each column represents a different
   time point and each row represents a patient. B baseline, C1/C2
   treatment cycle 1/2; M2/3/4/5, 2/3/4/5 months post C2; EoT, end of
   the trial. B Pairwise sample correlation between tissue WGBS and
   cfMeDIP data in the WCDT cohort, matched samples collected from the
   same patient were compared with the others. P value = 1.1 × 10^−6
   (two-sided Mann–Whitney U test). Box plots represent median values and
   0.25 and 0.75 quantiles. Whiskers represent 1.5× interquartile range
   (IQR). X = 272 and 17 independent observations for the “No (matched)”
   and “Yes (unmatched)” groups, respectively. Scatterplot showing the top
   10,000 most variable bins in tissue WGBS and cfMeDIP data for patient
   DBT-222 C) and DBT-149 (D). E Enrichment in different genomic locations
   for bins that are well explained by %ctDNA, LDH, and LDH independent of
   %ctDNA. (Delta) R squared of 0.2 is used as a cutoff to filter for
   regions that are well explained by the indicated variable. Fisher’s
   exact test was used to calculate p value and odds ratio. F
   Three-dimensional representation of samples according to the principal
   component analysis (PCA) for the four cohorts using the top 10,000 most
   variable bins. G Ratio of longer cfDNA fragments across the cohorts.
   Two-sided Mann–Whitney U test was used to calculate pairwise p values
   between localized samples from the CPC and mCRPC samples from the
   Barrier, VPC, WCDT cohorts, respectively. Color code is the same as
   used in F. Box plots represent median values and 0.25 and 0.75
   quantiles. Whiskers represent 1.5× interquartile range (IQR). X = 30,
   14, 67, and 22 independent experiments for the CPC, Barrier, VPC, and
   WCDT cohorts, respectively. H Pearson correlation of longer fragment
   ratio and ctDNA fraction (%ctDNA) in mCRPC samples from the VPC
   cohorts. P value was calculated using a two-sided t test. Purple line
   represents a fitted linear model of the data and shading around the
   fitted line represents 0.95 confidence interval (CI). Association of
   samples with shorter or longer fragment sizes with overall survival (I)
   and progression-free survival (J). Median value of the longer fragment
   ratio (0.3933) was used to dichotomize samples into two groups. K
   Distribution of the standard deviation of fragment ratio within a
   sample. Two-sided Mann–Whitney U test was used to calculate pairwise P
   values between localized samples from the CPC and mCRPC samples from
   the Barrier, VPC, WCDT cohorts, respectively. Color code same as used
   in F. Box plots represent median values and 0.25 and 0.75 quantiles.
   Whiskers represent 1.5× IQR. X = 30, 14, 67, and 22 independent
   experiments for the CPC, Barrier, VPC, and WCDT cohorts, respectively.
   Source data for 1B, 1F–H, and 1K are provided as a Source Data file.

   Methylated DNA fragments from all plasma samples were precipitated and
   subjected to paired-end DNA sequencing as described before^[98]26
   (Methods). We obtained a median of 63 million reads, covering over 60%
   of the genome, and showed enrichment of CpGs (Methods) between 1.59 and
   2.19 fold (Supplementary Data [99]1). We used the previously described
   binning strategy^[100]16 to quantify the data in 300 bp non-overlapping
   bins and reduced our analysis to the 337,420 non-low bins (Methods,
   Supplementary Data [101]2). In general, there is a higher correlation
   between samples collected from the same patients at different
   timepoints (baseline or progression) (Supplementary Fig. [102]1A),
   suggesting higher inter-patient variations. We further compared the
   cell-free methylome with previous tissue whole-genome bisulfite
   sequencing (WGBS) profiles^[103]21 from the WCDT cohort using the top
   10,000 most variable bins. Despite differences in sequencing
   technologies, a significantly higher correlation was observed between
   the matched tissue and cell-free methylomes, with the median Pearson’s
   r for matched samples being 0.37 compared to −0.02 for unmatched
   samples (Fig. [104]1B–D).

cfMeDIP data captures variations from tumor and TME

   To understand the source of data variations, we examined major clinical
   covariates by fitting linear regression models for individual bins
   using the 67 mCRPC samples from the VPC cohort (Supplementary
   Fig. [105]1B–E). Adjusted R squared was used to measure the proportion
   of variation explained by different covariates for each non-low bin.
   For less than one-fifth (19.63%) of the bins, variations in methylation
   signal showed a weak to moderate correlation (adjusted R squared >0.2)
   with ctDNA fraction (%ctDNA)^[106]11, and for only 3.45% of the regions
   %ctDNA has a moderate to high correlation (adjusted R squared >0.5)
   (Supplementary Fig. [107]1B). This is different from previous targeted
   methylation analysis on plasma DNA for mCRPC samples, where tumor
   fraction was identified as the major determinant of variation^[108]15.
   The previous analysis focused on the CpG sites covered by the Roche
   probes and left a large proportion of the methylome uncharacterized
   (Supplementary Fig. [109]1F): 75.4% of the non-low bins in our study
   were not covered. We reasoned that the unbiased cfMeDIP strategy was
   able to provide genome-wide methylome analysis reflective of a more
   comprehensive source of variation.

   We next investigated other available clinical factors, including age,
   lactate dehydrogenase (LDH) and alkaline phosphatase (ALP) levels.
   Serum LDH and ALP levels measure groups of enzymes that catalyze
   glycolysis and phosphate esters, respectively. LDH levels are
   associated with the outcomes in mCRPC and are related to visceral
   diseases such as liver metastasis^[110]27–[111]29. Bone-specific ALP is
   expressed on the surface of osteoblasts and is commonly upregulated in
   cancers originating or spreading to the bone^[112]30. Indeed, while age
   barely contributes, ALP level explains the variations to a weak to
   moderate degree for 2.41% of the bins and is a moderate to high
   contributor for a smaller fraction (0.3%) (Supplementary Fig. [113]1C,
   D). Interestingly, more bins (23.77%) are explained to a weak to
   moderate degree by LDH level than %ctDNA (19.63%) (Supplementary
   Fig. [114]1B, E), with 17.56% of the bins explained by LDH independent
   of %ctDNA variation (Supplementary Fig. [115]1G). These bins
   distributed differently across the genome: bins explained to a weak to
   moderate degree by %ctDNA are enriched in CpG islands and promoter
   regions while depleted in shelves (cpg_inter) and intergenic regions
   (Fig. [116]1E). Meanwhile a sharp contrast was observed for bins
   explained to a weak to moderate degree by LDH independently
   (Fig. [117]1E). Together, the amount of variation explained by ALP and
   LDH, but not %ctDNA, is potentially reflective of variations related to
   changes in the tumor microenvironment (TME).

cfMeDIP data capture fragmentation profile changes in the cfDNA

   We next extended the analysis to include all four cohorts. The global
   methylation pattern was able to provide a general separation between
   the localized and metastatic data (Fig. [118]1F), with several mCRPC
   samples having low to undetectable %ctDNA clustered closer to the
   primary cancer cohort (Supplementary Fig. [119]1H, I). Previous studies
   reported that tumor-derived cfDNA has a shorter length compared to
   healthy controls^[120]31,[121]32, we thus analyzed the cfDNA fragment
   size in the four cohorts. Significant shorter fragment size was
   observed in mCRPC samples compared to that of localized samples
   (Fig. [122]1G). In addition, the fragment size is significantly
   negatively associated (Pearson’s r = −0.43, p value = 2.8 × 10^−4) with
   %ctDNA in mCRPC samples from the VPC cohort (Fig. [123]1H). Together,
   these suggest a higher %ctDNA in the metastasis compared to localized
   patient blood, corroborating previous observations^[124]31. While
   previous studies of total cfDNA showed significant enrichment of
   fragments below 150 bp in patient samples^[125]33–[126]35, size
   distribution difference in our data mainly occurred longer than 150 bp
   (Supplementary Fig. [127]1J). We reasoned that methylated ctDNA could
   have different fragment lengths, and differences in experimental
   approaches may also contribute to such disparity. Nevertheless, our
   analysis showed that cfMeDIP-seq can capture fragment length difference
   qualitatively. As expected, the estimated fragment size distribution is
   significantly associated with both overall and progression-free
   survival (PFS, Fig. [128]1I, J). Besides the length difference, cancers
   are shown to have more variation in their cfDNA length^[129]36. Indeed,
   when we examine the fragmentation profile in 5-Mb bins across the
   genome (Methods), a higher standard deviation was observed in the
   metastatic cohorts (Fig. [130]1K). Differences in fragment length and
   fragmentation can further distinguish our localized samples from the
   healthy controls (HC) cfMeDIP-seq profiles reported previously^[131]32
   (Supplementary Fig. [132]1K, L).

Cell-free 5mC profiling reveals widespread hyper-methylation in metastatic
samples

   We next compared the cell-free methylation profiles between the CPC
   (localized) and VPC (metastatic) samples with age control (Methods).
   Widespread hypermethylation in metastatic tumors was observed
   (Fig. [133]2A) and we detected 7.6 times (19,048 hyper vs. 2493 hypo)
   more differentially methylated regions (DMRs) with increased
   methylation than decreased. Meanwhile, previously reported
   hypomethylation sites in patient tumors showed consistent lower
   methylation in the metastatic samples^[134]37, corroborating the
   validity of our analysis (Supplementary Fig. [135]2A). Global
   methylation on promoters of tumor suppressor genes showed significantly
   higher methylation in mCRPCs (Supplementary Fig. [136]2B), whereas only
   moderate differences were observed for oncogenes^[137]38,[138]39
   (Supplementary Fig. [139]2C). Indeed, DNA methylation is observed as a
   common mechanism for transcriptional regulation for tumor suppressors
   but not protooncogenes^[140]40. The detected DMRs also showed
   consistent deregulation in the Barrier and WCDT cohorts (Supplementary
   Fig. [141]2D), highlighting the robustness of the detected methylation
   changes. These robust DMRs can stratify patients into different risk
   groups: A ratio score dividing hyper-DMRs by hypo-DMRs is significantly
   associated with both overall and PFS (Supplementary Fig. [142]2E, F).

Fig. 2. cfMeDIP reveals widespread hypermethylation and preferential repeat
hypomethylation in metastatic samples.

   [143]Fig. 2
   [144]Open in a new tab

   A Volcano plot of differentially methylated regions (DMRs) was
   identified comparing metastatic and localized prostate cancer. A total
   of 2493 and 19,048 hypo and hyper DMRs were identified, respectively.
   Pearson correlation between the mean methylation levels of hyper (B)
   and hypo (C) DMRs and %ctDNA. P value was calculated using t test. Blue
   line represents a fitted linear model of the data and shading around
   the fitted line represents 0.95 CI. Genomic distribution of hyper- (D)
   and hypo- (E) DMRs. F Contingency table showing the distribution of
   repeat and the hypo-DMRs. Two-sided fisher’s exact test was used to
   calculate p value and odds ratio (OR). G Frequency of repeat types
   overlapped with downregulated peaks. H Differentially methylated peaks
   located within the 1 Mb regions flanking the centromere. CPC Canadian
   Prostate Cancer Genome Network cohort. I Example showing signal
   distribution around the pericentromeric region in chromosome 2. The Y
   axis showed the normalized signal per million reads (SPMR) from MACS
   (v2.2.5). Mean methylation levels for differential peaks shown in H for
   cfMeDIP-seq (J) and WGBS (K) data. Box plots represent median values
   and 0.25 and 0.75 quantiles. Whiskers represent 1.5× IQR. X = 30, and
   67 independent experiments for the CPC and VPC cohorts, respectively in
   J. X = 194, 194, and 100 independent experiments for the Benign,
   Primary, and Mets groups, respectively, in K. Source data for F2B-C and
   2J-K are provided as a Source Data file.

   In contrast to the significant positive correlation between hyper-DMRs
   and %ctDNA in the VPC cohorts (Fig. [145]2B), hypo-DMRs show no
   correlation (Fig. [146]2C). Both hyper- and hypo- DMRs are enriched in
   CpG islands and shores but depleted in shelf and open sea regions
   (Fig. [147]2D, E), with the former, also enriched in gene promoters.
   Genome occupancy of important transcription factors (TF) shows
   differential enrichment patterns in hyper- and hypo- DMRs
   (Supplementary Fig. [148]2G, H). Specifically, top hits enriched in
   hyper-DMRs are transcriptional suppressive factors like SUZ12 and EZH2,
   while for hypo-DMRs, activating factors like TRIM24 and CREB1 are
   the most enriched^[149]41,[150]42. AR binding is enriched in both sets
   of DMRs, consistent with the recognition of its dual roles in
   transcriptional regulation^[151]43,[152]44.

Pericentromeric regions are preferentially hypo-methylated in metastatic
samples

   While hyper-DMRs are enriched in regulatory regions like the promoters,
   hypo-DMRs are not (Fig. [153]2D, E). Upon further investigation, we
   found hypo-DMRs to be specifically enriched in the repeat regions
   (Fig. [154]2F). For better characterization of the repeat signal, we
   utilized a peak calling strategy to quantify the methylation signal
   (Methods, Supplementary Data [155]2). Among the most enriched repeat
   types in hypomethylated peaks is GAATGn, the classic satellite DNA
   frequently found in pericentromeric regions^[156]45 (Fig. [157]2G).
   Indeed, differentially methylated peaks within the pericentromeric
   regions (1 Mb around the annotated centromere gaps) show considerably
   reduced signals in metastatic samples (Fig. [158]2H–J). We compared
   tissue WGBS data from two previous studies and observed a similar
   pattern^[159]21,[160]46 (Fig. [161]2K, Supplementary Fig. [162]2I). In
   addition, a more noticeable reduction between the benign tissue and the
   primary tumor samples was observed, suggesting progressive loss of
   methylation for pericentromeric regions along the PCa development
   trajectory.

Methylation level at NR3C1 promoter associates with differential disease
outcome

   An outlier DMR showing the highest fold change (Fig. [163]2A) is
   located in the promoter region of the gene NR3C1 (Fig. [164]3A), which
   encodes glucocorticoid receptor (GR). The methylation level at this
   site (referred to as GR-DMR hereafter) shows a borderline positive
   correlation with %ctDNA (Fig. [165]3B), and the high fold change is
   likely driven by the few samples with exceptionally high methylation in
   the VPC cohort. We examined cfMeDIP-seq data of the isolated peripheral
   blood leukocytes from 20 healthy donors in a recent study and no signal
   was detected in this site^[166]32, suggesting cancer-specific
   methylation on GR-DMR. Higher methylation level at GR-DMR in localized
   tumors from the Cancer Genome Atlas (TCGA), the Chinese Prostate Cancer
   Genome and Epigenome Atlas (CPGEA) and the CPC
   cohorts^[167]22,[168]46,[169]47 are associated with worse outcomes
   (Fig. [170]3C, D, Supplementary Fig. [171]3A), while a reverse
   association was observed in the metastatic cohort (Fig. [172]3E).
   Associations with survival are not significant in the VPC (mCRPC)
   cfMeDIP-seq data (Supplementary Fig. [173]3B, C), likely due to small
   sample size. Expression levels of the GR gene only showed a moderate
   negative correlation with methylation levels at GR-DMR (Supplementary
   Fig. [174]3D–G). Moreover, direct GR RNA abundance showed no
   significant survival association except in the CPGEA cohort
   (Supplementary Fig. [175]3H–K), suggesting the existence of different
   regulatory mechanisms of GR expression.

Fig. 3. Differential methylation in GR gene associates its altered role in
mCRPC.

   [176]Fig. 3
   [177]Open in a new tab

   A Schematics illustrating GR gene (NR2C1) and its different isoforms.
   Blue bar indicates the identified outlier DMR. B Pearson correlation
   between GR site methylation level and %ctDNA in VPC cohort. P value was
   calculated using two-sided t test. Blue line represents a fitted linear
   model of the data and shading around the fitted line represents 0.95
   CI. Association between GR site methylation and disease outcome in the
   Cancer Genome Atlas (TCGA) (C), the Chinese Prostate Cancer Genome and
   Epigenome Atlas (CPGEA) (D) and WCDT (E) cohorts. Logrank test was used
   to calculate p values. X = 216 and 221, 22 and 24, 49, and 48 for high
   and low-risk groups for C–E, respectively. Gene ontology (GO) analysis
   shows the enrichment of Biological Process (BP) for genes downregulated
   in high GR-DMR methylation groups in TCGA (F), CPGEA (G), and WCDT (H)
   cohorts. Source data for F3B are provided as a Source Data file.

   To understand how the methylation changes at GR-DMR affect disease
   outcome, we performed differential gene expression analysis comparing
   samples with high and low GR-DMR methylation levels. Considering the
   relatively low number of samples with high GR-DMR methylation in the
   localized CPC cohort (Supplementary Fig. F[178]3E), we focused on the
   remaining three datasets (Supplementary Fig. [179]3L–N). Enrichment
   analysis showed that downregulated genes in the hypermethylation group
   are significantly associated with immune-related terms, including
   “antigen processing and presentation” (Fig. [180]3F–H). Such enrichment
   is observed in both the localized and mCRPC cohorts, suggesting
   potential immune regulatory roles related to GR-DMR methylation. Top
   terms enriched in the upregulated genes in the CPGEA cohort are all
   cell cycle-related (Supplementary Fig. [181]3O). Although not
   significant, these cell cycle-related genes show the trend of
   upregulation in the TCGA cohort (Supplementary Fig. [182]3P),
   consistent with the worse outcome associated with high GR-DMR
   methylation observed in primary tumors (Supplementary Fig. [183]3C, D).
   In contrast, such upregulation is not observed in the WCDT cohort
   (Supplementary Fig. [184]3P), suggesting a trend of switching from cell
   cycle regulatory roles in primary tumors to a
   dominant immune regulation in metastatic cases for the GR-DMR
   methylation.

Cell-free DNA methylome distinguish metastatic from localized samples with
high accuracy

   We next sought to create a machine-learning predictor distinguishing
   localized and metastatic tumors using the methylation profiles. To
   increase the sample size, we sequenced an additional 72 samples from
   the VPC cohort (VPC-V), and 30 samples from patients with localized
   tumor samples from the Ontario Health Study (OHS) cohort (Fig. [185]4A,
   Supplementary Fig. [186]4A). To avoid potential bias caused by the
   imbalanced classification in our samples, we randomly select equal
   numbers (21 for each) of localized and metastatic samples from the
   respective cohorts and use them as a training set (Fig. [187]4A).
   Feature selection was then performed on the training set: differential
   methylation analysis was performed and the top 150 hyper- and hypo-
   DMRs were selected. A random forest classifier was then built using the
   selected features and evaluated on the remaining dataset (testing set).
   This process was repeated 50 times and the performances were summarized
   (Fig. [188]4A–C, Supplementary Fig. [189]4B).

Fig. 4. Methylation profiles distinguish localized from metastatic samples
with high accuracy.

   [190]Fig. 4
   [191]Open in a new tab

   A Workflow of building a random forest predictor using methylation
   profiles. B Prediction accuracy and area under the receiver operating
   characteristics (AUROC) for methylation profile-based predictor on
   testing datasets. VPC-V validation samples from the VPC cohort, OHS the
   Ontario Health Study cohort, DMR differentially methylated region. C
   %ctDNA distribution in samples that are correctly or misclassified for
   the metastatic samples. Results from the 50 repeats were pooled for
   visualization. Two-sided Mann–Whitney U test was used to calculate
   pairwise P values. Box plots represent median values and 0.25 and 0.75
   quantiles. Whiskers represent 1.5× IQR. X = 69 and 2656 test results
   pooled from 50 times of repetition for FALSE and TRUE groups,
   respectively. Source data for F4B-C are provided as a Source Data file.

   The predictor yielded a median area under the curve value of 1, with a
   very high median accuracy of 0.989 on the testing datasets
   (Fig. [192]4B, Supplementary Fig. [193]4B, C). All the localized
   samples were correctly classified, with only 10 out of the 175
   metastatic samples ever misclassified across the 50 repeats
   (Fig. [194]4C). It is worth noting most samples with very low %ctDNA
   (<2%) were correctly classified, highlighting the sensitivity of DNA
   methylation in detecting samples with lower load of genetic alterations
   (Fig. [195]4C). We further applied our predictors to the 20 healthy
   controls from the previous study^[196]32. Significant probability
   distribution difference from the localized samples was observed
   (Supplementary Fig. [197]4C), suggesting the potential of
   discriminating between early-stage disease and healthy controls using
   cfMeDIP-seq data.

Cell-free DNA methylome can predict large-scale genetic variations

   Sequencing information obtained from cfMeDIP-seq data can reflect
   genetic changes and enable copy number alteration (CNA) analysis.
   Similar to RNA-seq data, enrichment-based cfMeDIP data also has uneven
   coverage caused by different methylation levels when used for CNA
   analysis. We thus adapted the CNA inference tool for RNA-seq to the
   analysis of cfMeDIP-seq data^[198]48 (Methods). Overall, CNA coverage
   is significantly higher in VPC (Fig. [199]5A), consistent with the
   notion of higher %ctDNA associated with metastatic samples. Indeed, CNA
   coverage showed a significant positive correlation with %ctDNA in the
   VPC cohorts (Fig. [200]5B, Supplementary Fig. [201]5A). We next
   examined the CNA changes in detail for individual genes assayed by
   panel sequencing from a previous study^[202]11 and high concordance was
   observed (Fig. [203]5C, Supplementary Fig. [204]5B). In regions that
   were misclassified as CNA neutral, a significantly lower degree of CNA
   was observed (Supplementary Fig. [205]5C). Very few depletions (9) were
   misclassified as amplifications and even fewer amplifications (3) were
   misclassified as depletions. Majority of the CNA neutral regions were
   correctly predicted, with 4% misclassification (154/3835)
   (Supplementary Fig. [206]5C). Together, these resulted in an overall
   accuracy of 0.86 for CNA prediction (Table [207]1). When considering
   only regions with CNA change, a 0.975 accuracy was achieved
   (Supplementary Fig. [208]5D). We then extended the analysis to the
   VPC-V. Only a moderate correlation between the predicted CNA coverage
   and %ctDNA (Supplementary Fig. [209]5E) was observed, likely due to the
   reduced sensitivity caused by the lower range of %ctDNA (0-44%, mean
   ~7.33%) compared to the VPC cohort (0–88.90%, mean ~38.46%). Still,
   high overall accuracy of 0.92 was achieved in these samples of lower
   genetic alteration load (Supplementary Fig. [210]5F, G). Similarly,
   misclassified regions have significantly lower CNA degree
   (Supplementary Fig. [211]5F) and the high proportion of correctly
   predicted neutral regions contributed to the overall high prediction
   accuracy (Supplementary Fig. [212]5G). Taken together, we showed that
   cfMeDIP-seq data can be used to reliably predict sample CNA.

Fig. 5. Prediction of sample CNA using cfDNA methylome.

   [213]Fig. 5
   [214]Open in a new tab

   A Predicted CNA coverage in localized samples from the CPC and mCRPC
   samples from the VPC cohorts. Two-sided Mann-Whitney U test was used to
   calculate pairwise P values. Box plots represent median values and 0.25
   and 0.75 quantiles. Whiskers represent 1.5× IQR. X = 30, and 67
   independent experiments for the CPC and VPC cohorts, respectively. B
   Correlation between the predicted CNA coverage and %ctDNA in mCRPC
   samples from the VPC cohorts. C Comparison of gene CNA between standard
   status obtained from previous panel sequencing^[215]11 and predicted
   results from cfMeDIP-seq data for the top 10 samples with the highest
   CNA coverage. Source data for F5A-B are provided as a Source Data file.

Table 1.

   The performance and accuracy of CNA prediction in mCRPC samples from
   the VPC cohort
            Precision Specificity Sensitivity  F1
   Gain     0.66      0.98        0.53        0.59
   Neutral  0.88      0.49        0.96        0.92
   Loss     0.82      0.99        0.44        0.58
   Mean     0.79      0.82        0.65        0.70
   Accuracy 0.86
   [216]Open in a new tab

Discussion

   In this work, we provide a genome-wide methylation analysis using the
   cell-free DNA isolated from the plasma of patients with localized and
   metastatic PCa. While the cfDNA methylation profile has been
   extensively studied using targeted analysis for early-stage,
   non-metastatic PCa^[217]10,[218]49, evaluation in the advanced,
   metastatic disease at the genome scale is still sparse^[219]10,[220]49.
   The mCRPC samples were collected from three independent studies. Using
   a subset of the data, we built a predictor to distinguish localized
   from metastatic samples that can be generalized to the remaining
   dataset. We achieved consistent high prediction accuracy on the three
   independent cohorts, suggesting that the methylation profiles are
   indeed capturing common features shared among mCRPCs. Considering the
   highly heterogeneous nature of mCRPCs, such consistent and highly
   sensitive performance is of particular interest and suggests potential
   clinical utility, such as early detection of oligometastasis.
   Oligometastatic disease is increasingly recognized as an independent
   state with low-volume metastasis^[221]50. Here, we show the ability of
   the cfDNA methylome in detecting metastasis with very low to
   undetectable levels of genetic alterations, a characteristic likely
   shared with oligometastasis. Specialized effort is needed to explore
   and directly validate its utility in such cases.

   We observe pervasive hypermethylation in metastasis compared to
   localized PCa. The preponderance of hypermethylation in mCRPC compared
   with localized samples may be disease-specific, as a proportional
   amount of hypo and hyper DMRs has been observed in pancreatic cancer,
   while preponderant hyper DMRs were observed in head and neck squamous
   carcinoma, with the same cfMeDIP-seq method^[222]16,[223]32. These
   sites were enriched in gene promoters, CpG islands, and shores, sharing
   the same feature of frequent hypermethylation observed in the early
   stage of PCa^[224]37,[225]51, suggesting progressive maintenance of
   dysregulated DNA methylation. Such noticeably higher number of
   hypermethylation sites seemingly contradict previous observations of
   global hypomethylation in metastatic PCa^[226]37,[227]51. Several
   factors might contribute to these observed differences: (1) Most early
   studies assess the global methylation changes by analyzing the total
   methylated DNA and the results might not reflect changes at individual
   sites. (2) Different analytical methods might also result in
   paradoxical observations: while a meta-analysis showed that
   hypomethylation is associated with PCa^[228]51, a more recent
   pan-cancer analysis identified more hypermethylated sites^[229]52. (3)
   Complexity of the cfDNA origin in plasma can further contribute to the
   observed disparities. Compared to metastasis, localized cancers are
   known to have lower ctDNA fraction, thus comparisons of plasma samples
   are likely to simultaneously capture differences between normal and
   tumorous tissues. Despite the overall consistency observed between
   tissue and cell-free methylomes, such disparities highlighted the need
   to directly analyze the cell-free methylomes.

   In contrast to hyper-DMR, we observed no correlation between the
   overall hypo-DMR methylation and %ctDNA, suggesting non-tumor source,
   such as TME, of such variations. Tumor grows in a specialized
   microenvironment and induces extensive remodeling, including changes in
   their DNA methylation profiles, in the surrounding non-cancerous cells.
   These changes in the TME may in turn alter the cell-free
   methylome^[230]53. Indeed, we identified sites that are correlated with
   ALP and DHL, indicators associated with diseases residing in different
   environments, potentially reflective of TME-derived changes. Future
   effort is needed to dissect the cell sources of tumor-associated
   cell-free methylome changes.

   The enrichment of MHC-I immune gene signature in GR-DMR high patients
   and the inverse association with survival in localized and metastatic
   PCa are intriguing. While glucocorticoids are frequently prescribed as
   co-medication for the treatment of solid tumors including PCa, recent
   studies showed that high GR expression was associated with worse
   outcome in mCRPC^[231]54,[232]55. It is worth noting that when
   investigating the association between GR-DMR methylation and %ctDNA,
   the signals seem to be derived from different distributions, suggesting
   the existence of multiple subtypes. Such mixture can also be the
   combined result of the tumor and non-tumor TME components, as tumor
   cells are able to alter the DNA methylation profile of the TME, which
   can further change the cfDNA methylome^[233]56,[234]57. Dissecting the
   cells of origin of GR-DMR methylation can help shed light upon its
   underlying functional mechanisms. Moreover, we only observed moderate
   negative correlation in GR-DMR methylation and GR gene expression,
   whether the GR-DMR function through mediating GR transcription warrants
   further investigation.

   Hypomethylated sites identified in this study are enriched in repeat
   and particularly pericentromeric regions. While global hypomethylation
   of repeat classes like LINE1 elements in cancer is commonly
   acknowledged, pericentromeric DNA sequences are less investigated.
   Pericentromeric DNA hypomethylation in other solid tumors like breast
   cancer and melanoma has been associated with chromosome
   instability^[235]58,[236]59, in line with the increased genomic
   instability observed in metastatic PCa. Analysis of the DNA methylation
   in pericentromeric regions had been focused on sequences in chromosome
   1, while we showed here such hypomethylation can encompass most
   chromosomes and is more prevalent than previously realized. Additional
   effort is needed to understand the functional implication of these
   epigenetic dysregulations.

   In summary, we identified consistent methylation changes and created
   classifiers that can distinguish between localized and metastatic
   samples with high accuracy. The results presented multiple insights
   into the disease biology and demonstrated the biomarker potential for
   detecting metastatic lesions with a minimally invasive, highly
   sensitive, and cost-effective strategy.

Methods

   This work complies with all relevant ethical regulations. All patients
   provided informed written consent and all samples were obtained upon
   approval of the institutional ethics committee and Research Ethics
   Board at the University Health Network (UHN) and the University of
   British Columbia (UBC), with compliance with all relevant ethical
   regulations. CPC and Barrier samples were retrieved from the UHN GU
   Biobank (REB file numbers: 11-0024(CPC) and 13-7122(Barrier)). VPC
   cfDNA from plasma was retrieved from the Vancouver Prostate Centre
   (VPC), UBC (REB file number: H18-00944). The WCDT cfDNA samples were
   retrieved from the University of California San Francisco (UCSF). The
   OHS samples were retrieved from the Ontario Institute for Cancer
   Research (OICR).

Contact for reagent and resource sharing

   Further information and requests for resources should be directed to
   and will be fulfilled by Lead Contact, Housheng Hansen He
   (hansenhe@uhnresearch.ca).

Experimental models and subject details

Method details

Cell-free DNA isolation from plasma

   Peripheral blood was collected from cancer patients using EDTA
   anticoagulant tubes. Plasma samples were isolated from whole blood
   using the UHN Biobank centrifugation protocol and stored at the UHN
   Biobank. 500 µl–1 ml plasma aliquots retrieved from the Biobank were
   immediately stored at −80 °C for short-term use. The cfDNA was isolated
   from plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen)
   according to the manufacturer’s protocol and quantified by Qubit
   (Thermo Fisher Scientific) before use. Within each experimental batch,
   samples were randomized by disease status and performed blinded during
   cfMeDIP-seq wet-lab processing.

Cell-free methylated DNA immunoprecipitation and sequencing

   To prepare cfMeDIP libraries for sequencing, the original cfMeDIP-seq
   protocol was used^[237]16 on 5 ng of input cfDNA per sample. First, the
   samples underwent library preparation using Kapa HyperPrep Kit (Kapa
   Biosystems) for end-repair and A-tailing, following the manufacturer’s
   instructions. Samples were then ligated to 0.181uM of NEBNext adaptor
   (NEBNext Multiplex Oligos for Illumina kit, New England Biolabs) by
   incubating at 20 °C for 20 mins. The DNA was then purified with AMPure
   XP beads (Beckman Coulter). The library was then digested using USER
   enzyme (New England Biolabs) and then purified with Qiagen MinElute PCR
   purification kit (MinElute columns).

   The prepared libraries were then combined with 95 ng of filler DNA (λ
   phage), and then MeDIP was performed using the Diagenode MagMeDIP kit
   (C02010021) using a previously published protocol^[238]26. The filler
   DNA consists of a mixture of unmethylated and in vitro methylated λ
   amplicons of different CpG densities: 1 CpG site, 5 CpG sites, 10 CpG
   sites, 15 CpG sites, and 20 CpG sites; all similar in size to cfDNA.
   This filler DNA ensures a constant ratio of antibody to input DNA and
   minimizes non-specific binding by the antibody and prevents cfDNA loss
   due to binding to plasticware. Once the prepared library and filler DNA
   were combined, 0.3 ng of control methylated and 0.3 ng of control
   unmethylated Arabidopsis thaliana DNA and the buffers from the MagMeDIP
   kit were added, as per the manufacturer’s instructions. The mixture was
   heated to 95 °C for 10 min, then immediately placed on ice for 10 mins.
   Each sample was partitioned into two 0.2 ml PCR tubes: one for 10%
   input control (7.9 µl) and the other for the sample to be subjected to
   immunoprecipitation (79 µl). The included 5mC monoclonal antibody
   (C15200081) from the MagMeDIP kit was diluted to 1:15 before adding it
   to the immunoprecipitation sample. MagMeDIP magnetic beads were then
   washed 2× with prepared buffers from the kit and added to the sample
   before incubation at 4 °C for 17 h with rotation. The samples were
   purified using the Diagenode iPure Kit v2 (C03010015) and eluted in
   50 µl of buffer C.

   Quality control 1 (QC1) was performed by qPCR to detect recovery of the
   spiked-in methylated and unmethylated A. thaliana DNA. The recovery of
   methylated A. thaliana DNA should be >20%, unmethylated A. thaliana DNA
   should be <1% (relative to the input control and adjusted to input
   control being 10% of the overall sample), and the specificity of the
   reaction should be >99% (1−[recovery of spike-in unmethylated DNA/
   recovery of spike-in methylated DNA] × 100) to proceed. The PCR cycle
   number for library amplification was determined by qPCR (QC2) and
   should be <15 cycles to proceed, and the samples were amplified using
   Kapa HiFi Hotstart Mastermix and NEBNext multiplex oligos, added to a
   final concentration of 0.3uM. The final libraries were amplified as
   follows: activation at 95 °C for 3 min, # cycles: 98 °C for 20 s, 65 °C
   for 15 s, and 72 °C for 30 s, and a final extension of 72 °C for 1 min.
   The amplified libraries were purified using MinElute columns, then size
   selected to remove adaptor dimers by either using 3% Nusieve GTG
   agarose gel and subsequent get cutting, or Pippin Prep (Sage Science)
   following the manufacturer’s instructions. All the final libraries were
   then checked at TapeStation (Agilent) for library concentration,
   correct sizing, then pooled with six other cfMeDIP samples with
   different NEBNext barcodes. The pool of seven samples (per lane) was
   sequenced at 150 bp paired-end on Illumina HiSeq X ten.

Publicly available data

   The TCGA prostate adenocarcinoma (PRAD) 450 K methylation data (hg19
   based) were downloaded from the TCGA Data Portal
   ([239]https://tcga-data.nci.nih.gov/tcga/), including 50 normal tissue
   and 489 primary tumor samples. Associated clinical data and normalized
   gene expression were also obtained. The CPC-GENE (Canadian Prostate
   Cancer Genome Network) 450 K methylation data from 286 patients with
   localized prostate adenocarcinoma, matching normalized gene expression
   and clinic information (hg19 based) were obtained from the previous
   publication^[240]22. Processed whole-genome bisulfite sequencing (WGBS,
   hg38-based), RNA-seq, and clinical data for 194 Asian patients with
   localized tumors and matched healthy tissue were obtained from
   CPGEA^[241]46, the Chinese Prostate Cancer Genome and Epigenome Atlas
   ([242]http://www.cpgea.com). Processed WGBS and matched RNA-seq data
   (hg38-based) for 100 WCDT mCRPC (West Coast Dream Team, metastatic
   castration-resistant PCa) were obtained from the previous
   publication^[243]21. The WGBS data from WCDT and CPGEA cohorts were
   converted to methylation values for the same 300 bins used in
   cfMeDIP-seq data analysis. Methylation value was defined as the total
   number of methylated counts divided by the total number of (methylated
   and unmethylated) counts in this 300 bp region. The hg38 genome
   coordinates were converted to hg19 using liftOver (v1.10.0) R package
   with a chain file retrieved from the USCS genome browser
   ([244]https://genome.ucsc.edu/).

   Genes (CTAG1B, TSPY1, MAGEA3, and PAGE1) with promoter hypomethylation
   in PCa were obtained from the previous publication^[245]37. A total of
   136 PCa driver genes were obtained from previous publications and
   DriverDBv3, consisting of 57 oncogenes and 79 tumor suppressor
   genes^[246]38,[247]39. Bins within the promoter region (1 kb upstream
   of TSS) of these genes were compared in the cfMeDIP-seq data.

Quantification and statistical analysis

   Statistical analyzes were performed using R statistical environment
   (v3.6.1) (R Core Team, 2019). All tests were two-sided unless otherwise
   specified. The type of test method used for statistical analysis was
   specified in the text where the results were described and details for
   the test were explained in the relevant figure legend and method
   section.

Sequencing data preprocessing

   Human genome (hg19/ GRCh37) was downloaded from the University of
   California Santa Cruz (UCSC) genome browser
   ([248]https://genome.ucsc.edu/). The quality of raw reads was assessed
   using FastQC^[249]60 (v0.11.5) and MultiQC^[250]61 (v0.8). Trim Galore
   (v0.5.0, [251]https://github.com/FelixKrueger/TrimGalore) (“--phred33
   --stringency 3 --length 20 -e 0.1”) was used to remove adapters and
   trim poor-quality sequencing reads. After trimming, the reads were
   aligned to the human reference genome using BWA^[252]62 (v0.7.15) with
   default parameters. SAMtools^[253]63 (v1.3.1) with default settings was
   used to convert Sam to Bam format, filter out duplicates, sort and
   index the files and provide mapping statistics for the output. For
   paired-end data, we filtered for properly paired alignments using
   SAMtools^[254]63 (“-h -f 2 -F 512”).

Fragment size analysis

   The fragment sizes for each sample were calculated using the
   CollecInsertSizeMetrics function from Picard (v2.6.0)
   ([255]https://github.com/broadinstitute/picard) on the sorted bam
   files, setting the minimum percentage option to 0.5. The longer
   fragment ratio was defined as the proportion of the number of reads
   from 170 bp to 210 bp to the number of reads from100bp to 210 bp.

   For fragmentation profiles, customized scripts from previous
   study^[256]36 (http:github.com/Cancer-Genomics/delfi_scripts) were
   applied on sorted bam files to calculate fragment ratios in 5-Mb bins.

Calculation of DMRs

   DMRs between metastatic (67 VPC cohort samples) and localized (30 CPC
   cohort samples) PCa were identified using DESeq2^[257]64 (v1.24.0)
   while controlling for age differences. Before detecting DMRs, the count
   generated by MEDIPS^[258]65 (v1.34.0) was first converted into reads
   per kilobase per million mapped reads (RPKM) using the total number of
   reads as the library size. Only bins with higher than 5 RPKM in at
   least one sample across all PCa samples were retained (non-low coverage
   bins). The raw counts of these non-low coverage bins were used as input
   for DESeq2^[259]64. Bins with Benjamini-Hochberg adjusted p-value <0.05
   and absolute fold change greater than 2 were nominated as DMRs.

DMRs annotation and enrichment analysis

   The genomic annotations of DMRs were obtained using the R packages
   annotatr^[260]66 (v1.12.1), TxDb.Hsapiens.UCSC.hg19.knownGene (v3.2.2)
   and org.Hs.eg.db (v3.10.0) from Bioconductor^[261]67,[262]68. ChIP-seq
   data for important TFs were collected from the gene expression omnibus
   (GEO) (Supplementary Data [263]4). Customized annotation using the TF
   ChIP-seq data was performed using GenomicRanges^[264]69 (v1.38.0). To
   assess whether DMRs are enriched or depleted in the annotated regions,
   adjacent regions of DMRs were first merged using the ‘reduce’ function
   of GenomicRanges^[265]69. Association analysis was then performed by
   regioneR^[266]70 (v1.16.2) with a permutation test (1000 iterations).
   The 33,740 300 bp non-low bins were used as background regions. P value
   of 0.05 was used as a cutoff for significance.

Differential gene expression analysis between high and low GR-DMR methylation
groups

   The most noticeable hyper-DMR site (chr5:142782301-142782600, referred
   to as “GR-DMR”) in metastasis compared to localized plasma samples is
   in the promoter of GR (also known as NR3C1) gene. To investigate the
   transcriptional effect of this site, we used public datasets from the
   TCGA, CPGEA, CPC, and WCDT cohorts^[267]20,[268]21,[269]46,[270]47.
   Considering that this site has low coverage in most of the samples from
   these cohorts, the top 10 samples with the highest and lowest
   methylation values were selected for comparison. Samples were grouped
   into GR-high and GR-low groups according to methylation levels on the
   GR site, and differentially expressed genes (DEGs) were determined
   using matched RNA-seq data. The DEGs were identified using DESeq2
   (v1.24.0)^[271]64 with --FDR = 0.05, --log2FC = 1”.

   For the 450 K DNA methylation array from TCGA PRAD and CPC
   cohorts^[272]22, the beta value of the CpG site overlapping with this
   DMR region was regarded as the methylation signal of this region. The
   schematics of the GR gene were plotted using the R package Gviz
   (v1.30.3).

Gene enrichment analysis

   Gene enrichment analysis was performed using TCGAbiolinks^[273]71
   (v2.14.0) with an FDR of 0.01 as cutoff. For DEGs from tissue RNA-seq
   data, the upregulated and downregulated genes were analyzed separately
   in the functions or pathways enrichment analysis. For DMRs from plasma
   cfMeDIP-seq data, genes with DMRs located within the 5Kb upstream of
   transcription start sites were used.

Predicting CNAs using plasma cfMeDIP data

   To assess the ability of cfMeDIP-seq data on determining CNA events of
   the data, the CaSpER^[274]48 (v0.1.0) R package was used for analysis.
   Briefly, CaSpER first preforms data smooth on three different length
   scale. CNA detection was then performed by taking into account both the
   three-scale smoothed DNA methylation signals and whole-genome allelic
   shift profiles inferred from plasma cfMeDIP-seq data. On each scale: 1
   CNA states (gain, loss, or neutral) were assigned using Hidden Markov
   Model (HMM). 2 The genome-wide allelic shift profiles were estimated
   using Gaussian mixture model to correct CNA predictions with relatively
   low evidence of methylation signals. In our analysis, localized samples
   from the CPC cohort were used as reference. The final consistent CNA
   calls were defined as CNA identified by at least six times from all
   pairwise scale comparisons. For the VPC cohort, inferred CNA states
   were compared to the gold standard calls from panel sequencing obtained
   previously^[275]11.

Machine learning for diagnostic classification

   To evaluate the performance for diagnostic tumor classification based
   on plasma cfMeDIP-seq, we randomly selected an equal number of unique
   samples from the localized and mCRPC cohorts as training sets and use
   the remaining unique patients different from training sets as testing
   sets. For mCRPC samples from the Barrier and WCDT cohorts, 3 out of 14
   and 19 out of 22 unique samples were used, respectively. Feature
   selection and model construction were performed in training sets, and
   the model performance was evaluated in testing sets. Briefly, to reduce
   the impact of technical factors, first, the methylated values were
   corrected and normalized using sva^[276]72 (v3.32.1) and DESeq2^[277]64
   (v1.24.0), considering sequencing batch as a confounder. Second, DMRs
   between localized and metastatic patients from the training set were
   identified using DESeq2^[278]64 (v1.24.0) as described above. Third,
   the top 150 hyper-DMRs and hypo-DMRs were then selected by measuring
   information gain and used to build a classification model using
   randomForest (v4.6.14)
   ([279]https://www.stat.berkeley.edu/~breiman/RandomForests/). Finally,
   the performance of the randomForest classifier was evaluated on the
   testing set. The AUROC (area under the receiver operating
   characteristics) curves were estimated using the probability from the
   random forest model and used for visualization. This procedure was
   repeated 50 times.

Performance assessment

   We used several evaluation metrics to assess the classification
   performance of localized and metastatic PCa, including sensitivity (1),
   specificity (2), precision (3), accuracy (4), and F1 score (5). These
   indicators were also employed in the evaluation of CNA prediction by
   cfMeDIP-seq profiles.
   [MATH: <mi mathvariant="normal">Sensitivity</mi><mo>=</mo><mi
   mathvariant="normal">TP</mi><mo>/</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">FN</mi></mrow><mo>)</mo></mrow> :MATH]
   1
   [MATH: <mi mathvariant="normal">Specificity</mi><mo>=</mo><mi
   mathvariant="normal">TN</mi><mo>/</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">TN</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi></mrow><mo>)</mo></mrow> :MATH]
   2
   [MATH: <mi mathvariant="normal">Precision</mi><mo>=</mo><mi
   mathvariant="normal">TP</mi><mo>/</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi></mrow><mo>)</mo></mrow> :MATH]
   3
   [MATH: <mi
   mathvariant="normal">Accuracy</mi><mo>=</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">TN</mi></mrow><mo>)</mo></mrow><mo>/</mo><mrow><mo
   >(</mo><mrow><mi mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">TN</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi><mo>+</mo><mi
   mathvariant="normal">FN</mi></mrow><mo>)</mo></mrow> :MATH]
   4
   [MATH: <mi mathvariant="normal">F1</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mn>2</mn><mi
   mathvariant="normal">TP</mi><mo>/</mo><mrow><mo>(</mo><mrow><mn>2</mn><
   mi mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi><mo>+</mo><mi
   mathvariant="normal">FN</mi></mrow><mo>)</mo></mrow> :MATH]
   5

   TP stands for true positive, TN for true negative, FP for false
   positive, and FN for false negative. Sensitivity (also known as Recall)
   indicates the fraction of positive patients that are correctly
   predicted. Specificity (also known as Selectivity) indicates the
   fraction of negative patients that are correctly predicted. Precision
   indicates the fraction of correctly identified positive patients to the
   total identified positive patients. Accuracy indicated the fraction of
   correctly identified patients to the total observed patients. F1 score
   is a comprehensive indicator calculated by combining precision and
   sensitivity, with a higher score representing better performance. AUROC
   curve was calculated using the R package ROCR (v1.07)^[280]73.

Survival analysis

   Kaplan–Meier plots were created using survival (v3.1.8) and survminer
   (v0.4.6), in which p value of survival between two groups was
   calculated using a log-rank test (cutoff p value = 0.05). For overall
   DMR analysis, we first calculated a hyper:hypo DMR ratio by dividing
   the mean methylated values of all hyper-DMRs by the mean methylated
   value of all hypo-DMRs. mCRPC samples from the VPC cohort were then
   split into high and low according to the median ratio (0.9786). For the
   fragment analysis, the fragment value was defined as the fragment ratio
   of the number of reads from 170 bp to 210 bp to the number reads from
   100 bp to 210 bp. Patients were classified into shorter or longer
   fragment groups based on the median value of the fragment ratio. For
   GR-DMR-related analysis, we divided the patients into high
   GR-methylated and low GR-methylated groups according to the median
   value of the GR-DMR methylation. A similar analysis was performed for
   mRNA expression between patients with high and low GR gene expression.

   The overall survival (OS) and PFS survival were used for the metastatic
   samples from the VPC cohort, and the OS also was used for the
   metastatic samples from the WCDT cohort. The biochemical
   recurrence-free survival was used for the localized samples from CPC
   and CPGEA cohorts. The clinic endpoints of recurrence-free survival
   were used for the 498 TCGA samples.

Repeat region analysis

   To analyze repeat regions, we used a peak strategy to summarize
   cfMeDIP-seq data signal. We first filtered bam files using samtools
   (v1.3.1) to obtain high-quality primary alignments (-F 1804). Next, we
   used the MACS^[281]74 (v2.2.5) “callpeak” function to generate
   narrowPeak on all samples with --SPMR parameter to generate a
   normalized pileup file. Pileup files from the peak calling step were
   converted to bigWig files using the ucsctools (v378). Peak files from
   all samples were merged to create a peak catalog. Mean signal
   intensities were summarized for each of the intervals in the peak
   catalog using bwtool (v1.0) from sample bigWig files.

Reporting summary

   Further information on research design is available in the [282]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [283]Supplementary Information^ (1.2MB, pdf)
   [284]41467_2022_34012_MOESM2_ESM.pdf^ (391.3KB, pdf)

   Description of Additional Supplementary Files
   [285]Supplementary Data 1^ (161.6KB, xlsx)
   [286]Supplementary Data 2^ (200.7MB, csv)
   [287]Supplementary Data 3^ (394.9MB, csv)
   [288]Supplementary Data 4^ (9.6KB, xlsx)
   [289]Reporting Summary^ (3.1MB, pdf)

Acknowledgements