Graphical abstract

   graphic file with name fx1.jpg
   [59]Open in a new tab

Highlights

     * •
       MSIMEP, first DNA methylation-only based microsatellite instability
       predictor
     * •
       Optimized and reduced versions with 400 and 25 predictors,
       respectively
     * •
       Consistent performance across many different colorectal cancer
       cohorts
     * •
       Better performance than an MLH1 promoter methylation-based model
     __________________________________________________________________

   Medical informatics; Genomic analysis; Cancer systems biology.

Introduction

   Microsatellites are short tandem repeat DNA sequences spread throughout
   the human genome. Because of their highly repetitive nature, these
   sequences have a higher propensity for acquiring mutations. Deficiency
   in DNA mismatch repair (MMR) activity results in a hypermutator
   phenotype, termed microsatellite instability (MSI), characterized by
   the presence of single nucleotide substitutions or insertion-deletion
   mutations within these microsatellites.[60]^1 The MMR deficiency
   resulting from germline mutations or epigenetic alterations in any of
   the MMR genes (MLH1, MSH2, MSH6, and PMS2), as well as deletions in the
   EPCAM gene, is the cause of Lynch syndrome (LS) and its variants.[61]^2
   One of the most aggressive, highly penetrant childhood cancer
   predisposition syndromes, the constitutional MMR deficiency syndrome,
   is caused by homozygous germline mutations in any of the four MMR
   genes. Furthermore, LS can result from mosaic germline MLH1
   epimutations. In contrast, biallelic MLH1 promoter methylation is
   primarily the key somatic event responsible for the loss of MLH1
   expression in ∼75% of sporadic cancers with MSI.[62]^1^,[63]^2 Although
   MMR deficiency/MSI determination has been classically the first step in
   LS screening algorithms, today MSI has gained importance as predictive
   biomarker for various anti-PD-1 therapies across many different tumor
   types (tumor-agnostic indication) and particularly in colorectal and
   endometrial cancers.[64]^3 Therefore, evaluation of MSI, through either
   PCR-based assays or immunohistochemistry, has become a routine clinical
   practice for various cancer types. With the emergence of
   next-generation sequencing-based technologies, alternate computational
   methods to infer MSI using DNA-targeted, whole exome or whole genome
   sequencing data (e.g., MSIsensor, MSIsensor-pro, mSINGS, MOSAIC, and
   MANTIS) have been developed.[65]^4^,[66]^5^,[67]^6^,[68]^7^,[69]^8
   Other computational approaches, instead of observing microsatellites
   directly to evaluate MSI, use orthogonal prediction methods based on
   gene expression[70]^9^,[71]^10^,[72]^11^,[73]^12^,[74]^13 or a
   combination of DNA methylation levels and mutations in MMR pathway
   genes (MIRMMR).[75]^14 Based on this and taking into consideration the
   fact that MSI-high tumors frequently exhibit a hypermethylated
   phenotype,[76]^15 herein we propose an MSI prediction method based on
   microarray DNA methylation tumor tissue profiling. We explore the
   underlying features that predict MSI and develop a reduced version with
   potential applicability to liquid biopsy samples.

Results

Model development and optimization

   First, to find differentially methylated CpG probes between MSI and MSS
   tumors, we interrogated Infinium Human Methylation 450K array data
   obtained from 388 colorectal primary tumors from the pooled The Cancer
   Genome Atlas (TCGA) Colon Adenocarcinoma (COAD)/Rectum Adenocarcinoma
   (READ) cohort. After carrying out methylation probes filtering, data
   imputation, and quality control (see [77]STAR Methods), we identified
   780 probes with an absolute Δβ>0.3 and a B-H false discovery rate
   (FDR)-adjusted p value<0.05. Of these probes 758 (97.2%) were
   hypermethylated and 22 (2.8%) hypomethylated ([78]Figure S1). Then,
   these 780 differentially methylated CpG probes were used to train 7
   supervised machine learning models on the same dataset. The generalized
   linear model (GLM) implemented in GLMNET yielded the best performance
   after five repeats of 10-fold cross-validation (accuracy = 0.98,
   kappa = 0.92). This performance was reached when setting alpha = 0.1
   and lamda = 0.0182 as tuned parameters. GLM performance was
   significantly better when compared against the second (Wilcoxon test
   Holm family-wise error rate (FWER)-adjusted p value = 0.041) and third
   (Wilcoxon test Holm FWER-adjusted p value = 0.017) best alternative
   models ([79]Figure 1A).

Figure 1.

   [80]Figure 1
   [81]Open in a new tab

   Development of the MSIMEP classifier

   (A) Evaluation of alternative algorithms for MSI/MSS phenotype
   prediction from array methylation tumor profiles. Model performance was
   assessed through 5 rounds of 10-fold cross-validation on the TCGA
   COAD/READ dataset, and an optimal algorithm (GLM) was selected based on
   significant improvement in classification accuracy. Alternative
   algorithms (Neural Network, Random Forest, SVM, Decision Tree, KNN, and
   Naive Bayes) are displayed sorted by average accuracy, with Holm
   FWER-adjusted p values showing statistical differences against GLM.

   (B) Heatmap for the methylation tumor profile (β-values) of TCGA
   COAD/READ patients for the CpG probes incorporated into the MSIMEP
   optimized model. CpG probes (x axis) are displayed following
   hierarchical clustering criteria, and TCGA COAD/READ patients (y axis)
   are grouped by MSI/MSS phenotype (green = MSI, red = MSS).

   Once the best model was selected, we carried out recursive feature
   elimination (RFE) for model optimization ([82]Figure S2). Optimized
   model (alpha = 0.1, lamda = 0.0182, accuracy = 0.98, kappa = 0.92, area
   under the receiver operating characteristic curve (AUROC) = 0.98,
   sensitivity = 0.88, and specificity = 1.00), hereafter MSIMEP, is
   composed of 400 CpG probes ([83]Figure 1B and [84]Table 1).
   Importantly, those cases incorrectly classified by our model presented
   a MANTIS score statistically significantly closer to 0.4,[85]^8 the
   average distance threshold established to differentiate MSI from MSS
   tumors by that computational method, than those properly classified
   (MSS vs failed, Wilcoxon test p value<0.001; failed vs MSI, Wilcoxon
   test p value = 0.009) ([86]Figure S3). Importantly, MSIMEP performance
   was consistent across different clinical and molecular subgroups
   ([87]Figures S4 and [88]S5).

Table 1.

   MSIMEP optimized model performance in different cohorts
   Cohort Cancer type Phase Sample size (n) MSI (%) MSS (%) Accuracy Kappa
   AUROC Sensitivity Specificity Precision
   TCGA COAD/READ CRC Training 387 15 85 0.98 0.92 0.98 0.88 1.00 0.99
   External I CRC Validation 79 14 86 0.97 0.90 0.99 1.00 0.97 0.85
   External II CRC Validation 262 15 85 0.95 0.83 0.95 0.88 0.97 0.83
   External III CRC Validation 81 11 89 0.86 0.52 0.86 0.89 0.86 0.44
   External IV CRC Validation 95 100 NA NA NA NA 0.86 NA 1.00
   External Pooled CRC Validation 517 30 70 0.93 0.83 0.93 0.88 0.95 0.88
   TCGA-STAD STAD Exploratory 394 18 82 0.84 0.56 0.94 0.90 0.82 0.52
   TCGA-UCEC UCEC Exploratory 420 31 69 0.88 0.70 0.92 0.72 0.95 0.86
   [89]Open in a new tab

   Additionally, considering a potential applicability to liquid biopsy
   samples and assuming a limited sacrifice of accuracy (see [90]STAR
   Methods), we developed a reduced version of MSIMEP composed of 25 CpG
   probes ([91]Figures S1, [92]S4, and [93]S6).

   Even though MSlMEP is intended for colorectal cancer (CRC), we were
   interested to test its performance in other tumor types with high
   prevalence of MSI such as gastric and endometrial cancers. For this
   purpose, we interrogated Infinium Human Methylation 450K array data
   from the TCGA-Stomach Adenocarcinoma (STAD) and TCGA-Uterine Corpus
   Endometrial Carcinoma (UCEC) cohorts. The optimized model yielded
   accuracy = 0.84, kappa = 0.56, AUROC = 0.94, sensitivity = 0.9,
   specificity = 0.82, and precision = 0.52, and accuracy = 0.88, kappa =
   0.70, AUROC = 0.92, sensitivity = 0.72, specificity = 0.95, and
   precision = 0.86 in STAD and UCEC, respectively ([94]Figure S7).
   Although far from perfect, reduced MSIMEP yielded accuracy = 0.75,
   kappa = 0.43, AUROC = 0.93, sensitivity = 0.94, specificity = 0.71, and
   precision = 0.41, and accuracy = 0.89, kappa = 0.73, AUROC = 0.93,
   sensitivity = 0.82, specificity = 0.91, and precision = 0.81 in STAD
   and UCEC, respectively ([95]Figure S7).

Underlying features of the MSIMEP model

   The composition of the MSIMEP is based on a 400 CpG probe set spread
   through the 22 human autosomes ([96]Figure 2A). Chromosomes 3, 1, and 6
   have the largest number of MSIMEP probes (41, 34, and 28 CpGs,
   respectively), and the highest density is present on chromosome 19 (21
   CpGs, 0.37 CpGs/Mb). Overall, MSIMEP probes tend to be located on high
   GC content regions (88%), which reflects not only the composition of
   the Illumina 450K Infinium Array but also the intrinsic CpG island
   methylator phenotype of tumors with MSI.

Figure 2.

   [97]Figure 2
   [98]Open in a new tab

   Features of the MSIMEP model

   (A) Genomic distribution of MSIMEP CpG probes. The x axis shows the CpG
   coordinates across the human autosomes (chromosomes 1–22). The y axis
   shows relative density of MSIMEP CpG probes (top), and the relative
   importance of each CpG for the GLM classification model (bottom).
   Probes are classified as genic (green), intergenic (gray), or
   cancer-related (red). Cancer-related probes are further classified
   according to the role/s in cancer (tumor suppressor gene, oncogene,
   fusion gene, or dual role) of the associated gene. Probes associated
   with regulatory elements are highlighted with a central dot. Location
   of the MLH1 gene is shown.

   (B) Top 20 genes profiled by MSIMEP, ranked by the cumulative
   importance for the GLM model of the associated CpG probes. Dot size
   reflects the number of probes associated with each gene. Cancer genes
   are highlighted with an asterisk.

   (C) Comparison of the distribution of hypermethylated and
   hypomethylated MSIMEP CpG probes (MSI vs MSS tumors from TCGA COAD/READ
   cohort) according to gene and CpG island context regions.

   (D) Pathway enrichment analysis showing overrepresented KEGG terms in
   MSIMEP at promoter level (n = 134 CpG probes), ranked by B-H
   FDR-adjusted p values.

   From the genomic perspective, 63 MSIMEP probes (15.8%) are located at
   intergenic regions, whereas 337 probes (84.3%) are associated with
   coding genes. Remarkably, 34 MSIMEP probes (8.5%) are associated with
   human cancer genes (COSMIC), including 19 probes associated with genes
   involved in cancer syndromes such as MLH1, EXT1, WRN, and AXIN2.
   Twenty-four MSIMEP probes (6%) are associated with tumor suppressor
   genes, 13 probes (3.3%) with fusion genes, and six probes (1.5%) with
   proto-oncogenes ([99]Figure 2A). The top-five relevant genes for the
   MSIMEP predictive model are MLH1, TRIP10, C1orf95, EXT1, and GPR160,
   respectively ([100]Figure 2B). In addition, when evaluating the
   composition of MSIMEP from a functional perspective, 250 probes (62%)
   are located at regulatory elements. Compared with the Infinium Human
   Methylation 450K array, MSIMEP presents a higher proportion of
   regulatory regions at both genic (χ^2 test p value = 0.003) and
   intergenic regions (χ^2 test p value = 0.012).

   Considering their relevance for gene structure, 219 probes (54.8%) are
   located within a proximal promoter region (between 1.5 kb upstream of
   transcription start site and the first exon). One hundred and nine
   probes (26.3%) are located within the gene body, and 13 probes (3.3%)
   are located within the 3′UTR. When comparing MSI vs MSS tumors from
   TCGA COAD/READ cohort, hypermethylated probes are predominantly located
   on promoter regions (Fisher exact test p value<0.001, hypermethylated
   vs hypomethylated), whereas hypomethylated probes are mainly located on
   gene bodies (Fisher exact test p value<0.001, hypermethylated vs
   hypomethylated) ([101]Figure 2C).

   According to the CpG landscape, 231 MSIMEP probes are located at CpG
   islands (57.8%), and 107 probes are located on their surroundings: 89
   on CpG shores (22.3%) and 18 on CpG shelves (4.5%). Sixty-two MSIMEP
   probes are located on open sea genomic regions (15.5%). Again, when
   comparing MSI vs MSS tumors from TCGA COAD/READ cohort, hypermethylated
   probes are predominant on CpG islands (Fisher exact test p value<0.001,
   hypermethylated vs hypomethylated), whereas hypomethylated probes
   typically are on open sea regions (Fisher exact test p value<0.001,
   hypermethylated vs hypomethylated) ([102]Figure 2C).

   The composition of the MSIMEP reduced model is based on a 25 CpG probe
   set ([103]Table 2, [104]Figure S1), conformed by 4 probes located at
   intergenic regions, and 21 CpG probes associated with genes. Among
   those probes associated with genes, 9 CpGs are attributed to 3 relevant
   cancer genes (MLH1, EXT1, and WNK2, all well-established [Tier I] tumor
   suppressor genes).

Table 2.

   Description of the 25 CpG probes of the MSIMEP reduced model
   CpG probe Chromosome Position Strand Gene Gene region Methylation
   status[105]^a (MSI vs MSS)
   cg15592945 1 226736711 F C1orf95 1st Exon Hypermethylated
   cg03745431 1 226736713 F C1orf95 1st Exon Hypermethylated
   cg17621259 3 37035168 F MLH1 TSS200 Hypermethylated
   cg14671526 3 37035200 R MLH1 TSS200 Hypermethylated
   cg27331401 3 37035207 F MLH1 TSS200 Hypermethylated
   cg06590608 3 37035228 F MLH1 TSS200 Hypermethylated
   cg11224603 3 37035282 R MLH1 1st Exon Hypermethylated
   cg14598950 3 37035355 F MLH1 1st Exon Hypermethylated
   cg04563996 3 57093519 R ARHGEF3 5′UTR Hypomethylated
   cg15048832 3 150130775 R TSC22D2 Body Hypermethylated
   cg12350863 3 169758289 F GPR160 5′UTR Hypermethylated
   cg20288341 6 31683131 F LY6G6E TSS1500 Hypermethylated
   cg06226516 7 13072314 R NA IGR Hypermethylated
   cg02200207 7 29605237 R PRR15 5′UTR Hypermethylated
   cg05313153 8 119122430 F EXT1 1st Exon Hypermethylated
   cg21602557 8 119122878 R EXT1 1st Exon Hypermethylated
   cg13563298 9 95948059 R WNK2 Body Hypermethylated
   cg27240158 11 12698019 R TEAD1 5′UTR Hypermethylated
   cg19524009 13 52735075 R NEK3 TSS1500 Hypermethylated
   cg06223834 16 4103161 R ADCY9 Body Hypermethylated
   cg09294739 16 55218782 R NA IGR Hypomethylated
   cg11582717 17 63452460 R NA IGR Hypermethylated
   cg13860006 19 6741178 R TRIP10 Body Hypermethylated
   cg06410591 19 6741181 R TRIP10 Body Hypermethylated
   cg05232694 20 48809539 R NA IGR Hypermethylated
   [106]Open in a new tab

   F, forward; IGR, intergenic region; NA, not applicable; R, reverse.
   ^a

   Methylation status is described based on the comparison of MSI vs MSS
   tumors from TCGA COAD/READ cohort.

   Finally, to fully characterize MSIMEP from a biological viewpoint, we
   carry out a gene set enrichment analysis. As expected, when evaluated
   at promoter level, this analysis revealed an enrichment of important
   pathways involved in mismatch repair (B-H FDR-adjusted p value<0.001),
   colorectal and endometrial cancers (B-H FDR-adjusted p value<0.001),
   and MAPK signaling (B-H FDR-adjusted p value = 0.015) ([107]Figure 2D).
   When evaluated either all CpG probes or those located at body gene
   level, mismatch repair (B-H FDR-adjusted p value<0.001), and pathways
   in cancer (B-H FDR-adjusted p value<0.001) appear among the most
   significant pathways, confirming again the biological correlation of
   MSIMEP with the MSI phenotype in cancer. Other significantly enriched
   pathways are described in [108]Figure S8. Moreover, through
   computational immune cell deconvolution techniques we confirmed that
   MSIMEP-predicted MSI CRCs preserve the same immune cell context than
   those MSI cases originally classified through standard PCR-based assays
   in the TCGA COAD/READ cohort ([109]Figure S9).

External validation of MSIMEP models

   To further confirm the accuracy of MSIMEP models in predicting MSI
   status in CRC samples, we evaluated its performance in four independent
   CRC cohorts.

   First, we tried MSIMEP in the external cohort I. In this cohort, MSIMEP
   model yielded promising results when evaluated in either invasive front
   (accuracy = 0.99, kappa = 0.95, AUROC = 1.00, sensitivity = 1.00,
   specificity = 0.99, and precision = 0.92), or in luminal (accuracy =
   0.99, kappa = 0.95, AUROC = 0.99, sensitivity = 1.00, specificity =
   0.99, and precision = 0.92) or center (accuracy = 0.97, kappa = 0.90,
   AUROC = 0.99, sensitivity = 1.00, specificity = 0.97, and precision =
   0.85) regions ([110]Table 1, [111]Figure 3A). Importantly, the
   concordance of predictions among the three different tumor regions per
   case was extremely high (Cochran’s Q test p value = 0.607, no
   statistically significant differences found in predictions by tumor
   region) ([112]Figure 3B).

Figure 3.

   [113]Figure 3
   [114]Open in a new tab

   MSIMEP evaluation in CRC external cohort I

   (A) Barplots showing complementary metrics (accuracy, kappa, AUROC,
   sensitivity, specificity, and precision) for the evaluation of MSIMEP
   classification capacity of MSI/MSS patients across different CRC
   regions (invasive front, luminal, and center). Models: MSIMEP optimized
   (red) and MSIMEP reduced (blue).

   (B) Venn diagram showing the concordance in MSI/MSS prediction across
   different CRC regions from external cohort I patients. Cohort: external
   I (n = 79, 14% MSI).

   Next, we evaluated MSIMEP in cohorts II and III, showing again
   consistent results (cohort II: accuracy = 0.95, kappa = 0.83, AUROC =
   0.95, sensitivity = 0.88, specificity = 0.97, and precision = 0.83;
   cohort III: accuracy = 0.86, kappa = 0.52, AUROC = 0.86, sensitivity =
   0.89, specificity = 0.86, and precision = 0.44) ([115]Table 1,
   [116]Figures 4A and 4B). Moreover, we assessed the predictive potential
   of MSIMEP in cohort IV, which is only composed of MSI-H cases. Despite
   that this represents a challenging cohort due to the lack of MSS cases,
   MSIMEP yielded a sensitivity of 0.86 ([117]Table 1, [118]Figure 4C).

Figure 4.

   [119]Figure 4
   [120]Open in a new tab

   MSIMEP evaluation in CRC external cohorts II–IV

   (A–C) Barplots showing complementary metrics (accuracy, kappa, AUROC,
   sensitivity, specificity, and precision) for the evaluation of MSIMEP
   classification capacity of MSI vs MSS patients. ROC curves showing
   MSIMEP classification performance at alternative predictive thresholds.
   Models: MSIMEP optimized (red) and MSIMEP reduced (blue). Circle chart
   showing the fraction of detected cases in a cohort of MSI patients
   (ID-001 to ID-095). (A) Cohort External II (n = 262, 15% MSI), (B)
   External cohort III (n = 81, 11% MSI), and (C) external cohort IV (n =
   95, 100% MSI).

   Additionally, we evaluated MSIMEP performance across different
   important clinical and molecular subgroups in a pooled CRC cohort
   (external cohorts I to IV). As expected, MSIMEP yielded consistent
   results, even when stratified by MLH1 promoter methylation status, a
   challenging predictive scenario considering the high relative
   importance of MLH1 CpG probes in MSIMEP model ([121]Figure S10).

   MSIMEP reduced model performance was also robust in all the external
   cohorts (I to IV) ([122]Table 3, [123]Figures 3A and [124]4) and across
   all subgroups in the pooled cohort ([125]Figure S11).

Table 3.

   MSIMEP reduced model performance in different cohorts
   Cohort Cancer type Phase Sample size (n) MSI (%) MSS (%) Accuracy Kappa
   AUROC Sensitivity Specificity Precision
   TCGA COAD/READ CRC Training 387 15 85 0.98 0.92 0.99 0.86 1.00 0.95
   External I CRC Validation 79 14 86 0.97 0.90 0.99 1.00 0.97 0.85
   External II CRC Validation 262 15 85 0.93 0.75 0.96 0.88 0.94 0.73
   External III CRC Validation 81 11 89 0.80 0.40 0.86 0.89 0.79 0.35
   External IV CRC Validation 95 100 NA 0.86 NA NA 0.86 NA 1.00
   External pooled CRC Validation 517 30 70 0.91 0.78 0.93 0.88 0.92 0.82
   TCGA-STAD STAD Exploratory 394 18 82 0.75 0.43 0.93 0.94 0.71 0.41
   TCGA-UCEC UCEC Exploratory 420 31 69 0.89 0.73 0.93 0.82 0.91 0.81
   [126]Open in a new tab

   NA, not applicable.

   Altogether, these results suggest that MSIMEP models can be used to
   predict MSI status from microarray DNA methylation profiles from both
   fresh-frozen FF and formalin-fixed paraffin-embedded (FFPE) tumor CRC
   samples.

Comparison of MSIMEP models with a MLH1 promoter methylation-based model

   Taking into consideration the role of biallelic MLH1 promoter
   methylation in most of the sporadic CRCs with MSI, we compared the
   performance of our MSIMEP models with a MLH1 promoter methylation-based
   model ([127]Table S1) in a pooled CRC cohort (external cohorts I to
   IV). As expected, MSIMEP optimized and reduced models yielded a better
   sensitivity (0.88 for both MSIMEP models vs 0.71 for MLH1 promoter
   methylation-based model) at the expense of a slightly lower specificity
   than the MLH1 promoter methylation-based model (0.95 and 0.92 for
   MSIMEP optimized and reduced models, respectively, vs 0.96 for MLH1
   promoter methylation-based model) ([128]Figure 5). Congruent results
   were obtained when stratified by MLH1 promoter methylation status,
   confirming the increased sensitivity of MSIMEP models to detect those
   non-MLH1 dependent cases ([129]Figure S12).

Figure 5.

   [130]Figure 5
   [131]Open in a new tab

   Comparison of the performance of three GLM models in a pooled CRC
   cohort (external cohorts I to IV)

   Barplots showing complementary metrics (accuracy, kappa, AUROC,
   sensitivity, specificity, and precision) for the evaluation of MSIMEP
   classification capacity of MSI vs MSS patients. ROC curves showing
   MSIMEP classification performance at alternative predictive thresholds.
   Models: MSIMEP optimized (red), MSIMEP reduced (blue), and promoter
   MLH1 (green). Cohort: external pool (n = 517, 30% MSI).

Discussion

   We developed MSIMEP, a computational tool for predicting MSI status
   from microarray DNA methylation profiles of CRC samples. We
   demonstrated the accuracy and robustness of MSIMEP by testing it in
   multiple CRC cohorts with varying tumor sample types (either FF or
   FFPE) from two different Illumina platforms (either Infinium Human
   Methylation 450K or Infinium MethylationEPIC arrays). Additionally, we
   tested its performance in other tumor types with high prevalence of MSI
   such as gastric and endometrial cancers, obtaining consistent results.

   As described by others,[132]^16 DNA methylation profiling represents a
   solid approach for blood-based liquid biopsy due to, among other
   aspects, the better limit of detection of methylation-based approaches
   and the advantage of avoiding somatic mutations derived from normal
   tissues, benign diseases, and clonal hematopoiesis of indeterminate
   potential.[133]^16^,[134]^17 Acknowledging this fact, we developed and
   validated the MSIMEP reduced model, which could be adapted, and
   facilitate its translation to the clinic as a cost-effective liquid
   biopsy test.

   To explore the underlying features of our MSI predictive model, we
   assessed the relative importance of the different predictors of MSIMEP
   in their genomic context. Although probes located at intergenic regions
   numerically represented only 15.8% of the total of predictors included
   in MSIMEP, taken as a whole, they accumulated a major relative
   importance in the model. Although the biological interpretation of this
   finding is complex and out of the scope of the present study,
   considering that intergenic regions may contain important functional
   sequences such as promoters and other regulatory elements, their role
   in the MSI phenotype and its regulation by DNA methylation deserve
   further investigation. Regarding probes located in genic regions, their
   highest proportion was in MLH1, reflecting the role of the methylation
   of its promoter as the primary somatic event responsible for the
   majority of sporadic MSI-H CRCs.[135]^18

   Additionally, confirming our expectations, (1) GSEA revealed an
   enrichment of pathways involved not only in mismatch repair but also in
   colorectal and endometrial cancers and (2) computational immune cell
   deconvolution techniques confirmed a consistent immune cell context
   between MSIMEP predicted and PCR-based MSI CRCs.

   Last, and after confirming MSIMEP performance across different
   important clinical and molecular subgroups, we carried out a
   face-to-face comparison of our MSIMEP models with a MLH1 promoter
   methylation-based model in a pooled CRC cohort composed of external
   cohorts I to IV, which confirmed a higher sensitivity of MSIMEP to
   detect MSI samples, mainly due to an increased capacity to detect those
   non-MLH1-dependent cases.

   In summary, MSIMEP could have useful applications in both basic and
   translational research by providing a cost-effective technique to
   characterize MSI status alongside methylation profiles in CRC. Future
   studies are needed to open MSIMEP to a wider range of tumor types,
   platforms, and biospecimens.

Limitations of the study

   Our tool MSIMEP has several limitations. First, because the training
   data used for its development comes from Infinium Human Methylation
   450K arrays, it should be adapted before moving to other types of
   platforms. Second, because MSIMEP is focused on the prediction of MSI
   in CRC, its performance in other tumor types such as gastric cancer or
   those neoplasms with low prevalence of MSI deserves further
   investigation. Third, the evaluation of MSI status of tumors by using
   distinct PCR-based assays introduced bias that only could be solved
   with a centralized reevaluation. All these aspects, together with the
   assessment of the MSIMEP reduced model in circulating tumor DNA
   samples, represent areas of improvement to be prioritized to realize
   the translation of MSIMEP to the daily clinical practice. Furthermore,
   it would be of interest to evaluate if MSIMEP offers an increased value
   in detecting potential candidates for anti-PD-1 therapies beyond
   PCR-based MSI status.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   TCGA Colon Adenocarcinoma GDC Data Portal TCGA-COAD
   TCGA Rectum Adenocarcinoma GDC Data Portal TCGA-READ
   TCGA Colon and Rectal Cancer UCSC Xena Portal TCGA-COADREAD
   TCGA Stomach Carcinoma GDC Data Portal TCGA-STAD
   TCGA Uterine Corpus Endometrial Carcinoma GDC Data Portal TCGA-UCEC
   External cohort I GEO [136]GSE69550
   External cohort II EGA EGAS00001004293
   External cohort III ArrayExpress E-GEOD-68060
   External cohort IV JGA JGAS00000000113
   UCSC Genome Browser GRCh37 Kent et al., 2002[137]^19
   [138]https://genome.ucsc.edu
   COSMIC v96 - Cancer Gene Census Sondka et al., 2018[139]^20
   [140]https://cancer.sanger.ac.uk/cosmic
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   R R Core Team, 2021[141]^21 [142]https://www.R-project.org/
   ChAMP v2.24.0 Morris et al., 2014[143]^22
   [144]https://github.com/YuanTian1991/ChAMP
   caret v6.0-91 Kuhn, 2008[145]^23 [146]https://github.com/topepo/caret
   methylGSA v1.12.0 Ren and Kuan, 2019[147]^24
   [148]https://github.com/reese3928/methylGSA
   MLeval v0.3 John, 2020[149]^25
   [150]https://CRAN.R-project.org/package=MLeval
   VennDiagram v1.7.1 Chen and Boutros, 2011[151]^26
   [152]https://github.com/cran/VennDiagram
   CIBERSORTx Newman et al., 2019[153]^27
   [154]https://cibersortx.stanford.edu/
   MSIMEP This paper [155]https://gitlab.com/mobilegenomesgroup/msimep
   [156]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Juan Ruiz-Bañobre
   ([157]juan.ruiz.banobre@sergas.es).

Materials availability

   This study did not generate new unique reagents or material.

Experimental model and subject details

TCGA cohorts

   Methylation datasets were downloaded from the GDC Data Portal (Project
   IDs TCGA-COAD, TCGA-READ, TCGA-STAD and TCGA-UCEC). Experimental
   Plat-form was Infinium Human Methylation 450K from Illumina. Downloaded
   data corresponded to 353 TCGA-COAD samples, 106 TCGA-READ samples,
   TCGA-STAD 397 samples and TCGA-UCEC 485 samples ([158]Data S1). Only
   450K methylation profiles from fresh-frozen (FF) primary solid tumors
   samples were included.

   Clinical information was downloaded from the GDC Data Portal [Project
   IDs: TCGA-COAD (458 cases), TCGA-READ (172 cases), TCGA-STAD (443
   cases), TCGA-UCEC (560 cases)]. Gene expression and single nucleotide
   variant information was downloaded from GDC Data Portal (TCGA-COAD)
   ([159]Data S2 and [160]Data S3). Additionally, CIMP status information
   of samples from TCGA COAD/READ cohort was downloaded from the UCSC Xena
   Portal (dataset ID: TCGA.COADREAD.sampleMap/COADREAD_clinicalMatrix).

   MSI status was evaluated through PCR-based assays as previously
   described.[161]^28^,[162]^29^,[163]^30 For each patient MSI status was
   encoded as a binary phenotype after considering MSS and MSI-Low
   profiles as MSS, and MSI-High profiles as MSI, in order to maximize the
   model predictive capacity toward relevant cases, as recommended by
   previous studies.[164]^7

   Among downloaded cases, only those from primary solid tumors with
   available DNA methylation beta (β)-values and MSI status information
   were used. From the pooled TCGA-COAD/READ cohort, 387 out of 459 cases
   fulfilled these criteria ([165]Table S2A). From TCGA-STAD and TCGA-UCEC
   cohorts, 394 out of 397 cases and 420 out of 485 cases fulfilled these
   criteria, respectively.

External cohorts

External cohort I

   Methylation dataset was downloaded from the Gene Expression Omnibus
   (GEO) data repository (GEO accession number: [166]GSE69550).
   Experimental Plat-form was Infinium Human Methylation 450K from
   Illumina. Downloaded data correspond to 79 formalin-fixed
   paraffin-embedded (FFPE) colorectal primary tumors. Three regions per
   primary tumor were selected: the region nearest the digestive tract
   surface (luminal), the central bulk (center), and the invasive front
   (front). MSI status was provided by the corresponding author upon
   request ([167]Table S2B).[168]^31

External cohort II

   Methylation dataset was downloaded from the European Genome-phenome
   Archive (EGA accession number: EGAS00001004293). Experimental Plat-form
   was Infinium Human Methylation 450K from Illumina. Downloaded data
   correspond to 262 FF colorectal primary tumors. MSI status and clinical
   characteristics were provided by the corresponding author upon request
   ([169]Table S2C).[170]^32

External cohort III

   Methylation dataset and MSI status information were downloaded from the
   ArrayExpress Archive of Functional Genomics Data repository (accession
   number: E-GEOD-68060). Experimental Plat-form was Infinium Human
   Methylation 450K from Illumina. Downloaded data correspond to 81 FF
   colorectal primary tumors. Clinical characteristics were provided by
   the corresponding author upon request ([171]Table S2D).

External cohort IV

   Methylation dataset was downloaded from the Japanese
   Geno-type-phenotype Archive (JGA) (JGA accession number:
   JGAS00000000113). Experimental Plat-form was Infinium MethylationEPIC
   array from Illumina. Downloaded data correspond to 95 FF MSI-H
   colorectal primary tumors. MSI status and clinical characteristics were
   provided by the corresponding author upon request
   ([172]Table S2E).[173]^33

Method details

TCGA datasets annotation

   Annotation for methylation datasets was obtained with the ChAMP
   v2.24.0 R package[174]^31 retrieving relevant information: probe type,
   strand, genomic coordinates, CpG island context (island ID and
   associated features) and gene context (gene symbol and associated
   features). Gene promoter was defined including probes located up to
   1.5 kb upstream from the transcription start site (TSS1500 and TSS200
   features). Annotation for cancer genes was obtained from the Cancer
   Gene Census database (COSMIC v96 release), retrieving relevant
   information (such as tier, role in cancer, and cancer syndrome) for
   each cancer gene. Genome-wide annotation for regulatory elements
   (curated regulatory regions, TFBS and regulatory polymorphisms) was
   obtained from the Open Regulatory Annotation track available at the
   UCSC Genome Browser database.

Analysis of DNA methylation arrays

   Analysis of DNA methylation arrays was performed with the ChAMP
   v2.24.0 R package.[175]^22 Methylation probe filtering was performed
   with champ.filter, setting the exclusion of (I) non-CpG probes, (II)
   Multi-Hit probes, (III) probes matching SNPs, and (IV) probes located
   in chromosomes X and Y. Methylation data imputation was performed with
   champ.impute under combined method (partial removal followed by KNN
   imputation) setting ProbeCutoff = 0.2, SampleCutoff = 0.2 and k = 5
   parameters. Quality control steps were performed with champ.QC
   generating mdsplots, densityPlots and dendrograms for methylation
   distribution. Normalization of Type-I and Type-II probes was performed
   with champ.norm under peak-based correction (PBC) method.
   Identification of differential methylation positions was performed with
   champ.DMP by comparing MSI vs MSS samples, applying Benjamini-Hochberg
   (B-H) False Discovery Rate (FDR) p value adjustment after limma
   analysis. MLH1 promoter methylation status was defined as previously
   described.[176]^34 Briefly, if β-value for all four MLH1 CpG sites
   cg23658326, cg11600697, cg21490561, and cg00893636 were greater than or
   equal to 0.18, 0.27, 0.11, and 0.10, respectively, the sample was
   considered hypermethylated for MLH1 promoter. Otherwise, the sample was
   considered non-hypermethylated.

Machine-learning predictive models

   The selection of predictors was done by retaining those probes from
   pooled TCGA COAD/READ dataset with an absolute delta β-value (Δβ)>0.3
   between MSI and MSS, and B-H FDR adjusted p value<0.05. Both
   hypermethylated and hypomethylated relevant probes were included as
   predictors, and patient binary MSI/MSS status was considered as
   response variable.

   Model development was performed with caret v6.0-91 R package,[177]^23
   training the following machine-learning classifiers: GLM (glmnet), KNN
   (knn), SVM (svmLinear), NaiveBayes (naive_bayes), NeuralNetwork (nnet),
   DecissionTree (rpart), and RandomForest (cforest).

   Classifier performance was evaluated through 5 repeats of 10-fold
   cross-validation, and hyperparameter tuning was done through grid
   search. Average model accuracies were compared through Friedman
   Rank-Sum Test and Pair-wise Wilcoxon Rank-Sum Test (paired,
   p.adjust.method = "holm"), and the less complex model among those with
   significantly better performance was selected.

   RFE was used to decrease model complexity, selecting an optimized model
   and a reduced version with suitable performance (less than 1% accuracy
   loss). Outer resampling method with 3 repeats of 2-fold
   cross-validation was applied. Predictors retained during the backward
   selection steps corresponded to 700, 600, 500, 400, 300, 200, 100, 50,
   25, 10, 5 and 2 CpG probes.

Model description

   Relative probe importance for the classification model was measured
   with caret varImp, considering the absolute value of the coefficients
   corresponding to the optimized GLM model. Relative gene importance was
   assessed by adding the absolute importance of individual CpG probes
   associated with each gene, normalized by the total number of genes
   represented in the model.

   Gene set enrichment analysis was performed with the methylGSA v1.12.0 R
   package,[178]^24 which implements logistic regression adjusting for the
   number of methylation probes. Enrichment in KEGG categories was
   evaluated with methylglm function. CpGs present in alternative gene
   context regions of the Infinium Human Methylation 450K array were
   considered (“group = promoter”, “group = body” and “group = all”). Gene
   sets composed by 20–500 members were displayed and ranked according to
   their B-H FDR-adjusted p value.

MSIMEP exploratory analysis in alternative TCGA cohorts

   Prediction of MSI/MSS phenotype in TCGA-UCEC and TCGA-STAD cohorts was
   performed using the caret predict function, by assessing the
   methylation profiles for the same 400 predictors (MSIMEP optimized
   model), or 25 predictors (MSIMEP reduced model), and considering
   patient binary MSI/MSS status as response variable. Methylation
   β-values were processed under a common framework (see [179]Analysis of
   DNA methylation arrays). Accuracy, Kappa, ROC, Sensitivity,
   Specificity, and Precision metrics were calculated from the predicted
   class probabilities (predict “type = prob”) and associated confusion
   matrix comparing predicted and expected MSI/MSS status. ROC curves were
   generated using the MLeval v0.3 R package.[180]^25

MSIMEP validation in external colorectal cancer cohorts

   Prediction of MSI/MSS phenotype in External cohort I, External cohort
   II, External cohort III, and External cohort IV was performed using the
   caret predict function, by assessing the methylation profiles for the
   same 400 predictors (MSIMEP optimized model), or 25 predictors (MSIMEP
   reduced model), and considering patient binary MSI/MSS status as
   response variable. Methylation β-values were processed under a common
   framework (see [181]Analysis of DNA methylation arrays). Accuracy,
   Kappa, ROC, Sensitivity, Specificity, and Precision metrics were
   calculated from the predicted class probabilities (caret:predict
   “type = prob”) and associated confusion matrix comparing predicted and
   expected MSI/MSS status. ROC curves were generated using the MLeval
   v0.3 R package.[182]^25 Venn diagram was generated using the
   VennDiagram v1.7.1 R package.[183]^26

Deconvolution of immune cell populations

   The CIBERSORTx deconvolution algorithm[184]^27 was used to infer immune
   cell infiltration from TCGA-COAD and TCGA-READ bulk RNA-seq data.
   CIBERSORTx job type: “Impute cell fractions” was launched in absolute
   mode on the LM22 signature, with active B mode batch correction and
   quantile normalization disabled (recommended conditions for bulk
   RNA-seq mixtures), executed with 500 permutations. Low-quality samples
   with p value for deconvolution >0.05 were discarded from downstream
   analyses. Relative cell proportions were obtained by normalizing the
   CIBERSORTx output to the sample-level sum of cell scores rendering
   percentages of immune infiltration. Relative cell proportions were
   compared between predicted MSIMEP MSI CRCs and MSI cases diagnosed
   through standard PCR-based assays. Profiled immune cell types were B
   cells (naive and memory), plasmatic cells, T cells (CD8, CD4 naive, CD4
   memory resting, CD4 memory activated, follicular helper, regulatory and
   gamma-delta), NK cells (resting and activated), monocytes, macrophages
   (M0, M1 and M2), dendritic cells (resting and activated), mast cells
   (resting and activated), eosinophils, and neutrophils.

Quantification and statistical analysis

   Comparisons between patient and disease characteristics were carried
   out using X^2 or Fisher exact test (categorical variables), and
   T-Student or Wilcoxon tests (continuous variables). All p values were
   2-sided, and those less than 0.05 were considered statistically
   significant. Statistical details can be found in table and figure
   legends, [185]results and [186]method details. All statistical analyses
   were performed using R.

Acknowledgments