Abstract

Simple Summary

   Abnormal DNA methylation is known to regulate gene expression, and its
   features have been frequently observed in colorectal cancer (CRC)
   patients. In addition, alterations in DNA methylation can be proposed
   as biomarkers for cancer prognosis, as they occur in the early stage of
   carcinogenesis. Although numerous studies have attempted to shed light
   on the impacts of DNA methylation on gene expression, it is still
   unclear which specific regions regulate gene expression and how they
   are associated with patient survival. In this study, we elucidated the
   intricate relationship between DNA methylation and gene expression.
   Furthermore, we found genes that were influenced by DNA methylation and
   were associated with survival; these genes were mainly enriched in
   immune-related pathways.

Abstract

   The aberrant expression of cancer-related genes can lead to colorectal
   cancer (CRC) carcinogenesis, and DNA methylation is one of the causes
   of abnormal expression. Although many studies have been conducted to
   reveal how DNA methylation affects transcription regulation, the ways
   in which it modulates gene expression and the regions that
   significantly affect DNA methylation-mediated gene regulation remain
   unclear. In this study, we investigated how DNA methylation in specific
   genomic areas can influence gene expression. Several regression models
   were constructed for gene expression prediction based on DNA
   methylation. Among these models, ElasticNet, which had the best
   performance, was chosen for further analysis. DNA methylation near
   transcription start sites (TSS), especially from 2 kb upstream to 7 kb
   downstream of TSS, had an essential regulatory role in gene expression.
   Moreover, methylation-affected and survival-associated genes were
   compiled and found to be mainly enriched in immune-related pathways.
   This study investigated genomic regions in which methylation changes
   can affect gene expression. In addition, this study proposed that
   aberrantly expressed genes due to DNA methylation can lead to CRC
   pathogenesis by the immune system.

   Keywords: colorectal cancer, DNA methylation, gene expression,
   regression, ElasticNet, survival analysis, immune system

1. Introduction

   Colorectal cancer (CRC) is one of the most frequently diagnosed cancers
   worldwide. According to GLOBOCAN statistics on global cancer, CRC was
   the third most common cancer and the second leading cause of
   cancer-related death in 2020 [[26]1]. CRC pathogenesis is caused by
   three major pathways: chromosomal instability, microsatellite
   instability, and the CpG island (Cytosine-phosphate-Guanine island;
   CGI) methylator phenotype (CIMP) [[27]2,[28]3,[29]4]. The CIMP
   phenotype is found in 30–35% of colorectal adenoma cases [[30]5]. It is
   characterized by the hypermethylation of CGIs at promoter regions and
   leads to the inactivation of multiple cancer-related genes, including
   tumor suppressors [[31]6,[32]7,[33]8,[34]9].

   DNA methylation, which is catalyzed by a family of DNA
   methyltransferases that transfer a methyl group to the 5′-carbon of
   cytosines in CpG sites [[35]10], plays a critical role in
   carcinogenesis to control gene expression [[36]11]. In particular,
   hypomethylation and hypermethylation are related to transcriptional
   activation and gene silencing, respectively [[37]12,[38]13]. The
   alteration of DNA methylation has been proposed as a potential
   biomarker for cancer diagnosis, treatment response prediction, and
   prognosis, as it occurs in early carcinogenesis [[39]14,[40]15].
   Furthermore, some studies have reported that differently methylated and
   differently expressed genes correlated with prognosis in CRC patients
   [[41]16,[42]17]. However, the exact role of DNA methylation in gene
   expression remains unclear [[43]18]. Specifically, the impact of CpG
   sites in various regions of the gene has not been precisely determined
   yet [[44]19,[45]20], although many studies have investigated
   differently methylated CpG sites in CGIs and CGI-surrounding regions,
   such as the CGI shore (0–2 kb from CGIs), CGI shelf (2–4 kb from CGIs),
   and open sea (a region with any specific designation) [[46]21,[47]22].
   Several studies have also reported that DNA methylation levels in the
   gene body, particularly in the first exon or first intron, are related
   to gene expression [[48]23,[49]24,[50]25]. Another recent study
   detected differentially methylated regions (DMR) between CRC patients
   and normal groups and annotated their genomic regions, including
   CGI-related features [[51]26]. In addition, one study found that the
   effect of DNA methylation on gene expression varies based on the
   distance between a CpG site and the gene and that the DNA methylation
   of distant CpG sites can also affect gene expression [[52]27].

   This study examined the potential effect of DNA methylation on various
   genomic areas, including gene components and CGI-related regions, based
   on their distance from the genes in CRC. To determine the CpG site in
   which methylation can modulate expression in CRC patients, this study
   obtained the DNA methylation and transcription profiles from The Cancer
   Genome Atlas (TCGA) [[53]28] and trained several statistical machine
   learning models, such as Lasso, Ridge, ElasticNet, and a Bayesian
   sparse linear mixed model (Bslmm) [[54]29], using DNA methylation
   values of CpG sites within ±1 Mb of the transcription start sites
   (TSS). The models have been proposed to solve the problem wherein there
   are significantly more features than the number of samples, and they
   have been utilized in various biological fields, e.g., genome-wide
   association studies [[55]30,[56]31,[57]32,[58]33], transcriptomic
   profiling [[59]34,[60]35,[61]36], and multi-omics approaches
   [[62]37,[63]38,[64]39]. The best model with the highest predictive
   performance was selected, and the effect of CpG sites was analyzed
   using the coefficients of the best-performing model, taking into
   account the distance from TSS and the region relative to the gene. In
   addition, this study validated the effect of CpG sites according to
   distance from TSS using independent external datasets. Moreover, this
   study uncovered genes whose expression is affected by DNA methylation
   and which have a potential association with survival outcomes.
   Furthermore, this study identified the biological roles of genes
   aberrantly expressed due to DNA methylation and associated with
   survival.

2. Materials and Methods

2.1. TCGA Data Collection and Preprocessing

   Gene expression and DNA methylation data of 393 CRC patients were
   acquired from the TCGA-COAD and TCGA-READ projects using the R package
   TCGA biolinks version 1.12.0 [[65]40]. Clinical datasets were also
   collected via the same process, with the exception of four patients who
   did not have data on survival times. Consequently, the clinical
   information of 389 CRC patients was used for survival analysis.

   TCGA provides mRNA expression values that have already been processed
   through several steps [[66]41]. FASTQ files were aligned to the
   GRCh38/hg38 reference genome of GENECODE v22 using STAR [[67]42], and
   the read counts mapped per gene were quantified using HTSeq
   [[68]41,[69]43]. We normalized the HTSeq-based raw counts of genes to
   transcripts per million (TPM) and transformed them into log[2](TPM +
   1). The protein-coding genes whose log[2](TPM + 1) values were one or
   more in at least half of the samples were selected in this study. Genes
   on sex chromosomes or with no probes within ±1 Mb of TSS were excluded
   from the analysis.

   TCGA methylation datasets were based on the Illumina Infinium
   HumanMethylation450 BeadChip (450k) [[70]44]. DNA methylation levels
   represent the ratio of methylated probes to the total array intensity,
   so it has the form of a β value that ranges from 0 to 1. DNA
   methylation probes containing missing values in >10% of the samples
   were excluded. Otherwise, missing values were imputed as the median of
   the remaining available values. Consequently, 12,822 genes and 394,994
   DNA methylation probes were selected for this research.

2.2. Collection of DNA Methylation Probes for Each Gene

   The probe lists and genomic coordinates of 450k were obtained using the
   IlluminaHumanMethylation450kanno.ilmn12.hg19 0.6.0 package in R.
   Because the annotation was provided based on hg19, it was converted to
   hg38 using UCSC liftOver version 1.16.0 in R. To investigate the effect
   of the DNA methylation level on gene expression regulation, the probes
   within ±1 Mb of TSS for each gene were chosen as the features of the
   candidate models. TSS were determined as the start positions of
   transcripts with the longest exon length for each gene. By collecting
   probes within ±1 Mb of the TSS of genes, 380,087 DNA methylation probes
   were finally obtained for the analysis. On one gene, an average of 575
   probes was used, with a maximum of 5577 probes. [71]Table S1 shows the
   total number of probes per gene.

2.3. Statistical Machine Learning Model

   Multivariate penalized regression methods (Lasso, Ridge, ElasticNet,
   and Bslmm) were utilized to determine the impact of the DNA methylation
   of CpG sites on gene expression. Lasso and ElasticNet are sparse models
   that give more weight to essential features by reducing the
   coefficients of less effective features to zero. Ridge is a relatively
   modest shrinkage approach that converges the coefficients very closely
   to zero. Ridge has the advantage of considering the minor influence of
   many features compared with Lasso or ElasticNet. These models apply
   regularization strategies, minimizing the prediction error while
   avoiding overfitting. These can achieve an appropriate trade-off
   between bias and variance by adding a penalty term to the cost
   function. The following equations describe the regularization
   approaches:
   [MATH:
   <mrow><mrow><mi>C</mi><mi>o</mi><mi>s</mi><mi>t</mi><mo>=</mo><mfrac><m
   n>1</mn><mrow><mn>2</mn><mi>N</mi></mrow></mfrac><mo> </mo><munderover>
   <mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>i</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>n</mi></munderover><msup><mrow><mrow><mo>(</mo><mrow><ms
   ub><mi>y</mi><mi>i</mi></msub><mo>−</mo><msub><mi>β</mi><mn>0</mn></msu
   b><mo>−</mo><munderover><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>p</mi></munderover><msub><mi>β</mi><mi>j</mi></msub><msu
   b><mi>x</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mo>)</mo></
   mrow></mrow><mn>2</mn></msup></mrow></mrow> :MATH]
   (1)
   [MATH:
   <mrow><mrow><mrow><mi>L</mi><mn>1</mn><mo> </mo><mi>p</mi><mi>e</mi><mi
   >n</mi><mi>a</mi><mi>l</mi><mi>t</mi><mi>y</mi><mo>=</mo><msubsup><msty
   le mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>p</mi></msubsup><mrow><mi
   mathvariant="sans-serif">λ</mi><mi
   mathvariant="sans-serif">α</mi></mrow><mo>|</mo><msub><mi>β</mi><mi>j</
   mi></msub><mo>|</mo></mrow><mrow><mo>,</mo><mo> </mo><mi>L</mi><mn>2</m
   n><mo> </mo><mi>p</mi><mi>e</mi><mi>n</mi><mi>a</mi><mi>l</mi><mi>t</mi
   ><mi>y</mi><mo>=</mo><msubsup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>p</mi></msubsup><mi
   mathvariant="sans-serif">λ</mi><mfrac><mrow><mrow><mo>(</mo><mrow><mn>1
   </mn><mo>−</mo><mi
   mathvariant="sans-serif">α</mi></mrow><mo>)</mo></mrow></mrow><mn>2</mn
   ></mfrac><msub><mi>β</mi><mi>j</mi></msub><msup><mrow></mrow><mn>2</mn>
   </msup></mrow></mrow></mrow> :MATH]
   (2)
   [MATH: <mrow><mrow><mi>L</mi><mfenced open="(" close=")"
   separators="|"><mrow><mi>β</mi><mo>,</mo><msub><mi>β</mi><mn>0</mn></ms
   ub></mrow></mfenced><mo>=</mo><mi
   mathvariant="italic">argmin</mi><mfenced open="(" close=")"
   separators="|"><mrow><mi>β</mi><mo>,</mo><msub><mi>β</mi><mn>0</mn></ms
   ub></mrow></mfenced><mo stretchy="false">[</mo><mo> </mo><mi
   mathvariant="italic">Cost</mi><mo> </mo><mo>+</mo><mi>L</mi><mn>1</mn><
   mo> </mo><mi
   mathvariant="italic">penalty</mi><mo> </mo><mo>+</mo><mi>L</mi><mn>2</m
   n><mo> </mo><mi mathvariant="italic">penalty</mi><mo> </mo><mo
   stretchy="false">]</mo></mrow></mrow> :MATH]
   (3)

   In Equations (1)–(3),
   [MATH: <mrow><mi>n</mi></mrow> :MATH]
   is the number of patients and
   [MATH: <mrow><mi>p</mi></mrow> :MATH]
   is the number of DNA methylation probes in each gene. When
   [MATH: <mrow><mi>y</mi></mrow> :MATH]
   is the gene expression value and
   [MATH: <mrow><mi>x</mi></mrow> :MATH]
   is the DNA methylation level, the shrunken coefficients
   [MATH: <mrow><mi>β</mi></mrow> :MATH]
   for each probe can be obtained by training the model with an adequate
   shrinkage penalty
   [MATH: <mrow><mi mathvariant="sans-serif">λ</mi></mrow> :MATH]
   . In Equation (2),
   [MATH: <mrow><mrow><mrow><mo> </mo><mi
   mathvariant="sans-serif">α</mi></mrow><mo>=</mo><mn>1</mn></mrow></mrow
   > :MATH]
   denotes the Lasso model with the L1 penalty, and
   [MATH: <mrow><mrow><mi
   mathvariant="sans-serif">α</mi><mo>=</mo><mn>0</mn></mrow></mrow>
   :MATH]
   denotes ridge regression with the L2 penalty. If
   [MATH: <mrow><mrow><mn>0</mn><mo><</mo><mrow><mi
   mathvariant="sans-serif">α</mi><mo> </mo></mrow><mo><</mo><mn>1</mn></m
   row></mrow> :MATH]
   , ElasticNet, which combines the L1 and L2 penalties, is used [[72]45].
   Collectively, the relative influence of the two penalty terms can be
   controlled by regulating
   [MATH: <mrow><mi mathvariant="sans-serif">α</mi></mrow> :MATH]
   .

   Bslmm combines the benefits of standard linear mixed models with sparse
   regression modeling [[73]29,[74]46]. It assumes prior distributions for
   hyperparameter speciation and infers posterior distributions by
   efficiently fitting a Markov chain Monte Carlo algorithm. Similar to
   regularization techniques, Bslmm can bring the coefficients close to
   zero, excluding a few informative features. More details of Bslmm are
   described in [[75]29]. Lasso, Ridge, and ElasticNet were implemented
   via the glmnet package version 4.1.3 in R. The appropriate value of
   [MATH: <mrow><mi mathvariant="sans-serif">α</mi></mrow> :MATH]
   was searched by altering the
   [MATH: <mrow><mi mathvariant="sans-serif">α</mi></mrow> :MATH]
   value between 0 and 1 by 0.1. Bslmm was implemented via GEMMA software
   version 0.98.4.

2.4. Performance Evaluation and Model Selection

   Five-fold cross-validation was conducted using the cv.glmnet function
   provided in the glmnet package. The optimal value of
   [MATH: <mrow><mi mathvariant="sans-serif">λ</mi></mrow> :MATH]
   in Equation (2) was searched automatically through glmnet after
   cross-validation. The performance of the models was calculated using
   the metrics in Equations (4)–(7). The mean squared error (MSE), root
   MSE (RMSE), and mean absolute error (MAE) measure the error of the
   models, and the R2 score assesses the model’s explanatory power
   [[76]47]. To find genes whose expression level was affected by DNA
   methylation, the R2 score was used, because MSE, RMSE, and MAE were
   scale-dependent measures [[77]48]. When the R2 score is closer to 1, it
   indicates that the regression model is fitted to the data well.
   [MATH:
   <mrow><mrow><mrow><mi>MSE</mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mf
   rac></mrow><mrow><msup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mo> </mo></msup><msup><mrow><mr
   ow><mo>(</mo><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>a</mi>
   <mi>l</mi></mrow></msub><mo>−</mo><msub><mi>y</mi><mrow><mi>p</mi><mi>r
   </mi><mi>e</mi><mi>d</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mn>2</mn></msup></mrow></mrow></mrow> :MATH]
   (4)
   [MATH:
   <mrow><mrow><mi>RMSE</mi><mo>=</mo><msqrt><mrow><mfrac><mn>1</mn><mi>n<
   /mi></mfrac><mo> </mo><msup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mo> </mo></msup><msup><mrow><mr
   ow><mo>(</mo><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>a</mi>
   <mi>l</mi></mrow></msub><mo>−</mo><msub><mi>y</mi><mrow><mi>p</mi><mi>r
   </mi><mi>e</mi><mi>d</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mn>2</mn></msup></mrow></msqrt></mrow></mrow> :MATH]
   (5)
   [MATH:
   <mrow><mrow><mrow><mi>MAE</mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mf
   rac></mrow><mrow><msup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mo> </mo></msup><mrow><mo>|</mo
   ><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>a</mi><mi>l</mi></
   mrow></msub><mo>−</mo><msub><mi>y</mi><mrow><mi>p</mi><mi>r</mi><mi>e</
   mi><mi>d</mi></mrow></msub></mrow><mo>|</mo></mrow></mrow></mrow></mrow
   > :MATH]
   (6)
   [MATH: <mrow><mrow><mi
   mathvariant="normal">R</mi><mn>2</mn><mo>=</mo><mn>1</mn><mo>−</mo><mfr
   ac><mrow><msup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mo> </mo></msup><msup><mrow><mr
   ow><mo>(</mo><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>a</mi>
   <mi>l</mi></mrow></msub><mo>−</mo><msub><mi>y</mi><mrow><mi>p</mi><mi>r
   </mi><mi>e</mi><mi>d</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mn>2</mn></msup></mrow><mrow><msup><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mo> </mo></msup><msup><mrow><mr
   ow><mo>(</mo><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>a</mi>
   <mi>l</mi></mrow></msub><mo>−</mo><mi>M</mi><mi>e</mi><mi>a</mi><mi>n</
   mi><mrow><mo>(</mo><mrow><msub><mi>y</mi><mrow><mi>r</mi><mi>e</mi><mi>
   a</mi><mi>l</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mo>)</mo>
   </mrow></mrow><mn>2</mn></msup></mrow></mfrac></mrow></mrow> :MATH]
   (7)

   In all candidate models, the accuracy metrics were calculated in the
   same way, and the model with a smaller error and higher correlation
   levels in more genes was selected.

2.5. Investigation of the Significance of CpG Sites

   The finally selected model, ElasticNet, was retrained using the entire
   set of 393 patients’ data with the selected α and λ in Equation (2).
   The coefficient for each feature in regression techniques represents
   the impact of that feature on the target. Because the range of
   expression values varies by gene, the absolute values of the
   coefficients in each gene were scaled via min–max normalization. If the
   coefficient was equal to zero, it was regarded as a noneffective probe
   and excluded after the normalization process. Normalized coefficients
   were used after rounding to one decimal place.

2.6. Genomic Annotation for Probes

   Using the GENECODE v22 Gene Transfer Format file, the genomic
   coordinate of each component of the gene body was extracted. The probes
   located in the gene body were annotated with the component in which
   they were found. In addition, in order to check the relation to CGI,
   information about probes located in the target gene’s CGI and the
   CGI-surrounding regions was retrieved using the
   IlluminaHumanMethylation450kanno.ilmn12.hg19 0.6.0 package in R. The
   probes in the CGI-related regions of the target gene were tagged with
   the corresponding region.

2.7. Validation Data Analysis

   The CRC cell line gene expression datasets were obtained from the
   Cancer Cell Line Encyclopedia (CCLE) [[78]49]. DNA methylation datasets
   for the identical cell lines were downloaded from [79]GSE68379 in the
   form of β values, which had already been processed by Iorio F. et al.
   [[80]50]. The cell lines from two different datasets were intersected
   by name, and, as a result, the gene expression and the DNA methylation
   data were matched for 31 CRC cell lines ([81]Table S2). The validation
   data were preprocessed in the same way as in our experiments with TCGA
   datasets. Consequently, 15,760 genes and 484,843 DNA methylation probes
   were selected. After collecting probes within ±1 Mb of the TSS of
   genes, 469,522 DNA methylation probes were used for the validation
   experiments. The total number of probes per gene can be found in
   [82]Table S3.

2.8. Survival Analysis

   For each target gene, the samples were separated into low- and
   high-expression groups according to the optimal cutoff values
   determined via the Maxstat R package version 0.7.25, similar to other
   previous studies that divided groups for survival analysis
   [[83]51,[84]52,[85]53,[86]54]. A log-rank test was used to examine
   whether there was a relationship between the expression levels of each
   gene and survival.

2.9. Identification of Signature Genes and Functional Annotation

   Methylation-affected and survival-associated (MASA) genes were defined
   as genes whose expression was affected by DNA methylation and were
   potentially associated with survival outcomes. In detail, it was
   considered that gene expression levels were affected by the methylation
   pattern if the R2 score was >0.3. Additionally, if the p value obtained
   by the log-rank test was ≤0.05 in the survival analysis, the gene
   expression change had the potential to affect survival. MASA genes were
   retrieved by intersecting the two groups of genes. To distinguish the
   potential biological impact of MASA, Gene Ontology (GO) analysis for
   biological processes (BP) and the Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analysis were conducted by DAVID
   [[87]55]. GO terms and KEGG pathways with a false discovery rate (FDR)
   of ≤0.05 were extracted as enriched sets.

3. Results

3.1. Best-Performing Model: ElasticNet

   To predict gene expression using the DNA methylation levels at CpG
   sites, four multivariate regression models, namely Lasso, Ridge,
   ElasticNet, and Bslmm, were applied. The performance of the models was
   evaluated by comparing the original expression of each gene to the
   predicted value using five-fold cross-validation. To identify the best
   model that could accurately predict gene expression for as many genes
   as possible, the median values in each performance metric, namely MSE,
   RMSE, MAE, and R2 score, were examined ([88]Figure 1a–d). The R2 score
   is a value that indicates the explanatory power of the model, and a
   higher R2 score indicates that the model has better performance. By
   contrast, MSE, RMSE, and MAE represent the model error; the smaller the
   value is, the better the model’s performance. ElasticNet, with α = 0.1,
   had the highest median R2 score ([89]Figure 1d). When α = 0.1,
   ElasticNet also showed the lowest median error values compared with
   other methods ([90]Figure 1a–c), although the values were very slightly
   different among ElasticNet models with different α values.
   Consequently, ElasticNet with α = 0.1 was selected as the best model,
   and this model was used in all subsequent analyses.

Figure 1.

   [91]Figure 1
   [92]Open in a new tab

   Performance of the candidate models. The numbers in parentheses are the
   α values of ElasticNet. (a) Median MSE for target genes per model; (b)
   median RMSE for target genes per model; (c) median MAE for target genes
   per model; (d) median R2 score for target genes per model. The best
   model is marked with a red diamond in each metric.

   The number of genes whose R2 scores exceeded the specific values for
   each model was counted ([93]Table S4). In all models, >12,000 genes had
   an R2 score of >0. However, as the R2 score increased, the differences
   between the models became more noticeable. With a higher R2 score, the
   number of genes decreased rapidly in the case of Bslmm and Ridge.
   Conversely, in ElasticNet and Lasso, the number of genes decreased
   slowly, even as the R2 score increased. ElasticNet and Lasso contained
   >3000 genes with R2 scores of >0.3. Specifically, in ElasticNet with α
   = 0.1, the R2 scores of 3307 genes were >0.3. This was the largest
   number of genes among the models, and these genes were regarded as
   genes whose expression can be influenced by DNA methylation.

3.2. Importance of DNA Methylation in a Specific Genomic Region

   This study evaluated which CpG sites’ methylation levels affect gene
   expression using the coefficients of the finally selected ElasticNet
   model. The normalized coefficients obtained after min–max normalization
   of the absolute coefficient values of CpG sites within ±1 Mb of TSS
   were utilized. It was hypothesized that if the coefficient was
   negative, DNA methylation would negatively affect gene expression. The
   negative effect means that high DNA methylation levels can cause genes
   to be underexpressed, whereas low DNA methylation can lead to genes
   being overexpressed.

   This study investigated in which direction effective CpG sites are
   located from TSS and how the effects are changed according to the
   distance from TSS. Moreover, we checked whether the DNA methylation of
   CpG sites was negatively or positively related to gene expression
   levels. Most effective CpG sites were located within ±25 kb of TSS
   ([94]Figure 2a). Therefore, the region of interest was narrowed down
   within ±25 kb, and the effect of methylation in this area was further
   analyzed. CpG sites with an absolute normalization coefficient of 1 can
   be considered important sites for gene expression. To analyze how the
   percentage of important CpG sites varied with distance, the number of
   important probes was divided by the number of probes with absolute
   coefficients more than zero within that distance. [95]Figure 2b shows
   the percentage of important probes. The percentage was high, especially
   from 2 kb upstream to 7 kb downstream, and most important CpG sites
   near TSS had a negative effect rather than a positive one. Moreover,
   when we restricted our analysis of the regulatory effects only to the
   genes with high R2 scores (>0.3), we still confirmed that the
   percentage of the effective probes within ±1 Mb of TSS and the negative
   effect of the CpG sites from 2 kb upstream to 7 kb downstream of TSS
   were fairly similar ([96]Figure S1a,b). Because the 2 kb upstream
   region of TSS was generally contained in a promoter region in previous
   studies [[97]27,[98]56], the negative effect of DNA methylation at the
   promoter region on gene expression could be verified again.

Figure 2.

   [99]Figure 2
   [100]Open in a new tab

   Regulatory effects represented by coefficients of CpG sites. Green,
   negative effect probes; orange, positive effect probes; black, zero,
   noneffective. (a) Effects of probes according to the distance from TSS.
   Probes within ±1 Mb of TSS were used. (b–f) The percentage of probes
   was represented by dividing the number of probes with absolute
   normalized coefficients of 1 by the number of probes with absolute
   coefficients more than zero. (b) Effects according to the distance from
   TSS. Only probes within ±25 kb were plotted. (c) Effects according to
   the relation to CGI within ±1 Mb of TSS. (d) Effects according to the
   relation to CGI within 2 kb upstream to 7 kb downstream of TSS. (e,f)
   Comparison of importance among CGI-related regions by logistic
   regression. The positive coefficient indicates a high probability of
   important probes in that region (red bar). On the other hand, a
   negative coefficient means that the probes in that region have less
   importance (grey bar). The open sea was not represented because the
   coefficient was not estimated from the logistic regression model. The
   region where its coefficient was estimated with a considerably low p
   value (p < 2.2 × 10^−10) was marked as ***, and that estimated with p <
   0.05 was marked as * above the bar. (e) Probes within ±1 Mb of TSS were
   used. (f) Probes within 2 kb upstream to 7 kb downstream of TSS were
   used.

   Furthermore, we analyzed whether CGIs and their surrounding regions are
   significant for transcription regulation by DNA methylation. The number
   of effective probes located in CGI-related regions within ±1 Mb of the
   TSS of the target gene was investigated. When the percentage of
   important probes among the number of probes with effects was measured,
   the percentage of important probes with negative effects was clearly
   higher than those with positive effects ([101]Figure 2c). The high
   percentage of important probes with negative effects was also similarly
   detected when only employing the genes with high R2 scores (>0.3)
   ([102]Figure S1c).

   To identify which region is significantly more important than others,
   we applied a logistic regression model using the allocation information
   of the important probes among the CGI-related regions. As a result, the
   CGI shore and CGI shelf had positive coefficients and low p values,
   which means that there existed significantly more important probes in
   these regions (p < 2.2 × 10^−10; [103]Figure 2e; [104]Table S7). By
   restricting the analysis to the genes with high R2 scores, the CGI
   shore revealed a positive coefficient with a low p value, but the
   estimated coefficient was relatively smaller and the p value was
   considerably higher than the results obtained by all genes (p < 2.2 ×
   10^−5; [105]Figure S1e; Table S7).

   Furthermore, to identify the effect of CGI-related regions within 2 kb
   upstream to 7 kb downstream of TSS, the percentage of important probes
   among the number of effective probes was reinvestigated ([106]Figure
   2d). All CGI-related regions had a high percentage of important probes
   with a negative effect. This tendency was also clearly discovered in
   the results of TCGA datasets with genes with high R2 scores
   ([107]Figure S1d). When utilizing a logistic regression model, the CGI
   shelf was more important than other CGI-related regions with a positive
   coefficient and a low p value (p < 0.05; [108]Figure 2f; [109]Table
   S8). However, the coefficient and p value were not so significant
   compared to the results from CGI-related regions within ±1 Mb of the
   TSS of the target gene. Moreover, when analyzing genes with high R2
   scores, important CGI-related regions with a positive coefficient were
   not detected ([110]Figure S1f; Table S8).

   Interestingly, regardless of the R2 scores of genes, more than 80% of
   the important probes with a negative impact within ±1 Mb were retained
   in the CGI, shore, and shelf within 2 kb upstream to 7 kb downstream of
   TSS ([111]Tables S5 and S6). Therefore, this study validated that DNA
   methylation within 2 kb upstream to 7 kb downstream of TSS is important
   for expression regulation, and the DNA methylation-medicated gene
   regulation has a relatively weak relationship with CGIs in this area.

   For further validation, we carried out the same experiments using CCLE
   CRC cell line datasets. The percentage of effective probes was
   increased near TSS, similarly to TCGA datasets ([112]Figure S2a).
   Moreover, the CpG sites with negative effects were mainly observed from
   2 kb upstream to 7 kb downstream ([113]Figure S2b). In the CGI-related
   regions, excluding the open sea, there were more probes with negative
   effects than positive ones ([114]Figure S2c). When we restricted the
   analyzed region to 2 kb upstream to 7 kb downstream, it was confirmed
   that the probes with negative effects were more frequently located in
   all CGI-related areas, including the open sea ([115]Figure S2d). It was
   noted that more than 90% of the important probes with a negative impact
   were retained in the CGI, shore, and shelf, within 2 kb upstream to 7
   kb downstream of TSS ([116]Tables S9 and S10). There were no important
   CGI-related regions obtained by the logistic regression analysis both
   within ±1 Mb ([117]Figure S2e; Table S11) and within 2 kb upstream to 7
   kb downstream of TSS ([118]Figure S2f; Table S12).

   This study also investigated genomic regions at ~7 kb downstream and
   the regional effects on gene expression. The probes with normalized
   coefficient values of −1 were more likely to be found in the first
   intron region and first exon ([119]Figure 3a). Conversely, probes with
   highly positive effects, denoted as normalized coefficients of 1, did
   not specify any representative regions ([120]Figure 3b). Interestingly,
   a considerable percentage of the probes with low normalized
   coefficients were found in intergenic regions, regardless of whether
   the effects were negative or positive. Moreover, the percentage of
   probes with low coefficients at intergenic regions moderately increased
   with the distance from TSS ([121]Figure 3a,b). Furthermore, in both the
   experiments only using genes with high R2 scores and the validation
   experiments using CCLE datasets, the first intron and the first exon
   were also found to be important regions, where DNA methylation
   occurring in these regions negatively affects gene expression
   ([122]Figures S3 and S4).

Figure 3.

   [123]Figure 3
   [124]Open in a new tab

   Regulatory effects represented by coefficients of CpG sites within
   genes. Effects of the probes within genes with (a) negative
   coefficients and (b) positive coefficients according to the distance
   from the TSS. The x-axis is the relative distance from TSS, and the
   leftmost point of the x-axis is TSS. The y-axis is the percentage of
   probes represented by dividing the number of probes with absolute
   normalized coefficients of 1 by the number of probes with absolute
   coefficients more than zero. The colors of the lines denote the genomic
   regions. Orange, first exon; red, first intron; black, intergenic;
   green, other exon; blue, other intron. Other exon/intron refers to all
   exons/introns in the transcript except the first exon/intron. Only
   downstream regions are plotted.

3.3. Prediction with Probes between 2 kb Upstream and 7 kb Downstream of TSS

   The DNA methylation of CpG sites in the region from 2 kb upstream to 7
   kb downstream of TSS has an important role in gene regulation. This
   study assessed the model’s prediction performance using only the probes
   within the corresponding regions.

   The prediction results using only probes within 2 kb upstream to 7 kb
   downstream of TSS showed a strong correlation with the results using
   all probes within ±1 Mb of TSS ([125]Figure 4a–d). Additionally, some
   genes showed slightly better performance when only the probes of the
   regions (2 kb upstream to 7 kb downstream) were used. These findings
   suggested that the DNA methylation levels in 2 kb upstream to 7 kb
   downstream regions are crucial for regulating gene expression. However,
   the effect of DNA methylation on a wide range of regions around TSS
   cannot be ignored.

Figure 4.

   [126]Figure 4
   [127]Open in a new tab

   Performance comparison of ElasticNet depending on probe distance. Gene
   expression prediction was performed with ElasticNet using probes within
   ±1 Mb of TSS and ElasticNet using probes located in the region of 2 kb
   upstream to 7 kb downstream. Each model’s performance was measured and
   compared in terms of (a) MSE, (b) RMSE, (c) MAE, and (d) R2 score.
   Genes with better performance using probes within 2 kb upstream to 7 kb
   downstream are highlighted in yellow. The letter r denotes Pearson’s
   correlation coefficient, and p is the p value for Pearson’s
   correlation.

3.4. MASA Genes

   To conduct survival analysis, patients were divided into two groups
   according to gene expression levels (see Methods). Signature genes
   whose expression is affected by the DNA methylation level and has the
   potential to affect survival rates were defined as MASA genes (R2 score
   > 0.3 and p ≤ 0.05, log-rank test; [128]Figure 5a). Whether DNA
   methylation-affected genes are related to the survival rate was
   assessed using Pearson’s χ^2 test. The p value was very low (p < 2.2 ×
   10^−16; [129]Figure 5b). Genes whose expression is affected by DNA
   methylation were statistically linked to patient survival.
   Additionally, the odds ratio (OR) was 1.57, indicating that DNA
   methylation-affected genes were 1.57 times more relevant to survival
   than non-methylation-affected genes. The 95% confidence interval of the
   OR was 1.45–1.70.

Figure 5.

   [130]Figure 5
   [131]Open in a new tab

   Analysis of MASA genes. (a) Comprehensive view of each gene. Red dots
   are MASA genes with R2 score > 0.3 and p ≤ 0.05. (b) Number of genes
   classified corresponding to each specified characteristic. (c) KEGG
   pathway enrichment analysis using MASA. (d) GO BP enrichment analysis
   using MASA.

   Finally, the KEGG pathway and GO enrichment analysis for BP were
   conducted to investigate the biological roles of MASA genes and explore
   how they are related to survival. MASA genes were mainly enriched in
   immune-related pathways ([132]Figure 5c). The two most significantly
   enriched pathways were cell adhesion molecules and hematopoietic cell
   lineage pathways, and both were also related to the immune system. Cell
   adhesion molecules play a crucial role in all aspects of inflammation,
   such as mediating the migration of immune cells near malignant tumor
   cells or regulating the interaction between immune cells or between
   immune cells and target cells [[133]57]. Moreover, hematopoietic stem
   cells (HSC) are a raw form of blood cells that can differentiate into
   any type of immune cell lineage: either lymphoid lineages, such as T
   and B cells, or myeloid lineages, such as macrophages or granulocytes
   [[134]58]. Therefore, hematopoietic cell lineage pathways are involved
   in all differentiation processes of HSC to immune cell lineages.
   Likewise, the most enriched GO BP terms were also related to the immune
   response ([135]Figure 5d). Notably, MASA genes were enriched in the
   angiogenesis process ([136]Figure 5d), which plays an important role in
   cancer cell growth and proliferation [[137]59,[138]60].

4. Discussion

   CRC is one of the cancers most commonly diagnosed worldwide and has a
   high mortality rate [[139]61]. CRC pathogenesis can be influenced by
   various abnormal genetic or epigenetic modifications
   [[140]62,[141]63,[142]64]. DNA methylation is one of the most important
   epigenetic features, in which abnormal changes can lead to CRC
   development by disrupting transcription regulation [[143]65]. Using
   multivariate regression models, this study investigated how DNA
   methylation affects gene expression. To determine the model that can
   explain the relationship between DNA methylation and gene expression,
   the models’ performance was evaluated. Consequently, ElasticNet
   outperformed all others in terms of the lowest median error and the
   highest median R2 score for target genes. The weights of most sites
   were offset by zero in ElasticNet. Therefore, it was concluded that DNA
   methylation changes at all CpG sites do not affect gene expression
   levels uniformly and that specific, important CpG sites can more
   strongly regulate transcriptional events.

   To identify CpG sites with a strong influence, the coefficients for
   each probe were analyzed. Therefore, DNA methylation within 2 kb
   upstream to 7 kb downstream of TSS was proven more essential in
   regulating gene expression than other CpG sites. Specifically, DNA
   methylation in these regions can affect gene expression negatively.
   Because the promoter region is generally located upstream of the TSS
   [[144]66,[145]67], it was reconfirmed that DNA methylation at the
   promoter regions inhibits gene expression. Furthermore, by
   investigating the distribution of probes in CGI-related regions, we
   investigated the regulatory effects exerted by the CGI-related regions,
   because some previous studies reported that DNA methylation in the CGI
   shore and CGI shelf was important
   [[146]68,[147]69,[148]70,[149]71,[150]72]. Our study also suggested
   that the CGI was less effective in gene regulation and the CGI shore
   and CGI shelf regions were relatively more important. However, based on
   the results obtained using genes with high R2 scores, the regulatory
   effect would be limited, especially at the regions near the TSS.

   Interestingly, the relevant CGI-related regions with clearly low p
   values were not found in the 2 kb upstream to 7 kb downstream region.
   The p values of the regions with positive effects were also more than
   0.01 in this area. This result would imply that the relation to CGI is
   less significant because important probes with negative effects are
   mainly concentrated in the 2 kb upstream to 7 kb downstream region.
   Moreover, in the CCLE datasets utilized for the validation experiments,
   no important CGI-related regions were detected. These results may be
   due to the characteristics of cell lines being different from tissue
   samples from cancer patients, or the limitation of the low number of
   validation samples (n = 31). However, given that most of the important
   probes with negative effects in CGI-related regions, in the CCLE
   datasets, were located in the region 2 kb upstream to 7 kb downstream
   of TSS, the result might make sense. To better evaluate the importance
   of the CGI-related region, further computational and biological
   experimental studies using more datasets would need to be carried out.

   Additional analyses were also conducted to explore which gene
   components can support the negative effect of the 7 kb downstream
   region. Consequently, DNA methylation occurring in the first exon and
   the first intron located in the region 7 kb downstream of TSS could be
   the key component that can negatively affect transcription regulation.
   This means that hypermethylation or hypomethylation in the first exon
   and first intron can cause gene silencing or overexpression,
   respectively, and may eventually lead to CRC pathogenesis.

   To verify which DNA methylation pattern within 2 kb upstream to 7 kb
   downstream of TSS can sufficiently explain the expression level,
   ElasticNet was reconstructed only using the probes in this region. When
   their performance was compared, the Pearson’s correlation coefficients
   were very high, reflecting that DNA methylation in the region within 2
   kb upstream to 7 kb downstream of TSS can effectively explain gene
   expression levels. However, many genes showed better performance when
   utilizing probes within ±1 Mb. It was suggested that the DNA
   methylation pattern of many CpG sites, albeit at low levels, has some
   effects on the regulation of gene expression, although the DNA
   methylation status of specific CpG sites plays a critical role in gene
   regulation.

   Furthermore, this study searched for genes whose expression changes
   were associated with survival rates. Genes statistically associated
   with survival were intersected with DNA methylation-affected genes.
   Finally, 1988 genes whose transcriptional changes are affected by DNA
   methylation levels and whose expression can be related to survival were
   identified. The identified signature genes were named MASA, and
   biological functions enriched in MASA were explored. Obviously, genes
   were mainly related to immunity. Many studies have already recognized
   immune systems as a hallmark of cancer [[151]73,[152]74]. The complex
   interaction between cancer and the immune system can enhance or
   suppress cancer development and progression [[153]75]. When immune
   systems function normally, tumor-infiltrating immune cells can
   eliminate malignant cells and are associated with prognosis [[154]76].
   GO analysis also revealed immune-related responses, such as cell
   adhesion, signal transduction, antigen processing and presentation, and
   the inflammatory response [[155]57,[156]77,[157]78]. Additionally, MASA
   is enriched in the angiogenesis process, which is one of the most
   essential characteristics of cancer [[158]59]. Because cancer cells
   require nutrients and oxygen for survival, malignant cells are located
   near blood vessels to obtain essential elements [[159]79].

   Due to the importance of DNA methylation in CRC pathogenesis, it is
   crucial to thoroughly analyze the connection between DNA methylation
   and gene expression. In addition, one study, by Kerachian et al.,
   identified CRC detection biomarkers based on the DNA methylation status
   [[160]26]. Our study not only analyzed the relationships between DNA
   methylation and gene expression patterns occurring in the CGI-related
   regions, gene body, and promoter, but also revealed that the first exon
   and first intron were the most important regions in the gene body.
   Furthermore, the negative effects on gene expression regulation by DNA
   methylation within 2 kb upstream to 7 kb downstream of TSS was
   confirmed. Finally, our study identified GO and KEGG pathways in which
   methylation-affected and survival-associated genes are enriched. Thus,
   these results could help in the identification of novel biomarkers for
   the diagnosis and prognosis of CRC patients.

5. Conclusions

   In summary, this study discovered that DNA methylation within 2 kb
   upstream to 7 kb downstream regions of TSS can induce aberrant
   expression, and that genes affected by DNA methylation can be further
   associated with survival. Because these genes are related to the immune
   or angiogenesis process, the misfunction of these genes might
   negatively affect the prognosis of CRC patients.

Acknowledgments