Abstract
Background and objective
The classification of glioma subtypes is essential for precision
therapy. Due to the heterogeneity of gliomas, the subtype-specific
molecular pattern can be captured by integrating and analyzing
high-throughput omics data from different genomic layers. The
development of a deep-learning framework enables the integration of
multi-omics data to classify the glioma subtypes to support the
clinical diagnosis.
Results
Transcriptome and methylome data of glioma patients were preprocessed,
and differentially expressed features from both datasets were
identified. Subsequently, a Cox regression analysis determined genes
and CpGs associated with survival. Gene set enrichment analysis was
carried out to examine the biological significance of the features.
Further, we identified CpG and gene pairs by mapping them in the
promoter region of corresponding genes. The methylation and gene
expression levels of these CpGs and genes were embedded in a
lower-dimensional space with an autoencoder. Next, ANN and CNN were
used to classify subtypes using the latent features from embedding
space. CNN performs better than ANN for subtyping lower-grade gliomas
(LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN
was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively.
The precision of the models was 97.67% in LGG and 90.40% in GBM. The
model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we
observed the superior performance of CNN with external datasets. The
genes and CpGs pairs used to develop the model showed better
performance than the random CpGs-gene pairs, preprocessed data, and
single omics data.
Conclusions
The current study showed that a novel feature selection and data
integration strategy led to the development of DeepAutoGlioma, an
effective framework for diagnosing glioma subtypes.
Supplementary Information
The online version contains supplementary material available at
10.1186/s13040-023-00349-7.
Keywords: Multi-omics, Autoencoder, Lower-grade glioma (LGG),
Glioblastoma multiforme (GBM), Convolutional neural network (CNN)
Introduction
The accurate classification of brain tumors is crucial in modern
clinical practice. For clinical decision-making, it is vital to
distinguish various types of brain tumors, such as subtypes and primary
gliomas, from metastases. Presently, the characterization of brain
tumors is mainly performed based on imaging and histopathology
[[27]1–[28]3]. Due to the heterogeneity of neoplastic brain tissue and
the mixed characteristics of gliomas, it is often difficult to separate
the subtype by imaging. Furthermore, most histopathology-based
procedures suffer from intraobserver and interobserver variability,
resulting in suboptimal clinical outcomes [[29]4, [30]5]. As a result,
new clinical variables are required for the stratification of gliomas
for precision therapy. Gliomas are the most prevalent type of brain
tumor, accounting for approximately 33% of all cases. Gliomas are
classified according to how rapidly or slowly the cells divide.
Slower-growing gliomas are known as lower-grade gliomas (LGG), whereas
more aggressive gliomas are named glioblastoma multiforme (GBM), a
Grade IV cancer. LGG is more common in younger people; instead, GBM is
more frequently diagnosed in older patients. The LGG is a grade II and
III tumor classified into three subtypes: astrocytoma,
oligodendroglioma, and oligoastrocytoma. Some of these LGGs develop
into GBM, while others remain in the same stage for an extended period
[[31]6, [32]7]. Similarly, there are three subtypes of GBM, i.e.,
classical, proneural, and mesenchymal [[33]8]. Due to the diverse
subtypes of lower and higher-grade gliomas, an effective stratification
methodology is required for improved diagnosis. Perturbations at
various molecular layers (such as gene expression, methylation, etc.)
result in the emergence of all types of human cancer. Moreover, in
gliomas, the quantum of molecular heterogeneity is too high, posing a
further challenge to early detection and understanding of disease
etiology. In this aspect, analysis of high-throughput omics data from
different molecular layers can decipher the link between molecular
signatures and cancer phenotypes. Indeed, multi-omics data integration
can elucidate how the molecular alterations at different layers
contribute to disease formation and provide a global view of the
molecular signature of disease. Previous studies have primarily relied
on mono-omics data, specifically gene expression data, to develop a
model for classifying glioma subtypes. Consequently, these models fail
to account for other molecular changes in glioma, such as epigenetic
modifications, which play a significant role in the development of
cancer [[34]9–[35]11]. The epigenetic modifications (or methylations)
directly regulate the transcriptomic landscape of the cell.
Hypermethylation of CpG sites on promoter regions generally reduces
gene expression, whereas hypomethylation elevates gene expression
[[36]12, [37]13]. Therefore, the biologically relevant diagnostic model
can be developed by integrating these two interlinked biological
phenomena. The integration of multi-omics data is a great challenge,
and powerful integration methods can provide an efficient diagnostic
tool to support the clinician [[38]14]. Recently, deep learning (DL)
models have been successfully applied to integrate high-dimensional
genomics and epigenomics data. When analyzing such data, a common
approach is to find data embedded in a lower-dimensional space. The
compressed features from embedding space can be good predictors in
predictive models. Autoencoder is a deep learning-based nonlinear
embedding approach recently implemented to integrate multi-omics data
to develop diagnostic and predictive models [[39]15–[40]17]. In the
present study, we developed an autoencoder and deep-neural
network-based, biologically relevant novel approach for integrating the
transcriptome and methylome and subsequently classified the subtypes of
LGG and GBM. To this end, we first identified the differentially
expressed genes (DEGs) and differentially methylated regions (DMRs).
Then we performed the univariate Cox regression analysis to determine
the DEGs and DMRs associated with patient survival. Next, we mapped the
DMR to the DEG through the promoter regions to find the altered
methylation affecting gene expression. Finally, we implemented an
autoencoder with concatenated inputs to integrate the survival-linked
genes and CpGs into promoters. The autoencoder reduces the
dimensionality of the corresponding gene expression and methylation
matrices. Then, the new feature from the autoencoder was used to make
DL-based models for subtyping. The current framework achieves a
superior accuracy of > 94% for subtyping LGG and GBM. The framework is
called DeepAutoGlioma. The present work introduces a new way for
subtyping brain cancer, and we think this research will shed light on
the DL-based clinical support system for accurate disease prediction
using multi-omic data.
Results
Identification of biologically relevant features for classification of LGG
and GBM subtypes
Deregulated gene expression and aberrant methylation are the hallmarks
of human cancer [[41]18]. Methylation status in the promoter region
determines the level of gene expression. Therefore, linking the
methylome and transcriptome is crucial in finding the genetic and
epigenetic features that cause cancer, which is also important for
making biologically relevant models. To connect the methylome and
transcriptome, patients with transcriptome and methylome profiles were
chosen to identify the upregulated and downregulated genes (DEGs); and
hypomethylated and hypermethylated CpGs (DMRs). We used a z-score to
screen the DEGs and DMRs (see “[42]Materials and methods” section). A
z-score greater than 1 or less than − 1 indicates the gene expression
and methylation are greater or less than the population mean,
respectively. We identified the DEGs and DMRs for each subtype of LGG
and GBM. In LGG, we found a total of 3972, 4024, and 4088 DEGs
(Fig. [43]1A) and 177,458, 181,957, and 181,163 DMRs (Fig. [44]1B) in
astrocytoma, oligoastrocytoma, and oligodendroglioma, respectively. In
subtypes of GBM, we found a total of 3910, 3767, and 3745 DEGs
(Fig. [45]1C), and 211,764, 208,111, and 190,743 DMRs (Fig. [46]1D) in
classical, mesenchymal, and proneural, respectively. We also found that
differences in average expression and methylation level between z > 1
and z <- 1 are statistically significant (p-value < 0.001) in all
subtypes (Fig. [47]1A-D). Next, we performed a univariate Cox
regression analysis to find the correlation between patient prognosis
with DEGs and DMRs. We separately generated the univariate prognostic
models for each DEG and DMR. Next, we screened the survival-associated
genes and CpG sites based on the p-value < 0.05. Our results showed
that, in LGG, a total of 2295 DEGs and 18,068 DMRs, and in GBM, a total
of 1055 DEGs and 5033 DMRs were linked to the patient’s survival. We
found that a total of 50.83% of DEGs and 20.35% of DMR in LGG; and
23.30% of DEGs and 5.41% of DMR in GBM were linked with patient
survival. This indicates that a higher percentage of genes, or CpGs,
are not linked with LGG or GBM. Therefore, univariate Cox analysis
facilitates identifying the biologically important and
cancer-associated features, which can lead to the development of a
clinically relevant DL model while reducing the dimension of the data
to build better-fit prediction models. Subsequently, we mapped the
survival-associated CpGs in promoters (TSS1500, TSS200, the first exon,
and the 5′ UTR) and their corresponding survival-associated genes. The
linking of these two layers of genomic data identifies the CpG-gene
pairs, which are involved in cancer progression. We found that in LGG,
a total of 1110 genes (DEGs) and 3204 CpGs (DMRs) in the promoter, and
in GBM, 268 genes (DEGs) and their 447 CpGs (DMRs) in the promoter are
linked to patient survival (Supplementary Table [48]1). If a gene is
involved in patient survival and if its methylation level in the
promoter, which regulates its expression, is also linked to survival,
this indicates an additive impact of methylation and gene expression on
patient prognosis. We believe that integrating methylation levels with
gene expression data will be more biologically valid for diagnostic
model development. We also found that these genes (prognostic genes)
are involved in biological processes and pathways that are linked to
cancer (Fig. [49]1E and F), such as signaling by ALK in cancer
[[50]19], cell-cell adhesion [[51]20], signaling by receptor tyrosine
kinase [[52]21], PID INTEGRIN A4B1 pathway [[53]22], gliogenesis
[[54]23], positive regulation of cell adhesion [[55]24] and
VEGFA-VEGFR2 signaling pathways [[56]25, [57]26]. Therefore, these
prognostic genes and CpGs were used for autoencoder-based data
integration and model building.
Fig. 1.
[58]Fig. 1
[59]Open in a new tab
Boxplots show the difference in gene expression and methylation level
between Z > 1 and Z < − 1. A DEGs and B DMRs in each LGG subtype;
C DEGs and D DMRs in each GBM subtype; E and F Bar plots represent
significantly enriched Biological processes and pathways of genes used
as input in the autoencoder (***p < 0.001). DEGs: differentially
expressed genes, DMRs: differentially methylated CpGs
Integration of gene expression and its promoter methylation level by
autoencoders shows superior accuracy in subtyping
In the previous section, we derived the list of genes and their CpG
sites in promoters linked to patient survival using univariate Cox
regression analysis. Next, we extracted the gene expression and
methylation matrix. We divided these datasets into training (70%) and
validation (30%) sets. 70% of the data was utilized to optimize the
model’s parameters and evaluate the performance of each model, and the
remaining 30% of the data was employed as independent predictors. The
gene expression and methylation matrices were fed into the autoencoder
with concatenated inputs (CNC-AE). The methylation and gene expression
levels are combined and compressed in the latent space or bottleneck
layer learned by the autoencoder [[60]14, [61]17, [62]27–[63]29]. All
the dimensions and parameters of the different layers in the
autoencoder were optimized. The autoencoder consists of two parts: an
encoder and a decoder network. In the encoder network, gene expression
and DNA methylation profiles of LGG and GBM are first encoded into two
4314 and 715-dimensional vectors separately through hidden layers,
respectively. Next, we set the dimensions of the bottleneck layers at
400 and 100 for LGG and GBM. In the decoder network, the latent
variables were again used to decode the original input data, and this
was used to measure the reconstruction loss, which indicates the
performance of the autoencoder. The network structure of the decoder is
similar to the mirror image of the encoder network (Fig. [64]2). If a
latent variable captures the actual data pattern, i.e., intrinsic
relationships between the variables, then the difference between the
encoded and decoded vectors will be less. We measured the
reconstruction loss using the mean squared error (MSE). We found that
MSE was significantly lower, i.e., 0.04 in LGG and 0.04 in GBM. This
shows that the autoencoder efficiently learned the pattern in gene
expression and methylation and encoded it in the latent space. Then
these latent variables were used to develop the DL models for the
classification of LGG and GBM subtypes.
Fig. 2.
[65]Fig. 2
[66]Open in a new tab
Architecture of autoencoder: The autoencoder used in DeepAutoGlioma
consists of an encoder and a decoder made from 2 hidden layers and one
bottleneck layer. The autoencoder has two input layers for DNA
methylation and gene expression; in the first hidden layer, data is
concatenated, and is passed to another hidden layer and finally
compressed in the bottleneck layer. In the decoder part, the latent
variables from the bottleneck layer are reconstructed to the initial
ones
We implemented two DL algorithms, i.e., artificial neural networks
(ANN) and convolutional neural networks (CNN), and compared their
performance for subtype classification. During the model training step,
we used the grid search method to find the best combination of
hyperparameters (see material and methods). Then, using these optimal
hyperparameters, we performed stratified k-fold cross-validation
(k = 10) on the latent variables and computed the average performance
measures for each DL model (Table [67]1). Average accuracy, recall,
precision, F1-score, false positive rate (FPR), geometric mean (GM),
and Matthew’s correlation coefficient (MCC) were used to assess the
model’s performance (see materials and methods). We found that CNN
models had higher prediction accuracy in subtyping, i.e., 98.03% (95%
CI, 98.02–98.038) and 94.07% (95% CI, 94.04–94.10) for LGG and GBM,
respectively, than the ANN models. We found that FPR, 0.01 (95% CI,
0.005–0.015), and 0.02 (95% CI, 0.01–0.03) were minimal, and the MCC
scores were high,i.e., 0.96 (95% CI, 0.95–0.97) and 0.93 (95% CI,
0.89–0.97) in the case of CNN (Table [68]1). The higher MCC score
represents a good correlation between the observed and predicted
classes. Next, we performed the classification using validation
datasets to check the reproducibility of the DL framework. We found the
accuracy of subtype classification [for LGG 95.23% (95% CI,
95.22–95.24) and GBM 90.26% (95% CI, 90.20–90.32)] of CNN was superior,
and the MCC score was 0.90 (95% CI, 0.86–0.94) and 0.92 (95% CI,
0.84–1) (Table [69]2). The accuracy of the current framework for
subtyping LGG and GBM outperforms that of earlier machine learning (ML)
and deep learning (DL) models [[70]11, [71]30]. We named this framework
DeepAutoGlioma (Fig. [72]3). We also observed the superior performance
of DeepAutoGlioma using external GEO datasets (Table [73]3). The
combination of feature genes and CpG sites in the model construction
likely accounts for the impressive performance of DeepAutoGlioma. In
most cases, feature selection approaches that rely on ML or DL ignore
the biological relevance of features [[74]31–[75]33]. However, here we
screened the DEGs and DMRs in each subtype, which were associated with
LGG and GBM patients’ survival. Also, the genes and methylation sites
used as inputs into the autoencoder are linked through their genomic
locations. Together, these approaches reduce the dimension of the data,
which significantly influences the model’s performance. There are very
limited studies available on glioma subtype classification using
multi-omics data. The study conducted by Xu et al. [[76]34]
demonstrates that the integration of multi-omics data leads to enhanced
accuracy (88.50% accuracy) in the classification of glioblastoma
subtypes as compared to the mono-omics data alone. However,
DeepAutoGlioma exhibits a higher level of accuracy in subtype
classification for both LGG and GBM. In our opinion, biologically
relevant inputs to the autoencoder provided superior accuracy (95–98%)
in the subtype classification achieved with CNN.
Table 1.
Performance evaluation of LGG and GBM subtypes classification
Methods Performance measures (Average of 10 fold cross-validation)
Accuracy [95% CI] Precision [95% CI] Recall [95% CI] F1-score [95% CI]
FPR [95% CI] Gmean [95% CI] MCC [95% CI]
LGG ANN 95.40% 92.50% 92.73% 92.45% 0.03 95.28% 0.89
[95.39–95.41] [92.48–92.52] [92.72–92.74] [92.33–92.57] [0.02–0.038]
[95.27–95.29] [0.87–0.91]
CNN 98.03% 97.67% 96.96% 96.97% 0.01 97.99% 0.96
[98.02–98.038] [97.66–97.679] [96.95–96.97] [96.96–96.98] [0.005–0.015]
[97.98–97.998] [0.95–0.97]
GBM ANN 92.19% 88.05% 89.77% 87.75% 0.03 94.76% 0.9
[92.16–92.22] [88.00- 88.10] [89.73–89.81] [87.70–87.80] [0.02–0.04]
[94.74–94.78] [0.86–0.94]
CNN 94.07% 90.40% 91.18% 90.25% 0.02 96.51% 0.93
[94.04–94.10] [90.35–90.45] [91.13–91.23] [90.20–90.30] [0.01–0.03]
[96.49–96.53] [0.89–0.97]
[77]Open in a new tab
Table 3.
Classification performance of DeepAutoGlioma on external datasets
Methods Performance measures (Average of 10 fold cross-validation)
Accuracy [95% CI] Precision [95% CI] Recall [95% CI] F1-score [95% CI]
FPR [95% CI] Gmean [95% CI] MCC [95% CI]
LGG ANN 91.89% 91.20% 88.00% 86.90% 0.06 92.13% 0.83
[91.85–91.93] [91.15–91.25] [87.94–88.06] [86.83–86.97] [0.02–0.10]
[92.09–92.17] [0.74–0.92]
CNN 91.38% 91.38% 91.38% 91.38% 0 1 1
[91.34–91.40] [91.34–91.40] [91.34–91.40] [91.34–91.40] [0–0] [1–1]
[1–1]
GBM ANN 84.10% 74.48% 79.48% 76.15% 0.06 90.55% 0.72
[84.01–84.18] [74.35–74.60] [79.37–79.58] [76.03–76.26] [0.02–0.10]
[90.49–90.60] [0.58–0.86]
CNN 86.41% 79.87% 83.33% 81.02% 0.05 92.92% 0.76
[86.33–86.49] [79.75–79.99] [83.23–83.43] [80.90–81.13] [0.01–0.09]
[92.87–92.97] [0.62–0.90]
[78]Open in a new tab
Fig. 3.
[79]Fig. 3
[80]Open in a new tab
Subtype classification framework of the DeepAutoGlioma. Methylome and
transcriptome data are preprocessed, differentially expressed genes
(DEGs) and differentially methylated regions (DMRs) are identified, and
clinically significant features are extracted. Further, these features
are mapped according to the genomic region to integrate the CpG-gene
pair. Then, clinically relevant methylation (CpGs) and gene expression
data are fed into the autoencoder, and latent variables are extracted
to build deep learning models for subtyping brain cancer
Table 2.
Classification performance of deep learning algorithms on LGG and GBM
subtypes for validation set
Methods Performance measures (Average of 10 fold cross-validation on
test datset)
Accuracy [95% CI] Precision [95% CI] Recall [95% CI] F1-score [95% CI]
FPR [95% CI] Gmean [95% CI] MCC [95% CI]
LGG ANN 90.18% 82.40% 84.23% 82.42% 0.07 90.06% 0.8
[98.16–98.20] [82.35–82.45] [84.19–84.27] [82.37–82.47] [0.05–0.09]
[90.04–90.08] [0.75–0.85]
CNN 95.23% 92.08% 92.63% 91.84% 0.03 95.30% 0.9
[95.22–95.24] [92.05–92.11] [92.60- 92.66] [91.83–91.87] [0.02–0.04]
[95.29–95.31] [0.86–0.94]
GBM ANN 93.85% 93.85% 93.85% 93.85% 0 1 1
[93.79–93.91] [93.79–93.91] [93.79–93.91] [93.79–93.91] [0–0] [1–1]
[1–1]
CNN 90.26% 85.38% 87.69% 86.15% 0.02 95.26% 0.92
[90.20–90.32] [85.29–85.47] [87.62–87.76] [86.06–86.24] [0–0.04]
[95.21–95.31] [0.84–1]
[81]Open in a new tab
DL models with a random feature set, preprocessed data and single omics data
To validate our findings and better understand the role of feature
selection in model performance, we extended the DL-based model by
feeding different sets of inputs (features) to the autoencoder and
compared their performance to that of DeepAutoGlioma. First, we
compared the performance of mapped CpGs and gene expression with
randomized CpG-gene pairs as input into the autoencoder. We randomly
selected the CpGs (n = 3204 for LGG and n = 447 for GBM) and genes
(n = 1110 for LGG and n = 268 for GBM) from preprocessed data. Then
this unmapped, randomly selected methylation and gene expression data
were fed into the autoencoder. Then, ANN and CNN were used to classify
the subtypes using the latent features from random datasets. We
repeated the process ten times, and the accuracy varied from 60.68 to
71.43% in LGG and 62.42 to 72.14% in GBM in all iterations
(Supplementary Tables [82]2 and [83]3). And the average accuracy of all
iterations in CNN was 66.12 and 66.59% in LGG and GBM, respectively.
When compared to DeepAutoGlioma, the average accuracy of all ten
iterations in CNN is significantly less (p-value < 0.001, Fig. [84]4).
Not only the accuracy, but other parameters such as precision, MCC, and
FPR are very low compared to DeepAutoGlioma. This finding confirms that
mapping the promoter methylation region to the gene has aided in
predicting LGG and GBM subtypes with greater accuracy and precision.
Fig. 4.
[85]Fig. 4
[86]Open in a new tab
Comparison of model performance using different sets of features to
that of DeepAutoGlioma (***p < 0.001)
To better understand the significance of biologically relevant
features, such as DEGs and DMRs, as well as univariate Cox regression
analysis for feature selection, we run the autoencoder on preprocessed
data and then classify using ANN and CNN. LGG and GBM gene expression
and methylation data matrices contain 14,517 and 14,125 genes,
respectively, as well as 139,403 and 141,672 CpGs. The autoencoder was
then run on these preprocessed datasets, and the accuracy of
prediction, as well as other model evaluation parameters, were measured
(Supplementary Table [87]4). When compared to DeepAutoGlioma, the
prediction accuracy is significantly (p-value < 0.001) lower
(Fig. [88]4). The subtype classification accuracy of LGG was 83.73%
(95% CI, 83.71–83.74) in CNN and 69.86% (95% CI, 69.852–69.868) in ANN.
Whereas in GBM classification, accuracy was 61.54% (95% CI,
61.53–61.55) in CNN and 67.57% (95% CI, 67.53–67.61) in ANN.
Furthermore, the results of other evaluation parameters were too low to
be considered. This unequivocally demonstrates that cancer-associated
features or features that are biologically relevant played a crucial
role in achieving higher classification accuracy.
Furthermore, we compared classification accuracy between di-omics and
mono-omics data. The mono-omics data, i.e., methylation or gene
expression matrix, was used as input to the autoencoder. As previously
stated, we extracted compressed features from latent space, used DL
algorithms, and calculated average performance metrics for each DL
model. We observed that in the case of LGG, the single omics data
showed good accuracy of prediction, i.e., 96.26% (95% CI, 96.25–96.27)
and 96.54% (95% CI, 96.53–96.55) using gene expression and methylation
data, respectively (Supplementary Table [89]5). However, these
accuracies are lower in comparison to the DeepAutoGlioma (98.03%).
However, the accuracy of prediction using test and external gene
expression was 66.07% (95% CI, 66.06–66.08) and 63.81% (95% CI,
63.77–63.85), and methylation was 66.51% (95% CI, 66.50–66.52) and
74.79% (95% CI, 74.77–74.81), which is considerably less. Whereas in
the case of GBM subtype prediction accuracy using gene expression and
methylation data was 91.54% (95% CI, 91.53–91.57) and 43.89% (95% CI,
43.87–43.91), respectively (Supplementary Table [90]5). Although gene
expression data showed higher accuracy, however, in the test and
external datasets, the accuracy was 85.48% (95% CI, 85.37–85.59) and
72.68% (95% CI, 72.63–72.73). We saw good prediction accuracy in LGG
and GBM utilizing mono-omics data, particularly gene expression, but
models were unable to accurately predict subtypes using test and
external datasets. This demonstrated that the individual omics data
were inadequate for cancer subtype classification with superior
accuracy. The models trained on multi-omics data outperformed those
trained on single-omics data, owing to the fact that multi-omics data
contains a wealth of information not found in a single type of omics
data alone.
Discussion
It is well established that molecular perturbations in different
genomic layers cause cancer occurrence and progression. Therefore, it
is crucial to perform integrative approaches that combine multi-omics
data to comprehend the disease mechanism and develop novel diagnostic
tools for brain cancer detection. The integration of high-throughput
omics data from distinct genome layers can capture the
interrelationships of biomolecules and facilitate interpreting their
function in disease onset. Transcriptomics and epigenomics data are
unpaired because they are usually measured in separate experiments,
which demands effective and efficient in-silico multi-omics integration
[[91]28]. In the present study, we designed the deep autoencoder and
deep learning (ANN and CNN)-based clinically relevant framework for
integrating the methylome and transcriptome to classify the glioma
subtype with superior accuracy. To strengthen the biological relevance,
we screened patient samples with transcriptome and methylome profiles
and measured the DEGs and DMRs in each subtype of LGG and GBM cancer.
Further, we performed a univariate Cox regression analysis to identify
the DEGs and DMRs associated with the patient’s survival. The
univariate Cox regression approach helps to determine clinically
relevant feature genes and CpG sites based on the patient’s overall
survival information; further, it also decreases the data dimension.
Next, we map the CpGs and genes based on the promoter regions. The
linked CpGs and genes were used as input in the autoencoder. As a
result, the input features in the autoencoder were biologically and
clinically relevant in three ways: first, they are differentially
regulated; second, they are linked to the patient’s survival; and
third, methylation in the promoter is linked to gene expression. We
found that using latent variables learned by the autoencoder as an
input in deep learning models (ANN and CNN), we were able to predict
the subtypes of LGG and GBM with an accuracy of 98.03% (95% CI,
98.02–98.038) and 94.07% (95% CI, 94.04–94.10), respectively, using
CNN. Furthermore, the current framework classifies the GBM and LGG
subtypes using the external datasets with 86.41% (95% CI, 86.33–86.49)
and 91.89% (95% CI, 91.85–91.93) accuracy, respectively. On the other
hand, autoencoder-based deep learning with omics data, randomized
CpG-gene pair, and preprocessed dataset did not perform well compared
to DeepAutoGlioma. We believe that feature screening using various
statistical methods and integration of di-omics data using autoencoders
play an essential role in achieving higher subtyping accuracy. The
current study demonstrated how data integration could lead to the
discovery of novel patterns in transcriptomics and epigenomics data and
aid in developing efficient diagnostic tools. It is anticipated that
multimodal learning approaches for multi-omics data analysis will
become more prevalent in cancer diagnosis, allowing physicians to more
accurately determine the most effective line of therapy. We hope the
current DL framework will assist clinicians in personalizing treatment
for brain cancer patients, which could lead to better treatment
outcomes.
Conclusions
The accurate subtyping of gliomas is crucial for precision therapy. A
DL-based model can improve the overall precision and efficacy of
diagnostic processes using large-scale omics data. However, clinical
diagnosis still raises questions about the validity and
interpretability of DL- or AI-based diagnostic models. Therefore, it is
essential to design a biologically and clinically relevant AI-based
diagnostic model to increase the reliability of diagnosis. Here we
design the AI-based diagnostic tool, i.e., DeepAutoGlioma, for
subtyping the glioma. The transcriptome and methylome data of glioma
patients were used to extract biologically and clinically relevant
features for model development. The features from two levels of genomic
layers were integrated to capture cancer-specific patterns for accurate
subtyping. Integration of omics data enables us to achieve greater
model performance because it provides a wealth of information from
different genomic layers. The model developed based on multi-omics data
can greatly support the clinician in personalizing treatment.
Materials and methods
Data collection and preprocessing
We retrieved methylome (Illumina Infinium HumanMethylation450 platform)
and transcriptome (RNA-seq) data of TCGA from UCSC Xena
([92]https://xena.ucsc.edu) [[93]35] log2 (RSEM + 1) values for gene
expression and beta-values for methylation levels were considered for
analysis. Here, RSEM stands for RNA-Seq by Expectation Maximization.
Next, low-expressed genes were filtered out of the transcriptome data
[log2 (RSEM + 1) < 0.1 in the 90% sample]. Patients with both
transcriptome and methylome profiles were considered for analysis. GBM
patients (n = 52) were divided into three groups based on their
clinical information: classical (n = 16), mesenchymal (n = 22), and
proneural (n = 14). Similarly, the LGG patients (n = 281) were divided
into three groups based on cancer subtype, i.e., astrocytoma (n = 96),
oligoastrocytoma (n = 75), and oligodendroglioma (n = 110). We obtained
the external data set from the Gene Expression Omnibus (GEO)
repository. The subtyping of LGG was validated using [94]GSE74462,
[95]GSE43378 (gene expression data), and [96]GSE129477 (DNA methylation
data). The subtyping of GBM was validated using the gene expression
data from [97]GSE145645 and the DNA methylation data from
[98]GSE128654.
Identification of differentially expressed genes and differentially
methylated regions
DEGs and DMRs were identified by z-score. We classified high- and
low-expressed genes as well as hyper- and hypo-methylated CpG sites
using the Z-score because healthy patient data for the transcriptome
and methylome were unavailable. The following formula was used to
determine the Z-score for each gene or CpG site in a certain subtype:
[MATH:
Z-score
=x--μσ :MATH]
Here,
[MATH: x_ :MATH]
denotes the subtype-specific average gene expression or methylation
level, whereas
[MATH: μ :MATH]
and
[MATH: σ :MATH]
stand for the population mean and population standard deviation,
respectively. For each subtype of LGG and GBM, we used Z-score > 1 for
higher expression and hypermethylation and Z-score < -1 for lower
expression and hypomethylation. Then, considering that differential
methylation in the promoter regions may affect the related gene’s
expression, we screened the higher- and lower-expressed genes whose
promoter regions were differentially methylated. Finally, genes with
differential expression and methylated promoter regions were used for
further analysis.
Construction of univariate cox regression models and survival analysis
Univariate Cox regression analysis was implemented to build the
prognostic risk-score model for a particular gene and CpG site
[[99]36]. Univariate Cox regression analysis was performed using the
survminer and survival package in R. The p-value < 0.05 was considered
the significant association of a gene or CpG site with patients’
overall survival (OS).
[MATH: h(t)=h0(t)×exp<
mfenced close="}"
open="{">b1x1+b2x2+........+bp
xp :MATH]
Where t is survival time, h(t) is the hazard function determined by a
set of covariates (
[MATH: x1 :MATH]
,
[MATH: x2 :MATH]
, …….,
[MATH: xp :MATH]
) for genes or methylation sites,
[MATH: b1 :MATH]
,
[MATH: b2 :MATH]
, …….,
[MATH: bp :MATH]
are the coefficients of regression,
[MATH: h1 :MATH]
is a baseline hazard.
Mapping and integration of methylation and gene expression data
CpG IDs and genes were mapped through the promoter region. The TSS1500,
TSS200, the first exon, and the 5′ UTR were considered promoters of a
gene. If both gene expression and methylation levels at the promoter
alter (i.e., DEGs and DMRs), then we screened those CpG-gene pairs.
Next, we constructed methylation and gene expression matrices using
these CpGs and genes fed in an autoencoder using two separate layers.
Biological processes and pathway enrichment analysis
We analyzed biological processes and pathway enrichment using the
Metascape tool [[100]37]. Enrichment analysis was performed using the
following ontology sources: Gene Ontology (GO) Biological Processes,
KEGG Pathway and Reactome Gene Sets, and the Kyoto Encyclopedia of
Genes and Genomes (KEGG). If the adjusted p-value < 0.05, the
biological process or pathway was considered significantly enriched.
Autoencoder implementation
Autoencoders are feed-forward neural networks that aim to copy the
input variable to the output variable with the minimum loss of
information. It compresses the inputs into latent variables in the
bottleneck layer’s embedding space and then reconstructs the output
from the embedding space. The autoencoder is composed of two parts: the
encoder and the decoder. The encoder maps the high dimensional input
data into latent variables in embedding space, and the decoder
reconstructs the input data from the embedding. Here, we used one
concatenated layer, one hidden layer, and a bottleneck layer in the
encoding part. We used a concatenated autoencoder to integrate the gene
expression and methylation data. We used the Keras library with
TensorFlow [[101]38] to implement the concatenated autoencoder. To
integrate the gene expression and methylation level of LGG in the
hidden layer of the autoencoder, a rectified linear activation function
(ReLU) was used. In the bottleneck layer, a uniform kernel initializer
and a linear activation function were implemented. Similarly, in the
decoding layer, one hidden layer and a concatenated layer were also
used. ReLU activation function was applied to the decoder layer. The
same architecture we have followed in the GBM dataset for integrating
the gene expression and methylation data. In the GBM dataset, the
Exponential Linear Unit (ELU) activation function was used in the
hidden layers. Linear activation function and a uniform kernel
initializer were employed at the bottleneck layer. Further, the ELU
activation function is applied to the decoder layer. Epoch size and
batch size were 1500 and 16, respectively, in each dataset. The
architecture of the autoencoder is shown in Fig. [102]2. We selected
1110 features from gene expression and 3204 features from DNA
methylation in the LGG dataset, while in the GBM dataset, 268 features
from gene expression, 447 features from DNA methylation were selected
for the input layer. For the autoencoder, we set a concatenate layer,
hidden layers, and a bottleneck layer, respectively. We obtained the
400 and 100 features from the bottleneck in the LGG and GBM datasets,
respectively.
Deep learning classifier
ANNs, which imitate the human brain, are feed-forward neural networks.
ANNs are represented by a weighted, directed graph connecting inputs to
a series of interconnected “hidden” layers that are composed of
multiple nodes called “neurons” that are in turn connected to an output
layer [[103]39]. ANNs are trained to recognize and categorize complex
patterns. There is one input layer, one output layer, and one hidden
layer in the network. The hidden layers lie between the input and
output layers. The number of output neurons varies depending on the
specific application, while the number of input neurons is equal to the
number of attributes. Here, latent variables obtained from the
bottleneck layer of the autoencoder were used as input. The parameters
were optimized on training datasets with the grid search method using
the GridSearchCV package in Python used for classifying the LGG and GBM
by implementing ANN method: for LGG the parameters are,
activation = relu, batch size = 32, epochs = 100, and optimizer = adam;
for GBM, the parameters are activation = linear, batch size = 30,
epochs = 50, and optimizer = RMSprop. We have used Keras library to
build ANN deep-learning classifiers on the Python platform.
CNN is a type of deep learning method that directly learns from the
data. CNN consists of three layers: convolutional, pooling and fully
connected (FC) layers [[104]40]. The convolutional layer is the first
layer, while the FC layer is the last. In the first layer, i.e., the
convolutional layer, where filters are applied to raw data or feature
maps in deep CNN, convolution is one linear operation utilized in place
of generic matrix multiplication. The convolution operation (denoted by
an asterisk) is defined by:
[MATH: ft=x∗Kt :MATH]
where the function
[MATH: xt :MATH]
is referred to as input,
[MATH: Kt :MATH]
is referred to as kernel, and the
[MATH: ft :MATH]
is referred to as output. After the convolutional layer, the input data
is downscaled by the pooling layer to save computation, and the final
prediction is made by the fully connected layer. Since every node in a
single layer is fully connected to every node in the subsequent layer,
it represents a network that is fully connected. Keras library was used
to construct the model architecture for CNN. Eight convolutional layers
were used to obtain the best result. Furthermore, parameters were
optimized with the grid search method using the GridSearchCV package in
Python. The parameters were used for classifying the LGG and GBM by
implementing CNN are: for LGG the parameter are, activation = relu,
batch size = 64, dropout rate = 0.2, epochs = 2000, filters = 1, kernel
size = 3, optimizer = RMSprop; for GBM the parameters are,
activation = elu, batch size = 64, dropout rate = 0.2, epochs = 2000,
filters = 1, kernel size = 3, optimizer = RMSprop. Using a stratified
k-fold CV, the 70% training dataset was used to calculate the average
performance of the model using the optimal parameters. In a stratified
k-fold CV, the dataset is split into k different folds, of which k-1
was utilized to train the network, and the final fold was set aside for
testing. This procedure is then repeated until all folds are used once
as a test set. The final output is then computed by averaging the
performance parameters obtained from each test set.
Performance evaluation
The performance of the DL model was evaluated based on the eight
criteria: accuracy, sensitivity, specificity, precision, F1-score, FPR,
geometric mean, and MCC. A true positive (TP) would indicate that the
cancerous cell is identified correctly, while a false positive (FP)
indicates that the cancerous cell is identified as healthy. Conversely,
true negatives (TN) and false negatives (FN) are calculated for healthy
cells. The following equations define the metrics:
[MATH:
Accurac
y=TP+TNTP+T<
mi>N+FP+FN
mrow> :MATH]
[MATH:
Sensiti
vity=TPTP+
FN :MATH]
[MATH:
Specifi
city=TNTN+
FP :MATH]
[MATH:
Precisi
on=TPTP+
FP :MATH]
[MATH:
F1-scor
e=2×TP2×TP+<
mi>FP+FN
:MATH]
[MATH: FPR=FPTN+
FP :MATH]
[MATH:
Geometr
icmean=(x1.x2⋯⋯.x
mi>n)n :MATH]
[MATH:
MCC=TP×TN-(FP×FN)TP+FPTP+FNTN+FP(TN+FN)
:MATH]
We used the sklearn.metrics library in Python to calculate the above
score by importing functions such as confusion_matrix and
classification_performance.
Statistical analysis
Pairwise comparison was done using the Mann-Whitney U test using Sigma
Plot 11.0.
Supplementary Information
[105]13040_2023_349_MOESM1_ESM.xlsx^ (121.3KB, xlsx)
Additional file 1: Supplementary Table 1. List of input features in
autoencoder.
[106]13040_2023_349_MOESM2_ESM.docx^ (17.8KB, docx)
Additional file 2: Supplementary Table 2. Model performance in LGG
subtype classification using random features.
[107]13040_2023_349_MOESM3_ESM.docx^ (18.1KB, docx)
Additional file 3: Supplementary Table 3. Model performance in GBM
subtype classification using random features.
[108]13040_2023_349_MOESM4_ESM.docx^ (16KB, docx)
Additional file 4: Supplementary Table 4. Model performance in LGG and
GBM subtyping using preprocessed data as a feature.
[109]13040_2023_349_MOESM5_ESM.docx^ (18.4KB, docx)
Additional file 5: Supplementary Table 5. Classification performance of
deep learning algorithms on LGG and GBM subtyping using mono-omics
data.
Acknowledgements