Abstract

   Background/Objectives: Glioblastoma multiforme (GBM) is an aggressive
   and heterogeneous brain tumor with poor prognosis, emphasizing the need
   for reliable molecular biomarkers to improve patient stratification and
   treatment planning. This study aimed to identify key genes associated
   with overall survival in GBM by employing and comparing machine
   learning (ML) and deep learning (DL) approaches using RNA-Seq gene
   expression data. Methods: RNA-Seq expression and clinical data for
   primary GBM tumors were obtained from The Cancer Genome Atlas (TCGA). A
   univariate Cox proportional hazards regression was used to identify
   survival-associated genes. For survival prediction, ML-based feature
   selection techniques—RF, GB, SVM-RFE, RF-RFE, and PCA—were used to
   construct multivariate Cox models. Separately, DeepSurv, a DL-based
   survival model, was trained using the significant genes from the
   univariate analysis. Gradient-based importance scoring was applied to
   determine key genes from the DeepSurv model. Results: Univariate
   analysis yielded 694 survival-associated genes. The best ML-based Cox
   model (RF-RFE with 90% training data) achieved a c-index of 0.725. In
   comparison, DeepSurv demonstrated superior performance with a c-index
   of 0.822. The top 10 genes were identified from the DeepSurv analysis,
   including CMTR1, GMPR, and PPY. Kaplan–Meier survival curves confirmed
   their prognostic significance, and network analysis highlighted their
   roles in processes such as purine metabolism, RNA processing, and
   neuroendocrine signaling. Conclusions: This study demonstrates the
   effectiveness of combining ML and DL models to identify prognostic gene
   expression biomarkers in GBM, with DeepSurv providing higher predictive
   accuracy. The findings offer valuable insights into GBM biology and
   highlight candidate biomarkers for further validation and therapeutic
   development.

   Keywords: glioblastoma multiforme, RNA-Seq, survival analysis, machine
   learning, deep learning, biomarkers, Cox regression, gene network
   analysis

1. Introduction

   Glioblastoma multiforme (GBM) is the most aggressive and fatal form of
   primary brain tumor, classified as grade IV astrocytoma by the World
   Health Organization [[30]1,[31]2,[32]3,[33]4]. GBM is characterized by
   rapid cellular proliferation, high intertumoral heterogeneity, and
   extensive infiltration into surrounding brain tissue, which severely
   limits the effectiveness of conventional therapies [[34]5,[35]6,[36]7].
   Despite advances in surgical resection, radiotherapy, and chemotherapy,
   the median overall survival of GBM patients remains approximately 14 to
   15 months, with a five-year survival rate of less than 5%
   [[37]8,[38]9]. This highlights the urgent need to identify robust
   molecular biomarkers to improve prognostic accuracy and guide
   personalized therapeutic approaches [[39]10,[40]11,[41]12]. It should
   be noted that the current study uses data collected prior to the 2021
   WHO reclassification, which now defines GBM strictly as IDH-wildtype.
   Therefore, the term GBM in this manuscript may include cases that would
   now be reclassified as IDH-mutant astrocytoma.

   Tumorigenesis in GBM involves complex molecular alterations, including
   gene mutations, aberrant gene expression, and dysregulated signaling
   pathways [[42]3,[43]13]. Gene expression profiling, particularly
   through RNA sequencing (RNA-Seq), provides insights into
   transcriptional changes that reflect tumor behavior and can reveal
   potential prognostic markers [[44]14,[45]15]. In particular,
   dysregulated expression of tumor suppressors and oncogenes identified
   through RNA-Seq has been associated with GBM progression and patient
   survival, supporting its utility in biomarker discovery
   [[46]16,[47]17].

   The Cancer Genome Atlas (TCGA) provides a comprehensive and publicly
   accessible resource containing multidimensional molecular data across
   various cancer types, including GBM [[48]18,[49]19]. The TCGA GBM
   dataset includes RNA-Seq gene expression profiles along with detailed
   clinical information, such as patient survival time and vital status
   [[50]20,[51]21]. This data offers a valuable opportunity to explore the
   association between gene expression signatures and patient prognosis.

   Recent developments in machine learning (ML) and deep learning (DL)
   have revolutionized cancer research by enabling the extraction of
   complex patterns from high-dimensional omics data [[52]22]. These
   computational approaches can be used to identify potential biomarkers,
   classify tumor subtypes, and build accurate prognostic models. ML
   methods such as random forest, support vector machine, and gradient
   boosting machine are widely used for feature selection and survival
   prediction [[53]23,[54]24,[55]25], while DL models such as autoencoders
   or deep neural networks can capture intricate non-linear relationships
   in large datasets [[56]20,[57]26].

   In the context of GBM, integrating RNA-Seq expression data with
   survival analysis using ML and DL offers a promising strategy to
   uncover novel genes that are strongly associated with overall survival.
   However, due to the high dimensionality and variability inherent in
   gene expression data, careful feature selection and model validation
   are essential for building reliable and interpretable prognostic
   models.

   In this study, we analyzed RNA-Seq-based gene expression profiles and
   corresponding survival data of GBM patients from the TCGA dataset. We
   applied a combination of statistical methods, machine learning
   algorithms, and deep learning approaches to identify key genes
   associated with patient prognosis. Our findings aim to provide insights
   into the transcriptional landscape of GBM and support the development
   of effective survival prediction models for clinical applications.

2. Materials and Methods

2.1. Data Collection

   RNA-Seq expression data and corresponding clinical information for GBM
   were obtained from The Cancer Genome Atlas (TCGA) project through the
   Genomic Data Commons (GDC) data portal
   ([58]https://portal.gdc.cancer.gov/) accessed on 24 June 2024. This
   study utilized data from Project ID of TCGA-GBM with accession number
   phs000178, specifically focusing on RNA-Seq-based gene expression
   quantification data (workflow type: STAR-Counts). A total of 162
   samples were included in the analysis, comprising 157 tumor samples and
   5 healthy samples. The dataset contained raw read counts across 29,128
   genes, which were normalized using the TMM and voom transformation
   pipeline.

   Since the dataset was collected prior to the release of the 2021 WHO
   classification of central nervous system tumors, IDH mutation status
   was not available for most samples and was therefore not used for
   stratification. As such, the term “GBM” in this study may include both
   IDH-wildtype and IDH-mutant cases based on earlier classification
   criteria. Patient demographic and clinical characteristics are
   summarized in [59]Table 1.

Table 1.

   Patient cohort summary (n = 157).
          Category                Subcategory         Value
    Overall Survival Time         Mean (days)         415.54
                           Standard Deviation (days)  384.81
             Age                  Mean (years)        59.57
                           Standard Deviation (years) 13.51
             Sex                      Male             102
                                     Female             55
        Vital Status                Deceased           126
                                     Alive              31
          Ethnicity            Hispanic or Latino       4
                             Not Hispanic or Latino    129
                                  Not Reported          24
            Race                     White             141
                           Black or African American    10
                                     Asian              5
                                  Not Reported          1
   Classification of Tumor          Primary            108
                                   Recurrence           12
                                  Progression           32
                                    Unknown             5
   [60]Open in a new tab

2.2. Identification of Survival-Associated Genes

   To determine genes significantly associated with overall survival in
   GBM patients, a univariate Cox proportional hazards regression analysis
   was performed on each gene individually using the gene expression data
   [[61]27]. This model estimates the hazard ratio (HR) of each gene,
   quantifying its effect on the risk of death over time. Genes with
   statistically significant associations with patient survival (p-value <
   0.05) were selected for downstream modeling. This approach enables the
   identification of potential prognostic biomarkers that may contribute
   to differences in survival outcomes among GBM patients. The Cox
   regression model is defined as
   [MATH: <mrow><mrow><mi>h</mi><mfenced
   separators="|"><mrow><mi>t</mi></mrow><mrow><mi>X</mi></mrow></mfenced>
   <mo>=</mo><msub><mrow><mi>h</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   fenced
   separators="|"><mrow><mi>t</mi></mrow></mfenced><mi>e</mi><mi>x</mi><mi
   >p</mi><mfenced
   separators="|"><mrow><mi>β</mi><mi>X</mi></mrow></mfenced></mrow></mrow
   > :MATH]
   (1)

   where
   [MATH: <mrow><mrow><mi>h</mi><mfenced
   separators="|"><mrow><mi>t</mi></mrow><mrow><mi>X</mi></mrow></mfenced>
   </mrow></mrow> :MATH]
   is the hazard function at time
   [MATH: <mrow><mrow><mi>t</mi></mrow></mrow> :MATH]
   given covariates
   [MATH: <mrow><mrow><mi>X</mi></mrow></mrow> :MATH]
   ,
   [MATH:
   <mrow><mrow><msub><mrow><mi>h</mi></mrow><mrow><mn>0</mn></mrow></msub>
   <mfenced separators="|"><mrow><mi>t</mi></mrow></mfenced></mrow></mrow>
   :MATH]
   is the baseline hazard, and
   [MATH: <mrow><mrow><mi>β</mi></mrow></mrow> :MATH]
   represents the estimated coefficients for each gene [[62]28].

   The univariate Cox regression analyses were implemented using R version
   4.2.1 in RStudio. Survival analysis was performed using the survfit
   function from the survival package, and the log-rank p-values were
   extracted using the surv_pvalue function from the survminer package.
   Genes with p-values < 0.05 were considered significantly associated
   with overall survival and were retained for downstream analysis.

2.3. Integration of Machine Learning with Survival Analysis

2.3.1. Feature Selection

   To reduce the high dimensionality of the gene expression dataset and
   select the most informative features, multiple machine learning
   algorithms were employed for feature importance estimation. This step
   is essential to ensure that only relevant genes with predictive power
   are included in the survival models [[63]29]. The random forest (RF)
   algorithm, an ensemble-based method, was used to estimate the
   importance of each gene based on the mean decrease in node impurity
   [[64]24,[65]25,[66]30]. RF handles high-dimensional data well and is
   robust to overfit. Similarly, gradient boosting machine (GB) builds an
   ensemble of weak learners sequentially and focuses on reducing
   prediction errors by emphasizing difficult cases [[67]25,[68]31]. It
   provides feature importance based on how frequently and effectively a
   feature is used in boosting iterations. Support vector machine with
   recursive feature elimination (SVM-RFE) was applied as a wrapper
   method, where features are recursively removed based on the model’s
   weight coefficients until the optimal subset is reached [[69]25]. This
   method is particularly effective in eliminating irrelevant genes that
   contribute little to classification performance. Additionally, random
   forest recursive feature elimination (RF-RFE), a variant of RFE using
   RF as the base model, was employed to iteratively remove the least
   important features based on RF’s internal ranking [[70]25,[71]29].
   Lastly, principal component analysis (PCA), an unsupervised
   dimensionality reduction technique, was used to transform the original
   gene space into a smaller set of uncorrelated principal components.
   Though PCA does not directly rank individual genes, it helps identify
   key variance-explaining gene combinations which were further analyzed
   [[72]29]. The top-ranked genes from each method were consolidated and
   used for survival modeling.

   To implement these feature selection methods, all analyses were
   conducted using R version 4.2.1. [73]Table 2 summarizes the methods,
   corresponding R packages, functions, and key parameters used for
   reproducibility.

Table 2.

   Summary of feature selection methods, packages, functions, and
   parameters.
   Method R Package(s) Function(s) Key Parameters/Notes
   RF randomForest randomForest() n.trees = 100–1000; best ntree selected
   by accuracy
   GB gbm gbm() n.trees = 100–1000, interaction.depth = 3, shrinkage =
   0.01, cv.folds = 5
   SVM-RFE caret, e1071 rfe() with svmFuncs 10-fold cross-validation;
   default svmLinear kernel
   RF-RFE caret, randomForest rfe() with rfFuncs 10-fold cross-validation;
   sizes = c(10, 20, 50, 100); final model with ntree = 500
   PCA stats prcomp() Data centered and scaled; top 20 loadings from PC1
   selected
   [74]Open in a new tab

2.3.2. Survival Modeling Using Cox Proportional Hazards

   The most informative genes identified through feature selection were
   used to develop multivariate Cox proportional hazards models to predict
   patient survival [[75]32]. To assess the robustness and
   generalizability of the model, the dataset was split into training and
   testing subsets with varying proportions (60%, 70%, 80%, and 90%
   training data), and a 5-fold cross-validation strategy was implemented
   within the training set. This modeling approach allows for the
   estimation of the joint effects of multiple genes on survival
   [[76]21,[77]29]. Models were trained and optimized using only the
   training subset, while model performance was evaluated on the hold-out
   testing set [[78]24]. The genes included in the best-performing models
   across different splits were considered robust survival biomarkers.

2.4. Integration of Deep Learning with Survival Analysis

2.4.1. DeepSurv Model Development

   In addition to traditional survival models, a deep learning-based
   survival analysis was conducted using the DeepSurv framework
   [[79]33,[80]34]. DeepSurv is a deep feedforward neural network that
   extends the Cox proportional hazards model by incorporating the ability
   to model complex and non-linear relationships among input features
   [[81]20,[82]26]. In this study, the input features for DeepSurv
   consisted of all genes that were identified as significant through
   univariate survival analysis.

   The DeepSurv model was implemented in R (version 4.2.1) using the Keras
   and TensorFlow libraries. Multiple training–testing data splits were
   evaluated (60%, 70%, 80%, and 90% training), with 5-fold
   cross-validation applied to ensure the robustness and generalizability
   of the model [[83]26]. To achieve optimal predictive performance, a
   comprehensive hyperparameter tuning process was employed. The
   hyperparameter space included number of hidden units (16, 32, 64),
   activation functions (“relu”, “tanh”), dropout rates (0.0, 0.1, 0.2),
   learning rates (10^−5, 10^−6, 10^−7), number of epochs (30, 40, 50),
   and batch sizes (8, 12, 16). As a regularization scheme, dropout was
   used, and L1 or L2 regularization was not applied. For each training
   split, 486 model configurations were evaluated. By iteratively
   evaluating different combinations of these parameters through grid
   search and cross-validation, the configuration that yielded the best
   predictive performance on the validation data was selected as the final
   model. This approach enabled the model to effectively learn intricate
   patterns within the gene expression data that are potentially
   associated with patient survival, thereby enhancing its prognostic
   capabilities.

2.4.2. Feature Importance Extraction from DeepSurv

   After training the DeepSurv model, a gradient-based approach was
   employed to identify the most influential genes for survival
   prediction. This technique calculates the gradient of the model’s
   output (predicted risk) with respect to each input gene, quantifying
   how much a slight change in a gene’s value affects the prediction.
   Using TensorFlow’s GradientTape, gradients were computed across the
   test dataset. The absolute values of these gradients were averaged
   across all test samples to produce an important score for each gene
   [[84]35,[85]36]. The top genes with the highest average importance were
   selected as key contributors and exported for further analysis. This
   method provides an interpretable mechanism to extract biologically
   relevant features from deep learning models, which are often perceived
   as black boxes.

2.5. Model Performance Evaluation

   To evaluate the performance of both traditional Cox and DeepSurv
   models, the concordance index (c-index) was employed. The c-index is a
   widely used metric in survival analysis to measure the discriminatory
   power of a predictive model. Formally, for a pair of individuals
   [MATH: <mrow><mrow><mi>i</mi></mrow></mrow> :MATH]
   and
   [MATH: <mrow><mrow><mi>j</mi></mrow></mrow> :MATH]
   , let
   [MATH:
   <mrow><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub>
   </mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub>
   </mrow></mrow> :MATH]
   represent the observed survival times and
   [MATH: <mrow><mrow><msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo
   stretchy="false">^</mo></mover></mrow><mrow><mi>i</mi></mrow></msub></m
   row></mrow> :MATH]
   and
   [MATH: <mrow><mrow><msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo
   stretchy="false">^</mo></mover></mrow><mrow><mi>j</mi></mrow></msub></m
   row></mrow> :MATH]
   the corresponding predicted risks. The c-index estimates the
   probability that the model assigns a higher risk to the patient with
   the shorter survival time and is defined as
   [MATH: <mrow><mrow><mi>C</mi><mo>=</mo><mi>P</mi><mfenced
   separators="|"><mrow><msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo
   stretchy="false">^</mo></mover></mrow><mrow><mi>i</mi></mrow></msub><mo
   >></mo><msub><mrow><mover accent="true"><mrow><mi>y</mi></mrow><mo
   stretchy="false">^</mo></mover></mrow><mrow><mi>j</mi></mrow></msub></m
   row><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o><</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub></mr
   ow></mfenced></mrow></mrow> :MATH]
   (2)

   The c-index value ranges from 0 to 1, where a higher value indicates
   better model performance. In this study, the c-index was computed on
   the test dataset for each model and used as the primary metric to
   select the best-performing model configurations and genes [[86]20].

2.6. Kaplan–Meier Survival Analysis

   To validate the clinical utility of the selected prognostic genes,
   Kaplan–Meier (KM) survival curves were plotted to visualize survival
   differences between patient subgroups [[87]37]. Patients were
   stratified into high-risk and low-risk groups based on the median
   predicted risk score or gene expression level [[88]38]. The log-rank
   test was applied to determine whether the observed survival differences
   between the groups were statistically significant [[89]39]. A
   significant p-value supports the prognostic relevance of the gene under
   investigation. KM analysis provides both a statistical test and an
   intuitive visual representation of how gene expression profiles or risk
   scores influence survival probabilities over time [[90]40], where a
   greater separation between curves indicates stronger predictive power.

2.7. Pathway and Network Analysis

   To understand the biological functions and interactions of the top
   genes identified through machine learning and deep learning, pathway
   enrichment and gene interaction network analysis was conducted using
   GeneMANIA ([91]https://genemania.org/) accessed on 16 May 2025.
   GeneMANIA integrates data from multiple sources, including
   co-expression, co-localization, protein–protein interactions, and
   shared pathways, to predict gene functions and visualize biological
   networks. This integrative approach enables the identification of
   functional modules and potential regulatory relationships among genes
   [[92]41]. The resulting networks and enriched pathways provide valuable
   insight into the possible mechanisms by which these genes influence GBM
   prognosis, including their roles in metabolic regulation, signal
   transduction, and cellular structural dynamics [[93]42,[94]43].

3. Results

3.1. Clinical Data Summary

   This clinical dataset includes information on survival status, overall
   survival duration, gender, and patient follow-up time. The majority of
   patients in this dataset were deceased at the end of the observation
   period, accounting for 80%, while 20% of patients were still alive.
   Gender distribution shows that approximately 65% were male and 35% were
   female, indicating a male-dominated patient population in this study.
   Among the deceased patients, overall survival duration varied widely,
   ranging from 5 to 1537 days, with a median value of approximately 360
   days, indicating significant variability in patient survival after
   diagnosis or the start of observation. For the surviving patients, the
   follow-up duration ranged from 13 to 958 days, with a median of
   approximately 268 days. Overall, the variation in survival status and
   observation period within this patient cohort provides a strong basis
   for conducting further survival analyses, such as the Cox proportional
   hazards model, to evaluate the relationship between gene expression and
   patient prognosis.

3.2. Exploratory Analysis of Gene Expression Data

   To ensure the appropriateness of the RNA-Seq data for linear modeling,
   a voom transformation was applied to stabilize the variance across the
   range of gene expression levels. This transformation is crucial for
   meeting the assumptions of linear modeling, particularly in handling
   the heteroscedasticity typical of RNA-Seq data. The transformation’s
   effect is illustrated in [95]Figure 1, where each dot represents a
   gene. The x-axis shows the average log2 counts per million, while the
   y-axis displays the square root of the standard deviation. A clear
   downward trend is observed along the red trend line, indicating that
   genes with lower expression levels tend to exhibit higher variability,
   whereas highly expressed genes show more stable variance. This trend
   enables the calculation of observation-level precision weights, thereby
   improving the reliability of downstream differential expression
   analysis.

Figure 1.

   [96]Figure 1
   [97]Open in a new tab

   Mean variance trend plot of voom-transformed data.

   To further explore the overall structure and variability within the
   dataset, PCA was conducted. The PCA results are shown in [98]Figure 2,
   which visualizes the projection of samples onto the first two principal
   components (PC1 and PC2). In this plot, primary solid tumor samples are
   represented by black dots and normal solid tissue samples by pink dots.
   A distinct separation between the two groups is evident, particularly
   along PC1, which captures the largest proportion of variance and serves
   as the main axis distinguishing tumor from normal samples. PC2 accounts
   for additional, though smaller, variation. This clear clustering
   pattern indicates substantial differences in gene expression profiles
   between tumor and normal tissues, highlighting the biological relevance
   of the data and supporting its suitability for identifying
   differentially expressed genes and building prognostic models.

Figure 2.

   [99]Figure 2
   [100]Open in a new tab

   Principal component analysis of tumor and normal tissue samples.

3.3. Univariate Survival Analysis Using Gene Expression Data

   To assess the relationship between gene expression and overall
   survival, a univariate survival analysis was conducted using KM curves
   and log-rank tests. The gene expression matrix, consisting of 29,128
   genes, was first pre-processed to match with clinical survival data.
   Genes with incomplete expression profiles were excluded from further
   analysis to ensure reliable statistical inference. For each gene,
   patients were stratified into two groups, high expression and low
   expression, based on the mean expression value of the respective gene
   across all samples. A KM survival model was then fitted to compare the
   survival distributions between these two groups, and a log-rank test
   was used to compute the p-value indicating statistical significance.
   This procedure was iteratively applied to all genes. Only genes with a
   p-value < 0.05 were retained as significantly associated with overall
   survival. As a result, 694 genes were identified as statistically
   significant, suggesting their potential relevance as prognostic
   biomarkers.

3.4. Machine Learning-Based Feature Selection and Survival Prediction
Performance

   To assess the impact of different feature selection techniques on
   survival prediction, multivariate Cox proportional hazards models were
   developed using gene sets selected by five machine learning algorithms.
   These models were trained and evaluated under varying training–test
   splits (60%, 70%, 80%, and 90% training data), and their predictive
   performance was measured using the c-index. The results are summarized
   in [101]Table 3.

Table 3.

   Average c-index values on test data for Cox proportional hazards models
   based on different machine learning feature selection methods.
   Ratio of Training Data (%)  RF    GB   SVM-RFE RF-RFE  PCA
               60             0.585 0.605  0.598  0.606  0.619
               70             0.586 0.610  0.580  0.574  0.610
               80             0.571 0.590  0.613  0.585  0.596
               90             0.598 0.618  0.627  0.725  0.608
   [102]Open in a new tab

   The highest predictive performance was achieved using RF-RFE at 90%
   training data, with a c-index of 0.725, suggesting that this method was
   most effective in selecting gene features relevant to survival
   outcomes. The substantial increase in performance at 90% training also
   highlights the benefit of a larger training set when dealing with
   high-dimensional biological data. Across all training splits, PCA
   consistently yielded moderate to good performance, with a notable
   c-index of 0.619 at the 60% split. Despite being an unsupervised
   method, PCA’s ability to retain variance from the original dataset may
   have captured essential biological signals related to survival.
   Meanwhile, SVM-RFE and GB also demonstrated stable performance,
   especially at higher training proportions. GB maintained C-index values
   above 0.59 across all data splits, while SVM-RFE achieved a C-index of
   0.627 at the 90% split. These results affirm the potential of combining
   ensemble and wrapper-based methods with survival analysis for robust
   biomarker selection. In contrast, RF without recursive elimination
   showed lower and relatively flat performance across different splits,
   indicating that direct importance ranking alone may not sufficiently
   refine feature selection in this context. Overall, the findings
   emphasize the importance of both the choice of machine learning-based
   feature selection method and the size of the training dataset in
   determining survival prediction performance. Recursive elimination
   techniques, particularly those based on ensemble models, appear most
   suitable for high-dimensional gene expression data and should be
   prioritized in similar analytical pipelines.

3.5. Application of DeepSurv for Survival Prediction and Exploration of
Significant Genes

   In this study, the DeepSurv approach was employed as a deep learning
   model for survival analysis based on gene expression in glioblastoma
   patients. This model is designed to capture complex non-linear
   relationships between gene expression and patient mortality risk—an
   aspect that classical statistical methods like the Cox proportional
   hazards model often fail to optimally model. The primary goal of
   applying DeepSurv is to improve survival prediction accuracy and
   identify the most influential genes associated with patient prognosis.
   The performance of DeepSurv was evaluated by testing various model
   training configurations, including variations in training data
   proportion, number of hidden units, activation functions, dropout
   rates, learning rates, number of epochs, and batch sizes. The
   evaluation results are summarized in [103]Table 4, which presents the
   c-index as the main metric for assessing survival prediction accuracy.
   The model trained with 90% of the data achieved the highest c-index
   value of 0.822, indicating excellent predictive performance. This value
   is significantly higher than that of the best ML-based Cox model
   (RF-RFE), which achieved a maximum c-index of around 0.725 using RF-RFE
   for feature selection and Cox proportional hazards models for survival
   prediction.

Table 4.

   Average c-index values on test data for DeepSurv models based on
   different training.
   Ratio of Training Data (%) C-Index Units Activation Dropout Learning
   Rate Epochs Batch Size
   60 0.677 16 tanh 0.1 10^−6 40 8
   70 0.737 64 tanh 0.2 10^−7 30 12
   80 0.733 64 relu 0 10^−7 40 16
   90 0.822 16 tanh 0.2 10^−5 50 16
   [104]Open in a new tab

   To further evaluate the robustness of our proposed DeepSurv model, we
   compared its performance with a previous study [[105]20], which used
   the same TCGA GBM dataset for survival prediction. As summarized in
   [106]Table 5, across different training data ratios (60%, 70%, 80%, and
   90%), our model consistently achieved higher average c-index values
   than the reference model. For instance, at 90% training data, our model
   reached a c-index of 0.822 compared to 0.639 reported by Feng et al.
   This improvement can be attributed to the comprehensive hyperparameter
   tuning and model architecture optimization conducted in our study,
   including activation function selection, dropout regularization, and
   learning rate adjustment. Moreover, the exploration of multiple
   configurations helped mitigate overfitting and enhanced
   generalizability. These results suggest that with appropriate tuning
   and feature selection, deep learning models such as DeepSurv can
   significantly outperform classical or previously published models even
   on the same dataset, highlighting their potential utility in precision
   oncology. As seen, larger training datasets tend to produce relatively
   better c-index values on the test data. Overall, the proposed model
   outperformed the reference model across all configurations, confirming
   its robustness and superior capacity to model survival outcomes from
   gene expression data in glioblastoma.

Table 5.

   Average c-index values of the proposed model and reference model using
   different amounts of training data.
   Ratio of Training Data (%) Proposed Model Reference Model [[107]20]
               60                 0.677                0.603
               70                 0.737                0.603
               80                 0.733                0.609
               90                 0.822                0.639
   [108]Open in a new tab

   Following model training, gene contributions to risk prediction was
   evaluated using a gradient-based method. From this, 10 key genes with
   the highest average scores were identified, indicating their major
   influence on glioblastoma patient survival predictions. These top genes
   are CMTR1, RPL23AP42, TSPYL1, [109]AC011287.1, RPL7L1P8, CCDC107,
   [110]AL354743.2, GMPR, PPY, and MT-TL1. To better summarize their
   biological functions and potential prognostic roles, the top genes
   identified by DeepSurv are presented in [111]Table 6.

Table 6.

   Summary of top prognostic genes identified by DeepSurv with associated
   biological functions.
   Gene Symbol Gene Type Known/Predicted Function Biological Relevance to
   GBM KM
   p-Value
   CMTR1 Protein-coding mRNA cap methylation, immune response Upregulated
   in cancer; promotes ribosomal gene expression and growth [[112]44]
   0.040
   RPL23AP42 and RPL7L1P8 Pseudogene Putative ceRNA, miRNA sponge,
   ribosomal pseudogene Function unclear; may act as ceRNAs affecting gene
   regulation, as seen in others pseudogene [[113]45] 0.037 and 0.027
   TSPYL1 Protein-coding Chromatin remodeling, transcription regulation
   Associated with neural development and tumor progression [[114]46]
   0.023
   [115]AC011287.1 and [116]AL354743.2 lncRNA Uncharacterized lncRNAs,
   potential epigenetic regulators While direct evidence in GBM is
   lacking, other lncRNAs have been shown to influence glioma development
   and gene regulation [[117]47] 0.029 and 0.020
   CCDC107 Protein-coding Coiled-coil domain Related CCDC family member
   implicated in glioma progression and cytoskeletal regulation [[118]48]
   0.029
   GMPR Enzyme Purine metabolism (GMP to IMP conversion) Key enzyme in
   purine metabolism [[119]49]; purine metabolism regulates DNA repair and
   therapy resistance in GBM [[120]50] 0.009
   PPY Peptide hormone Neuropeptide signaling NPY receptors and
   intratumoral neuropeptides active in GBM [[121]51] 0.006
   MT-TL1 Mitochondrial tRNA Mitochondrial protein synthesis Mitochondrial
   tRNA involved in energy metabolism; mtDNA alterations in GBM [[122]52]
   0.034
   [123]Open in a new tab

   Among the top prognostic genes identified, CMTR1 encodes a
   methyltransferase involved in mRNA cap methylation and immune
   regulation. Recent studies revealed that CMTR1 is overregulated in
   various cancers and promotes ribosomal protein gene expression, thereby
   enhancing tumor growth. RPL23AP42 and RPL7L1P8 are ribosomal
   pseudogenes with limited direct evidence in GBM. However, other
   ribosomal pseudogenes, such as RPL4P4, have been reported as prognostic
   markers and may act as competitive endogenous RNAs (ceRNAs),
   influencing gene expression regulation in gliomas. TSPYL1, a chromatin
   remodeling factor, is associated with neural development and may
   contribute to glioma progression, as identified in IDH1-associated
   tumor evolution studies. For [124]AC011287.1 and [125]AL354743.2,
   functional annotation is scarce, but they belong to the long non-coding
   RNA (lncRNA) class. Other lncRNAs, such as MEG3, have been shown to
   suppress glioma cell proliferation by modulating gene expression and
   chromatin states. These uncharacterized lncRNAs may play similar
   regulatory roles. CCDC107, though not yet directly studied in GBM,
   belongs to the coiled-coil domain-containing family. CCDC103, a related
   protein, has been implicated in glioma progression and cytoskeletal
   organization, suggesting that CCDC107 may also impact tumor cell
   motility or structure. GMPR, a key enzyme in purine metabolism,
   catalyzes the conversion of GMP to IMP. While its direct involvement in
   GBM is underexplored, purine metabolism has been shown to regulate DNA
   repair and drive therapy resistance in glioblastoma. PPY, a
   neuropeptide and member of the NPY family, is part of the broader
   neuroendocrine signaling system. NPY receptors and intertumoral
   neuropeptides are active in GBM, potentially influencing tumor
   behavior. MT-TL1, a mitochondrial tRNA for leucine, plays a critical
   role in mitochondrial protein synthesis. Alterations in mitochondrial
   DNA, including tRNA genes like MT-TL1, have been linked to metabolic
   reprogramming and mitochondrial dysfunction in GBM.

   These findings indicate that DeepSurv not only enhances survival
   prediction accuracy but also uncovers biologically relevant genes with
   potential as prognostic biomarkers or therapeutic targets.
   Nevertheless, experimental validation is essential to confirm their
   mechanistic roles in glioblastoma. Furthermore, challenges such as
   overfitting and interpretability must be addressed to enable clinical
   implementation of deep learning-based survival models.

3.6. Kaplan–Meier Survival Analysis of Key Genes

   To evaluate the clinical relevance of the top 10 prognostic genes
   identified through deep learning analysis, KM survival curves were
   generated for each gene. Patients were stratified into high- and
   low-expression groups (labeled UP and DOWN) based on the median
   expression threshold of each gene. Survival differences between the two
   groups were assessed using the log-rank test. As illustrated in
   [126]Figure 3, the KM plots reveal a statistically significant survival
   difference for all 10 genes, with log-rank test p-values ranging from
   0.0061 to 0.04. Notably, GMPR (p = 0.0094) and PPY (p = 0.0061) exhibit
   the most pronounced separation between high- and low-expression groups,
   suggesting strong prognostic potential.

Figure 3.

   [127]Figure 3
   [128]Open in a new tab

   Kaplan–Meier survival curves comparing overall survival between patient
   groups stratified by gene expression level (UP = high expression, DOWN
   = low expression) for the top prognostic genes: (a) CMTR1, (b)
   RPL23AP42, (c) TSPYL1, (d) [129]AC011287.1, (e) RPL7L1P8, (f) CCDC107,
   (g) [130]AL354743.2, (h) GMPR, (i) PPY, (j) MT-TL1.

   The number at risk at various time points is also shown in the plots,
   providing additional context for interpreting survival trends over
   time. These results indicate that patients with high expression (or low
   expression, depending on the gene) consistently have significantly
   different survival outcomes compared to those with opposite expression
   level. For example, elevated expressions of CMTR1, TSPYL1, or RPL23AP42
   are associated with poorer survival, aligning with their hypothesized
   roles in promoting tumor progression or interfering with gene
   regulation. The statistical significance of separation supports the
   relevance of these genes as individual prognostic biomarkers. From a
   clinical perspective, these genes could potentially be used to stratify
   glioblastoma patients into distinct risk categories, aiding
   personalized treatment planning. However, further investigation,
   including experimental validation and clinical trials, is essential
   before clinical implementation.

   While all top 10 genes showed statistically significant survival
   differences, the nature of the curve separation varied. For instance,
   although PPY demonstrated the lowest p-value (p = 0.0061), its survival
   curves remained nearly overlapping until after the median survival
   time, suggesting a delayed prognostic effect. In contrast, GMPR (p =
   0.0094) showed early and consistent divergence between high- and
   low-expression groups, indicating a more robust and clinically
   meaningful impact on patient prognosis. These distinctions highlight
   the importance of interpreting survival curves beyond p-values alone,
   considering both the magnitude and timing of risk stratification, which
   may influence their utility in future clinical applications.

3.7. Functional Network and Pathway Analysis of Key Genes

   To further understand the biological functions and molecular
   interactions of the top 10 genes identified through machine learning
   and deep learning models, a network-based functional analysis was
   conducted. This analysis integrated diverse biological data sources,
   including co-expression, physical and genetic interactions, shared
   pathways, co-localization, and shared protein domains. The network was
   constructed for Homo sapiens, with the weighting based on biological
   process relevance to ensure a functionally meaningful arrangement. The
   resulting protein–protein interaction (PPI) network, shown in
   [131]Figure 4, illustrates how the selected genes are embedded within a
   broader biological framework. In this network, the central 10 genes are
   enriched with additional interacting partners predicted.

Figure 4.

   [132]Figure 4
   [133]Open in a new tab

   Protein–protein interaction (PPI) network highlighting hub genes among
   the top-ranked prognostic candidates and their interacting partners.

   The lines connecting the genes are color-coded to indicate the nature
   of their interactions; co-expression is depicted in light purple,
   physical interactions in pink, predicted interactions in orange,
   co-localization in green, shared pathways in blue, and shared protein
   domains in gray. These edge types reflect distinct modes of functional
   relationships; co-expression suggests genes are transcriptionally
   coordinated, physical interactions imply direct protein binding,
   predicted interactions are inferred from orthologous data, and
   co-localization indicates shared subcellular compartments. Shared
   pathways and shared domains reflect common biological roles or
   structural motifs. These visual indicators provide a comprehensive
   representation of the complex relationships among the genes.

   The analysis reveals that many of the selected genes share strong
   co-expression and pathway-based associations, suggesting that they may
   participate in related biological processes relevant to glioblastoma
   progression. Notably, several genes such as GMPR emerge as key hubs
   within the network, connecting multiple genes primarily involved in
   purine metabolism. GMPR acts as a central node linking metabolic and
   nucleotide regulatory elements crucial for maintaining the
   proliferative state of cancer cells. CMTR1, known for its involvement
   in RNA processing and immune response modulation, shows limited but
   specific interactions, notably with HPRT1, suggesting a specialized
   regulatory role rather than widespread connectivity. Notably, CMTR1 is
   not connected with TSPYL1 in the network. TSPYL1 exhibits limited but
   distinct associations, particularly with HPRT1 and NPY5R, reflecting
   its involvement in regulatory pathways such as chromatin remodeling and
   cell cycle control. In this analysis, TSPYL1 is not connected with PPY
   or CMTR1, suggesting distinct functional roles in the network. The
   neuropeptide-related genes, including PPY and members of the NPY family
   (e.g., NPY, NPY1R, NPY2R, NPY4R, NPY5R, PYY), form a distinct cluster
   characterized by strong co-expression and shared pathway interactions,
   indicating a potential role in neuroendocrine signaling relevant to
   glioblastoma. The network also includes CCDC107, which appears isolated
   with no direct interactions, implying that its role in glioblastoma
   remains unclear but may still be worth investigating given its
   potential involvement in cytoskeletal regulation. Overall, this
   functional network analysis emphasizes the interconnectedness of the
   selected prognostic genes and highlights their collective involvement
   in pathways that could be critical to glioblastoma survival outcomes.
   These findings provide not only a biological rationale for their
   selection as predictive biomarkers but also lay the groundwork for
   future investigations into targeted therapeutic strategies.

4. Discussion

   This study demonstrates the effectiveness of both machine learning and
   deep learning methods in identifying prognostic biomarkers for GBM
   using RNA-Seq gene expression data. GBM remains one of the most lethal
   brain tumors, with limited improvement in patient outcomes over the
   last decades. Identifying molecular features that can reliably predict
   survival is crucial for improving risk stratification and guiding
   treatment decisions. The integration of computational approaches
   enables high-throughput and unbiased screening of potential biomarkers
   across the genome.

   The univariate Cox regression analysis identified 694
   survival-associated genes, which served as a foundation for building
   multivariate models. Among the ML-based approaches, the Cox model using
   random forest–recursive feature elimination (RF-RFE) with 90% training
   data achieved a concordance index (C-index) of 0.725, suggesting
   moderate predictive performance. These ML models offer interpretability
   and are less computationally intensive, making them attractive for
   practical applications. However, they may not fully capture complex
   interactions in high-dimensional data such as gene expression profiles.

   In contrast, the DeepSurv model, a deep learning extension of the Cox
   proportional hazards model, demonstrated superior predictive power,
   achieving a C-index of 0.822. This underscores the strength of DL
   approaches in modeling non-linear and intricate relationships that are
   often missed by traditional or ML-based survival models. While the
   higher performance of DeepSurv is promising, it also introduces
   challenges in interpretability and requires more computational
   resources and expertise.

   The top 10 genes identified by DeepSurv—CMTR1, RPL23AP42, TSPYL1,
   [134]AC011287.1, RPL7L1P8, CCDC107, [135]AL354743.2, GMPR, PPY, and
   MT-TL1—were further analyzed using Kaplan–Meier survival curves, which
   showed significant separation in survival outcomes between high- and
   low-risk groups. Several of these genes, such as CMTR1 and GMPR, have
   known roles in RNA modification and purine metabolism, respectively,
   processes that are commonly dysregulated in cancer, particularly in
   aggressive subtypes like GBM. The MT-TL1 gene is part of the
   mitochondrial genome and has been linked to metabolic regulation and
   energy production in cancer cells.

   To further explore the biological relevance of these genes, GeneMANIA
   was employed to construct gene interaction networks and perform pathway
   enrichment analysis. The analysis revealed functional associations with
   RNA processing, neuroendocrine signaling, cell cycle regulation, and
   chromatin modification, all of which are critical in GBM pathogenesis.
   These findings are consistent with previous reports highlighting the
   role of transcriptional dysregulation in GBM progression, particularly
   as revealed through RNA-Seq–based gene expression profiling. The
   inclusion of lncRNAs and pseudogenes among the top-ranked features also
   suggests the involvement of non-coding regulatory mechanisms in
   gliomagenesis, an area that warrants further exploration.

   To further substantiate the relevance of the top-ranked genes, we
   investigated existing literature to determine whether these genes have
   been previously associated with prognosis in other types of cancer.
   Several of them have demonstrated potential prognostic or functional
   roles across various tumor types. A summary of the prognostic relevance
   of the top 10 genes based on published studies is provided in
   [136]Table 7, which strengthens the biological plausibility of their
   involvement in GBM outcomes and supports their consideration as
   candidate biomarkers for future validation.

Table 7.

   Prognostic relevance of the top 10 genes in other cancer types based on
   the existing literature.
   Gene Symbol Full Name Cancer Type Reported Prognostic Association
   References