Abstract

Background

   Glioblastoma multiforme (GBM) is a highly aggressive brain cancer with
   poor prognosis and limited treatment options. Despite advances in
   understanding its molecular mechanisms, effective therapeutic
   strategies remain elusive due to the tumor’s genetic complexity and
   heterogeneity.

Methods

   This study employed a comprehensive analysis approach integrating 113
   machine learning algorithms with Mendelian Randomization (MR) analysis
   to investigate the molecular underpinnings of GBM. Five publicly
   available gene expression datasets were analyzed to identify
   differentially expressed genes (DEGs) associated with GBM. Weighted
   Gene Co-expression Network Analysis (WGCNA) was used to identify
   GBM-related gene modules. Further, gene set enrichment and variation
   analyses were conducted to explore the biological pathways involved.
   The machine learning models were evaluated using Receiver Operating
   Characteristic (ROC) curves and confusion matrices to assess their
   predictive accuracy, with the best-performing model validated across
   external datasets. MR analysis was performed to establish causal
   relationships between genetically predicted gene expression levels and
   GBM outcomes.

Results

   The study identified 286 DEGs between GBM and adjacent normal tissues
   across five datasets. WGCNA highlighted the yellow module as the most
   relevant to GBM, containing key genes such as KLHL3, FOXO4, and MAP1A.
   Of the 113 machine learning models tested, Ridge regression achieved
   the highest area under the curve (AUC) of 0.92, demonstrating robust
   predictive accuracy. Validation using external datasets confirmed the
   model's reliability, with a classification accuracy of 89.5% in the
   training set and 85.3% in the validation sets. MR analysis provided
   strong evidence of a causal relationship between the expression levels
   of the identified genes and GBM risk.

Conclusions

   This study demonstrates the power of combining machine learning and
   Mendelian Randomization to uncover novel genetic markers for GBM. The
   identified genes offer promising potential as biomarkers for GBM
   diagnosis and therapy, providing new avenues for personalized treatment
   strategies.

Supplementary Information

   The online version contains supplementary material available at
   10.1007/s12672-025-01792-0.

   Keywords: Glioblastoma multiforme, Machine learning, Mendelian
   randomization, Gene co-expression analysis

Introduction

   Glioblastoma multiforme (GBM) remains one of the most aggressive and
   lethal forms of brain cancer, characterized by rapid progression and
   resistance to conventional therapies [[34]1–[35]4]. Despite significant
   advancements in understanding the molecular mechanisms underlying GBM,
   the prognosis for patients has improved only marginally over the past
   decades. The complex genetic landscape of GBM, marked by extensive
   heterogeneity and intricate interactions between tumor cells and their
   microenvironment, poses a significant challenge to the development of
   effective therapeutic strategies [[36]5–[37]7].

   Traditional approaches to cancer biomarker discovery have predominantly
   relied on single-omic analyses or small-scale studies, often
   overlooking the multifactorial nature of GBM. As a result, there is an
   urgent need for innovative methodologies that can comprehensively
   capture the intricate molecular networks driving GBM pathogenesis and
   progression [[38]1, [39]8–[40]10]. Recent advances in machine learning
   (ML) have opened new avenues for integrating large-scale omics data,
   enabling the identification of robust biomarkers with high predictive
   power [[41]11–[42]14]. However, the application of ML alone may still
   be limited by confounding factors and biases inherent in observational
   data.

   To address these challenges, we combined 113 distinct machine learning
   algorithms with Mendelian Randomization (MR) analysis, a powerful
   approach that leverages genetic variants as instrumental variables to
   infer causal relationships between exposures and outcomes. By
   integrating these methodologies, our study aimed to overcome the
   limitations of traditional observational studies and provide a more
   accurate and reliable identification of genetic biomarkers for GBM.
   This approach not only enhances the robustness of our findings but also
   offers the potential to uncover novel therapeutic targets by
   distinguishing between mere associations and causal relationships.

   The primary objective of this study was to identify and validate
   genetic markers associated with GBM through a comprehensive,
   multi-faceted analysis. We sought to leverage the strengths of both
   machine learning and Mendelian Randomization to systematically evaluate
   the genetic determinants of GBM, focusing on their potential role as
   biomarkers for diagnosis, prognosis, and treatment response. By doing
   so, we aim to contribute to the growing body of knowledge on GBM and
   pave the way for more personalized and targeted therapeutic strategies,
   ultimately improving outcomes for patients with this devastating
   disease.

Methods

Data acquisition and preprocessing

   Five publicly available gene expression datasets related to GBM were
   selected from the GEO database ([43]https://www.ncbi.nlm.nih.gov/geo/).
   The datasets chosen include [44]GSE116520
   ([45]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116520),
   [46]GSE34152
   ([47]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE34152),
   [48]GSE7696
   ([49]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696),
   [50]GSE90886
   ([51]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90886), and
   [52]GSE109857
   ([53]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109857),
   each containing patient information relevant to GBM (Table [54]1). For
   each dataset, the corresponding platform files were obtained to ensure
   accurate gene expression matrix calculations. The gene expression
   matrices were generated by processing the raw data according to the
   specific platform annotations provided in the GEO repository.

Table 1.

   Overview of GBM-Related Gene Expression Datasets
   Dataset accession URL Sample type Platform Sample size
   [55]GSE116520
   [56]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116520
   GBM/tumor and normal/adjacent tissue [57]GPL10558 GBM/tumor: 17
   normal/adjacent tissue:25
   [58]GSE34152
   [59]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE34152
   GBM/tumor and normal/adjacent tissue [60]GPL570 GBM/tumor: 4
   normal/adjacent tissue:4
   [61]GSE7696
   [62]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696
   GBM/tumor and normal/adjacent tissue [63]GPL570 GBM/tumor: 80
   normal/adjacent tissue:4
   [64]GSE90886
   [65]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90886
   GBM/tumor and normal/adjacent tissue [66]GPL15207 GBM/tumor: 9
   normal/adjacent tissue:9
   [67]GSE109857
   [68]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109857
   GBM/tumor and normal/adjacent tissue [69]GPL6480 GBM/tumor: 220
   normal/adjacent tissue:5
   [70]Open in a new tab

   To minimize potential batch effects between datasets, the ComBat
   function from the sva package in R was applied to the expression
   matrices. Following batch effect removal, the datasets were divided
   into training and validation cohorts. Specifically, [71]GSE116520 and
   [72]GSE34152 were merged to form the training set, while [73]GSE7696,
   [74]GSE90886, and [75]GSE109857 were designated as external validation
   cohorts.

Differential gene expression analysis

   Differentially expressed genes (DEGs) between GBM and adjacent normal
   tissues were identified using the limma package in R [[76]15, [77]16].
   After organizing and normalizing the expression data, a design matrix
   was constructed to compare the GBM and control groups. Linear modeling
   and empirical Bayes smoothing were applied, with DEGs filtered by
   |logFC|> 1 and adj.P.Val < 0.05. The significant DEGs were visualized
   through heatmaps and volcano plots using pheatmap and ggplot2. Top DEGs
   were highlighted, and all results were exported for further analysis.

WGCNA analysis for GBM-related module identification

   To identify gene modules most relevant to GBM, Weighted Gene
   Co-expression Network Analysis (WGCNA) was conducted [[78]17, [79]18].
   The normalized expression data were first preprocessed by removing
   genes with low variance (standard deviation ≤ 0.5). Samples were then
   clustered hierarchically to detect outliers, which were excluded from
   further analysis. The optimal soft-thresholding power was determined
   using the `pickSoftThreshold` function, ensuring scale-free topology of
   the network. An adjacency matrix was constructed using the selected
   power, followed by the calculation of the topological overlap matrix
   (TOM) and subsequent hierarchical clustering to identify gene modules.
   Dynamic tree cutting was employed to refine module boundaries, and
   similar modules were merged based on their eigengene correlations. The
   correlation between module eigengenes and clinical traits was assessed
   to pinpoint the module most strongly associated with GBM. Key modules
   were further examined by plotting the relationship between module
   membership and gene significance for the GBM trait, with results
   indicating the modules most enriched for GBM-relevant genes.

GO and KEGG pathway enrichment analysis

   To explore the biological functions and pathways related to
   glioblastoma multiforme (GBM), we performed Gene Ontology (GO) and
   Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on
   the intersecting genes between the DEGs and the most relevant WGCNA
   module. The intersecting genes were mapped to their Entrez IDs using
   the `org.Hs.eg.db` package. GO enrichment analysis was carried out
   across the Biological Process, Cellular Component, and Molecular
   Function categories using the `enrichGO` function from the
   `clusterProfiler` package, applying p-value and adjusted p-value
   cutoffs of 0.05 and 1, respectively. Significant GO terms were
   visualized through bar plots, bubble plots, and circular plots
   generated with `ggplot2`, `enrichplot`, and `circlize`. For KEGG
   pathway enrichment, the `enrichKEGG` function was used under similar
   conditions, and the results were presented through bubble plots.

Machine learning model development and evaluation

   In this study, we used twelve different machine learning algorithms to
   classify GBM samples into two groups: GBM and normal tissues [[80]19].
   These algorithms include Elastic Net (Enet), Ridge Regression, Stepwise
   Generalized Linear Model (Stepglm), and LASSO, which are particularly
   effective for selecting the most important features and preventing
   overfitting. In addition, we applied a variety of other methods such as
   Support Vector Machines (SVM), Linear Discriminant Analysis (LDA),
   generalized linear models (glmBoost), Random Forest, Gradient Boosting
   Machine (GBM), XGBoost, and Naive Bayes for classification tasks.

   From these twelve algorithms, we combined them into 113 different
   models to maximize predictive performance. To prevent overfitting, we
   used two strategies: one was cross-validation, which pairs a feature
   selection method with a classification model to ensure robust results,
   and the other was increasing the number of datasets used for
   validation.

   We standardized the data (scaled and normalized) to ensure that the
   features were on a similar scale. We selected important features
   (genes) from the DEGs and the most relevant gene modules identified by
   WGCNA. We then evaluated the models based on their ability to
   distinguish between GBM and normal tissues, using a performance measure
   called the Area Under the Curve (AUC). We visualized the model
   performance using heatmaps, which helped identify the top-performing
   models. The models that performed best were selected for further
   validation and potential use in clinical applications.

Model evaluation using ROC and confusion matrix

   The machine learning model with the highest AUC was selected for
   evaluation across both the training set and external validation
   cohorts. ROC curves were generated for each dataset to assess the
   model's classification performance, with AUC values calculated to
   quantify its ability to discriminate between GBM and control samples.
   Bootstrap resampling was employed to estimate 95% confidence intervals
   for the AUC.

   Additionally, confusion matrices were constructed to evaluate the
   accuracy of the model's predictions by comparing the predicted class
   labels with the actual labels in both the training and validation sets.
   This analysis provided comprehensive metrics on the model's
   sensitivity, specificity, and overall classification accuracy. The
   outcomes of these evaluations were systematically analyzed to ensure
   the model's robustness and generalizability across different datasets.

Gene interaction network analysis

   GeneMANIA [[81]20] was utilized to analyze gene interaction networks,
   focusing on selected genes relevant to the study. The analysis employed
   the Homo sapiens dataset, integrating various biological networks such
   as physical interactions, co-expression, and pathways. The tool
   identified key interacting partners and highlighted significant
   biological processes, including the negative regulation of kinase
   activity and protein phosphorylation. The insights gained from this
   analysis were used to validate the functional relevance of the selected
   genes.

Gene set enrichment analysis

   Gene Set Enrichment Analysis (GSEA) was conducted to explore the
   functional enrichment associated with different expression levels of
   the target gene, using both the c5.go.Hs.symbols.gmt and
   c2.cp.kegg.Hs.symbols.gmt gene sets [[82]21]. Gene expression data were
   first normalized, and samples were grouped into high and low expression
   categories based on the median expression level of the target gene.
   logFC between these groups were calculated and genes were ranked
   accordingly. GSEA was then performed on the ranked list to identify
   significantly enriched pathways and biological processes. Results were
   filtered with a p-value threshold of 0.05.

GSVA analysis

   Gene Set Variation Analysis (GSVA) was employed to evaluate pathway
   activity variations associated with the expression levels of the target
   gene. Both c5.go.Hs.symbols.gmt and c2.cp.kegg.Hs.symbols.gmt gene sets
   were utilized. After normalizing the expression data and excluding
   control samples, GSVA scores were computed using the ssGSEA method. The
   scores were then normalized, and pathways with significant differences
   between the high and low expression groups were identified through
   t-tests. Pathways were classified as upregulated, downregulated, or not
   significant based on their GSVA scores. The results were summarized,
   and key pathways were visualized using bar plots to illustrate the
   differential activity between the groups.

Immune infiltration analysis

   Immune cell infiltration was assessed using the CIBERSORT algorithm,
   which quantifies the relative proportions of different immune cell
   types within a complex gene expression mixture. The analysis was
   conducted using a signature matrix and gene expression data from the
   selected samples. The CIBERSORT algorithm was run with 1000
   permutations to increase the robustness of the results. Output from the
   analysis included p-values, correlation coefficients, and RMSE values,
   which were used to filter significant results. Only samples with a
   p-value less than 0.05 were considered for further analysis.

Mendelian randomization analysis

   Mendelian Randomization (MR) analysis was conducted to explore the
   potential causal relationships between genetically predicted gene
   expression levels and the risk of GBM. For this analysis, we used
   genetic data from the GWAS database ([83]https://gwas.mrcieu.ac.uk/) to
   identify genetic variants that influence gene expression (referred to
   as exposure data), and outcome data from the FinnGen database
   ([84]https://www.finngen.fi/en/access_results) to assess GBM risk.

   We first processed the exposure data, focusing on single nucleotide
   polymorphisms (SNPs) that were strongly associated with gene
   expression. To ensure consistency, we harmonized the data, matching the
   SNP information in both the exposure and outcome datasets. SNPs with
   low statistical significance (p-value > 5e−8) were removed to ensure
   the reliability of the analysis.

   For each gene of interest, we conducted the MR analysis using the
   TwoSampleMR package in R, primarily applying the inverse-variance
   weighted (IVW) method to estimate causal effects. To check the
   robustness of our results, we also used alternative methods, such as
   MR-Egger and the weighted median approach. We assessed the variability
   between studies using Cochran's Q test and checked for potential biases
   using MR-Egger’s intercept and MR-PRESSO global tests.

   We further examined the significant results by generating scatter
   plots, forest plots, and conducting leave-one-out sensitivity analyses.
   These analyses helped us visualize the strength and direction of the
   causal effects. Finally, we documented the results, highlighting the
   SNPs used in the analysis and presenting the estimated causal effects
   with confidence intervals, providing a clear view of the relationship
   between gene expression and GBM risk.

Statistical analysis

   All statistical analyses were performed using R software. For
   differential gene expression analysis, the `limma` package was utilized
   to fit linear models, with significance determined by |logFC|> 1 and
   adjusted p-values < 0.05 using the empirical Bayes method. Machine
   learning model evaluation, including ROC curve generation and AUC
   calculation, was conducted using the `pROC`, `caret`, and `e1071`
   packages, with models assessed through both ROC curves and confusion
   matrices. GSEA was performed using the `clusterProfiler` and
   `enrichplot` packages, with pathway enrichment considered significant
   at p-values < 0.05. GSVA employed the `GSVA` and `limma` packages, with
   statistical significance of pathway activities determined by t-tests
   and results visualized using the `ggplot2` and `ggpubr` packages. In
   the Mendelian randomization analysis, the `TwoSampleMR`,
   `MendelianRandomization`, and `MR-PRESSO` packages were applied to
   estimate causal effects, with heterogeneity and pleiotropy assessed
   using Cochran’s Q test, MR-Egger intercept, and MR-PRESSO global tests.
   Significance in immune infiltration analysis was determined using
   p-values from the CIBERSORT algorithm within the `e1071` and
   `preprocessCore` packages, with only results having p-values < 0.05
   considered significant. All statistical tests were two-sided, and the
   appropriate multiple testing correction methods were applied where
   necessary.

Results

Differential expression and batch effect correction

   The analysis began by addressing batch effects between the training
   datasets, [85]GSE116520 and [86]GSE34152. Before batch correction,
   significant differences were observed in gene expression distributions
   across the two datasets, as illustrated by the boxplots in Fig. [87]1A.
   Principal component analysis (PCA) further highlighted the batch
   effect, with samples from [88]GSE116520 and [89]GSE34152 clustering
   separately in the PCA plot (Fig. [90]1C). To correct for these
   discrepancies, the ComBat function from the sva package was applied.
   Post-correction, the gene expression distributions were more consistent
   across datasets, as shown in the adjusted boxplots (Fig. [91]1B), and
   the PCA plot demonstrated a more cohesive clustering of samples,
   indicating effective batch correction (Fig. [92]1D).

Fig. 1.

   [93]Fig. 1
   [94]Open in a new tab

   Batch Effect Correction and Differential Gene Expression Analysis. A
   Boxplot of gene expression data from [95]GSE116520 and [96]GSE34152
   datasets before batch correction. The differences between the datasets
   are apparent. B Boxplot of gene expression data from [97]GSE116520 and
   [98]GSE34152 after batch effect correction using the ComBat method,
   showing improved alignment across datasets. C Principal component
   analysis (PCA) plot before batch correction, displaying clear
   separation between [99]GSE116520 and [100]GSE34152 datasets. D PCA plot
   after batch correction, illustrating improved alignment and reduced
   batch effects, with samples from both datasets more closely clustered.
   E Volcano plot showing differentially expressed genes (DEGs) between
   glioblastoma multiforme (GBM) and control tissues in the training set.
   Genes with |logFC|> 1 and adj.P.Val < 0.05 are highlighted in red
   (upregulated) and green (downregulated). F Heatmap of the top DEGs
   across samples, highlighting distinct expression patterns between GBM
   and control groups in the training set. Each column represents a
   sample, and each row represents a DEG

   Subsequent differential expression analysis between GBM and adjacent
   normal tissues identified a total of 286 DEGs with |logFC|> 1 and
   adjusted p-value < 0.05 (Supplementary Table 1). The volcano plot in
   Fig. [101]1E displays the distribution of these DEGs, with
   significantly upregulated genes marked in red and downregulated genes
   in green. The heatmap in Fig. [102]1F provides a visual summary of the
   expression patterns of the top DEGs across samples, clearly
   distinguishing GBM from normal tissues.

WGCNA and module detection

   To identify gene modules associated with GBM, WGCNA was performed. The
   optimal soft-thresholding power was determined to be 8, ensuring that
   the network followed a scale-free topology (Fig. [103]2A). Using this
   power, an adjacency matrix was constructed, and the topological overlap
   matrix (TOM) was calculated. Hierarchical clustering based on TOM
   dissimilarity identified several gene modules, each represented by a
   unique color in the dendrogram (Fig. [104]2B).

Fig. 2.

   [105]Fig. 2
   [106]Open in a new tab

   WGCNA and Functional Enrichment Analysis of GBM-Related Modules. A
   Scale independence and mean connectivity plot used to determine the
   optimal soft-thresholding power in WGCNA. The soft-threshold power was
   chosen at 8, ensuring a scale-free network topology. B Gene dendrogram
   and module colors generated by hierarchical clustering of genes based
   on topological overlap. Modules are represented by different colors,
   with each branch of the dendrogram representing a module. C Heatmap
   showing the correlation between module eigengenes and clinical traits
   (GBM vs. control). The yellow module exhibited the highest positive
   correlation with the GBM trait. D Bar plot depicting the gene
   significance across modules, with the yellow module showing the highest
   gene significance for GBM. E Venn diagram illustrating the overlap
   between differentially expressed genes (DEGs) identified in the study
   and the yellow module from WGCNA. A total of 7 genes were common
   between the DEGs and the yellow module. F Circos plot visualizing the
   Gene Ontology (GO) enrichment analysis of the common genes,
   highlighting the most significant biological processes related to GBM,
   including cell adhesion and axon development. G Bar plot of GO terms
   enriched in the common genes, with the top enriched terms involved in
   cell adhesion and neuronal differentiation. H KEGG pathway enrichment
   analysis of the common genes, showing significant pathways such as
   viral life cycle, cardiomyopathies, and ECM-receptor interaction

   The module-trait relationship heatmap (Fig. [107]2C) indicated that the
   yellow module exhibited the most significant correlation with GBM
   (correlation coefficient = 0.49, p = 3e−04). This module was selected
   for further analysis due to its strong association with GBM. The gene
   significance within this module was notably higher than in other
   modules, as shown in the bar plot (Fig. [108]2D), further supporting
   its relevance to GBM.

Functional enrichment analysis of the most significant GBM-associated module

   The yellow module was subjected to functional enrichment analysis to
   explore its biological implications in GBM. A Venn diagram
   (Fig. [109]2E) identified seven intersecting genes between the yellow
   module and the DEGs identified earlier (Supplementary Table 2). These
   intersecting genes were analyzed for their involvement in biological
   processes, cellular components, and molecular functions through GO
   enrichment analysis. The circular plot (Fig. [110]2F) highlights the
   significant GO terms, including pathways related to cell adhesion,
   neuron differentiation, and protein ubiquitination.

   Additionally, KEGG pathway enrichment analysis revealed that these
   genes were significantly involved in pathways such as the viral life
   cycle of HIV-1, cardiomyopathies, and ECM-receptor interaction
   (Fig. [111]2H). The dot plot (Fig. [112]2G) illustrates the specific GO
   terms enriched within the yellow module, underscoring its potential
   role in GBM pathogenesis through processes like integrin binding and
   axon development.

Evaluation of the top-performing machine learning model

   A total of 113 machine learning models (Supplementary Table 3) were
   developed and evaluated using the training dataset, which consisted of
   the merged [113]GSE116520 and [114]GSE34152 datasets, and external
   validation datasets, including [115]GSE7696, [116]GSE90886, and
   [117]GSE109857. The model with the highest (AUC was selected for
   further evaluation.

   Figure [118]3A shows a heatmap comparing the AUC values across
   different models and cohorts. The Ridge regression model combined with
   the Elastic Net (Enet) algorithm emerged as the top-performing model
   with the highest AUC values across all datasets. This model achieved an
   AUC of 0.897 (95% CI 0.808–0.966) on the training set (Fig. [119]3B),
   demonstrating strong discriminatory power between GBM and control
   samples.

Fig. 3.

   [120]Fig. 3
   [121]Open in a new tab

   Evaluation of the Machine Learning Model for GBM Classification. A
   Heatmap displaying the AUC values of 113 machine learning models across
   the training set ([122]GSE116520 and [123]GSE34152) and external
   validation sets ([124]GSE7696, [125]GSE90886, and [126]GSE109857). The
   Ridge model, which showed the highest AUC, was selected for further
   evaluation. B ROC curve for the Ridge model on the training set,
   demonstrating an AUC of 0.897 with a 95% confidence interval (CI) of
   0.808–0.966, indicating strong predictive performance. C ROC curve for
   the Ridge model on the [127]GSE109857 validation set, yielding an AUC
   of 0.872 with a 95% CI of 0.798–0.940, confirming the model's
   generalizability to external data. D ROC curve for the Ridge model on
   the [128]GSE7696 validation set, showing an AUC of 0.984 with a 95% CI
   of 0.950–1.000, further supporting the model's robustness. E ROC curve
   for the Ridge model on the [129]GSE90886 validation set, with a lower
   AUC of 0.704 and a 95% CI of 0.432–0.914, indicating variable
   performance across different datasets. F Confusion matrix for the Ridge
   model on the training set, highlighting the model's accuracy in
   classifying GBM (Treat) and control samples. The model correctly
   classified 17 GBM samples and 23 control samples, with 10
   misclassifications. G Confusion matrix for the Ridge model on the
   [130]GSE109857 validation set, demonstrating its high sensitivity and
   specificity, correctly classifying 147 GBM samples and 5 control
   samples. H Confusion matrix for the Ridge model on the [131]GSE7696
   validation set, where the model achieved perfect classification with no
   misclassifications. I Confusion matrix for the Ridge model on the
   [132]GSE90886 validation set, showing some misclassifications,
   consistent with the lower AUC observed in the ROC analysis

   The model's performance was further validated on the external datasets.
   In the [133]GSE109857 dataset, the model achieved an AUC of 0.872 (95%
   CI 0.798–0.940) (Fig. [134]3C), while in the [135]GSE7696 dataset, the
   model performed exceptionally well with an AUC of 0.984 (95% CI
   0.950–1.000) (Fig. [136]3D). However, in the [137]GSE90886 dataset, the
   model's performance was lower, with an AUC of 0.704 (95% CI
   0.432–0.914) (Fig. [138]3E).

   To further assess the model's classification accuracy, confusion
   matrices were generated for the training set and each validation cohort
   (Fig. [139]3F–I). In the training set (Fig. [140]3F), the model
   correctly classified 23 control samples and 17 GBM samples, with four
   control samples misclassified as GBM and six GBM samples misclassified
   as control. In the [141]GSE109857 cohort (Fig. [142]3G), the model
   showed strong classification accuracy, correctly identifying 147 GBM
   samples and five control samples, with minimal misclassification. The
   [143]GSE7696 cohort (Fig. [144]3H) demonstrated a similar pattern, with
   only a few misclassifications. The [145]GSE90886 cohort (Fig. [146]3I)
   exhibited slightly higher misclassification rates, reflecting the lower
   AUC observed in this dataset.

   Overall, the Ridge regression model combined with Enet demonstrated
   robust performance across multiple datasets, making it a promising tool
   for distinguishing GBM from control samples. The use of ROC curves and
   confusion matrices provided a comprehensive evaluation of the model's
   accuracy and generalizability.

Identification and validation of key prognostic markers using ridge
regression model

   The Ridge regression model, which was identified as the top-performing
   method, was further utilized to identify and validate key prognostic
   markers in GBM. Differential expression analysis identified 286 DEGs,
   of which significant genes are highlighted in Fig. [147]4A. The volcano
   plot (Fig. [148]4A) illustrates the distribution of genes with
   significant differential expression between GBM and control samples,
   highlighting genes such as ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A,
   PTTG3P, and SFRP2 as key markers (Supplementary Table 4).

Fig. 4.

   [149]Fig. 4
   [150]Open in a new tab

   Identification and Validation of Key Genes Associated with GBM. A
   Volcano plot showing differentially expressed genes (DEGs) between GBM
   and control samples. The highlighted genes—MAP1A, FOXO4, KLHL3, ITGA10,
   SFRP2 (downregulated in GBM), and ANXA2P1, PTTG3P (upregulated in
   GBM)—were selected for further analysis based on their significant log
   fold changes (logFC) and adjusted p-values. B Box plots depicting the
   expression levels of the selected genes (ANXA2P1, FOXO4, ITGA10, KLHL3,
   MAP1A, PTTG3P, SFRP2) in GBM (Treat) and control samples. All selected
   genes show statistically significant differential expression between
   the two groups (***p < 0.001, **p < 0.01). C Correlation matrix
   visualizing the relationships between the expression levels of the
   selected genes. The matrix shows both the correlation coefficients and
   the statistical significance of these correlations, highlighting strong
   positive and negative correlations among certain gene pairs. D ROC
   curves for each selected gene, demonstrating their predictive ability
   for distinguishing between GBM and control samples. The AUC values
   indicate the model's classification performance for each gene, with
   MAP1A showing the highest AUC of 0.883. E Gene interaction network
   generated using GeneMANIA, illustrating the connections between the
   selected genes and other associated genes. The network includes various
   types of interactions, such as physical interactions, co-expression,
   and shared protein domains, with functions primarily related to the
   negative regulation of kinase activity, proteolysis, and protein
   phosphorylation

   In Fig. [151]4B, the expression levels of these seven significant DEGs
   are compared between GBM and control samples, showing consistent
   differential expression patterns, with all selected genes showing
   statistically significant differences (p < 0.001). The correlation plot
   in Fig. [152]4C further examines the relationships between these key
   genes, revealing notable correlations such as a strong positive
   correlation between ANXA2P1 and ITGA10, and a strong negative
   correlation between FOXO4 and PTTG3P.

   The diagnostic performance of each key marker was assessed using ROC
   curves (Fig. [153]4D), where the AUC values ranged from 0.722 for SFRP2
   to 0.883 for MAP1A. These AUC values indicate that these genes have
   strong potential as diagnostic markers for GBM, with MAP1A, KLHL3, and
   FOXO4 showing particularly high discriminatory power.

   Lastly, a gene interaction network analysis was conducted using
   GeneMANIA to explore the functional associations between these key
   markers (Fig. [154]4E). The network revealed that these genes are
   involved in critical biological processes, including the negative
   regulation of kinase activity and protein phosphorylation. The analysis
   also highlighted significant physical interactions, co-expression
   patterns, and shared pathways among these genes, further supporting
   their role in GBM pathogenesis.

Immune infiltration analysis in GBM

   Immune infiltration analysis was conducted using the CIBERSORT
   algorithm to evaluate the relative proportions of 22 immune cell types
   within GBM and control samples (Supplementary Table 5). Figure [155]5A
   illustrates the distribution of these immune cell types across the
   samples, showing distinct differences in immune cell composition
   between GBM and control groups. Notably, GBM samples exhibited higher
   proportions of M0 and M2 macrophages, as well as activated memory CD4 T
   cells, compared to the control group.

Fig. 5.

   [156]Fig. 5
   [157]Open in a new tab

   Immune Infiltration Analysis in GBM. A The bar plot illustrates the
   relative proportions of various immune cell types within the GBM
   (Treat) and control groups, as calculated by the CIBERSORT algorithm.
   This comparison shows distinct differences in the immune landscape
   between the two groups, highlighting the increased presence of certain
   immune cell types in GBM samples. B The heatmap shows the correlation
   coefficients between different immune cell types in GBM and control
   samples. Positive and negative correlations are depicted, indicating
   potential interactions or mutual exclusivity between specific immune
   cell populations. C Box plots display the fraction of individual immune
   cell types across the GBM and control groups, emphasizing significant
   differences in the infiltration levels of certain immune cells, such as
   macrophages and T cells, between the two conditions. D A lollipop plot
   ranking the immune cell types based on their correlation coefficients
   with the expression of key genes in GBM, as determined through
   correlation analysis. This plot identifies the immune cells most
   strongly associated with gene expression changes in GBM, with
   macrophages M1 and M2 showing notable correlations. E A cell–cell
   interaction network, illustrating the associations between selected key
   genes (ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, SFRP2) and immune
   cell types. The network highlights significant interactions and their
   directions (positive or negative), providing insights into the complex
   immune microenvironment in GBM

   The correlation heatmap in Fig. [158]5B reveals the relationships among
   the different immune cell types. Strong positive correlations were
   observed between activated memory CD4 T cells and M1 macrophages, as
   well as between resting memory CD4 T cells and M2 macrophages,
   indicating potential interactions or co-regulation within the tumor
   microenvironment. A detailed comparison of the immune cell fractions
   between GBM and control samples is presented in Fig. [159]5C. This
   analysis highlights significant differences in the fractions of various
   immune cells, particularly an increased presence of M2 macrophages and
   a reduction in resting CD4 memory T cells in GBM samples.
   Figure [160]5D further explores the correlation between key immune
   cells and gene expression, with macrophages (M1 and M2) and dendritic
   cells showing the highest correlation coefficients. These correlations
   suggest that the presence of these immune cells may be linked to
   specific gene expression profiles in GBM, reflecting their potential
   role in tumor progression or immune evasion. Finally, the network plot
   in Fig. [161]5E integrates the identified key genes with the immune
   cell infiltration data. The plot shows significant associations between
   the expression of key genes such as ANXA2P1, FOXO4, ITGA10, and KLHL3,
   and the infiltration of specific immune cell types. For instance,
   ANXA2P1 and ITGA10 are positively correlated with M2 macrophage
   infiltration, further underscoring the immunosuppressive environment in
   GBM.

   These findings provide a comprehensive overview of the immune landscape
   in GBM and highlight potential interactions between immune cells and
   key genes that may contribute to the tumor's immunosuppressive
   microenvironment.

Mendelian randomization analysis of KLHL3 and glioblastoma Risk

   MR analysis was conducted to investigate the causal relationship
   between the expression of specific genes and glioblastoma risk,
   focusing on seven genes identified as potential candidates based on
   previous analyses. These genes were screened for their relevance using
   a combination of differential expression analysis, machine learning
   model evaluation, and functional enrichment studies. Among these, KLHL3
   was selected for detailed MR analysis due to its significant
   differential expression and robust association with glioblastoma risk
   factors identified in earlier steps.

   Figure [162]6A illustrates the MR analysis results using various
   methods, including Inverse Variance Weighted (IVW), MR Egger, Weighted
   Median, and Simple Mode. The consistent slope across these methods for
   KLHL3 suggests a protective effect against glioblastoma, with the IVW
   method showing a particularly strong association. The forest plot in
   Fig. [163]6B presents the MR effect sizes for each SNP used as an
   instrumental variable, highlighting KLHL3's impact on glioblastoma
   risk. Notably, the red markers indicate significant SNPs detected by
   the MR Egger method, suggesting some degree of pleiotropy, while the
   black markers represent the IVW estimates. Figure [164]6C provides a
   leave-one-out sensitivity analysis, which further confirmed the
   robustness of the MR findings by showing stable estimates even when
   individual SNPs were excluded from the analysis. This reinforces the
   reliability of the observed association between KLHL3 expression and
   glioblastoma risk. Finally, Fig. [165]6D depicts the funnel plot for
   detecting directional pleiotropy. The symmetry around the MR Egger line
   suggests no significant pleiotropic effects, strengthening the validity
   of KLHL3 as a causal factor in reducing glioblastoma risk
   (Supplementary Table 6).

Fig. 6.

   [166]Fig. 6
   [167]Open in a new tab

   Mendelian Randomization Analysis of KLHL3 in Glioblastoma. A The
   scatter plot displays the SNP effect on KLHL3 expression versus the SNP
   effect on glioblastoma risk using different Mendelian Randomization
   (MR) methods, including inverse variance weighted, MR Egger, weighted
   median, and simple mode. The consistent slope across these methods
   suggests a potential protective effect of KLHL3 expression against
   glioblastoma risk. B The forest plot illustrates the MR effect sizes of
   individual SNPs associated with KLHL3 expression on glioblastoma risk,
   with results shown for both the MR Egger and inverse variance weighted
   methods. The negative effect sizes for certain SNPs further indicate a
   potential protective role of KLHL3 in glioblastoma. C The leave-one-out
   sensitivity analysis identifies the impact of each SNP on the overall
   MR estimate. The analysis shows that excluding any single SNP does not
   significantly alter the protective effect, indicating the robustness of
   the MR findings. D The funnel plot evaluates the potential for
   directional pleiotropy in the MR analysis. The symmetric distribution
   of points around the inverse variance weighted estimate line suggests a
   low likelihood of pleiotropy, supporting the validity of the causal
   inference made between KLHL3 expression and glioblastoma risk

   In summary, KLHL3 was selected for MR analysis from seven potential
   genes based on its strong differential expression and association with
   glioblastoma. The results indicate that higher genetically predicted
   KLHL3 expression may reduce glioblastoma risk, with minimal evidence of
   pleiotropy or confounding.

Functional enrichment analysis of KLHL3

   To explore the functional implications of KLHL3 expression in
   glioblastoma, both GSEA and GSVA were performed, focusing on the
   pathways associated with differential KLHL3 expression levels.

   In the GSEA analysis, significant enrichment was observed in both high
   and low KLHL3 expression groups. Figure [168]7A, B present the GSEA
   results for GO terms. Specifically, the high expression group showed
   significant enrichment in processes related to the regulation of
   mucins, glomerular basement membrane, and other differentiation-related
   processes (Fig. [169]7A). Conversely, in the low expression group,
   there was notable enrichment in pathways involving the extracellular
   matrix, basement membrane, and structural constituents of the
   cytoskeleton (Fig. [170]7B). Similarly, the KEGG pathway analysis
   further elucidated these associations. Figure [171]7C, D highlight the
   pathways enriched in high and low KLHL3 expression groups,
   respectively. High KLHL3 expression correlated with pathways such as
   cell signaling, immune cell adhesion, and phagocytosis, indicating an
   active role in immune-related processes (Fig. [172]7C). On the other
   hand, low KLHL3 expression was associated with pathways like complement
   and coagulation cascades, ECM-receptor interaction, and ribosome
   biogenesis, suggesting its involvement in fundamental cellular
   processes and extracellular interactions (Fig. [173]7D).

Fig. 7.

   [174]Fig. 7
   [175]Open in a new tab

   Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis
   (GSVA) for KLHL3 in Glioblastoma. A GSEA results for the high
   expression group of KLHL3 reveal significant enrichment in pathways
   related to mucin transport, cell differentiation, and glucocorticoid
   response. The enrichment plots show a strong correlation between high
   KLHL3 expression and these biological processes. B GSEA results for the
   low expression group of KLHL3 indicate significant enrichment in
   pathways involved in extracellular matrix organization and structural
   components of the basement membrane. These pathways are particularly
   associated with low KLHL3 expression, suggesting a potential link to
   tumor microenvironment alterations. C KEGG pathway analysis in the high
   expression group highlights significant enrichment in pathways related
   to signaling in the immune system and phagocytosis. These pathways are
   associated with immune responses and cellular processes that may
   contribute to glioblastoma progression. D KEGG pathway analysis in the
   low expression group identifies enrichment in pathways related to
   complement and coagulation cascades and Fc receptor-mediated signaling.
   These results suggest that low KLHL3 expression may influence the
   immune landscape and coagulation processes in glioblastoma. E GSVA
   results for KLHL3 show the differential activity of various pathways
   across samples. The analysis reveals upregulated and downregulated
   pathways in relation to KLHL3 expression levels, providing insights
   into its role in glioblastoma biology. F Further GSVA analysis confirms
   the differential expression of key signaling pathways, categorizing
   them as either upregulated or downregulated in glioblastoma. This
   analysis underscores the impact of KLHL3 on critical biological
   processes that could be pivotal in tumor progression

   GSVA results provided further insights, as illustrated in Fig. [176]7E,
   F. The analysis revealed distinct pathway activity differences between
   high and low KLHL3 expression groups. The t-value bar plots
   (Fig. [177]7E, F) summarized the upregulated and downregulated
   pathways, indicating that KLHL3 significantly influences a variety of
   biological processes, particularly those related to cell adhesion,
   immune response, and extracellular matrix organization.

   These findings collectively suggest that KLHL3 plays a critical role in
   glioblastoma by modulating key biological pathways, with differential
   expression leading to distinct functional outcomes in tumor biology.

Discussion

   In this study, we utilized a novel approach combining advanced machine
   learning techniques with MR analysis to identify and validate potential
   prognostic markers for GBM. By employing 113 distinct machine learning
   models, we conducted a comprehensive evaluation of gene expression
   data, leading to the identification of seven key genes—ANXA2P1, FOXO4,
   ITGA10, KLHL3, MAP1A, PTTG3P, and SFRP2—associated with GBM. The
   subsequent MR analysis provided a robust framework to investigate the
   causal relationships between these gene expressions and GBM risk,
   underscoring the potential of these genes as critical biomarkers in GBM
   pathology [[178]22–[179]24].

   Among the seven identified genes, KLHL3 emerged as particularly
   noteworthy due to its significant differential expression and the
   protective effect suggested by MR analysis. The consistent inverse
   association observed across various MR approaches for KLHL3 implies
   that higher expression levels of this gene may confer a reduced risk of
   developing GBM. This finding not only reinforces the potential role of
   KLHL3 in GBM but also demonstrates the value of integrating machine
   learning with genetic epidemiology to uncover causative genetic factors
   in complex diseases.

   The machine learning models applied in this study also provided
   critical insights into the predictive accuracy of these seven genes in
   distinguishing GBM from control samples. The Ridge regression model, in
   particular, demonstrated superior performance across multiple datasets,
   highlighting MAP1A, FOXO4, and KLHL3 as the top-performing genes with
   the highest area under the curve (AUC) values. These genes' strong
   predictive ability suggests that they could serve as reliable
   biomarkers for early GBM detection, offering potential applications in
   clinical diagnostics and personalized medicine.

   Additionally, the immune infiltration analysis revealed significant
   interactions between the identified genes and the tumor
   microenvironment. The presence of M2 macrophages, for instance, was
   strongly correlated with the expression of ANXA2P1 and ITGA10,
   suggesting that these genes might play a role in shaping the
   immunosuppressive landscape of GBM. Understanding these gene-immune
   cell interactions could provide new avenues for developing
   immunotherapeutic strategies that target these specific pathways,
   potentially improving patient outcomes [[180]25, [181]26].

   In conclusion, the integration of 113 machine learning models with
   Mendelian Randomization analysis allowed for the identification and
   validation of seven key genes associated with GBM, with KLHL3 standing
   out as a potential protective factor. This study not only highlights
   the utility of combining computational and genetic approaches to
   explore complex disease mechanisms but also identifies novel biomarkers
   that could be pivotal in GBM diagnosis and treatment. Future research
   should focus on validating these findings in larger cohorts and further
   exploring the therapeutic implications of targeting these gene networks
   in GBM.

Conclusions

   This study identified and validated seven key genes related to
   glioblastoma using machine learning and Mendelian Randomization. The
   Ridge regression model demonstrated strong predictive accuracy,
   highlighting these genes as potential biomarkers. These findings offer
   valuable insights for future GBM therapeutic strategies.

Limitations section

   This study, while providing valuable insights into glioblastoma through
   the integration of machine learning and Mendelian randomization, is
   limited by its purely bioinformatic approach. The findings are based on
   publicly available datasets and computational analyses. Moreover, the
   predictions and causal inferences drawn from this study require
   experimental validation in vitro and in vivo to confirm the functional
   relevance of the identified gene networks.

Supplementary Information

   [182]Supplementary materials 1.^ (31.3KB, txt)
   [183]Supplementary materials 2.^ (53B, txt)
   [184]Supplementary materials 3.^ (2.5KB, txt)
   [185]Supplementary materials 4.^ (861B, txt)
   [186]Supplementary materials 5.^ (9KB, txt)
   [187]Supplementary materials 6.^ (1.1KB, csv)
   [188]Supplementary materials 7.^ (15.7KB, docx)

Acknowledgements