Abstract Background Glioblastoma multiforme (GBM) is a highly aggressive brain cancer with poor prognosis and limited treatment options. Despite advances in understanding its molecular mechanisms, effective therapeutic strategies remain elusive due to the tumor’s genetic complexity and heterogeneity. Methods This study employed a comprehensive analysis approach integrating 113 machine learning algorithms with Mendelian Randomization (MR) analysis to investigate the molecular underpinnings of GBM. Five publicly available gene expression datasets were analyzed to identify differentially expressed genes (DEGs) associated with GBM. Weighted Gene Co-expression Network Analysis (WGCNA) was used to identify GBM-related gene modules. Further, gene set enrichment and variation analyses were conducted to explore the biological pathways involved. The machine learning models were evaluated using Receiver Operating Characteristic (ROC) curves and confusion matrices to assess their predictive accuracy, with the best-performing model validated across external datasets. MR analysis was performed to establish causal relationships between genetically predicted gene expression levels and GBM outcomes. Results The study identified 286 DEGs between GBM and adjacent normal tissues across five datasets. WGCNA highlighted the yellow module as the most relevant to GBM, containing key genes such as KLHL3, FOXO4, and MAP1A. Of the 113 machine learning models tested, Ridge regression achieved the highest area under the curve (AUC) of 0.92, demonstrating robust predictive accuracy. Validation using external datasets confirmed the model's reliability, with a classification accuracy of 89.5% in the training set and 85.3% in the validation sets. MR analysis provided strong evidence of a causal relationship between the expression levels of the identified genes and GBM risk. Conclusions This study demonstrates the power of combining machine learning and Mendelian Randomization to uncover novel genetic markers for GBM. The identified genes offer promising potential as biomarkers for GBM diagnosis and therapy, providing new avenues for personalized treatment strategies. Supplementary Information The online version contains supplementary material available at 10.1007/s12672-025-01792-0. Keywords: Glioblastoma multiforme, Machine learning, Mendelian randomization, Gene co-expression analysis Introduction Glioblastoma multiforme (GBM) remains one of the most aggressive and lethal forms of brain cancer, characterized by rapid progression and resistance to conventional therapies [[34]1–[35]4]. Despite significant advancements in understanding the molecular mechanisms underlying GBM, the prognosis for patients has improved only marginally over the past decades. The complex genetic landscape of GBM, marked by extensive heterogeneity and intricate interactions between tumor cells and their microenvironment, poses a significant challenge to the development of effective therapeutic strategies [[36]5–[37]7]. Traditional approaches to cancer biomarker discovery have predominantly relied on single-omic analyses or small-scale studies, often overlooking the multifactorial nature of GBM. As a result, there is an urgent need for innovative methodologies that can comprehensively capture the intricate molecular networks driving GBM pathogenesis and progression [[38]1, [39]8–[40]10]. Recent advances in machine learning (ML) have opened new avenues for integrating large-scale omics data, enabling the identification of robust biomarkers with high predictive power [[41]11–[42]14]. However, the application of ML alone may still be limited by confounding factors and biases inherent in observational data. To address these challenges, we combined 113 distinct machine learning algorithms with Mendelian Randomization (MR) analysis, a powerful approach that leverages genetic variants as instrumental variables to infer causal relationships between exposures and outcomes. By integrating these methodologies, our study aimed to overcome the limitations of traditional observational studies and provide a more accurate and reliable identification of genetic biomarkers for GBM. This approach not only enhances the robustness of our findings but also offers the potential to uncover novel therapeutic targets by distinguishing between mere associations and causal relationships. The primary objective of this study was to identify and validate genetic markers associated with GBM through a comprehensive, multi-faceted analysis. We sought to leverage the strengths of both machine learning and Mendelian Randomization to systematically evaluate the genetic determinants of GBM, focusing on their potential role as biomarkers for diagnosis, prognosis, and treatment response. By doing so, we aim to contribute to the growing body of knowledge on GBM and pave the way for more personalized and targeted therapeutic strategies, ultimately improving outcomes for patients with this devastating disease. Methods Data acquisition and preprocessing Five publicly available gene expression datasets related to GBM were selected from the GEO database ([43]https://www.ncbi.nlm.nih.gov/geo/). The datasets chosen include [44]GSE116520 ([45]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116520), [46]GSE34152 ([47]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE34152), [48]GSE7696 ([49]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696), [50]GSE90886 ([51]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90886), and [52]GSE109857 ([53]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109857), each containing patient information relevant to GBM (Table [54]1). For each dataset, the corresponding platform files were obtained to ensure accurate gene expression matrix calculations. The gene expression matrices were generated by processing the raw data according to the specific platform annotations provided in the GEO repository. Table 1. Overview of GBM-Related Gene Expression Datasets Dataset accession URL Sample type Platform Sample size [55]GSE116520 [56]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116520 GBM/tumor and normal/adjacent tissue [57]GPL10558 GBM/tumor: 17 normal/adjacent tissue:25 [58]GSE34152 [59]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE34152 GBM/tumor and normal/adjacent tissue [60]GPL570 GBM/tumor: 4 normal/adjacent tissue:4 [61]GSE7696 [62]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696 GBM/tumor and normal/adjacent tissue [63]GPL570 GBM/tumor: 80 normal/adjacent tissue:4 [64]GSE90886 [65]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90886 GBM/tumor and normal/adjacent tissue [66]GPL15207 GBM/tumor: 9 normal/adjacent tissue:9 [67]GSE109857 [68]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109857 GBM/tumor and normal/adjacent tissue [69]GPL6480 GBM/tumor: 220 normal/adjacent tissue:5 [70]Open in a new tab To minimize potential batch effects between datasets, the ComBat function from the sva package in R was applied to the expression matrices. Following batch effect removal, the datasets were divided into training and validation cohorts. Specifically, [71]GSE116520 and [72]GSE34152 were merged to form the training set, while [73]GSE7696, [74]GSE90886, and [75]GSE109857 were designated as external validation cohorts. Differential gene expression analysis Differentially expressed genes (DEGs) between GBM and adjacent normal tissues were identified using the limma package in R [[76]15, [77]16]. After organizing and normalizing the expression data, a design matrix was constructed to compare the GBM and control groups. Linear modeling and empirical Bayes smoothing were applied, with DEGs filtered by |logFC|> 1 and adj.P.Val < 0.05. The significant DEGs were visualized through heatmaps and volcano plots using pheatmap and ggplot2. Top DEGs were highlighted, and all results were exported for further analysis. WGCNA analysis for GBM-related module identification To identify gene modules most relevant to GBM, Weighted Gene Co-expression Network Analysis (WGCNA) was conducted [[78]17, [79]18]. The normalized expression data were first preprocessed by removing genes with low variance (standard deviation ≤ 0.5). Samples were then clustered hierarchically to detect outliers, which were excluded from further analysis. The optimal soft-thresholding power was determined using the `pickSoftThreshold` function, ensuring scale-free topology of the network. An adjacency matrix was constructed using the selected power, followed by the calculation of the topological overlap matrix (TOM) and subsequent hierarchical clustering to identify gene modules. Dynamic tree cutting was employed to refine module boundaries, and similar modules were merged based on their eigengene correlations. The correlation between module eigengenes and clinical traits was assessed to pinpoint the module most strongly associated with GBM. Key modules were further examined by plotting the relationship between module membership and gene significance for the GBM trait, with results indicating the modules most enriched for GBM-relevant genes. GO and KEGG pathway enrichment analysis To explore the biological functions and pathways related to glioblastoma multiforme (GBM), we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on the intersecting genes between the DEGs and the most relevant WGCNA module. The intersecting genes were mapped to their Entrez IDs using the `org.Hs.eg.db` package. GO enrichment analysis was carried out across the Biological Process, Cellular Component, and Molecular Function categories using the `enrichGO` function from the `clusterProfiler` package, applying p-value and adjusted p-value cutoffs of 0.05 and 1, respectively. Significant GO terms were visualized through bar plots, bubble plots, and circular plots generated with `ggplot2`, `enrichplot`, and `circlize`. For KEGG pathway enrichment, the `enrichKEGG` function was used under similar conditions, and the results were presented through bubble plots. Machine learning model development and evaluation In this study, we used twelve different machine learning algorithms to classify GBM samples into two groups: GBM and normal tissues [[80]19]. These algorithms include Elastic Net (Enet), Ridge Regression, Stepwise Generalized Linear Model (Stepglm), and LASSO, which are particularly effective for selecting the most important features and preventing overfitting. In addition, we applied a variety of other methods such as Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), generalized linear models (glmBoost), Random Forest, Gradient Boosting Machine (GBM), XGBoost, and Naive Bayes for classification tasks. From these twelve algorithms, we combined them into 113 different models to maximize predictive performance. To prevent overfitting, we used two strategies: one was cross-validation, which pairs a feature selection method with a classification model to ensure robust results, and the other was increasing the number of datasets used for validation. We standardized the data (scaled and normalized) to ensure that the features were on a similar scale. We selected important features (genes) from the DEGs and the most relevant gene modules identified by WGCNA. We then evaluated the models based on their ability to distinguish between GBM and normal tissues, using a performance measure called the Area Under the Curve (AUC). We visualized the model performance using heatmaps, which helped identify the top-performing models. The models that performed best were selected for further validation and potential use in clinical applications. Model evaluation using ROC and confusion matrix The machine learning model with the highest AUC was selected for evaluation across both the training set and external validation cohorts. ROC curves were generated for each dataset to assess the model's classification performance, with AUC values calculated to quantify its ability to discriminate between GBM and control samples. Bootstrap resampling was employed to estimate 95% confidence intervals for the AUC. Additionally, confusion matrices were constructed to evaluate the accuracy of the model's predictions by comparing the predicted class labels with the actual labels in both the training and validation sets. This analysis provided comprehensive metrics on the model's sensitivity, specificity, and overall classification accuracy. The outcomes of these evaluations were systematically analyzed to ensure the model's robustness and generalizability across different datasets. Gene interaction network analysis GeneMANIA [[81]20] was utilized to analyze gene interaction networks, focusing on selected genes relevant to the study. The analysis employed the Homo sapiens dataset, integrating various biological networks such as physical interactions, co-expression, and pathways. The tool identified key interacting partners and highlighted significant biological processes, including the negative regulation of kinase activity and protein phosphorylation. The insights gained from this analysis were used to validate the functional relevance of the selected genes. Gene set enrichment analysis Gene Set Enrichment Analysis (GSEA) was conducted to explore the functional enrichment associated with different expression levels of the target gene, using both the c5.go.Hs.symbols.gmt and c2.cp.kegg.Hs.symbols.gmt gene sets [[82]21]. Gene expression data were first normalized, and samples were grouped into high and low expression categories based on the median expression level of the target gene. logFC between these groups were calculated and genes were ranked accordingly. GSEA was then performed on the ranked list to identify significantly enriched pathways and biological processes. Results were filtered with a p-value threshold of 0.05. GSVA analysis Gene Set Variation Analysis (GSVA) was employed to evaluate pathway activity variations associated with the expression levels of the target gene. Both c5.go.Hs.symbols.gmt and c2.cp.kegg.Hs.symbols.gmt gene sets were utilized. After normalizing the expression data and excluding control samples, GSVA scores were computed using the ssGSEA method. The scores were then normalized, and pathways with significant differences between the high and low expression groups were identified through t-tests. Pathways were classified as upregulated, downregulated, or not significant based on their GSVA scores. The results were summarized, and key pathways were visualized using bar plots to illustrate the differential activity between the groups. Immune infiltration analysis Immune cell infiltration was assessed using the CIBERSORT algorithm, which quantifies the relative proportions of different immune cell types within a complex gene expression mixture. The analysis was conducted using a signature matrix and gene expression data from the selected samples. The CIBERSORT algorithm was run with 1000 permutations to increase the robustness of the results. Output from the analysis included p-values, correlation coefficients, and RMSE values, which were used to filter significant results. Only samples with a p-value less than 0.05 were considered for further analysis. Mendelian randomization analysis Mendelian Randomization (MR) analysis was conducted to explore the potential causal relationships between genetically predicted gene expression levels and the risk of GBM. For this analysis, we used genetic data from the GWAS database ([83]https://gwas.mrcieu.ac.uk/) to identify genetic variants that influence gene expression (referred to as exposure data), and outcome data from the FinnGen database ([84]https://www.finngen.fi/en/access_results) to assess GBM risk. We first processed the exposure data, focusing on single nucleotide polymorphisms (SNPs) that were strongly associated with gene expression. To ensure consistency, we harmonized the data, matching the SNP information in both the exposure and outcome datasets. SNPs with low statistical significance (p-value > 5e−8) were removed to ensure the reliability of the analysis. For each gene of interest, we conducted the MR analysis using the TwoSampleMR package in R, primarily applying the inverse-variance weighted (IVW) method to estimate causal effects. To check the robustness of our results, we also used alternative methods, such as MR-Egger and the weighted median approach. We assessed the variability between studies using Cochran's Q test and checked for potential biases using MR-Egger’s intercept and MR-PRESSO global tests. We further examined the significant results by generating scatter plots, forest plots, and conducting leave-one-out sensitivity analyses. These analyses helped us visualize the strength and direction of the causal effects. Finally, we documented the results, highlighting the SNPs used in the analysis and presenting the estimated causal effects with confidence intervals, providing a clear view of the relationship between gene expression and GBM risk. Statistical analysis All statistical analyses were performed using R software. For differential gene expression analysis, the `limma` package was utilized to fit linear models, with significance determined by |logFC|> 1 and adjusted p-values < 0.05 using the empirical Bayes method. Machine learning model evaluation, including ROC curve generation and AUC calculation, was conducted using the `pROC`, `caret`, and `e1071` packages, with models assessed through both ROC curves and confusion matrices. GSEA was performed using the `clusterProfiler` and `enrichplot` packages, with pathway enrichment considered significant at p-values < 0.05. GSVA employed the `GSVA` and `limma` packages, with statistical significance of pathway activities determined by t-tests and results visualized using the `ggplot2` and `ggpubr` packages. In the Mendelian randomization analysis, the `TwoSampleMR`, `MendelianRandomization`, and `MR-PRESSO` packages were applied to estimate causal effects, with heterogeneity and pleiotropy assessed using Cochran’s Q test, MR-Egger intercept, and MR-PRESSO global tests. Significance in immune infiltration analysis was determined using p-values from the CIBERSORT algorithm within the `e1071` and `preprocessCore` packages, with only results having p-values < 0.05 considered significant. All statistical tests were two-sided, and the appropriate multiple testing correction methods were applied where necessary. Results Differential expression and batch effect correction The analysis began by addressing batch effects between the training datasets, [85]GSE116520 and [86]GSE34152. Before batch correction, significant differences were observed in gene expression distributions across the two datasets, as illustrated by the boxplots in Fig. [87]1A. Principal component analysis (PCA) further highlighted the batch effect, with samples from [88]GSE116520 and [89]GSE34152 clustering separately in the PCA plot (Fig. [90]1C). To correct for these discrepancies, the ComBat function from the sva package was applied. Post-correction, the gene expression distributions were more consistent across datasets, as shown in the adjusted boxplots (Fig. [91]1B), and the PCA plot demonstrated a more cohesive clustering of samples, indicating effective batch correction (Fig. [92]1D). Fig. 1. [93]Fig. 1 [94]Open in a new tab Batch Effect Correction and Differential Gene Expression Analysis. A Boxplot of gene expression data from [95]GSE116520 and [96]GSE34152 datasets before batch correction. The differences between the datasets are apparent. B Boxplot of gene expression data from [97]GSE116520 and [98]GSE34152 after batch effect correction using the ComBat method, showing improved alignment across datasets. C Principal component analysis (PCA) plot before batch correction, displaying clear separation between [99]GSE116520 and [100]GSE34152 datasets. D PCA plot after batch correction, illustrating improved alignment and reduced batch effects, with samples from both datasets more closely clustered. E Volcano plot showing differentially expressed genes (DEGs) between glioblastoma multiforme (GBM) and control tissues in the training set. Genes with |logFC|> 1 and adj.P.Val < 0.05 are highlighted in red (upregulated) and green (downregulated). F Heatmap of the top DEGs across samples, highlighting distinct expression patterns between GBM and control groups in the training set. Each column represents a sample, and each row represents a DEG Subsequent differential expression analysis between GBM and adjacent normal tissues identified a total of 286 DEGs with |logFC|> 1 and adjusted p-value < 0.05 (Supplementary Table 1). The volcano plot in Fig. [101]1E displays the distribution of these DEGs, with significantly upregulated genes marked in red and downregulated genes in green. The heatmap in Fig. [102]1F provides a visual summary of the expression patterns of the top DEGs across samples, clearly distinguishing GBM from normal tissues. WGCNA and module detection To identify gene modules associated with GBM, WGCNA was performed. The optimal soft-thresholding power was determined to be 8, ensuring that the network followed a scale-free topology (Fig. [103]2A). Using this power, an adjacency matrix was constructed, and the topological overlap matrix (TOM) was calculated. Hierarchical clustering based on TOM dissimilarity identified several gene modules, each represented by a unique color in the dendrogram (Fig. [104]2B). Fig. 2. [105]Fig. 2 [106]Open in a new tab WGCNA and Functional Enrichment Analysis of GBM-Related Modules. A Scale independence and mean connectivity plot used to determine the optimal soft-thresholding power in WGCNA. The soft-threshold power was chosen at 8, ensuring a scale-free network topology. B Gene dendrogram and module colors generated by hierarchical clustering of genes based on topological overlap. Modules are represented by different colors, with each branch of the dendrogram representing a module. C Heatmap showing the correlation between module eigengenes and clinical traits (GBM vs. control). The yellow module exhibited the highest positive correlation with the GBM trait. D Bar plot depicting the gene significance across modules, with the yellow module showing the highest gene significance for GBM. E Venn diagram illustrating the overlap between differentially expressed genes (DEGs) identified in the study and the yellow module from WGCNA. A total of 7 genes were common between the DEGs and the yellow module. F Circos plot visualizing the Gene Ontology (GO) enrichment analysis of the common genes, highlighting the most significant biological processes related to GBM, including cell adhesion and axon development. G Bar plot of GO terms enriched in the common genes, with the top enriched terms involved in cell adhesion and neuronal differentiation. H KEGG pathway enrichment analysis of the common genes, showing significant pathways such as viral life cycle, cardiomyopathies, and ECM-receptor interaction The module-trait relationship heatmap (Fig. [107]2C) indicated that the yellow module exhibited the most significant correlation with GBM (correlation coefficient = 0.49, p = 3e−04). This module was selected for further analysis due to its strong association with GBM. The gene significance within this module was notably higher than in other modules, as shown in the bar plot (Fig. [108]2D), further supporting its relevance to GBM. Functional enrichment analysis of the most significant GBM-associated module The yellow module was subjected to functional enrichment analysis to explore its biological implications in GBM. A Venn diagram (Fig. [109]2E) identified seven intersecting genes between the yellow module and the DEGs identified earlier (Supplementary Table 2). These intersecting genes were analyzed for their involvement in biological processes, cellular components, and molecular functions through GO enrichment analysis. The circular plot (Fig. [110]2F) highlights the significant GO terms, including pathways related to cell adhesion, neuron differentiation, and protein ubiquitination. Additionally, KEGG pathway enrichment analysis revealed that these genes were significantly involved in pathways such as the viral life cycle of HIV-1, cardiomyopathies, and ECM-receptor interaction (Fig. [111]2H). The dot plot (Fig. [112]2G) illustrates the specific GO terms enriched within the yellow module, underscoring its potential role in GBM pathogenesis through processes like integrin binding and axon development. Evaluation of the top-performing machine learning model A total of 113 machine learning models (Supplementary Table 3) were developed and evaluated using the training dataset, which consisted of the merged [113]GSE116520 and [114]GSE34152 datasets, and external validation datasets, including [115]GSE7696, [116]GSE90886, and [117]GSE109857. The model with the highest (AUC was selected for further evaluation. Figure [118]3A shows a heatmap comparing the AUC values across different models and cohorts. The Ridge regression model combined with the Elastic Net (Enet) algorithm emerged as the top-performing model with the highest AUC values across all datasets. This model achieved an AUC of 0.897 (95% CI 0.808–0.966) on the training set (Fig. [119]3B), demonstrating strong discriminatory power between GBM and control samples. Fig. 3. [120]Fig. 3 [121]Open in a new tab Evaluation of the Machine Learning Model for GBM Classification. A Heatmap displaying the AUC values of 113 machine learning models across the training set ([122]GSE116520 and [123]GSE34152) and external validation sets ([124]GSE7696, [125]GSE90886, and [126]GSE109857). The Ridge model, which showed the highest AUC, was selected for further evaluation. B ROC curve for the Ridge model on the training set, demonstrating an AUC of 0.897 with a 95% confidence interval (CI) of 0.808–0.966, indicating strong predictive performance. C ROC curve for the Ridge model on the [127]GSE109857 validation set, yielding an AUC of 0.872 with a 95% CI of 0.798–0.940, confirming the model's generalizability to external data. D ROC curve for the Ridge model on the [128]GSE7696 validation set, showing an AUC of 0.984 with a 95% CI of 0.950–1.000, further supporting the model's robustness. E ROC curve for the Ridge model on the [129]GSE90886 validation set, with a lower AUC of 0.704 and a 95% CI of 0.432–0.914, indicating variable performance across different datasets. F Confusion matrix for the Ridge model on the training set, highlighting the model's accuracy in classifying GBM (Treat) and control samples. The model correctly classified 17 GBM samples and 23 control samples, with 10 misclassifications. G Confusion matrix for the Ridge model on the [130]GSE109857 validation set, demonstrating its high sensitivity and specificity, correctly classifying 147 GBM samples and 5 control samples. H Confusion matrix for the Ridge model on the [131]GSE7696 validation set, where the model achieved perfect classification with no misclassifications. I Confusion matrix for the Ridge model on the [132]GSE90886 validation set, showing some misclassifications, consistent with the lower AUC observed in the ROC analysis The model's performance was further validated on the external datasets. In the [133]GSE109857 dataset, the model achieved an AUC of 0.872 (95% CI 0.798–0.940) (Fig. [134]3C), while in the [135]GSE7696 dataset, the model performed exceptionally well with an AUC of 0.984 (95% CI 0.950–1.000) (Fig. [136]3D). However, in the [137]GSE90886 dataset, the model's performance was lower, with an AUC of 0.704 (95% CI 0.432–0.914) (Fig. [138]3E). To further assess the model's classification accuracy, confusion matrices were generated for the training set and each validation cohort (Fig. [139]3F–I). In the training set (Fig. [140]3F), the model correctly classified 23 control samples and 17 GBM samples, with four control samples misclassified as GBM and six GBM samples misclassified as control. In the [141]GSE109857 cohort (Fig. [142]3G), the model showed strong classification accuracy, correctly identifying 147 GBM samples and five control samples, with minimal misclassification. The [143]GSE7696 cohort (Fig. [144]3H) demonstrated a similar pattern, with only a few misclassifications. The [145]GSE90886 cohort (Fig. [146]3I) exhibited slightly higher misclassification rates, reflecting the lower AUC observed in this dataset. Overall, the Ridge regression model combined with Enet demonstrated robust performance across multiple datasets, making it a promising tool for distinguishing GBM from control samples. The use of ROC curves and confusion matrices provided a comprehensive evaluation of the model's accuracy and generalizability. Identification and validation of key prognostic markers using ridge regression model The Ridge regression model, which was identified as the top-performing method, was further utilized to identify and validate key prognostic markers in GBM. Differential expression analysis identified 286 DEGs, of which significant genes are highlighted in Fig. [147]4A. The volcano plot (Fig. [148]4A) illustrates the distribution of genes with significant differential expression between GBM and control samples, highlighting genes such as ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, and SFRP2 as key markers (Supplementary Table 4). Fig. 4. [149]Fig. 4 [150]Open in a new tab Identification and Validation of Key Genes Associated with GBM. A Volcano plot showing differentially expressed genes (DEGs) between GBM and control samples. The highlighted genes—MAP1A, FOXO4, KLHL3, ITGA10, SFRP2 (downregulated in GBM), and ANXA2P1, PTTG3P (upregulated in GBM)—were selected for further analysis based on their significant log fold changes (logFC) and adjusted p-values. B Box plots depicting the expression levels of the selected genes (ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, SFRP2) in GBM (Treat) and control samples. All selected genes show statistically significant differential expression between the two groups (***p < 0.001, **p < 0.01). C Correlation matrix visualizing the relationships between the expression levels of the selected genes. The matrix shows both the correlation coefficients and the statistical significance of these correlations, highlighting strong positive and negative correlations among certain gene pairs. D ROC curves for each selected gene, demonstrating their predictive ability for distinguishing between GBM and control samples. The AUC values indicate the model's classification performance for each gene, with MAP1A showing the highest AUC of 0.883. E Gene interaction network generated using GeneMANIA, illustrating the connections between the selected genes and other associated genes. The network includes various types of interactions, such as physical interactions, co-expression, and shared protein domains, with functions primarily related to the negative regulation of kinase activity, proteolysis, and protein phosphorylation In Fig. [151]4B, the expression levels of these seven significant DEGs are compared between GBM and control samples, showing consistent differential expression patterns, with all selected genes showing statistically significant differences (p < 0.001). The correlation plot in Fig. [152]4C further examines the relationships between these key genes, revealing notable correlations such as a strong positive correlation between ANXA2P1 and ITGA10, and a strong negative correlation between FOXO4 and PTTG3P. The diagnostic performance of each key marker was assessed using ROC curves (Fig. [153]4D), where the AUC values ranged from 0.722 for SFRP2 to 0.883 for MAP1A. These AUC values indicate that these genes have strong potential as diagnostic markers for GBM, with MAP1A, KLHL3, and FOXO4 showing particularly high discriminatory power. Lastly, a gene interaction network analysis was conducted using GeneMANIA to explore the functional associations between these key markers (Fig. [154]4E). The network revealed that these genes are involved in critical biological processes, including the negative regulation of kinase activity and protein phosphorylation. The analysis also highlighted significant physical interactions, co-expression patterns, and shared pathways among these genes, further supporting their role in GBM pathogenesis. Immune infiltration analysis in GBM Immune infiltration analysis was conducted using the CIBERSORT algorithm to evaluate the relative proportions of 22 immune cell types within GBM and control samples (Supplementary Table 5). Figure [155]5A illustrates the distribution of these immune cell types across the samples, showing distinct differences in immune cell composition between GBM and control groups. Notably, GBM samples exhibited higher proportions of M0 and M2 macrophages, as well as activated memory CD4 T cells, compared to the control group. Fig. 5. [156]Fig. 5 [157]Open in a new tab Immune Infiltration Analysis in GBM. A The bar plot illustrates the relative proportions of various immune cell types within the GBM (Treat) and control groups, as calculated by the CIBERSORT algorithm. This comparison shows distinct differences in the immune landscape between the two groups, highlighting the increased presence of certain immune cell types in GBM samples. B The heatmap shows the correlation coefficients between different immune cell types in GBM and control samples. Positive and negative correlations are depicted, indicating potential interactions or mutual exclusivity between specific immune cell populations. C Box plots display the fraction of individual immune cell types across the GBM and control groups, emphasizing significant differences in the infiltration levels of certain immune cells, such as macrophages and T cells, between the two conditions. D A lollipop plot ranking the immune cell types based on their correlation coefficients with the expression of key genes in GBM, as determined through correlation analysis. This plot identifies the immune cells most strongly associated with gene expression changes in GBM, with macrophages M1 and M2 showing notable correlations. E A cell–cell interaction network, illustrating the associations between selected key genes (ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, SFRP2) and immune cell types. The network highlights significant interactions and their directions (positive or negative), providing insights into the complex immune microenvironment in GBM The correlation heatmap in Fig. [158]5B reveals the relationships among the different immune cell types. Strong positive correlations were observed between activated memory CD4 T cells and M1 macrophages, as well as between resting memory CD4 T cells and M2 macrophages, indicating potential interactions or co-regulation within the tumor microenvironment. A detailed comparison of the immune cell fractions between GBM and control samples is presented in Fig. [159]5C. This analysis highlights significant differences in the fractions of various immune cells, particularly an increased presence of M2 macrophages and a reduction in resting CD4 memory T cells in GBM samples. Figure [160]5D further explores the correlation between key immune cells and gene expression, with macrophages (M1 and M2) and dendritic cells showing the highest correlation coefficients. These correlations suggest that the presence of these immune cells may be linked to specific gene expression profiles in GBM, reflecting their potential role in tumor progression or immune evasion. Finally, the network plot in Fig. [161]5E integrates the identified key genes with the immune cell infiltration data. The plot shows significant associations between the expression of key genes such as ANXA2P1, FOXO4, ITGA10, and KLHL3, and the infiltration of specific immune cell types. For instance, ANXA2P1 and ITGA10 are positively correlated with M2 macrophage infiltration, further underscoring the immunosuppressive environment in GBM. These findings provide a comprehensive overview of the immune landscape in GBM and highlight potential interactions between immune cells and key genes that may contribute to the tumor's immunosuppressive microenvironment. Mendelian randomization analysis of KLHL3 and glioblastoma Risk MR analysis was conducted to investigate the causal relationship between the expression of specific genes and glioblastoma risk, focusing on seven genes identified as potential candidates based on previous analyses. These genes were screened for their relevance using a combination of differential expression analysis, machine learning model evaluation, and functional enrichment studies. Among these, KLHL3 was selected for detailed MR analysis due to its significant differential expression and robust association with glioblastoma risk factors identified in earlier steps. Figure [162]6A illustrates the MR analysis results using various methods, including Inverse Variance Weighted (IVW), MR Egger, Weighted Median, and Simple Mode. The consistent slope across these methods for KLHL3 suggests a protective effect against glioblastoma, with the IVW method showing a particularly strong association. The forest plot in Fig. [163]6B presents the MR effect sizes for each SNP used as an instrumental variable, highlighting KLHL3's impact on glioblastoma risk. Notably, the red markers indicate significant SNPs detected by the MR Egger method, suggesting some degree of pleiotropy, while the black markers represent the IVW estimates. Figure [164]6C provides a leave-one-out sensitivity analysis, which further confirmed the robustness of the MR findings by showing stable estimates even when individual SNPs were excluded from the analysis. This reinforces the reliability of the observed association between KLHL3 expression and glioblastoma risk. Finally, Fig. [165]6D depicts the funnel plot for detecting directional pleiotropy. The symmetry around the MR Egger line suggests no significant pleiotropic effects, strengthening the validity of KLHL3 as a causal factor in reducing glioblastoma risk (Supplementary Table 6). Fig. 6. [166]Fig. 6 [167]Open in a new tab Mendelian Randomization Analysis of KLHL3 in Glioblastoma. A The scatter plot displays the SNP effect on KLHL3 expression versus the SNP effect on glioblastoma risk using different Mendelian Randomization (MR) methods, including inverse variance weighted, MR Egger, weighted median, and simple mode. The consistent slope across these methods suggests a potential protective effect of KLHL3 expression against glioblastoma risk. B The forest plot illustrates the MR effect sizes of individual SNPs associated with KLHL3 expression on glioblastoma risk, with results shown for both the MR Egger and inverse variance weighted methods. The negative effect sizes for certain SNPs further indicate a potential protective role of KLHL3 in glioblastoma. C The leave-one-out sensitivity analysis identifies the impact of each SNP on the overall MR estimate. The analysis shows that excluding any single SNP does not significantly alter the protective effect, indicating the robustness of the MR findings. D The funnel plot evaluates the potential for directional pleiotropy in the MR analysis. The symmetric distribution of points around the inverse variance weighted estimate line suggests a low likelihood of pleiotropy, supporting the validity of the causal inference made between KLHL3 expression and glioblastoma risk In summary, KLHL3 was selected for MR analysis from seven potential genes based on its strong differential expression and association with glioblastoma. The results indicate that higher genetically predicted KLHL3 expression may reduce glioblastoma risk, with minimal evidence of pleiotropy or confounding. Functional enrichment analysis of KLHL3 To explore the functional implications of KLHL3 expression in glioblastoma, both GSEA and GSVA were performed, focusing on the pathways associated with differential KLHL3 expression levels. In the GSEA analysis, significant enrichment was observed in both high and low KLHL3 expression groups. Figure [168]7A, B present the GSEA results for GO terms. Specifically, the high expression group showed significant enrichment in processes related to the regulation of mucins, glomerular basement membrane, and other differentiation-related processes (Fig. [169]7A). Conversely, in the low expression group, there was notable enrichment in pathways involving the extracellular matrix, basement membrane, and structural constituents of the cytoskeleton (Fig. [170]7B). Similarly, the KEGG pathway analysis further elucidated these associations. Figure [171]7C, D highlight the pathways enriched in high and low KLHL3 expression groups, respectively. High KLHL3 expression correlated with pathways such as cell signaling, immune cell adhesion, and phagocytosis, indicating an active role in immune-related processes (Fig. [172]7C). On the other hand, low KLHL3 expression was associated with pathways like complement and coagulation cascades, ECM-receptor interaction, and ribosome biogenesis, suggesting its involvement in fundamental cellular processes and extracellular interactions (Fig. [173]7D). Fig. 7. [174]Fig. 7 [175]Open in a new tab Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis (GSVA) for KLHL3 in Glioblastoma. A GSEA results for the high expression group of KLHL3 reveal significant enrichment in pathways related to mucin transport, cell differentiation, and glucocorticoid response. The enrichment plots show a strong correlation between high KLHL3 expression and these biological processes. B GSEA results for the low expression group of KLHL3 indicate significant enrichment in pathways involved in extracellular matrix organization and structural components of the basement membrane. These pathways are particularly associated with low KLHL3 expression, suggesting a potential link to tumor microenvironment alterations. C KEGG pathway analysis in the high expression group highlights significant enrichment in pathways related to signaling in the immune system and phagocytosis. These pathways are associated with immune responses and cellular processes that may contribute to glioblastoma progression. D KEGG pathway analysis in the low expression group identifies enrichment in pathways related to complement and coagulation cascades and Fc receptor-mediated signaling. These results suggest that low KLHL3 expression may influence the immune landscape and coagulation processes in glioblastoma. E GSVA results for KLHL3 show the differential activity of various pathways across samples. The analysis reveals upregulated and downregulated pathways in relation to KLHL3 expression levels, providing insights into its role in glioblastoma biology. F Further GSVA analysis confirms the differential expression of key signaling pathways, categorizing them as either upregulated or downregulated in glioblastoma. This analysis underscores the impact of KLHL3 on critical biological processes that could be pivotal in tumor progression GSVA results provided further insights, as illustrated in Fig. [176]7E, F. The analysis revealed distinct pathway activity differences between high and low KLHL3 expression groups. The t-value bar plots (Fig. [177]7E, F) summarized the upregulated and downregulated pathways, indicating that KLHL3 significantly influences a variety of biological processes, particularly those related to cell adhesion, immune response, and extracellular matrix organization. These findings collectively suggest that KLHL3 plays a critical role in glioblastoma by modulating key biological pathways, with differential expression leading to distinct functional outcomes in tumor biology. Discussion In this study, we utilized a novel approach combining advanced machine learning techniques with MR analysis to identify and validate potential prognostic markers for GBM. By employing 113 distinct machine learning models, we conducted a comprehensive evaluation of gene expression data, leading to the identification of seven key genes—ANXA2P1, FOXO4, ITGA10, KLHL3, MAP1A, PTTG3P, and SFRP2—associated with GBM. The subsequent MR analysis provided a robust framework to investigate the causal relationships between these gene expressions and GBM risk, underscoring the potential of these genes as critical biomarkers in GBM pathology [[178]22–[179]24]. Among the seven identified genes, KLHL3 emerged as particularly noteworthy due to its significant differential expression and the protective effect suggested by MR analysis. The consistent inverse association observed across various MR approaches for KLHL3 implies that higher expression levels of this gene may confer a reduced risk of developing GBM. This finding not only reinforces the potential role of KLHL3 in GBM but also demonstrates the value of integrating machine learning with genetic epidemiology to uncover causative genetic factors in complex diseases. The machine learning models applied in this study also provided critical insights into the predictive accuracy of these seven genes in distinguishing GBM from control samples. The Ridge regression model, in particular, demonstrated superior performance across multiple datasets, highlighting MAP1A, FOXO4, and KLHL3 as the top-performing genes with the highest area under the curve (AUC) values. These genes' strong predictive ability suggests that they could serve as reliable biomarkers for early GBM detection, offering potential applications in clinical diagnostics and personalized medicine. Additionally, the immune infiltration analysis revealed significant interactions between the identified genes and the tumor microenvironment. The presence of M2 macrophages, for instance, was strongly correlated with the expression of ANXA2P1 and ITGA10, suggesting that these genes might play a role in shaping the immunosuppressive landscape of GBM. Understanding these gene-immune cell interactions could provide new avenues for developing immunotherapeutic strategies that target these specific pathways, potentially improving patient outcomes [[180]25, [181]26]. In conclusion, the integration of 113 machine learning models with Mendelian Randomization analysis allowed for the identification and validation of seven key genes associated with GBM, with KLHL3 standing out as a potential protective factor. This study not only highlights the utility of combining computational and genetic approaches to explore complex disease mechanisms but also identifies novel biomarkers that could be pivotal in GBM diagnosis and treatment. Future research should focus on validating these findings in larger cohorts and further exploring the therapeutic implications of targeting these gene networks in GBM. Conclusions This study identified and validated seven key genes related to glioblastoma using machine learning and Mendelian Randomization. The Ridge regression model demonstrated strong predictive accuracy, highlighting these genes as potential biomarkers. These findings offer valuable insights for future GBM therapeutic strategies. Limitations section This study, while providing valuable insights into glioblastoma through the integration of machine learning and Mendelian randomization, is limited by its purely bioinformatic approach. The findings are based on publicly available datasets and computational analyses. Moreover, the predictions and causal inferences drawn from this study require experimental validation in vitro and in vivo to confirm the functional relevance of the identified gene networks. Supplementary Information [182]Supplementary materials 1.^ (31.3KB, txt) [183]Supplementary materials 2.^ (53B, txt) [184]Supplementary materials 3.^ (2.5KB, txt) [185]Supplementary materials 4.^ (861B, txt) [186]Supplementary materials 5.^ (9KB, txt) [187]Supplementary materials 6.^ (1.1KB, csv) [188]Supplementary materials 7.^ (15.7KB, docx) Acknowledgements