Abstract Introduction Osteosarcoma, the prevailing primary bone malignancy among children and adolescents, is frequently associated with treatment failure primarily due to its pronounced metastatic nature. Methods This study aimed to establish potential associations between hub genes and subtypes for the treatment of metastatic osteosarcoma. Differentially expressed genes were extracted from patients diagnosed with metastatic osteosarcoma and a control group of non-metastatic patients, using the publicly available gene expression profile ([33]GSE21257). The intersection of these gene sets was determined by focusing on endoplasmic reticulum (ER) stress-related genes sourced from the GeneCards database. We conducted various analytical techniques, including functional and pathway enrichment analysis, WGCNA analysis, protein-protein interaction (PPI) network construction, and assessment of immune cell infiltration, using the intersecting genes. Through this analysis, we identified potential hub genes. Results Osteosarcoma subtype models were developed using molecular consensus clustering analysis, followed by an examination of the associations between each subtype and hub genes. A total of 138 potential differentially expressed genes related to endoplasmic reticulum (ER) stress were identified. These genes were further investigated using Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Set Enrichment Analysis (GSEA) pathways. Additionally, the PPI interaction network revealed 38 interaction relationships among the top ten hub genes. The findings of the analysis revealed a strong correlation between the extent of immune cell infiltration and both osteosarcoma metastasis and the expression of hub genes. Notably, the differential expression of the top ten hub genes was observed in osteosarcoma clusters 1 and 4, signifying their significant association with the disease. Conclusion The identification of ten key genes linked to osteosarcoma metastasis and endoplasmic reticulum stress bears potential clinical significance. Additionally, exploring the molecular subtype of osteosarcoma has the capacity to guide clinical treatment decisions, necessitating further investigations and subsequent clinical validations. Keywords: Osteosarcoma, Metastasis, Endoplasmic reticulum stress, Subtype analysis 1. Introduction Osteosarcoma, a primary bone tumor prevalent among children and adolescents, exhibits an annual incidence of approximately 8–11/million [[34]1]. This malignancy is characterized by its highly aggressive nature, heterogeneity, and propensity for metastasis [[35]2], leading to a significant risk of disability and mortality following metastatic spread. Osteosarcoma commonly manifests in the vicinity of the metaphysis of long bones, including the distal femur, proximal tibia, and humerus [[36]3]. The precise etiology, progression, and metastasis of osteosarcoma have not been thoroughly investigated [[37]4,[38]5]. However, the elevated prevalence among adolescents and its predilection for specific anatomical sites suggest a correlation with accelerated bone tissue growth [[39]6]. Presently, the primary therapeutic approach for osteosarcoma entails preoperative chemotherapy, surgical excision, and postoperative chemotherapy. Despite extensive removal of macroscopic tumor tissue during surgical intervention, recurrence frequently occurs [[40]7]. The implementation of neoadjuvant chemotherapy has led to a substantial enhancement in the 5-year survival rate among individuals diagnosed with osteosarcoma [[41]8]. When neoadjuvant chemotherapy is combined with surgical resection, the 5-year survival rate for patients with non-metastatic osteosarcoma is approximately 60% [[42]9]. However, the survival rate drops significantly to only 20–30% for patients with metastatic osteosarcoma [[43][9], [44][10], [45][11]]. The high incidence of metastasis in osteosarcoma is the primary determinant of the unfavorable prognosis [[46]12]. Presently, there is no treatment available that significantly improves the 5-year survival rate for patients with metastatic osteosarcoma [[47]13]. The endoplasmic reticulum (ER) plays a crucial role in the processes of protein synthesis, folding, and structural maturation. ER stress, which occurs as a result of genetic or environmental factors, disrupts the ER's ability to properly modify, fold, and secrete proteins, leading to the accumulation of misfolded proteins within organelles [[48]14]. This phenomenon of ER stress is closely linked to the onset and progression of various diseases, including tumors, diabetes, and neurodegenerative disorders [[49][15], [50][16], [51][17]]. Unfavorable conditions within the tumor microenvironment, including nutritional deficiency, hypoxia, hypermetabolism, and oxidative stress, have been shown to disrupt protein folding by the endoplasmic reticulum (ER), leading to the persistent activation of “ER stress.” This phenomenon enhances the tumorigenic, metastatic, and drug-resistant properties of malignant cells [[52]14,[53]16,[54]18]. Osteosarcoma is characterized by extensive genetic variation, dysregulation of multiple signaling pathways, and genome instability. These factors are closely associated with the regulation of bone development, tumor microenvironment, genome homeostasis, cell cycle control, and cell signal transduction pathways [[55]16]. The correlation between ER stress-related genes and osteosarcoma metastasis has been established [[56]19], and the utilization of molecular subtype models is crucial for predicting tumor risk and evaluating prognosis [[57][20], [58][21], [59][22]]. Nevertheless, there is currently a lack of research on molecular subtype models specifically pertaining to ER stress and osteosarcoma metastasis. In this study, the author developed a molecular subtype model for osteosarcoma AMI. By analyzing the gene data from 14 metastatic osteosarcoma patients and 19 non-metastatic osteosarcoma patients, the study identified co-expressed differential genes associated with osteosarcoma metastasis and ER stress. Subsequently, a validation set comprising 10 key genes was established, confirming the robustness of the molecular subtype model for osteosarcoma. These key genes hold promise as potential targets for targeted therapeutic drugs and molecular markers for prognostic evaluation. 2. Materials and methods * 1 GEO data difference analysisC We performed an analysis on the osteosarcoma dataset. [60]GSE21257 comes from the Gene Expression Omnibus (GEO) database [[61]23]. [62]GSE21257 has 53 samples, including 14 samples of osteosarcoma metastases (metastases present at diagnosis) and 19 samples of non-metastases (no metastases). The chip platform is a [63]GPL10295 Illumina human-6 v2. 0 expression beadchip (using nuIDs as identifiers). We used the standardized data expression matrix downloaded from GEO to match, select, and delete gene data based on the samples. The selected samples were grouped according to metastasis and non-metastasis for subsequent analysis. For the preprocessed data set, we drew a boxplot using the R language's boxplot function to observe the data distribution. Expression difference P-values and expression fold change values were calculated using the R package limma (Version 3.42.2) [[64]24]. We selected genes with a P-value <0.05 and |log2FC| > 0.263 (1.2-fold relationship) as significantly differentially expressed mRNAs. The R package heatmap function (Version 1.0.12) [[65]25] was used to make differential heat maps and volcano maps for visualization. * 2 Molecular subtype construction We calculated the AMI molecular subtypes with the ConsensusClusterPlus package [[66]26] and the Rtsne package [[67]27]. The ggplot2 package was used for molecular subtype visualization. Correlations between hub genes and molecular subtypes were plotted with the gspubr package. * 3 Enrichment analysis of intersection genes The GeneCards database ([68]https://www.genecards.org/) [[69]28] provides annotated and predicted human genetic information. The database automatically integrated genetic data from approximately 150 network sources, including genomics, transcriptomics, proteomics, and genetics, as well as clinical and functional information. In this analysis, ER stress-related genes were downloaded from the GeneCards database using “endoplasmic reticulum stress” as the search key. We took the intersection of differential genes and ER stress-related human genes to create a Venn diagram, which was constructed using the R-package Venn diagram function (Version 1.6.20) [[70]29]. Gene Ontology (GO) [[71]30] describes our understanding of the field of biology based on three aspects, molecular function (MF), cellular components (CC), and biological processes (BP). GO enrichment analysis is typically used to explore the enrichment degree of GO terms associated with differentially expressed genes. Kyoto Encyclopedia of Genes and Genomes (KEGG) [[72]31] is a utility database resource for understanding advanced functions and biological systems (such as cells, organisms, and ecosystems) based on molecular-level information, especially genome sequencing and other high-throughput experimental techniques generated from large molecular datasets. KEGG was established in 1995 by Kanehisa's laboratory at the Center for Bioinformatics at Kyoto University, Japan. The analysis used in this study was based on the use of the R package clusterProfiler package (version 3.14.3) [[73]32] to perform GO function/pathway enrichment analysis on the intersection genes and the significance threshold was set to P ≤ 0.05. Bubble plots were created for visualization using the R package ggplot2 (Version 3.3.3) [[74]33]. The pathway network diagram was drawn to visualize the pathway relationship using the cytoscape plug-in ClueGo [[75]34]. We selected and downloaded the c2.cp.v7.2.symbols.gmt gene set data from the GSEA ([76]http://www.gsea-msigdb.org/gsea/index.jsp) [[77]35] database as the reference gene set and performed GSEA enrichment analysis on the two sets of data with the R clusterProfiler package. The GSEA statistical process was used to calculate the enrichment score, estimate the significance of the enrichment score, correct for multiple hypothesis testing, and select the enrichment results with P < 0.05 to draw the GSEA enrichment map. * 4 WGCNA analysis The gene co-expression network was constructed using the WGCNA R package [[78]36]. By calculating the Pearson correlation coefficient between two genes, we used the expression data to create a similarity matrix and chose an appropriate soft threshold β to make the constructed network more compatible with the standard of a scale-free network. We transformed the adjacency matrix into a topological overlap matrix TOM and used hierarchical clustering to generate a hierarchical clustering tree of genes. The correlations between genes and clinical information were calculated, and the significant associations of modules with ER-related DEGs were analyzed. * 5 PPI Interaction Network Construction The STRING (version 11.0, [79]http://www.string-db.org/) database [[80]37] was used to perform protein-protein interactions (PPI) on critical module genes from the WGCNA results and the ER stress-related differentially expressed genes. We selected Required Confidence (combined score) > 0.4 as the threshold value of the PPI relationship. Cytoscape software [[81]38] was used to construct the PPI network of interaction genes, and the top ten hub genes were extracted using the cytohbhs plug-in Ref. [[82]39]. * 6 Analysis of Immune Cell Infiltration CIBERSORT is based on the principle of linear support vector regression to deconvolve the transcriptome expression matrix and estimate the composition and abundance of immune cells in mixed cell populations [[83]40]. We used the gene expression matrix data to analyze the infiltration of 22 immune genes in the samples using the CIBERSORT algorithm. The samples with P < 0.05 were filtered to obtain the immune cell infiltration matrix. The corplot package [[84]41] was used to draw correlation heatmaps to visualize the correlations of the 22 immune cell infiltrations. The gepubr package (https:/CRAN.R-pnojiect.org/package-ggpuby) was used to draw violin plots to visualize the differences in the infiltration of the 22 immune cells. We plotted the correlations between the hub genes and immune cells. * 7 Statistical analysis All statistical analyses were performed using means ± standard deviation. The data were analyzed using R software (version 3.6.3). A P-value <0.05 was considered statistically significant, and all statistical tests were two-sided. The overall analysis protocol used in this study is shown in [85]Fig. 1. Fig. 1. [86]Fig. 1 [87]Open in a new tab Flow chart. 3. Results * 1 GEO data variance analysis The downloaded standardized data was preprocessed to obtain a boxplot of the data before and after preprocessing ([88]Fig. 2A and B). The results of the PCA analysis revealed specific differences between the metastatic and non-metastatic groups ([89]Fig. 2C). The [90]GSE21257 data set screened the expression data for 24,994 genes with |logFC|>0.263, p.adj<0.05 as the threshold. Finally, 1960 differentially expressed genes were identified, of which 1145 were up-regulated and 815 were down-regulated ([91]Fig. 2D and E). The ER stress-related genes were downloaded from the GeneCard database. Then, 138 potential ER stress-related differentially expressed genes were identified by their intersection with the previously detected differentially expressed genes from the [92]GSE21257 data set ([93]Fig. 2F). Fig. 2. [94]Fig. 2 [95]Open in a new tab Difference analysis (Pink is the metastatic group and green is the non-metastatic group): (A) Boxplot before [96]GSE21257 data processing. (B) Boxplots after [97]GSE21257 data processing. (C) PCA graph. (D)[98]GSE21257 dataset differential genes volcano plot. Red is up-regulated differential genes, blue is down-regulated differential genes, and grey is non-significant genes. (E) [99]GSE21257 datasetdifferential genes heat map. Blue is the metastatic group, pink is the non-metastatic group, green is low expression, and red is high expression. (F) Venn diagram of the intersection of endoplasmic reticulum stress-related genes and differential genes. 138 potential ER stress-related differentially expressed genes. (For interpretation of the references to color in this figure legend, the reader is referred