Abstract Background Osteosarcoma (OS) is a common malignant bone tumor originating in the interstitial tissues and occurring mostly in adolescents and young adults. Energy metabolism is a prerequisite for cancer cell growth, proliferation, invasion, and metastasis. However, the gene signatures associated with energy metabolism and their underlying molecular mechanisms that drive them are unknown. Methods Energy metabolism-related genes were obtained from the TARGET database. We applied the “NFM” algorithm to classify putative signature gene into subtypes based on energy metabolism. Key genes related to progression were identified by weighted co-expression network analysis (WGCNA). Based on least absolute shrinkage and selection operator (LASSO) Cox proportional regression hazards model analyses, a gene signature for the predication of OS progression and prognosis was established. Robustness and estimation evaluations and comparison against other models were used to evaluate the prognostic performance of our model. Results Two subtypes associated with energy metabolism was determined using the “NFM” algorithm, and significant modules related to energy metabolism were identified by WGCNA. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) suggested that the genes in the significant modules were enriched in kinase, immune metabolism processes, and metabolism-related pathways. We constructed a seven-gene signature consisting of SLC18B1, RBMXL1, DOK3, HS3ST2, ATP6V0D1, CCAR1, and C1QTNF1 to be used for OS progression and prognosis. Upregulation of CCAR1, and C1QTNF1 was associated with augmented OS risk, whereas, increases in the expression SCL18B1, RBMXL1, DOK3, HS3ST2, and ATP6VOD1 was correlated with a diminished risk of OS. We confirmed that the seven-gene signature was robust, and was superior to the earlier models evaluated; therefore, it may be used for timely OS diagnosis, treatment, and prognosis. Conclusions The seven-gene signature related to OS energy metabolism developed here could be used in the early diagnosis, treatment, and prognosis of OS. Keywords: Gene signature, Energy metabolism, Least absolute shrinkage and selection operator, Osteosarcoma, Prognosis biomarker, Weighted co-expressed network analysis Introduction Osteosarcoma (OS) is one of the most common malignant bone tumors. It appears mainly in adolescents and young adults (overall incidence: 0.3–0.4/100,000 individuals/year) [[35]1, [36]2]. As with other sarcomas, OS originates from the interstitial tissues. It is characterized by the formation of osteoid tissue and osteoid- and spindle-shaped stromal cells in immature bones [[37]3, [38]4]. The main clinical treatments for patients with OS include local surgery, chemotherapy, and radiotherapy. Substantial improvements have been made in the clinical response and survival rate of OS [[39]5]. About 70% of all patients with OS can be cured [[40]6]. However, OS prognosis remains poor especially among patients with metastatic disease or tumor recurrence. In these cases, the overall survival rate is only ~ 20% [[41]7]. Despite advances in surgical technique and targeted therapy, optimal treatment outcomes in OS are still negatively impacted by tumor immunity, infection, complications, and low survival rates. Thus, new predictive and prognostic approaches are urgently needed to improve survival in patients with OS. High-throughput technology, gene microarray chips, and large-scale RNA-seq transcriptome data have been widely used to identify prognostic genes for various cancers, elucidate oncogenic mechanisms, and improve cancer treatment [[42]8, [43]9]. Energy metabolism is an important marker of cancer cell metastasis and invasion [[44]10]. It enables tumor cells to generate ATP, maintain the redox balance, and sustain the macromolecular biosynthetic processed required for cell growth, proliferation, and migration [[45]11]. Recent empirical evidence has demonstrated that there is a metabolic symbiosis between glycolysis and oxidative phosphorylation (OXPHOS) in OS [[46]12]. Lactate and pyruvate formed in glycolysis may be transferred to the TCA cycle and used as intermediate substrates for ATP generation [[47]13, [48]14]. OS cells may also utilize the ketones released by adjacent cells during free fatty acid catabolism to liberate metabolic energy [[49]15]. Thus, a new energy metabolism-related gene signature could be invaluable in the prediction of OS metastasis and invasion. Weighted gene co-expression network analysis (WGCNA) is a systematic biological method that delineates correlations between genes and clinical traits [[50]16]. It identifies highly correlated genes to investigate potential biological functions [[51]17, [52]18]. Here, a new seven-gene signature associated with OS energy metabolism was established via a WGCNA algorithm and a least absolute shrinkage and selection operator (LASSO) Cox regression. This seven-gene signature was validated and confirmed to be superior to other predictive models of the same type. It served as a biomarker that was significantly associated with OS metastasis. Our findings may advance the study of OS prognosis, as they revealed that energy metabolism is a potential therapeutic control point in this disease. The research path taken in the present study is outlined in Fig. [53]1. Fig. 1. Fig. 1 [54]Open in a new tab Flow diagram of data preparation, processing, analysis, and validation in the present study Results Identification of the energy metabolism molecular subtypes in OS The “NFM” algorithm [[55]19] was used to identify the molecular subtypes of energy metabolism in OS samples with 587 identified energy metabolism-related genes [[56]20]. As shown in Fig. [57]2a, the energy metabolism molecular subtypes of OS were detected in the TARGET database using cophenetic, dispersion and silhouette algorithm indicators. The expression levels of the energy metabolism-related genes in the two molecular subtypes are shown in Fig. [58]2b. Patient mortality was significantly higher in the C1 group than the C2 group. We also determined that the prognosis for C1 was worse than that for C2. Figure [59]2c shows significant differences between these molecular subtypes (log-rank P = 0.032). Using the same parametric analysis as for the TARGET data, we ran the “NMF” algorithm on the GEO dataset [60]GSE21257 and found similarities between them. Both were partitioned into two groups. As shown in Fig. [61]3, there were significant differences between subtypes in terms of prognosis (log-rank P = 0.033). Fig. 2. [62]Fig. 2 [63]Open in a new tab Molecular subtype classification according to energy metabolism in TARGET Database. a Consensus map of NMF clustering. b Heatmap of energy metabolism-related gene expression by molecular subtype. c Prognosis survival curve by molecular subtype Fig. 3. [64]Fig. 3 [65]Open in a new tab Molecular subtype classification according to energy metabolism in [66]GSE21257. a Consensus map of NMF clustering. b Heatmap of energy metabolism-related gene expression by molecular subtype. c Prognosis survival curve of molecular subtype Detection of significant modules in the molecular subtypes WGCNA was performed on the genes in the TARGET database to screen for modules that were significantly associated with the energy metabolism molecular subtypes in OS [[67]21] (Fig. [68]4). There were no outliers in the sample clustering (Fig. [69]4a). A value of five was the lowest power for the 0.9 scale-free topology network index. It was screened out in order to plot a hierarchical clustering tree (Fig. [70]4b, c). Similar clusters were merged into new modules using the following settings: height = 0.5, deepSplit = 2, and minModuleSize = 80. Twenty-three modules with similar connected gene patterns were obtained (Fig. [71]4d). As shown in Fig. [72]4e, correlations of each module with patient gender, age, ethnicity, and Clusters1 and 2 were analyzed. The strongest association was found between the lightyellow module and Cluster 1, and between the pink module and Cluster 2. Therefore, the lightyellow (140 genes) and pink (352 genes) modules were selected for the subsequent analyses (Additional file [73]1: Table S1 and Additional file [74]2: Table S2). Fig. 4. [75]Fig. 4 [76]Open in a new tab Co-expression network analysis. a Sample clustering analysis. b, c Analysis of network topology showing that it met the scale-free topology threshold of 0.9 when β = 5. d Clustering dendrogram of genes based on topological overlap. e Heatmap displaying correlations and significant differences between gene modules and clinical phenotypes Gene Ontology (GO) and pathway enrichment analysis GO and KEGG functional enrichment analyses of the genes in the lightyellow and pink modules were performed using the “ClusterProfiler” package in R (Additional file [77]3: Table S3, Additional file [78]4: Table S4, Additional file [79]5: Table S5, Additional file [80]6: Table S6). As shown in Fig. [81]5a, the genes in the lightyellow module were enriched in 293 GO terms closely related to kinase activity. They were also enriched in the Ras and Rap1 signaling pathways and ECM-receptor interaction (Fig. [82]5b). The genes in the pink module were enriched in 567 GO terms and 263 KEGG terms. The top 20 GO terms (Fig. [83]5c) were related mainly to immune metabolism processes such as neutrophil-mediated immunity, and ATP hydrolysis-coupled proton transport. The KEGG analysis suggested that the top 20 significantly enriched pathways were related to tumorigenesis, including the mTOR and TNF signaling pathways (Fig. [84]5d). Fig. 5. [85]Fig. 5 [86]Open in a new tab Functional enrichment analysis of lightyellow and pink modules. a Top 20 GO terms in lightyellow module. b Top 20 KEGG terms in lightyellow module. c Top 20 GO terms in pink module. d Top 20 KEGG terms in pink module LASSO Cox regression and energy metabolism signature A univariate Cox proportional hazard regression model was applied to screen for significant differences in the final prognoses. Sixty-five significant genes were obtained of which the top 20 are listed in Table [87]1. LASSO Cox regression was performed on all 65 significant OS-related genes. As shown in Fig. [88]6a, the trajectory of each independent variable was analyzed. As λ increased, the number of independent coefficients tended to decline towards zero. A threefold cross-validation was run to build the model and the confidence interval under each λ was analyzed. The model was optimal when λ = 0.101608. Thirteen genes with λ = 0.101608 were chosen as the final targets (Fig. [89]6b) and subjected to multivariate Cox survival analysis. Seven genes with the lowest AUC (156.28) were retained and integrated into the final model (Fig. [90]6c; Table [91]2). The RiskScore was calculated for the seven-gene signature as follows: [MATH: RiskScore7=-1.2514×< mtext>expSLC18B1-1.476×expRBM XL1+1.5145×expC1QTN F1-1.3606×expDOK3+1.1638×expCCAR1 -0.7127×expHS3ST2 -2.186×expATP6V0D 1. :MATH] Table 1. Top 20 significant lncRNA screened by univariate Cox regression analysis Symbol P-value HR Low 95% CI High 95% CI NUBP1 1.32E−05 0.256 0.138 0.472 PDZD11 0.00023 0.274 0.138 0.546 ATP6V0D1 0.00037 0.380 0.223 0.648 NTAN1 0.00064 0.391 0.228 0.670 TMEM8A 0.00136 0.372 0.203 0.681 FOLR2 0.00176 0.664 0.513 0.858 GRN 0.00247 0.565 0.390 0.818 BLVRA 0.00347 0.429 0.243 0.757 NPC2 0.00387 0.582 0.403 0.840 RBMXL1 0.00398 0.372 0.190 0.729 DOK3 0.00414 0.532 0.346 0.819 APBB1IP 0.00427 0.615 0.440 0.858 GPX1 0.00497 0.547 0.359 0.833 FCER1G 0.00673 0.684 0.519 0.900 TWF2 0.00675 0.459 0.261 0.806 PLEKHO2 0.00729 0.464 0.265 0.813 CHCHD10 0.00817 0.596 0.406 0.875 HS3ST2 0.00819 0.541 0.343 0.853 TAGLN 0.00884 0.650 0.470 0.897 SELPLG 0.00933 0.635 0.451 0.894 [92]Open in a new tab Fig. 6. Fig. 6 [93]Open in a new tab Constructing seven-gene-based classifier by LASSO Cox regression model. a Trajectory of each independent variable. Horizontal axis represents log of independent variable λ. Vertical axis represents coefficient of independent variable. b Three fold cross-validation of tuning parameter in LASSO model Table 2. Information for seven-gene signature screened by LASSO Cox regression Symbol Coef P value HR Low 95% CI High 95% CI SLC18B1 − 1.2514 0.13982 0.2861 0.05433 1.5065 RBMXL1 − 1.476 0.09257 0.2286 0.04093 1.2762 C1QTNF1 1.5145 4.08E−05 4.5473 2.20564 9.3749 DOK3 − 1.3606 0.03348 0.2565 0.07318 0.8991 CCAR1 1.1638 0.1236 3.2021 0.72794 14.0852 HS3ST2 − 0.7127 0.13179 0.4903 0.19404 1.2389 ATP6V0D1 − 2.186 0.00587 0.1124 0.02373 0.5322 [94]Open in a new tab The scoring formula for each sample is the sum of the aforementioned gene expression value multiplied by the ordinal. We set the optimal threshold calculated from the 5-year AUC as the classification effect in the TARGET training set. As shown in Fig. [95]7a, the death sample survival time significantly decreased with increasing patient RiskScore. Most of the individuals who died were in the high-risk group. The expression levels of CCAR1 and C1QTNF1 increased with risk value. Thus, upregulation of either of these genes is associated with high risk of death. In contrast, the expression levels of SCL18B1, RBMXL1, DOK3, HS3ST2, and ATP6VOD1 decreased with increasing risk value. Therefore, upregulation of these genes is correlated with low risk and all five of them are protective. Figure [96]7b shows the ROC curve for all seven genes. The AUC was > 0.74. As shown in Fig. [97]7c, 36 patients were classified into the low-risk group and 40 were assigned to the high-risk group. There was a significant difference between these two groups (log-rank P < 0.001; HR = 19.87). Fig. 7. [98]Fig. 7 [99]Open in a new tab Evaluation of robustness of seven-gene signature in TARGET training datasets. a RiskScore, survival time and status, and expression of seven-gene signature. b ROC curve of seven-gene signature. c K–M prognosis curve of seven-gene signature Validation of gene signature robustness To validate the robustness of the gene signature, we calculated the RiskScore for the expression level of each sample in the validation set. The RiskScore distribution is shown in Fig. [100]8a. The expression level of the seven-gene signature increased with risk value. These findings are consistent with the training set in which the expression levels of CCAR1 and C1QTNF1 also increased with risk value. Therefore, high expression and high risk were associated with these two genes and both were risk factors. In contrast, the expression levels of SLC18B1, RBMXL1, DOK3, HS3ST2, and ATP6V0D1 decreased with increasing risk value. There was a correlation between high expression and low risk for these five genes and all of them were protective factors. ROC analysis of the prognostic RiskScore classification was performed using the “timeROC” package in R. As shown in Fig. [101]8b, the model had a high AUC value (0.87). Figure [102]8c suggested that according to classification of all TARGET sets, 38 patients were scored as low-risk and 46 patients were rated as high-risk. There were significant differences between these risk groups (log-rank P < 0.0001; HR = 17.92). The external GEO dataset [103]GSE21257 was analyzed by the same method as above to verify the model robustness. It generated the same results as the TARGET validation and training sets, namely, CCAR1 and C1QTNF1 were risk factors, whereas SLC18B1, RBMXL1, DOK3, HS3ST2, and ATP6V0D1 were protective factors (Fig. [104]9a). Figure [105]9b depicts that the model had a high AUC value (0.73). Based on the [106]GSE21257 data, 37 patients were classified as low-risk and 16 patients were classified as high-risk, and again the difference between the groups was significant (log-rank P = 0.011; HR = 2.84) (Fig. [107]9c). The foregoing results indicate that the seven-gene signature was highly robust. In addition, the external dataset [108]GSE16091 validated results suggested this seven-gene signature had a high AUC value (0.7); 10 patients and 24 patients were classified into low-risk and high-risk groups, respectively (log-rank P = 0.02664; HR = 3.218) (Fig. [109]10). Fig. 8. [110]Fig. 8 [111]Open in a new tab Evaluation of robustness of seven-gene signature in TARGET validation datasets. a RiskScore, survival time and status, and the expression of seven-gene signature. b ROC curve of seven-gene signature. c K–M prognosis curve of seven-gene signature Fig. 9. [112]Fig. 9 [113]Open in a new tab Evaluation of robustness of seven-gene signature in GEO validation. a RiskScore, survival time and status, and the expression of seven-gene signature. b ROC curve of seven-gene signature. c K–M prognosis curve of seven-gene signature Fig. 10. Fig. 10 [114]Open in a new tab Evaluation of robustness of seven-gene signature in [115]GSE16091. a ROC curve of seven-gene signature. b K–M prognosis curve of seven-gene signature Gene signature model evaluation To identify the relationship between the RiskScore for the seven-gene signature and the immune and matrix scores, the ImmuneScore, StromalScore, and tumor purity were calculated separately for each sample [[116]22]. There were significant differences between the high-risk and low-risk samples in the TARGET training set in terms of ImmuneScore, StromalScore, and tumor purity. Similar results were obtained for the [117]GSE21257 dataset (Fig. [118]11). To confirm the independence of the seven-gene signature model in a clinical setting, we systematically analyzed the clinical information in the TARGET, TARGET training, and [119]GSE21257 validation datasets including age, gender, metastasis, and seven-gene signature model grouping information (Table [120]3). One-way Cox regression analysis of the TARGET training set revealed that the high-risk group and metastasis were significantly associated with survival. The corresponding multivariate Cox regression analysis revealed that only the high-risk group (HR = 28.89; 95% CI 6.25–133.4; P = 1.63E^−05), age, and metastasis were significantly associated with survival. A one-way Cox regression analysis of the TARGET set demonstrated that the high-risk group and metastasis were significantly associated with survival. Similarly, the multivariate Cox regression analysis showed that only the high-risk group (HR = 21.07; 95% CI 4.87–91.21; P = 4.5E^−05) and metastasis were significantly associated with survival. One-way and multivariate Cox regression analysis of the [121]GSE21257 validation set indicated that the high-risk group was significantly associated with survival. The aforementioned analyses confirmed that the seven-gene signature is an independent standalone prognostic indicator with good predictive performance and clinical application utility. Fig. 11. [122]Fig. 11 [123]Open in a new tab Estimation of seven-gene signature. a Distributions of StromalScore in high and low-risk groups in TARGET training dataset. b Distributions of ImmuneScore in high and low-risk groups in TARGET training dataset. c Distributions of ESTIMATEScore in high and low-risk groups in TARGET training dataset. d Distributions of StromalScore in high and low-risk groups in [124]GSE21257 training dataset. e Distributions of ImmuneScore in high and low-risk groups in [125]GSE21257 training dataset. f Distributions of the ESTIMATEScore in high and low-risk groups in [126]GSE21257 training dataset Table 3. Univariate and multivariate Cox regression analyses, prognosis-related clinical factors, and clinical independence Variables Univariate analysis Multivariable analysis HR 95% CI of HR P value HR 95% CI of HR P value TARGET training datasets 7-gene risk score Risk score (high/low) 19.87 4.67–84.42 3.11E−04 28.89 6.25–133.4 1.63E−05 Age 0.99 0.90–1.09 0.83 1.12 1.00–1.24 0.048 Gender (male/female) 0.57 0.25–1.27 0.17 0.57 0.24–1.33 0.19 Metastatic vs non-metastatic 4.18 1.92–9.10 0.0003 5.34 2.21–12.95 0.0002 TARGET datasets 7-gene risk score Risk score (high/low) 17.29 4.10–72.96 1.04E−04 21.07 4.87–91.21 4.54E−05 Age 0.99 0.91–1.07 0.81 1.08 0.98––1.19 0.098 Gender (male/female) 0.76 0.36–1.60 0.47 0.76 0.35–1.64 0.48 Metastatic vs Non-metastatic 4.74 2.27–9.89 3.42E−05 5.69 2.54–12.76 2.36E−05 [127]GSE21257 validation datasets 7-gene risk score Risk score (high/low) 2.84 1.23–6.55 0.014 2.76 1.18–6.46 0.0186 Age 1.0007 0.99–1.003 0.602 1 0.99–1.003 0.787 Gender (male/female) 1.4 0.58–3.34 0.44 1.39 0.56–3.41 0.468 [128]Open in a new tab Gene set enrichment analysis (GSEA) GSEA was performed on the significantly enriched pathways in the high- and low-risk groups in the TARGET set. The enriched pathway selection threshold was P < 0.05, and significantly enriched pathways are listed in Table [129]4. The high-risk group was mainly associated with metabolic pathways including NITROGEN METABOLISM and LINOLEIC ACID METABOLISM, whereas the low-risk group was enriched mainly in receptor-related pathways, such as B CELL RECEPTOR SIGNALING, CHEMOKINE SIGNALING, and TOLL LIKE RECEPTOR SIGNALING. Table 4. KEGG pathways significantly enriched in high and low-risk groups Name Size ES NES NOM FDR FWER P-val P-val P-val KEGG_NITROGEN_METABOLISM 23 0.497 1.539 0.037 0.470 0.8 KEGG_LINOLEIC_ACID_METABOL ISM 29 0.573 1.572 0.044 0.745 0.753 KEGG_PEROXISOME 78 − 0.483 − 1.871 0.002 0.151 0.176 KEGG_CHEMOKINE_SIGNALING_ PATHWAY 185 − 0.443 − 1.785 0.006 0.144 0.32 KEGG_FC_GAMMA_R_MEDIATED_GOCYTOSIS 95 − 0.479 − 1.821 0.006 0.119 0.244 KEGG_LEUKOCYTE_TRANSENDO THELIAL_MIGRATION 115 − 0.475 − 1.850 0.008 0.109 0.195 KEGG_PATHOGENIC_ESCHERICHI A_COLI_INFECTION 55 − 0.521 − 1.883 0.010 0.200 0.162 KEGG_LYSOSOME 119 − 0.568 − 1.956 0.010 0.167 0.081 KEGG_LEISHMANIA_INFECTION 69 − 0.608 − 1.866 0.012 0.118 0.178 KEGG_VIRAL_MYOCARDITIS 68 − 0.557 − 1.757 0.012 0.166 0.384 KEGG_ENDOCYTOSIS 178 − 0.378 − 1.581 0.017 0.317 0.729 KEGG_SNARE_INTERACTIONS_INICULAR_TRANSPORT 38 − 0.496 − 1.630 0.022 0.295 0.631 KEGG_GLUTATHIONE_METABOLI SM 48 − 0.455 − 1.648 0.025 0.283 0.605 KEGG_TOLL_LIKE_RECEPTOR_SI GNALING_PATHWAY 101 − 0.411 − 1.567 0.032 0.329 0.752 KEGG_ANTIGEN_PROCESSING_A ND_PRESENTATION 79 − 0.518 − 1.661 0.033 0.282 0.575 KEGG_COMPLEMENT_AND_COAG ULATION_CASCADES 69 − 0.457 − 1.590 0.037 0.318 0.715 KEGG_B_CELL_RECEPTOR_SIGNA LING_PATHWAY 74 − 0.427 − 1.591 0.037 0.339 0.714 KEGG_NOD_LIKE_RECEPTOR_SIG NALING_PATHWAY 61 − 0.454 − 1.598 0.037 0.351 0.703 G_ACUTE_MYELOID_LEUKEMA 57 − 0.445 − 1.548 0.041 0.316 0.782 [130]Open in a new tab Comparative study of other risk models We compared published OS-related models with that used in the present study. We determined the differences between the high- and low-risk groups in terms of OS prognosis. As shown in Fig. [131]12a, b, the ROC and K–M curves (PMID: 31333788) indicated that the 3-year AUC was 0.75 (P = 0.00024). Moreover, the OS prognosis for the four-pseudogene signature (PMID: 31146489) demonstrated that the 3-year AUC value was 0.93 (P < 0.0001) (Fig. [132]11c, d). In addition, the ROC and K–M curves in PMID: 31090103 and PMID: 30604867 suggested the 3-year AUC were 0.81(P = 0.00088) and 0.81 (P = 0.00594), respectively. Comprehensive comparative studies disclosed that the seven-gene signature in the present study was superior to the eight-gene signature (PMID: 31333788) and had predictive power similar to that of the four-pseudogene signature (PMID: 31146489). Thus, the model constructed in the present study has superior performance to the previously published models against which it was compared. Fig. 12. [133]Fig. 12 [134]Open in a new tab Comparative analysis of other models. a AUC curve of eight-gene signature (PMID: 31333788) in TARGET training dataset. b K–M curve of eight-gene signature (PMID: 31333788) in TARGET training dataset. c AUC curve of four-pseudogene signature (PMID: 31146489) in TARGET training dataset. d K–M curve of four-pseudogene signature (PMID: 31146489) in TARGET training dataset. e AUC curve of ten-gene signature (PMID: 31090103) in TARGET training dataset. f K–M curve of ten-gene signature (PMID: 31090103) in TARGET training dataset. g AUC curve of nineteen-pseudogene signature (PMID: 30604867) in TARGET training dataset. h K–M curve of 19-pseudogene signature (PMID: 30604867) in TARGET training dataset Discussion OS is a bone tumor that occurs mainly in teenagers and young adults. It originates from primitive transformed mesenchymal cells. Energy metabolism is an important indicator of osteosarcoma cell proliferation, metastasis, and invasion. In the present study, a comprehensive bioinformatics analysis identified a seven-gene signature associated with OS energy metabolism. This gene signature was significantly associated with OS progression and prognosis. The “NMF” algorithm was run to partition the samples into glycolysis and OXPHOS energy metabolism categories. Recent investigations found that the relationship between glycolysis and OXPHOS is cooperative and competitive [[135]23]. As OXPHOS is attenuated, glycolysis may be enhanced to increases energy generation. When OXPHOS function is in equilibrium, it regulates glycolysis and maintains the energy balance [[136]24]. Fantin et al. [[137]25] reported that when LDH-A was inhibited in cancer cells, OXPHOS was enhanced to compensate for glycolysis suppression and ATP reduction. Pacheco-Velazquez et al. [[138]26] proposed that MCF-7 cells are equally dependent on OXPHOS and glycolysis for ATP generation. WGCNA is a systematic biological algorithm revealing the associations between genes and clinical phenotypes. It has been widely used to screen for diagnostic and prognostic biomarkers of Alzheimer’s disease, breast cancer, osteoarthritis, and Dupuytren’s contracture. In the present study, WGCNA identified 23 modules of which the lightyellow and pink were highly associated with molecular subtypes related to OS energy metabolism. A GO analysis disclosed that the genes in these two modules were enriched mainly for protein kinase A, DNA metabolism, and Wnt protein-binding pathways and the transcription coactivator, kinase regulator, ATPase, and proton transmembrane transport activity pathways. Kinases catalyze the transfer of phosphate groups from high-energy, phosphate-donating molecules to specific substrates. Takahashi et al. [[139]27] used siRNA and the small molecule inhibitor CX-4945 to show that upregulation of casein kinase 2 (CK2) was important for the growth of human osteosarcoma cells. Zhu et al. [[140]28] reported that the checkpoint kinase inhibitor AZD7762 promoted apoptosis and mitotic catastrophe in osteosarcoma cells. ATPase decomposes ATP into ADP and free phosphate ion and releases energy. Meszaros et al. [[141]29] found that Ca^2+-ATPase inhibitors suppress ATP- and thrombin-Ca^2+ in OS cells. KEGG enrichment analysis showed that the genes in the lightyellow and pink modules participated in osteoclast differentiation, the mTOR, TNF, and Ras signaling pathways, and the focal adhesion-related metabolic pathways. Osteoclast differentiation is involved in bone formation and originates in mesenchymal stem cells. Gobin et al. [[142]30] suggested that the PIDK inhibitor BYL719 inhibited and promoted osteoclast differentiation. The mTOR pathway is the entry point for OS treatment. It regulates cell growth, increases cell proliferation, and suppresses autophagy [[143]31]. Perry et al. [[144]32] used complementary genomics to identify OS-related genomic events. They found that inhibition of the mTOR pathway could be exploited for OS therapy. Focal adhesion is a key regulator of multi-cellular signaling pathways in cell proliferation, cycle regression, migration, apoptosis, and survival [[145]33]. Hu et al. [[146]34] assessed the effects of the small-molecule focal adhesion inhibitor PF562271 on OS cells and showed that it diminished OS cell volume, weight, and angiogenesis and concluded that inhibition of focal adhesion could therefore be a target for OS treatment. As the lightyellow and pink modules were screened by WGCNA, a systematic LASSO Cox regression estimation was performed, which compresses the coefficients and conserves the original data. The LASSO Cox regression screened seven genes that were then used to construct a seven-gene signature with the lowest AUC value. This seven-gene signature was validated by robustness and estimation evaluations, and by comparing it to other models from independent external datasets with the TARGET training dataset. These analyses revealed the clinical significance of the signature in CCAR1, and C1QTNF1 (risk factor), and in SLC18B1, RBMXL1, DOK3, HS3ST2, and ATP6V0D1 (protective factors). Metastasis and invasion information for clinical biopsies or surgical samples may be obtained by quantifying the expression level of each gene in this seven-gene signature. Genes within this signature may participate in cancer development and progression. CCAR1 (cell division cycle and apoptosis regulator protein 1) is an intermediate in regulatory transduction. It is involved in transcriptional regulation, apoptosis, autophagy, and cell progression and/or proliferation. CCAR1 is a biomarker of hepatocellular carcinoma [[147]35]. RBMXL1 (RNA binding motif protein) encodes a splicing protein that suppresses tumor proliferation, and promotes apoptosis in gastric and breast cancer [[148]36, [149]37]. DOK3 (downstream of tyrosine kinase 3) is an adaptor protein that plays a vital role in the negative feedback regulation of PTK-mediated signaling loops and suppresses the cellular proliferation [[150]38]. HS3ST2 (heparan sulfate-glucosamine 3-sulfotransferase 2) encodes an enzyme that [[151]39] participates in cell proliferation, apoptosis, autophagy, and other processes associated with cancer [[152]40]. ATP6V0D1 (V-type proton ATPase subunit d1) acidifies various intracellular compartments in cancer cells and generates transport process energy [[153]41]. Han et al. [[154]42] suggested that C1QTNF1 (complement C1q tumor necrosis factor-related protein 1) acts on the miR-221-3p/SOCS3 axis to modulate the JAK/STAT signaling pathway and alter HCC cell behavior and tumor proliferation. SLC18B1 (MFS-type transporter SLC18B1) has not yet been functionally linked to cancer, but we suggest it could be a viable target for OS treatment. In conclusion, we used comprehensive bioinformatics techniques to define a novel and effective biomarker for the prediction of the clinical progression, development, invasion, and metastasis of OS. WGCNA and LASSO Cox regression analyses were performed to screen for a seven-gene signature comprised of SLC18B1, RBMXL1, DOK3, HS3ST2, ATP6V0D1, CCAR1, and C1QTNF1 and associated with OS energy metabolism. Validations were performed on independent external datasets and revealed that the seven-gene signature was superior to the other models evaluated. Here, a seven-gene signature related to OS energy metabolism was constructed to forecast the outcome of this cancer. Moreover, this signature contains genes that provide potential targets for innovative and effective OS therapeutic strategies. Methods Data collection and processing Human metabolic pathways were downloaded from the Reactome Database ([155]http://reactome.org/) [[156]43]. A total of 587 genes were obtained from 11 metabolic-related pathways (Table [157]5). RNA-seq expression and clinical follow-up data for osteosarcoma were obtained from the public database TARGET ([158]https://ocg.cancer.gov/) [[159]44]. There were 274 patients with clinical information and ~ 101 patients with RNA-seq data. TPM data gene length and sequencing depth (M scale RNA-seq data), were applied in this study. The following steps were performed on the 101 patients with RNA-seq data. (1) Samples lacking clinical information and/or DFS < 30 were discarded. (2) Normal tissue sample data were removed. (3) Genes with FPKM values of 0 in 50% or more of the samples were eliminated. (4) The expression profile of the genes involved in energy metabolism were retained. The gene expression profile of [160]GSE21257 and its related clinical data were downloaded from the Gene Expression Omnibus (GEO) database ([161]http://www.ncbi.nlm.nih.gov/geo/), and we also included the data from pre-chemotherapy biopsies of 53 patients with OS. The following steps were performed on the [162]GSE21257 data [[163]45]. (1) The normal tissue sample data were removed. (2) The chip probe was mapped to the human genome using the Bioconductor package in R. (3) The expression profiles of the energy metabolism-related genes were retained (Table [164]6). Table 5. Energy metabolism-related pathways in the Reactome Database Metabolic pathways from Reactome PathwayID Gene count Biological oxidations R-HSA-211859 216 Metabolism of carbohydrates R-HSA-71387 290 Mitochondrial fatty acid beta-oxidation R-HSA-77289 37 Glycogen synthesis R-HSA-3322077 16 Glycogen metabolism R-HSA-8982491 27 Glucose metabolism R-HSA-70326 90 Glycogen breakdown (glycogenolysis) R-HSA-70221 15 Glycolysis R-HSA-70171 71 Pyruvate metabolism R-HSA-70268 31 Pyruvate metabolism and Citric Acid (TCA) cycle R-HSA-71406 55 Citric acid cycle (TCA cycle) R-HSA-71403 22 Sum 11 871 [165]Open in a new tab Table 6. Clinical information for pre-processed dataset Characteristic TARGET training datasets (n = 76) TARGET all datasets (n = 84) [166]GSE21257 (n = 53) Age (years) ≤ 18 59 66 35 > 18 17 18 18 Survival status Living 50 55 30 Dead 26 29 23 Gender Female 33 37 19 Male 43 47 34 Metastatic Metastatic 19 21 34 Non-metastatic 47 53 19 [167]Open in a new tab Identification of energy metabolism molecular subtypes in OS The OS samples were clustered with 587 energy metabolism-related genes using a non-negative matrix clustering algorithm (NMF) [[168]46]. The “NMF” package in R was applied using the standard “burnet” for 50 iterations, setting the k cluster range from 2 to 10, determining the average contour with the common member matrix, and setting the minimum number of subclass members to ten. Based on the cophenetic correlation, dispersion, silhouette, and other indicators, the optimal cluster number was established at two [[169]47]. Constructing a dynamic weighted gene co-expression network In the present study, a WGCNA co-expression algorithm was run to mine co-expressed coding genes and co-expression modules. The expression profiles for the coding genes were extracted from the TPM data in the TARGET database. The samples were clustered hierarchically to eliminate outliers. The “WGCNA” package in R was used to construct a weighted co-expression network [[170]48], which was aligned with the scale-free network. The log(k) of the node with connection degree k was negatively correlated with the log(P(k)) of the probability of the occurrence of the node. The correlation coefficient setting was > 0.9. The co-expression modules were screened out. The gene modules were detected by average-linkage hierarchical clustering based on a topological overlap matrix (TOM)-based dissimilarity measure (1-TOM) [[171]49]. Here, module size 80, height 0.5, deepSplit 2, and certain highly similar modules were merged. After associating the modules with clinical traits, those with the highest Pearson’s correlation coefficients were regarded as significant for the subsequent analyses. Gene Ontology (GO) and pathway enrichment analyses The Gene Ontology (GO) database ([172]http://geneontology.org/) [including biological process (BP), cell component (CC), and molecular function (MF) terms] was used to identify biological mechanisms based on high-throughput genome or transcriptome data [[173]50]. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database ([174]http://www.kegg.jp/) [[175]51] served to identify the systematic functions and biological relevance of candidate targets. In the present study, the “clusterProfiler” package in R ([176]http://bioconductor.org/packages/clusterProfiler) [[177]52] was run to conduct GO and KEGG pathway enrichment analyses of the genes in the significant modules and identify the underlying biological mechanisms. LASSO Cox regression and energy metabolism signature An energy metabolism signature was constructed based on gene expression levels and associations with energy metabolism molecular subtypes. About 90% of the 84 samples from the TARGET Database were randomly selected as the training set for the signature model. The information for all samples in training set is shown in Table [178]6. The “survfit coxph function” package [[179]53] in R was run to generate a univariate Cox proportional hazard regression model and expanded for the genes in significant modules and for the survival data. P < 0.05 was the threshold. Least absolute shrinkage and selection operator (LASSO) estimates compression [[180]54]. It shrinks the subset by compressing the original coefficients. It has been broadly applied to the survival analysis of high-dimensional data. In the current study, a LASSO Cox regression model was built to target the genes significantly associated with energy metabolism. A threefold cross-validation was performed for tuning parameter selection. The calculated partial deviance met the minimum criteria. A multivariate Cox survival analysis was run on significant genes. Those retained had the lowest the area under the curve (AUC) value and comprised the final gene signature for energy metabolism. To plot receiver operating characteristic (ROC) curves and compute the AUC, we ran the “pROC” package in R. Validation of gene signature robustness To validate gene signature model robustness, the training set model and coefficient were applied to all genes in the TARGET Database and the external GEO dataset [181]GSE21257 and [182]GSE16091 [[183]55]. The “RiskScore” was calculated for each sample based on the gene expression levels in the validation set samples (TARGET database, [184]GSE21257, and [185]GSE16091). The “timeROC” package in R ([186]http://cran.r-project.org/web/packages/timeROC) [[187]56] was run to analyze the ROC of the “RiskScore” for prognostic classification. Survival analyses of the data from the TARGET database and external GEO dataset [188]GSE21257 and [189]GSE16091 were performed with the “survival” package in R ([190]http://bioconductor.org/packages/) [[191]57]. Univariate survival was estimated by the Kaplan–Meier univariate survival method. P < 0.001 was considered significant. Gene signature model evaluation To identify the relationship between the RiskScore of the gene signature model and the immune and matrix scores, the “estimate” package in R calculated the immune and stromal scores and the tumor purity for each sample. To identify the independence of the gene signature model in clinical applications, the relevant HR and 95% CI were evaluated via the signal factor and multivariate Cox regression in the TARGET training and validation sets and [192]GSE21257. Gene set enrichment analysis (GSEA) GSEA ([193]http://software.broadinstitute.org/gsea/index.jsp) [[194]58] explored the biological functions of the gene signature based on the energy metabolism status (high-risk vs. low-risk groups). The annotated gene set c2.cp.kegg.v6.0.symbols was selected as a reference. Gene size ≥ 10, P < 0.05, and |Enrichment Score (ES)| > 0.40 were set as the cut-off criteria. Comparative study of other risk models Pertinent references were searched, and two related published risk