Abstract Breast cancer (BC) is currently one of the deadliest tumors worldwide. Cancer stem cells (CSCs) are a small group of tumor cells with self-renewal and differentiation abilities and high treatment resistance. One of the reasons for treatment failures is the inability to completely eliminate tumor stem cells. By using the edgeR package, we identified stemness-related differentially expressed genes in [38]GSE69280. Via Lasso-penalized Cox regression analysis and univariate Cox regression analysis, survival genes were screened out to construct a prognostic model. Via nomograms and ROC curves, we verified the accuracy of the prognostic model. We selected 4 genes (PSMB9, CXCL13, NPR3, and CDKN2C) to establish a prognostic model from TCGA data and a validation model from [39]GSE24450 data. We found that the low-risk score group had better OS than the high-risk score group, whether using TCGA or [40]GSE24450 data. A prognostic model including four stemness-related genes was constructed in our study to determine targets of breast cancer stem cells (BCSCs) and improve the treatment effect. Subject terms: Cancer, Computational biology and bioinformatics, Stem cells, Oncology Introduction Breast cancer (BC) is one of the most common tumors in females. In China, the number of new cases is 272,400, and the death toll is 69,500 every year^[41]1. According to immunohistochemical analysis, BC can be divided into luminal-type, Her2-positive and triple-negative breast cancer (TNBC), of which TNBC has the worst prognosis^[42]2,[43]3. With improved treatment, the mortality rate of BC is decreasing year by year^[44]4,[45]5, but 70% of patients have recurrence and metastasis within 5 years^[46]6. There is a small group of stem-like cells in tumors called cancer stem cells (CSCs). CSCs have the characteristics of self-renewal and differentiation abilities and high drug resistance^[47]7–[48]11. Previous studies have indicated that this portion of breast cancer stem cells (BCSCs) is identified by cell surface markers, such as CD44, CD24, CD133 and ALDH^[49]12,[50]13. With changes in the tumor microenvironment, BC cells can differentiate into tumor stem-like cells^[51]14,[52]15. In BC-resistant cell lines and tissues, the CSC population is significantly increased by chemotherapy^[53]16. Compared with other types of BC, TNBC has the highest expression of stem cell markers, which may be one of the reasons for TNBC having the worst prognosis^[54]15,[55]17. Previous studies have shown that stemness-related-gene expression can be used as a predictive biomarker for breast cancer patients. Akbar et al. identified a novel gene list (CNCL) that can discern the stemness and EMT phenotypic statuses of breast cancer, thereby tracking tumor cells and altering the response to tumor treatment^[56]18. Treatment for BCSCs has already emerged but is still immature. In our article, we hope to identify multiple stemness-related genes for determining BC prognosis by establishing a prognostic model. These genes may be potential targets for treating breast cancer, which may improve patient survival. Result Selected stemness-related differentially expressed genes (DEGs) Via edgeR, we identified 599 stemness-related DEGs in [57]GSE69280; among them, 255 genes were upregulated, and 344 genes were downregulated, with thresholds of |log[2] FC|> 1.0 and P < 0.05 (Fig .[58]S2). Identification of stemness-related DEGs in the TCGA BC database Coexpressed genes were obtained by intersection of TCGA and [59]GSE24450 data. We obtained 566 genes by intersection the stemness-related DEGs list and TCGA data. By Using edgeR, we identified 106 stemness-related DEGs in TCGA BC patients; among them, 54 genes were downregulated, and 52 genes were upregulated, with thresholds of |log[2] FC|> 1.0 and an adjusted P < 0.05 (Fig. [60]1A,B). Figure 1. [61]Figure 1 [62]Open in a new tab The stemness-related differentially expressed genes of breast cancer patients. (A) Heatmap and (B) Volcano plot. Construction of the stemness-related-gene prognostic model By using univariate Cox regression analysis, we obtained the survival-associated genes shown in Fig. [63]2. Lasso-penalized Cox regression was performed to identify the genes in the prognostic model. We constructed a prognostic model and used [64]GSE24450 to build a validation model. In TCGA prognostic model the expression of genes for each patient is shown in Fig. [65]3A, the distribution of different risk scores is shown in Fig. [66]3B, the distribution of different survival statuses (years) of TCGA patients is shown in Fig. [67]3C. In [68]GSE24450 validation model the expression of genes for each patient is shown in Fig. [69]3D, the distribution of different risk scores is shown in Fig. [70]3E, the distribution of different survival statuses (years) of [71]GSE24450 patients is shown in Fig. [72]3F. The risk score for the prognostic gene signature was calculated as follows: risk score = (expression level of PSMB9 × − 0.01623) + (expression level of CXCL13 × − 0.00335) + (expression level of NPR3 × 0.05481) + (expression level of CDKN2C × − 0.04691). Figure 2. Figure 2 [73]Open in a new tab The survival-associated stemness-related differentially expressed genes of breast cancer patients. Figure 3. [74]Figure 3 [75]Open in a new tab Establishment of the stemness-related prognostic model. (A) Heatmap of four genes in the TCGA model. (B) Rank of risk score and distribution of groups in the TCGA data. (C) Survival status of TCGA BC patients in different groups. (D) Heatmap of four genes in the [76]GSE24450 model. (E) Rank of risk score and distribution of groups in the [77]GSE24450 data. (F) Survival status of [78]GSE24450 BC patients in different groups. We classified patients into low- and high-risk score groups based on the median risk score as the cut-off. Survival was analyzed by a Kaplan–Meier (KM) curve, and the low-risk-score group had better overall survival (OS) than the high-risk-score group (P < 0.001) (Fig. [79]4A). In the validation model, the low-risk-score group had better OS than the high-risk-score group (P = 0.0115) (Fig. [80]4B). Figure 4. [81]Figure 4 [82]Open in a new tab Survival analysis of the prognostic models. (A) The KM curve of the TCGA model. (B) The KM curve of the [83]GSE24450 model. The clinical utility of the prognostic model In the TCGA prognostic model, univariate Cox regression analyses (Fig. [84]5A) showed that older age (> 65) (hazard ratio [HR 1.532; 95% confidence interval [CI] = 1.117–2.047; P < 0.001), high American Joint Committee on Cancer (AJCC) stage (III-IV) (HR = 2.048; 95% CI = 1.603–2.616; P < 0.001), high tumor (T) stage (3–4) (HR = 1.379; 95% CI = 1.101–1.729; P = 0.005), lymph node metastasis (positive) (HR = 1.572; 95% CI = 1.300–1.900; P < 0.001), and high risk score (HR = 3.108; 95% CI = 2.049–4.715; P < 0.001) were significant risk factors for poor prognosis. In the multivariate Cox regression analysis (Fig. [85]5B), older age (> 65) (HR = 1.634; 95% CI = 1.319–2.048; P < 0.001), high AJCC stage (III–IV) (HR = 2.101; 95% CI = 1.244–3.549; P = 0.005) and high risk score (HR = 3.324; 95% CI = 2.010–5.497; P < 0.001) were found to be independently associated with poor OS. The risk scores were significantly higher for patients with higher AJCC stage (III-IV) (Fig. [86]6C) and older age (> 65) (Fig. [87]6D). The risk score was significantly higher in TNBC patients than in luminal-type patients (Fig. [88]6E). The risk scores for different T stages (Fig. [89]6A) and different lymph node statuses (Fig. [90]6B) were not statistically significantly different. The risk scores in luminal-type patients and HER2-positive patients were not statistically significantly different (Fig. [91]6E). The risk scores in HER2-positive patients and TNBC patients were not statistically significantly different (Fig. [92]6E). Figure 5. Figure 5 [93]Open in a new tab Cox regression analyses of the prognostic model and clinical features. (A) Univariate Cox analyses of the TCGA model. (B) Multivariate Cox regression analysis of the TCGA model. Figure 6. [94]Figure 6 [95]Open in a new tab The relationship between risk score and clinical features. (A) The risk score in different T stage groups. (B) The risk score in different lymph node metastasis groups. (C) The risk score in different AJCC stage groups. (D) The risk score in different age groups. (E) The risk score in different molecular phenotype groups. Verification of the accuracy of the prognostic model To further verify the accuracy of the prognostic model, we constructed a nomogram and ROC curve. The ROC curve analysis of the TCGA prognostic model is shown in Fig. [96]7A, and the area under the curve (AUC) was 0.752. The nomogram is shown in Fig. [97]7B, and the C-index was 0.758. Figure 7. [98]Figure 7 [99]Open in a new tab Verification of the accuracy of prognostic models. (A) The ROC curve of the TCGA prognostic model. (B) The nomogram of the TCGA prognostic model. Functional enrichment analysis of stemness-related genes Through GSEA, we found that the high-risk-score group had enrichment in KEGG pathways related to metabolism (Fig. [100]8): the hedgehog signaling pathway, the TGF-β signaling pathway and a pathway related to arrhythmogenic right ventricular cardiomyopathy (ARVC). The low-risk-score group had enrichment in the following KEGG pathways (Fig. [101]8): the cell cycle, apoptosis, chemokine, cytokine and JAK-STAT pathways. Figure 8. [102]Figure 8 [103]Open in a new tab KEGG pathway enrichment analysis. Discussion In this research, we identified DEGs with potential stemness characteristics by analyzing stem-like and non-stem-like cells in [104]GSE69280. Then, the DEGs were compared with TCGA and [105]GSE24450 data to select coexpressed genes in the two databases. Next, by using univariate Cox regression analysis and Lasso-penalized Cox regression analysis, we obtained four prognostic-related genes (PSMB9, CXCL13, NPR3, and CDKN2C) and established a prognostic model. The model was validated with [106]GSE24450 data. We divided patients into low-risk-score and high-risk-score groups and found that the low-risk-score group had better OS than the high-risk-score group for both TCGA and [107]GSE24450 data. BCSCs are a small group of tumor cells that have self-renewal capacity and play an important role in tumor formation, recurrence and metastasis^[108]19. Furthermore, resistance to traditional chemoradiotherapy is a remarkable feature of BCSCs, as well as one of the culprits for treatment failure^[109]20. Recent studies have demonstrated that breast non-stem cells undergo dedifferentiation and transform into CSCs in response to treatment^[110]21. In addition, traditional treatments cannot thoroughly eliminate BCSCs, which contributes to a significant increase in the proportion of CSCs^[111]22. The main reasons for the resistance of CSCs are as follows. First, CSCs inhibit the expression of membrane-bound APC transporters, which act as efflux drug pumps to decrease intracellular drug accumulation^[112]20. In addition, CSCs also have DNA repair and antiapoptotic effects^[113]23, which are responsible for resistance to treatment. What’s more, different BC molecular subtypes, such as TNBC cells and HER-2 positive cells, has the similar stemness, but they are tow unique diseases that require different treatment strategies^[114]24. In BCSCs of different molecular subtypes, the expression and regulation of HER-2 are both different, so therapeutic repercussion and prognosis of patients will be different^[115]25. Thus, elimination of BSCSs is a potential new strategy for patients with refractory breast cancer. Our prognostic model was constructed with a series of survival-associated DEGs, including PSMB9, CXCL13, NPR3, and CDKN2C. CDKN2C, also known as p18 or INK4C, is a member of the INKCK family and regulates the G1 phase of the cell cycle by inhibiting CDK4 or CDK6^[116]26. Previous studies have reported that CDKN2C is involved in the regulation of normal stem cells and CSCs^[117]27. Yuan et al. pointed out that liver CSC counts significantly increased in the absence of CDKN2C expression, suggesting that CDKN2C strongly inhibited the self-renewal of liver CSCs^[118]28. Gain of the CCND1 and CDK4 and loss of the CDKN2A (p16) and CDKN2C (p18) genes are present in patients with luminal B breast cancer and poor prognosis of and negatively regulated by the cell cycle pathway^[119]29. Currently, inhibitors targeting CDK4/6 have been clinically approved for breast cancer patients who have failed hormone receptor-targeted treatment. CXCL13 is a member of the chemokine family and is an important component of the tumor microenvironment. In vivo, IL-30 overexpression in primary tumors facilitates the recruitment of prostate cancer stem-like cells (PCSLCs) to CXCL13, creating a microenvironment convenient for lymph node and blood metastasis^[120]30,[121]31. Zhang found that mesenchymal stem cells (MSCs) could secrete a large amount of CXCL13 in the bone marrow microenvironment of multiple myeloma and promote the proliferation, metastasis and drug resistance of myeloma cells through a CXCL13-mediated signaling pathway^[122]32. PSMB9 is one of the genes encoding proteasome subunits in human embryonic stem cells (hESCs) and plays a key role in maintaining the pluripotency of hESCs and regulating the cell cycle^[123]33. NPR3 is enriched in bone marrow mesenchymal stem cells (BM-MSCs) and has important regulatory effects on BM-MSCs^[124]34. Therefore, considering the regulatory role of these four stemness-related genes, our prognostic signature might be a potential biomarker in breast cancer outcome prediction. GSEA revealed that the high-risk-score group was enriched in the Hedgehog, TGF-β and cardiovascular KEGG pathways, while the low-risk-score group was enriched in the cell cycle, apoptosis, chemokine, cytokine and JAK-STAT KEGG pathways. The Hedgehog signaling pathway is essential for maintenance of BCSCs^[125]35, and inhibition of the components of the Hedgehog signaling pathway, such as Gli1, Gil2 and SHH, can reduce CSCs in breast cancer cell lines^[126]35,[127]36. The components of the tumor microenvironment (cytokines, chemokines, and exosomes)^[128]37,[129]38 and multiple signaling pathways, such as the apoptotic pathway^[130]39 and the cell cycle pathway^[131]27, both play an important role in maintaining the phenotype and function of CSCs. The JAK2-STAT pathway mediates BCSC resistance^[132]40, while JAK1-STAT may participate in non-CSC transformation into BCSCs^[133]41. Thus, the KEGG pathways involved in both groups are closely related to maintaining stemness, which may provide strategies for BC treatment. There are already some clinical trials that act directly on the hedgehog, Notch and Wnt signaling pathways and have some effects on CSC suppression^[134]42–[135]44. However, unfortunately, although there are treatment strategies for CSCs, the translational of these treatments into the clinic for BC patients has been unsatisfactory. There are some limitations in our present study. For example, the selected genes have been demonstrated to play an important role in maintaining CSCs or other SCs (BM-MSCs and hESCs), some of which have different roles in breast cancer, but few studies have involved the relationship of these genes with BCSCs. This requires further research in the future. A prognostic model consisting of stem cell-associated genes was constructed in our study. In data from both TCGA and [136]GSE24450, the low-risk-score group had worse outcomes than the high-risk-score group. Although BCSCs account for only a small proportion of all breast cancer cells, these cells play an important role in the recurrence and metastasis of the disease, and traditional treatment cannot thoroughly eliminate them. To the best of our knowledge, this is the first study to build a stemness-related prognostic signature in BC. It is hoped that our present study can provide potential biomarkers for BC outcome prediction and targets for therapies. Material and methods Selected stemness-related DEGs Via the edgeR package (v3.53) ([137]https://bioconductor.org/packages/edgeR/) (R Development Core Team, Vienna, Austria), we analyzed the [138]GSE69280 data in cells with stemness characteristics and cells without stemness characteristics and identified the stemness-related DEGs (with thresholds of |log[2] fold change [FC]|> 1.0 and false discovery rate [FDR] adjusted to P < 0.05). Data collection Patient clinical information and mRNA sequencing data were obtained from The Cancer Genome Atlas (TCGA) and [139]GSE24450. The TCGA database contains 1066 BC tissues and 112 adjacent normal tissues, the clinical features of patients were showed in Table [140]1. [141]GSE24450 included 183 breast cancer patients. All patients had complete survival information, and the follow-up time was more than 10 years; the validation data set had similar characteristics. The DEGs were identified as follows: (A) First, the coexpressed genes were obtained by intersecting TCGA and [142]GSE24450 genes. (B) Second, a stemness-related gene list was obtained from [143]GSE69280. (C) Next, the DEGs in BC samples from TCGA were identified. (D) Finally, we compared the stemness-related gene list and TCGA DEGs to obtain eligible stemness-related DEGs. The flow chart is shown in Fig [144]S1. Table 1. The clinical features of TCGA breast cancer patients. Clinical features Age (years) Media 58 Rage 26–89 Numbers of patients (n = 1066) Numbers of patients (%) Gender Female 1055 (98.97) Male 11 (1.03) T stage T1 280 (26.27) T2 617 (57.88) T3 131 (12.29) T4 36 (3.38) Unknown 2 (0.18) N stage 0 501 (47.00) 1 358 (33.58) 2 118 (11.69) 3 74 (6.94) Unknown 15 (0.79) M stage 0 888 (83.80) 1 19 (1.78) Unknown 159 (14.42) AJCC stage I 183 (17.17) II 602 (56.47) III 240 (22.51) IV 19 (1.78) Unknown 20 (2.07) HER-2 status Positive 156 (14.63) Negative 560 (52.53) Unknown 350 (32.84) Estrogen receptor status Positive 787 (73.83) Negative 233 (21.86) Unknown 46 (4.31) [145]Open in a new tab AJCC, American Joint Committee on Cancer; HER-2, human epidermal growth factor receptor-2; M, metastasis, N, node; T, tumor; TCGA, The Cancer Genome Atlas. Identification of stemness-related differentially expressed genes (DEGs) Through the R limma package^[146]45, we identified stemness-related DEGs for BC in the TCGA data (with thresholds of |log[2] fold change (FC)|> 1.0 and false discovery rate [FDR] adjusted to P < 0.05). Establishment of a prognostic model and validation model Prognostic risk scores were obtained for all patients by univariate Cox regression analysis and Lasso-penalized Cox regression^[147]46. The risk score calculation formula for all patients is as follows. [MATH: Surviva lRiskScoreSRS=i=1 nCi×Vi :MATH] In the formula, n represents the number of mRNAs, C[i] represents the coefficient of the mRNA in multivariate Cox regression analysis, and V[i] represents the expression level of the mRNA. Patients were classified into a high-risk-score group and a low-risk-score group by median risk score. To further verify the feasibility of the prognostic model, we also divided [148]GSE24450 patients into two groups according to the median risk score. The survival of the two groups of patients was analyzed by KM curves. Construction of a prognosis-related nomogram and receiver operating characteristic (ROC) curves To further verify the accuracy of the prognostic model, a nomogram and ROC curves were established by the edgeR package^[149]47,[150]48. The C-index was used to evaluate the accuracy of the nomogram by a bootstrap method with 1000 resamples. Functional enrichment analysis To better understand the underlying biological mechanisms of these genes, KEGG pathway analyses were performed (gene set enrichment analysis [GSEA])^[151]49. KEGG pathway analyses were based on a threshold of P < 0.05. Statistical analysis Statistical analyses were performed by using GraphPad Prism (version 8.0, San Diego, USA). Independent prognostic factors were determined by using a multivariate Cox regression model. Patient survival time was analyzed using the KM curve, and the log-rank test was used for statistical analysis. P < 0.05 was considered to indicate a statistically significant difference. Ethics declarations Our research is in compliance with the Declaration of Helsinki. Supplementary information [152]Supplementary Figures.^ (245.1KB, pdf) Acknowledgements