Abstract

   Background: Telomeres and cellular senescence are critical biological
   processes implicated in cancer development and progression, including
   breast cancer, through their influence on genomic stability and
   modulation of the tumor microenvironment. Methods: This study
   integrated bulk RNA sequencing and single-cell RNA sequencing
   (scRNA-seq) data to establish a gene signature associated with telomere
   maintenance and cellular senescence for prognostic prediction in breast
   cancer. Telomere-related genes (TEGs) and senescence-associated genes
   were curated from The Cancer Genome Atlas (TCGA) and Gene Expression
   Omnibus (GEO) databases. A comprehensive machine learning framework
   incorporating 101 algorithmic combinations across 10 survival modeling
   approaches, including random survival forests and ridge regression, was
   employed to develop a robust prognostic model. Results: A set of 19 key
   telomere- and senescence-related genes was identified as the optimal
   prognostic signature. The model demonstrated strong predictive accuracy
   and was successfully validated in multiple independent cohorts.
   Functional enrichment analyses indicated significant associations with
   immune responses and aging-related pathways. Single-cell transcriptomic
   analysis revealed marked cellular heterogeneity, identifying distinct
   subpopulations (fibroblasts and immune cells) with divergent risk
   scores and biological pathway activity. Additionally, pseudo-time
   trajectory analysis and intercellular communication mapping provided
   insights into the dynamic evolution of the tumor microenvironment.
   Immunohistochemical (IHC) validation using data from the Human Protein
   Atlas confirmed differential protein expression between normal and
   tumor tissues for several of the selected genes, reinforcing their
   biological relevance and clinical utility. Conclusions: This study
   presents a novel 19-gene telomere- and senescence-associated signature
   with strong prognostic value in breast cancer. These findings enhance
   our understanding of tumor heterogeneity and may inform precision
   oncology approaches and future therapeutic strategies.

   Keywords: telomeres, cellular senescence, breast cancer, machine
   learning, single-cell RNA sequencing, prognostic signature, tumor
   microenvironment

1. Introduction

   Breast cancer remains the most commonly diagnosed malignancy and the
   leading cause of cancer-related deaths among women worldwide,
   accounting for approximately 2.3 million new cases and 685,000 deaths
   in 2020 alone [[26]1]. Despite considerable advancements in screening,
   diagnosis, and therapeutic approaches, including endocrine therapy,
   chemotherapy, and targeted agents, the overall prognosis of breast
   cancer patients still varies significantly due to tumor heterogeneity
   and complex molecular mechanisms [[27]2,[28]3]. There is an urgent need
   to identify reliable prognostic biomarkers and therapeutic targets that
   can guide individualized treatment and improve clinical outcomes.

   Aging is one of the most prominent risk factors for breast cancer, with
   the majority of diagnoses occurring in postmenopausal women aged 50
   years and older [[29]4]. Aging-related processes, particularly telomere
   attrition and cellular senescence, are increasingly recognized as key
   contributors to cancer biology [[30]5]. Telomeres, the repetitive
   TTAGGG sequences at the ends of chromosomes, play a critical role in
   maintaining chromosomal stability and preventing end-to-end fusion
   [[31]6,[32]7]. However, due to the end-replication problem, telomeres
   progressively shorten with each cell division. When telomeres reach a
   critically short length, they trigger DNA damage responses that lead to
   cellular senescence or apoptosis [[33]8]. While this mechanism serves
   as a protective barrier against oncogenic transformation, aberrant
   telomere dysfunction can paradoxically promote tumor progression by
   increasing genomic instability, chromosomal rearrangements, and
   mutational burden [[34]9]. Telomere dysfunction is frequently observed
   in breast cancer and has been associated with aggressive
   clinicopathological features and a poor prognosis [[35]10]. Moreover,
   the reactivation of telomerase, a ribonucleoprotein that elongates
   telomeres, is a hallmark of many cancers, including breast cancer,
   facilitating unlimited proliferative potential [[36]11].

   Cellular senescence, another hallmark of aging, refers to a stable form
   of cell cycle arrest that is triggered not only by telomere shortening
   but also by various stressors such as oncogene activation, oxidative
   damage, and chemotherapy [[37]12]. Apparently, in the absence of
   telomerase, critically short telomeres induce cellular senescence,
   whereas the reintroduction of telomerase rescues senescence, suggesting
   a causal relationship between critically short telomeres and cellular
   senescence [[38]13]. During cancer development, senescence can be
   triggered by various factors, such as DNA damage, oncogene activation,
   therapeutic interventions, or elevated levels of reactive oxygen
   species. As a mechanism, cellular senescence inhibits the proliferation
   of damaged cells and plays a dual role in cancer and aging processes
   [[39]14,[40]15]. Although senescence initially acts as a
   tumor-suppressive mechanism by halting the proliferation of damaged
   cells, senescent cells often acquire a pro-inflammatory secretory
   profile known as the senescence-associated secretory phenotype (SASP)
   [[41]16]. Therefore, senescence represents a double-edged sword in
   cancer biology, and its role in breast cancer remains to be fully
   elucidated.

   Recent advancements in computational biology and high-throughput
   sequencing technologies have enabled researchers to leverage
   multi-omics data for biomarker discovery and prognosis prediction. In
   particular, machine learning (ML) algorithms have shown great potential
   in modeling complex biological systems and identifying robust
   prognostic gene signatures from large-scale transcriptomic datasets
   [[42]17]. ML approaches can integrate multiple data modalities, uncover
   nonlinear patterns, and construct predictive models that outperform
   conventional statistical methods [[43]18,[44]19,[45]20,[46]21]. For
   example, integrative ML-based frameworks have successfully been applied
   to identify immune-related signatures, hypoxia-related scores, and DNA
   damage repair profiles across various cancers, including breast cancer
   [[47]22]. In a recent study, researchers developed a novel machine
   learning-assisted telomerase signature (MLTS) by integrating
   multi-omics data from nine independent breast cancer datasets. The
   findings revealed that patients with high MLTS scores exhibited
   increased tumor mutational burden, chromosomal instability, and poor
   survival outcomes, suggesting a potential role in cancer progression
   [[48]23].

   Despite growing interest in telomere biology and cellular senescence,
   few studies have systematically combined these two aging-related
   processes to develop a comprehensive prognostic signature in breast
   cancer. Moreover, the role of aging-related genes in shaping the tumor
   microenvironment, influencing immune infiltration, and dictating
   therapeutic responsiveness remains largely unexplored.

   In this study, we aim to address these knowledge gaps by constructing a
   novel prognostic model based on telomere-related and cellular
   senescence-associated genes using a machine learning-integrated
   approach. By incorporating bulk RNA-seq data from TCGA and GEO
   datasets, we identified key genes associated with patient survival
   through rigorous ML algorithms, including random survival forests and
   ridge regression. We validated the model’s prognostic utility in
   external cohorts and explored underlying biological functions through
   enrichment analyses. Furthermore, we employed single-cell RNA
   sequencing (scRNA-seq) analysis to characterize the tumor
   microenvironment, assess cell-type-specific risk patterns, and
   investigate intercellular communication networks. Finally, we validated
   the protein-level expression of the key genes using
   immunohistochemistry (IHC) data from the Human Protein Atlas (HPA),
   enhancing the clinical relevance and translational potential of our
   findings.

2. Materials and Methods

2.1. Data Collection and Preprocessing

   Transcriptomic data for breast cancer patients were obtained from The
   Cancer Genome Atlas (TCGA-BRCA) via the official TCGA portal
   ([49]https://www.cancer.gov/ccg/research/genome-sequencing/tcga/,
   accessed on 12 September 2024). Corresponding somatic mutation
   profiles, clinical characteristics, and survival data were retrieved
   using the TCGAbiolinks R package (v2.31.3) [[50]24]. Samples labeled as
   “-11A” were defined as adjacent normal tissues, while the remaining
   were considered tumor tissues. A total of 1179 samples were included,
   comprising 113 normal and 1066 tumor tissues. For external validation,
   four GEO datasets ([51]GSE58812 [[52]25], [53]GSE21653 [[54]26],
   [55]GSE103091 [[56]27], and [57]GSE19615 [[58]28]) were downloaded from
   the NCBI-GEO database ([59]https://www.ncbi.nlm.nih.gov/geo/, National
   Center for Biotechnology Information, Bethesda, MD, USA; accessed on 13
   September 2024) [[60]29]. Cancer tissue samples with prognostic
   information were extracted from these datasets as subsequent model
   validation datasets. Inclusion criteria for GEO datasets were as
   follows: (1) availability of overall survival data, (2) sufficient
   sample size (n > 50), and (3) use of standardized microarray platforms.
   Datasets lacking raw expression data or platform annotation files were
   excluded. Additionally, [61]GSE161529 was obtained for single-cell RNA
   sequencing analysis. TCGA raw count data were normalized using
   variance-stabilizing transformation (VST) in the DESeq2 package. GEO
   microarray data were log2-transformed, and probe IDs were converted to
   gene symbols using corresponding platform annotations. For genes with
   multiple probes, the median expression value was used.

   Telomere-related genes (TEGs) were obtained from the GSEA database
   ([62]https://www.gsea-msigdb.org; Broad Institute, Cambridge, MA, USA;
   University of California San Diego, San Diego, CA, USA; accessed on 15
   September 2024) and the relevant published literature [[63]30].
   Aging-related genes (AGs) were integrated from the MSigDB database,
   including gene sets such as GOBP cell aging, TP53-associated senescence
   targets, senescence–autophagy pathways, and replicative senescence
   [[64]31]. Additionally, 307 AGs were downloaded from Human Aging
   Genomic Resources (HAGR, CellAge database:
   [65]http://genomics.senescence.info/genes/, accessed on 15 September
   2024). All gene sets were merged and deduplicated prior to downstream
   analysis.

2.2. Identification and Functional Annotation of Telomere-Related Cellular
Senescence Genes in Breast Cancer

   To identify differentially expressed telomere-related cellular
   senescence genes in breast cancer, transcriptomic data from the
   TCGA-BRCA cohort were analyzed. Differential expression analysis
   between tumor tissues and adjacent normal tissues was performed using
   the DESeq2 package (v1.34.0) [[66]32]. Genes with an absolute log[2]
   fold change (|log[2]FC|) > 0.2 and a p-value < 0.05 (adjusted using the
   Benjamini–Hochberg method) were considered significantly differentially
   expressed. Telomere-related genes (TEGs) and aging-related genes (AGs)
   were retrieved from previously curated databases. The intersection
   between differentially expressed genes (DEGs) and TEGs, as well as DEGs
   and AGs, was identified. The overlapping genes between these two sets
   were defined as telomere-related cellular senescence genes. To further
   refine this gene set, Pearson correlation analysis was conducted,
   retaining genes with an absolute correlation coefficient (|r|) ≥ 0.6
   and p-value ≤ 0.05. Heatmaps were generated to visualize expression
   patterns and the distribution of upregulated and downregulated genes.
   Gene ontology (GO) enrichment analysis, covering biological processes
   (BP), cellular components (CC), and molecular functions (MF), as well
   as Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, was
   conducted using the clusterProfiler package (v3.19.0). Enrichment
   results with a p-value < 0.05 were considered statistically
   significant.

2.3. Integration of Machine Learning Algorithms to Identify Prognostic Gene
Signatures

   Based on the above-mentioned cross-linked genes, univariate Cox
   regression analysis was performed using the survival package (version
   3.2.13, [67]https://github.com/therneau/survival, accessed on 20
   September 2024) to identify candidate genes associated with prognosis
   [[68]33]. The corresponding p-value for each gene was obtained for
   evaluation, and the threshold for significance was set at a p-value <
   0.05. The prognostic model was constructed by integrating 101
   combinations of 10 different machine learning algorithms, including
   CoxBoost, elastic network (Enet), survival support vector machine
   (survival-SVM), lasso, partial least squares regression for Cox
   (plsRcox), ridge, random survival forest (RSF), stepwise Cox,
   supervised principal components (SuperPC), and generalized boosted
   regression modeling (GBM) [[69]34,[70]35]. The RiskScore for each
   patient was calculated using a linear combination of the expression
   levels of selected genes weighted by their corresponding coefficients
   derived from the machine learning model, according to the formula:
   [MATH: <mrow><mrow><mi mathvariant="normal">R</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">k</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">r</mi><mi
   mathvariant="normal">e</mi><mo>=</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>n</mi></mrow></munderover><mrow><mi>c</mi><mi>o</mi><mi>e</mi><mi>
   f</mi><mo> </mo><mo>×</mo><mo> </mo><msub><mrow><mi>e</mi><mi>x</mi><mi
   >p</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mrow></mrow></mrow
   > :MATH]

   where coef represents the coefficient for gene i and exp[i] denotes its
   normalized expression level. The coefficients were obtained from the
   best-performing model [[71]36].

   To ensure consistency, gene expression data were normalized prior to
   analysis. Patients were stratified into high- and low-risk groups based
   on an optimal cut-off value determined from the training cohort using
   the survminer R package. All tumor samples in the TCGA-BRCA dataset
   were thus classified into high-risk (risk score ≥ cut-off) and low-risk
   (risk score < cut-off) groups. Kaplan–Meier survival curves and the
   log-rank test were applied to evaluate overall survival differences
   between the two groups. Time-dependent receiver operating
   characteristic (ROC) analysis was conducted to assess the predictive
   performance of the prognostic model, with area under the curve (AUC)
   values calculated at 1-, 3-, and 5-year intervals. The established
   prognostic signature was further validated in four external GEO
   datasets ([72]GSE58812, [73]GSE21653, [74]GSE103091, and [75]GSE19615),
   using the same gene coefficients and cut-off threshold derived from the
   training set.

2.4. Clinical Correlation Assessment, Nomogram Modeling, and Comparative
Performance Analysis

   Clinical factors including age, gender, tumor grade, and pathological
   stage were collected from both the training cohort (TCGA-BRCA) and the
   external validation cohorts (GEO datasets) and are presented in
   [76]Supplementary Table S1. For each clinical variable, the concordance
   index (C-index) was calculated to assess its prognostic discriminatory
   power. A bar chart was generated to compare the C-index values across
   datasets. To further evaluate the independence of the proposed
   prognostic signature, univariate and multivariate Cox regression
   analyses were performed in the TCGA-BRCA cohort using the survival R
   package. The risk score group (high-risk group [HRG] vs. low-risk group
   [LRG]) and clinical covariates were included in the models to identify
   independent predictors of overall survival. Based on the risk score and
   key clinical features, a prognostic nomogram was constructed using the
   rms package (v6.2-0). The nomogram was designed to estimate 1-, 3-, and
   5-year survival probabilities. Calibration curves were plotted to
   evaluate the agreement between predicted and observed outcomes, thereby
   assessing the predictive accuracy of the nomogram. To further
   investigate the association between the risk score and clinical
   characteristics, chi-square tests were performed to assess the
   distribution differences of clinical features, including gender, T
   stage, N stage, M stage, pathological stage, and calculated molecular
   subtypes, between the HRG and LRG. The sample distributions across risk
   groups were visualized using bar plots, providing an intuitive overview
   of their associations. Lastly, to evaluate the robustness and
   superiority of our model, its performance was compared with 18
   published prognostic models developed within the last three years.
   Signature genes and their coefficients were extracted from the
   literature, and corresponding risk scores were calculated. The C-index
   for each model was computed and visualized to compare prognostic
   performance, offering a comprehensive benchmark for the proposed
   signature.

2.5. Assessment of the Risk Score Signature for Immunotherapy Response and
Prognostic Evaluation

   To investigate the potential of the risk score signature as a
   predictive biomarker for immunotherapy response and its correlation
   with survival prognosis, we analyzed its relationship with several
   established immunotherapy predictors, including microsatellite
   instability (MSI), cytotoxic activity (CYT), interferon-gamma (IFN-γ),
   T cell inflammatory gene expression profile (GEP), T cell receptor
   (TCR) richness, TCR Shannon entropy value, immune phenotype score
   (IPS), and tumor immune dysfunction and rejection score (TIDE). The
   Wilcoxon test was applied to assess the significance of the
   distribution of these immunotherapy predictors between the high- and
   low-risk groups. Box plots were used to visualize their distributions,
   and the Wilcoxon test provided the corresponding p-values. Since no
   immunotherapy data were available in the TCGA-BRCA cohort, the
   IMvigor210 immunotherapy dataset was used to further evaluate the
   predictive performance of the risk score. The same set of 101 machine
   learning algorithms was applied to calculate the risk score for each
   sample. Kaplan–Meier survival curves were generated using the survival
   package to assess the correlation between the HRG and LRG and survival
   prognosis. The significance of survival differences between the groups
   was tested using the log-rank test. The distribution of
   immunotherapy-related factors in the high- and low-risk groups was
   visualized using box plots, and p-values were calculated using the
   Wilcoxon test.

2.6. Correlation Analysis Between Signature Genes and Stromal Scores

   To investigate the association between the identified gene signature
   and stromal components of the tumor microenvironment, stromal scores
   were calculated for each TCGA-BRCA sample using the ESTIMATE algorithm
   (implemented via the estimate R package). Pearson correlation analysis
   was performed between the expression levels of each of the model genes
   and the corresponding stromal scores. Genes with |correlation
   coefficient| > 0.3 and p-value < 0.05 were considered significantly
   correlated with stromal content. Correlation coefficients were
   visualized using heatmaps and scatter plots to depict gene–stroma
   relationships. All analyses were conducted using the ggplot2, corrplot,
   and ggpubr packages in R.

2.7. Biological Mechanisms and Mutational Analysis of Risk Score Features

   To elucidate the biological mechanisms underlying risk score
   characteristics and compare mutational profiles between high- and
   low-risk groups, we performed a series of analyses. First, we explored
   potential biological mechanisms by calculating the cancer immune cycle
   using the TIP database ([77]http://biocc.hrbmu.edu.cn/TIP/, accessed on
   20 September 2024) and conducting gene set enrichment analysis (ssGSEA)
   with GO and KEGG gene sets from MSigDB
   ([78]http://www.gsea-msigdb.org/gsea/index.jsp, Broad Institute,
   Cambridge, MA, USA; University of California San Diego, San Diego, CA,
   USA; accessed on 21 September 2024) using the GSVA package (V1.40.1)
   ([79]https://www.bioconductor.org/packages/release/bioc/html/GSVA.html,
   Fred Hutchinson Cancer Center, Seattle, WA, USA; accessed on 21
   September 2024). We also performed hallmark gene set analysis with
   GSEA. For mutational comparison, we used the maftools package (V2.8.05)
   [[80]37] to analyze TCGA-BRCA mutation data, identifying the top 20
   genes with the highest mutation rates in the HRG and LRG. We assessed
   the tumor mutational burden (TMB) and its impact on survival using
   Kaplan–Meier analysis and the Wilcoxon rank-sum test. Additionally, we
   detected somatic copy number variations using GISTIC 2.0
   ([81]https://github.com/broadinstitute/gistic2, Broad Institute,
   Cambridge, MA, USA; accessed on 22 September 2024) and obtained
   aneuploidy scores and tumor neo-antigen data from Thorsson et al.’s
   study [[82]38].

2.8. Chemotherapeutic Drug Sensitivity

   Risk scores and the expression of 2249 druggable targets were analyzed
   in the TCGA-BRCA cohort (2249 druggable targets were obtained from the
   literature [[83]39]). Positively correlated genes were treated as risk
   score-related targets (correlation coefficient > 0.3 and FDR < 0.05).
   Differential expression analysis was performed for high-risk vs.
   low-risk groups in the TCGA-BRCA cohort (using limma package-V3.48.3
   [[84]40]), and the corresponding p-value and logFC of the gene were
   obtained. The CMap database [[85]41] was used to predict the CMap score
   of each compound based on the RiskScore gene. Compounds with a CMap
   score < −90 or a CMap score > 90 [[86]42] are regarded as potentially
   effective compounds. Then, the half-maximal inhibitory concentration
   (IC50) values of these drugs per sample were evaluated using the
   pRRophetic (v 0.5) package [[87]43] to assess the sensitivity of the
   samples to drugs. After, discrepancies in the IC50 values of each drug
   between the HRG and LRG were compared using the wilcox.test function in
   the rstatix (v 0.7.2) package (p < 0.05). Subsequently, the connection
   between IC50 values and risk scores was calculated by employing the
   cor.test function in the stats (v 4.2.2) package (|cor| > 0.3, p <
   0.05). Finally, box plots were drawn using the ggplot2 (v 3.4.4)
   package for visualization.

2.9. Consensus Clustering Identifies Subgroups of Breast Cancer

   According to the above-mentioned characteristic genes, BRCA patients
   were clustered without supervision using the “ConensusClustosPlus”
   software package (V-1.56.0,
   [88]https://www.bioconductor.org/packages/release/bioc/html/ConsensusCl
   usterPlus.html, Fred Hutchinson Cancer Center, Seattle, WA, USA;
   accessed on 28 September 2024). Molecular subtypes were identified
   based on the optimal cluster K. The survival package (V3.2.13,
   [89]https://github.com/therneau/survival, Mayo Clinic, Rochester, MN,
   USA; accessed on 29 September 2024) was used to draw survival
   prognostic curves to analyze the prognostic differences between
   subtypes, and the ComplexHeatmap package (V-2.8.0,
   [90]https://www.bioconductor.org/packages/release/bioc/html/ComplexHeat
   map.html, Fred Hutchinson Cancer Center, Seattle, WA, USA; accessed on
   30 September 2024) was used to display the heatmap of the clinical
   characteristics of characteristic genes between subtypes.

2.10. ScRNA-Seq Data Processing and Cell Subpopulation Analysis

   The 10× scRNA-seq data of [91]GSE161529 were processed via the Seurat
   (v 5.0.1) package [[92]44] to investigate differences between 24 normal
   control samples and 45 breast cancer samples. Cells with fewer than 200
   genes, more than 7000 genes, or more than 20% mitochondrial gene
   expression were filtered out. The top 2000 variable genes were
   identified using the FindVariableFeatures function, and cell types were
   annotated using marker genes from the CellMarker 2.0 database [[93]45].
   Risk scores were calculated at the cell level using the AddModuleScore
   algorithm, and differences in these scores between cancer and normal
   groups were compared using box plots and the Wilcoxon test. For cell
   subpopulation analysis, marker genes from PanglaoDB [[94]46] and
   CellMarker 2.0 were used to identify subclusters, and differential
   expression analysis was performed to obtain DEGs, which were visualized
   with heatmaps and bubble charts. KEGG enrichment analysis was conducted
   on DEGs to identify the signaling pathways involved in these
   subclusters.

2.11. Cell Communication and Pseudo-Time Analysis Immunohistochemistry
Validation via the HPA Database

   Cell communication analysis was performed on annotated cell clusters
   using the CellChat package (v1.6.1) [[95]47]. Network diagrams and
   heatmaps were generated with the netVisual_heatmap function in the
   CellChat (v 1.6.1) package to visualize the results. Additionally, to
   investigate the developmental trajectory of key cells and changes in
   the expression of prognostic genes within these cells, pseudo-time
   trajectory analysis was conducted using the monocle (v 2.26.0) package
   [[96]48]. The results were visualized with the plot_cell_trajectory
   function.

2.12. Immunohistochemical Validation of Key Genes Using the Human Protein
Atlas

   To validate transcriptomic findings at the protein level, the IHC
   staining data of the identified key genes were retrieved from the Human
   Protein Atlas (HPA) database ([97]https://www.proteinatlas.org,
   accessed on 1 March 2025). IHC images of normal breast tissue and
   breast cancer tissue were examined. Protein expression levels were
   evaluated based on staining intensity, cellular localization, and
   distribution patterns. Differences between tumor and normal tissues
   were visually compared. To ensure specificity and reliability, only
   high-confidence antibodies with enhanced validation, as defined by the
   HPA, were selected for analysis.

2.13. Statistical Methods

   All statistical analyses and data visualizations were conducted in R
   (version 4.3.1). Group comparisons were assessed using chi-square or
   Wilcoxon rank-sum tests as appropriate. Survival analyses, including
   Kaplan–Meier curves and Cox proportional hazards regression, were
   performed using the survival package, with log-rank tests used for
   significance assessment. Time-dependent ROC curves and concordance
   index (C-index) comparisons were generated using the timeROC and
   CompareC packages. Differential gene expression was analyzed with
   DESeq2, and enrichment analyses were carried out using clusterProfiler.
   Additional tools used include maftools for mutation analysis,
   pRRophetic for drug sensitivity prediction, ConsensusClusterPlus for
   molecular subtyping, and Seurat for single-cell data processing. Unless
   otherwise noted, all tests were two-sided, and a p-value < 0.05 was
   considered statistically significant. Detailed statistical thresholds
   and methods are summarized in [98]Supplementary Table S2.

3. Results

3.1. Identification of Telomere-Related Cellular Senescence Genes with
Significant Differences in Breast Cancer and Enrichment Analysis of
Intersection Genes

   A total of 2100 telomere-related genes (TEGs) and 1153 cellular
   senescence genes (CAGs) were obtained. Differential expression analysis
   was performed using the DESeq2 package on 113 adjacent normal tissues
   and 1066 tumor tissues from the TCGA-BRCA dataset. A total of 13,595
   differentially expressed genes (DEGs) were identified between tumor and
   adjacent normal tissues, with criteria of |log2FC| > 0.2 and p-value <
   0.05 ([99]Figure 1A, [100]Supplementary Table S3). Among these, 1490
   genes were found to overlap between TEGs and DEGs, and 842 genes
   overlapped between CAGs and DEGs ([101]Figure 1B, [102]Supplementary
   Table S4). Further correlation analysis of these overlapping genes
   revealed 1124 genes with |r| ≥ 0.6 and p-value ≤ 0.05
   ([103]Supplementary Table S5), with approximately half of the genes
   showing an upregulated trend ([104]Figure 1C,D). A total of 1161 GO
   terms (p < 0.05) were enriched by the candidate genes. Among these,
   1006 BPs were identified, including fibroblast proliferation, cellular
   response to alkaloids, regulation of mitotic nuclear division, response
   to peptide hormones, and positive regulation of DNA replication, among
   others. In addition, 58 CCs were enriched, such as the
   chromatin-silencing complex, chromosome, telomeric repeat region, and
   DNA repair complex. Furthermore, 97 MFs were enriched, including
   single-stranded DNA binding, histone kinase activity, and
   promoter-specific chromatin binding ([105]Figure 2A–C,
   [106]Supplementary Table S6). Additionally, 35 KEGG pathways (p < 0.05)
   were significantly enriched, including cellular senescence, cell cycle,
   and p53 signaling pathways ([107]Figure 2D, [108]Supplementary Table
   S7).

Figure 1.

   [109]Figure 1
   [110]Open in a new tab

   Identification of telomere-related cellular senescence genes with
   significant differences in breast cancer. (A) Volcano plot showing
   differentially expressed genes (DEGs) between tumor and normal tissues
   in the TCGA-BRCA cohort. (B) Venn diagram illustrating the overlap
   among DEGs, telomere-related genes (TEGs), and aging-related genes
   (AGs). (C) Heatmap displaying expression profiles of 1124 intersected
   telomere- and senescence-related genes across tumor and normal samples.
   (D) Bar plot representing the proportion of upregulated and
   downregulated genes within the intersected gene set.

Figure 2.

   [111]Figure 2
   [112]Open in a new tab

   Functional enrichment analysis of telomere- and senescence-associated
   genes. (A–C) Gene ontology (GO) enrichment analysis of the intersected
   genes in terms of the biological process (BP), cellular component (CC),
   and molecular function (MF) categories. (D) KEGG pathway enrichment
   analysis highlighting key signaling pathways associated with the
   intersected genes. The color intensity corresponds to the statistical
   significance level (darker red indicates stronger significance).

3.2. Machine Learning Development and Validation of Breast Cancer Consensus
Gene Signature

   A total of 1124 genes were screened through single-factor Cox
   regression to identify 157 significant prognostic genes among the above
   intersection genes (p < 0.05, [113]Supplementary Figure S1 and
   Supplementary Table S8). The model’s predictive performance was
   assessed using the concordance index (C-index), a standard metric in
   survival analysis that reflects the model’s ability to correctly rank
   predicted survival times. As classification metrics such as accuracy or
   F1 score are not appropriate for censored data, the C-index was chosen
   as the primary evaluation criterion [[114]35,[115]49]. Based on the
   expression data of the above 157 prognostic genes in the training set
   TCGA and four validation sets, 10 machine learning methods were used to
   construct a risk model using a combination of 101 machine learning
   algorithms, and the model with the best C-index was selected
   ([116]Figure 3A). Among all algorithm combinations, the integration of
   RSF and ridge achieved the highest average C-index (0.65), indicating
   the best predictive performance. This model identified 19
   characteristic genes and generated risk scores for each dataset
   ([117]Supplementary Tables S9 and S10). The results show that the
   survival curves of the TCGA-BRCA training set are statistically
   significant, and the AUC values are all >0.7, indicating better
   performance prediction effects ([118]Figure 3B,C). In addition, the
   results of the four validation sets are consistent with the results of
   the training set, that is, the survival rate of the high-risk group is
   significantly lower than that of the low-risk group ([119]Figure 3D–K).

Figure 3.

   [120]Figure 3
   [121]Open in a new tab

   Machine learning-based development and validation of a breast cancer
   prognostic gene signature. (A) Concordance index (C-index) distribution
   of the integrated machine learning models across five datasets. (B)
   Time-dependent receiver operating characteristic (ROC) curves for 1-,
   3-, and 5-year overall survival (OS) prediction in the TCGA-BRCA
   training cohort. The dotted diagonal line indicates the performance of
   a random classifier (AUC = 0.5) and serves as a reference. (C)
   Kaplan–Meier (KM) survival curves for high-risk and low-risk groups in
   the training set; red and blue lines represent high- and low-risk
   patients, respectively. Dotted lines indicate the time point and
   corresponding survival probability for comparison between high- and
   low-risk groups. (D–K) Time-dependent ROC curves and corresponding KM
   survival curves for external validation cohorts ([122]GSE58812,
   [123]GSE21653, [124]GSE103091, and [125]GSE19615), evaluating the
   performance of the prognostic model at 1, 3, and 5 years.

3.3. Association Between Risk Score and Clinical Characteristics

   To explore the clinical relevance of the risk score, we examined the
   distribution of key clinicopathological features, including gender,
   molecular subtypes (derived from unsupervised clustering), TNM stage
   (T, N, or M), and overall pathological stage, between the HRG and LRG.
   The sample distributions across risk groups were visualized using bar
   plots. Chi-square tests were conducted to assess statistical
   differences in clinical feature distributions between the HRG and LRG.
   Significant differences (p-value < 0.05) were observed for gender,
   molecular subtype, N stage, M stage, and overall pathological stage.
   Interestingly, T stage was the only variable that did not show a
   significant difference (p-value > 0.05), suggesting that the risk score
   may capture tumor aggressiveness beyond size-related metrics. These
   results suggest that the risk score correlates strongly with several
   key clinical features and may reflect underlying biological
   heterogeneity. Detailed statistical comparisons and visualizations of
   these distributions are provided in [126]Supplementary Figure S2.

3.4. Prognostic Performance of the Risk Signature and Nomogram Construction

   We calculated the clinical factors (age, gender, grade, stage, etc.)
   and risk score C-index for breast cancer patients and compared the
   differences in the C-index across each dataset. The results showed that
   TCGA data had up to six clinical characteristics, with an average
   C-index of approximately 0.7, which was higher than the results
   observed in most other validation sets ([127]Figure 4A). To investigate
   the source of this difference, univariate and multivariate Cox
   regression analyses were performed on the clinical factors and risk
   scores from the training set. Among the factors analyzed, three were
   found to be significant: risk score (RiskScore), M stage, and tumor
   stage ([128]Figure 4B,C). Clinical data from tumor patients, including
   the significant factors of M stage, tumor stage, and RiskScore, were
   extracted as prognostic variables. The resulting nomogram, based on
   these independent prognostic factors, demonstrated that higher total
   scores were associated with lower survival probabilities for BRCA
   patients ([129]Figure 4D and [130]Figure 5A). The calibration curve
   showed that the predicted survival probabilities at 1, 2, and 3 years
   closely matched the reference line ([131]Figure 5B), indicating that
   the nomogram had relatively high predictive accuracy. Additionally, the
   decision curve analysis (DCA) showed that the net benefit of the
   nomogram model was superior to that of both the “all” and “none”
   strategies ([132]Figure 5C), suggesting that the nomogram model has
   good predictive performance and practical clinical application value
   for BRCA decision-making. Furthermore, the C-index values of the
   prognostic models from 18 other studies, including those focused on
   inflammation, were generally lower than those of the model we developed
   ([133]Figure 5D).

Figure 4.

   [134]Figure 4
   [135]Open in a new tab

   Prognostic value of the risk score signature and construction of a
   nomogram. (A) Comparison of the concordance index (C-index) for the
   risk score and various clinical features across the TCGA-BRCA cohort
   and four external validation datasets. (B,C) Univariate and
   multivariate Cox regression analyses evaluating the prognostic
   significance of the risk score and clinical variables in the TCGA-BRCA
   cohort. (D) A nomogram constructed based on independent prognostic
   factors, including RiskScore, M stage, and pathological stage, for
   predicting 1-, 3-, and 5-year overall survival in breast cancer
   patients. Statistical significance is indicated as follows: ** p <
   0.01; *** p < 0.001.

Figure 5.

   [136]Figure 5
   [137]Open in a new tab

   Evaluation of the nomogram’s performance and comparative analysis with
   existing models. (A) Kaplan–Meier survival curves for high-risk and
   low-risk groups as stratified by the nomogram. (B) Decision curve
   analysis (DCA) comparing the clinical net benefit of the RiskScore
   alone versus the integrated nomogram. (C) Calibration plots assessing
   the accuracy of the nomogram in predicting 1-, 3-, and 5-year overall
   survival. (D) Comparative analysis of C-index values between the
   proposed model and 18 published prognostic models, including those
   based on inflammation-related gene signatures.

3.5. Immune Correlation of Risk Score Signatures as Predictive Biomarkers for
Immunotherapy Response and Associated Biological Mechanisms

   As shown in [138]Figure 6A, the distribution of stromal scores, immune
   scores, and ESTIMATE scores differed significantly between the HRG and
   LRG, suggesting that distinct tumor microenvironment characteristics
   are associated with the risk stratification. Additionally, multiple
   immune infiltration algorithms revealed significant differences in the
   proportions of T cell subsets (CD8+ T cells and naïve CD4+ T cells)
   between the two groups ([139]Figure 6B,C). To further explore
   immune-related functions, correlation analyses showed that RiskScore
   was significantly positively associated with genes involved in antigen
   presentation, cell adhesion, immune co-inhibition, co-stimulation,
   ligands, and receptors ([140]Figure 7A,B). Moreover, scores related to
   immune response, including CYT, IPS, GEP, IFNG expression, TCR
   richness, and TCR Shannon entropy, also showed significant differences
   between the HRG and LRG ([141]Supplementary Figure S3A), reinforcing
   the immunological relevance of the risk signature. To assess the
   predictive power of RiskScore in immunotherapy contexts, survival and
   response analyses were conducted in two immune checkpoint therapy
   datasets: IMvigor210 and [142]GSE78220. Stratification by RiskScore
   revealed significant prognostic and therapeutic response differences
   between the HRG and LRG ([143]Supplementary Figure S3B–E). Furthermore,
   analysis of cancer immune cycle activity from the TIP database
   demonstrated differential activation states across groups
   ([144]Supplementary Figure S3F). Pathway enrichment analyses further
   supported the biological relevance of the risk model. The top 50
   pathways from GO and KEGG analyses were positively correlated with
   RiskScore ([145]Supplementary Figure S4A,B), while GSEA of hallmark
   gene sets identified distinct pathway signatures that were enriched in
   the HRG and LRG, including immune regulation and senescence-associated
   processes ([146]Supplementary Figures S3G and S4C).

Figure 6.

   [147]Figure 6
   [148]Open in a new tab

   Immune landscape and prognostic differences between high- and low-risk
   groups. (A) Comparisons of stromal score, immune score, and ESTIMATE
   score between the high-risk group (HRG) and low-risk group (LRG),
   calculated using the ESTIMATE algorithm. (B) Heatmap showing the
   relative infiltration levels of 22 immune cell types estimated via the
   CIBERSORT algorithm across HRG and LRG samples. (C) Immune infiltration
   landscape estimated by the TIMER algorithm. Blue and red colors
   indicate the LRG and HRG, respectively. Statistical significance is
   indicated as follows: * p < 0.05; ** p < 0.01; *** p < 0.001; **** p <
   0.0001.

Figure 7.

   [149]Figure 7
   [150]Open in a new tab

   Association between RiskScore and immune functional features. (A)
   Correlation heatmap showing the relationship between RiskScore and the
   relative abundance of immune cells estimated using the ssGSEA
   algorithm. A redder color indicates a stronger positive correlation.
   (B) Heatmap illustrates the expression levels of immune-related genes,
   including those involved in antigen presentation, cell adhesion,
   co-inhibitory and co-stimulatory molecules, ligands, and receptors,
   between the high-risk group (HRG) and low-risk group (LRG).

3.6. Stratification Functional Characterization of Risk Score and Immune
Landscape

   To further investigate the relationship between the prognostic gene
   signature and the tumor stromal microenvironment, we performed a
   correlation analysis between the expression levels of each signature
   gene and stromal scores derived using the ESTIMATE algorithm. As shown
   in [151]Supplementary Figure S5, several genes, including COL17A1,
   ZMAT3, IL33, and AMPD1, demonstrated a significant positive correlation
   with stromal scores, indicating potential involvement in stromal
   activation or extracellular matrix remodeling. In contrast, genes such
   as PTGES3 and UBTF exhibited significant negative correlations,
   suggesting tumor-cell-intrinsic functions. These findings suggest that
   the proposed gene signature reflects both stromal and epithelial tumor
   biology.

3.7. Genomic Alterations and Drug Sensitivity Prediction Based on Risk
Stratification

   Somatic mutation and CNV data from the TCGA-BRCA cohort were analyzed
   to investigate genomic differences between the HRG and LRG. While no
   statistically significant difference in TMB was observed between the
   two groups (p-value = 1), the tumor neo-antigen burden (TNB) was
   significantly higher in the HRG (p-value < 0.05) ([152]Figure 8A,B,
   [153]Supplementary Figure S6A–D). Notably, TP53 was the most frequently
   mutated gene in the LRG, whereas PIK3CA had the highest mutation
   frequency in the HRG ([154]Figure 8C,D). To evaluate drug sensitivity,
   we applied the pRRophetic algorithm to predict responses to 138
   chemotherapeutic agents. Additionally, Spearman correlation analysis
   between RiskScore and 2243 druggable targets identified 111 targets
   that are significantly associated with RiskScore (correlation
   coefficient > 0.3, FDR < 0.05) ([155]Figure 9A, [156]Supplementary
   Table S11). Differential gene expression analysis using the limma
   package revealed 2596 DEGs between the HRG and LRG ([157]Supplementary
   Table S12), of which the top 150 upregulated and 150 downregulated
   genes were selected for further analysis. CMap analysis identified 40
   candidate compounds with potential therapeutic relevance, defined by
   CMap scores < −90 or >90 ([158]Supplementary Table S13). Among these,
   two compound classes were represented by more than two agents,
   suggesting potential drug categories for high-risk patients
   ([159]Supplementary Figure S6E,F).

Figure 8.

   [160]Figure 8
   [161]Open in a new tab

   Genomic alterations in high- and low-risk groups based on risk score
   signature. (A,B) Comparison of the TMB and TNB between the HRG and LRG.
   While the TMB did not differ significantly between groups (p-value =
   1), the TNB was significantly elevated in the HRG (p-value < 0.05).
   (C,D) Waterfall plots depicting the top 20 most frequently mutated
   genes in the LRG and HRG. TP53 exhibited the highest mutation frequency
   in the LRG, whereas PIK3CA was the most commonly mutated gene in the
   HRG.

Figure 9.

   [162]Figure 9
   [163]Open in a new tab

   Drug sensitivity analysis and molecular subtype identification based on
   risk signature. (A) Predicted IC50 values for selected chemotherapeutic
   agents between the high-risk group (HRG) and low-risk group (LRG),
   indicating differences in drug sensitivity. (B) Consensus matrix
   heatmap for clustering analysis with k = 2, where rows and columns
   represent individual samples. (C) Cumulative distribution function
   (CDF) plot showing clustering stability across different values of k.
   (D) Kaplan–Meier survival curves comparing overall survival between the
   identified molecular subtypes. (E) Heatmap displaying the expression
   levels of 19 signature genes across the molecular subtypes. (F)
   Principal component analysis (PCA) showing separation of subtypes based
   on gene expression profiles. (G) Box plot illustrating the differential
   expression of the 19 signature genes between subtypes. Statistical
   significance is indicated as follows: ns, p > 0.05; * p < 0.05; ** p <
   0.01; *** p < 0.001; **** p < 0.0001.

3.8. Consensus Clustering Identifies Molecular Subtypes of Breast Cancer

   To further explore the molecular heterogeneity of breast cancer,
   consensus clustering was performed based on the expression profiles of
   19 key genes identified via machine learning. Gene expression data from
   1066 tumor samples in the TCGA-BRCA cohort were extracted for analysis.
   Using the ConsensusClusterPlus algorithm, the optimal number of
   clusters was determined to be k = 2 based on the cumulative
   distribution function (CDF) and delta area plot, resulting in two
   distinct subtypes: Cluster 1 (n = 575) and Cluster 2 (n = 491)
   ([164]Figure 9B,C). Kaplan–Meier survival analysis revealed a
   significant difference in overall survival between the two subgroups
   (p-value < 0.05), with Cluster 2 showing worse prognoses ([165]Figure
   9D). Heatmaps and PCA plots demonstrated that the expression profiles
   of the 19 signature genes effectively distinguished the two clusters
   ([166]Figure 9E,F). Furthermore, the majority of these genes exhibited
   significantly different expression levels between the two clusters
   ([167]Figure 9G). Consistent clustering patterns were also observed in
   the four external validation cohorts ([168]GSE58812, [169]GSE21653,
   [170]GSE103091, and [171]GSE19615), where consensus clustering (k = 2)
   similarly identified two robust subtypes. The expression levels of most
   signature genes were significantly different between the subgroups in
   each dataset, reinforcing the reproducibility and robustness of the
   clustering results ([172]Supplementary Figure S7).

3.9. Cellular Heterogeneity and Identification of Fibroblasts as a Key Cell
Type in Breast Cancer

   After stringent quality control of scRNA-seq data, a total of 428,024
   high-quality cells were retained for downstream analysis
   ([173]Supplementary Figure S8A,B). Cell type annotation was performed
   based on canonical marker gene expression, resulting in the
   identification of seven major cell types, including fibroblasts,
   epithelial cells, and T cells ([174]Figure 10A). Comparative analysis
   between tumor and normal breast tissues revealed distinct cellular
   compositions. Fibroblasts and T cells were significantly enriched in
   tumor samples, while epithelial cells were more prevalent in normal
   tissues ([175]Figure 10B,C). Given the substantial enrichment and
   potential biological relevance of fibroblasts in the tumor
   microenvironment, this cell population was selected for further
   investigation. Subclustering analysis of fibroblasts identified four
   transcriptionally distinct subtypes: antigen-presenting
   cancer-associated fibroblasts (antigen-presenting CAFs),
   myofibroblastic CAFs (MyCAFs), lipofibroblasts, and conventional
   cancer-associated fibroblasts (CAFs). Pathway enrichment analysis of
   differentially expressed genes among these subtypes revealed distinct
   functional profiles. Notably, pathways such as IL-17 signaling,
   ECM–receptor interaction, oxidative phosphorylation, and TNF signaling
   were differentially activated across the fibroblast subpopulations
   ([176]Figure 10D,E, [177]Supplementary Figure S8C,D), suggesting
   functional heterogeneity within the fibroblast compartment that may
   contribute to tumor progression and immune modulation.

Figure 10.

   [178]Figure 10
   [179]Open in a new tab

   Cellular heterogeneity and functional profiling of fibroblasts in
   breast cancer. (A) UMAP plot showing cell type annotations across all
   retained single cells, including fibroblasts, epithelial cells, and T
   cells. (B) Bar plot depicting the proportional distribution of each
   cell type between tumor and normal samples. (C) UMAP visualization of
   cell type distribution stratified by tissue origin (tumor vs. normal).
   (D) UMAP plot showing subcluster identification of four fibroblast
   subtypes. (E) KEGG enrichment analysis of differentially expressed
   genes across fibroblast subtypes, highlighting key pathways such as
   IL-17 signaling and ECM–receptor interaction.

3.10. Fibroblast Differentiation Trajectory and Intercellular Communication

   To investigate the dynamic differentiation states of fibroblasts in
   breast cancer, pseudo-time trajectory analysis was conducted. The
   developmental progression of fibroblasts was inferred from early to
   late stages ([180]Figure 11A). Antigen-presenting CAFs were identified
   as the initial state, while classical CAFs represented the terminal
   differentiated state. Fate-determining genes associated with this
   transition were also identified ([181]Figure 11B). Among the 19 key
   prognostic genes, most maintained relatively stable expression across
   pseudo-time. However, HSP90AA1 exhibited an up–down expression pattern
   along the trajectory, while TAGLN2 remained stable during early
   differentiation and markedly increased at the later stages
   ([182]Supplementary Figure S9), indicating potential roles in
   fibroblast maturation and tumor microenvironment remodeling. Cell–cell
   communication analysis revealed frequent signaling interactions between
   fibroblasts and other cell populations, particularly dendritic cells
   ([183]Figure 12A,C). Ligand–receptor analysis identified the MIF (CD74
   + CXCR4) axis as a dominant interaction pathway in both B cell–T cell
   and epithelial cell–T cell interactions ([184]Supplementary Figure
   S10). Notably, within the collagen signaling pathway, fibroblasts
   played central roles as signal senders, receivers, mediators, and
   influencers, suggesting their importance in modulating the breast
   cancer microenvironment ([185]Figure 12B,D).

Figure 11.

   [186]Figure 11
   [187]Open in a new tab

   Pseudo-time differentiation trajectory and gene expression dynamics of
   fibroblasts. (A) Pseudo-temporal trajectory plot illustrating the
   differentiation states of fibroblasts, progressing from
   antigen-presenting CAFs to terminal-stage CAFs. Cells are arranged
   based on transcriptomic similarity, reflecting their progression along
   the developmental continuum. The circled numbers (1–4) indicate major
   branch points, representing key developmental transitions or cell fate
   decision events along the trajectory. (B) Heatmap of the top 50 genes
   dynamically expressed during pseudo-time. Columns represent cells
   ordered by pseudo-time, and rows represent genes. Expression levels are
   color-coded: blue (low), white (moderate), and red (high).

Figure 12.

   [188]Figure 12
   [189]Open in a new tab

   Intercellular communication landscape of the tumor microenvironment.
   (A,C) Left panels: number of signaling interactions among cell
   subpopulations. Right panels: total communication intensity. (B,D)
   Heatmap depicting the functional roles of cell types in the signaling
   pathway network. Roles include senders: cells primarily emitting
   signals; receivers: cells predominantly receiving signals; mediators:
   cells facilitating signal relay; and influencers: cells with a major
   regulatory impact on the pathway dynamics. Columns denote specific cell
   types.

3.11. IHC-Based Validation of Candidate Prognostic Genes in Normal and Tumor
Breast Tissues

   To validate the transcriptomic findings at the protein level, IHC
   staining data of 19 key prognostic genes were retrieved from the HPA
   database. These genes included AMPD1, COL17A1, DGAT1, G6PD, GLI1,
   HSP90AA1, HSPH1, IL33, JAK2, NUDT14, PCMT1, PSMB8, PTGES3, RELB,
   TAGLN2, UBTF, WT1, ZMAT3, and ZMIZ1. Representative IHC images from
   both normal breast tissue and breast cancer specimens were examined to
   assess protein expression levels, staining intensity, and spatial
   localization. Specifically, AMPD1 showed low cytoplasmic staining in
   normal breast epithelium, but markedly stronger cytoplasmic and
   membranous staining in tumor tissues ([190]Figure 13A,B). Similarly,
   G6PD ([191]Figure 13C,D) and HSP90AA1 ([192]Figure 13E,F) exhibited
   notably increased protein expression in tumor tissues compared to
   normal controls. These patterns suggest their possible roles in tumor
   progression and validate their prognostic relevance at the protein
   level. Conversely, several genes, such as ZMAT3 and DGAT1, demonstrated
   reduced or absent staining in tumor tissues relative to normal tissues,
   indicating potential tumor-suppressive functions ([193]Supplementary
   Figure S11). These protein-level observations agree with transcriptomic
   findings and provide additional evidence supporting the clinical
   relevance of these genes. Collectively, the IHC validation enhances the
   biological plausibility of the proposed prognostic markers and
   underscores their potential as therapeutic targets in breast cancer.

Figure 13.

   [194]Figure 13
   [195]Open in a new tab

   Immunohistochemical validation of selected key prognostic genes in
   breast cancer tissues. (A,B) IHC staining of AMPD1 in normal and breast
   cancer (BC) tissue, respectively. (C,D) IHC staining of G6PD in normal
   and BC tissues. (E,F) IHC staining of HSP90AA1 in normal and BC
   tissues. All three genes exhibited markedly elevated protein expression
   in breast cancer tissues compared to normal controls, consistent with
   transcriptomic findings. Scale bar = 200 μm. Each tissue core has a
   diameter of approximately 1 mm.

4. Discussion

   The treatment of early-stage breast cancer traditionally relies on
   adjuvant chemotherapy guided by clinicopathological features, such as
   tumor size, lymph node status, and histologic grade [[196]50]. However,
   recent advances in genomic testing now enable more precise treatment
   decisions and prognostic risk assessment in breast cancer patients
   [[197]51,[198]52]. The 21-gene test (Oncotype DX) is the only test
   proven to have both prognostic and efficacy-predicting functions,
   identifying which populations benefit from adjuvant chemotherapy
   [[199]53]. Since 2011, the 21-gene test has been widely adopted across
   various countries, included in decision-making models by the NCCN
   clinical guidelines, and recommended by ASCO and ESMO as a
   complementary diagnostic tool to pathological evaluation [[200]54].
   Although this approach has been widely recognized and applied, it is
   predominantly utilized for estrogen receptor-positive/HER2-negative
   (ER+/HER2-) breast cancer.

   In this study, we developed a robust prognostic model based on 19 key
   genes associated with telomere maintenance and cellular senescence. By
   integrating 101 algorithmic combinations from 10 machine learning
   approaches, including CoxBoost, elastic net, survival-SVM, lasso,
   partial least squares regression (plsRcox), ridge regression, random
   survival forests (RSF), stepwise Cox, supervised principal components
   (SuperPC), and GBM, we comprehensively evaluated performance across
   diverse modeling frameworks. The final model, built using the RSF +
   ridge combination, achieved the highest average C-index (0.65) and
   demonstrated consistently high time-dependent AUCs (>0.70), indicating
   its predictive accuracy and generalizability. Compared with previous
   prognostic models based on either single-method machine learning or
   genomics-only signatures, our integrative multi-algorithm approach
   offers several advantages. First, many prior studies focused solely on
   genomic alterations, immune-related features, or inflammation pathways
   [[201]55,[202]56]. While informative, such models often lack
   consideration of senescence- and telomere-associated biology, key
   hallmarks of aging and tumor progression. Secondly, some models rely on
   fewer genes, potentially compromising robustness, or include large
   multi-gene panels that may reduce clinical feasibility
   [[203]57,[204]58]. In contrast, our model balances interpretability and
   complexity by leveraging a biologically meaningful 19-gene panel
   derived from intersecting telomere- and aging-related signatures.
   Notably, our model outperformed 18 recently published models, including
   immune- and metabolism-based signatures, in terms of C-index across
   multiple cohorts, suggesting superior predictive capacity.

   However, despite its strengths, several limitations warrant discussion.
   The use of retrospective, publicly available datasets may introduce
   batch effects, clinical heterogeneity, and sample bias. Additionally,
   we relied on adjacent non-tumor tissue as a control, which may not
   fully reflect true normal biology due to stromal interaction or field
   effects. While our model was externally validated in multiple GEO
   cohorts, prospective clinical validation is necessary to confirm its
   translational utility. Moreover, functional interpretation remains
   limited by the lack of in vitro and in vivo experimental validation. To
   partially address this, we supplemented transcriptomic analysis with
   IHC validation from the Human Protein Atlas, which supported the
   differential expression of key genes at the protein level. We also
   conducted correlation analysis between gene expression and stromal
   scores, providing indirect support for the involvement of several genes
   in tumor–stroma interactions. These findings suggest that our model has
   potential value in breast cancer research and may be helpful in
   clinical decision-making, although further validation is needed to
   fully assess its clinical impact.

   Pathway enrichment analysis revealed several pathways and associated
   mechanisms potentially involved in breast cancer progression. Notably,
   significant enrichment was observed in the p53 signaling, MAPK
   signaling, cellular senescence, cell cycle, and DNA repair pathways.
   The p53 signaling pathway, known for its role in regulating cell cycle
   arrest, DNA repair, and apoptosis, plays a crucial role in maintaining
   genomic stability and suppressing tumorigenesis [[205]59]. Mutations in
   p53 remain the most common genetic alteration in human neoplasms and
   are strongly associated with more aggressive breast cancer and worse
   overall survival [[206]60]. Similarly, the MAPK signaling pathway,
   which regulates cell growth, differentiation, and survival, is
   frequently dysregulated in cancer, contributing to tumor progression
   and metastasis [[207]61,[208]62]. Recent studies have suggested that
   p53 can act as an upstream regulator, activating MAPK signaling via
   transcriptional activation of members of the dual-specificity
   phosphatase family [[209]63]. In addition to these, key biological
   processes such as fibroblast proliferation, cellular response to
   alkaloids, regulation of mitotic nuclear division, response to peptide
   hormones, and positive regulation of DNA replication were significantly
   enriched. Furthermore, chromatin silencing complexes, chromosomes,
   telomeric repeat regions, and DNA repair complexes also showed
   significant enrichment. Molecular functions such as single-stranded DNA
   binding, histone kinase activity, and promoter-specific chromatin
   binding were highlighted as well. These findings underscore the
   complexity of the molecular landscape driving breast cancer
   progression. The enriched pathways and key processes identified in this
   study not only provide insights into the mechanisms underlying
   tumorigenesis but also highlight potential targets for therapeutic
   intervention.

   TMB is widely recognized as a predictive biomarker for the efficacy of
   immune checkpoint inhibitors (ICIs), being associated with high
   neo-antigen burdens and T cell infiltration, as well as a higher
   response rate across various tumor types. While TMB has emerged as a
   useful biomarker for evaluating the effectiveness of immunotherapy in
   several cancer types, its role in breast cancer remains not fully
   understood [[210]64]. In this study, we analyzed mutation and CNV data
   from TCGA-BRCA to explore the genomic differences between the HRG and
   LRG in breast cancer. We observed that there was no significant
   difference in TMB between the two groups, suggesting that TMB may not
   be a distinguishing factor between the HRG and LRG in this cohort.
   However, a significant difference was found in TNB between the groups,
   indicating that TNB could serve as a potential biomarker for
   differentiating these groups. Moreover, TP53 was found to be the most
   frequently mutated gene in the LRG, which is consistent with previous
   studies highlighting its pivotal role in various cancer types. TP53 is
   a central gene involved in the pathogenesis of human cancers and is
   frequently mutated in almost all human malignancies [[211]65]. There is
   substantial evidence linking TP53 mutations with poorer overall
   survival and disease-free survival in breast cancer patients
   [[212]66,[213]67]. On the other hand, PIK3CA exhibited the highest
   mutation rate in the HRG, which aligns with its well-established role
   in breast cancer pathogenesis, particularly in
   estrogen-receptor-positive tumors. The alteration of the
   phosphoinositide-3-kinase (PI3K) pathway, often resulting from PIK3CA
   mutations, has been described as a key driver of tumorigenesis and
   hyperactivity of the PI3K pathway in several cancer types
   [[214]68,[215]69].

   In single-cell analysis, fibroblasts were found to be more abundant in
   breast cancer tumor tissues than in normal tissues, while epithelial
   cells were more prevalent in normal tissues. Further annotations
   revealed four fibroblast subtypes: antigen-presenting cancer-associated
   fibroblasts (antigen-presenting CAFs), cancer-associated fibroblasts
   (CAFs), myofibroblastic cancer-associated fibroblasts, and
   lipofibroblasts. Pathway analysis of differentially expressed genes
   revealed distinct expression patterns across these subtypes, with
   significant differences in pathways such as IL-17 signaling,
   ECM–receptor interactions, oxidative phosphorylation, and TNF
   signaling. Breast cancer is a heterogeneous disease, classified into
   three main subtypes: luminal (Lum), HER2-positive, and triple-negative
   (TN), based on histopathological and gene expression profiling
   [[216]70,[217]71]. Tumors are complex ecosystems influenced by various
   stromal factors that either enhance or inhibit the effects of genetic
   alterations in epithelial cells. While normal fibroblasts suppress
   tumor formation [[218]72], cancer-associated fibroblasts (CAFs)
   contribute to tumor progression by promoting cancer cell proliferation,
   invasion, neo-angiogenesis, inflammation, and ECM remodeling
   [[219]73,[220]74,[221]75]. In breast cancer, the abundance of stromal
   myofibroblasts (α-SMA-positive fibroblasts) correlates with aggressive
   adenocarcinomas and poor prognosis [[222]76,[223]77,[224]78].
   Furthermore, CAFs have been implicated in drug resistance
   [[225]79,[226]80] and in suppressing anti-tumor immunity
   [[227]81,[228]82]. Therefore, the exact functions and mechanisms of
   CAFs in cancer progression and immunotherapy remain to be fully
   elucidated.

   Pseudo-temporal analysis of fibroblasts revealed that HSP90AA1 (heat
   shock protein 90 alpha family class A member 1) and TAGLN2 (transgelin
   2) play significant roles in fibroblast function and behavior,
   particularly during differentiation. HSP90AA1, a molecular chaperone,
   is involved in the stabilization and proper folding of numerous client
   proteins, including those related to cell signaling, apoptosis, and
   stress responses [[229]83]. In fibroblasts, it is essential for
   maintaining cellular homeostasis, particularly under stress conditions
   [[230]84]. Researchers revealed that pretreatment plasma HSP90AA1
   combined with other markers could conveniently predict the risk of
   breast cancer onset and metastasis [[231]85]. In our study, HSP90AA1
   exhibited a dynamic expression pattern, increasing in the early stages
   of fibroblast differentiation and then decreasing at later stages,
   which may reflect its role in stabilizing key proteins for
   differentiation and cytoskeletal reorganization. The decrease in its
   expression during later stages may indicate a reduced need for
   chaperoning as fibroblasts mature.

   TAGLN2, an actin-binding protein, plays a critical role in regulating
   cell shape, motility, and contractility, particularly in smooth muscle
   cells and fibroblasts [[232]86]. It has also been identified as a tumor
   suppressor in breast cancer metastasis, where its effects may be
   mediated by redox signaling, positioning TAGLN2 as a critical regulator
   of the ROS/NF-κB pathway in breast cancer [[233]87]. In our analysis,
   TAGLN2 showed stable expression during the early stages of fibroblast
   differentiation, with a significant increase in the final stages. This
   upregulation suggests its involvement in the transition from a
   proliferative state to a more differentiated, contractile phenotype.
   The increased expression of TAGLN2 in the later stages likely reflects
   its role in fibroblast maturation, facilitating actin cytoskeleton
   reorganization and the formation of stress fibers. These fibers are
   essential for fibroblast contractility and ECM remodeling during tissue
   repair and fibrosis. Taken together, the dynamic expression patterns of
   HSP90AA1 and TAGLN2 in fibroblasts suggest that they play distinct
   roles in fibroblast differentiation and function. HSP90AA1 may support
   early differentiation by stabilizing proteins, while TAGLN2 likely
   contributes to the terminal stages through cytoskeletal remodeling and
   enhanced contractility. Our findings reveal new insights into the
   regulation of fibroblast differentiation and highlight TAGLN2 as a
   potential target for therapeutic strategies aimed at controlling
   fibroblast behavior in breast cancer.

   Although this study established a promising prognostic model based on
   19 key genes related to telomerase and cellular senescence in breast
   cancer, utilizing advanced machine learning algorithms, and provided
   valuable insights into the dynamic expression patterns of HSP90AA1 and
   TAGLN2 during fibroblast differentiation, several limitations must be
   considered. Firstly, while the model demonstrated strong predictive
   ability across different breast cancer subtypes, it was primarily
   developed using public genomic datasets, which may introduce sample
   selection bias. To enhance the generalizability and clinical relevance
   of these findings, future studies should aim to validate the model
   using large-scale, multicenter clinical cohorts. Secondly, despite the
   robust performance of the machine learning-based prognostic model, its
   clinical applicability needs further validation through prospective
   clinical trials to assess its accuracy and stability across diverse
   patient populations and clinical settings. Moreover, while the
   differential expression of HSP90AA1 and TAGLN2 was closely linked to
   fibroblast differentiation, the precise molecular mechanisms underlying
   these changes, particularly in the context of different breast cancer
   subtypes, remain unclear.

   It is important to acknowledge that the use of adjacent non-tumor
   breast tissues as a reference for identifying cancer-specific gene
   signatures has certain limitations. Although these tissues are
   histologically normal, they are inevitably influenced by
   tumor-associated stromal interactions and potential treatment effects,
   which may confound the identification of truly cancer-specific
   molecular changes. Ideally, comparisons should be made between tumor
   tissues and entirely normal breast tissues obtained from healthy
   individuals. However, due to ethical and practical constraints in
   sample collection, especially the limited availability of normal breast
   tissue datasets, such comparisons are often not feasible in large-scale
   studies. As a result, many previous studies utilizing TCGA and similar
   cohorts have also adopted tumor versus adjacent tissue comparisons to
   explore cancer-associated gene expression profiles
   [[234]88,[235]89,[236]90]. In line with these established
   methodologies, our study also employs this approach, while
   acknowledging the limitations and potential implications for the
   interpretation of our findings.

   To evaluate the predictive performance of our model, we employed the
   C-index, a widely accepted metric in survival analysis. Unlike
   traditional classification metrics such as accuracy, precision, recall,
   or F1 score, which require binary outcomes and predefined thresholds,
   the C-index is threshold-independent and well-suited for censored
   time-to-event data [[237]91]. Its application enabled a direct
   comparison of prognostic accuracy across heterogeneous cohorts in our
   study. However, while the C-index provides a robust global assessment
   of a model’s discriminative ability, it does not account for other
   important aspects of performance, such as calibration, time-specific
   predictive accuracy, or clinical utility in decision-making. To address
   these limitations, we supplemented our evaluation with additional
   metrics, including time-dependent area under the curve (AUC) and
   decision curve analysis (DCA), which offer complementary insights into
   model performance and potential clinical applicability, particularly
   when applied to prospective or real-world datasets.

   Future research should prioritize the integration of spatial
   transcriptomics and single-cell proteomics, which would allow for
   compartment-specific gene expression mapping. Additionally, functional
   assays (CRISPR/Cas9 or RNAi knockdown) and patient-derived xenograft
   models are needed to validate the mechanistic roles of key genes.
   Implementation of the model in prospective clinical trials could
   further assess its utility in guiding personalized treatment decisions.
   Another critical consideration is the molecular heterogeneity of breast
   cancer, which remains a significant challenge. While we identified
   several enriched pathways involved in breast cancer progression, the
   tumor’s complexity requires further investigation. Integrating
   single-cell transcriptomics and spatial omics approaches will be
   essential for understanding the roles of fibroblast subtypes and their
   interactions with cancer cells within the tumor microenvironment. In
   conclusion, this study paves the way for further exploration of
   telomere-related and fibroblast-related biomarkers and therapeutic
   targets in breast cancer. By expanding upon these findings through
   further mechanistic studies and clinical validation, we hope to
   contribute to the development of more personalized, effective
   diagnostic and therapeutic strategies for breast cancer patients,
   ultimately advancing precision medicine in oncology.

5. Conclusions

   This study integrated bulk and single-cell RNA sequencing data to
   investigate the prognostic relevance of telomere-related and cellular
   senescence-associated gene signatures in breast cancer. Through a
   robust machine learning framework, we identified 19 key genes
   implicated in telomere maintenance and cellular senescence. These genes
   were consistently validated across multiple independent cohorts,
   demonstrating strong predictive accuracy and model reliability.
   Functional analyses indicated that the gene signature is closely
   associated with immune regulation and senescence-related pathways,
   highlighting its potential role in shaping the tumor microenvironment.
   A prognostic model constructed from these genes showed excellent
   performance in both training and validation datasets, offering a
   promising tool for stratifying patient risk and guiding clinical
   decision-making. Furthermore, immunohistochemical validation using the
   Human Protein Atlas supported the protein-level expression of several
   key genes, reinforcing their biological and clinical significance.
   Overall, our findings underscore the dual functional roles of telomere
   dynamics and cellular senescence in breast cancer progression and
   provide a foundation for developing personalized therapeutic
   strategies. While further experimental validation is warranted, this
   study offers a valuable resource for precision oncology in breast
   cancer management.

Acknowledgments