Abstract

Background

   Endometriosis and breast cancer are significant global health burdens
   affecting women worldwide. Both conditions share notable
   characteristics including estrogen dependence, progressive growth
   patterns, recurrence tendencies, and metastatic potential. Despite
   these biological parallels, the molecular mechanisms connecting these
   conditions remain incompletely characterized. This study aimed to
   identify shared gene signatures and underlying molecular processes in
   breast cancer and endometriosis.

Methods

   Expression matrices for both conditions were obtained from the Gene
   Expression Omnibus (GEO), UCSC Xena, and the Molecular Taxonomy of
   Breast Cancer International Consortium. Common differentially expressed
   genes (DEGs) were identified using the limma package. Comprehensive
   analyses included Gene Ontology (GO) and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway enrichment, machine learning-based
   diagnostic and prognostic model development, potential therapeutic
   compound screening, tumor immune microenvironment (TIME)
   characterization, and hub gene identification with subsequent
   validation.

Results

   The analysis identified 47 common DEGs between breast cancer and
   endometriosis. Functional assessment of these genes revealed their
   involvement in critical biological processes including cell cycle
   regulation, oxidative stress response, and secretory granule and
   recycling endosome dynamics. Integration of comprehensive genomic and
   clinical data led to the development of a prognostic model for breast
   cancer and a diagnostic model for endometriosis.

Conclusion

   This study provides molecular insights into shared pathogenic
   mechanisms underlying breast cancer and endometriosis, highlighting
   common physiological pathways and key regulatory genes. These findings
   offer novel perspectives for understanding disease pathogenesis and
   potential therapeutic interventions for both conditions.

Supplementary Information

   The online version contains supplementary material available at
   10.1007/s12672-025-02887-4.

   Keywords: Breast cancer, Endometriosis, Multi-omics analysis, Machine
   learning, Hub genes

Introduction

   Endometriosis and breast cancer are major global health challenges for
   women. Endometriosis affects 5–10% of reproductive-age women, with over
   176 million cases worldwide [[28]1, [29]2]. Characterized by ectopic
   endometrial tissue growth, this condition manifests as pelvic pain,
   dysmenorrhea, and infertility [[30]3]. These symptoms occur in 50–80%
   of women with pelvic pain and up to 50% of those experiencing fertility
   difficulties [[31]2, [32]4]. The pathogenesis primarily involves
   retrograde menstruation, wherein endometrial fragments flow into the
   peritoneal cavity, where they implant and infiltrate pelvic structures
   [[33]5, [34]6]. Additional contributing factors include obstructed
   menstrual flow, extended estrogen exposure (from early menarche or late
   menopause), genetic predisposition, immune dysfunction, and lifestyle
   factors. As an estrogen-dependent chronic inflammatory disorder,
   molecular alterations in estrogen signaling and inflammatory pathways
   facilitate both implantation and proliferation of abnormal endometrial
   tissue [[35]7].

   Diagnosis of endometriosis typically involves pelvic examination and
   ultrasound imaging, though laparoscopy with histopathological
   confirmation remains the gold standard despite risks including trauma,
   adhesion formation, and potential impacts on fertility [[36]8]. The
   biomarker CA125, while elevated in advanced disease, lacks sensitivity
   for early detection. The absence of reliable peripheral blood or
   endometrial tissue biomarkers, coupled with the requirement for
   invasive surgical procedures, often delays diagnosis by 7–11 years,
   hampering timely intervention [[37]9]. Addressing these limitations is
   essential for developing non-invasive diagnostic approaches and
   elucidating the fundamental mechanisms of endometriosis.

   Breast cancer accounted for 11.7% of all global cancer cases in 2020,
   with approximately 2.3 million new diagnoses, representing a leading
   cause of mortality among women [[38]10]. Risk factors include advancing
   age, genetic predisposition, history of benign breast disease,
   endogenous hormone exposure, fertility issues, obesity, and radiation
   exposure [[39]11]. Diagnostic evaluation comprises comprehensive
   clinical assessment and detailed imaging (mammography, breast
   ultrasound), typically confirmed by core biopsy before treatment
   planning [[40]12]. Research has classified breast cancer into four
   major molecular subtypes through gene clustering analysis [[41]13]:
   luminal, human epidermal growth factor receptor 2 (HER2)-enriched,
   basal-like, and normal breast-like. At the RNA level, subtype
   differentiation primarily depends on estrogen receptor (ER) activity,
   ER-associated genes, proliferation drivers, and to a lesser extent,
   HER2 and genes within the HER2 amplicon on chromosome 17 [[42]14].
   Treatment strategies based on diagnostic findings typically include
   surgery, radiotherapy, chemotherapy, targeted therapy, and endocrine
   treatment [[43]12, [44]15]. The heterogeneity of breast cancer is
   reflected in its multiple clinically relevant mutations, with molecular
   characterization of metastatic disease and subsequent targeted therapy
   assessed through next-generation sequencing and mutation analysis,
   potentially improving prognosis and survival.

   Endometriosis and breast cancer share several significant
   characteristics and risk factors, including estrogen dependence,
   progressive growth patterns, invasiveness, recurrence, and metastatic
   potential [[45]16]. Elevated estrogen levels in ectopic lesions of
   endometriosis patients [[46]17] and endogenous hormone exposure both
   contribute to increased breast cancer risk. The infertility associated
   with endometriosis often results in nulliparity or delayed
   childbearing, established risk factors for breast cancer [[47]18].
   Moreover, common treatments for endometriosis, such as progestins and
   oral contraceptives, may influence breast health [[48]19]. While
   research has established a significant association between
   endometriosis and increased risk of epithelial ovarian cancer [[49]5],
   evidence linking endometriosis to breast cancer progression remains
   inconclusive. Further investigation is needed to elucidate the
   underlying pathological connections and identify shared genetic markers
   between these conditions, potentially revealing common drug targets and
   improving treatment strategies for both diseases.

   The development of biomarkers for endometriosis and breast cancer that
   combine high sensitivity with precise specificity remains inadequate.
   Understanding the biological pathways and molecular networks underlying
   these diseases is essential for effective screening, prevention,
   diagnosis, and treatment. In this study, we analyzed datasets from the
   Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA), and the
   Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)
   to investigate the relationship between shared differentially expressed
   genes in both diseases and their impact on endometriosis diagnosis and
   breast cancer prognosis. Using machine learning algorithms, we
   identified 11 signature genes predictive of endometriosis and
   constructed a three-gene model for breast cancer prognosis. This model
   was validated with both internal and external datasets, confirming its
   stability and reliability in predicting outcomes for breast cancer
   patients. Our findings suggest potential novel biomarkers for
   endometriosis diagnosis and breast cancer prognostication, while also
   highlighting possible therapeutic targets.

Materials and methods

Data acquisition

   Datasets for endometriosis and breast cancer were obtained from
   multiple platforms. Two endometriosis datasets, [50]GSE51981 [[51]20]
   and [52]GSE35287 [[53]21], were acquired from the NCBI GEO. The
   [54]GSE51981 dataset, generated using the Affymetrix Human Genome U133
   Plus 2.0 array ([55]GPL570), contained 77 samples from endometriosis
   patients and 71 samples from healthy controls. The [56]GSE35287
   dataset, used for external validation, was produced with the Affymetrix
   Human Gene 1.0 ST Array ([57]GPL6244) and included 40 endometriosis and
   40 normal samples.

   Breast cancer datasets from TCGA and METABRIC were obtained from
   cBioPortal [[58]22] and UCSC Xena [[59]23]. These datasets were
   generated using the Illumina platform, with TCGA comprising 1050 tumor
   and 98 normal samples, and METABRIC containing 1980 tumor samples,
   which served as external validation cohorts.

Data preprocessing

   Data from GEO were processed according to previously described methods
   using the “GEOquery” R package [[60]24]. Gene probes were annotated
   with gene symbols, and probes lacking symbols or matching multiple
   symbols were excluded. For duplicate gene symbols, the maximum
   expression value was retained.

DEGs screening and Functional Analysis

   DEGs were identified using the “limma” package [[61]25] from the
   TCGA-breast cancer and [62]GSE51981 datasets. Genes with an absolute
   Log Fold Change (LogFC) greater than 1 and adjusted P-value below 0.05
   were considered statistically significant. Common DEGs were visualized
   with a Venn diagram, and their expression patterns were displayed in a
   heatmap generated using R. Functional enrichment of these genes was
   analyzed through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathways using Metascape [[63]26], with a minimum
   overlap of 3 and enrichment factor of 1.5. Enrichment results with a
   P-value below 0.01 were considered statistically significant.

Characteristic genes in endometriosis

   To identify distinctive genes associated with endometriosis, three
   complementary machine learning techniques were employed: Random Forests
   (RF), Least Absolute Shrinkage and Selection Operator (LASSO) logistic
   regression, and Support Vector Machine-Recursive Feature Elimination
   (SVM-RFE). These methods were selected for their distinctive strengths:
   LASSO for feature selection and regularization to prevent overfitting,
   SVM-RFE for effective ranking of gene features, and RF for robust
   handling of complex interactions. The RF technique was implemented
   using the “randomForest” package [[64]27]. LASSO logistic regression
   was conducted with the “glmnet” package [[65]28], selecting the minimal
   lambda as optimal. Optimization parameters were cross-verified with a
   tenfold factor, ensuring minimal criteria for partial likelihood
   deviation. Genes commonly identified across all models were selected
   for further analysis. A diagnostic column line graph predicting
   endometriosis occurrence was generated using the “rms” package. The
   [66]GSE35287 dataset served as the validation set, with model
   effectiveness evaluated through receiver operating characteristic (ROC)
   curves and area under the curve (AUC). The predictive power and
   clinical utility of the model were further assessed using the
   consistency index (C-index) and decision curve analysis (DCA) based on
   the calibration curve.

Establishing prognostic markers in breast cancer

   The prognostic relevance of common DEGs was initially assessed through
   univariate Cox regression analysis, with significance defined at p <
   0.05. The prognostic gene set was refined using the stepwise Akaike
   information criterion (stepAIC) method implemented in the “MASS”
   package. Individual patient risk scores were derived using the
   following equation:
   graphic file with name d33e347.gif

   where Inline graphic and Inline graphic are the normalized expression
   levels and corresponding regression coefficients of the candidate
   genes, respectively. Patients were stratified into high- and low-risk
   categories based on the median risk score as the cutoff value. The
   efficacy of the gene signature was evaluated through Kaplan–Meier
   survival plots and ROC curve analyses using the ‘survminer’,
   ‘survival’, and ‘survivalROC’ packages. The prognostic independence of
   the risk score from other clinical variables in breast cancer patients
   was determined through both univariate and multivariate Cox regression
   analyses.

Prognostic characteristics of the tumor microenvironment

   This study compared genomic alterations, gene expression patterns,
   immune microenvironment composition, hypoxia status, tumor stemness
   scores, and biological functions between risk groups. We used
   “maftools” and cBioPortal to analyze gene mutations. The abundance of
   immune cells in each patient sample was determined by single-sample
   gene set enrichment analysis (ssGSEA), using marker genes for 28
   distinct immune cell types as reference [[67]29]. Hypoxia scores were
   obtained from cBioPortal, and drug sensitivities were predicted using
   “oncoPredict” [[68]30].

   Tumor stemness was evaluated using 26 gene sets from StemChecker
   [[69]31], employing ssGSEA via the GSVA method to derive stemness
   enrichment scores. Differential gene expression analysis was performed
   to compare high- and low-risk groups. Gene set variation analysis
   (GSVA) was performed using hallmark gene sets from MSigDB v7.5. The
   resulting enrichment scores, reflecting pathway activity in individual
   samples, were compared between risk groups using the Wilcoxon rank-sum
   test. DEGs were identified using thresholds of |logFC| > 1 and FDR
   < 0.05. These genes underwent GO/KEGG pathway analysis using Metascape.

Statistical evaluation methods

   All statistical analyses were performed using R (version 4.3.1).
   Prognostic outcomes and survival rates across patient subgroups were
   analyzed using Kaplan-Meier survival plots and the log-rank test.
   Normality of data distribution was evaluated using the Shapiro-Wilk
   test. Due to significant deviation from normal distribution in most
   variables, non-parametric statistical methods were selected for
   between-group comparisons. The Wilcoxon rank-sum test was used for
   two-group comparisons, while the Kruskal-Wallis test was applied for
   analyses involving multiple groups. The prognostic significance of
   clinical characteristics within high- and low-risk groups was
   determined using both univariate and multivariate Cox regression
   analyses, conducted via the “survival” package in R.

qRT-PCR methodology

   Total mRNA was isolated from cellular samples using TRIpure reagent
   (ELK Biotechnology). Reverse transcription was performed using
   EntiLink™ 1 st Strand cDNA Synthesis Super Mix with the following
   temperature profile: 5 min at 25 °C, 30 min at 42 °C, and 5 min at 85
   °C. Quantitative real-time PCR (qRT-PCR) was conducted using a
   real-time PCR system (Applied Life Technologies, USA), with relative
   expression levels calculated using the 2^-ΔΔCT method. Specific primers
   were used for targeted gene amplification:

   H-ACTIN
     * Forward:
       GTCCACCGCAAATGCTTCTA
     * Reverse:
       TGCTGTCACCTTCACCGTTC

   H-SHCBP1
     * Forward:
       GGTGCTGGTATAGAAATCTACCCT
     * Reverse:
       GTTTCACCAAGACAACACCATAAC

   H-PMAIP1
     * Forward:
       GTGCTACTCAACTCAGGAGATTTG
     * Reverse:
       TCTTTCTTCAAATTGATGAAACGT

   H-LTF
     * Forward:
       TGCAAATTTGATGAATATTTCAGTC
     * Reverse:
       CATTGTTATTTCCATCAGTGTTCTG

Western blotting

   Cells were lysed using Aspen buffer for total protein extraction.
   Proteins were separated by SDS-PAGE and transferred to PVDF membranes.
   Membranes were blocked with 5% skim milk and incubated with primary
   antibodies: SHCBP-1 (No:12672-1-AP, 1:1000, Proteintech), PAMIP (No:
   PA5-19977, 1:500, Thermofisher), LTF (No:10933-1-AP, 1:1000,
   Proteintech), and GAPDH (Cat No. ab181602, 1:10000, Abcam). After
   washing, the membranes were incubated with secondary antibodies
   (1:10000, Aspen). Protein bands were visualized, scanned, and
   documented. For both PCR and western blotting (WB) experiments, each
   gene was analyzed in duplicate and all experiments were performed in
   triplicate. Neither PCR nor WB procedures were conducted under blind
   conditions. Statistical analysis was performed using SPSS. Differences
   between groups were assessed using one-way ANOVA and Student’s T-test,
   with P < 0.05 considered statistically significant.

Results

Identification of common genes associated with endometriosis and breast
cancer

   Differential expression analysis identified 1,600 DEGs between breast
   cancer and normal tissue samples in the TCGA-breast cancer cohort, and
   179 DEGs between endometriosis and normal tissues in the [70]GSE51981
   cohort (Fig. [71]1A, B). Further analysis revealed 47 common genes
   associated with both endometriosis and breast cancer in these cohorts.
   (Fig. [72]1C). Expression profiles of these 47 genes were characterized
   for both cohorts (Fig. [73]1D, E). GO/KEGG pathway analysis
   demonstrated enrichment in biological processes including chromosome
   segregation, cell cycle regulation, positive regulation of cell cycle
   phase transition, oxidative stress response, muscle cell development,
   and secretory granule and recycling endosome dynamics (Fig. [74]1F, G).

Fig. 1.

   [75]Fig. 1
   [76]Open in a new tab

   Differential expression analysis. A Volcano graph of the normal group
   and breast cancer group in differential analysis. B Volcano diagram for
   difference analysis of normal group and endometriosis. C Venn Figure
   for intersected genes in differentially expressed genes of breast
   cancer and endometriosis. D Heat map of differential analysis between
   breast cancer and normal group. E Heat map of differential analysis
   between endometriosis and normal group. F, G The GO terms and KEGG
   pathway enrichment analysis of common DEGs. GO, Gene Ontology; KEGG,
   Kyoto Encyclopedia of Genes and Genomes

Selection of endometriosis’s signature genes using machine learning algorithm

   Endometriosis biomarkers were identified using three machine learning
   algorithms: RF, SVM-RFE, and LASSO regression. The RF model identified
   22 genes (Fig. [77]2A), SVM-RFE identified 43 genes (Fig. [78]2B), and
   LASSO analysis yielded 18 genes (Fig. [79]2C, D). Intersection of these
   results revealed 11 robust core biomarkers (OLFM4, APOBEC3B, BPIFB1,
   CPM, MSRB3, EZH2, SCGB3A1, F13A1, PTGER3, FOS, and RCAN1)
   (Fig. [80]2E). Using the ‘rms’ package, we constructed a diagnostic
   column line graph for endometriosis (Fig. [81]2F). A calibration curve
   showed minimal deviation between predicted and actual risk, confirming
   the model’s accuracy (Fig. [82]3A, B). DCA demonstrated that this model
   provided significant net benefit compared to alternative strategies
   (Fig. [83]3C, D). The model exhibited high AUC values in both the
   training ([84]GSE51981) and external validation ([85]GSE35287) sets,
   with scores of 0.896 and 0.988, respectively (Fig. [86]3E, F). These
   findings corroborated the superior predictive performance of the
   diagnostic model.

Fig. 2.

   [87]Fig. 2
   [88]Open in a new tab

   Detection of diagnostic markers using machine-learning algorithms in
   endometriosis. A Based on RF algorithm to screen biomarkers. B Based on
   SVM-RFE to screen biomarkers. C, D LASSO logistic regression algorithm
   to screen diagnostic markers. E Venn diagram showed the intersection of
   diagnostic markers obtained by the three algorithms. F Nomogram is used
   to predict the occurrence of Endometriosis

Fig. 3.

   [89]Fig. 3
   [90]Open in a new tab

   Verification of nomogram model for endometriosis. A, B Construction of
   the calibration curve for assessing the predictive efficiency of the
   nomogram model in both A [91]GSE51981 and B [92]GSE35287. C, D Decision
   curve analysis of risk prediction nomogram for endometriosis in both C
   [93]GSE51981 and D [94]GSE35287. E, F ROC curve validation of risk
   prediction nomogram for endometriosis in both E [95]GSE51981 and F
   [96]GSE35287

Development and evaluation of breast cancer prognostic models

   We developed a prognostic model for breast cancer using univariate Cox
   regression analysis, which initially identified five genes with
   significant prognostic impact (p < 0.05). Further refinement through
   stepAIC analysis yielded three key prognostic genes. The risk score was
   calculated as: Risk score = (0.2857) × SHCBP1 + (−0.1610) × PMAIP1 +
   (−0.0534) × LTF (Supplementary Fig. [97]1). The median risk score
   served as the cutoff point to stratify patients into high- and low-risk
   groups, and was applied consistently in the external validation cohort
   (METABRIC) to assess model generalizability. Based on median signature
   values, 525 patients were categorized into high- or low-risk groups. In
   the TCGA cohort, the low-risk (LR) group demonstrated significantly
   longer overall survival (OS) than the high-risk (HR) group (median
   duration 215.0 months vs. 115.0 months, p < 0.0001, Fig. [98]4A). Lower
   risk scores consistently correlated with improved survival
   (Fig. [99]4C). The model’s robustness was confirmed in the independent
   METABRIC cohort, where LR patients also exhibited superior OS (median
   time = 167.0 months vs. 145.0 months, P = 0.02, Fig. [100]4B). These
   validation findings confirmed the efficacy of the model across multiple
   datasets. The distribution of risk scores and survival status in the
   METABRIC cohort is shown in Fig. [101]4D. Both univariate and
   multivariate Cox regression analyses confirmed that the prognostic risk
   score was independent of other clinical characteristics including age,
   stage, TNM classification, and radiation therapy in the TCGA-breast
   cancer cohort (Fig. [102]4E, F).

Fig. 4.

   [103]Fig. 4
   [104]Open in a new tab

   Construction and validation of a prognosis signature for breast cancer.
   A, B Overall survival in the low- and high-risk score group patients in
   A TCGA- breast cancer and B METABRIC. C, D Distribution of risk score
   according to the survival status and time in C TCGA- breast cancer and
   D METABRIC. E Univariate analysis for the clinicopathologic
   characteristics and risk score in TCGA- breast cancer. F Multivariate
   analysis for the clinicopathologic characteristics and risk score in
   TCGA- breast cancer. StepAIC: stepwise Akaike information criterion

Association between cancer hallmarks and risk groups

   We examined correlations between risk scores and immune responses by
   measuring enrichment scores for immune cell subsets and their
   associated activities through ssGSEA. The LR group showed greater
   infiltration by eosinophils, mast cells, natural killer (NK) cells,
   neutrophils, and plasmacytoid dendritic cells (Fig. [105]5A). In
   contrast, the HR group displayed elevated levels of activated CD4 and
   CD8 T cells, effector memory CD4 T cells, γδ T cells, and regulatory T
   cells (Fig. [106]5A). Expression of immune checkpoint inhibitors varied
   significantly with risk scores. Patients in the LR category exhibited
   increased expression of NRP1, CD200, and CD44, while those in the HR
   group showed elevated levels of CD276, IDO1, PDCD1LG2, and TNFRSF9
   (Fig. [107]5B). Cancer stem cell assessment using 26 stemness gene sets
   revealed higher enrichment scores in the HR group (Fig. [108]5C).
   Additionally, HR patients demonstrated elevated hypoxia scores
   (Fig. [109]5D) and higher non-synonymous tumor mutation burden (TMB)
   (Fig. [110]5E). Analysis of the 15 most frequently mutated genes
   revealed distinct mutation patterns between risk groups (Fig. [111]5F),
   with significant differences observed for PIK3CA (22% in HR vs. 44% in
   LR) and TP53 (50% in HR vs. 17% in LR) (Fig. [112]5G). Further genomic
   analyses showed that the HR group had significantly higher fraction
   genome altered (FGA) and distinctive copy number variation (CNV)
   patterns compared to the LR group (Fig. [113]5H, I).

Fig. 5.

   [114]Fig. 5
   [115]Open in a new tab

   Dissection of tumor microenvironment based on prognosis signature. A
   The box plot of 28 infiltrated immune cell types was calculated by
   ssGSEA. B Box plot of expression levels of immune checkpoint-associated
   genes. C Box plot displaying the differences of 26 ssGSEA stemness
   scores between low risk and high-risk group. D Violin plot of
   significantly increased hypoxic score in high-risk patients. E
   Comparison of tumor mutation burden (TMB). F Oncoplot of mutation,
   deletion, insertion, and frameshift. G Comparison of different mutation
   sites of TP53 and PIK3CA. H The score of fraction of genome altered
   (FGA) in different risk groups. I Copy number variation (CNV) patterns
   in different risk cohorts. * p < 0.05; ** p < 0.01; *** p < 0.001; ****
   p < 0.0001

Efficacy of prognostic signature in predicting drug sensitivity

   We evaluated associations between our prognostic model and drug
   responsiveness by measuring IC[50] values for various therapeutic
   agents in breast cancer samples. Differences in IC[50] values indicated
   varying drug sensitivities correlated with risk groups (Fig. [116]6A).
   Higher IC[50] values for Lapatinib, Temsirolimus, and Vinorelbine in
   the HR group indicated resistance to these agents, whereas lower IC[50]
   values for Cisplatin, Paclitaxel, and Rapamycin suggested sensitivity
   (Fig. [117]6B-G). These findings highlighted the potential utility of
   Cisplatin, Paclitaxel, and Rapamycin in treating chemotherapy-resistant
   breast cancer.

Fig. 6.

   [118]Fig. 6
   [119]Open in a new tab

   Efficacy of prognosis signature in predicting drug sensitivity. A
   Bubble plot of the relationship between drugs and model genes. Boxplots
   of the comparison of IC50 of drugs between high- and low-risk groups,
   and correlation between the IC50 and riskscore in TCGA- breast cancer
   cohort: B Lapatinib; C Temsirolimus; D Vinorelbine; E Cisplatin; F
   Paclitaxel; G Rapamycin

Biological characteristics between risk groups

   Analysis of the prognostic gene model revealed distinct biological
   characteristics between risk groups. Differential expression analysis
   identified 91 genes, visualized in a volcano plot (Fig. [120]7A). A
   protein-protein interaction (PPI) network constructed using the
   Metascape database with the MCODE plug-in (minimum interaction score of
   0.7) identified two critical functional modules (Fig. [121]7B). GO/KEGG
   pathway analysis linked these genes to diverse biological processes
   including cell cycle phase transition, mitotic cell cycle regulation,
   immune response, epithelial cell differentiation, inflammatory
   response, neuronal apoptotic regulation, supramolecular fiber
   organization, and cortical actin cytoskeleton dynamics
   (Fig. [122]7C-D). GSVA demonstrated significant associations between
   the HR group and DNA damage repair and cell cycle-related functions
   (Fig. [123]7E).

Fig. 7.

   [124]Fig. 7
   [125]Open in a new tab

   Biologic functions underlying the breast cancer prognostic model. A
   Volcano plot showed DEGs (FDR < 0.05 and |log2FC|> 1) between high risk
   and low-risk group. B PPI network of differentially expressed genes
   between high risk and low-risk group based on the Metascape website. C,
   D The GO terms and KEGG pathway enrichment analysis of differentially
   expressed genes. E Heatmap of GSVA analysis shows different biological
   functions between high risk and low-risk group. GO, Gene Ontology;
   KEGG, Kyoto Encyclopedia of Genes and Genomes

Validation of breast cancer prognostic gene expression through qRT-PCR and WB

   Expression levels of key prognostic genes were validated using both
   qRT-PCR and WB in breast cancer and control samples. Results confirmed
   significantly higher expression of SHCBP1 and PMAIP1 in breast cancer
   samples, while LTF expression was markedly decreased (Fig. [126]8A-E).
   These findings reinforced the potential utility of these genes as
   biomarkers for predicting breast cancer outcomes.

Fig. 8.

   [127]Fig. 8
   [128]Open in a new tab

   The expression of genes was verified by qRT-PCR and West-blotting. A
   The expression of SCHBP1 between breast cancer group and control group.
   B The expression of PMAIP1 between breast cancer group and control
   group. C The expression of LTF between breast cancer group and control
   group. D Protein expression levels of SCHBP1, PMAIP1 and LTF in breast
   cancer group 1 and control group. E Protein expression levels of
   SCHBP1, PMAIP1 and LTF in breast cancer group 2 and control group. *
   p < 0.05; ** p < 0.01; *** p < 0.001; **** p < 0.0001

Discussion

   Endometriosis, a chronic gynecological disorder dependent on estrogen,
   exhibits traits similar to malignant cells despite its benign
   classification, including local and distant metastasis with resultant
   tissue damage [[129]2]. This condition shares several risk factors with
   breast cancer, including endogenous estrogen exposure, reproductive
   characteristics, obesity, and hormone replacement therapy. Our study
   explored these associations, suggesting that identification of common
   differential genes and construction of prognostic risk models for
   breast cancer could elucidate shared underlying mechanisms and
   potentially reveal novel biomarkers for breast cancer prognosis.

   Accurate diagnosis of endometriosis remains challenging, often
   resulting in delays and misdiagnoses [[130]32], highlighting the need
   for precise clinical diagnostic tools to initiate timely treatment.
   This investigation employed three machine learning algorithms—RF, LASSO
   logistic regression, and SVM-RFE—to identify eleven robust core
   biomarkers: OLFM4, APOBEC3B, BPIFB1, CPM, MSRB3, EZH2, SCGB3A1, F13A1,
   PTGER3, FOS, and RCAN1. These biomarkers demonstrated high diagnostic
   accuracy for endometriosis in a diagnostic column line graph,
   outperforming other strategies and indicating significant clinical
   utility.

   OLFM-4, an extracellular matrix protein highly expressed in human
   endometrium [[131]33], is downregulated in endometriosis compared to
   controls [[132]9]. This protein may stabilize the endometrium and
   modulate inflammation through negative regulation of M2 macrophages
   [[133]34]. APOBEC3B, a member of the cytidine deaminases superfamily
   [[134]35], contributes to DNA mutation by converting cytosine to
   uracil, potentially increasing the mutational burden in endometriosis
   [[135]36, [136]37] and is associated with poorer outcomes in
   ER-positive breast cancer due to its elevated expression
   [[137]38–[138]40]. EZH2, a component of the polycomb repressive complex
   2 (PRC2), mediates transcriptional silencing through histone H3
   methylation [[139]41, [140]42]. Hypoxic conditions enhance EZH2
   expression, amplifying activity in pathways such as Wnt/β-catenin that
   are critical in the epithelial-to-mesenchymal transition observed in
   both breast cancer [[141]43] and endometriosis [[142]44].

   BPIFB1 expression is stimulated by estrogen, and elevated levels
   correlate with negative prognosis in luminal A breast cancer [[143]45,
   [144]46]. MSRB3, a protein repair enzyme, is associated with apoptotic
   cell death in various cancers, including breast cancer [[145]47]. FOS,
   an immediate response gene, plays a crucial role in estrogen-driven
   proliferation of endometrial cells [[146]48]. PTGER3, a receptor with
   high affinity for prostaglandin E2 (PGE2), is upregulated in
   endometriosis and implicated in tumor-associated angiogenesis,
   influencing clinical outcomes in various cancers [[147]8, [148]49].
   RCAN1 functions as a tumor suppressor, inhibiting cellular growth and
   angiogenesis in breast cancer [[149]50]. Secretoglobin family 3 A
   member 1 (SCGB3A1) enhances stem cell characteristics and
   aggressiveness in breast cancer cells [[150]51]. Carboxypeptidase M
   (CPM), found on tumor-associated macrophages, may serve as a cancer
   biomarker [[151]52]. Factor XIII A chain (F13A1) participates in fibrin
   network stabilization and potentially facilitates tumor matrix
   formation and progression [[152]53]. These genes may play key roles in
   the development of both diseases and could serve as targets for future
   therapies.

   Through univariate Cox regression analysis combined with stepAIC, we
   constructed a prognostic model incorporating three key genes: SHCBP1,
   PMAIP1, and LTF. This model effectively stratified breast cancer
   patients into high- and low-risk groups. The HR group demonstrated
   significantly reduced OS compared to the LR group in both the
   TCGA-breast cancer and METABRIC cohorts. The model’s reliability was
   further validated in the METABRIC study. Within the TCGA-breast cancer
   cohort, model-derived risk scores emerged as independent prognostic
   factors, remaining significant regardless of age, stage, TNM
   classification, or radiation treatment status.

   SHCBP1, a member of the SHC protein family, plays vital roles in cell
   proliferation, migration, adhesion, and cell cycle regulation,
   contributing significantly to carcinogenesis [[153]54, [154]55]. In
   breast cancer, elevated SHCBP1 expression correlates with advanced
   clinical stages and shorter survival times [[155]54, [156]56–[157]58].
   PMAIP1, a pro-apoptotic member of the BCL-2 protein family, interacts
   with the p53 pathway to enhance apoptosis [[158]59–[159]62]. It
   functions as a tumor suppressor and shows elevated expression in breast
   cancer samples [[160]63], with critical importance in paclitaxel
   response in triple-negative breast cancer [[161]64]. High PMAIP1 mRNA
   expression represents a positive prognostic marker for relapse-free and
   OS across diverse breast cancer molecular subtypes [[162]64]. LTF, a
   multifunctional glycoprotein belonging to the transferrin family,
   exhibits significant anti-tumor properties through mechanisms including
   inhibition of tumor cell proliferation and promotion of apoptosis or
   necrosis [[163]65–[164]68]. Pan-cancer analysis confirms that low LTF
   expression in tumors supports its classification as a tumor suppressor
   gene [[165]69].

   Further analysis revealed distinct patterns in immune cell infiltration
   and immune checkpoint expression between risk groups. The LR group
   exhibited increased infiltration by eosinophils, mast cells, and NK
   cells. Conversely, the HR group showed greater presence of activated
   CD4 and CD8 T cells, alongside elevated stemness enrichment scores and
   hypoxia scores, suggesting more aggressive tumor characteristics.
   Despite this activation pattern, the LR group maintained higher total
   CD8 + T cell levels with reduced immunosuppressive M2 macrophage
   presence—potentially explaining enhanced immunotherapy responsiveness.
   Pharmacogenomic analyses revealed higher predicted Lapatinib IC[50]
   values in the HR group, indicating potential HER2-targeted therapy
   resistance. The HR group also demonstrated increased non-synonymous
   mutation burden and aneuploidy, reflecting underlying genomic
   instability.

   Several limitations exist regarding sample size and clinical annotation
   depth. Future investigations require larger cohorts with comprehensive
   clinical and longitudinal data to enhance model generalizability and
   better account for clinical heterogeneity. Collaborations are being
   established to access well-annotated prospective datasets. Subsequent
   studies will implement network-based analyses with experimental
   validation to elucidate shared gene functions between pathologies.
   Advanced statistical approaches, including causal inference and
   propensity score matching, will address potential confounders. While
   METABRIC provided valuable validation, cohort heterogeneity, processing
   variations, and treatment history differences necessitate further
   validation through prospective multi-center studies.

   This study identified common genes between endometriosis and breast
   cancer, facilitating the development of diagnostic and prognostic
   models. Our diagnostic model, based on 11 core biomarkers, accurately
   predicted endometriosis onset. The prognostic model, utilizing three
   genes, effectively stratified breast cancer patients into distinct risk
   categories that correlated with specific clinical outcomes and
   biological behaviors. These risk groups exhibited unique immune cell
   profiles and genomic features, enhancing our understanding of the
   molecular dynamics underlying both conditions. These insights are
   essential for advancing personalized diagnostic and treatment
   approaches.

Electronic supplementary material

   [166]Supplementary Material 1^ (7MB, docx)

Acknowledgements