Abstract Polycystic Ovary Syndrome (PCOS) is a complex endocrine disorder affecting women of childbearing age, and we aimed to reveal its underlying molecular mechanisms. Gene expression profiles from [40]GSE138518 and [41]GSE155489, and single-cell RNA sequencing (scRNA-seq) data from PRJNA600740 were collected and subjected to bioinformatics analysis to identify the complex molecular mechanisms of PCOS. The expression of genes was detected by RT-qPCR. Through differential analysis, we identified 230 common differentially expressed genes (DEGs) in [42]GSE138518 and [43]GSE155489. GSEA results showed significant enrichment of purine metabolism and oocyte meiosis in the control group, while GSVA results indicated significant activation of ECM receptor interaction, and antigen processing and presentation in PCOS. Weighted gene co-expression network analysis revealed 7 co-expression modules, with the bisque4 module showing the highest positive correlation with PCOS. Enrichment analysis revealed that genes in the bisque4 module were mainly involved in the PI3K-Akt signaling pathway, calcium signaling pathway, and Ras signaling pathway. Pseudotime trajectory analysis of cell subpopulations revealed the potential developmental trajectory of PCOS. The gene expression consistent with the potential developmental trajectory was validated by RT-qPCR. Our study, by analyzing multiple datasets, has revealed the complex molecular network of PCOS, offering new insights into understanding its pathophysiological basis. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-024-81110-w. Keywords: Polycystic ovary syndrome, Pseudotime trajectory, WGCNA, Bioinformatics Subject terms: Cell biology, Pathogenesis Introduction Polycystic Ovary Syndrome (PCOS) is a prevalent endocrine disorder affecting up to 10% of women of reproductive age, characterized by hyperandrogenism, ovulatory dysfunction, and polycystic ovaries^[44]1,[45]2. The World Health Organization estimates that more than 116 million women worldwide are affected by this condition, underscoring its significance as a public health issue^[46]3. Despite recent advances advancements, significant gaps remain in our understanding of PCOS, particularly regarding its molecular mechanisms and the long-term health risks it poses, such as metabolic syndrome and cardiovascular disease^[47]4,[48]5. While the primary focus has been on reproductive and metabolic health, recent studies suggest that PCOS may have systemic implications beyond the reproductive system^[49]6. In addition, differentially expressed genes and signaling pathways may play a crucial role in the syndrome’s complex molecular mechanisms, further emphasizing the need for a broader understanding of the condition’s impact on women’s overall health^[50]7. PCOS is understood to have a multifactorial etiology, including genetic predisposition, environmental influences, and lifestyle factors^[51]8,[52]9. However, gaps in understanding its molecular mechanisms limit the effectiveness of current treatments and early diagnosis strategies. This highlights the need for further research to elucidate the complex molecular pathways involved in PCOS, aiming for targeted therapeutic interventions^[53]10–[54]12. The molecular basis of PCOS is complex, involving alterations in multiple pathways, but the precise details of these alterations remain elusive. To address these gaps, this study integrates gene expression profiles from the datasets [55]GSE138518 and [56]GSE155489 datasets with single-cell RNA sequencing (scRNA-seq) data from PRJNA600740, to identify the intricate molecular mechanisms and cellular changes associated with PCOS. Specifically, we seek to identify gene networks and signaling pathways that could be targeted for therapeutic interventions, providing new insights into the pathophysiology of PCOS. Materials and methods Data collection Microarray data sets [57]GSE138518^[58]13, and [59]GSE155489^[60]14 were downloaded from the Gene Expression Omnibus (GEO) database ([61]https://www.ncbi.nlm.nih.gov/gds), and scRNA-seq data PRJNA600740^[62]15 was downloaded from the BioProject database ([63]https://www.ncbi.nlm.nih.gov/bioproject). The [64]GSE138518 dataset contains gene expression profiles of ovarian granulosa cells from 3 PCOS patients and 3 normal individuals. The [65]GSE155489 dataset includes gene expression profiles of ovarian granulosa cells from 2 PCOS patients and 2 normal individuals. The microarray data were background corrected using the affy package in R, and the raw probe intensities were normalized using the Robust Multi-array Average (RMA) method. The PRJNA600740 dataset contained scRNA-seq data from 9 oocytes of PCOS patients and 11 oocytes from healthy individuals. Quality control of the scRNA-seq data was performed using the Seurat package, removing cells with less than 500 detected genes and excluding cells with more than 10% mitochondrial gene expression. The filtered data were then normalized using the NormalizeData function from the Seurat package. High variance genes (> 2000) were identified for subsequent analysis using the FindVariableFeatures function in Seurat. Differential analysis and co-expression network analysis The limma R package was utilized for identifying differentially expressed genes (DEGs) in the [66]GSE138518 and [67]GSE155489 datasets, with |log2FoldChange| > 1 and P value < 0.05 set as the thresholds for significance. Weighted gene co-expression network analysis (WGCNA) was used to construct a co-expression network for the union of the DEGs. Initially, the pickSoftThreshold function within the WGCNA package was used to calculate the scale-free topology fit index for various soft-thresholding powers, selecting the optimal beta value to ensure a scale-free network. Pearson correlation coefficients for all pairs of genes within the gene expression matrix were computed to build a matrix of gene expression similarity, which was then transformed into an adjacency matrix. The adjacency matrix was converted into a Topological Overlap Matrix (TOM), followed by hierarchical clustering to group genes with similar expression patterns together. Modules within the clustering tree were identified using the dynamic tree cut method. Modules most correlated with PCOS were identified by calculating the correlation between the module eigengenes and PCOS. Enrichment analysis Gene set enrichment analysis (GSEA) was performed using the clusterProfiler R package based on hallmark gene sets for PCOS and controls. Gene Set Variation Analysis (GSVA) was carried out using the GSVA R package by evaluating the expression pattern of genes within the same gene set across samples, generating a score reflecting the activity of that gene set within the sample. The resulting gene set activity scores were used to compare differences between PCOS patients and controls, identifying signaling pathways that are activated or suppressed in the disease state. Additionally, the clusterProfiler R package was used to identify GO and KEGG signaling pathways^[68]16 involved with the module genes. ScRNA-seq cell clustering and pseudotime trajectory analysis The FindClusters function in the Seurat R package was used to identify major clusters. Visualization and further analysis were carried out using the Uniform Manifold Approximation and Projection (UMAP) method. Pseudotime trajectory analysis was performed using the monocle2 R package. The plot_cell_trajectory function was utilized to visualize the pseudotime trajectory of cells, displaying the continuous changes and branching structures of cell states. Sample collection Women aged between 28 and 45 were recruited from The First Affiliated Hospital of Heilongjiang University of Traditional Chinese Medicine, including 10 patients diagnosed with PCOS based on the Rotterdam criteria and 10 healthy non-PCOS women as a control group. Individuals with genetic diseases, other endocrine disorders, or those currently undergoing related medication treatments were excluded. This study was conducted following the ethical guidelines outlined in the Declaration of Helsinki and was approved by the Ethics Committee of The First Affiliated Hospital of Heilongjiang University of Traditional Chinese Medicine (No. HZYLLBA2024004), all methods were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all participants before their inclusion in the study. Participants were fully informed about the study’s purpose, procedures, potential risks, and benefits, and signed a written consent form prior to sample collection. A 5 mL sample of peripheral blood was collected using vacuum blood collection tubes containing anticoagulant for subsequent analysis. RT-qPCR analysis Total RNA was extracted from the blood samples using the TRIzol reagent (Invitrogen, USA) according to the manufacturer’s protocol. The purity and concentration of RNA were determined using a NanoDrop spectrophotometer. Complementary DNA (cDNA) was synthesized from 1 µg of total RNA using the PrimeScript RT Reagent Kit (Takara, Japan) following the manufacturer’s instructions. Real-time quantitative PCR (RT-qPCR) was performed to quantify the expression levels of DVL1, HES4, ISG15, SDF4, AGRN, INTS11, and UBE2J2. The specific primers for each gene are listed in Table [69]S1. The RT-qPCR reactions were carried out in a 20 µL reaction volume containing 10 µL of SYBR Premix Ex Taq (Takara Bio, Japan), 0.4 µL of each primer (10 µM), 2 µL of cDNA, and 7.2 µL of RNase-free water. The reactions were performed in triplicate using a ABI7500 Real-Time PCR System. The relative expression levels of the target genes were calculated using the 2^−ΔΔCt method, with GAPDH serving as the internal control gene. Statistical analysis Statistical analysis was performed using GraphPad Prism 9 software (GraphPad Software, USA). Data were presented as mean ± standard deviation (SD). To compare gene expression levels between the PCOS and control groups were analyzed using an unpaired t-test. Prior to performing the t-test, we assessed the normality of the data using the Shapiro-Wilk test, which confirmed that the gene expression data were normally distributed in both groups. The statistical significance was set at P < 0.05. Result Altered gene expression and signaling pathways in PCOS In the [70]GSE138518 and [71]GSE155489 datasets, differential analysis between PCOS and control groups identified 1732 DEGs (Fig. [72]1A) and 3208 DEGs (Fig. [73]1B), respectively. Through intersection analysis, we identified 230 common DEGs (Fig. [74]1C, D). Fig. 1. [75]Fig. 1 [76]Open in a new tab Identification of differentially expressed genes in PCOS. (A) Volcano plot of differentially expressed genes in [77]GSE138518 dataset. It displaying significant DEGs based on |log2FoldChange| > 1 and P < 0.05. Upregulated genes are shown in red, and downregulated genes are in blue. (B) Volcano plot of differentially expressed genes in [78]GSE155489 dataset. It displaying significant DEGs based on |log2FoldChange| > 1 and P < 0.05. Upregulated genes are shown in red, and downregulated genes are in blue. (C) Venn diagram showing the intersection of DEGs identified in the [79]GSE138518 and [80]GSE155489 datasets. A total of 230 genes were found to be commonly differentially expressed between the two datasets. (D) Heatmap of the expression of common DEGs identified in panel C within the [81]GSE138518 dataset. (E) Gene Set Enrichment Analysis (GSEA) enrichment analysis for PCOS and controls. (F) Gene Set Variation Analysis (GSVA) evaluated the changes of signaling pathways in PCOS. GSEA results indicated that in the healthy control group, pathways related to purine metabolism and oocyte meiosis were significantly enriched, suggesting that these biological processes may play an important role in healthy ovarian function (Fig. [82]1E). Meanwhile, GSVA results revealed significant activation of ECM receptor interaction, and antigen processing and presentation pathways in PCOS patients, pointing to possible immune and inflammatory responses involved in PCOS (Fig. [83]1F). Identifying key modules through WGCNA In WGCNA, the optimal β value of 24 (scale-free R^2 = 0.85) was selected, identifying 7 co-expressed modules (Fig. [84]2A, B). Correlation analysis revealed that the bisque4 module had the highest positive correlation with PCOS (Fig. [85]2C), suggesting it may play a key role in the onset and progression of PCOS. Further enrichment analysis indicated that genes within the bisque4 module are mainly involved in biological processes such as cell adhesion and biological adhesion; cellular components like component of plasma membrane and plasma membrane part; and molecular functions including calcium ion binding and identical protein binding (Fig. [86]3A). The results of KEGG signaling pathways involved with genes in the bisque4 module revealed pathways such as the PI3K-Akt signaling pathway, Calcium signaling pathway, and Ras signaling pathway (Fig. [87]3B). Fig. 2. [88]Fig. 2 [89]Open in a new tab Construction of co-expression network for all differentially expressed genes. (A) Determination of soft threshold power for constructing a scale-free network using WGCNA. The plot shows the scale-free topology fit index (y-axis) across different soft threshold powers (x-axis). (B) Clustering dendrogram of gene modules built by WGCNA. Genes are grouped into modules based on their expression patterns. The branches of the dendrogram represent different modules, and each color indicates a distinct module. (C) Heatmap showing the correlation between module–trait of PCOS/control groups. Each row represents a module, and each column represents a trait (PCOS or control). Fig. 3. [90]Fig. 3 [91]Open in a new tab Enrichment analysis of bisque4 module genes. (A) Gene Ontology (GO) enrichment analysis for biological processes (BP), cellular components (CC), and molecular functions (MF) of genes in the bisque4 module. (B) KEGG pathway enrichment analysis for genes in the bisque4 module. Pseudotime trajectory analysis We applied UMAP for scRNA-seq data in PRJNA600740 to identify 13 cell clusters (Fig. [92]4A) and show the distribution of cell clusters in PCOS and controls (Fig. [93]4B). Pseudotime trajectory analysis identified potential developmental sequences of cells during the progression of PCOS (Fig. [94]4C). Figure [95]4D showed the distribution of cell clusters along the pseudotime trajectory, while Fig. [96]4E displayed the distribution of PCOS patients and normal controls along these paths. Clusters 2 and 13, aggregating along the trajectory, might be involved in the progression of PCOS. Moreover, AGRN, DVL1, HES4, INTS11, ISG15, SDF4, and UBE2J2 were genes whose expression changes over pseudotime trajectory (Fig. [97]5). Fig. 4. [98]Fig. 4 [99]Open in a new tab Characterization of cell clusters in PCOS of scRNA-seq. (A) UMAP plot of single-cell RNA sequencing data from PCOS and control samples, showing 13 distinct cell clusters. Each dot represents a single cell, and different colors represent different cell clusters. (B) UMAP plot showing the distribution of cells from PCOS patients (orange) and control individuals (blue). (C) Pseudotime trajectory analysis plot displaying the developmental progression of cells. (D) The distribution of cell clusters on pseudotime trajectories. (E) The distribution of cells on pseudotime trajectories according to PCOS group (orange) or control group (blue). Fig. 5. [100]Fig. 5 [101]Open in a new tab Gene expression level on the pseudotime trajectory. The RT-qPCR results shown in Fig. [102]6 validated the expression levels of selected genes identified through bioinformatics analysis. It was found that the expression of DVL1, HES4, ISG15, and SDF4 in PCOS was higher than that in the control group, while the expression of AGRN, INTS11, and UBE2J2 was lower than that in the control group. Fig. 6. [103]Fig. 6 [104]Open in a new tab Gene expression level in PCOS and normal controls detected by RT-qPCR. The bar plots show relative gene expression levels (mean ± SD) in the two groups. *P < 0.05. Discussion The aim of this study is to utilize bioinformatics approaches to unveil the potential molecular mechanisms of PCOS. By integrating analysis of the [105]GSE138518 and [106]GSE155489 datasets along with PRJNA600740 scRNA-seq data, we conducted an in-depth exploration of PCOS at the molecular level. Through GSEA, we noted significant enrichment the biological processes of purine metabolism^[107]17 and oocyte meiosis^[108]18 in the healthy control group. This may be associated with the maturation of oocytes and energy metabolism in the ovaries, both of which are crucial for normal fertility^[109]19–[110]23. In PCOS patients, GSVA revealed significant activation of ECM receptor interaction and antigen processing and presentation pathways, indicating that the pathological process of PCOS might be closely related to the immune response and inflammatory state of the patients. The activation of ECM receptor interaction suggests that PCOS may be accompanied by ECM remodeling, potentially affecting the ovarian microenvironment and its function^[111]24. Moreover, altered ECM remodeling has been linked to increased fibrosis in ovarian tissue, which may impair normal follicular development and function, exacerbating the ovulatory dysfunction seen in PCOS patients^[112]25. Pathways related to antigen processing and presentation may reflect a state of chronic low-grade inflammation that is commonly associated with PCOS and may contribute to its metabolic complications, including insulin resistance and obesity^[113]26,[114]27. These metabolic disturbances not only exacerbate the hormonal imbalances seen in PCOS but may also worsen the overall clinical presentation, including increased risks of type 2 diabetes and cardiovascular disease^[115]28. Moreover, obesity, a well-established risk factor for PCOS, has been shown to further exacerbate these pathological processes. Tokmak et al. demonstrated that obesity not only worsens metabolic and inflammatory parameters but also significantly reduces pregnancy rates in women undergoing ovulation induction^[116]29. This highlights the necessity for weight management as part of therapeutic strategies in PCOS treatment, especially for patients seeking fertility interventions. Further supporting this, recent evidence by Camili et al.^[117]30 demonstrated that inflammatory cytokine is significantly elevated in PCOS patients and may be linked to the syndrome’s immune dysregulation. This underscores the growing recognition of immune modulation as a potential therapeutic target in managing PCOS, particularly in patients with pronounced inflammatory profiles. These molecular insights could help guide future therapeutic interventions, targeting ECM remodeling or modulating inflammatory responses to improve ovarian function and reduce the metabolic complications associated with PCOS. Through WGCNA, we identified the bisque4 module as having the highest positive correlation with PCOS. Enrichment analysis revealed that genes within this module are primarily involved in biological processes and molecular functions such as cell adhesion, biological adhesion, components of the plasma membrane, calcium ion binding, and protein binding, highlighting their potential roles in the development of PCOS. Cell adhesion molecules can be used to evaluate inflammation related disorders in PCOS patients^[118]31. The downregulation of extracellular matrix and cell adhesion molecules seems to be related to PCOS^[119]32. The expression of calcium ions and related proteins has been consistently low in female patients with PCOS, leading to impaired muscle mass^[120]33. Furthermore, signaling pathways associated with the bisque4 module’s genes, such as the PI3K-Akt^[121]34–[122]36, calcium signaling^[123]37,[124]38, and Ras signaling pathways^[125]39,[126]40, regulate key biological processes like cell proliferation, survival, and apoptosis, indicating they may play central roles in the pathogenesis of PCOS. Additionally, recent studies have identified genetic polymorphisms in signaling molecules, such as ERK-1 and ERK-2, as key contributors to the pathogenesis of PCOS. Guney et al. reported that variations in these genes are associated with altered cell signaling, which may influence the development of PCOS by disrupting ovarian function and hormonal balance^[127]41. These findings provide further evidence of the complex genetic underpinnings of PCOS and highlight the importance of molecular studies in identifying potential therapeutic targets. Signaling pathways identified in this study may offer promising avenues for developing targeted treatments that address both the reproductive and metabolic abnormalities of PCOS. Single-cell data analysis provided a more detailed perspective, revealing cellular heterogeneity within PCOS and normal ovarian tissues. UMAP analysis identified 13 cell clusters, while pseudotime trajectory analysis unveiled potential developmental sequences during the progression of PCOS. Notably, the aggregation of clusters 2 and 13 along the trajectory suggests these cell groups may play unique roles in the progression of PCOS. This offers us an in-depth understanding of cellular differentiation and development under the pathological state of PCOS and may reveal key molecules and pathways involved in the disease process. We also identified genes such as AGRN, DVL1, HES4, INTS11, ISG15, SDF4, and UBE2J2, whose expression changes over pseudotime. The RT-qPCR results demonstrated that the expression of DVL1, HES4, ISG15, and SDF4 was significantly higher in PCOS patients than in controls, which is consistent with the differential expression predicted by our bioinformatics analysis. Other studies have reported a certain association between DVL1^[128]42, HES4^[129]43, ISG15^[130]44, and SDF4^[131]45 with ovarian. Although there is currently a lack of sufficient literature to support its specific mechanism of action, the expression changes of AGRN, INTS11, and UBE2J2 in the analysis of this study suggest that these genes may be involved in pathological processes such as ovarian dysfunction, metabolic disorders, and inflammatory response in PCOS patients. The consistency between the computational and experimental data strengthens the reliability of the bioinformatics findings and underscores the potential biological relevance of these genes in the pathogenesis of PCOS. Moreover, the validated genes may contribute to the clinical presentation of PCOS and provide potential targets for further research and therapeutic interventions. The functions of these genes and their roles in PCOS require further biological validation and mechanistic studies, with the hope of providing new directions for the treatment of PCOS. This study has some limitations. Firstly, the public databases with limitation of sample size relied upon in this study may increase the likelihood of potential bias in the type of PCOS patients, and may also weaken the broad applicability and explanatory power of the results. Future research should focus on collecting larger and more diverse datasets that include PCOS patients from different ethnic backgrounds, clinical phenotypes, and disease severities. A prospective longitudinal study could track the progression of PCOS from early onset through various stages of the disorder, which allow for an examination of how the molecular mechanisms identified in this study. Cell lines and animal models of PCOS were also could be used to validate the findings from this study in a whole-organism context. Secondly, the current understanding of the potential long-term health effects of PCOS, such as cardiovascular disease and cancer risk, is still limited. In addition, further research should be conducted on the genetic background and environmental triggering factors of PCOS, especially its application in early diagnosis and treatment strategy development. Conclusion In summary, our study, by integrating and analyzing multiple independent datasets, has uncovered the complex molecular networks and cellular state changes underlying PCOS. By delving into these findings, future research can deepen our understanding of the complex pathology of PCOS and drive the development of new diagnostic and therapeutic methods. Electronic supplementary material Below is the link to the electronic supplementary material. [132]Supplementary Material 1^ (16.2KB, docx) Acknowledgements