Abstract Premature ovarian insufficiency (POI) is a reproductive endocrine disorder characterized by infertility and the perimenopausal syndrome. Its genetic etiology is highly heterogeneous and not yet fully understood. Limited by short-read sequencing, the profile and structural variation of the full-length transcript for POI have remained elusive. Therefore, this study included peripheral blood samples from 5 POI patients and 5 controls, characterizing full-length transcripts of POI using Oxford Nanopore sequencing firstly. Ultimately, we identified 26,122 transcripts, including 7,724 novel gene loci and 13,593 novel transcripts. A total of 382 differentially expressed transcripts were identified, including 366 down-regulated and 16 up-regulated transcripts. Based on transcript structure variant analysis, 8,834 alternative splicing events, 65,254 alternative polyadenylation sites and 32 motifs were further identified, revealing the diversity sources of transcript isoforms, proteins and genetic complexity. Enrichment analysis of differentially AS genes suggested that the ferroptosis pathway may play an important role in the pathogenesis of POI.Additionally, 494 high-confidence lncRNAs, 1,768 transcription factors, and novel gene-coding regions were predicted based on full-length transcript sequence. Analysis of immune cell subtypes revealed the expression of CD8 + T cells and monocytes were down-regulated in POI, which was significantly positively correlated with AMH, suggesting that CD8 + T cells and monocytes could serve as potential diagnostic markers and immunotherapy targets for POI. Conclusively, this study provides new perspectives on the pathogenesis, post-transcriptional regulation mechanisms, and immune targets of POI. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-025-89391-5. Keywords: Premature ovarian insufficiency, Full-length transcriptome, Alternative splicing, Alternative polyadenylation, Immune cell subtypes Subject terms: Biological techniques, Biotechnology, Computational biology and bioinformatics, Genetics, Molecular biology Introduction Premature ovarian insufficiency (POI) refers to the development of reproductive endocrine dysfunction and follicular depletion in women before the age of 40. The main clinical manifestations are infertility and menstrual disorders, characterized by hypergonadotropic and hypoestrogenic amenorrhea^[44]1. It is estimated that the prevalence of POI is 1% before the age of 40, while it is approximately 0.01% before the age of 20^[45]2. Due to reduced estrogen production, women with POI have a higher risk for long-term complications, including cardiovascular disease, osteoporosis, genitourinary symptoms, and neurodegeneration^[46]3. The etiology of POI is highly heterogeneous. Chromosomal abnormalities, autoimmune diseases, and medical factors (surgery, radiotherapy, chemotherapy) have been reported, but the etiology is still unknown in 90% of cases^[47]4. Genetic factors are considered the most important factor in promoting the development of POI^[48]5. Therefore, exploring the characteristics of the genetic profile is essential for understanding the pathogenesis and therapeutic targets of POI. The advent of high-throughput sequencing technology has brought about revolutionary breakthroughs in genetic research and biomedicine. Oxford nanopore technology (ONT) is the third generation of high-throughput sequencing based on the real-time detection of single molecule electrical signals. The principle is to combine DNA/RNA duplets with nanopore proteins and unhelix them, use voltage to drive nucleic acid molecules through the pore one by one, and identify the sequence of nucleic acid base by detecting the changes in electrical signals^[49]6,[50]7. Compared to the next-generation sequencing technologies represented by the Illumina platform, ONT sequencing offers distinct advantages including ultra- long read lengths, shorter sequencing cycle, and real-time results acquisition^[51]8,[52]9. Moreover, full-length sequences facilitates more accurate refinement of genome annotation information and explore intricate transcript structural variations. In this study, the state-of-the-art PromethION platform from ONT was used to characterize transcriptional expression profiling of POI. The aim was to refine the functional annotation of POI transcripts and explore post-transcriptional regulatory signature, including alternative splicing (AS) events, alternative polyadenylation (APA) sites, fusion genes, predicted long non-coding RNAs (lncRNAs), coding sequence (CDS), and transcription factor (TFs) families based on full-length sequences. And the proportion of different immune cell subtypes was analyzed based on the transcriptional profile data. These findings will provide new insights into the pathogenesis and genetic diversity of POI (Fig. [53]1). Fig. 1. [54]Fig. 1 [55]Open in a new tab Bioinformatics analysis of ONT sequencing. ONT: oxford nanopore technology, AS: alternative splicing, APA: alternative polyadenylation, CPM: counts per million, DET: differentially expressed transcript, CDS: coding sequence, SSR: simple sequence repeat. Materials and methods Participants and samples collection The diagnostic criteria for POI were as follows^[56]1: (i) age younger than 40 years; (ii) experiencing oligo/amenorrhea for at least 4 months; and (iii) serum follicle-stimulating hormone (FSH) concentrations exceeding 25 IU/L, detected on two occasions with an interval of more than 4 weeks. The inclusion criteria for the control group including (i) age and body mass index (BMI) matched controls; (ii) diagnosis of infertility due to tubal factors; (iii) with normal menstrual cycles and ovarian function. The exclusion criteria for both the case and control groups were consistent: (i) presence of other endocrine disorders; (ii) history of pelvic or ovarian surgery; (iii) use of hormones or medications affecting the endocrine system within 3 months prior to blood sampling. The peripheral blood samples (2.5 ml) were collected at 9 a.m. using PAXgene Blood RNA Tubes (BD, United States) following a 12-hour fasting period and subsequently stored at -80 °C for preservation. The clinical data of the participants were collected, including age, BMI, antral follicle (AFC), anti-Mullerian hormone (AMH), FSH, Luteinizing hormone (LH), estradiol (E2), progesterone (P) and prolactin (PRL). The present study was granted approval by the Ethics Committee of the First Affiliated Hospital of Guangxi Medical University (NO. 2021KY-E-249), and written informed consent was obtained from all participants. We confirmed that all methods were conducted in accordance with relevant guidelines and regulations. The construction of cDNA library The total RNA of the samples was extracted by PAXgene Blood miRNA Kit (manufacturer: QIAGEN, modle: 763134) and construct a cDNA library by a matching kit (Thermo Scientific™Maxima H Minus Reverse Transcriptase, modle: EP0752)). The PromethION platform (Oxford Nanopore Technologies, Oxford, UK) was employed for conducting full-length transcriptome sequencing. The full-length sequence was refined to generate a consensus sequence, which was subsequently aligned with the human reference genome (version: Homo_sapiens GRCh38_release95) using Minimap2 software (2.16 version). Sequences with an identity of less than 0.9 and coverage below 0.85 were excluded, resulting in a final de-redundant sequence that was utilized for further analysis^[57]10. Identification and functional annotation of novel transcripts The gff3 file obtained from the non-redundant sequence was compared to the gff3 of reference genome using the gffcompare software, and transcripts identified in this sequencing result but not annotated by the reference genome were classified as novel transcripts. The annotation information for novel transcripts was obtained by comparing them with the Non-Redundant Protein Database (NR, [58]https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/), Swissprot ([59]https://www.uniprot.org/), Gene Ontology (GO, [60]http://geneontology.org/), Clusters of Orthologous Groups (COG,[61]https://www.ncbi.nlm.nih.gov/research/cog/), euKaryotic Ortholog Groups (KOG, [62]https://www.ncbi.nlm.nih.gov/COG/), Protein family (Pfam, [63]http://pfam.xfam.org/), and Kyoto Encyclopedia of Genes and Genomes (KEGG, [64]https://www.genome.jp/kegg/) databases^[65]11. Identification of differentially transcripts and enrichment analysis The transcript expression levels were quantified as counts per million (CPM). The specific calculation method was referred to our previous study^[66]10. The DESeq2 R package was used to perform differential expression analysis on the transcripts from all samples. The transcripts with | log2 fold change(FC)| >1.5 and a false discovery rate (FDR) < 0.05 were considered as differentially expressed transcripts (DETs). The false discovery rate (FDR) is calculated by applying the Benjamini-Hochberg correction method to adjust the original significance P-values obtained from hypothesis testing.Then the enrichment analysis of KEGG and GO was performed on the DETs. Analysis of alternative splicing and alternative polyadenylation We used AStalavista software to identify AS types in all samples, including intron retention (IR), exon skipping (ES), mutually exclusive exon (MEE), alternative 3 ‘splice sites (A3S), and alternative 5’ splice sites (A5S). The genes with alternative splicing only in the POI group were identified and intersected with the DETs, and then the intersecting genes were subjected to KEGG pathway enrichment analysis. APA sites were verified using the TAPIS pipeline, and the number and location of APA loci were analyzed. The sequence features 50 bp upstream of the transcript poly(A) site were analyzed using DREME software to determine the motif. Prediction of lncRNAs, CDS and SSR Transcripts with minimum length > 200nt and exon number ≥ 2 were defined as candidate lncRNAs. The intersection of the results predicted by the four tools were identified as lncRNAs with high confidence. Protein family (Pfam): identifies coding transcripts with protein domain matches; unmatched transcripts are potential lncRNAs. Coding potential calculator (CPC): classifies transcripts with a score < 0 as noncoding. Coding potential assessment tool (CPAT): uses a logistic regression model to evaluate coding potential based on ORF length, ORF coverage, fickett score, and hexamer score. Coding-noncoding index (CNCI): transcripts with a score < 0 are identified as noncoding. Coding sequences (CDS) were identified from transcripts using TransDecoder software. The transcripts with more than 500 bp were screened from the deredundant transcripts, and simple sequence repeat (SSR) analysis was conducted using MISA software. Enrichment indicators for each SSR classification are shown in Supplementary Table [67]S1. Transcription factor annotation and binding site prediction The transcription factors (TFs) are a class of proteins that specifically bind to nucleotide sequences located upstream of genes and play a crucial role in the regulation of gene expression. We identified the TFs types through comparison with the AnimalTFDB 3.0 database (AnimalTFDB3) and assessed their activities in a given sample using the CoRegNet software package, presenting the results visually through heat maps. The R package TFBStools was used to predict the TFs binding sites (TFBS) in the promoter regions of the differentially expressed genes. Network analysis of DETs-TFs-pathways and protein-protein interaction Transcription factors targeting DETs were predicted using a transcription factor database. Subsequently, pathways associated with these DETs were identified through a gene function annotation database. The targeting relationships among DETs, TFs, and pathways were then imported into Cytoscape software to construct and visualize a network map.In addition, the DETs were uploaded to the STRING database (STRING: functional protein association networks) for protein-protein interaction (PPI) analysis. The PPI results were subsequently imported into Cytoscape software for network analysis. The degree value of each node was calculated, and a concentric circle layout was used for visualization. The innermost circle represented the most critical hub genes based on their degree values. Immune cell subtypes and clinical correlation analysis CIBERSORT is a cell type recognition algorithm based on gene expression data. In this study, the CIBERSORT package was used to perform immune cell subtypes analysis of transcriptional expression profiles to infer the relative proportions of distinct immune cell subsets in samples. Analyze the differential expression of immune cell subsets between two groups using T-test, further analyze the correlation between differential cell subsets and clinical indicators AMH and FSH by calculating Pearson coefficient. P < 0.05 was considered statistically significant. Results Overview of full-length transcriptome sequencing Compared to the control group, the POI group exhibited significantly elevated levels of FSH and LH, while AMH levels were significantly lower (P < 0.05). However, there were no significant differences in age, BMI, E2, PRL, and T between the two groups (P > 0.05)( Supplementary Table [68]S2). The Nanopore PromethION platform was used to construct a full-length transcriptome library of peripheral blood samples. The raw data of each sample were filtered for short fragments and low-quality reads to obtain the total clean data, and the output of clean data of each sample was more than 3.77 GB. After the removal of low-quality sequences and ribosomal RNA sequences, the proportion of full-length sequences ranged from 90.2 to 92.84%, with full-length sequences ranging from 2,781,865 to 4,582,222^[69]10. Identification and functional annotation of novel transcripts and DETs Using Minimap2 to align full-length sequences against the human reference genome, we identified a total of 26,122 transcript sequences, which included 7,724 novel gene loci and 13,593 novel transcripts(Supplementary Table [70]S3). Additionally, a total of 382 DETs were identified through the differential expression analysis, including 366 down-regulated and 16 up-regulated transcripts^[71]10 (Supplementary Table [72]S4 and Fig.[73]S1). Annotation information of DETs in seven databases was shown in Supplementary Table [74]S5. GO functional annotation analysis showed that DETs were significantly enriched in the viral life cycle in biological processes. The molecular functions significantly enriched included superoxide-generating NADPH oxidase activity, virus receptor activity, guanylate kinase activity. Additionally, ficolin-1-rich granules and ficolin-1-rich granule membranes were mainly enriched in cellular components (Fig. [75]2a-c). KEGG enrichment analysis showed that the top 20 most significantly enriched pathways, including protein processing in endoplasmic reticulum, fluid shear stress and atherosclerosis, phagosome, Fc gamma R-mediated phagocytosis, leishmaniasis, osteoclast differentiation (Fig. [76]2d). The COG database annotated 20 entries, of which translation, ribosomal structure and biogenesis accounted for the largest proportion [15-17.24%]. There were 21 items annotated in KOG, of which signal transduction mechanisms accounted for the largest proportion [42-14.19%] (Supplementary Figure [77]S2). Fig. 2. [78]Fig. 2 [79]Open in a new tab Enrichment analysis of DETs. (a–c) GO enrichment analysis.(d) KEGG enrichment analysis. Characteristics of AS, APA and motif Based on the ONT sequencing data, we identified 8,834 AS events involving five types (Fig. [80]3a). Of these, 4,231 AS events were identified in the POI group, including 2,419 ES, 633 A3S, 528 A5S, 477 IR, and 174 MEE events. Additionally, 4,603 AS events were found in the control group, including 2,591 ES, 713 A3S, 560 A5S, 546 IR, and 193 MEE events. Overall, the proportions of various AS types were similar in both groups, with the largest proportion of ES and the smallest proportion of MEE (Fig. [81]3b). Meanwhile, we identified 688 genes that only experienced AS events in the POI group. By taking the intersection of these AS genes and DET, 17 intersection genes were obtained. Further KEGG enrichment analysis indicated that the intersection genes were significantly enriched in the ferroptosis pathway (Supplementary Table [82]S6, Fig. [83]S3). Fig. 3. [84]Fig. 3 [85]Open in a new tab Characteristics and quantities of AS, APA, and motifs. (a) Alternative splicing type. (b) Number of alternative splicing events in POI and controls. (c) Base distribution plot of upstream and downstream of poly(A) site. (d) Number of alternative polyadenylation loci in two groups. (e) motifs upstream of the polyA site (the abscissa is the relative position of the base in the motif, the ordinate is the sequence conservation of the base, and the base height represents the relative frequency at that position). Next, 65,254 APA loci were identified in the peripheral blood samples. The specific APA loci types and numbers in each sample were shown in supplementary Figure [86]S4. Overall, the proportion of APA loci in various types was similar between the two groups, with the largest proportion of genes having 1 polyA locus and the smallest proportion of genes having 5 polyA loci (Fig. [87]3d). As shown in Fig. [88]3e, we identified 10 motifs by analyzing the 50 bp sequence upstream of the poly(A) site of all transcripts, including conserved sequences such as AAUAAA. We found that the upstream and downstream regions of the 3′-untranslated regions (UTR) showed significant base propensity. Specifically, adenine (A) was enriched in the upstream and downstream regions of the 3′-UTR (Fig. [89]3c). Characteristic of CDS, lncRNAs and SSR In the transcriptome data, 7,737 open reading frames (ORFs) were detected. Of these, 4,171 were complete ORFs with a CDS length distribution between 0–1,100 aa. A complete CDS means that both the 5’ UTR and 3’ UTR are present, while a partial CDS indicates the presence of either the 5’ UTR or the 3’ UTR (Fig. [90]4a, Supplementary Table [91]S7). Fig. 4. [92]Fig. 4 [93]Open in a new tab Characteristics and quantities of CDS-encoded proteins, transcription factors (TFs), and long noncoding RNAs (lncRNAs). (a) Length distribution of CDS-encoded proteins. (b) TFs type distribution map. (c) Predicted lncRNAs Venn diagram by four methods. (d) Classification map of lncRNAs locations. (e) The heatmap of significant TFs. The CPC, CNCI, CPAT, and Pfam protein domain tools identified 698, 547, 629, and 494 lncRNAs, respectively (Supplementary Table [94]S8). By intersecting the results of these four methods, a total of 494 lncRNAs were predicted with high confidence. These included 31 sense-lncRNAs, 100 anti-sense lncRNAs, 285 intergenic lncRNAs, and 78 intronic lncRNAs. Among them, 452 lncRNAs were identified as novel lncRNAs, and two differentially expressed lncRNAs were ONT.13874.1 and ONT.2222.1, respectively (Supplementary Table [95]S9 and Fig. [96]4c and d). By analyzing the sequence of transcripts over 500 bp, seven types of SSR were identified, including Mono-nucleotide, Di-nucleotide, tris-nucleotide, Tetra-nucleotide, Penta-nucleotide, Hexa-nucleotide and compound SSR (mixed microsatellite, two SSRS are less than 100 bp apart), of which Mono-nucleotide is the largest number(Table [97]1). Table 1. The type and number of simple sequence repeat. Searching item Numbers Total number of sequences examined 102,735 Total size of examined sequences (bp) 221,917,398 Total number of identified SSRs 44,655 Number of SSR containing sequences 24,779 Number of sequences containing more than 1 SSR 9,611 Number of SSRs present in compound formation 4,667 Mono nucleotide 25,637 Di nucleotide 8,088 Tri nucleotide 9,656 Tetra nucleotide 974 Penta nucleotide 225 Hexa nucleotide 75 [98]Open in a new tab Characteristics and binding sites of TFs A total of 1,768 TFs were predicted from the novel transcripts in this study (Supplementary Table [99]S10, Fig. [100]4b). zf-C2H2, Homeobox, and bHLH were the top three TF families with the largest number. TFs activity refers to the functional state of transcription factors in cells that regulate gene expression. We assessed the activity of TFs in each sample based on activity scores and identified coregulated TFs pairs. The heat map of TFs activity showed that the BCLB-ZBTB family was a positively regulated TF in the POI group, while the remaining TF families were negatively regulated, with darker colors indicating higher activity (Fig. [101]4e). Additionally, the promoter regions of differential genes (the region about 1 kb upstream of the gene) were analyzed to predict the binding sites for transcription factors and obtain the sequence characteristics of TF motifs (Supplementary Fig. [102]S5). These data will make a valuable contribution to the further investigation of the regulatory mechanisms underlying downstream target genes of TFs. Network interaction analysis of DET-TF-pathway and PPI The interaction of DETs, TFs and pathways was used to construct a network interaction diagram (Fig. [103]5), in which nodes of different colors represented different types of molecules. Network diagram is helpful for understanding biological processes and disease mechanisms, and provides important references for disease diagnosis, treatment and