Abstract

   Premature ovarian insufficiency (POI) is a reproductive endocrine
   disorder characterized by infertility and the perimenopausal syndrome.
   Its genetic etiology is highly heterogeneous and not yet fully
   understood. Limited by short-read sequencing, the profile and
   structural variation of the full-length transcript for POI have
   remained elusive. Therefore, this study included peripheral blood
   samples from 5 POI patients and 5 controls, characterizing full-length
   transcripts of POI using Oxford Nanopore sequencing firstly.
   Ultimately, we identified 26,122 transcripts, including 7,724 novel
   gene loci and 13,593 novel transcripts. A total of 382 differentially
   expressed transcripts were identified, including 366 down-regulated and
   16 up-regulated transcripts. Based on transcript structure variant
   analysis, 8,834 alternative splicing events, 65,254 alternative
   polyadenylation sites and 32 motifs were further identified, revealing
   the diversity sources of transcript isoforms, proteins and genetic
   complexity. Enrichment analysis of differentially AS genes suggested
   that the ferroptosis pathway may play an important role in the
   pathogenesis of POI.Additionally, 494 high-confidence lncRNAs, 1,768
   transcription factors, and novel gene-coding regions were predicted
   based on full-length transcript sequence. Analysis of immune cell
   subtypes revealed the expression of CD8 + T cells and monocytes were
   down-regulated in POI, which was significantly positively correlated
   with AMH, suggesting that CD8 + T cells and monocytes could serve as
   potential diagnostic markers and immunotherapy targets for POI.
   Conclusively, this study provides new perspectives on the pathogenesis,
   post-transcriptional regulation mechanisms, and immune targets of POI.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-025-89391-5.

   Keywords: Premature ovarian insufficiency, Full-length transcriptome,
   Alternative splicing, Alternative polyadenylation, Immune cell subtypes

   Subject terms: Biological techniques, Biotechnology, Computational
   biology and bioinformatics, Genetics, Molecular biology

Introduction

   Premature ovarian insufficiency (POI) refers to the development of
   reproductive endocrine dysfunction and follicular depletion in women
   before the age of 40. The main clinical manifestations are infertility
   and menstrual disorders, characterized by hypergonadotropic and
   hypoestrogenic amenorrhea^[44]1. It is estimated that the prevalence of
   POI is 1% before the age of 40, while it is approximately 0.01% before
   the age of 20^[45]2. Due to reduced estrogen production, women with POI
   have a higher risk for long-term complications, including
   cardiovascular disease, osteoporosis, genitourinary symptoms, and
   neurodegeneration^[46]3. The etiology of POI is highly heterogeneous.
   Chromosomal abnormalities, autoimmune diseases, and medical factors
   (surgery, radiotherapy, chemotherapy) have been reported, but the
   etiology is still unknown in 90% of cases^[47]4. Genetic factors are
   considered the most important factor in promoting the development of
   POI^[48]5. Therefore, exploring the characteristics of the genetic
   profile is essential for understanding the pathogenesis and therapeutic
   targets of POI.

   The advent of high-throughput sequencing technology has brought about
   revolutionary breakthroughs in genetic research and biomedicine. Oxford
   nanopore technology (ONT) is the third generation of high-throughput
   sequencing based on the real-time detection of single molecule
   electrical signals. The principle is to combine DNA/RNA duplets with
   nanopore proteins and unhelix them, use voltage to drive nucleic acid
   molecules through the pore one by one, and identify the sequence of
   nucleic acid base by detecting the changes in electrical
   signals^[49]6,[50]7. Compared to the next-generation sequencing
   technologies represented by the Illumina platform, ONT sequencing
   offers distinct advantages including ultra- long read lengths, shorter
   sequencing cycle, and real-time results acquisition^[51]8,[52]9.
   Moreover, full-length sequences facilitates more accurate refinement of
   genome annotation information and explore intricate transcript
   structural variations.

   In this study, the state-of-the-art PromethION platform from ONT was
   used to characterize transcriptional expression profiling of POI. The
   aim was to refine the functional annotation of POI transcripts and
   explore post-transcriptional regulatory signature, including
   alternative splicing (AS) events, alternative polyadenylation (APA)
   sites, fusion genes, predicted long non-coding RNAs (lncRNAs), coding
   sequence (CDS), and transcription factor (TFs) families based on
   full-length sequences. And the proportion of different immune cell
   subtypes was analyzed based on the transcriptional profile data. These
   findings will provide new insights into the pathogenesis and genetic
   diversity of POI (Fig. [53]1).

Fig. 1.

   [54]Fig. 1
   [55]Open in a new tab

   Bioinformatics analysis of ONT sequencing. ONT: oxford nanopore
   technology, AS: alternative splicing, APA: alternative polyadenylation,
   CPM: counts per million, DET: differentially expressed transcript, CDS:
   coding sequence, SSR: simple sequence repeat.

Materials and methods

Participants and samples collection

   The diagnostic criteria for POI were as follows^[56]1: (i) age younger
   than 40 years; (ii) experiencing oligo/amenorrhea for at least 4
   months; and (iii) serum follicle-stimulating hormone (FSH)
   concentrations exceeding 25 IU/L, detected on two occasions with an
   interval of more than 4 weeks. The inclusion criteria for the control
   group including (i) age and body mass index (BMI) matched controls;
   (ii) diagnosis of infertility due to tubal factors; (iii) with normal
   menstrual cycles and ovarian function. The exclusion criteria for both
   the case and control groups were consistent: (i) presence of other
   endocrine disorders; (ii) history of pelvic or ovarian surgery; (iii)
   use of hormones or medications affecting the endocrine system within 3
   months prior to blood sampling.

   The peripheral blood samples (2.5 ml) were collected at 9 a.m. using
   PAXgene Blood RNA Tubes (BD, United States) following a 12-hour fasting
   period and subsequently stored at -80 °C for preservation. The clinical
   data of the participants were collected, including age, BMI, antral
   follicle (AFC), anti-Mullerian hormone (AMH), FSH, Luteinizing hormone
   (LH), estradiol (E2), progesterone (P) and prolactin (PRL). The present
   study was granted approval by the Ethics Committee of the First
   Affiliated Hospital of Guangxi Medical University (NO. 2021KY-E-249),
   and written informed consent was obtained from all participants. We
   confirmed that all methods were conducted in accordance with relevant
   guidelines and regulations.

The construction of cDNA library

   The total RNA of the samples was extracted by PAXgene Blood miRNA Kit
   (manufacturer: QIAGEN, modle: 763134) and construct a cDNA library by a
   matching kit (Thermo Scientific™Maxima H Minus Reverse Transcriptase,
   modle: EP0752)). The PromethION platform (Oxford Nanopore Technologies,
   Oxford, UK) was employed for conducting full-length transcriptome
   sequencing. The full-length sequence was refined to generate a
   consensus sequence, which was subsequently aligned with the human
   reference genome (version: Homo_sapiens GRCh38_release95) using
   Minimap2 software (2.16 version). Sequences with an identity of less
   than 0.9 and coverage below 0.85 were excluded, resulting in a final
   de-redundant sequence that was utilized for further analysis^[57]10.

Identification and functional annotation of novel transcripts

   The gff3 file obtained from the non-redundant sequence was compared to
   the gff3 of reference genome using the gffcompare software, and
   transcripts identified in this sequencing result but not annotated by
   the reference genome were classified as novel transcripts. The
   annotation information for novel transcripts was obtained by comparing
   them with the Non-Redundant Protein Database (NR,
   [58]https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/),
   Swissprot ([59]https://www.uniprot.org/), Gene Ontology (GO,
   [60]http://geneontology.org/), Clusters of Orthologous Groups
   (COG,[61]https://www.ncbi.nlm.nih.gov/research/cog/), euKaryotic
   Ortholog Groups (KOG, [62]https://www.ncbi.nlm.nih.gov/COG/), Protein
   family (Pfam, [63]http://pfam.xfam.org/), and Kyoto Encyclopedia of
   Genes and Genomes (KEGG, [64]https://www.genome.jp/kegg/)
   databases^[65]11.

Identification of differentially transcripts and enrichment analysis

   The transcript expression levels were quantified as counts per million
   (CPM). The specific calculation method was referred to our previous
   study^[66]10. The DESeq2 R package was used to perform differential
   expression analysis on the transcripts from all samples. The
   transcripts with | log2 fold change(FC)| >1.5 and a false discovery
   rate (FDR) < 0.05 were considered as differentially expressed
   transcripts (DETs). The false discovery rate (FDR) is calculated by
   applying the Benjamini-Hochberg correction method to adjust the
   original significance P-values obtained from hypothesis testing.Then
   the enrichment analysis of KEGG and GO was performed on the DETs.

Analysis of alternative splicing and alternative polyadenylation

   We used AStalavista software to identify AS types in all samples,
   including intron retention (IR), exon skipping (ES), mutually exclusive
   exon (MEE), alternative 3 ‘splice sites (A3S), and alternative 5’
   splice sites (A5S). The genes with alternative splicing only in the POI
   group were identified and intersected with the DETs, and then the
   intersecting genes were subjected to KEGG pathway enrichment analysis.
   APA sites were verified using the TAPIS pipeline, and the number and
   location of APA loci were analyzed. The sequence features 50 bp
   upstream of the transcript poly(A) site were analyzed using DREME
   software to determine the motif.

Prediction of lncRNAs, CDS and SSR

   Transcripts with minimum length > 200nt and exon number ≥ 2 were
   defined as candidate lncRNAs. The intersection of the results predicted
   by the four tools were identified as lncRNAs with high confidence.
   Protein family (Pfam): identifies coding transcripts with protein
   domain matches; unmatched transcripts are potential lncRNAs. Coding
   potential calculator (CPC): classifies transcripts with a score < 0 as
   noncoding. Coding potential assessment tool (CPAT): uses a logistic
   regression model to evaluate coding potential based on ORF length, ORF
   coverage, fickett score, and hexamer score. Coding-noncoding index
   (CNCI): transcripts with a score < 0 are identified as noncoding.
   Coding sequences (CDS) were identified from transcripts using
   TransDecoder software. The transcripts with more than 500 bp were
   screened from the deredundant transcripts, and simple sequence repeat
   (SSR) analysis was conducted using MISA software. Enrichment indicators
   for each SSR classification are shown in Supplementary Table [67]S1.

Transcription factor annotation and binding site prediction

   The transcription factors (TFs) are a class of proteins that
   specifically bind to nucleotide sequences located upstream of genes and
   play a crucial role in the regulation of gene expression. We identified
   the TFs types through comparison with the AnimalTFDB 3.0 database
   (AnimalTFDB3) and assessed their activities in a given sample using the
   CoRegNet software package, presenting the results visually through heat
   maps. The R package TFBStools was used to predict the TFs binding sites
   (TFBS) in the promoter regions of the differentially expressed genes.

Network analysis of DETs-TFs-pathways and protein-protein interaction

   Transcription factors targeting DETs were predicted using a
   transcription factor database. Subsequently, pathways associated with
   these DETs were identified through a gene function annotation database.
   The targeting relationships among DETs, TFs, and pathways were then
   imported into Cytoscape software to construct and visualize a network
   map.In addition, the DETs were uploaded to the STRING database (STRING:
   functional protein association networks) for protein-protein
   interaction (PPI) analysis. The PPI results were subsequently imported
   into Cytoscape software for network analysis. The degree value of each
   node was calculated, and a concentric circle layout was used for
   visualization. The innermost circle represented the most critical hub
   genes based on their degree values.

Immune cell subtypes and clinical correlation analysis

   CIBERSORT is a cell type recognition algorithm based on gene expression
   data. In this study, the CIBERSORT package was used to perform immune
   cell subtypes analysis of transcriptional expression profiles to infer
   the relative proportions of distinct immune cell subsets in samples.
   Analyze the differential expression of immune cell subsets between two
   groups using T-test, further analyze the correlation between
   differential cell subsets and clinical indicators AMH and FSH by
   calculating Pearson coefficient. P < 0.05 was considered statistically
   significant.

Results

Overview of full-length transcriptome sequencing

   Compared to the control group, the POI group exhibited significantly
   elevated levels of FSH and LH, while AMH levels were significantly
   lower (P < 0.05). However, there were no significant differences in
   age, BMI, E2, PRL, and T between the two groups (P > 0.05)(
   Supplementary Table [68]S2). The Nanopore PromethION platform was used
   to construct a full-length transcriptome library of peripheral blood
   samples. The raw data of each sample were filtered for short fragments
   and low-quality reads to obtain the total clean data, and the output of
   clean data of each sample was more than 3.77 GB. After the removal of
   low-quality sequences and ribosomal RNA sequences, the proportion of
   full-length sequences ranged from 90.2 to 92.84%, with full-length
   sequences ranging from 2,781,865 to 4,582,222^[69]10.

Identification and functional annotation of novel transcripts and DETs

   Using Minimap2 to align full-length sequences against the human
   reference genome, we identified a total of 26,122 transcript sequences,
   which included 7,724 novel gene loci and 13,593 novel
   transcripts(Supplementary Table [70]S3). Additionally, a total of 382
   DETs were identified through the differential expression analysis,
   including 366 down-regulated and 16 up-regulated transcripts^[71]10
   (Supplementary Table [72]S4 and Fig.[73]S1). Annotation information of
   DETs in seven databases was shown in Supplementary Table [74]S5.

   GO functional annotation analysis showed that DETs were significantly
   enriched in the viral life cycle in biological processes. The molecular
   functions significantly enriched included superoxide-generating NADPH
   oxidase activity, virus receptor activity, guanylate kinase activity.
   Additionally, ficolin-1-rich granules and ficolin-1-rich granule
   membranes were mainly enriched in cellular components (Fig. [75]2a-c).
   KEGG enrichment analysis showed that the top 20 most significantly
   enriched pathways, including protein processing in endoplasmic
   reticulum, fluid shear stress and atherosclerosis, phagosome, Fc gamma
   R-mediated phagocytosis, leishmaniasis, osteoclast differentiation
   (Fig. [76]2d). The COG database annotated 20 entries, of which
   translation, ribosomal structure and biogenesis accounted for the
   largest proportion [15-17.24%]. There were 21 items annotated in KOG,
   of which signal transduction mechanisms accounted for the largest
   proportion [42-14.19%] (Supplementary Figure [77]S2).

Fig. 2.

   [78]Fig. 2
   [79]Open in a new tab

   Enrichment analysis of DETs. (a–c) GO enrichment analysis.(d) KEGG
   enrichment analysis.

Characteristics of AS, APA and motif

   Based on the ONT sequencing data, we identified 8,834 AS events
   involving five types (Fig. [80]3a). Of these, 4,231 AS events were
   identified in the POI group, including 2,419 ES, 633 A3S, 528 A5S, 477
   IR, and 174 MEE events. Additionally, 4,603 AS events were found in the
   control group, including 2,591 ES, 713 A3S, 560 A5S, 546 IR, and 193
   MEE events. Overall, the proportions of various AS types were similar
   in both groups, with the largest proportion of ES and the smallest
   proportion of MEE (Fig. [81]3b). Meanwhile, we identified 688 genes
   that only experienced AS events in the POI group. By taking the
   intersection of these AS genes and DET, 17 intersection genes were
   obtained. Further KEGG enrichment analysis indicated that the
   intersection genes were significantly enriched in the ferroptosis
   pathway (Supplementary Table [82]S6, Fig. [83]S3).

Fig. 3.

   [84]Fig. 3
   [85]Open in a new tab

   Characteristics and quantities of AS, APA, and motifs. (a) Alternative
   splicing type. (b) Number of alternative splicing events in POI and
   controls. (c) Base distribution plot of upstream and downstream of
   poly(A) site. (d) Number of alternative polyadenylation loci in two
   groups. (e) motifs upstream of the polyA site (the abscissa is the
   relative position of the base in the motif, the ordinate is the
   sequence conservation of the base, and the base height represents the
   relative frequency at that position).

   Next, 65,254 APA loci were identified in the peripheral blood samples.
   The specific APA loci types and numbers in each sample were shown in
   supplementary Figure [86]S4. Overall, the proportion of APA loci in
   various types was similar between the two groups, with the largest
   proportion of genes having 1 polyA locus and the smallest proportion of
   genes having 5 polyA loci (Fig. [87]3d).

   As shown in Fig. [88]3e, we identified 10 motifs by analyzing the 50 bp
   sequence upstream of the poly(A) site of all transcripts, including
   conserved sequences such as AAUAAA. We found that the upstream and
   downstream regions of the 3′-untranslated regions (UTR) showed
   significant base propensity. Specifically, adenine (A) was enriched in
   the upstream and downstream regions of the 3′-UTR (Fig. [89]3c).

Characteristic of CDS, lncRNAs and SSR

   In the transcriptome data, 7,737 open reading frames (ORFs) were
   detected. Of these, 4,171 were complete ORFs with a CDS length
   distribution between 0–1,100 aa. A complete CDS means that both the 5’
   UTR and 3’ UTR are present, while a partial CDS indicates the presence
   of either the 5’ UTR or the 3’ UTR (Fig. [90]4a, Supplementary Table
   [91]S7).

Fig. 4.

   [92]Fig. 4
   [93]Open in a new tab

   Characteristics and quantities of CDS-encoded proteins, transcription
   factors (TFs), and long noncoding RNAs (lncRNAs). (a) Length
   distribution of CDS-encoded proteins. (b) TFs type distribution map.
   (c) Predicted lncRNAs Venn diagram by four methods. (d) Classification
   map of lncRNAs locations. (e) The heatmap of significant TFs.

   The CPC, CNCI, CPAT, and Pfam protein domain tools identified 698, 547,
   629, and 494 lncRNAs, respectively (Supplementary Table [94]S8). By
   intersecting the results of these four methods, a total of 494 lncRNAs
   were predicted with high confidence. These included 31 sense-lncRNAs,
   100 anti-sense lncRNAs, 285 intergenic lncRNAs, and 78 intronic
   lncRNAs. Among them, 452 lncRNAs were identified as novel lncRNAs, and
   two differentially expressed lncRNAs were ONT.13874.1 and ONT.2222.1,
   respectively (Supplementary Table [95]S9 and Fig. [96]4c and d).

   By analyzing the sequence of transcripts over 500 bp, seven types of
   SSR were identified, including Mono-nucleotide, Di-nucleotide,
   tris-nucleotide, Tetra-nucleotide, Penta-nucleotide, Hexa-nucleotide
   and compound SSR (mixed microsatellite, two SSRS are less than 100 bp
   apart), of which Mono-nucleotide is the largest number(Table [97]1).

Table 1.

   The type and number of simple sequence repeat.
   Searching item                                 Numbers
   Total number of sequences examined             102,735
   Total size of examined sequences (bp)          221,917,398
   Total number of identified SSRs                44,655
   Number of SSR containing sequences             24,779
   Number of sequences containing more than 1 SSR 9,611
   Number of SSRs present in compound formation   4,667
   Mono nucleotide                                25,637
   Di nucleotide                                  8,088
   Tri nucleotide                                 9,656
   Tetra nucleotide                               974
   Penta nucleotide                               225
   Hexa nucleotide                                75
   [98]Open in a new tab

Characteristics and binding sites of TFs

   A total of 1,768 TFs were predicted from the novel transcripts in this
   study (Supplementary Table [99]S10, Fig. [100]4b). zf-C2H2, Homeobox,
   and bHLH were the top three TF families with the largest number. TFs
   activity refers to the functional state of transcription factors in
   cells that regulate gene expression. We assessed the activity of TFs in
   each sample based on activity scores and identified coregulated TFs
   pairs. The heat map of TFs activity showed that the BCLB-ZBTB family
   was a positively regulated TF in the POI group, while the remaining TF
   families were negatively regulated, with darker colors indicating
   higher activity (Fig. [101]4e). Additionally, the promoter regions of
   differential genes (the region about 1 kb upstream of the gene) were
   analyzed to predict the binding sites for transcription factors and
   obtain the sequence characteristics of TF motifs (Supplementary Fig.
   [102]S5). These data will make a valuable contribution to the further
   investigation of the regulatory mechanisms underlying downstream target
   genes of TFs.

Network interaction analysis of DET-TF-pathway and PPI

   The interaction of DETs, TFs and pathways was used to construct a
   network interaction diagram (Fig. [103]5), in which nodes of different
   colors represented different types of molecules. Network diagram is
   helpful for understanding biological processes and disease mechanisms,
   and provides important references for disease diagnosis, treatment and