Abstract Haematococcus pluvialis is a microalga known for producing the red carotenoid, astaxanthin. Key research areas to improve productivity include optimizing vegetative biomass, enhancing astaxanthin content, and controlling secondary cell wall formation. In this study, RNA-seq was conducted on 96 H. pluvialis samples under various treatments and time points, generating 96.7 GB of high-quality data (2,080,511,173 clean reads, quality score > 30). Gene expression was quantified as transcripts per million (TPM), and genes with similar expression patterns were clustered. A co-expression network, constructed with a soft threshold of β = 7 (R² > 0.9), identified 39 co-expression modules, each averaging 732 genes. Additionally, the re-annotation process provided functional descriptions for 8,598 previously uncharacterized genes. Gene functional classification analysis revealed their involvement in metabolic pathways related to genetic information processing and primary/secondary metabolite metabolism. This dataset provides a valuable resource for exploring the molecular mechanisms involved in critical processes such as growth regulation, astaxanthin accumulation, and secondary cell wall synthesis in H. pluvialis. Subject terms: RNA sequencing, Bioinformatics Background & Summary Haematococcus pluvialis is a freshwater unicellular green microalga belonging to the class Chlorophyceae, order Volvocales, and family Haematococcaceae. It has been cultivated on an industrial scale^[50]1. This species is wildly distributed across the world, often dominating temporary freshwater habitats such as shallow pools, rocky depressions, and birdbaths^[51]2. H. pluvialis garners significant interest due to its status as the richest natural source of astaxanthin, a potent antioxidant with nutraceutical and pharmacological applications^[52]3. Astaxanthin can be obtained from both natural sources and chemical synthesis. While chemical synthesis method allows for large-scale production at a lower cost, the resulting product often contains a mixture of isomers and by-products. These impurities can reduce the antioxidant capacity and bioavailability of synthetic astaxanthin compared to its natural counterpart. Additionally, the use of petrochemical sources in its production raises concerns about food safety, potential toxicity in the final product. Hence, synthetic astaxanthin has not been proved for direct human consumption, and is primarily used in industrial applications such as aquaculture feed for coloration purposes^[53]4. In contrast, natural astaxanthin is free from harmful by-products and contaminants, and has been approved for direct human consumption^[54]5. In the aquaculture industry, H. pluvialis powder, rich in natural astaxanthin, can also been used as feed to enhance coloration and improve the growth and immune capacity of crabs, shrimp, and fish^[55]6–[56]9. In humans, natural astaxanthin-based dietary supplements have been shown to boost immune response and help prevent various human diseases, including cardiovascular disease, ocular diseases, cancer, diabetes, and UV-induced skin damage^[57]10. As a result, natural astaxanthin is indispensable and is expected to experience substantial demand and growth in the coming years^[58]11. Additionally, the significantly higher price of natural astaxanthin compared to its synthetic counterpart has drawn increased attention to its production^[59]5. As a unicellular microalga, H. pluvialis is highly sensitive to environmental changes. Under optimal conditions, H. pluvialis cells are green, motile, and rapidly increase in biomass through cell division. Sodium acetate is one of the most effective organic carbon sources for enhancing the growth rate and biomass of H. pluvialis^[60]12. Under stress conditions, the cells undergo significant physiological and morphological transformations, including cessation of cell division, loss of flagella, accumulation of astaxanthin, and formation of a thick secondary cell wall (SCW). This process ultimately results in the development of astaxanthin-rich aplanospores^[61]13,[62]14. The production of astaxanthin from H. pluvialis typically follows a two-stage process to optimize yield, as the conditions favourable for vegetative growth differ from those required for astaxanthin accumulation^[63]15. In the first stage, i.e., the vegetative growth phase, biomass production is maximized by cultivating algal cells in nutrient-rich media under low-light conditions. Optimal biomass productivity has been achieved using mixotrophic culture systems, which include sodium acetate as an organic carbon source and carbon dioxide as an inorganic carbon source^[64]16. The second stage, i.e., the astaxanthin accumulation phase, is triggered by subjecting the cells to stress conditions, such as high light intensity, nutrient deprivation, UV-B irradiation, drought, or salinity stress^[65]17,[66]18. Among these, high light exposure is one of the most effective and wildly used methods for inducing astaxanthin accumulation in H. pluvialis^[67]19. Combining high light stress with chemical modulators such as sucrose^[68]20, potassium iodide^[69]21, trisodium citrate^[70]22, and sodium acetate^[71]11, or with other stress conditions such as nitrogen deprivation^[72]23, and salinity stress^[73]24, can further enhance astaxanthin accumulation in H. pluvialis. Elucidating the mechanisms by which inducers promote the growth and astaxanthin accumulation in H. pluvialis holds significant biological importance. The SCW of H. pluvialis exhibits a tri-layered structure consisting of the trilaminar sheath, the secondary wall, and the tertiary wall^[74]25,[75]26. The trilaminar sheath is primarily composed of algaenan, a biopolymer resistant to acetylation. The secondary wall is mainly constituted of mannan, while the tertiary wall comprises heterogenous polysaccharides. This complex and durable SCW provides H. pluvialis cells with considerable resistance to chemical and physical damage, allowing them to endure various abiotic and biotic stresses. However, this structural robustness significantly hinders the digestion and absorption of astaxanthin within aplanospores for both humans and animals^[76]27. Furthermore, it complicates the extraction of astaxanthin through physical and chemical methods. Therefore, understanding the mechanisms by which stress conditions influence SCW biosynthesis, as well as the relationship between SCW formation and astaxanthin accumulation, is crucial. Transcriptome analysis using rapid and cost-effective next-generation sequencing (NGS) enables precise, high-throughput profiling of gene expression. It has become a standard tool in life sciences research, enhancing our understanding of molecular mechanisms, particularly in plant responses to environmental factors, by providing an accurate depiction of cellular processes^[77]28,[78]29. In this study, the transcriptome dataset of H. pluvialis using 96 RNA samples collected from two stages—vegetative growth and astaxanthin accumulation—under multiple treatments and at various time points were analysed. This dataset offers a comprehensive overview of gene expression in H. pluvialis under different conditions. Furthermore, it serves as a valuable tool for exploring the regulatory mechanisms behind cell growth, astaxanthin accumulation, and SCW biosynthesis. This can be achieved by examining the differential growth rates, variations in astaxanthin and SCW accumulation across multiple samples, and by gaining insights from weighted gene co-expression network analysis (WGCNA). Methods Design and sample processing H. pluvialis (strain number: NBU489) was obtained from the Algae Collection Lab at Ningbo University, which was originally collected from Dongqian Lake in Ningbo, Zhejiang, China. The algal cultivation was carried out in two-stage method^[79]30. Algal cells were initially cultivated under low light conditions to promote biomass growth, and subsequently transferred to high light conditions to induce astaxanthin accumulation. In the low light stage, i.e., the vegetative growth stage, cells were grown in 250 mL Erlenmeyer flasks containing 150 mL of basal medium, under a light intensity of 25–30 μmol photon m^−2 s^−1 with a 12-hour light/dark cycle at 24 °C. The initial cell density was about 2.5 × 10^4 cells/mL. The basal medium used was NBU3^#, containing the following components: NaNO[3] (100 mg/L); Na[2]EDTA (20 mg/L); K[2]HPO[4] (10 mg/L); FeSO[4]⋅7H[2]O (2.5 mg/L); MnSO[4] (0.25 mg/L); vitamin B[12] (5 × 10^−4 mg/L); and vitamin B[1] (1.2 × 10^−3 mg/L). In the high light stage, i.e., astaxanthin accumulation stage, flasks containing H. pluvialis cells at the plateau stage from first stage, with a cell density of approximately 2.5 × 10^5 cells/mL, were exposed to continuous illumination at a higher light intensity of 140 μmol photon m^−2 s^−1 ^[80]31. Based on this two-stage cultivation strategy, samples for the sequencing project were collected at different time points from both the low and high light stages under various treatment conditions. At the low-light stage, algal cells were collected from both the logarithmic phase (sample IDs: 1–3) and plateau phase (sample IDs: 4–6) of the basal medium group, as well as from the logarithmic phase of the 1.8 g/L sodium acetate treatment group (sample IDs: 7–9). As described in the cultivation method above, the three samples collected during the low-light plateau phase (sample IDs: 4–6) can be regarded as the day 0 samples for high light exposure. During the high-light stage, samples were collected from different groups at various time points. On day 1 of high light exposure, samples were collected from the control group (sample IDs: 10–15), the 1 mM potassium iodide treatment group (sample IDs: 43–45), the nitrogen deprivation group (1/4 NBU3# without NaNO[3], sample IDs: 52–54), the nitrogen repletion group (1/4 NBU3# plus 25 mg/L NaNO[3], sample IDs: 58–60), the 1 mM fructose diphosphate treatment group (sample IDs: 64–66), and the 5 mM fructose diphosphate treatment group (sample IDs: 70–72). On day 2 of high light exposure, samples were collected from the control group (sample IDs: 16–18) and the 0.5 g/L NaCl plus 0.5 mM cytidine 5′-monophosphate (CMP) treatment group (sample IDs: 76–78). At the 60-hour of high light exposure, samples were collected from the control group (sample IDs: 19–21) and a mutant strain (RMS 10, sample IDs: 79–81). On day 3 of high light exposure, samples were collected from the control group (sample IDs: 22–30), the 1 mM xanthosine treatment group (sample IDs: 82–87), and the 1 mM taurine treatment group (sample IDs: 88–90). On day 5 of high light exposure, samples were collected from the control group (sample IDs: 31–39), the 1 mM potassium iodide treatment group (sample IDs: 46–48), the nitrogen deprivation group (1/4 NBU^3# without NaNO[3], sample IDs: 55–57), the nitrogen repletion group (1/4 NBU^3# plus 25 mg/L NaNO[3], sample IDs: 61–63), the 1 mM fructose diphosphate treatment group (sample IDs: 67–69), the 5 mM fructose diphosphate treatment group (sample IDs: 73–75), the 1 mM uridine treatment group (sample IDs: 91–93), and the 1 mM guanine treatment group (sample IDs: 94–96). Finally, on day 10 of high light exposure, samples were collected from the control group (sample IDs: 40–42) and the 1 mM potassium iodide treatment group (sample IDs: 49–51). The control group had the same culture conditions (basal medium, light intensity, photoperiod, etc.) as the experimental group in the same batch, except for the absence of exogenous inducer addition prior to high light stress. These samples were collected from ten batches of experiment. These experiments were designed to achieve three primary objectives: (1) enhancing biomass production (batches II, VII, IX and X), (2) increasing astaxanthin accumulation (batches I, III, IV, V, VI and IX), and (3) suppressing SCW synthesis (batches I, III, IV, V, VI, VII, VIII, IX and X). The sample information arranged by experimental batches was listed in Supplementary Table [81]S1. The morphological and physiological traits of these samples were also described in Supplementary Table [82]S1. Algal cells were harvested by centrifugation, immediately frozen in liquid nitrogen, and stored at −80 °C for subsequent transcriptomic profiling. An overview of the experimental workflows is provided in Fig. [83]1. Fig. 1. [84]Fig. 1 [85]Open in a new tab Overview of the experimental design and data analysis pipeline. This study comprised several key stages: Experimental design, RNA sequencing, Data processing, annotation, and analysis, each marked with distinct colours. Briefly, RNA from H. pluvialis under 26 different treatments was collected for Illumina RNA-seq. Then the sequencing data were subjected to quality control, comparison, quantification, and differential analysis. Finally, all genes underwent comprehensive re-annotation, co-expression networks were constructed, and transcription factors were predicted. RNA extraction, library preparation, and sequencing Total RNA of each sample was extracted using Trizol (Sangon Biotech, China) in accordance with the manufacturer’s instructions. RNA concentration and quality were assessed using a NanoDrop ND1000 spectrophotometer (Thermo Scientific, Waltham, MA, USA) and an Agilent Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA). RNA samples meeting the following criteria were used for library preparation and sequencing: concentration > 200 ng/μL, OD[260]/OD[280] > 1.8, OD[260]/OD[230] > 1.8, RIN > 6.5, and 28S/18S ratio > 1.0. After passing quality control, RNA sequencing libraries were constructed using the MGIEasy™ mRNA Library Preparation Kit (MGI, Shenzhen, China). mRNA was enriched using oligo(dT) magnetic beads, which bind to the polyA tail of mRNA through A-T base pairing. The enriched mRNA was fragmented and reverse transcribed into second-strand cDNA using N6 random primers. The cDNA was then purified using DNA magnetic beads and subjected to PCR amplification. After amplification, the library was purified again using DNA magnetic beads. The quality and size of the libraries were evaluated using a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and an Agilent 2100 system (Agilent Technologies, Santa Clara, CA, USA). Sequencing was performed by Wuhan BGI (Wuhan, China) on the MGISEQ-2000 platform, generating approximately 20 million clean reads per sample using single-end sequencing. Quality control and processing of RNA-seq data The quality of RNA-seq data was assessed using Fastp software^[86]32, which filtered out low-quality reads based on several criteria: bases with a quality score below Q20, reads shorter than 15 bp, reads containing more than 5% N bases, and other default settings. Following filtering, FastQC was employed to generate quality control reports, which were subsequently consolidated using MultiQC v1.22.3 with default settings^[87]33,[88]34. Clean reads from each sample were then aligned to the H. pluvialis genome (GenBank accession: GCA_011766145.1)^[89]35 using HISAT2 v2.2.1^[90]36 in end-to-end and sensitive mode. The alignment results were sorted and converted to BAM format using SAMtools^[91]37. Raw read counts were obtained and normalized to quantify transcript expression levels using transcripts per million (TPM) values, with featureCounts v2.0.1 of the Subread package used for quantification^[92]38. TPM normalization is based on raw counts and gene length information by using Trimmed Means of M-values method (TMM). For each sample, gene raw counts are divided by the gene length, and the resulting values are divided by the sum of gene-length normalized values for all genes and multiplied by one million^[93]39. Differential expression analysis was carried out with DESeq2 v1.34.0^[94]40, and the p-values were adjusted for multiple testing using the Benjamini-Hochberg (BH) procedure within DESeq2 to control the false discovery rate (FDR). Functional enrichment analysis for the genes, including both unknown genes and module genes, was conducted using the R package clusterProfiler^[95]41. Differentially expressed genes (DEGs) were selected based on two criteria, namely, an absolute value of log[2](fold change) >1 and a corresponding FDR value < 0.05. Pearson correlation analysis was conducted using the built-in cor function in R v4.21 software, with the method parameter set to “pearson”. Construction of weighted gene co-expression network Weighted Gene Co-expression Network Analysis (WGCNA) was performed to identify modules of highly correlated genes and to summarize their interactions, with the aim of uncovering regulatory mechanisms involved in cell growth, astaxanthin accumulation, and secondary cell wall biosynthesis^[96]42. The “WGCNA” package (version 1.72–5) in R software was used to construct and visualize the network. All samples were included, and the TPM values of all genes were utilized to generate gene co-expression networks. Firstly, the hclust function from the WGCNA package was employed for sample clustering to detect any significant outliers. Next, the pickSoftThreshold function was applied to select candidate power values ranging from 1 to 20. The appropriate soft-thresholding power (β) was chosen based on the scatter plot of the fit index versus the power value. A regulatory network of all genes was constructed to conform to a scale-free distribution based on connectivity, and an adjacency matrix was generated. This matrix was then transformed into a topological overlap matrix (TOM). Subsequently, the TOM-based dissimilarity metric was used to group genes with similar expression patterns into gene modules via average linkage hierarchical clustering with the following parameters: deepSplit = 2, minModuleSize = 30, and mergeCutHeight = 0.25. The resulting gene co-expression network was visualized as a heatmap based on the TOM dissimilarity, accompanied by a hierarchical clustering dendrogram using the TOMplot function. Module correlations were analyzed by extracting module eigengenes (MEs) from the network and calculating inter-module correlations using the cor function in R software. Additionally, the number of genes in each module was counted and visualized using a bar plot, generated with ggplot2 (version 4.3.3), and the module correlation heatmap was generated with corrplot (version 0.95). Gene function annotation and Transcription factors identification The reference genome of H. pluvialis (GenBank accession: GCA_011766145.1) contains many genes annotated as hypothetical or uncharacterized. To address this, a more comprehensive annotation of gene functions was conducted. Gene functions were assigned based on the best hits to proteins annotated in evolutionary genealogy of genes: Non-supervised Orthologous Groups database (the eggNOG^[97]43 v5.0, updated on 2022-07), Non-Redundant Protein database (NR^[98]44, updated on 2024-02), Swiss Protein database (Swiss-Prot^[99]45, updated on 2024-04), and Translated European Molecular Biology Laboratory database (TrEMBL^[100]46, updated on 2024-04), using Diamond Blast^[101]47 to search these databases with the target proteins as query sequences (cut-off E-value ≤ 1e-5), identifying homologous matches between the hypothetical proteins and known proteins, with the best match selected as the final annotation. Motifs and domains were annotated using InterProScan^[102]48 (version 5.67) by searching against publicly available InterPro databases^[103]49, including Pfam^[104]50, PRINTS^[105]51, PROSITE^[106]52, ProDom^[107]53, and SMART^[108]54. Gene Ontology (GO) information was retrieved from InterPro. GhostKOALA^[109]55 was used to map the genes of H. pluvialis to KEGG pathways by querying the Kyoto Encyclopedia of Genes and Genomes database (KEGG, updated on 2024-04) and identify the best hit for each node. Transcription factors (TFs) were annotated using the iTAK^[110]56 and PlantTFDB^[111]57 websites. Data Records The RNA-seq clean data and gene expression levels for the 96 samples analysed in this study have been deposited in the NCBI Sequence Read Archive (SRA) under the project accession number PRJNA1138290^[112]58 and the Gene Expression Omnibus (GEO) repository under the accession number of [113]GSE289633^[114]59, respectively. Furthermore, supplementary materials, including the sample information table (Table S1 Sample information), gene expression level, the gene co-expression network, and gene functional annotations (Table S2 Gene expression and function), differential gene expression analysis (Table S3 DEGs), KEGG pathway enrichment analyses of genes in each modules (Table S4 module enrichment), correlation coefficient and p-value in module correlation analysis (Table S5 module correlation), KEGG pathway classification of the 8,600 newly annotated genes (Table S6 New KEGG classification), GO annotation classification of the 8,600 newly annotated genes (Table S7 New GO classification), predicted transcription factors (Table S8 TFs), and R code for constructing the WGCNA co-expression network (R for WGCNA), are accessible via the Figshare data management platform^[115]60. Technical Validation Illumina RNA-seq quality validation Quality assessment of the raw reads from Illumina RNA-seq data for each sample was conducted using FastQC, which encompassed evaluation of mean quality scores (Fig. [116]2a), per sequence quality scores (Fig. [117]2b), per sequence GC content (Fig. [118]2c), and per base N content (Fig. [119]2d), followed by integration the results of 96 samples with MultiQC. Clean reads meet the following criteria to be retained for subsequent analysis: both the base quality score and the sequence quality scores exceed 30, and there should be no ‘N’ base in the sequence. Post the quality assessment procedures, a total of 96.7 gigabases (GB) of clean reads were yielded from these samples. The GC content of the samples exhibited a normal distribution ranging from 57% to 60% (Fig. [120]2c). The read quality of the samples indicated that the RNA-seq reads in this study were of high quality and devoid of contamination. Clean reads from the 96 samples were aligned to the reference genome using HISAT2, with an average mapping rate of 63.31% (Table [121]1). Gene expression levels were quantified based on the reads mapped to the reference genome. To mitigate the effects of gene length and data size on the calculated gene expression levels, the Transcripts Per Million (TPM) method was employed as a standardized approach for estimating gene expression levels. The distribution of normalized TPM values for all genes in each sample is depicted in a boxplot (Fig. [122]3a). The rectangles in the boxplot represent the interquartile range (IQR), capturing the central 50% of gene expression values, which primarily lie between 1.78 to 21.4 TPM. The median, marked by a bold line within each box, is positioned towards the lower end of the box, indicating a general trend of low gene expression. All samples exhibit a similar distribution, with the except of HL-Day3-Xan-2, which appears as an outlier. This consistency in low expression range, along with the alignment of IQRs across samples, suggests that gene expression distributions are comparable across different samples and that sequencing depth is consistent. The reliability of the RNA-seq data across the 96 samples was also assessed using Pearson correlation analysis (Fig. [123]3b). The correlation analysis shown that the correlation coefficients between different biological replicates of the same group approximately 0.950, indicating excellent sample reproducibility. The average correlation coefficient between samples under high light conditions was 0.758 and that under low light conditions was 0.841, while the average correlation coefficients between high-light and low-light samples was only 0.558. This result indicated that although the addition of different chemical reagents had a significant impact on the transcriptome of H. pluvialis, the effect of light intensity on the transcriptome of H. pluvialis was the most pronounced. The results from the above mentioned analytical methods demonstrated that the biological replicates in this study are highly consistent, indicating that the data obtained in this study could be used for subsequent research. Fig. 2. [124]Fig. 2 [125]Open in a new tab Sequence quality assessment metrics of filtered and trimmed RNA-Seq data using MultiQC. (a) Mean quality scores. (b) Per sequence quality scores. (c) Per sequence GC content. (d) Per base N content. Table 1. Summary of RNA-sequencing. Group Clean reads Total mapped reads Mapping rate (%) Uniquely mapped ratio (%) Multi mapped ratio (%) LL-Plateau 21,712,009 14,107,131 64.97% 50.07% 14.90% LL-Exponential 21,274,334 14,582,793 68.55% 51.41% 17.14% LL-Exponential-NaAc 21,199,530 14,295,510 67.43% 50.95% 16.48% HL-Day1 21,741,5 46 14,219,016 65.41% 50.62% 14.80% HL-Day2 21,571,323 13,677,234 63.40% 48.77% 14.64% HL-60h 22,816,936 13,412,442 58.67% 45.30% 13.37% HL-Day3 21,722,016 13,642,382 62.86% 48.29% 14.52% HL-Day5 21,695,676 13,842,286 63.80% 49.48% 14.33% HL-Day10 22,220,335 14,228,627 64.04% 49.84% 14.20% HL-Day1-1mM-KI 21,797,321 14,058,327 64.50% 50.85% 13.65% HL-Day5-1mM-KI 21,790,772 14,101,501 64.71% 50.76% 13.95% HL-Day10-1mM-KI 21,798,272 14,179,204 65.05% 50.76% 14.29% HL-Day1-ND 21,553,344 13,808,073 64.07% 49.68% 14.39% HL-Day5-ND 21,555,210 13,885,533 64.42% 49.80% 14.62% HL-Day1-NR 21,552,977 14,114,746 65.49% 50.51% 14.97% HL-Day5-NR 21,066,896 12,989,057 61.67% 47.99% 13.67% HL-Day1-1mM-FDP 21,622,648 14,804,140 68.47% 51.74% 16.73% HL-Day5-1mM-FDP 21,619,961 14,689,178 67.94% 52.05% 15.89% HL-Day1-5mM-FDP 21,594,611 14,747,192 68.29% 51.20% 17.10% HL-Day5-5mM-FDP 21,618,925 14,368,473 66.46% 51.19% 15.27% HL-Day2-NaCl-CMP 21,574,148 14,080,241 65.27% 50.07 15.20% HL-60h-Mutant 22,304,121 13,065,487 58.36% 45.22% 13.15% HL-Day3-Xan 21,370,474 14,101,627 66.60% 50.24% 16.36% HL-Day3-Tau 21,630,082 13,529,582 62.55% 47.9 14.65% HL-Day5-Uri 21,600,247 13,470,299 62.36% 48.26% 14.10% HL-Day5-Gua 21,552,594 13,562,964 62.93% 48.73% 14.20% [126]Open in a new tab Fig. 3. [127]Fig. 3 [128]Open in a new tab Summary of the sample clustering. (a) Gene expression levels of the 96 samples. (b) The person correlation between the 96 samples. The Roman numerals at the end of the sample name are the batch of the sample. Transcriptome data analysis Pearson’s correlation coefficient was employed to cluster the samples and generate a sample clustering tree. By setting the height parameter to 1,500,000, one outlier (HL-Day3-Xan-2) was identified and subsequently removed, while the remaining samples were retained for further analysis (Fig. [129]4a). Using the pickSoftThreshold function, an optimal soft threshold of 7 was determined to ensure that the scale-free topology index (R²) exceeded 0.9 and the mean connectivity was less than 100 (Fig. [130]4b), indicating that the network exhibits a high degree of randomness. To verify whether the network constructed at this soft threshold met the scale-free topology, the softConnectivity function was used for validation. The log[10](k) of the number of connections for genes and the log[10](p(k)) of the occurrence probability of such nodes showed a negative correlation, with R² = 0.88 (Fig. [131]4c). The larger the R² value, the closer the network is to the characteristics of a scale-free network^[132]61. These results indicated that the selected soft threshold of 7 can be used to construct a scale-free network. Next, the dynamic tree-cutting algorithm was applied for module identification, grouping genes with similar expression patterns into distinct clusters (Fig. [133]4d). This analysis identified 39 modules, each represented by a different colour (Fig. [134]4e). The module with the fewest genes, firebrick3, contained 31 genes, while the largest module, bisque4, comprised 5,312 genes, and the average number of genes per module was 732. To gain insights into the main functions of genes in each module, KEGG pathway enrichment analysis on these genes were performed, and the results were presented in Supplementary Table [135]S4. The interactive relationships among all modules and genes were visualized using a heatmap based on the Topological Overlap Matrix (TOM), which confirmed the robustness of the module delineation. The analysis revealed strong correlations within modules, demonstrating that the modules are interrelated rather than independent (Fig. [136]4f). Furthermore, the gene expression patterns within each module were visualized (Fig. [137]5a). Subsequently, the correlations between modules were analysed to reveal the connections among them. The dynamic expression patterns varied significantly across different modules. For instance, the β-carotene ketolase gene (BKT3), a rate limiting gene for astaxanthin biosynthesis^[138]62, was assigned to the salmon4 module, and its expression pattern (red line in Fig. [139]5a) closely mirrored the overall expression trend of the salmon4 module (salmon4 line in Fig. [140]5a). The genes in salmon4 module were significantly enriched in 24 KEGG pathways (Supplementary Table [141]S4, salmon4 sheet), including “pyruvate metabolism”, “fatty acid biosynthesis”, “carotenoid biosynthesis”, “terpenoid backbone biosynthesis”, which were involved in the biosynthesis of astaxanthin and astaxanthin fatty acid esters in H. pluvialis^[142]20,[143]63. The genes in modules that are significantly and positively correlated with salmon4 module, such as coral module (r = 0.61, p-value < 1.0e-10), and ivory module (r = 0.44, p-value < 8.9e-6), may also be associated with astaxanthin biosynthesis in H. pluvialis. Fig. 4. [144]Fig. 4 [145]Open in a new tab Results of weighted gene co-expression network analysis (WGCNA). (a) Clustering dendrogram of 96 samples. (b) Analysis of the scale-free index and mean connectivity for various soft-threshold powers (β). (c) Checking the scale free topology when β = 7; On this plot, the distribution approximately follows an approximately straight line, termed approximately scale-free topology. (d) Hierarchical cluster analysis was conducted to detect co-expression clusters with corresponding colour assignments. Each colour represents a module in the constructed gene co-expression network by WGCNA. (e) The number of genes in each module. (f) Heatmap depicts the Topological Overlap Matrix (TOM) of genes selected for weighted co-expression network analysis. Light colour represents low overlap and dark represents high overlap. Fig. 5. [146]Fig. 5 [147]Open in a new tab Gene module expression and correlation analysis. (a) Visual module expression profile. The red line shows the expression level of β-carotene ketolase, a gene within the salmon4 module, under different treatments. The salmon4 line represents the overall expression trend of all genes in the salmon4 module across these treatments. Additionally, the palevioletred2 line represents a module with an expression trend opposite to that of the salmon4 module. (b) Heatmap of correlation coefficients and p values between modules. Blue colour indicates positive correlation, while red colour indicates negative correlation. The numbers represent the correlation values. Asterisks (*) denote statistically significant correlations (p < 0.05), with the number of asterisks indicating the level of significance. Mannose-1-phosphate guanylyltransferase gene (GMPP), mannosyltransferase genes (OCH1 and MNNs), and mannan synthase genes were involved in SCW biosynthesis in H. pluvialis, most of which were found in bisque4 module. The genes in bisque4 module were significantly enriched in nine pathways including “galactose metabolism”, “mannose type O-glycan biosynthesis”, “fructose and mannose metabolism”, “fatty acid elongation”, “other glycan degradation”, and “glycosylphosphatidylinositol (GPI)-anchor biosynthesis”, which were involved in SCW biosynthesis in H. pluvialis^[148]13,[149]23. The genes in modules that significantly and positively correlated with bisque4 module, such as cyan module (r = 0.85, p-value < 1.0e-10), darkturquoise module (r = 0.81, p-value < 1.0e-10), and lightsteelblue module (r = 0.71, p-value < 1.0e-10), which may also be associated with SCW biosynthesis in H. pluvialis. Under favourable environmental conditions, H. pluvialis exhibits rapid vegetative growth without astaxanthin accumulation or SCW synthesis. Conversely, environmental stress triggers growth cessation accompanied by simultaneous astaxanthin biosynthesis and SCW formation. However, our experimental data revealed a decoupling phenomenon between astaxanthin accumulation and SCW synthesis, as evidenced by multiple samples demonstrating astaxanthin production without concomitant SCW formation (Supplementary Table [150]S1). This observation suggests that modules showing negative correlations with either the bisque4 or salmon4 modules may be directly associated with growth regulation. Specifically, the bisque4 module demonstrated significant negative correlations with seven modules (antiquewhite2, brown, coral2, greenyellow, mediumpurple1, mediumpurple3, and palevioletred2), while the salmon4 module showed negative correlations with four modules (coral2, grey60, mediumpurple1, and palevioletred2) (Supplementary Table [151]S5 and Fig. [152]5b). Notably, the palevioletred2 and coral2 modules displayed expression patterns inversely correlated with both salmon4 and bisque4 modules. Functional enrichment analysis revealed that these two modules shared significant association with the “purine metabolism” pathway. This finding aligns with previous research demonstrating that enhanced purine metabolism inhibits both SCW biosynthesis and astaxanthin accumulation while promoting cellular proliferation in H. pluvialis under high-light stress conditions^[153]63. These results highlight the functional specialization of distinct gene modules in regulating critical biological processes. The identified module-pathway relationships provide mechanistic insights into the coordinated regulation of growth dynamics, astaxanthin biosynthesis, and SCW formation in H. pluvialis, establishing a foundation for further investigation of stress-responsive metabolic regulation in microalgae. Gene function annotation and Transcription factors identification In the annotation file of the reference genome utilized in this study (GenBank accession: GCA_011766145.1), 8,903 genes (31.48%) were annotated with specific functions, while 19,376 genes (68.52%) were not characterized. An extensive annotation analysis was conducted on previously uncharacterized genes by utilizing comparisons against public databases. As a result, a significant number of these genes were successfully annotated: 4,012 by TrEMBL, 4,306 by eggNOG, 2,322 by SwissProt, 5,314 by GhostKOALA, and 4,426 by InterProScan, totalling 8,598 genes (Fig. [154]6a). This comprehensive annotation has shed light on a substantial portion of the previously unknown gene functions. To gain a deeper insight into the roles of these newly annotated genes, they were categorized into KEGG pathways (Supplementary Table [155]S6) and GO annotation classification (Supplementary Table [156]S7). The KEGG pathway classification analysis revealed that “Metabolism” was the largest category (1,354 genes), followed by “Genetic Information Processing” (1,106 genes), “Environmental Information Processing” (169 genes), “Cellular Processes” (164 genes), and “Organic Systems” (92 genes) (Fig. [157]6b). In the second-level of classification, 18 subclass pathways were identified, including translation process, folding, sorting and degradation process, and carbohydrate metabolism (Fig. [158]6b). The GO classification analysis revealed these newly annotated genes were distributed in 101 GO terms of molecular function, 42 GO terms of cellular components, and 155 GO terms of biological process. The top 5 GO terms of each class with the highest number of genes were presented in Fig. [159]6c. In addition, 245 genes were annotated as transcription factors belonging to 45 transcription factor families. The main transcription factor families identified were GNAT (33 genes), SNF2 (28 genes), MYB-related (14 genes), SET (20 genes), C2H2 (13 genes), etc. and the details are provided in Supplementary Table [160]S8. Gene functional annotation and the prediction of transcription factor are pivotal in elucidating the molecular mechanisms that govern cell growth, astaxanthin accumulation, and the biosynthesis of SCW in H. pluvialis. Fig. 6. [161]Fig. 6 [162]Open in a new tab Functional annotation of the 8,598 newly annotated genes. (a) Venn diagram of gene annotation overlaps across for five major databases. (b) KEGG classification. (c) GO classification. Supplementary information [163]Table S1^ (63.6KB, docx) [164]Table S2^ (24.9MB, xlsx) [165]Table S3^ (3.7MB, xlsx) [166]Table S4^ (221.9KB, xlsx) [167]Table S5^ (42.5KB, xlsx) [168]Table S6^ (34.6KB, xlsx) [169]Table S7^ (85.1KB, xlsx) [170]Table S8^ (13.6KB, xlsx) [171]R for WGCNA.R^ (5.6KB, txt) Acknowledgements