Abstract Despite responses to salinity stress in Dunaliella salina, a unicellular halotolerant green alga, being subject to extensive study, but the underlying molecular mechanism remains unknown. Here, Empirical Bayes method was applied to identify the common differentially expressed genes (DEGs) between hypersaline and normal conditions. Then, using weighted gene co-expression network analysis (WGCNA), which takes advantage of a graph theoretical approach, highly correlated genes were clustered as a module. Subsequently, connectivity patterns of the identified modules in two conditions were surveyed to define preserved and non-preserved modules by combining the Zsummary and medianRank measures. Finally, common and specific hub genes in non-preserved modules were determined using Eigengene-based module connectivity or module membership (k[ME]) measures and validation was performed by using leave-one-out cross-validation (LOOCV). In this study, the power of beta = 12 (scale-free R2 = 0.8) was selected as the soft-thresholding to ensure a scale-free network, which led to the identification of 15 co-expression modules. Results also indicate that green, blue, brown, and yellow modules are non-preserved in salinity stress conditions. Examples of enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in non-preserved modules are Sulfur metabolism, Oxidative phosphorylation, Porphyrin and chlorophyll metabolism, Vitamin B6 metabolism. Moreover, the systems biology approach was applied here, proposed some salinity specific hub genes, such as radical-induced cell death1 protein (RCD1), mitogen-activated protein kinase kinase kinase 13 (MAP3K13), long-chain acyl-CoA synthetase (ACSL), acetyl-CoA carboxylase, biotin carboxylase subunit (AccC), and fructose-bisphosphate aldolase (ALDO), for the development of metabolites accumulating strains in D. salina. Subject terms: Systems biology, Regulatory networks, Computational biology and bioinformatics, Gene regulatory networks Introduction Dunaliella salina is a unicellular halotolerant green alga that can survive in saturated brine (up to 5.5 M NaCl)^[26]1. This feature makes it an interesting model organism for studying salt tolerance. On the other hand, as the main producer of carotenoids and lipids, D. salina is widely used for food and drug industries^[27]2. Modifying the environmental circumstances, especially salinity concentration, are among the most effective approach put forth to enhance different metabolites accumulation in these microalgae. However, the biomass productivity of D. salina is retarded in a hypersaline conditions^[28]3, making it difficult to improve metabolite production in large scales. Exploration of salt stress-responding mechanisms is an inevitable step in resolving these problems. The contribution of ROS and calcium signaling pathway in response to salinity condition has been reported previously^[29]4. Expressed sequence tag (EST) profiling of D. salina in hypersaline condition also identified 1401 unique responsive transcripts were contributed in protein synthesis, energy, primary metabolism, and protein fate^[30]5. Moreover, a recent global transcriptome sequencing showed that enhancements of photosynthesis and biosynthesis of porphyrins, as well as degradation of starch, synthesis of glycerol, membrane lipid desaturation, are taken part in D.salina responses to hypersaline conditions^[31]6. Hovewer, due to the complexity of salt stresses responding processes in microalgae, underlying molecular mechanisms for salt stress response in microalgae remain a daunting challenge^[32]7. There is a lack of integrative investigation on transcriptomes data of D.salina under various salinity stress pressures over ranges of treatment periods. Moreover, all of the previous researches have solely focused on differentially expressed genes identification, wherease connectivity analysis has not yet been considered. In contrast to focusing on differentially expressed genes, co-expression module-based network analysis provides new insight into the role of different genes associated with a specific condition, which cannot be detected by standard transcriptome and network analysis^[33]4. Meanwhile, in these networks, hubs can represent essential genes that may indeed contribute to a specific phenotype^[34]8. This powerful approach has been widely used on a range of systems mainly in plant species, including Barley^[35]9, Wheat^[36]10, Rice^[37]11, elucidating the comprehensive picture of stress responses. More recently, the efficiency of this approach in identifying groups of expressed genes and highly connected hubs which contributed in secondary metabolites accumulation in microalgae Auxenochlorella protothecoides has been confirmed^[38]12. The current study focused on co-expression network construction using weighted gene co-expression network analysis (WGCNA) in combination with the identification of hub genes in respective co-expressed modules of salt responsive genes in D. salina. Materials and methods Eligible RNA seq data collection A search for RNA seq datasets was performed on NCBI Sequence Read Archive (SRA) database using the following keywords: microalgae, Dunaliella, salina, salt, saline, osmotic, stress. Finally, four independent studies that surveyed global transcriptome profiling in D. salina at salt stress conditions were selected and raw data were retrieved in fastq format. The first dataset (SRP134914) contains twelve biological samples that were grown in sterile ATCC-1174 DA medium supplemented with NaCl, sorbitol, and H[2]O[2]. We only included the control and NaCl treated samples of this dataset. This dataset was generated by deep sequencing, in triplicate, using Illumina NextSeq500 platform. The second dataset (SRP149387) contains nine samples were generated in triplicate, using Illumina HiSeq 4000 platform. The third dataset (SRP184449) contains twelve samples which were generated using Illumina HiSeq 2000 platform (Table [39]1). Table 1. Details of datasets and treatment conditions with hypersalinity which were used in this study. Data set ID Salinity condition Sampling time points SRP134914 NaCl (2 M) 24 h after treatment SRP149387 NaCl (2.5 M) 6, 12, and 24 h after treatment SRP184449 NaCl (2.5 M) 0.5, 1 and 2 h after treatment [40]Open in a new tab Pre-processing and differential expression analysis Quality control of raw data sets was performed with FastQC (v 0.11.5) software^[41]13. Adaptor sequence and low-quality reads with Phred score < 30 were trimmed by using Trimmomatic (v0.32) software^[42]14. The processed reads were subjected to de novo assembly using Trinity (v2.4.0) software^[43]15 using default parameters. Protein orthology was determined using Blastx (cutoff value of 6) against C. reinhardtii and D. salina proteins ([44]https://phytozome.jgi.doe.gov/) as described by Dums et al.^[45]16. The best hits were extracted with in-house python scripts (Supplementary File [46]S1). Filtered reads were aligned to the de novo assembled transcripts using align_and_estimate_abundance Perl script implemented in RSEM (v1.3.1) software^[47]17. Gene counts were then subjected to Bioconductor DESeq2 package version 1.10.1^[48]18 to identify differentially expressed genes (DEGs). Comparisons were done using Wald’s test to determine the log2-fold change. To overcome the inconsistency of results of different studies and stabilize the genes’ expression ratios, batch effect correction by using the Empirical Bayes method was performed^[49]19. This correction enables direct comparisons of expression profiles between biological groups from independent experiments^[50]20–[51]22. Moreover, genes with low CV less than 10% were filtered out. Finally, common DEGs between four datasets with a threshold |> 1.0| and adjusted p-value < 0.05 were selected to further analysis. Weighted gene co-expression network analysis (WGCNA) Co-expression networks were constructed using the WGCNA algorithm implemented in R WGCNA package^[52]23. To import selected DEGs in WGCNA package, raw expression values of selected DEGs were normalized with variance StabilizingTransformation (vst) function in R software. Then, a similarity co-expression matrix was calculated with Pearson's correlation [MATH: Cori,j :MATH] for all common DEGs. The similarity matrix was transformed into an adjacency matrix (AM) by using the following equation [MATH: aij=0.51+cori,jβ :MATH] where [MATH: aij :MATH] denotes the adjacencies between DEGs as a connection strengths index. The soft-thresholding power beta of the co-expression network was chosen by the criterion of scale-free topology with R2 cutoff (0.8). Finally, the adjacency was transformed into a topological overlap matrix (TOM) and corresponding dissimilarity matrix [MATH: (1-TOM) :MATH] using the following formula [MATH: TOMi,j=u aiuauj+aijminki,kj+1-a< mrow>ij,Ki=uaiu :MATH] where, row index u [MATH: (u=1,,m) :MATH] represents sample measurements. To obtain co-expressed modules, the parameters were adjusted to minModuleSize = 20 and minimum height = 0.2 to cut the tree. Network preservation analysis ModulePreservation function implemented in WGCNA Bioconductor R package was applied to survey preservation levels of control network modules in the salinity coexpressed modules based on the combination of two preservation statistics including medianRank and Zsummary. Zsummary combines multiple statistics into a single overall measure of preservation that considers density and connectivity aspects of preservation using the following formula [MATH: Zsummary=< msub>Zdensity+Zconnectivity2 :MATH] The higher value of a Zsummary indicates the strong preservation in control and treatment conditions. However, the dependency of Zsummary to module size is a challenge, especially when modules with different sizes must be compared. medianRank as a module size independent index is another statistic to test the preservation level. The lower value of a medianRank indicates the strong preservation in control and treatment conditions^[53]23. Statistical significance of both indexes was tested using permutation testing (here we applied 200 permutations). As prescribed in original reports Zsummary and medianRank were combined and Zsummary < 5 or medianRank < 8 were considered as criteria for considering a module as a non-preserved module^[54]8,[55]23. Identification and validation of hub genes Hub genes in each co-expressed module were defined according to Eigengene-based module connectivity or module membership (k[ME]) index in non-preserved modules. To determine the k[ME], the correlation of expression value of a gene and eigengene of the module were estimated. This index measures the closeness of a gene in a given module. Genes with |k[ME]| ≥ 0.7 were considered as hub genes in the respective module^[56]23. Validation of hubs was performed by using leave-one-out cross-validation (LOOCV) as prescribed in^[57]12. Statistical analysis and functional enrichment Kyoto encyclopedia of genes and genome (KEGG) Pathway enrichment was performed using Algal Functional Annotation Tool (available [58]http://pathways.mcdb.ucla.edu/algal/index.html)^[59]24 by setting P-value < 0.05 as a cut-off criterion with Dunaliella salina and Chlamydomonas reinhardtii genomes as references.