Abstract

   Despite responses to salinity stress in Dunaliella salina, a
   unicellular halotolerant green alga, being subject to extensive study,
   but the underlying molecular mechanism remains unknown. Here, Empirical
   Bayes method was applied to identify the common differentially
   expressed genes (DEGs) between hypersaline and normal conditions. Then,
   using weighted gene co-expression network analysis (WGCNA), which takes
   advantage of a graph theoretical approach, highly correlated genes were
   clustered as a module. Subsequently, connectivity patterns of the
   identified modules in two conditions were surveyed to define preserved
   and non-preserved modules by combining the Zsummary and medianRank
   measures. Finally, common and specific hub genes in non-preserved
   modules were determined using Eigengene-based module connectivity or
   module membership (k[ME]) measures and validation was performed by
   using leave-one-out cross-validation (LOOCV). In this study, the power
   of beta = 12 (scale-free R2 = 0.8) was selected as the
   soft-thresholding to ensure a scale-free network, which led to the
   identification of 15 co-expression modules. Results also indicate that
   green, blue, brown, and yellow modules are non-preserved in salinity
   stress conditions. Examples of enriched Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathways in non-preserved modules are Sulfur metabolism,
   Oxidative phosphorylation, Porphyrin and chlorophyll metabolism,
   Vitamin B6 metabolism. Moreover, the systems biology approach was
   applied here, proposed some salinity specific hub genes, such as
   radical-induced cell death1 protein (RCD1), mitogen-activated protein
   kinase kinase kinase 13 (MAP3K13), long-chain acyl-CoA synthetase
   (ACSL), acetyl-CoA carboxylase, biotin carboxylase subunit (AccC), and
   fructose-bisphosphate aldolase (ALDO), for the development of
   metabolites accumulating strains in D. salina.

   Subject terms: Systems biology, Regulatory networks, Computational
   biology and bioinformatics, Gene regulatory networks

Introduction

   Dunaliella salina is a unicellular halotolerant green alga that can
   survive in saturated brine (up to 5.5 M NaCl)^[26]1. This feature makes
   it an interesting model organism for studying salt tolerance. On the
   other hand, as the main producer of carotenoids and lipids, D. salina
   is widely used for food and drug industries^[27]2.

   Modifying the environmental circumstances, especially salinity
   concentration, are among the most effective approach put forth to
   enhance different metabolites accumulation in these microalgae.
   However, the biomass productivity of D. salina is retarded in a
   hypersaline conditions^[28]3, making it difficult to improve metabolite
   production in large scales.

   Exploration of salt stress-responding mechanisms is an inevitable step
   in resolving these problems. The contribution of ROS and calcium
   signaling pathway in response to salinity condition has been reported
   previously^[29]4. Expressed sequence tag (EST) profiling of D. salina
   in hypersaline condition also identified 1401 unique responsive
   transcripts were contributed in protein synthesis, energy, primary
   metabolism, and protein fate^[30]5. Moreover, a recent global
   transcriptome sequencing showed that enhancements of photosynthesis and
   biosynthesis of porphyrins, as well as degradation of starch, synthesis
   of glycerol, membrane lipid desaturation, are taken part in D.salina
   responses to hypersaline conditions^[31]6. Hovewer, due to the
   complexity of salt stresses responding processes in microalgae,
   underlying molecular mechanisms for salt stress response in microalgae
   remain a daunting challenge^[32]7.

   There is a lack of integrative investigation on transcriptomes data of
   D.salina under various salinity stress pressures over ranges of
   treatment periods. Moreover, all of the previous researches have solely
   focused on differentially expressed genes identification, wherease
   connectivity analysis has not yet been considered. In contrast to
   focusing on differentially expressed genes, co-expression module-based
   network analysis provides new insight into the role of different genes
   associated with a specific condition, which cannot be detected by
   standard transcriptome and network analysis^[33]4. Meanwhile, in these
   networks, hubs can represent essential genes that may indeed contribute
   to a specific phenotype^[34]8. This powerful approach has been widely
   used on a range of systems mainly in plant species, including
   Barley^[35]9, Wheat^[36]10, Rice^[37]11, elucidating the comprehensive
   picture of stress responses. More recently, the efficiency of this
   approach in identifying groups of expressed genes and highly connected
   hubs which contributed in secondary metabolites accumulation in
   microalgae Auxenochlorella protothecoides has been confirmed^[38]12.

   The current study focused on co-expression network construction using
   weighted gene co-expression network analysis (WGCNA) in combination
   with the identification of hub genes in respective co-expressed modules
   of salt responsive genes in D. salina.

Materials and methods

Eligible RNA seq data collection

   A search for RNA seq datasets was performed on NCBI Sequence Read
   Archive (SRA) database using the following keywords: microalgae,
   Dunaliella, salina, salt, saline, osmotic, stress. Finally, four
   independent studies that surveyed global transcriptome profiling in D.
   salina at salt stress conditions were selected and raw data were
   retrieved in fastq format. The first dataset (SRP134914) contains
   twelve biological samples that were grown in sterile ATCC-1174 DA
   medium supplemented with NaCl, sorbitol, and H[2]O[2]. We only included
   the control and NaCl treated samples of this dataset. This dataset was
   generated by deep sequencing, in triplicate, using Illumina NextSeq500
   platform. The second dataset (SRP149387) contains nine samples were
   generated in triplicate, using Illumina HiSeq 4000 platform. The third
   dataset (SRP184449) contains twelve samples which were generated using
   Illumina HiSeq 2000 platform (Table [39]1).

Table 1.

   Details of datasets and treatment conditions with hypersalinity which
   were used in this study.
   Data set ID Salinity condition Sampling time points
   SRP134914   NaCl (2 M)         24 h after treatment
   SRP149387   NaCl (2.5 M)       6, 12, and 24 h after treatment
   SRP184449   NaCl (2.5 M)       0.5, 1 and 2 h after treatment
   [40]Open in a new tab

Pre-processing and differential expression analysis

   Quality control of raw data sets was performed with FastQC (v 0.11.5)
   software^[41]13. Adaptor sequence and low-quality reads with Phred
   score < 30 were trimmed by using Trimmomatic (v0.32) software^[42]14.
   The processed reads were subjected to de novo assembly using Trinity
   (v2.4.0) software^[43]15 using default parameters. Protein orthology
   was determined using Blastx (cutoff value of 6) against C. reinhardtii
   and D. salina proteins ([44]https://phytozome.jgi.doe.gov/) as
   described by Dums et al.^[45]16. The best hits were extracted with
   in-house python scripts (Supplementary File [46]S1). Filtered reads
   were aligned to the de novo assembled transcripts using
   align_and_estimate_abundance Perl script implemented in RSEM (v1.3.1)
   software^[47]17. Gene counts were then subjected to Bioconductor DESeq2
   package version 1.10.1^[48]18 to identify differentially expressed
   genes (DEGs). Comparisons were done using Wald’s test to determine the
   log2-fold change. To overcome the inconsistency of results of different
   studies and stabilize the genes’ expression ratios, batch effect
   correction by using the Empirical Bayes method was performed^[49]19.
   This correction enables direct comparisons of expression profiles
   between biological groups from independent experiments^[50]20–[51]22.
   Moreover, genes with low CV less than 10% were filtered out. Finally,
   common DEGs between four datasets with a threshold |> 1.0| and adjusted
   p-value < 0.05 were selected to further analysis.

Weighted gene co-expression network analysis (WGCNA)

   Co-expression networks were constructed using the WGCNA algorithm
   implemented in R WGCNA package^[52]23. To import selected DEGs in WGCNA
   package, raw expression values of selected DEGs were normalized with
   variance StabilizingTransformation (vst) function in R software. Then,
   a similarity co-expression matrix was calculated with Pearson's
   correlation
   [MATH: <mrow><mi mathvariant="normal">Cor</mi><mfenced close=")"
   open="("><mi mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">j</mi></mfenced></mrow> :MATH]
   for all common DEGs. The similarity matrix was transformed into an
   adjacency matrix (AM) by using the following equation
   [MATH: <mrow><msub><mi>a</mi><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub><mo>=</mo><msup><mfenced
   close=")"
   open="("><mrow><mn>0.5</mn><mrow><mrow></mrow><mo>∗</mo></mrow><mfenced
   close=")" open="("><mrow><mn>1</mn><mo>+</mo><mtext>cor</mtext><mfenced
   close=")"
   open="("><mrow><mtext>i</mtext><mo>,</mo><mtext>j</mtext></mrow></mfenc
   ed></mrow></mfenced></mrow></mfenced><mi>β</mi></msup></mrow> :MATH]

   where
   [MATH: <msub><mi>a</mi><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub> :MATH]
   denotes the adjacencies between DEGs as a connection strengths index.

   The soft-thresholding power beta of the co-expression network was
   chosen by the criterion of scale-free topology with R2 cutoff (0.8).
   Finally, the adjacency was transformed into a topological overlap
   matrix (TOM) and corresponding dissimilarity matrix
   [MATH: <mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>T</mi><mi>O</mi><mi>M</m
   i><mo stretchy="false">)</mo></mrow> :MATH]
   using the following formula
   [MATH: <mrow><msub><mrow><mi
   mathvariant="italic">TOM</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi
   ></mrow></msub><mo>=</mo><mfrac><mrow><msub><mo>∑</mo><mi>u</mi></msub>
   <msub><mi>a</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msub><msub><mi>a</mi><mrow><mi
   mathvariant="italic">uj</mi></mrow></msub><mo>+</mo><msub><mi>a</mi><mr
   ow><mi mathvariant="italic">ij</mi></mrow></msub></mrow><mrow><mi
   mathvariant="normal">min</mi><mfenced close=")"
   open="("><msub><mi>k</mi><mi>i</mi></msub><mo>,</mo><msub><mi>k</mi><mi
   >j</mi></msub></mfenced><mo>+</mo><mn>1</mn><mo>-</mo><msub><mi>a</mi><
   mrow><mi
   mathvariant="italic">ij</mi></mrow></msub></mrow></mfrac><mo>,</mo><msu
   b><mi>K</mi><mi>i</mi></msub><mo>=</mo><msub><mo>∑</mo><mi>u</mi></msub
   ><msub><mi>a</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msub></mrow> :MATH]

   where, row index u
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>u</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>⋯</m
   o><mo>,</mo><mi>m</mi><mo stretchy="false">)</mo></mrow> :MATH]
   represents sample measurements.

   To obtain co-expressed modules, the parameters were adjusted to
   minModuleSize = 20 and minimum height = 0.2 to cut the tree.

Network preservation analysis

   ModulePreservation function implemented in WGCNA Bioconductor R package
   was applied to survey preservation levels of control network modules in
   the salinity coexpressed modules based on the combination of two
   preservation statistics including medianRank and Zsummary. Zsummary
   combines multiple statistics into a single overall measure of
   preservation that considers density and connectivity aspects of
   preservation using the following formula
   [MATH: <mrow><msub><mi>Z</mi><mrow><mi
   mathvariant="italic">summary</mi></mrow></msub><mo>=</mo><mfrac><mrow><
   msub><mi>Z</mi><mrow><mi
   mathvariant="italic">density</mi></mrow></msub><mo>+</mo><msub><mi>Z</m
   i><mrow><mi
   mathvariant="italic">connectivity</mi></mrow></msub></mrow><mn>2</mn></
   mfrac></mrow> :MATH]

   The higher value of a Zsummary indicates the strong preservation in
   control and treatment conditions. However, the dependency of Zsummary
   to module size is a challenge, especially when modules with different
   sizes must be compared. medianRank as a module size independent index
   is another statistic to test the preservation level. The lower value of
   a medianRank indicates the strong preservation in control and treatment
   conditions^[53]23. Statistical significance of both indexes was tested
   using permutation testing (here we applied 200 permutations). As
   prescribed in original reports Zsummary and medianRank were combined
   and Zsummary < 5 or medianRank < 8 were considered as criteria for
   considering a module as a non-preserved module^[54]8,[55]23.

Identification and validation of hub genes

   Hub genes in each co-expressed module were defined according to
   Eigengene-based module connectivity or module membership (k[ME]) index
   in non-preserved modules. To determine the k[ME], the correlation of
   expression value of a gene and eigengene of the module were estimated.
   This index measures the closeness of a gene in a given module. Genes
   with |k[ME]| ≥ 0.7 were considered as hub genes in the respective
   module^[56]23. Validation of hubs was performed by using leave-one-out
   cross-validation (LOOCV) as prescribed in^[57]12.

Statistical analysis and functional enrichment

   Kyoto encyclopedia of genes and genome (KEGG) Pathway enrichment was
   performed using Algal Functional Annotation Tool (available
   [58]http://pathways.mcdb.ucla.edu/algal/index.html)^[59]24 by setting
   P-value < 0.05 as a cut-off criterion with Dunaliella salina and
   Chlamydomonas reinhardtii genomes as references.