Abstract

   Although recognized as a promising microbial cell factory for producing
   biofuels, current productivity in cyanobacterial systems is low. To
   make the processes economically feasible, one of the hurdles, which
   need to be overcome is the low tolerance of hosts to toxic biofuels.
   Meanwhile, little information is available regarding the cellular
   responses to biofuels stress in cyanobacteria, which makes it
   challenging for tolerance engineering. Using large proteomic datasets
   of Synechocystis under various biofuels stress and environmental
   perturbation, a protein co-expression network was first constructed and
   then combined with the experimentally determined protein–protein
   interaction network. Proteins with statistically higher topological
   overlap in the integrated network were identified as common responsive
   proteins to both biofuels stress and environmental perturbations. In
   addition, a weighted gene co-expression network analysis was performed
   to distinguish unique responses to biofuels from those to environmental
   perturbations and to uncover metabolic modules and proteins uniquely
   associated with biofuels stress. The results showed that
   biofuel-specific proteins and modules were enriched in several
   functional categories, including photosynthesis, carbon fixation, and
   amino acid metabolism, which may represent potential key signatures for
   biofuels stress responses in Synechocystis. Network-based analysis
   allowed determination of the responses specifically related to biofuels
   stress, and the results constituted an important knowledge foundation
   for tolerance engineering against biofuels in Synechocystis.

   Keywords: biofuels, network, WGCNA, tolerance, Synechocystis

Introduction

   Human society has been dependent on fossil fuels for centuries.
   However, fossil fuels are not an infinite resource, and the possibility
   of their running out in the future and the increasing concerns over
   energy security and global climate change pose an urgent call for
   developing renewable ways to produce fuels. Among all alternatives,
   photosynthetic cyanobacteria have recently attracted significant
   attention as a promising “microbial cell factory” to produce renewable
   biofuels due to their capability to utilize solar energy and CO[2] as
   the sole energy and carbon sources, respectively (Ducat et al.,
   [31]2011; Quintana et al., [32]2011; Robertson et al., [33]2011).
   Cyanobacteria contain considerable amounts of lipids in the thylakoid
   membranes and possess higher photosynthetic efficiency and faster
   growth rate compared to eukaryotic green algae and higher plants
   (Quintana et al., [34]2011). In addition, cyanobacteria have a
   relatively simple genetic background and are amenable to modification
   by metabolic engineering and synthetic biology (Wang et al.,
   [35]2012a). Recent efforts have led to successful production of various
   biofuels in engineered cyanobacterial cells, such as ethanol (Deng and
   Coleman, [36]1999), butanol and isobutanol (Atsumi et al., [37]2009),
   alkanes (Choi and Lee, [38]2013), and biodiesel (Da Ros et al.,
   [39]2013). However, the current biofuel productivity in the
   cyanobacterial systems is several orders of magnitude lower than their
   native producing microbes (Jin et al., [40]2014). In addition to
   ongoing efforts to optimize the existing pathways and to discover and
   construct novel pathways, one option to achieve high productivity is to
   improve cellular tolerance to toxic biofuel products synthesized by the
   cyanobacterial hosts (Dunlop, [41]2011; Zingaro and Papoutsakis,
   [42]2012).

   Although response mechanisms against biofuels have been extensively
   studied in many native biofuel-producing microbes (Couto et al.,
   [43]1997; Dunlop, [44]2011), it remains unclear for cyanobacteria. As
   part of our long-term goal to construct more robust and
   product-tolerant photosynthetic “chassis” for synthesizing various
   renewable biofuels, our laboratory has applied integrated
   transcriptomic, proteomic, and metabolomic approaches to determine the
   metabolic profiles of a model cyanobacterium Synechocystis sp. PCC 6803
   (hereafter Synechocystis) stressed under various biofuels (Liu et al.,
   [45]2012; Qiao et al., [46]2012; Wang et al., [47]2012b; Tian et al.,
   [48]2013; Zhu et al., [49]2013). Consistent with early genome-level
   studies in other microbes (Nicolaou et al., [50]2010; Dunlop,
   [51]2011), our previous results showed that Synechocystis cells
   employed a combination of multiple resistance mechanisms in dealing
   with biofuels stress (Wang et al., [52]2012b). In addition, the
   comparative proteomic analysis provided strong evidence that proteins
   involved in multiple aspects of photosynthesis (i.e., photosystems I
   and II, cytochrome, and ferredoxin) were up-regulated in
   ethanol-treated Synechocystis (Qiao et al., [53]2012), suggesting there
   could be unique response mechanisms employed by cyanobacteria to combat
   biofuel toxicity.

   Although initial efforts using a conventional approach of analyzing
   individual genes/proteins according to fold change and statistical
   significance has led to determination of the responses associated with
   each of the biofuels in Synechocystis (Liu et al., [54]2012; Qiao et
   al., [55]2012; Wang et al., [56]2012b; Tian et al., [57]2013; Zhu et
   al., [58]2013), it becomes clear that network-focused rather than
   individual gene/protein-focused methodologies would be more appropriate
   to obtain a complete picture of cellular response (Lehtinen et al.,
   [59]2013). In addition, the network analysis defines modules and their
   possible biological roles based on connectivity of proteins or genes
   rather than using any artificial cutoff, which may avoid information
   loss related to genes/proteins of low abundance or small fold changes,
   such as signal transduction genes. In recent years, network analysis
   has been applied to cyanobacterial studies. For example, Singh et al.
   ([60]2010) constructed a Bayesian network of Synechocystis using
   transcriptomic data and defined a set of genes as the core
   transcriptional response (CTR) that are commonly regulated under most
   of environmental perturbations (Singh et al., [61]2010). McDermott et
   al. ([62]2011) developed a predictive in silico model of diurnal and
   circadian behavior of Cyanothece 51142 using transcriptomic data, and
   the results showed that incorporation of network topology into the
   model could improve the ability to explain the behavior (McDermott et
   al., [63]2011). Recently, Wang et al. ([64]2013b) utilized a weighted
   gene co-expression network analysis (WGCNA) approach to establish
   transcriptional networks for four cyanobacterial species under metal
   stresses, and a further cross-species network comparison led to the
   discovery of several core response modules and genes that may be
   essential to all metal stresses, as well as species-specific hub genes
   for metal stresses (Wang et al., [65]2013b). The studies demonstrated
   that network-based analysis could be a powerful tool in deciphering
   cellular responses.

   In this study, to further identify responses specifically related to
   biofuels stress that could be used as potential targets for rational
   tolerance engineering, a topological analysis of global proteins
   co-expression network combined with protein–protein interaction (PPI)
   network was first performed to uncover a core set of proteins commonly
   responsive to both biofuels stress and environmental perturbations.
   Then, a WGCNA was applied to identify responses specifically related to
   biofuels stress. The combination of both analyses allowed the
   identification of the protein network signatures associated with
   exogenous biofuels treatments, and provided new insights into the
   molecular mechanisms against biofuels stress in Synechocystis.

Materials and Methods

Proteomic data sources

   A total of five iTRAQ LC-MS/MS datasets of Synechocystis sp. PCC 6803
   from our previous study were re-analyzed at a peptide level. Growth of
   Synechocystis under ethanol, butanol, hexane, salt stress conditions
   with dosages of 1.5% (v/v), 0.2% (v/v), 0.8% (v/v), 4% (w/v), and
   nitrogen starvation, which led to ~50% growth reduction were then
   determined. For each condition, cells were harvested at two time points
   (24 and 48 h) that were corresponding to middle-exponential and
   exponential-stationary transition phases in the growth time courses for
   proteomics analysis. Each biological replicates sample has two
   technical replicates. Due to the page limitation, for details about the
   environmental perturbation and biofuel stress experiments and original
   proteomic datasets please find from several previous publications (Liu
   et al., [66]2012; Qiao et al., [67]2012, [68]2013; Huang et al.,
   [69]2013; Tian et al., [70]2013).

Proteomic data analysis

   The mass spectroscopy analysis was performed using a AB SCIEX
   TripleTOF™ 5600 mass spectrometer (AB SCIEX, Framingham, MA, USA),
   coupled with online micro flow HPLC system (Shimadzu Co, Kyoto, Japan)
   as described previously. Genome sequence and annotation information of
   Synechocystis sp. PCC 6803 were downloaded from NCBI
   ([71]ftp://ftp.ncbi.nlm.nih.gov/genomes). The details for the
   experimental design, execution, and proteomic data analysis can be
   found in the original publications (Liu et al., [72]2012; Qiao et al.,
   [73]2012, [74]2013; Huang et al., [75]2013; Tian et al., [76]2013).

Protein co-expression network construction

   To construct the association network from proteomic data, we used a
   multi-step procedure for network construction: first, we performed a
   procedure for data normalization identical with Principal component
   analysis (PCA) (See below); second, correlation values were calculated
   between present values for all pairs of peptides. In this study, we
   used peptides rather than proteins to construct the protein
   co-expression network. One reason is lots of related peptides from the
   same protein are always observed in discordance, which may be due to
   different post-translational modifications or isoforms. Correlation is
   calculated as the Pearson correlation coefficient for all pairwise
   peptides. Third, in order to generate a reliable protein co-expression
   network, high correlation coefficients (r > 0.9) was used, where only
   gene pairs with a correlation coefficient higher than 0.9 were
   considered connected. Finally, we combined protein co-expression
   network with experimentally determined PPI network (Sato et al.,
   [77]2007). In this process, known PPI between observed proteins already
   in the co-expression network were added as new edges to the network.

Topological analysis

   Topological analysis of networks was performed using Cytoscape
   software. Bottleneck and hub proteins were defined as the top 20% of
   proteins ranked by the values of betweenness and degree centrality,
   respectively (McDermott et al., [78]2011, [79]2012). The degree and
   betweenness centrality metrics were defined according to the methods
   described by McDermott et al. ([80]2012). Briefly, degree centrality is
   a metric of the connectedness of a node, and betweenness centrality is
   a metric that measures how often paths between nodes must traverse a
   given node. Generally, degree centrality is the fraction of edges for a
   particular protein out of all possible interactions for that protein in
   the network, and betweenness is the number of shortest paths between
   all pairs of proteins in the network that pass through a specific node.

Principal components analysis

   The proteomics data were converted to a ratio versus control
   conditions. The data were then log2 scaled, and each unit reflects a
   twofold change in abundance. In order to avoid influence caused by
   missing data, peptides with any missing data in any condition were
   removed. Remaining core peptides identified in all conditions were
   subjected to PCA and partial least square-discriminant analysis
   (PLS-DA) by SIMCA-P 11.5 software. Averaging was taken for all
   technical replicates of samples, as in general good reproducibility was
   observed between replicates (Liu et al., [81]2012; Qiao et al.,
   [82]2012; Tian et al., [83]2013).

Weighted gene co-expression network analysis

   Weighted gene co-expression network analysis approach was used to
   establish a co-expression network from the LC-MS/MS proteomic data
   (Langfelder and Horvath, [84]2008). The co-expression network was
   created first by calculating weighted Pearson correlation matrices
   corresponding to peptide abundance expression, and then by following
   the standard procedure of WGCNA to create the networks. Briefly,
   weighted correlation matrices were transformed into matrices of
   connection strengths using a power function. These connection strengths
   were then used to calculate topological overlap (TO) (Langfelder and
   Horvath, [85]2008). The topological overlap matrix (TOM) is computed as
   TOM[ij] = (l[ij] + a[ij])/[min (k[i],k[j]) + 1 − a[ij]] where l[ij] is
   defined as the dot product on row i and column j in adjacency matrix
   [a] and k[i] (the connectivity) is the summation of row i in adjacency
   matrix [a] (Gibbs et al., [86]2013). Hierarchical clustering based on
   TO was used to group proteins with highly similar co-expression
   relationships into modules. Protein dendrograms were obtained by
   average linkage hierarchical clustering, while the color row underneath
   the dendrogram showed the module assignment determined by the Dynamic
   Tree Cut method. The network for each module was generated with the
   minimum spanning tree with a dissimilarity matrix from WGCNA. The
   modules with r > 0.55 and a p-value <0.1 were extracted (Wang et al.,
   [87]2013a).

Functional enrichment analysis

   Metabolic pathway enrichment analysis was conducted according to Kyoto
   Encyclopedia of Genes and Genomes (KEGG) and Cluster of Orthologous
   Groups of proteins (COG) database using the following formula:
   [MATH: <mi>p</mi><mo class="MathClass-rel">=</mo><mn>1</mn><mo
   class="MathClass-bin">−</mo><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn
   >0</mn></mrow><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow></munderover><
   /mstyle><mfrac><mrow><mfenced separators="" open="("
   close=")"><mrow><mtable equalrows="false" columnlines="none none none
   none none none none none none" equalcolumns="false"
   class="array"><mtr><mtd class="array"
   columnalign="center"><mi>M</mi></mtd></mtr><mtr><mtd class="array"
   columnalign="center"><mi>i</mi></mtd></mtr><mtr><mtd class="array"
   columnalign="center"></mtd></mtr></mtable></mrow></mfenced><mfenced
   separators="" open="(" close=")"><mrow><mtable equalrows="false"
   columnlines="none none none none none none none none none"
   equalcolumns="false" class="array"><mtr><mtd class="array"
   columnalign="center"><mi>N</mi><mo
   class="MathClass-bin">−</mo><mi>M</mi></mtd></mtr><mtr><mtd
   class="array" columnalign="center"><mi>n</mi><mo
   class="MathClass-bin">−</mo><mi>i</mi></mtd></mtr><mtr><mtd
   class="array"
   columnalign="center"></mtd></mtr></mtable></mrow></mfenced></mrow><mrow
   ><mfenced separators="" open="(" close=")"><mrow><mtable
   equalrows="false" columnlines="none none none none none none none none
   none" equalcolumns="false" class="array"><mtr><mtd class="array"
   columnalign="center"><mi>N</mi></mtd></mtr><mtr><mtd class="array"
   columnalign="center"><mi>n</mi></mtd></mtr><mtr><mtd class="array"
   columnalign="center"></mtd></mtr></mtable></mrow></mfenced></mrow></mfr
   ac> :MATH]

   N is the number of all proteins with KEGG pathway annotation
   information, M is the number of proteins with a given KEGG pathway
   annotation, n is the number of the associated proteins with KEGG
   pathway annotation information, and m is the number of the associated
   proteins with a given KEGG pathway annotation. All pathway mapping was
   manually checked for each of the proteins. We also calculated
   functional enrichment by considering each group or module of interest
   versus all proteins in the network as a background, as the ratio of m/n
   versus M/N.

Results and Discussion

Overview of proteomics analysis

   The proteomic datasets used in this study are listed in Table [88]1.
   Briefly, the datasets contain four sets of quantitative iTRAQ–LC-MS
   analyses of Synechocystis grown under five stress conditions, i.e.,
   biofuel stresses of ethanol, butanol and hexane, and environmental
   perturbations of high salt and nitrogen starvation. For each condition,
   treated and corresponding wild-type control cells were harvested at two
   time points (i.e., 24 and 48 h). Each biofuel-stressed dataset has two
   technical replicates. For overall data quality, reproducibility and
   full description of the proteomic datasets, please refer to several
   previous publications (Liu et al., [89]2012; Qiao et al., [90]2012,
   [91]2013; Huang et al., [92]2013; Tian et al., [93]2013).

Table 1.

   The proteomic datasets used in this study.
   Condition[94]^a Unique spectra Peptides Unique peptides Proteins
   Ethanol (Qiao et al., [95]2012) 21,066 7,337 7,192 1,523
   Butanol (Tian et al., [96]2013) 18,745 6,355 6,252 1,300
   Hexane (Liu et al., [97]2012) 19,217 6,995 6,875 1,389
   Salt (Qiao et al., [98]2013) 23,822 8,379 8,257 1,702
   N-starvation (Huang et al., [99]2013) 23,674 8,404 8,282 1,703
   [100]Open in a new tab

   ^aReferences for each dataset are provided.