Abstract
Catharanthus roseus is a medicinal plant, which can produce monoterpene
indole alkaloid (MIA) metabolites with biological activity and is rich
in vinblastine and vincristine. With release of the scaffolded genome
sequence of C. roseus, it is necessary to annotate gene functions on
the whole-genome level. Recently, 53 RNA-seq datasets are available in
public with different tissues (flower, root, leaf, seedling, and shoot)
and different treatments (MeJA, PnWB infection and yeast elicitor). We
used in-house data process pipeline with the combination of PCC and MR
algorithms to construct a co-expression network exploring
multi-dimensional gene expression (global, tissue preferential, and
treat response) through multi-layered approaches. In the meanwhile, we
added miRNA-target pairs, predicted PPI pairs into the network and
provided several tools such as gene set enrichment analysis, functional
module enrichment analysis, and motif analysis for functional
prediction of the co-expression genes. Finally, we have constructed an
online croFGD database ([32]http://bioinformatics.cau.edu.cn/croFGD/).
We hope croFGD can help the communities to study the C. roseus
functional genomics and make novel discoveries about key genes involved
in some important biological processes.
Keywords: Catharanthus roseus, co-expression network, functional
module, gene function, monoterpene indole alkaloid
Introduction
Catharanthus roseus, a model plant of the Apocynaceae family, is best
known for production of the bis-indole monoterpene indole alkaloids
(MIAs). There are four important MIAs, vinblastine and vincristine used
in the clinic as anti-cancer agents ([33]Aslam et al., 2010),
catharanthine which can reduce blood sugar content ([34]Pan et al.,
2012), and vindoline. MIAs belong to a class of terpenoid indole
alkaloids (TIAs). Some TIAs exhibit strong pharmacological activities,
whose production has beneficial effects on human health ([35]Almagro et
al., 2015). The biosynthesis of TIAs is regulated by several key
transcription factors (TFs), such as ORCA3, ORCA2, WRKY, MYC, ZCT1, and
BIS, which can enhance alkaloid production ([36]Van Der Fits and
Memelink, 2000; [37]Suttipanta et al., 2011; [38]Zhang et al., 2011;
[39]Li et al., 2013; [40]Van Moerkercke et al., 2015; [41]Rizvi et al.,
2016). In addition to these key TFs, some hormones and transporters are
essential for the regulation of TIA biosynthesis in C. roseus ([42]Liu
et al., 2017). Some external signals such as elicitor and jasmonate
(JA) can regulate the activities of several TFs involved in TIA
biosynthesis ([43]Memelink and Gantet, 2007). Although much progress
has been made in the field of TIAs, functions of some key genes and
enzymes associated with the regulation of TIA biosynthesis are still
unknown, which makes it difficult to understand the whole process.
Notably, the release of the scaffolded genome sequence of C. roseus
([44]Kellner et al., 2015), makes it possible to refine functional
annotations of genes by integrating multidimensional data and existing
methods.
The integration of biological information through gene expression
profiling analysis can benefit to elucidating gene function
([45]Noordewier and Warren, 2001). Transcriptomic datasets can be used
to establish the gene expression profiles, which can provide some
useful information for inferring gene regulatory relationship
([46]Newton and Wernisch, 2014). Transcriptome analysis reveals that
some genes involved in TIA biosynthesis are differentially expressed in
leaf and root tissues, which can help understand specialized metabolic
pathways in C. roseus ([47]Verma et al., 2014). Integrated
transcriptome and metabolome analysis can establish connections between
genes and specialized metabolites, which can identify many genes
involved in TIA synthesis and elucidate particular biological pathways
([48]Rischer et al., 2006). Basing on transcriptomic datasets, the
network construction can provide important biological knowledge,
especially for digging out possible gene functions ([49]Rhee and
Mutwil, 2014).
Currently, there has been a plenty of transcriptomic datasets available
on the public platform, which lay the foundation for the research in C.
roseus. By considering all collected transcriptomic samples available
together, co-expression network is applied to predicting gene functions
on a large scale ([50]Ma et al., 2014). Co-expression network analysis
can mimic some important regulatory mechanism in vivo and thus discover
key regulatory genes or functional modules. [51]van Dam et al. (2017)
excavated disease-related functional modules and annotated core genes
based on co-expression network analysis. Considering that genes within
a specialized metabolite pathway may form tight associations with each
other in co-expression network, the method for connecting genes to
specialized metabolic pathways in plant is effective, which can
identify novel genes associated with specialized metabolic pathways
([52]Wisecaver et al., 2017). Co-expression network analysis identified
two missing enzymes, PAS and DPAS, necessary for vinblastine
biosynthesis in C. roseus, which is important for understanding many
other bioactive alkaloids ([53]Caputi et al., 2018).
A growing number of studies have supported the utility of co-expression
network analysis for inferring and annotating gene function, and
excavating core genes involved in specific biological process. PlaNet
used Heuristic Cluster Chiseling Algorithm (HCCA) to construct
whole-genome co-expression networks for Arabidopsis and six important
plant crop species ([54]Mutwil et al., 2011). AraNet presented
co-functional gene network for Arabidopsis and generated functional
predictions for 27 non-model plant species using an orthologous-based
projection ([55]Lee et al., 2015). ATTED-II provided 16 co-expression
platforms for nine plant species through combining the Pearson
correlation coefficient (PCC) and mutual rank (MR) algorithm ([56]Aoki
et al., 2016). Our lab have published several functional genomics
databases with co-expression network for plant species ([57]Yu et al.,
2014; [58]You et al., 2015, [59]2016; [60]Zhang et al., 2015; [61]Tian
et al., 2016; [62]Ma et al., 2018). Besides, ccNET provided comparative
gene functional analyses at a multi-dimensional network and epigenome
level across diploid and polyploid Gossypium species based on the
co-expression network ([63]You et al., 2017). With the combination of
transcriptomic and epigenomic data, MCENet provided global and
conditional networks to help identify maize functional genes or modules
associated with agronomic traits ([64]Tian et al., 2018).
Here, we constructed a functional genomics database for C. roseus
(croFGD). It provided three types of co-expression network, which
allowed user to perform network search and analysis from a
multi-dimensional perspective. Functional annotation information and
several analysis tools were provided for functional prediction of the
co-expression genes. Basing on co-expression network, we identified
some functional modules which could be applied to the discovery of
vital genes associated with agronomic traits. The integration of
co-expression network analysis and functional module identification can
be used to improve C. roseus gene function annotation and helpful for
the functional genomics research. Besides, it can promote the research
for the synthesis, metabolism of active substances and drug
development.
Materials and Methods
Transcriptomic Data Source
There were 53 samples in Catharanthus roseus collected from the NCBI
Sequence Read Archive (SRA), which covered different tissues (root,
hairy root, shoot, stem, leaf, flower, seedling, and callus) and
different treatments, such as methyl jasmonate (MeJA), peanut witches’
broom (PnWB) infection and yeast elicitor ([65]Supplementary Table S1).
Data Processing and Gene Expression Profiling Analysis
The C. roseus genome had a size of ∼500 Mb, and 33,829 protein-coding
genes. All transcriptomic datasets were subjected to quality control
using FastQC software (v0.10.1) ([66]Brown et al., 2017). Those
datasets with mapping rate <50% were filtered out. The sequence reads
were mapped to the C. roseus reference genome (ASM94934v1) ([67]Kellner
et al., 2015) using Tophat (v2.0.10) software ([68]Trapnell et al.,
2009) with default parameters. Cufflinks (v2.2.1) ([69]Trapnell et al.,
2010) was used to calculate the FPKM (fragments per kilobase of
transcript per million mapped reads) values with default parameters.
And differentially expressed genes was calculated by Cuffdiff (v2.2.1)
([70]Trapnell et al., 2013).
Co-expression Network Construction
Pearson correlation coefficient is used to calculate correlation
coefficient between two genes. MR represents high credible
co-expression gene pairs after ranking the PCC. PCC is calculated based
on the formula below. The more similar the expression pattern in
samples between genes is, the higher the PCC score might be. MR is an
algorithm basing on PCC, which takes a geometric average of the PCC
rank from gene A to gene B and from gene B to gene A.
[MATH:
PCC=<
mfrac>∑i=1n
mo>(xi−x¯)(yi−y¯)∑i=1n
mo>(xi−x¯)2·∑i=1n
mo>(yi−y¯)2<
/msqrt>M
R(AB)=(Rank(A→B)×Rank(B→A)) :MATH]
X or Y represents the FPKM value, and n represents the number of
samples. MR ensures those co-expression gene pairs with low credibility
will be filtered out, so the PCC and MR are combined to construct
co-expression network. Here, all samples were used for the construction
of global co-expression network. Among all samples, 44 samples without
treatment were used to construct tissue-preferential network, and 32
samples with treatment and corresponding control were used to construct
the treat-response network.
Functional Module Identification and Parameter Selection
The Clique Percolation Method (CPM) ([71]Adamcsek et al., 2006) was
used to identify modules with nodes densely connected to each other in
three types of co-expression networks, including global network,
tissue-preferential network and treat-response network. Parameter
selection was based on module number, module overlap rate and gene
coverage rate. Here, we selected the k = 5 clique size for global
co-expression network, which meant each module had at least five nodes
and each node had co-expression relationship with each other
([72]Supplementary Figure S2). In fact, one functional module could be
regarded as a small network. Similarly, we selected the k = 6 clique
size for tissue-preferential network and treat-response network. The
functions of the modules were annotated through gene set enrichment
analysis (GSEA) ([73]Yi et al., 2013), including GO terms, gene
families, plantCyc and KEGG pathways.
The Identification of Orthologous Genes in Arabidopsis
Bidirectional blast alignments were conducted for the analysis of
protein sequences between C. roseus and Arabidopsis. Our criteria for
the identification of orthologous gene pairs were as follows: the top
three hits in each bidirectional blast alignment were selected as the
best orthologous pairs; in addition, orthologous pairs with an e-value
less than 1E-25 were regarded as the second level.
The Classification of Gene Family
Five main gene families, including TFs and regulator factors (TRs),
carbohydrate-active enzymes, kinase, ubiquitin and cytochrome P450,
were classified to improve limited functional annotation. TF/TRs and
kinase family were identified mainly by iTAK tool ([74]Zheng et al.,
2016) based on the rule in PlnTFDB ([75]Pérez-Rodríguez et al., 2009)
and PlantsP Kinase Classification ([76]Tchieu et al., 2003),
respectively. The carbohydrate-active enzymes (CAZy) family
([77]Lombard et al., 2014) was predicted through the method of
orthologous search based on Arabidopsis thaliana. The enzymes were
classified into six groups: glycoside hydrolases (GH),
glycosyltransferase (GT), polysaccharide lyases (PL), carbohydrate
esterase (CE), auxiliary activities (AA) and carbohydrate-binding
modules (CBM). Ubiquitin family was identified through Hidden Markov
Model (HMM) search based on models from UUCD ([78]Gao et al., 2013).
And cytochrome P450 family was predicted by orthologous relationship
with Arabidopsis and the candidates were confirmed with ID of PF00067
by Pfam ([79]Finn et al., 2014) search.
Z-Score for Motif Analysis
Motif (cis-element) analysis tool is developed to identify significant
motifs in one sequence or in the promoter region of interested gene
list and thus predict possible functions. Z-score is a statistical
measurement of the distance in standard deviations of a sample, which
can act as a normalization method to eliminate the difference caused by
background for different samples. So far, it is widely applied to
calculating the cis-element significance ([80]Endo et al., 2014).
The Z-score is calculated as:
[MATH:
Z=X¯−μσ/n
:MATH]
[MATH: X¯ :MATH]
represents sum value of a motif in the promoter of one gene list. μ
represents mean value of the same motif in 1,000 random gene lists with
same scale. σ represents standard deviation of the 1,000 mean value
based on random selection.
Plant Materials and Growth Conditions
C. roseus seeds were planted in small pots and kept moistened until the
seeds had germinated, and allowed to grow until they had three to five
leaves, then transferred to a greenhouse (16 h light/8 h darkness,
28/25°C). For MeJA treatment, 100 μM MeJA was sprayed evenly on leaves
and stem of well-growth plants. In order to prevent MeJA decomposition,
leaves and stem with treatment and corresponding control were under
darkness. After treatment for 6 and 24 h, the leaves and stem were
harvested, immediately frozen in liquid nitrogen, and then stored at
-80°C for use. Control samples were also harvested. Three biologically
repeated samples were harvested.
RNA Isolation and Quantitative Real Time RT-PCR
About 100 mg of tissue was ground in liquid nitrogen before isolation
of the RNA. Total RNA was isolated using TRIZOL^® reagent (Invitrogen,
Carlsbad, CA, United States) and purified using Qiagen RNeasy columns
(Qiagen, Hilden, Germany). Reverse transcription was performed using
Moloney murine leukemia virus (M-MLV; Invitrogen). We heated 10 μL
samples containing 2 μg of total RNA, and 20 pmol of random hexamers
(Invitrogen) at 70°C for 2 min to denature the RNA and then chilled the
samples on ice for 2 min. We added reaction buffer and M-MLV to a total
volume of 20 μL containing 500 μM dNTPs, 50 mM Tris-HCl (PH 8.3), 75 mM
KCl, 3 mM MgCl[2], 5 mM dithiothreitol, 200 units of M-MLV and 20 pmol
random hexamers. The samples were then heated at 42°C for 1.5 h. The
cDNA samples were diluted to 2 ng/μL for real time RT-PCR analysis.
For quantitative real-time RT-PCR, triplicate quantitative assays were
performed on 1 μL of each cDNA dilution using the SYBR Green Master Mix
with an ABI 7900 sequence detection system according to the
manufacture’s protocol (Applied Biosystems). The gene-specific primers
were designed using PRIMER3^[81]1. The amplification of 18S rRNA was
used as an internal control to normalize all data (forward primer,
5′-CGGCTACCACATCCAAGGAA-3′; reverse primer, 5′-TGTCACTACCT
CCCCGTGTCA-3′). Gene-specific primers were listed in [82]Supplementary
Table S2. The relative quantification method (ΔΔCT) was used to
evaluate quantitative variation between replicates examined.
Construction and Content
Database Construction
The database was constructed under the LAMP (Linux + Apache + Mysql +
PHP) environment. It mainly contains three parts: (I) functional
annotation, which includes gene family, KEGG pathway and miRNA detailed
information, etc.; (II) network and module, including co-expression
network search and analysis, network comparison and module search;
(III) some analysis tools, mainly including cis-element enrichment
analysis, GSEA, functional module enrichment analysis and UCSC Genome
Browser visualization ([83]Figure 1).
FIGURE 1.
FIGURE 1
[84]Open in a new tab
Database architecture. The croFGD database is divided into three main
pieces: network & module, functional annotation, and analysis tools.
The line with different colors indicates different pieces.
Functional Annotation
We obtained the functional annotation information in C. roseus from the
Dryad Digital Repository ([85]Kellner et al., 2015). Among 33,829
protein-coding genes, 14,527 genes were annotated with 4,734 GO terms
by blast2GO ([86]Conesa and Gotz, 2008). 5,571 enzymes involved in 213
metabolism pathways were annotated by GhostKOALA ([87]Kanehisa et al.,
2016) from KEGG database. We mapped C. roseus protein sequences against
CathaCyc ([88]Van Moerkercke et al., 2013) using the BLASTP program and
2,421 enzymes involved in 513 metabolism pathways were annotated. Then
we predicted 36,882 orthologous pairs between C. roseus and Arabidopsis
through bidirectional blast alignment. There were a total of 1,035
plant motifs collected from the Plant Cis-acting Regulatory DNA
Elements (PLACE) database ([89]Higo et al., 1999), PlantCARE database
([90]Rombauts et al., 1999), AthaMap database ([91]Steffens, 2004) and
literatures. Furthermore, we adopted the inparanoid algorithm
([92]Sonnhammer and Östlund, 2015) and predicted 9,377 protein–protein
interaction (PPI) pairs in C. roseus from over 18,000 experimentally
validated PPI pairs in Arabidopsis integrated from several databases,
such as BIOGRID ([93]Chatr-Aryamontri et al., 2017), IntAct
([94]Orchard et al., 2014) and related literature ([95]Lumba et al.,
2014). We also collected 227 miRNA sequence information derived from a
literature ([96]Shen et al., 2017), and then mapped these miRNA
sequences against the whole-genome sequence using the GMAP program
([97]Wu and Watanabe, 2005). Furthermore, 143 miRNA targets were
identified by psRNATarget ([98]Dai and Zhao, 2011). The miRNA detailed
information mainly included location, sequence and structure, miRNA
target and expression profiles in seedling after MeJA treatment
([99]Supplementary Figure S3). Furthermore, we conducted the gene
family classification and finally predicted 88 TFs/TRs families with
1,702 genes, 21 ubiquitin families with 1,192 genes, 98 cytochrome P450
families with 191 genes, 85 kinase families with 778 genes and 96 CAZy
families with 1,505 genes ([100]Table 1).
Table 1.
Data collection and statistics in croFGD.
Database content Number Source Reference
GO terms (genes) 55,505 (14,527) Blast2GO tool [101]Conesa and Gotz,
2008
KEGG pathway (genes) 213 (5,571) GhostKOALA tool [102]Kanehisa et al.,
2016
PlantCyc (genes) 513 (2,421) Blastp prediction –
Cis-elements (motifs) 1,035 Database and literature collection –
Orthologous pairs in Arabidopsis (genes) 36,882 (14,719) Blast
alignment –
Transcription factor and regulators (members) 88 (1,702) iTAK
prediction [103]Zheng et al., 2016
Kinases (members) 85 (778)
Carbohydrate-active enzymes (members) 96 (1,505) Blast alignment
[104]Lombard et al., 2014
Ubiquitin (members) 21 (1,192) Blast alignment [105]Zhou et al., 2018
Cytochrome P450 (members) 98 (191) the cytochrome p450 homepage
[106]Nelson, 2009
Co-expression network nodes (%) 30,096 (88.9%) PCC and MR [107]Aoki et
al., 2016
Tissue-preferential network nodes (%) 29,808 (88.1%)
Treat-response network nodes (%) 30,541 (90.3%)
Protein–protein interaction pairs 9,377 InParanoid algorithm
[108]Sonnhammer and Östlund, 2015
miRNA target modules 143 psRNAtarget prediction [109]Dai and Zhao, 2011
Function modules from global network (nodes) 2,310 (10,757) CFinder
tool [110]Adamcsek et al., 2006
Function modules from tissue-preferential network (nodes) 1,849
(12,090)
Function modules from treat-response network (nodes) 2,177 (12,073)
[111]Open in a new tab
Co-expression Network and Functional Module
A well-developed strategy with the integration of PCC and MR algorithm
was widely applied to the construction of co-expression network
([112]You et al., 2016, [113]2017; [114]Obayashi et al., 2018;
[115]Tian et al., 2018). We used the 240 BP terms of GO associated with
>4 and <20 genes to evaluate the networks. To get optimal gene pairs
and evaluate the credibility of co-expression network, we selected
different PCC thresholds of PCC > 0.7, PCC > 0.8, PCC > 0.9 and
different MR thresholds of MR top3 + MR ≤ 30, MR top3 + MR ≤ 50, MR
top3 + MR ≤ 100 to predict gene functions basing on selected GO terms
and generated receiver operating characteristic (ROC) curves
([116]Supplementary Figure S1). The larger the area under the curve
(AUC) value of co-expression network is, the higher the credibility of
the network will be. Finally, we selected the thresholds of PCC > 0.7
and MR top3 + MR ≤ 30 to filter out those co-expression gene pairs with
low credibility to construct co-expression network. In total, there
were 30,096, 29,808 and 30,541 nodes in global network,
tissue-preferential network and treat-response network with gene
expression view, which covered 88.9%, 88.1%, and 90.3% of genes in C.
roseus, respectively ([117]Table 1). All networks were visualized by
Cytoscape 2.8 ([118]Smoot et al., 2011).
Then we overlaid the gene expression value onto the co-expression
network to identify whether genes in the network were expressed or not
based on the minimum threshold FPKM value. To determine the minimum
threshold of the gene expression value (FPKM) among all C. roseus
samples (detailed mapping results are shown in [119]Supplementary Table
S3), the lowest 5% of all gene FPKM values in each sample and the
standard deviation (SD) of each experimental group were computed. The
mathematical formula “threshold = average (5% value) + 3 ^∗ SD”
([120]You et al., 2016, [121]2017) was used to calculate the minimum
expression value of each experimental group. The minimum threshold of
FPKM was 0.094. We identified differential expressed genes between
treatment and control samples by the cutoff: |log[2]FC|≥ 1 and p-value
≤ 0.05. Tissue-preferential analysis in different tissues (root, hairy
root, shoot, stem, leaf, flower, seedling, and callus) and
treat-response analysis under three types of treatments (MeJA, PnWB
infection and yeast elicitor) among five tissues (root, shoot, flower,
callus, and hairy root) were supplied for the co-expression network
analysis. Meanwhile, predicted miRNA target and PPI pairs were
integrated into the network, and further analysis was provided for all
members in the network, such as gene expression profiling analysis,
GSEA, and cis-element analysis.
Furthermore, co-expression network could be used to perform modularized
analysis and excavation for the discovery of agronomic trait-related
vital gene and functional module. The CPM proposed to detect the
overlapping communities in the complex network ([122]Palla et al.,
2005; [123]Li et al., 2014), provided certain practicability for the
discovery of key gene and module. Finally, we applied the algorithm and
predicted 2,310, 1,849, and 2,177 functional modules in global network,
tissue-preferential network and treat-response network in C. roseus,
respectively ([124]Table 1). The functions of these modules were
annotated through GSEA ([125]Yi et al., 2013). The entries which were
not significant were filtered out by Fisher’s tests and multiple test
correction method (FDR ≤ 0.05). These functional modules covered
diverse functions such as vindoline and vinblastine biosynthesis,
jasmonic acid biosynthesis, pathogen resistance and hormone response,
etc.
Analysis Tools
Gene Set Enrichment Analysis
Gene set enrichment analysis ([126]Yi et al., 2013) is a powerful
method for the functional annotation of interested gene list by
computing the overlaps with well-defined background gene sets. Some
categories of gene sets, such as GO terms, gene families, plantCyc and
KEGG pathways, miRNA targets and functional modules identified from
three types of network, were used as background gene sets. The
significantly enriched gene set with FDRs ≤ 0.05 would be displayed on
the GSEA result page.
Functional Module Enrichment Analysis
The tool was used to identify some functional modules from interested
gene list especially in the network. The previously annotated miRNA
target modules and functional modules identified from three types of
network were used as background functional modules. The modules with
FDRs ≤ 0.05 would be regarded as significantly enriched and the
enrichment analysis result page included module annotation, module
source, overlap gene number, and FDR value.
Cis-Element Enrichment Analysis
Cis-element (motif), a short conserved sequence, can be recognized by
some TFs to regulate the expression levels of downstream genes. The
tool was developed to identify motifs in a set of gene promoters and
thus predict the function of gene set. The cis-element significance
test is an algorithm using statistical method based on Z-score and
p-value filtering ([127]Yu et al., 2014) that can identify significant
cis-regulatory elements in the promoter region of one gene. The
promoter region was set as 3 kb in C. roseus. When scanned in the 3 kb
promoter region of C. roseus genes, motifs with p-value ≤ 0.05 were
significantly enriched on account of the frequency of motif occurrence.
Other Tools Supported in croFGD
A quick search, UCSC Genome Browser ([128]Speir et al., 2016)
visualization and a manual were provided for users. The search page
mainly included gene detail search, gene function search, functional
module search and orthologous search. The orthologous search allowed
user to input one gene list in Arabidopsis to search for corresponding
C. roseus genes.
Function Application
Comprehensive Exploration for the Function of 16OMT Gene
CRO_T004356 (16OMT), o-methyltransferase family member, which was
reported to be involved in the biosynthesis of TIAs ([129]Pandey et
al., 2016; [130]Yamamoto et al., 2016). Taking 16OMT gene as an
example, we explored possible function of the gene through the
database. By gene detail search, we found that the gene: (I) was
annotated with alkaloid biosynthetic process (GO: 0009821) and
myricetin 3′-O-methyltransferase activity (GO: 0033799), etc.; (II) had
two pfam domains: “Dimerisation (PF08100)” and “Methyltransf_2
(PF00891)” domains; (III) was mainly involved in vindoline and
vinblastine biosynthesis; (IV) was relatively high in expression in
leaf tissue ([131]Figure 2A). We conducted network analysis for three
types of co-expression network of 16OMT gene including
tissue-preferential network ([132]Figure 2B), global network
([133]Figure 2C) and treat-response network ([134]Figure 2D). GSEA
results for global network genes indicated that these genes might be
involved in phenylpropanoid biosynthesis, vindoline and vinblastine
biosynthesis. Network comparison results suggested that it was
relatively conservative between global network and tissue-preferential
network ([135]Figure 2E), and there were great differences between
global network and treat-response network ([136]Figure 2F). Through
module search, the gene in the module ([137]Figure 2G) might be
involved in vindoline and vinblastine biosynthesis, alkaloid
biosynthetic process, and protein phosphorylation, etc. Therefore,
16OMT gene might have diverse function in several biological processes
like hos1 gene ([138]MacGregor and Penfield, 2015). The expression
heatmaps of all genes in the module were included ([139]Figure 2H).
UCSC genome browser visualization ([140]Figure 2I) indicated that most
RNA-seq peaks were enriched in the genic region. Furthermore,
stilbenoid, diarylheptanoid, and gingerol biosynthesis pathway was
shown ([141]Figure 2J).
FIGURE 2.
FIGURE 2
[142]Open in a new tab
Comprehensive explorations for the function of 16OMT (CRO_T004356)
gene. (A) The detailed information of 16OMT gene in C. roseus. Three
types of co-expression network, including tissue-preferential network
(B), global network (C) and treat-response network (D). In these
networks, the node with yellow color represents the gene submitted
initially, and the nodes with green color represent co-expressed genes;
the edge with pink color links two genes with positive co-expression
relationship; the edge with blue color links two genes with negative
co-expression relationship. (E) Network comparison between global
network and tissue-preferential network. The nodes with yellow color
represent overlap genes between two networks, and the nodes with green
and dark green color stand for unique genes in two networks,
respectively. (F) Network comparison between global network and
treat-response network. (G) The “CFinderADM000741” module. (H)
Expression heatmaps of genes in “CFinderADM000741” module. (I) UCSC
genome browser visualization. (J) Stilbenoid, diarylheptanoid, and
gingerol biosynthesis pathway.
Co-expression Network Analysis for CPR Gene
CPR, NADPH–cytochrome P450 reductase, which is essential for the
activation of cytochrome P450 enzymes, is critical for the biosynthesis
of MIAs ([143]Parage et al., 2016). The detailed information of all
genes in the global network of CPR gene ([144]Figure 3A) was listed in
[145]Supplementary Table S4. In the CPR network, some genes (GES, 7DLH,
GOR, HDS, G8H, ISY, MCS, HDR, 7DLGT and IO) were involved in MIA
biosynthesis pathway ([146]Chebbi et al., 2014; [147]Kumar et al.,
2015). These genes were labeled with bold in the MIA biosynthesis
pathway ([148]Figure 3C). Through GO enrichment analysis ([149]Tian et
al., 2017) for all genes in the CPR network, the significantly enriched
GO terms were associated with terpene biosynthetic process, and
isoprenoid biosynthetic process ([150]Figure 3B), which were related to
MIA biosynthesis ([151]Geu-Flores et al., 2012; [152]Dugé de
Bernonville et al., 2015). Through module enrichment analysis for all
genes in CPR network, three genes (CYP76C, CRO_T015823, and
CRO_T014922) in significantly enriched functional modules might be
involved in brassinosteroid (BR) biosynthesis, gibberellic acid (GA)
response and indole alkaloid biosynthesis, respectively ([153]Figure
3D). Therefore, in addition to MIA biosynthesis, CYP76C and CRO_T015823
also played important role in plant growth and development. Besides,
CRO_T014922 might also be involved in MIA biosynthesis together with
other genes (CRO_T019924, CRO_T030883, CRO_T015465, and CRO_T025273) in
the module ([154]Figure 3D). Thus, in addition to the function of
network, co-expressed genes might be involved in some other functions.
Furthermore, co-expression analysis can be combined with module
enrichment analysis to predict gene function effectively.
FIGURE 3.
FIGURE 3
[155]Open in a new tab
The global network of CPR (CRO_T031702) gene involved in MIA pathway.
(A) Global network of CPR gene. The query gene CPR is highlighted by
yellow, the blue line represents negative co-expression relationship
between two genes, while the pink line represents positive
co-expression relationship. The dark purple diamond represents several
genes involved in the MIA biosynthesis pathway, such as GES, 7DLH, GOR,
HDS, G8H, ISY, MCS, HDR, 7DLGT, and IO, which are co-expressed with CPR
gene in the network, and the light purple circular represents other
genes co-expressed with query gene. (B) Scatter plot of GO enrichment
analysis results for all genes in CPR network. (C) The simplified MIA
pathway. The bold represents the gene in CPR co-expression network. (D)
Several functional modules related to genes in CPR network. The red
node represents genes in CPR network.
Network Comparison Between Global Network and Tissue-Preferential Network of
JAZ1 Gene
JAZ1, a jasmonate-zim-domain protein, was discovered as repressors of
jasmonate signaling, which was involved in TIA biosynthesis ([156]Pan
et al., 2018). We conducted network comparison between global network
and tissue-preferential network of JAZ1 ([157]Figure 4A). The
information of co-expressed genes in global network and
tissue-preferential network was shown in [158]Supplementary Table S5.
We found that the two networks displayed different network structure.
There were nine overlapped genes including JAZ1 gene between two
networks. Fifteen unique genes (including TIFY, CYP94C, and JAZ3)
appeared in global network, while sixteen unique genes including MYB15
appeared in tissue-preferential network. GSEA results for the genes in
global network of JAZ1 indicated that some gene sets were significantly
enriched, such as jasmonic acid biosynthesis, alpha-linolenic acid
metabolism, steroid biosynthesis and plant hormone signal transduction
([159]Menke et al., 1999; [160]Koo et al., 2014; [161]Patra et al.,
2018). GSEA results for the genes in tissue-preferential network of
JAZ1 illustrated that some gene sets were significantly enriched, such
as jasmonic acid biosynthetic process, 12-oxophytodienoate reductase
activity, NADPH dehydrogenase activity, triglyceride lipase activity
and oxylipin biosynthetic process ([162]Figure 4B) ([163]Tani et al.,
2008; [164]Wallström et al., 2014; [165]Wang et al., 2018). Based on
the structure and function of the two networks of JAZ1 gene, there were
some conservation and differences between two networks. In Arabidopsis,
cytochrome p450 family member CYP94C1 and CYP94B3 played important role
in the regulation of jasmonate response ([166]Niu et al., 2011;
[167]Heitz et al., 2012; [168]Koo et al., 2014). In Gossypium hirsutum,
GhJAZ2 regulated the jasmonic acid signaling pathway by interacting
with the R2R3-MYB transcription factor GhMYB25 ([169]Hu et al., 2016).
It needed further study whether the two genes CYP94C and MYB15
coexpressed with JAZ1 in two networks had similar function in C. roseus
as in Arabidopsis and Gossypium hirsutum, respectively. These results
indicated that network comparison is an effective approach to analyze
gene function from the perspective of different networks.
FIGURE 4.
FIGURE 4
[170]Open in a new tab
The network comparison between the global network and
tissue-preferential network. (A) The comparison between the global
network and tissue-preferential network of JAZ1 (CRO_T006982). In the
network comparison, the nodes with yellow color represents overlap
genes between two networks, and the nodes with dark blue color
represents genes only in tissue specific network, while the nodes with
sky blue color represents genes only in global network. (B) GSEA
results for two networks of JAZ1. The “G” represents global network,
and the “T” represents tissue-preferential network.
Treat-Response Network With Expression View After MeJA Treatment
In JAZ1 network with expression view after MeJA treatment in different
tissues (shoot, root, hairy root, and seedling) ([171]Figure 5), most
genes had significant change in expression, such as JAZ1, JAZ3, CYP94C,
MYB, MYB15, and TIFY. Detailed information for up and down-regulated
genes in these networks was shown ([172]Supplementary Table S6). In C.
roseus, JAZ proteins could repress MYC2 and BIS1 to respond to JA
signaling and then modulate MIA biosynthesis ([173]Patra et al., 2018).
In rice, enhanced expression of cytochrome p450 family member CYP94C2b
could alleviate the jasmonate response and enhanced salt tolerance
([174]Kurotani et al., 2015). In Arabidopsis, AtMYB44 could repress
JA-mediated defense by activating the expression of WRKY70 at
transcriptional level ([175]Shim et al., 2013). PvTIFY10C and GsTIFY10
gene acted as a repressor in the JA signaling pathway in Phaseolus
vulgaris and Glycine soja ([176]Zhu et al., 2011; [177]Aparicio-Fabre
et al., 2013), respectively. We conferred that CYP94C, MYB, MYB15, and
TIFY co-expressed with JAZ1 might act as JA-response candidate genes in
C. roseus. Furthermore, CRO_T012104 (anthranilate synthase beta
subunit), CRO_T013473 (protein of unknown function), CRO_T010878
(alpha/beta-hydrolases superfamily protein), CRO_T002729 (allene oxide
cyclase), and CRO_T002624 (tryptophan biosynthesis) almost up-regulated
under those five conditions, might also act as JA-response candidate
genes. Taking treat-response network of JAZ1 gene as an example, we
selected six genes (JAZ1, TIFY, MYB, CRO_T012104, CRO_T024124, and
CRO_T002729) for the real time RT-PCR validation ([178]Supplementary
Figure S4). These genes were up-regulated after MeJA treatment in shoot
tissues and might act as JA-response genes. The qRT-PCR results
indicated that these genes acted as JA-response genes in shoot tissues.
This not only validated the accuracy of the predicted results, but also
demonstrated the reliability of the network. Thus, treat-response
network with expression view can clear display the dynamic change of
gene expression in a network. Therefore, the co-expression network with
multi-dimensional analysis can benefit to analyzing regulatory
mechanisms in C. roseus development and stress response.
FIGURE 5.
FIGURE 5
[179]Open in a new tab
The expression view of JAZ1 network after MeJA treatment in different
tissues (shoot, root, hairy root, and seedling). The JAZ1 network after
MeJA treatment 6 h in shoot (A), MeJA treatment 24 h in shoot (B), 250
μM MeJA treatment 24 h in hairy root (C), MeJA treatment 24 h in root
(D), 6 μm MeJA treatment 12 days in sterile seedling (E). The red
hexagon represents up-regulated genes, the blue hexagon represents
down-regulated genes, and the green hexagon represents genes with no
significant change in expression.
Discussion
Our croFGD database aims to provide an online database server for the
annotation and prediction of gene function. We constructed global
network, tissue-preferential network and treat-response network with
expression view, which covered almost 90% of gene in C. roseus and
identified more than 6,000 functional modules. The annotation of these
functional modules covered vindoline and vinblastine biosynthesis,
jasmonic acid biosynthesis, hormone response and pathogen resistance,
etc. The network analysis strategy, functional module annotation and
integrated method could improve and refine gene function annotation
from diverse perspectives to some extent. For some crops, it could be
applied to excavate important functional module related to agronomic
traits, which would be beneficial for genetic breeding.
Through some analysis tools supported in croFGD, we can excavate key
genes involved in some important biological processes and predict gene
function. In comprehensive exploration for the function of 16OMT
([180]Figure 2), we found that the gene might have complex function,
like hos1 gene ([181]MacGregor and Penfield, 2015). In global network
of CPR, some genes were involved in MIA biosynthesis, such as GES,
7DLH, GOR, G8H, ISY, and 7DLGT ([182]Figure 3A). The integration of
co-expression network analysis and module enrichment analysis can be
benefit to predicting gene function effectively and refining gene
annotation. Basing on network comparison between two networks of JAZ1,
there were certain similarities and differences whether in the
structure or in the function of two networks ([183]Figure 4). In
addition, function of two genes CYP94C and MYB15 needed further
research. In treat-response network of JAZ1 gene with expression view
after MeJA treatment in different tissues, we identified several
possible JA-response candidate genes ([184]Figure 5 and
[185]Supplementary Table S6), which was experimentally validated by
real time RT-PCR ([186]Supplementary Figure S4). These results would be
beneficial to understanding some molecular regulatory mechanisms in C.
roseus, such as MIA biosynthesis and jasmonic acid biosynthesis, etc.
Comparative co-expression network analysis between species is an
effective approach to predict gene function and improve functional
annotation ([187]Pathania et al., 2016). We conducted network
comparison for gene list with PCC ranks in the top 300 between C.
roseus and Arabidopsis (obtained from ATTED-II) ([188]Supplementary
Figure S5). High similarity between co-expression network of JAZ1 in C.
roseus and AT1G19180 (JAZ1) in Arabidopsis not only demonstrated the
reliability of co-expression network, but also illustrated the
conservation of JAZ1 gene function between these two species.
Based on co-expression network with multi-dimensional level, predicting
functional module and refining gene function is an effective strategy,
which can be used to identify more key genes and regulatory modules
when we focus on a detailed biological process. Interestingly,
co-expression network is highly associated with the regulation of
epigenetic modification, such as DNA methylation ([189]El-Sharkawy et
al., 2015) and H3K4me3 ([190]Farris et al., 2015), which can be
integrated to understand detailed molecular mechanism, such as the
biosynthesis of specific metabolites. There is a certain correlation
between co-expression network and metabolic network, the integration of
which can be used to predict key enzyme-coding genes and metabolites
([191]Chen et al., 2013), and contribute to better understanding of the
molecular mechanisms related to plant metabolic pathway ([192]Rischer
et al., 2006; [193]Coneva et al., 2014).
Notably, there are additional limitations and possible improvements for
croFGD database. Firstly, the release of the chromosome-level genome of
C. roseus in the future, will greatly promote the research on
functional genomics. Secondly, more RNA-seq samples of other tissues
and treatments could be integrated into the co-expression network
construction on the transcriptomic level, which will be beneficial to
excavate gene function and improve the whole genome annotation in C.
roseus. Thirdly, epigenomic data, such as ChIP-seq and DNase-seq data,
can be integrated to improve the annotation of cis-elements and predict
gene function. Furthermore, more accurate data, such as gene families,
new type of non-coding RNAs, KEGG pathway and GO terms, needs to be
integrated, too. Our croFGD database will be updated regularly, and we
hope the database can help the community study the functional genomics
and yield novel insights into the molecular regulatory mechanisms.
Data Availability
Publicly available datasets were analyzed in this study. This data can
be found here: [194]https://www.ncbi.nlm.nih.gov/sra.
Author Contributions
JS performed gene functional annotation, functional module
identification, and database construction. HY performed data collection
and the co-expression network construction. JY gave advice a lot about
the web server. WX gave advice for the application of the co-expression
network and some key functional module identification. ZS and WX
supervised the project. All authors read and approved the final
manuscript.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Funding. This work was supported by grants from the National Natural
Science Foundation of China (Grant Nos. 31771467 and 31571360).
^1
[195]http://frodo.wi.mit.edu/primer3/input.htm
Supplementary Material
The Supplementary Material for this article can be found online at:
[196]https://www.frontiersin.org/articles/10.3389/fgene.2019.00238/full
#supplementary-material
[197]Click here for additional data file.^ (1MB, doc)
[198]Click here for additional data file.^ (58KB, xls)
References