Graphical abstract
graphic file with name ga1.jpg
[27]Open in a new tab
Keywords: Camellia sinensis, Circadian rhythm, PPI network, Hidden
Markov model, Expression analysis
Abstract
Tea (Camellia sinensis) is among the most valuable commercial crops
being a non-alcoholic beverage having antioxidant properties. Like in
other plants, circadian oscillator in tea modulates several biological
processes according to earth’s revolution dependent variations in
environmental cues like light and temperature. In the present study, we
report genome wide identification and characterization of circadian
oscillator (CO) proteins in tea. We first mined the genes (24, in
total) involved in circadian rhythm pathway in the 56 plant species
having available genomic information and then built their hidden Markov
models (HMMs). Using these HMMs, 24 proteins were identified in tea and
were further assessed for their functional annotation. Expression
analysis of all these 24 CO proteins was then performed in 3 abiotic
(A) and 3 biotic conditions (B) stress conditions and co-expressed as
well as differentially expressed genes in the selected 6 stress
conditions were elaborated. A methodology to identify the
differentially expressed genes in specific types of stresses (A or B)
is proposed and novel markers among CO proteins are presented. By
mapping the identified CO proteins against the recently reported genome
wide interologous protein–protein interaction network of tea (TeaGPIN),
an interaction sub-network of tea CO proteins (TeaCO-PIN) is developed
and analysed. Out of 24 CO proteins, structures of 4 proteins could be
successfully predicted and validated using consensus of three structure
prediction algorithms and their stability was further assessed using
molecular dynamic simulations at 100 ns. Phylogenetic analysis of these
proteins is performed to examine their molecular evolution.
1. Introduction
Circadian oscillators (CO) play important role in the internal time
estimation in an organism by synchronizing the biological events in
co-ordination with day and night cycles [28][1]. Circadian oscillators
are crucial for several processes that include growth, development and
other physiological events under various environmental conditions
[29][2]. Circadian rhythm is an internal timekeeping mechanism by which
all organisms including plants anticipate the environmental changes
like daily light–dark cycles, temperature fluctuations to synchronise
internal biological events with these external changes. These rhythms
are ubiquitous and are helpful in expressing several temporally
separated biological traits through the cycles spanning 24 h’ time
period [30][3]. Eukaryotic circadian oscillators rely on
transcriptional and translational feedback loops [31][4], [32][5]
regulated by more than 20 transcription factors that constitute core
plant clock network [33][6]. Early in the morning before sunrise,
CIRCADIAN CLOCK ASSOCIATED 1 (CCA1) and LONG ELONGATED HYPOCOTYL (LHY)
genes downregulate the expression of PSEUDO-RESPONSE REGULATOR 5
(PRR5), PRR7, PRR9, TIMING OF CAB EXPRESSION 1 (TOC1/PRR1), GIGANTEA
(GI) and genes belonging to evening complex (EC), namely, LUX ARRHYTHMO
(LUX), EARLY FLOWERING 3 (ELF3), ELF4 [34][6]. PRR (morning and midday
phase), CCA1 HIKING EXPEDITION (CHE) genes subsequently express and
restrict the CCA1 and LHY expression by repressing them during day
time. At afternoon, REVEILLE 8 (RVE8), RVE4, NIGHT LIGHT-INDUCIBLE AND
CLOCK-REGULATED GENE 1 (LNK1) and LNK2 factors induce expression of
PRR5, TOC1 and ELF4 genes [35][6]. Moreover, at evening TOC1 negatively
regulate GI, LUX, ELF4 and PRR5 [36][2]. EC maintains negative
regulation of GI, PRR7 and PRR9′s expression that, in turn, alleviate
CCA1 and LHY repression during the night.
The growth and development of a plant is directly affected by these
rhythms that are helpful in modulating various behavioural activities
[37][7]. There are several studies in which plant COs are reported to
regulate important physiological processes like photosynthesis,
flowering time, root growth, sugar metabolism, hormonal signalling,
nutrient biogenesis and plant immunity [38][8]. Under free-running
(lacking any external cues) conditions, behavioural activities of
plants are well-studied and it is being understood that multiple clocks
in different tissues and even within the cells operate in unison,
thereby increasing the circadian system complexity [39][9]. To
understand the inner working of oscillations in the model plant
Arabidopsis thaliana, several approaches including mathematical
modelling and phenotypic expression were employed in last decade.
Tea is among the most popular non-alcoholic beverages that is obtained
from newborn leaves of the plant Camellia sinensis. Due to its flavour
and health promoting functions, tea is the most consumed drink after
water across the world [40][10]. To meet the consumer demands, studies
based on the combination of various genetic and environmental factors
are being performed to increase the tea production. Using different tea
processing methods, a variety of tea products are also being developed
having several palatabilities. Various transcriptomic studies have been
conducted to examine the stress specific expressions, and recently a
high quality draft genome of tea has also been sequenced and annotated
for understanding its specific traits [41][11].
In the present study, we have made in-silico efforts for the
systems-level mining and characterization of genes inducing circadian
rhythms in the tea plant. To explore and examine the systems level
architecture of protein–protein subnetwork of circadian rhythm clock
associated genes (TeaCR-PIN), genome-wide interologous interactome map
(TeaGPIN) of tea was considered [42][12]. Further, all the interacting
proteins of developed subnetwork (nodes and edges) were selected for
co-expression and differential-expression analysis by examining their
expression in 3 abiotic and 3 biotic stress conditions. Furthermore,
pathway enrichment analysis was also performed for the co-expressed
modules of TeaCR-PIN [43][13]. Moreover, structural prediction and
analysis of these proteins was performed and stability of proteins was
accessed by molecular dynamics and simulations. Additionally,
phylogenetic analysis was also performed in order to find the most
accurate evolutionary estimates of these proteins in closely associated
plant species.
2. Materials and methods
2.1. 2.1. Data extraction
Proteome information of 56 plants with available genomes was downloaded
from UniProt database ([44]https://www.uniprot.org/). Based on the
available information in literature about proteins related to circadian
rhythms in two widely studied plant species Arabidopsis thaliana and
Oryza sativa, 24 proteins were enlisted [45][2], [46][6]. Sequences of
all the selected proteins were retrieved and subjected to Blast-P with
all the downloaded plants having genome information. For each protein,
top hits having at least 40% sequence identity, 50% query coverage and
[MATH: E-value≤10-5 :MATH]
were selected and aligned using Clustal Omega [47][14]. Hidden Markov
models for each selected protein were built using HMMER
([48]http://hmmer.janelia.org/). Furthermore, tea proteins were
downloaded from TPIA database [49][15] and all the protein sequences
were scanned at domain level with in-house made HMM profiles to
identify the core circadian oscillator proteins in Camellia sinensis.
Each CO protein was then queried to InerProScan v5.48–83
([50]https://www.ebi.ac.uk/interpro/) to identify their respective
domains.
2.2. Expression analysis of core clock genes
The expression analysis of identified CO proteins was carried out by
leveraging the RNA Seq data of C. sinensis transcriptomes
(Supplementary Table S1) obtained from 3 abiotic (SRP047312 [51][16];
SRP055910 [52][17]; ERP012919 [53][18]) and 3 biotic stress conditions
(SRP060335 [54][19]; SRP063593 [55][20]; SRP067826 [56][21]). All
pair-end filtered reads of 44 samples were mapped separately to CO
proteins using BOWTIE2 tool [57][22]. Further, for calculating the
relative expression of mapped CO genes, normalized transcripts per
kilobase million (TPM) values per sample were obtained for each
selected protein [58][23]. Furthermore, expression analysis of obtained
CO genes was conducted by calculating Pearson’s correlation coefficient
(PCC) for the abiotic and biotic conditions, separately. Pairwise
co-expression for all the 24 genes was computed using Pearson
correlation coefficient (PCC)
[MATH: rxy=∑(x-x¯)(y-y¯)∑i=1nx-x¯2∑i
=1ny-y¯2 :MATH]
where n is the number of samples (21 for abiotic and 23 for biotic
conditions),
[MATH: x :MATH]
and
[MATH: y :MATH]
represent vectors of RPKM values corresponding to genes x and y.
In order to check if there exists some characteristic expression
profile specific to abiotic (A) and biotic (B) stress types, we
computed Pearson’s correlation coefficient (PCC) for each gene-pair to
establish a stress type specific expression relationship, where all the
stress conditions belonging abiotic stress are categorized as stress
type A while all the conditions belonging to biotic stress are
categorized as stress type B. PCC was computed for every gene pair in
stress types A and B, separately. A pair of genes found to share
similar expression patterns in both stress types (i.e. same pair found
to have
[MATH: r≥0.5 :MATH]
in both stress types A and B or
[MATH: r≤-0.5 :MATH]
in both stress types A and B representing that both genes are either
simultaneously upregulated or downregulated) was considered to be
stress type specific co-expressed (STSCE) in the expression matrix. A
gene pair was considered to be stress type specific differential
expressed (STSDE), if
* (i)
the
[MATH: r :MATH]
value for that pair is found to be high positive (
[MATH: r≥0.5 :MATH]
) in one type of stress (either A or B) and high negative (
[MATH: r≤0.5 :MATH]
) in the other stress type.
* (ii)
the
[MATH: r :MATH]
value for that pair is found to be high (
[MATH: |r|≥0.5 :MATH]
) in one type of stress (either A or B) and low (
[MATH: r≤0.2
mrow> :MATH]
) in the other stress type.
All the gene pairs were termed as neutrally expressed if they have weak
correlation (
[MATH: r≤0.2 :MATH]
in both the types A and B) and considered to have stress type
independent expression (STIE).
2.3. Construction and analysis of PPI sub-network of CO proteins (TeaCO-PIN)
Additionally, to find the interacting partners of CO proteins, genome
wide interologous protein–protein interaction network (TeaGPIN) was
selected. By mapping all the CO proteins to TeaGPIN [59][12], a
protein–protein interaction subnetwork of these proteins (TeaCO-PIN)
was developed that was visualized and analyzed using cytoscape
[60][24]. Moreover, TeaCO-PIN was also subjected to weighted gene
co-expression network analysis (WGCNA) tool [61][25] for identifying
the co-expressing modules. DAVID Bioinformatics Resources 6.8 [62][26]
was used to perform pathway enrichment analysis of these modules.
2.4. Functional annotation of clock associated proteins
For functional categorization of the identified proteins associated
with core clock in Camellia sinensis, each of protein was subjected to
Blast-P with non-redundant (NR) database at NCBI [63][27], UniProt
database [64][28] and TAIR database [65][29] with E-value of 10^-5.
Further, gene ontology (GO) terms for CO proteins were predicted using
AgriGo database [66][30] and WEGO tool [67][31]. For gaining the
information of particular pathways in which these proteins are
associated with, all the obtained sequences were mapped to KEGG
database [68][32].
2.5. In-silico structural modeling of CO proteins
To find the three-dimensional structural patterns of CO proteins,
homology modeling method was considered. All the proteins were
subjected to Blast-P with protein data bank (PDB) to identify the
templates for structure identification. Three structure prediction
softwares namely, modeler 9.21 [69][33], CPH models [70][34] and Phyre2
tools [71][35] were used to build structures of all these proteins. All
the modeled proteins were subjected to validation by structure analysis
and verification server (SAVES) to find the structural errors like
overall stereochemical quality, macromolecular volume, parameters of
residues, non-bonded interactions and compatibility of models [72][36],
[73][37], [74][38]. To find the residues in most favored regions,
PROCHECK analysis was performed to analyze the Ramachandran plot
[75][39].
2.6. Molecular dynamics simulation
Proteins with validated structures after homology modeling were
selected for molecular dynamics and simulation by GROMACS 5.0
(GROningen MAchine for Chemical Simulation) package using the GROMOS96
53a6 force field [76][40]. Command pdb2gmx was used in order to
generate protein topology files. Solvation of proteins was performed
and solvated proteins were fixed in cubic box with a distance of 1.0 nm
between the protein surface and edges of the box. The particle-mesh
ewald (PME) electrostatic and periodic boundary conditions were then
applied in all the directions [77][40]. As per the requirements in the
protein under consideration, the entire system was neutralized by
adding Na^+ counter ions. To avoid the steric clashes and high energy
interactions, energy minimization was performed for the whole system at
50,000 steps of steepest descent. During MD simulations, production and
equilibration phases were configured. To equilibrate the system, it was
administered to the simulations at constant number, volume and
temperature [NVT] and constant number, pressure and temperature [NPT]
at 300 K for 100 ps. Finally, every system was subjected to MD
production run at 1 bar pressure and 300 K temperature for 100 ps. The
atom coordinates were recorded at every 100 ns throughout the MD
simulation.
2.7. Phylogenetic analysis of CO proteins
Top hits for all the four structurally validated CO proteins after
Blast-P results with downloaded plant genomes were considered in order
to extract their sequences. We obtained a multiple sequence alignment
(MSA) for each protein by aligning its homologs in the template
organisms using ClustalW software with default parameters [78][41].
Phylogenetic trees from each alignment were then constructed, using the
neighbor-joining (NJ) method that uses the Jones-Thornton-Taylor (JTT)
model, 1,000 bootstraps and Poisson model with uniform rates for
substitution as well as pairwise deletion.
3. Results and discussions
While the demand of Camellia sinensis (tea) is continuously increasing,
this crop is also under continuous threat due to various abiotic and
biotic stresses [79][12]. The substantial role of circadian rhythms in
determining the results of host-pathogen interactions in plants, and
hence contributing towards increase in the fitness of plants is being
widely studied [80][42]. Due to irregular environmental conditions,
plants are facing continuous stress to survive. In response to
particular stress (abiotic or biotic), circadian rhythm induces the
expression in particular genes of interest as demonstrated in
Arabidopsis thaliana, where genes are expressed rhythmically [81][43],
[82][44], [83][45]. Thus, we attempt to identify the proteins related
to circadian rhythm and their expression through an exhaustive
computational framework. The overall workflow chart is given in
[84]Fig. 1.
Fig. 1.
[85]Fig. 1
[86]Open in a new tab
Overall methodology for genome wide identification, characterization
(functional and network level), co-expression analysis and structural
modelling of circadian rhythm proteins in Camellia sinensis.
3.1. Data mining and analysis
By the extensive search of literature, 24 proteins related to core CO
were selected and an attempt has been made to identify these proteins
in Camellia sinensis. We have selected 56 plants with available genome
information and searched all the identified genes in the proteomes of
these plants through Blast-P. For the selected 24 proteins, a total of
2,543 hits were found in the 56 plants under study at an
[MATH: E-value≤10-5 :MATH]
(see Supplementary Table S2). By imposing a criterion of at least 40%
sequence identity and 50% query coverage we selected 963 top-hits
corresponding to all the CO proteins, separately. HMM profiles for all
the 24 CO proteins were built by extracting their sequences from all
the plant species under consideration in which hits were found to meet
the above criteria. The Camellia sinensis proteome was scanned against
these HMMs, and on the basis of returned top hits, 24 proteins related
to circadian rhythm were identified in tea for the further analysis
(details are given in Supplementary Table S3). Since, gene LWD1 is
partially redundant with LWD2, CCA1 with LHY, RVE8 with RVE4 and RVE6,
different members of PRR family and LUX with NOX and they express
during different times of the day or form complexes [87][6] so we
selected the second best hit for LWD2, REV4, PRR5, REV6, NOX, PRR3 and
LHY proteins. These results are further confirmed by NCBI-Nr, GO
([88]Supplementary Fig. S1) and KEGG annotations (Supplementary Table
S4). Pfam, InterProScan databases also confirmed the presence of
respective domains in the predicted CO proteins (Supplementary Table
S4).
3.2. Expression analysis of core clock genes
Expression patterns emerging from expression profiles of all the CO
genes were analysed separately (3 abiotic and 3 biotic stress
conditions). After mapping of pair end reads, we obtained TPM values
for every gene in order to normalize the total number of mapped reads
and reads mapped to a gene [89][23]. The expression profile of each CO
gene in the form of heat map in all the 44 samples is given in [90]Fig.
2A. As can be seen from [91]Fig. 2A, expression patterns of considered
proteins are not consistent throughout the various conditions selected
in this work as the genes are having either high-expression or
low-expression with respect to time at different conditions of abiotic
and biotic stresses. So, we were interested to figure out if there are
some gene pairs which are having distinctive expression patterns in
abiotic v/s biotic conditions. For that, expression profiles of CO
genes were analysed by computing pairwise Pearson’s correlation
coefficient for abiotic and biotic conditions separately. Entire data
is given in the Supplementary Table S5. It was interesting to observe
that there exist clearly distinctive expression patterns among various
gene-pairs in abiotic v/s biotic stress conditions. In [92]Fig. 2B, we
give representative pairs of genes belonging to various classes. In the
following, we discuss distinctive expression profiles of some of those
gene pairs.
Fig. 2.
[93]Fig. 2
[94]Open in a new tab
(a) Expression analysis (TPM values) of 24 tea CO genes in RNA-Seq
datasets of three abiotic (23 samples) and three biotic (21 samples)
stress conditions. (b) Representative gene pairs belonging to various
stress type specific expressions. STSCE pairs are in green colored
cells, STSDE pairs are in yellow colored cells and STIE pair is in grey
colored cell. In each cell, PCC values of the proteins in A as well as
B types of stresses are mentioned in parenthesis. (For interpretation
of the references to colour in this figure legend, the reader is