Graphical abstract
   graphic file with name ga1.jpg
   [27]Open in a new tab
   Keywords: Camellia sinensis, Circadian rhythm, PPI network, Hidden
   Markov model, Expression analysis
Abstract
   Tea (Camellia sinensis) is among the most valuable commercial crops
   being a non-alcoholic beverage having antioxidant properties. Like in
   other plants, circadian oscillator in tea modulates several biological
   processes according to earth’s revolution dependent variations in
   environmental cues like light and temperature. In the present study, we
   report genome wide identification and characterization of circadian
   oscillator (CO) proteins in tea. We first mined the genes (24, in
   total) involved in circadian rhythm pathway in the 56 plant species
   having available genomic information and then built their hidden Markov
   models (HMMs). Using these HMMs, 24 proteins were identified in tea and
   were further assessed for their functional annotation. Expression
   analysis of all these 24 CO proteins was then performed in 3 abiotic
   (A) and 3 biotic conditions (B) stress conditions and co-expressed as
   well as differentially expressed genes in the selected 6 stress
   conditions were elaborated. A methodology to identify the
   differentially expressed genes in specific types of stresses (A or B)
   is proposed and novel markers among CO proteins are presented. By
   mapping the identified CO proteins against the recently reported genome
   wide interologous protein–protein interaction network of tea (TeaGPIN),
   an interaction sub-network of tea CO proteins (TeaCO-PIN) is developed
   and analysed. Out of 24 CO proteins, structures of 4 proteins could be
   successfully predicted and validated using consensus of three structure
   prediction algorithms and their stability was further assessed using
   molecular dynamic simulations at 100 ns. Phylogenetic analysis of these
   proteins is performed to examine their molecular evolution.
1. Introduction
   Circadian oscillators (CO) play important role in the internal time
   estimation in an organism by synchronizing the biological events in
   co-ordination with day and night cycles [28][1]. Circadian oscillators
   are crucial for several processes that include growth, development and
   other physiological events under various environmental conditions
   [29][2]. Circadian rhythm is an internal timekeeping mechanism by which
   all organisms including plants anticipate the environmental changes
   like daily light–dark cycles, temperature fluctuations to synchronise
   internal biological events with these external changes. These rhythms
   are ubiquitous and are helpful in expressing several temporally
   separated biological traits through the cycles spanning 24 h’ time
   period [30][3]. Eukaryotic circadian oscillators rely on
   transcriptional and translational feedback loops [31][4], [32][5]
   regulated by more than 20 transcription factors that constitute core
   plant clock network [33][6]. Early in the morning before sunrise,
   CIRCADIAN CLOCK ASSOCIATED 1 (CCA1) and LONG ELONGATED HYPOCOTYL (LHY)
   genes downregulate the expression of PSEUDO-RESPONSE REGULATOR 5
   (PRR5), PRR7, PRR9, TIMING OF CAB EXPRESSION 1 (TOC1/PRR1), GIGANTEA
   (GI) and genes belonging to evening complex (EC), namely, LUX ARRHYTHMO
   (LUX), EARLY FLOWERING 3 (ELF3), ELF4 [34][6]. PRR (morning and midday
   phase), CCA1 HIKING EXPEDITION (CHE) genes subsequently express and
   restrict the CCA1 and LHY expression by repressing them during day
   time. At afternoon, REVEILLE 8 (RVE8), RVE4, NIGHT LIGHT-INDUCIBLE AND
   CLOCK-REGULATED GENE 1 (LNK1) and LNK2 factors induce expression of
   PRR5, TOC1 and ELF4 genes [35][6]. Moreover, at evening TOC1 negatively
   regulate GI, LUX, ELF4 and PRR5 [36][2]. EC maintains negative
   regulation of GI, PRR7 and PRR9′s expression that, in turn, alleviate
   CCA1 and LHY repression during the night.
   The growth and development of a plant is directly affected by these
   rhythms that are helpful in modulating various behavioural activities
   [37][7]. There are several studies in which plant COs are reported to
   regulate important physiological processes like photosynthesis,
   flowering time, root growth, sugar metabolism, hormonal signalling,
   nutrient biogenesis and plant immunity [38][8]. Under free-running
   (lacking any external cues) conditions, behavioural activities of
   plants are well-studied and it is being understood that multiple clocks
   in different tissues and even within the cells operate in unison,
   thereby increasing the circadian system complexity [39][9]. To
   understand the inner working of oscillations in the model plant
   Arabidopsis thaliana, several approaches including mathematical
   modelling and phenotypic expression were employed in last decade.
   Tea is among the most popular non-alcoholic beverages that is obtained
   from newborn leaves of the plant Camellia sinensis. Due to its flavour
   and health promoting functions, tea is the most consumed drink after
   water across the world [40][10]. To meet the consumer demands, studies
   based on the combination of various genetic and environmental factors
   are being performed to increase the tea production. Using different tea
   processing methods, a variety of tea products are also being developed
   having several palatabilities. Various transcriptomic studies have been
   conducted to examine the stress specific expressions, and recently a
   high quality draft genome of tea has also been sequenced and annotated
   for understanding its specific traits [41][11].
   In the present study, we have made in-silico efforts for the
   systems-level mining and characterization of genes inducing circadian
   rhythms in the tea plant. To explore and examine the systems level
   architecture of protein–protein subnetwork of circadian rhythm clock
   associated genes (TeaCR-PIN), genome-wide interologous interactome map
   (TeaGPIN) of tea was considered [42][12]. Further, all the interacting
   proteins of developed subnetwork (nodes and edges) were selected for
   co-expression and differential-expression analysis by examining their
   expression in 3 abiotic and 3 biotic stress conditions. Furthermore,
   pathway enrichment analysis was also performed for the co-expressed
   modules of TeaCR-PIN [43][13]. Moreover, structural prediction and
   analysis of these proteins was performed and stability of proteins was
   accessed by molecular dynamics and simulations. Additionally,
   phylogenetic analysis was also performed in order to find the most
   accurate evolutionary estimates of these proteins in closely associated
   plant species.
2. Materials and methods
2.1. 2.1. Data extraction
   Proteome information of 56 plants with available genomes was downloaded
   from UniProt database ([44]https://www.uniprot.org/). Based on the
   available information in literature about proteins related to circadian
   rhythms in two widely studied plant species Arabidopsis thaliana and
   Oryza sativa, 24 proteins were enlisted [45][2], [46][6]. Sequences of
   all the selected proteins were retrieved and subjected to Blast-P with
   all the downloaded plants having genome information. For each protein,
   top hits having at least 40% sequence identity, 50% query coverage and
   [MATH: E-value≤10-5 :MATH]
   were selected and aligned using Clustal Omega [47][14]. Hidden Markov
   models for each selected protein were built using HMMER
   ([48]http://hmmer.janelia.org/). Furthermore, tea proteins were
   downloaded from TPIA database [49][15] and all the protein sequences
   were scanned at domain level with in-house made HMM profiles to
   identify the core circadian oscillator proteins in Camellia sinensis.
   Each CO protein was then queried to InerProScan v5.48–83
   ([50]https://www.ebi.ac.uk/interpro/) to identify their respective
   domains.
2.2. Expression analysis of core clock genes
   The expression analysis of identified CO proteins was carried out by
   leveraging the RNA Seq data of C. sinensis transcriptomes
   (Supplementary Table S1) obtained from 3 abiotic (SRP047312 [51][16];
   SRP055910 [52][17]; ERP012919 [53][18]) and 3 biotic stress conditions
   (SRP060335 [54][19]; SRP063593 [55][20]; SRP067826 [56][21]). All
   pair-end filtered reads of 44 samples were mapped separately to CO
   proteins using BOWTIE2 tool [57][22]. Further, for calculating the
   relative expression of mapped CO genes, normalized transcripts per
   kilobase million (TPM) values per sample were obtained for each
   selected protein [58][23]. Furthermore, expression analysis of obtained
   CO genes was conducted by calculating Pearson’s correlation coefficient
   (PCC) for the abiotic and biotic conditions, separately. Pairwise
   co-expression for all the 24 genes was computed using Pearson
   correlation coefficient (PCC)
   [MATH: rxy=∑(x-x¯)(y-y¯)∑i=1nx-x¯2∑i
   =1ny-y¯2 :MATH]
   where n is the number of samples (21 for abiotic and 23 for biotic
   conditions),
   [MATH: x :MATH]
   and
   [MATH: y :MATH]
   represent vectors of RPKM values corresponding to genes x and y.
   In order to check if there exists some characteristic expression
   profile specific to abiotic (A) and biotic (B) stress types, we
   computed Pearson’s correlation coefficient (PCC) for each gene-pair to
   establish a stress type specific expression relationship, where all the
   stress conditions belonging abiotic stress are categorized as stress
   type A while all the conditions belonging to biotic stress are
   categorized as stress type B. PCC was computed for every gene pair in
   stress types A and B, separately. A pair of genes found to share
   similar expression patterns in both stress types (i.e. same pair found
   to have
   [MATH: r≥0.5 :MATH]
   in both stress types A and B or
   [MATH: r≤-0.5 :MATH]
   in both stress types A and B representing that both genes are either
   simultaneously upregulated or downregulated) was considered to be
   stress type specific co-expressed (STSCE) in the expression matrix. A
   gene pair was considered to be stress type specific differential
   expressed (STSDE), if
     * (i)
       the
       [MATH: r :MATH]
       value for that pair is found to be high positive (
       [MATH: r≥0.5 :MATH]
       ) in one type of stress (either A or B) and high negative (
       [MATH: r≤0.5 :MATH]
       ) in the other stress type.
     * (ii)
       the
       [MATH: r :MATH]
       value for that pair is found to be high (
       [MATH: |r|≥0.5 :MATH]
       ) in one type of stress (either A or B) and low (
       [MATH: r≤0.2
       mrow> :MATH]
       ) in the other stress type.
   All the gene pairs were termed as neutrally expressed if they have weak
   correlation (
   [MATH: r≤0.2 :MATH]
   in both the types A and B) and considered to have stress type
   independent expression (STIE).
2.3. Construction and analysis of PPI sub-network of CO proteins (TeaCO-PIN)
   Additionally, to find the interacting partners of CO proteins, genome
   wide interologous protein–protein interaction network (TeaGPIN) was
   selected. By mapping all the CO proteins to TeaGPIN [59][12], a
   protein–protein interaction subnetwork of these proteins (TeaCO-PIN)
   was developed that was visualized and analyzed using cytoscape
   [60][24]. Moreover, TeaCO-PIN was also subjected to weighted gene
   co-expression network analysis (WGCNA) tool [61][25] for identifying
   the co-expressing modules. DAVID Bioinformatics Resources 6.8 [62][26]
   was used to perform pathway enrichment analysis of these modules.
2.4. Functional annotation of clock associated proteins
   For functional categorization of the identified proteins associated
   with core clock in Camellia sinensis, each of protein was subjected to
   Blast-P with non-redundant (NR) database at NCBI [63][27], UniProt
   database [64][28] and TAIR database [65][29] with E-value of 10^-5.
   Further, gene ontology (GO) terms for CO proteins were predicted using
   AgriGo database [66][30] and WEGO tool [67][31]. For gaining the
   information of particular pathways in which these proteins are
   associated with, all the obtained sequences were mapped to KEGG
   database [68][32].
2.5. In-silico structural modeling of CO proteins
   To find the three-dimensional structural patterns of CO proteins,
   homology modeling method was considered. All the proteins were
   subjected to Blast-P with protein data bank (PDB) to identify the
   templates for structure identification. Three structure prediction
   softwares namely, modeler 9.21 [69][33], CPH models [70][34] and Phyre2
   tools [71][35] were used to build structures of all these proteins. All
   the modeled proteins were subjected to validation by structure analysis
   and verification server (SAVES) to find the structural errors like
   overall stereochemical quality, macromolecular volume, parameters of
   residues, non-bonded interactions and compatibility of models [72][36],
   [73][37], [74][38]. To find the residues in most favored regions,
   PROCHECK analysis was performed to analyze the Ramachandran plot
   [75][39].
2.6. Molecular dynamics simulation
   Proteins with validated structures after homology modeling were
   selected for molecular dynamics and simulation by GROMACS 5.0
   (GROningen MAchine for Chemical Simulation) package using the GROMOS96
   53a6 force field [76][40]. Command pdb2gmx was used in order to
   generate protein topology files. Solvation of proteins was performed
   and solvated proteins were fixed in cubic box with a distance of 1.0 nm
   between the protein surface and edges of the box. The particle-mesh
   ewald (PME) electrostatic and periodic boundary conditions were then
   applied in all the directions [77][40]. As per the requirements in the
   protein under consideration, the entire system was neutralized by
   adding Na^+ counter ions. To avoid the steric clashes and high energy
   interactions, energy minimization was performed for the whole system at
   50,000 steps of steepest descent. During MD simulations, production and
   equilibration phases were configured. To equilibrate the system, it was
   administered to the simulations at constant number, volume and
   temperature [NVT] and constant number, pressure and temperature [NPT]
   at 300 K for 100 ps. Finally, every system was subjected to MD
   production run at 1 bar pressure and 300 K temperature for 100 ps. The
   atom coordinates were recorded at every 100 ns throughout the MD
   simulation.
2.7. Phylogenetic analysis of CO proteins
   Top hits for all the four structurally validated CO proteins after
   Blast-P results with downloaded plant genomes were considered in order
   to extract their sequences. We obtained a multiple sequence alignment
   (MSA) for each protein by aligning its homologs in the template
   organisms using ClustalW software with default parameters [78][41].
   Phylogenetic trees from each alignment were then constructed, using the
   neighbor-joining (NJ) method that uses the Jones-Thornton-Taylor (JTT)
   model, 1,000 bootstraps and Poisson model with uniform rates for
   substitution as well as pairwise deletion.
3. Results and discussions
   While the demand of Camellia sinensis (tea) is continuously increasing,
   this crop is also under continuous threat due to various abiotic and
   biotic stresses [79][12]. The substantial role of circadian rhythms in
   determining the results of host-pathogen interactions in plants, and
   hence contributing towards increase in the fitness of plants is being
   widely studied [80][42]. Due to irregular environmental conditions,
   plants are facing continuous stress to survive. In response to
   particular stress (abiotic or biotic), circadian rhythm induces the
   expression in particular genes of interest as demonstrated in
   Arabidopsis thaliana, where genes are expressed rhythmically [81][43],
   [82][44], [83][45]. Thus, we attempt to identify the proteins related
   to circadian rhythm and their expression through an exhaustive
   computational framework. The overall workflow chart is given in
   [84]Fig. 1.
Fig. 1.
   [85]Fig. 1
   [86]Open in a new tab
   Overall methodology for genome wide identification, characterization
   (functional and network level), co-expression analysis and structural
   modelling of circadian rhythm proteins in Camellia sinensis.
3.1. Data mining and analysis
   By the extensive search of literature, 24 proteins related to core CO
   were selected and an attempt has been made to identify these proteins
   in Camellia sinensis. We have selected 56 plants with available genome
   information and searched all the identified genes in the proteomes of
   these plants through Blast-P. For the selected 24 proteins, a total of
   2,543 hits were found in the 56 plants under study at an
   [MATH: E-value≤10-5 :MATH]
   (see Supplementary Table S2). By imposing a criterion of at least 40%
   sequence identity and 50% query coverage we selected 963 top-hits
   corresponding to all the CO proteins, separately. HMM profiles for all
   the 24 CO proteins were built by extracting their sequences from all
   the plant species under consideration in which hits were found to meet
   the above criteria. The Camellia sinensis proteome was scanned against
   these HMMs, and on the basis of returned top hits, 24 proteins related
   to circadian rhythm were identified in tea for the further analysis
   (details are given in Supplementary Table S3). Since, gene LWD1 is
   partially redundant with LWD2, CCA1 with LHY, RVE8 with RVE4 and RVE6,
   different members of PRR family and LUX with NOX and they express
   during different times of the day or form complexes [87][6] so we
   selected the second best hit for LWD2, REV4, PRR5, REV6, NOX, PRR3 and
   LHY proteins. These results are further confirmed by NCBI-Nr, GO
   ([88]Supplementary Fig. S1) and KEGG annotations (Supplementary Table
   S4). Pfam, InterProScan databases also confirmed the presence of
   respective domains in the predicted CO proteins (Supplementary Table
   S4).
3.2. Expression analysis of core clock genes
   Expression patterns emerging from expression profiles of all the CO
   genes were analysed separately (3 abiotic and 3 biotic stress
   conditions). After mapping of pair end reads, we obtained TPM values
   for every gene in order to normalize the total number of mapped reads
   and reads mapped to a gene [89][23]. The expression profile of each CO
   gene in the form of heat map in all the 44 samples is given in [90]Fig.
   2A. As can be seen from [91]Fig. 2A, expression patterns of considered
   proteins are not consistent throughout the various conditions selected
   in this work as the genes are having either high-expression or
   low-expression with respect to time at different conditions of abiotic
   and biotic stresses. So, we were interested to figure out if there are
   some gene pairs which are having distinctive expression patterns in
   abiotic v/s biotic conditions. For that, expression profiles of CO
   genes were analysed by computing pairwise Pearson’s correlation
   coefficient for abiotic and biotic conditions separately. Entire data
   is given in the Supplementary Table S5. It was interesting to observe
   that there exist clearly distinctive expression patterns among various
   gene-pairs in abiotic v/s biotic stress conditions. In [92]Fig. 2B, we
   give representative pairs of genes belonging to various classes. In the
   following, we discuss distinctive expression profiles of some of those
   gene pairs.
Fig. 2.
   [93]Fig. 2
   [94]Open in a new tab
   (a) Expression analysis (TPM values) of 24 tea CO genes in RNA-Seq
   datasets of three abiotic (23 samples) and three biotic (21 samples)
   stress conditions. (b) Representative gene pairs belonging to various
   stress type specific expressions. STSCE pairs are in green colored
   cells, STSDE pairs are in yellow colored cells and STIE pair is in grey
   colored cell. In each cell, PCC values of the proteins in A as well as
   B types of stresses are mentioned in parenthesis. (For interpretation
   of the references to colour in this figure legend, the reader is