Graphical abstract graphic file with name ga1.jpg [27]Open in a new tab Keywords: Camellia sinensis, Circadian rhythm, PPI network, Hidden Markov model, Expression analysis Abstract Tea (Camellia sinensis) is among the most valuable commercial crops being a non-alcoholic beverage having antioxidant properties. Like in other plants, circadian oscillator in tea modulates several biological processes according to earth’s revolution dependent variations in environmental cues like light and temperature. In the present study, we report genome wide identification and characterization of circadian oscillator (CO) proteins in tea. We first mined the genes (24, in total) involved in circadian rhythm pathway in the 56 plant species having available genomic information and then built their hidden Markov models (HMMs). Using these HMMs, 24 proteins were identified in tea and were further assessed for their functional annotation. Expression analysis of all these 24 CO proteins was then performed in 3 abiotic (A) and 3 biotic conditions (B) stress conditions and co-expressed as well as differentially expressed genes in the selected 6 stress conditions were elaborated. A methodology to identify the differentially expressed genes in specific types of stresses (A or B) is proposed and novel markers among CO proteins are presented. By mapping the identified CO proteins against the recently reported genome wide interologous protein–protein interaction network of tea (TeaGPIN), an interaction sub-network of tea CO proteins (TeaCO-PIN) is developed and analysed. Out of 24 CO proteins, structures of 4 proteins could be successfully predicted and validated using consensus of three structure prediction algorithms and their stability was further assessed using molecular dynamic simulations at 100 ns. Phylogenetic analysis of these proteins is performed to examine their molecular evolution. 1. Introduction Circadian oscillators (CO) play important role in the internal time estimation in an organism by synchronizing the biological events in co-ordination with day and night cycles [28][1]. Circadian oscillators are crucial for several processes that include growth, development and other physiological events under various environmental conditions [29][2]. Circadian rhythm is an internal timekeeping mechanism by which all organisms including plants anticipate the environmental changes like daily light–dark cycles, temperature fluctuations to synchronise internal biological events with these external changes. These rhythms are ubiquitous and are helpful in expressing several temporally separated biological traits through the cycles spanning 24 h’ time period [30][3]. Eukaryotic circadian oscillators rely on transcriptional and translational feedback loops [31][4], [32][5] regulated by more than 20 transcription factors that constitute core plant clock network [33][6]. Early in the morning before sunrise, CIRCADIAN CLOCK ASSOCIATED 1 (CCA1) and LONG ELONGATED HYPOCOTYL (LHY) genes downregulate the expression of PSEUDO-RESPONSE REGULATOR 5 (PRR5), PRR7, PRR9, TIMING OF CAB EXPRESSION 1 (TOC1/PRR1), GIGANTEA (GI) and genes belonging to evening complex (EC), namely, LUX ARRHYTHMO (LUX), EARLY FLOWERING 3 (ELF3), ELF4 [34][6]. PRR (morning and midday phase), CCA1 HIKING EXPEDITION (CHE) genes subsequently express and restrict the CCA1 and LHY expression by repressing them during day time. At afternoon, REVEILLE 8 (RVE8), RVE4, NIGHT LIGHT-INDUCIBLE AND CLOCK-REGULATED GENE 1 (LNK1) and LNK2 factors induce expression of PRR5, TOC1 and ELF4 genes [35][6]. Moreover, at evening TOC1 negatively regulate GI, LUX, ELF4 and PRR5 [36][2]. EC maintains negative regulation of GI, PRR7 and PRR9′s expression that, in turn, alleviate CCA1 and LHY repression during the night. The growth and development of a plant is directly affected by these rhythms that are helpful in modulating various behavioural activities [37][7]. There are several studies in which plant COs are reported to regulate important physiological processes like photosynthesis, flowering time, root growth, sugar metabolism, hormonal signalling, nutrient biogenesis and plant immunity [38][8]. Under free-running (lacking any external cues) conditions, behavioural activities of plants are well-studied and it is being understood that multiple clocks in different tissues and even within the cells operate in unison, thereby increasing the circadian system complexity [39][9]. To understand the inner working of oscillations in the model plant Arabidopsis thaliana, several approaches including mathematical modelling and phenotypic expression were employed in last decade. Tea is among the most popular non-alcoholic beverages that is obtained from newborn leaves of the plant Camellia sinensis. Due to its flavour and health promoting functions, tea is the most consumed drink after water across the world [40][10]. To meet the consumer demands, studies based on the combination of various genetic and environmental factors are being performed to increase the tea production. Using different tea processing methods, a variety of tea products are also being developed having several palatabilities. Various transcriptomic studies have been conducted to examine the stress specific expressions, and recently a high quality draft genome of tea has also been sequenced and annotated for understanding its specific traits [41][11]. In the present study, we have made in-silico efforts for the systems-level mining and characterization of genes inducing circadian rhythms in the tea plant. To explore and examine the systems level architecture of protein–protein subnetwork of circadian rhythm clock associated genes (TeaCR-PIN), genome-wide interologous interactome map (TeaGPIN) of tea was considered [42][12]. Further, all the interacting proteins of developed subnetwork (nodes and edges) were selected for co-expression and differential-expression analysis by examining their expression in 3 abiotic and 3 biotic stress conditions. Furthermore, pathway enrichment analysis was also performed for the co-expressed modules of TeaCR-PIN [43][13]. Moreover, structural prediction and analysis of these proteins was performed and stability of proteins was accessed by molecular dynamics and simulations. Additionally, phylogenetic analysis was also performed in order to find the most accurate evolutionary estimates of these proteins in closely associated plant species. 2. Materials and methods 2.1. 2.1. Data extraction Proteome information of 56 plants with available genomes was downloaded from UniProt database ([44]https://www.uniprot.org/). Based on the available information in literature about proteins related to circadian rhythms in two widely studied plant species Arabidopsis thaliana and Oryza sativa, 24 proteins were enlisted [45][2], [46][6]. Sequences of all the selected proteins were retrieved and subjected to Blast-P with all the downloaded plants having genome information. For each protein, top hits having at least 40% sequence identity, 50% query coverage and [MATH: E-value10-5 :MATH] were selected and aligned using Clustal Omega [47][14]. Hidden Markov models for each selected protein were built using HMMER ([48]http://hmmer.janelia.org/). Furthermore, tea proteins were downloaded from TPIA database [49][15] and all the protein sequences were scanned at domain level with in-house made HMM profiles to identify the core circadian oscillator proteins in Camellia sinensis. Each CO protein was then queried to InerProScan v5.48–83 ([50]https://www.ebi.ac.uk/interpro/) to identify their respective domains. 2.2. Expression analysis of core clock genes The expression analysis of identified CO proteins was carried out by leveraging the RNA Seq data of C. sinensis transcriptomes (Supplementary Table S1) obtained from 3 abiotic (SRP047312 [51][16]; SRP055910 [52][17]; ERP012919 [53][18]) and 3 biotic stress conditions (SRP060335 [54][19]; SRP063593 [55][20]; SRP067826 [56][21]). All pair-end filtered reads of 44 samples were mapped separately to CO proteins using BOWTIE2 tool [57][22]. Further, for calculating the relative expression of mapped CO genes, normalized transcripts per kilobase million (TPM) values per sample were obtained for each selected protein [58][23]. Furthermore, expression analysis of obtained CO genes was conducted by calculating Pearson’s correlation coefficient (PCC) for the abiotic and biotic conditions, separately. Pairwise co-expression for all the 24 genes was computed using Pearson correlation coefficient (PCC) [MATH: rxy=(x-x¯)(y-y¯)i=1nx-x¯2i =1ny-y¯2 :MATH] where n is the number of samples (21 for abiotic and 23 for biotic conditions), [MATH: x :MATH] and [MATH: y :MATH] represent vectors of RPKM values corresponding to genes x and y. In order to check if there exists some characteristic expression profile specific to abiotic (A) and biotic (B) stress types, we computed Pearson’s correlation coefficient (PCC) for each gene-pair to establish a stress type specific expression relationship, where all the stress conditions belonging abiotic stress are categorized as stress type A while all the conditions belonging to biotic stress are categorized as stress type B. PCC was computed for every gene pair in stress types A and B, separately. A pair of genes found to share similar expression patterns in both stress types (i.e. same pair found to have [MATH: r0.5 :MATH] in both stress types A and B or [MATH: r-0.5 :MATH] in both stress types A and B representing that both genes are either simultaneously upregulated or downregulated) was considered to be stress type specific co-expressed (STSCE) in the expression matrix. A gene pair was considered to be stress type specific differential expressed (STSDE), if * (i) the [MATH: r :MATH] value for that pair is found to be high positive ( [MATH: r0.5 :MATH] ) in one type of stress (either A or B) and high negative ( [MATH: r0.5 :MATH] ) in the other stress type. * (ii) the [MATH: r :MATH] value for that pair is found to be high ( [MATH: |r|0.5 :MATH] ) in one type of stress (either A or B) and low ( [MATH: r0.2 :MATH] ) in the other stress type. All the gene pairs were termed as neutrally expressed if they have weak correlation ( [MATH: r0.2 :MATH] in both the types A and B) and considered to have stress type independent expression (STIE). 2.3. Construction and analysis of PPI sub-network of CO proteins (TeaCO-PIN) Additionally, to find the interacting partners of CO proteins, genome wide interologous protein–protein interaction network (TeaGPIN) was selected. By mapping all the CO proteins to TeaGPIN [59][12], a protein–protein interaction subnetwork of these proteins (TeaCO-PIN) was developed that was visualized and analyzed using cytoscape [60][24]. Moreover, TeaCO-PIN was also subjected to weighted gene co-expression network analysis (WGCNA) tool [61][25] for identifying the co-expressing modules. DAVID Bioinformatics Resources 6.8 [62][26] was used to perform pathway enrichment analysis of these modules. 2.4. Functional annotation of clock associated proteins For functional categorization of the identified proteins associated with core clock in Camellia sinensis, each of protein was subjected to Blast-P with non-redundant (NR) database at NCBI [63][27], UniProt database [64][28] and TAIR database [65][29] with E-value of 10^-5. Further, gene ontology (GO) terms for CO proteins were predicted using AgriGo database [66][30] and WEGO tool [67][31]. For gaining the information of particular pathways in which these proteins are associated with, all the obtained sequences were mapped to KEGG database [68][32]. 2.5. In-silico structural modeling of CO proteins To find the three-dimensional structural patterns of CO proteins, homology modeling method was considered. All the proteins were subjected to Blast-P with protein data bank (PDB) to identify the templates for structure identification. Three structure prediction softwares namely, modeler 9.21 [69][33], CPH models [70][34] and Phyre2 tools [71][35] were used to build structures of all these proteins. All the modeled proteins were subjected to validation by structure analysis and verification server (SAVES) to find the structural errors like overall stereochemical quality, macromolecular volume, parameters of residues, non-bonded interactions and compatibility of models [72][36], [73][37], [74][38]. To find the residues in most favored regions, PROCHECK analysis was performed to analyze the Ramachandran plot [75][39]. 2.6. Molecular dynamics simulation Proteins with validated structures after homology modeling were selected for molecular dynamics and simulation by GROMACS 5.0 (GROningen MAchine for Chemical Simulation) package using the GROMOS96 53a6 force field [76][40]. Command pdb2gmx was used in order to generate protein topology files. Solvation of proteins was performed and solvated proteins were fixed in cubic box with a distance of 1.0 nm between the protein surface and edges of the box. The particle-mesh ewald (PME) electrostatic and periodic boundary conditions were then applied in all the directions [77][40]. As per the requirements in the protein under consideration, the entire system was neutralized by adding Na^+ counter ions. To avoid the steric clashes and high energy interactions, energy minimization was performed for the whole system at 50,000 steps of steepest descent. During MD simulations, production and equilibration phases were configured. To equilibrate the system, it was administered to the simulations at constant number, volume and temperature [NVT] and constant number, pressure and temperature [NPT] at 300 K for 100 ps. Finally, every system was subjected to MD production run at 1 bar pressure and 300 K temperature for 100 ps. The atom coordinates were recorded at every 100 ns throughout the MD simulation. 2.7. Phylogenetic analysis of CO proteins Top hits for all the four structurally validated CO proteins after Blast-P results with downloaded plant genomes were considered in order to extract their sequences. We obtained a multiple sequence alignment (MSA) for each protein by aligning its homologs in the template organisms using ClustalW software with default parameters [78][41]. Phylogenetic trees from each alignment were then constructed, using the neighbor-joining (NJ) method that uses the Jones-Thornton-Taylor (JTT) model, 1,000 bootstraps and Poisson model with uniform rates for substitution as well as pairwise deletion. 3. Results and discussions While the demand of Camellia sinensis (tea) is continuously increasing, this crop is also under continuous threat due to various abiotic and biotic stresses [79][12]. The substantial role of circadian rhythms in determining the results of host-pathogen interactions in plants, and hence contributing towards increase in the fitness of plants is being widely studied [80][42]. Due to irregular environmental conditions, plants are facing continuous stress to survive. In response to particular stress (abiotic or biotic), circadian rhythm induces the expression in particular genes of interest as demonstrated in Arabidopsis thaliana, where genes are expressed rhythmically [81][43], [82][44], [83][45]. Thus, we attempt to identify the proteins related to circadian rhythm and their expression through an exhaustive computational framework. The overall workflow chart is given in [84]Fig. 1. Fig. 1. [85]Fig. 1 [86]Open in a new tab Overall methodology for genome wide identification, characterization (functional and network level), co-expression analysis and structural modelling of circadian rhythm proteins in Camellia sinensis. 3.1. Data mining and analysis By the extensive search of literature, 24 proteins related to core CO were selected and an attempt has been made to identify these proteins in Camellia sinensis. We have selected 56 plants with available genome information and searched all the identified genes in the proteomes of these plants through Blast-P. For the selected 24 proteins, a total of 2,543 hits were found in the 56 plants under study at an [MATH: E-value10-5 :MATH] (see Supplementary Table S2). By imposing a criterion of at least 40% sequence identity and 50% query coverage we selected 963 top-hits corresponding to all the CO proteins, separately. HMM profiles for all the 24 CO proteins were built by extracting their sequences from all the plant species under consideration in which hits were found to meet the above criteria. The Camellia sinensis proteome was scanned against these HMMs, and on the basis of returned top hits, 24 proteins related to circadian rhythm were identified in tea for the further analysis (details are given in Supplementary Table S3). Since, gene LWD1 is partially redundant with LWD2, CCA1 with LHY, RVE8 with RVE4 and RVE6, different members of PRR family and LUX with NOX and they express during different times of the day or form complexes [87][6] so we selected the second best hit for LWD2, REV4, PRR5, REV6, NOX, PRR3 and LHY proteins. These results are further confirmed by NCBI-Nr, GO ([88]Supplementary Fig. S1) and KEGG annotations (Supplementary Table S4). Pfam, InterProScan databases also confirmed the presence of respective domains in the predicted CO proteins (Supplementary Table S4). 3.2. Expression analysis of core clock genes Expression patterns emerging from expression profiles of all the CO genes were analysed separately (3 abiotic and 3 biotic stress conditions). After mapping of pair end reads, we obtained TPM values for every gene in order to normalize the total number of mapped reads and reads mapped to a gene [89][23]. The expression profile of each CO gene in the form of heat map in all the 44 samples is given in [90]Fig. 2A. As can be seen from [91]Fig. 2A, expression patterns of considered proteins are not consistent throughout the various conditions selected in this work as the genes are having either high-expression or low-expression with respect to time at different conditions of abiotic and biotic stresses. So, we were interested to figure out if there are some gene pairs which are having distinctive expression patterns in abiotic v/s biotic conditions. For that, expression profiles of CO genes were analysed by computing pairwise Pearson’s correlation coefficient for abiotic and biotic conditions separately. Entire data is given in the Supplementary Table S5. It was interesting to observe that there exist clearly distinctive expression patterns among various gene-pairs in abiotic v/s biotic stress conditions. In [92]Fig. 2B, we give representative pairs of genes belonging to various classes. In the following, we discuss distinctive expression profiles of some of those gene pairs. Fig. 2. [93]Fig. 2 [94]Open in a new tab (a) Expression analysis (TPM values) of 24 tea CO genes in RNA-Seq datasets of three abiotic (23 samples) and three biotic (21 samples) stress conditions. (b) Representative gene pairs belonging to various stress type specific expressions. STSCE pairs are in green colored cells, STSDE pairs are in yellow colored cells and STIE pair is in grey colored cell. In each cell, PCC values of the proteins in A as well as B types of stresses are mentioned in parenthesis. (For interpretation of the references to colour in this figure legend, the reader is