Abstract

Background

   Advances in forward and reverse genetic techniques have enabled the
   discovery and identification of several plant defence genes based on
   quantifiable disease phenotypes in mutant populations. Existing models
   for testing the effect of gene inactivation or genes causing these
   phenotypes do not take into account eventual uncertainty of these
   datasets and potential noise inherent in the biological experiment
   used, which may mask downstream analysis and limit the use of these
   datasets. Moreover, elucidating biological mechanisms driving the
   induced disease resistance and influencing these observable disease
   phenotypes has never been systematically tackled, eliciting the need
   for an efficient model to characterize completely the gene target under
   consideration.

Results

   We developed a post-gene silencing bioinformatics (post-GSB) protocol
   which accounts for potential biases related to the disease phenotype
   datasets in assessing the contribution of the gene target to the plant
   defence response. The post-GSB protocol uses Gene Ontology semantic
   similarity and pathway dataset to generate enriched process regulatory
   network based on the functional degeneracy of the plant proteome to
   help understand the induced plant defence response. We applied this
   protocol to investigate the effect of the NPR1 gene silencing to
   changes in Arabidopsis thaliana plants following Pseudomonas syringae
   pathovar tomato strain DC3000 infection. Results indicated that the
   presence of a functionally active NPR1 reduced the plant’s
   susceptibility to the infection, with about 99% of variability in
   Pseudomonas spore growth between npr1 mutant and wild-type samples.
   Moreover, the post-GSB protocol has revealed the coordinate action of
   target-associated genes and pathways through an enriched process
   regulatory network, summarizing the potential target-based induced
   disease resistance mechanism.

Conclusions

   This protocol can improve the characterization of the gene target and,
   potentially, elucidate induced defence response by more effectively
   utilizing available phenotype information and plant proteome functional
   knowledge.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12870-017-1151-y)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Gene silencing, Plant defence gene discovery, Semantic
   similarity, Arabidopsis thaliana, Pseudomonas syringae

Background

   Plants are sessile organisms often subjected to several attacks and
   invasions from herbivorous and pathogenic organisms. Pathogenic
   organisms have been repeatedly reported to cause outbreaks on bean,
   cucumber, stone fruit, kiwi and olive tree, as well as on other crop
   and non-crop plants [[31]1]. Plants respond to these attacks by
   switching on an array of defence pathways whose end products serve to
   limit the progression of invading pathogens. The innate response is the
   first stratum and earliest form of response reported. It involves
   interaction between pathogen associated molecular patterns (PAMP/MAMP)
   from the invading pathogen, and the plant’s membrane-localized pattern
   recognition receptors [[32]2–[33]4]. Downstream of this is the
   hypersensitive response (HR) which is highly specific and requires the
   recognition of specific avirulent genes from the pathogen by specific
   resistance (R) genes in the plant [[34]5, [35]6]. Further downstream is
   a broad spectrum-type of response called systemic acquired resistance
   (SAR), a long-lasting defence response which primes the plant against
   future pathogenic attack [[36]2, [37]7]. In 1999, Pieterse and Van Loon
   [[38]8] highlighted the existence of yet another form of systemic
   response called induced systemic response (ISR), which can occur
   independently of the HR and SAR.

   These strata of defence response pathways are generally governed by
   perturbations in the cells redox state, fluctuations in intracellular
   ionic concentration, activation/repression of kinases, synthesis of
   diverse signalling molecules, including salicylic acid (SA), jasmonic
   acid (JA), ethylene (Et), activation/repression of transcriptional
   co-regulators, such as the non-expressor of pathogenesis-related 1
   (NPR1), activation/repression of transcription factors, such as TGA and
   WRKY, and the ultimate expression/suppression of defence end products,
   such as pathogenesis-related (PR) proteins, which influences the growth
   and progression of the invading pathogens [[39]9–[40]14]. Understanding
   the biology of the plant protection and defence response may help in
   controlling the performance and survival of plants, which are very
   often faced with several types of stresses. The induced resistance
   response requires a coordinate action of many defence-related genes
   interacting within processes and signalling pathways to control
   down-stream responses during stress conditions. Computational modelling
   is therefore playing an additional role [[41]15] in improving our
   understanding of plant defence and disease resistance mechanisms in a
   manner that is less costly, reduces redundancies and increases
   robustness. Such models offer attractive alternatives for the
   identification of gaps and opportunities in already characterized
   biological systems, and could eventually inform the design of more
   targeted molecular biology experiments [[42]16].

   Pseudomonas syringae (P. syringae) is a rod shaped Gram-negative
   bacterium, a major bacterial leaf pathogen that causes diseases in a
   wide range of plant species with visible symptoms [[43]17, [44]18]. A
   specific variety of P. syringae pathovar tomato strain DC3000
   (Pst-DC3000), is known to infect DC3000 tomato and Arabidopsis thaliana
   plants [[45]19], using the plant leaf surface to negotiate its entry
   into the plant. Both Arabidopsis [[46]20] and Pst-DC3000 [[47]19] have
   been sequenced and are genetically tractable, economically and
   environmentally convenient, and constitute a model system in research
   for studying molecular pathogenesis and dynamics of plant-pathogens
   interactions [[48]21]. Interestingly, Arabidopsis thaliana shares high
   similarity with high agronomic value crops, such as rice with
   similarity of 0.71 [[49]22], suggesting that the findings related to
   Arabidopsis thaliana can be assessed to determine whether they could be
   applied to these different plants.

   In this study, we used NPR1 (or SAI1 for salicylic acid-insensitivity
   or NIM1 for non-inducible immunity) as a reference due to the available
   experimental data and its central role in SAR signalling pathway, a
   broad-spectrum resistance protecting against several pathogenic attacks
   [[50]23], which activates antimicrobial genes products in both the
   model plant Arabidopsis thaliana and a diverse number of food crops
   [[51]4, [52]24–[53]32], including canola, cabbage, broccoli, tobacco,
   tomato, potato, corn. NPR1 is also involved in the regulation of
   signalling pathways like abscisic acid (ABA) [[54]33], which, as JA-
   and Et-dependent activation of defense responses resulting from ISR
   [[55]17], offers resistance to necrotrophic pathogens. We propose a
   post-gene silencing bioinformatics (post-GSB) protocol, a two-step
   approach (see Fig. [56]1), which can be used to efficiently analyze
   plant differential responses based on leaf spore count or expression
   level experimental data sets. This post-GSB protocol introduces the
   closeness score concept between plants to extract differentially
   infected plants in a population of Arabidopsis thaliana wild type and
   npr1 mutant plants infected with Pst-DC3000 used in further analyses.
   It uses a Gene Ontology (GO) based semantic similarity model to
   identify proteins collaborating with NPR1 as well as enriched
   biological processes involved in the defence response and elucidate
   these process occurrence sequences. This new model manages eventual
   uncertainty of data, potential noise inherent in the experiment and
   redundancy from the semantic-based GO structure to effectively assess
   contributions of and validate plant-defence genes, and predict
   plant-defence mechanisms.

Fig. 1.

   Fig. 1
   [57]Open in a new tab

   The post-gene silencing bioinformatics (post-GSB) scheme. This
   describes the two steps of the model as described in the scheme. The
   first step consists of producing experimental data sets and using the
   closeness scores between plants to extract the set of differentially
   infected plants to be considered for further analyses. The second step
   initially focuses on statistical analyses checking whether there is any
   difference between the two data sets: wild type and npr1 gene mutant
   plants. If this difference is significant, then the Gene Ontology
   process annotation based functional analysis is performed to identify
   proteins, enriched processes and pathways contributing to the
   NPR1-based plant defence mechanisms

Methods

   The study was divided into two parts, the first was to check whether
   NPR1 plays an important role in plant protection and secondly to
   perform functional analyses to predict the biological mechanisms of the
   NPR1-based defence response.

Experimental data from Arabidopsis wild type and npr1 mutant plants

   Data analyzed in this study was derived from a study conducted at the
   John Innes Centre (England) in 2010 [[58]32] and consisted of data from
   wild type (Wt) and npr1 mutant Arabidopsis thaliana Columbia (Col – 0)
   plants 48 hours post infection with Pst-DC3000 [[59]34]. These plants
   were grown and inoculated under controlled greenhouse conditions (23^°
   C, 10 h day/14 h dark regime and a relative humidity of 65±5%) in the
   John Innes Centre in England [[60]32]. Pst-DC3000 bacteria strain was
   cultured as described in [[61]34] using King B’s media [ 20 gL ^−1
   proteose peptone (w/v), 1.5 gL ^−1 di-potassium hydrogen phosphate
   (w/v), 1.5 gL ^−1 magnesium sulphate (w/v), 1.5% glycerol (v/v) and
   1.2% bacterio agar (w/v)] supplemented with 50 mgL ^−1 kanamycin. The
   bacteria cells were re-suspended in a 10 mM MgCl [2] solution and the
   concentration adjusted to 5×105 colony forming unit cfu.mL ^−1. For
   each plant to be inoculated, three leaves were hand-infiltrated using a
   1 mL needless syringe containing either a 10 mM MgCl [2] solution or an
   inoculum of Pst-DC3000 (5×105 cfu.mL ^−1). Bacteria growth was measured
   as spore counts 48 h post treatment using 8 mm leaf discs excised from
   the treated plants. Measurements were conducted using a FB12
   luminometer with a single photon counter [[62]32, [63]34]. For adequate
   representation and statistics purposes, the experiment was conducted
   three times using a total of 60 plants (three infected leaves per
   plant) per genotype (see raw data in Additional file [64]1).

Sample extraction and analysis

   Data from a total of 31 Arabidopsis Wt plants (controls) and 29 npr1
   mutant plants (cases) inoculated with Pst-DC3000 were used for further
   analysis. Figure [65]2 is a graphical representation of the spore count
   data to visually assess the distribution of these spore counts.
   Noticing that there was a distribution pattern, samples were further
   analysed in three main steps: (1) computing the closeness score of each
   individual in a sample and ranking these individuals based on these
   scores; (2) extracting the list of differentially infected plants using
   Pearson χ ^2 scores based on aggregated closeness scores; and (3)
   statistically measuring the significant difference among the two
   phenotypes.

Fig. 2.

   Fig. 2
   [66]Open in a new tab

   Plotted heat-maps of spore counts for the case and control raw
   datasets. These plots (cases and control datasets on left and right
   sides, respectively) show variations of leaf spore counts from
   relatively lower (dark orange) to higher (dark green) counts as an
   indication of low and high levels of infection. An asterix on a given
   plant identifier indicates that the plant is among differentially
   infected plants within a dataset

Computing individual closeness scores

   Generally, a set of plants within each group consists of a mix of
   phenotype tendency associated with different levels (moderate or high)
   of infection observed through plant spore counts. Since at a higher
   level of infection, it may not be evident to distinguish between the
   two groups under consideration, there is a need for a scoring scheme
   which can appropriately classify plants with moderate level of
   infection, referred to as differentially infected plants, based on
   plant spore counts. Here we set up a closeness score approach to
   measure the length of the line-segment joining the plant phenotypes
   using spore count vectors, enabling the selection of differentially
   infected plants. The individual closeness score of an individual
   (plant) ı, denoted ICS(ı), in a sample is computed as follows:
   [MATH:
   <mtext>ICS</mtext><mo>(</mo><mi>ı</mi><mo>)</mo><mo>=</mo><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>ȷ</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>ıȷ</mi></mrow></msub>
   :MATH]
   1

   where n is the size of the sample (number of individuals in the sample)
   and
   [MATH: <msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>ıȷ</mi></mrow></msub>
   :MATH]
   is the similarity score between individuals ı and ȷ calculated using
   the Manhattan metric defined on a vector of spore counts for three
   different readings and converted to similarity measure following the
   Resnik-edge-based approach [[67]35, [68]36]. This similarity score
   [MATH: <msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>ıȷ</mi></mrow></msub>
   :MATH]
   is given by:
   [MATH: <msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>ıȷ</mi></mrow></msub><mo>=<
   /mo><mn>2</mn><mo>×</mo><mi>δ</mi><mfenced close=")" open="("
   separators=""><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>ı</mi></mrow
   ></msub></mrow></mfenced><mo>−</mo><mi>d</mi><mfenced close=")"
   open="("
   separators=""><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>ı</mi></mrow
   ></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>ȷ</mi></mrow><
   /msub></mrow></mfenced> :MATH]
   2

   with x [k] a spore count vector of individual k, δ(x [ı])= max{d(x
   [ı],x [ȷ]):ȷ=1,…,n} and d(x [ı],x [ȷ]) the distance between spore count
   vectors of the two individuals based on the Manhattan metric defined on
   a vector of spore counts, computed as follows:
   [MATH: <mi>d</mi><mfenced close=")" open="("
   separators=""><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>ı</mi></mrow
   ></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>ȷ</mi></mrow><
   /msub></mrow></mfenced><mo>=</mo><mo>min</mo><mfenced close="}"
   open="{"
   separators=""><mrow><munderover><mrow><mo>∑</mo></mrow><mrow><mi>k</mi>
   <mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><mfenced
   close="|" open="|"
   separators=""><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>ık</mi></mro
   w></msub><mo>−</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>ȷp</mi><mo>(<
   /mo><mi>k</mi><mo>)</mo></mrow></msub></mrow></mfenced><mo>:</mo><mi>p<
   /mi><mo>∈</mo><mi mathvariant="script">P</mi></mrow></mfenced> :MATH]
   3

   where
   [MATH: <mi mathvariant="script">P</mi> :MATH]
   is the set of all permutations of the ordered set S={1,2,…,n} of n
   readings in the sample. The individual or plant with the highest ICS is
   referred to as reference individual or plant, whose similarity scores
   to other plants are used to extract the set or list of plants of
   interest for further statistical analyses.

Extracting sets of differentially infected plants in different samples

   In order to minimize the impact of potential biases which may emanate
   from highly infected plants, we design a model for selecting sets of
   differentially infected plants in which spore counts based phenotypes
   are significantly associated with moderate level of infection in
   different samples based on closeness or similarity score described in
   the subsection above. Using similarity scores between reference plant,
   denoted d, and all plants in a sample, the list of these plants were
   ranked from the highest to the lowest scores. The ranked plant list is
   then used to compute a Pearson χ ^2 score for each plant subset,
   reflecting the tendency of plants in a particular set to occur towards
   the extremes of the list. The Pearson χ ^2 score of a subset containing
   plants re-indexed from ℓ to m, denoted
   [MATH:
   <msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mi>ℓ</mi><mo>,</mo><mi>m</mi></mrow></mfenced>
   :MATH]
   , is given by:
   [MATH:
   <msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mi>ℓ</mi><mo>,</mo><mi>m</mi></mrow></mfenced><mo>
   =</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>ȷ</mi><mo>=</mo><m
   i>ℓ</mi></mrow><mrow><mi>m</mi></mrow></munderover><mfrac><mrow><msup><
   mrow><mfenced close=")" open="(" separators=""><mrow><msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>dȷ</mi></mrow></msub><mo>−<
   /mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>ℓm</mi></mrow></msub></mrow>
   </mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><msub><mrow
   ><mi>μ</mi></mrow><mrow><mi>ℓm</mi></mrow></msub></mrow></mfrac> :MATH]
   4

   which is known to approximate a χ ^2-distribution with (m−ℓ) degree of
   freedom (dof) and where their expected value μ [ℓm] represents
   aggregated score (ES) of the extracted subset and estimated as follows:
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>ℓm</mi></mrow></msub><mo>=</mo><
   mtext>ES</mtext><mfenced close=")" open="("
   separators=""><mrow><mi>m</mi><mo>−</mo><mi>ℓ</mi><mo>+</mo><mn>1</mn><
   /mrow></mfenced><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>ȷ</mi><mo>=</mo><m
   i>ℓ</mi></mrow><mrow><mi>m</mi></mrow></munderover><msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>dȷ</mi></mrow></msub>
   :MATH]
   5

   We identified s top individuals as differentially infected with s the
   index fold-change
   [MATH:
   <msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   . Assuming that the ranked list contains n individuals, s (s≤n) is the
   smallest index satisfying the following inequality:
   [MATH:
   <mfrac><mrow><msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mr
   ow><mn>2</mn></mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mn>1</mn><mo>,</mo><mi>s</mi></mrow></mfenced><mo>
   −</mo><mi>r</mi><mo>∗</mo><msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</
   mi></mrow><mrow><mn>2</mn></mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>n</mi><
   /mrow></mfenced></mrow><mrow><msqrt><mrow><mn>1</mn><mo>+</mo><msup><mr
   ow><mi>r</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow></msqrt></mrow
   ></mfrac><mo>></mo><mn>0</mn> :MATH]
   6

   where
   [MATH:
   <mi>r</mi><mo>=</mo><msqrt><mrow><mfrac><mrow><mi>s</mi><mo>−</mo><mn>1
   </mn></mrow><mrow><mi>n</mi><mo>−</mo><mi>s</mi><mo>−</mo><mn>1</mn></m
   row></mfrac></mrow></msqrt> :MATH]
   7

   which is the ratio of standard deviations of
   [MATH:
   <msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mn>1</mn><mo>,</mo><mi>s</mi></mrow></mfenced>
   :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>χ</mi></mrow><mrow><mi>P</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>n</mi><
   /mrow></mfenced> :MATH]
   , and set to 1 if s=1.

   The significance of the aggregated score, ES(s), of the subset of
   differentially infected plants was assessed using sample randomization.
   So, we randomly selected 1000 independent subsets (same size with the
   set of differentially infected) of plants and compute ES of each subset
   and then perform the Shapiro-Wilk test under the null hypothesis that
   the generated sample is drawn from a normal distribution. Following the
   rejection or the acceptance of the null hypothesis, we perform the
   Wilcoxon- or the T-test to check whether the identified set of
   differentially infected individuals is more than expected by chance.

Statistical analyis of npr1 mutant and wild-type plant based disease symptoms

   The contribution of NPR1 protein to the plant protection during
   infection as the defence co-regulator is statistically assessed on sets
   of differentially infected plants extracted from the two population
   samples (mutant/case and wild/control). We test for the differences
   between the mutant and the wild type plants using parametric or non
   parametric statistics given that agreements with respects to the
   violations of the assumptions are met. Thus, the normality and variance
   homogeneity assumptions of the two samples are tested using the
   Shapiro-Wilk and Bartlett tests, respectively. In order to satisfy
   these assumptions for datasets which are characterized by extreme
   non-normality or heterogeneity of variance, we apply a data
   transformation [[69]37, [70]38] based on the Fisher-Pearson coefficient
   of skewness [[71]39] to bring the data into greater agreement with the
   normality and variance homogeneity assumptions. Depending on whether
   normality and variance homogeneity assumptions are met or not, we
   performed the analysis of variance (ANOVA) method or the non-parametric
   Kruskal-Wallis rank sum test.

Bioinformatics analysis of potential NPR1-based plant defence mechanisms

   We used the Gene Ontology (GO) [[72]40, [73]41] and the protein GO
   Annotation (GOA) mapping provided by the UniProtKB-GOA project
   [[74]42–[75]44] to reveal the potential NPR1-based regulatory network
   using enriched biological processes and pathways in which a set of
   protein targets are involved. The complete set of GO data and
   protein-GO term associations were extracted from the GO and GOA
   databases, respectively, accessed on the 11th of October, 2016. For the
   pathway enrichment analysis, we use the Kyoto Encyclopeadia of Genes
   and Genomes (KEGG) datasets. The whole Arabidopsis thaliana pathway
   dataset was extracted from the KEGG database at
   [76]http://www.genome.jp/kegg/.

Identification of other putative proteins participating to the plant defence
response

   Proteins perform an astonishing range of biological functions in an
   organism by collaborating in pathways and processes, and interacting
   with the cellular environment in some way to promote the cell’s growth
   and function [[77]45, [78]46]. This argues that the induced resistance
   response requires concerted biological action of many genes involved in
   diverse processes or defence signalling pathways. There are various
   computational approaches for identifying proteins co-working or
   involved in a similar process, which, in general, rely on a priori
   knowledge of protein functional features. This functional knowledge can
   be either a protein-protein interaction (PPI) network, in which case,
   these proteins interact and influence each other in the same
   sub-network or protein functional annotations, in which case, these
   proteins are involved in semantically similar processes or in the same
   pathway. In this study, since we are looking at the NPR1 biological
   role and functional annotations, we use protein biological process
   dataset to identify putative proteins which are functionally similar to
   Arabidopsis NPR1 by quantifying functional similarity between proteins
   based on sets of strict non-redundant processes annotating proteins
   under consideration [[79]47].

   Given two proteins p and q, the functional similarity between p and q,
   BMA(p,q), is computed using the Best Match Average model
   [[80]48–[81]51] as follows:
   [MATH: <mtext>BMA</mtext><mfenced close=")" open="("
   separators=""><mrow><mi>p</mi><mo>,</mo><mi>q</mi></mrow></mfenced><mo>
   =</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mfe
   nced close=")" open="("
   separators=""><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mro
   w></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>s</mi><mo>∈</mo><msu
   b><mrow><mi>T</mi></mrow><mrow><mi>p</mi></mrow></msub></mrow></munder>
   <mi mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>,</mo><msub><mrow><mi>T</mi></mrow><m
   row><mi>q</mi></mrow></msub></mrow></mfenced><mo>+</mo><mfrac><mrow><mn
   >1</mn></mrow><mrow><mi>m</mi></mrow></mfrac><munder><mrow><mo>∑</mo></
   mrow><mrow><mi>s</mi><mo>∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>q
   </mi></mrow></msub></mrow></munder><mi
   mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>,</mo><msub><mrow><mi>T</mi></mrow><m
   row><mi>p</mi></mrow></msub></mrow></mfenced></mrow></mfenced> :MATH]
   8

   where T [r] is a set of process terms annotating a given protein r and
   n=|T [p]| and m=|T [q]| are the number of processes terms in these
   sets.
   [MATH: <mi mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>,</mo><msub><mrow><mi>T</mi></mrow><m
   row><mi>r</mi></mrow></msub></mrow></mfenced><mo>=</mo><mo>max</mo><mfe
   nced close="}" open="{" separators=""><mrow><mi
   mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>,</mo><mi>t</mi></mrow></mfenced><mo>
   :</mo><mi>t</mi><mo>∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>r</mi>
   </mrow></msub></mrow></mfenced> :MATH]
   with
   [MATH: <mi mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>s</mi><mo>,</mo><mi>t</mi></mrow></mfenced><mo>
   ,</mo> :MATH]
   the semantic similarity score between process terms s and t computed
   using the GO-universal metric [[82]48]. It is worth noting that
   semantic similarity measure has proved its effectiveness to solving
   biological issues related to the Gene Ontology annotation-based
   knowledge discovery, including predicting and validating
   protein-protein interaction (PPI) [[83]52]. With curated PPIs, which
   are still relatively scarce, high-throughput biology experiments and
   computational methods producing high rate of false positive PPIs
   [[84]53], and the lack of appropriate techniques to address these
   shortcomings [[85]45], semantic similarity-based approach is more
   effective, providing a classification indicator of PPIs in the absence
   of reliable information.

Identifying enriched Arabidopsis defence response processes and pathways

   For process or pathway enrichment analysis, we used hyper-geometric
   test adjusted using the Bonferroni multiple testing correction. For
   each process or pathway, the p-value was calculated using its
   frequencies of occurrence in the reference dataset (Arabidopsis
   thaliana proteome) and target set, which composed of proteins
   identified to be functionally similar to NPR1. Thus, this p-value,
   which is the probability of observing at least s proteins from a target
   gene set of size n by chance, knowing that the reference dataset
   contains m annotated genes out of N genes, is given by the following
   formula [[86]54, [87]55]:
   [MATH: <mi>P</mi><mfenced close="]" open="["
   separators=""><mrow><mi>X</mi><mo>≥</mo><mi>s</mi></mrow></mfenced><mo>
   =</mo><mn>1</mn><mo>−</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><m
   n>0</mn></mrow><mrow><mi>s</mi><mo>−</mo><mn>1</mn></mrow></munderover>
   <mfrac><mrow><mfenced close=")" open="(" separators=""><mfrac
   linethickness="0.0pt"><mrow><mi>m</mi></mrow><mrow><mi>k</mi></mrow></m
   frac></mfenced><mfenced close=")" open="(" separators=""><mfrac
   linethickness="0.0pt"><mrow><mi>N</mi><mo>−</mo><mi>m</mi></mrow><mrow>
   <mi>n</mi><mo>−</mo><mi>k</mi></mrow></mfrac></mfenced></mrow><mrow><mf
   enced close=")" open="(" separators=""><mfrac
   linethickness="0.0pt"><mrow><mi>N</mi></mrow><mrow><mi>n</mi></mrow></m
   frac></mfenced></mrow></mfrac> :MATH]
   9

   where the random variable X is the number of genes involved in the GO
   process or participating to the pathway under consideration within a
   given target gene set contributing to plant defence response. To
   account for relationships between GO terms in the GO structure, we used
   the concept of the GO term semantic similarity score [[88]55] and the
   frequency of occurrence f(t) of the target-associated process t in a
   set of proteins P is given by:
   [MATH:
   <mi>f</mi><mo>(</mo><mi>t</mi><mo>)</mo><mo>=</mo><munder><mrow><mo>∑</
   mo></mrow><mrow><mi>q</mi><mo>∈</mo><mi>P</mi></mrow></munder><msub><mr
   ow><mi>δ</mi></mrow><mrow><mi>q</mi></mrow></msub><mo>(</mo><mi>t</mi><
   mo>)</mo> :MATH]
   10

   where δ [q] is the q-function indicator given by
   [MATH:
   <msub><mrow><mi>δ</mi></mrow><mrow><mi>q</mi></mrow></msub><mo>(</mo><m
   i>t</mi><mo>)</mo><mo>=</mo><mfenced close="" open="{"
   separators=""><mrow><mtable><mtr><mtd><mn>1</mn></mtd><mtd><mtext>if</m
   text><mspace width="1em"></mspace><msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>q</mi></mrow></msub><mo>(</
   mo><mi>t</mi><mo>)</mo><mo>></mo><mi>ε</mi></mtd></mtr><mtr><mtd><mn>0<
   /mn></mtd><mtd><mtext>otherwise</mtext></mtd></mtr></mtable></mrow></mf
   enced> :MATH]
   11

   where ε≥0 is the agreement level or customized agreement at which a GO
   process is considered to be a possible annotation of the protein q and
   [MATH: <msub><mrow><mi
   mathvariant="script">S</mi></mrow><mrow><mi>q</mi></mrow></msub><mo>(</
   mo><mi>t</mi><mo>)</mo><mo>=</mo><mi
   mathvariant="script">S</mi><mfenced close=")" open="("
   separators=""><mrow><mi>t</mi><mo>,</mo><msub><mrow><mi>T</mi></mrow><m
   row><mi>q</mi></mrow></msub></mrow></mfenced> :MATH]
   , representing the semantic similarity degree at which a related term
   is considered to semantically reflect in the specification of the term
   t [[89]47]. Here, ε was set to 0 for t∈T [q] or
   [MATH: <mi>t</mi><mo>∈</mo><msub><mrow><mi
   mathvariant="script">A</mi></mrow><mrow><mi>q</mi></mrow></msub> :MATH]
   , and to 0.7 for
   [MATH: <mi>t</mi><mo>∈</mo><msub><mrow><mi
   mathvariant="script">C</mi></mrow><mrow><mi>q</mi></mrow></msub> :MATH]
   where
   [MATH: <msub><mrow><mi
   mathvariant="script">A</mi></mrow><mrow><mi>q</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi
   mathvariant="script">C</mi></mrow><mrow><mi>q</mi></mrow></msub> :MATH]
   are sets of ancestors and descendants of processes in T [q],
   respectively. The p-value of each process term was adjusted using the
   Bonferroni multiple testing correction.

Results and discussion

   In the first part of the post-GSB framework developed in this study, we
   have designed a spore count closeness scoring approach. This enables
   the identification of differentially infected plants used to ascertain
   the role of NPR1 in Arabidopsis plant defence response with bacteria
   spore count data collected from Arabidopsis Wt and npr1 mutant leaves
   48 h post Pst-DC3000 infection as described in the “[90]Methods”
   section. We then predicted the potential NPR1-based regulatory network
   based on the identified set of putative proteins, enriched biological
   processes and pathways which participate in plant defence response.

NPR1 protein is required for Arabidopsis defence response

   The heat map representation of the raw experimental spore count data
   from both data sets in Fig. [91]2 revealed that spore distribution
   patterns in the Arabidopsis Wt infected leaves are more closely similar
   (see Fig. [92]2 [93]b) in comparison to those of the mutant (Fig. [94]2
   [95]a). This is in agreement with previous results which showed a
   significant difference in spore numbers between the two phenotypes
   (data not shown) [[96]32]. In a study of Wt and npr1 mutant plants by
   Cao et al. [[97]56], it was observed that infection spread event to
   uninfected parts of the leaf is quite rapid due to its high level of
   susceptibility. To eliminate the possibility that the presence of
   highly infected plants in the mutant population is contributing
   significantly to the observed differential response found in this
   study, we statistically tested the difference between the two groups of
   plants on specific sets of plants from these two groups. These sets
   contains only differentially or moderately infected plants in order to
   keep out the highly infected plants within the two phenotype groups
   using the closeness scoring approach. Outputs generated 13 moderately
   infected npr1 mutant plants, and 14 moderately infected Arabidopsis Wt
   plants which were used for further downstream analysis (Fig. [98]2,
   plants marked with an asterisk). Note that these identified sets of
   differentially infected plants for both npr1 mutant and wild-type
   plants are statistically significant with p-values < 2e−16.

   The graphical representation (Q-Q plots) of the above Wt plant sub-data
   indicated that this sub-data was not normally distributed (see
   Additional file [99]2: Figure S1) with the Shapiro-Wilk test p–value of
   0.000152≤0.05, thereby leading to a two-step statistical analysis
   process. Firstly, we tested for statistical differences using the
   non-parametric Kruskal-Wallis rank sum test. Results from this test
   indicated that within phenotypes, there was no significance difference
   (p–value =0.70>0.05), but, between the two phenotypes, there was a
   significant difference in the number of bacteria spores (p–values
   =4.551e−12≤0.05). Noting that non-parametric tests have weaker
   statistical power in comparison to parametric tests, and to further
   eliminate the possibility of introducing a type II error [[100]57], we
   applied a parametric approach to increase the robustness of our
   analysis.

   In order to bring different data subsets into agreement with the
   normality and homogeneity of variance (equal population variance)
   assumptions, these data subsets were log transformed (see Additional
   file [101]2: Figure S2) and scaled using their standard deviations.
   Note that the choice of log10 transformation was guided by the
   Fisher-Pearson skewness coefficient [[102]37, [103]38] of the Wt
   dataset with a value of 1.46524, suggesting that its distribution was
   highly skewed (substantially positive skewness) [[104]39]. The
   Shapiro-Wilk normality test and the Bartlett variance homogeneity
   p–values were greater than 0.1169 and 0.98 for both Wt and npr1 mutant
   data subsets. Results from our ANOVA using this log transformed data
   subsets still confirmed that there was no significance difference
   within data subsets (p–value =0.481), but there existed a significant
   difference in spore numbers between both phenotypes (p–value < 2e−16)
   with the measure of association ω=0.98893. This indicates that 99% of
   the variability between Wt and npr1 mutant samples may be attributed to
   whether the NPR1 protein is functionally active or not, providing
   evidence that NPR1 plays a critical role in plant defence response.

   In summary, we applied the Pearson χ ^2-based plant closeness scoring
   scheme in order to extract specific sets of plants in populations under
   consideration to be used for further statistical inferences. These
   experimental data subsets showed evidence that a functional NPR1 is
   required to limit bacteria growth in Arabidopsis plants. This finding
   correlates with results from previous studies that have equally
   demonstrated that the presence of a functional Arabidopsis NPR1 is
   important in limiting pathogen growth in infected plants [[105]56,
   [106]58, [107]59], playing a key role in the SAR signalling pathway
   [[108]23, [109]58, [110]60], the broad-spectrum type of response
   providing defence against secondary pathogens after a primary attack.
   This Pearson χ ^2-based plant closeness scoring approach enables the
   use of more relevant data subsets in further analyses, thus limiting
   the effect of the eventual uncertainty of these datasets and potential
   noise inherent in the biological experiment used. This provides an
   increased robustness of statistical analyses and stronger validation of
   inferences from experimental datasets.

Elucidating potential NPR1-based enriched process regulatory network

   Similar to other organisms, plants combat the adverse effects of
   stressors by activating or deactivating various processes and pathways.
   We therefore applied a semantic similarity based computational model to
   gene products that are functionally similar to NPR1 and identify
   enriched plant defence biological processes using protein GO annotation
   mapping from the GOA-UniProtKB dataset [[111]42, [112]44] with the
   reviewed proteins in the Arabidopsis proteome retrieved from the
   UniProt database [[113]61]. Furthermore, we used the Arabidopsis KEGG
   pathway dataset to retrieve enriched pathways participating to the
   NPR1-based regulation of Arabidopsis defence response and elucidated
   the GO-based regulatory network triggered by NPR1.

Identification of NPR1-associated plant defence proteins

   In general, defence mechanism is driven by several gene products that
   act dependently or independently. In order to identify NPR1-associated
   plant defence proteins, we computed the functional similarity scores
   between NPR1 and other 20 545 annotated proteins [[114]62] out of 51874
   found in the complete list of proteins in the Arabidopsis proteome. In
   order to avoid over-estimating functional similarity scores between
   proteins [[115]52], which is induced by the redundant processes from
   parent-child relations from the GO structure [[116]47], we used the set
   of filtered non-redundant processes in which proteins are involved.
   Results indicated that 30 proteins shared high functional similarity to
   the Arabidopsis NPR1 (Table [117]1) of which, 25 were reviewed or
   manually curated (in Table [118]1; protein with status ✓). A
   hierarchical clustering representation of these 25 annotated proteins
   together with the Arabidopsis NPR1 protein, which shows functional
   relatedness between these 26 reviewed proteins, produced two main
   clusters (Fig. [119]3) with Cluster 1 comprising of two sub clusters
   (1A and 1B). Sub-Cluster 1A contains proteins closest to NPR1 (at the
   maximum distance of approximately 0.245) and distant at 0.275 from
   Sub-Cluster 1B, while proteins in Cluster 2 were the farthest subgroup
   being distant at 0.375 maximum from Cluster 1. Proteins in the
   sub-cluster 1A comprised of the NPR1/NIM-1 interacting protein
   (NIMIN-1), transcription factor proteins (TGA3, WRKY70), Resistance to
   P. syringae 2 (RPS2), two Enhanced Diseases Susceptibility proteins
   (EDS1, PAD4) and Mitogen-activated protein kinase kinase 4 (MKK4).
   Those in the sub-cluster 1B comprised of WRKY40, AT1G74360, XBAT34, the
   putative calmodulin-like protein 47 (CML47), AT2G23680, the Yellow
   Leaf-Specific protein 9 (YLS9), Constitutively Activated Cell Death 1
   (CAD1), PUB23, CML40, and ATL2. Proteins in cluster 2 comprised of
   Mitogen-activated Protein Kinase 11 (MPK11), Phosphoinositide 4-Kinase
   Gamma 4 (PI4KG4), PUB2, two basic Leu zipper (BZIP) transcription
   factors (BZIP60 and bZIP28), Molybdopterin biosynthesis protein (CNX1),
   Calcium-dependent Protein Kinase 9 (CPK9), and Oxysterol-binding
   Protein-Related Protein 1C (ORP1C2).

Table 1.

   Proteins highly similar to NPR1 or Non-inducible immunity protein 1
   (NIM1)
   Entry name Gene names Status Protein name
   [120]Q9LY00 WRKY70 or At3g56400 ✓ Probable WRKY transcription factor 70
   (WRKY DNA-binding protein 70)
   [121]Q9LMM5 MPK11 or At1g01560 ✓ Mitogen-activated protein kinase 11
   (AtMPK11) (MAP kinase 11)
   [122]O64834 At2g23680 or F26B6.33 ✓ Cold-regulated 413 plasma membrane
   protein 3 (AtCOR413-PM3)
   [123]Q9ZPY9 PI4KG4 or PI4KGamma4 ✓ Phosphatidylinositol 4-kinase gamma
   4 or At2g46500 (AtPI4Kgamma4) (PI4K gamma 4)
   [124]Q39054 CNX1 or At5g20990 or F22D1.6 ✓ Molybdopterin biosynthesis
   protein CNX1
   [125]Q5XEZ8 PUB2 or At5g67340 ✓ U-box domain-containing protein 2
   [126]Q9SJ52 YLS9 or NHL10 or At2g35980 ✓ Protein YLS9 (Protein
   NDR1/HIN1-LIKE 10)
   [127]Q9SGI8 CML40 or At3g01830 ✓ Probable calcium-binding protein CML40
   [128]Q9FNZ5 NIMIN-1 or At1g02450 ✓ Protein NIM1-Interacting 1 (NIMIN-1)
   [129]Q38868 CPK9 or At3g20410 ✓ Calcium-dependent protein kinase 9
   Q9C7SO BZIP60 or At1g42990 ✓ bZIP transcription factor 60 (AtbZIP60)
   [130]O80397 MKK4 or At1g51660 ✓ Mitogen-activated protein kinase kinase
   4 (AtMKK4)
   [131]Q9L9T5 ATL2 or At3g16720 ✓ RING-H2 finger protein ATL2 (Protein
   Arabidopsis Toxicos En Levadura 2)
   [132]Q9S745 PAD4 or EDS9 or At3g52430 ✓ Lipase-like PAD4 (Protein
   Enhanced Disease Susceptibility 9)
   [133]Q42484 RPS2 or At4g26090 ✓ Disease resistance protein RPS2
   (Resistance to P-syringae protein 2)
   [134]Q9SN89 CML47 or At3g47480 ✓ Probable calcium-binding protein CML47
   [135]Q9SU72 EDS1 or EDS1A or At3g48090 ✓ Protein EDS1 (Enhanced Disease
   Susceptibility 1)
   [136]Q8L751 ORP1C or At4g08180 ✓ Oxysterol-binding protein-related
   protein 1C
   [137]Q39234 TGA3 or BZIP22 or At1g22070 ✓ Transcription factor TGA3
   (AtbZIP22)
   [138]Q9SAH7 WRKY40 or At1g80840 ✓ Probable WRKY transcription factor 40
   (WRKY DNA-binding protein 40)
   COLGJ1 At1g74360 or F1M20.4 ✓ Threonine-protein kinase
   [139]Q84TG3 PUB23 or At2g35930 ✓ Plant U-box protein 23
   Q9FPHO XBAT34 or At4g14365 ✓ Putative E3 ubiquitin-protein ligase
   XBAT34
   [140]Q9SG86 BZIP28 or At3g10800 ✓ bZIP transcription factor 28
   (AtbZIP28)
   [141]Q9C7N2 CAD1 or At1g29690 ✓ MACPF domain-containing protein CAD1
   (Protein Constitutively Activated Cell Death 1)
   [142]Q9C9H6 RLP11 or At1g71390 ✗ Putative disease resistance protein
   (Receptor like protein 11)
   [143]Q9SJQ8 At2g36470 ✗ Expressed protein (Uncharacterized protein)
   [144]Q8RXN8 At5g42050 ✗ Development and Cell Death (DCD)
   [145]Q8GYH5 ANK or At5g54610 ✗ Ankyrin repeat family protein
   [146]Q9LN03 At1g08050 ✗ C3HC4-type RING finger-containing protein
   [147]Open in a new tab

   The gene names are given in the second column whilst the full protein
   name as downloaded from the UniProt database [[148]61] is given in the
   fourth column. A gene can have more than one name, but throughout this
   study we use the first names of each gene. The third column indicates
   whether the protein has been reviewed (i.e., manually curated: marked
   ✓) or not (✗)

Fig. 3.

   Fig. 3
   [149]Open in a new tab

   Functional similarity between NPR1 and other associated proteins. A
   dendrogram showing the relatedness of 26 reviewed proteins in
   Arabidopsis thaliana proteome. This was obtained by applying
   hierarchical agglomerative clustering algorithm on the set of annotated
   proteins related to NPR1 using a distance retrieved from functional
   similarity scores and represented on the horizontal axis

   An in-depth literature review demonstrated that these functionally
   related proteins could contribute in the same pathway as NPR1 or
   independently to either positively or negatively drive defence response
   to pathogens and elicitors. For instance, calcium signalling is known
   to be associated to the innate response, HR and SAR, acting upstream of
   the pathogen-induced SA-signalling pathway to favor the activation of
   NPR1 and associated downstream activities, such as TGA binding, to
   inhibit pathogen growth [[150]63–[151]65]. The calcium signalling
   triggers the activation of a set of Mitogen protein kinase (MPK)
   contributing to the plant innate immunity related cell death [[152]66]
   and the MAMP defence [[153]67], which is also an innate-form of defence
   response, and to other processes that stimulate ROS production during
   plant pathogen attack [[154]68]. In this study, two MPK proteins,
   namely MPK11 (Cluster 2) and MKK4 (Cluster 1A) which is an important
   mediator of plant response to osmotic stress [[155]69], were found to
   be closely related to NPR1. Proteins belonging to the EF hand super
   family have been shown to harbour motifs that are important for calcium
   sensing [[156]70, [157]71]. Typical examples of proteins within this
   family are the calmodulin (CaM)-like proteins (CML) and Calcium protein
   kinase (CPK) [[158]70, [159]72, [160]73] whose members: CML47, CML40,
   CPK9 identified in this study belonged to Clusters 1A and 2. Moreover,
   two other proteins identified, NDR1 (YLS9) in Cluster 1B and EDS1 in
   Cluster 1A, are known to mediate the Effector Triggered Immunity
   through specific R genes [[161]74, [162]75]. As illustration, EDS1 was
   found to be a regulatory gene controlling down-stream defence gene
   expression in Arabidopsis [[163]76], targeting R genes with TIR-NB-LRR
   motifs required for the recognition of Pst-DC3000 avrRps4 proteins
   [[164]75, [165]77, [166]78]. R genes, such as RPS2 (Cluster 1A), which
   recognizes Pst-DC3000 avrRpt2, are targeted by NDR1 and harbour
   LZ-NBS-LRR motifs [[167]74, [168]75, [169]77, [170]78]. YLS9, an
   NDR1/HIN1-like 10 (NHL-10) identified in our study has been implicated
   in HR and is inducible by several CPKs [[171]79], thereby suggesting
   its action downstream of the calcium signalling pathway.

   As pointed out previously, proteins functionally related to NPR1 may
   contribute to the plant defence mechanism degradation. For example,
   CAD1 (Cluster 1A) has been implicated in the repression of the
   programmed cell death (PCD) pathway mediated by SA and this negative
   function of CAD1 is inhibited under stress conditions to favor the PCD
   pathway [[172]80]. However, even under undisturbed conditions, plant
   proteins undergo proteolysis as a strategy to effectively turnover its
   protein reservoir and to ensure for efficient functioning of biological
   processes through an ubiquitination process [[173]81, [174]82]. The
   NPR1 protein, for instance, which can switch between two conformational
   states (oligomeric and monomeric) or two cell compartments (cytosol and
   nucleus), is a typical example of a protein whose nuclei-localized
   monomers undergo proteolysis through a CUL-3-based E3 ligase
   ubiquitination process to ensure for the effective expression of its
   target genes [[175]59, [176]83, [177]84]. Our analysis identified five
   putative ubiquitin-like proteins - XBA34, PUB23, PUB2, ATL2, PI4KG4
   [[178]85–[179]87]. Although their direct role in NPR1 ubiquitination
   has not been reported, their transcript levels, especially those of
   ATL2 and PI4KG4, have been shown to increase following exposure of
   plants to chitinases [[180]88] and treatment with SA [[181]89],
   respectively. The role of these proteins during drought stress has
   equally been demonstrated [[182]85]; and proteins, such as XBAT34 have
   been shown to be induced by 6-BA (6-benzylaminopurine) and SA
   [[183]90]. Additionally, XBAT34 contains two Ankyrin repeats, which are
   also core motifs found in NPR1 protein and are important in mediating
   interactions with other proteins [[184]58, [185]90–[186]92]. The
   importance of the ankyrin repeats within the NPR1 sequence is elegantly
   demonstrated in the inability of npr1 mutants to activate PR genes
   following pathogen attack due to a distortion in this region [[187]58,
   [188]93]. A distorted ankyrin repeat within the NPR1 protein hampers
   its ability to bind to transcription factors belonging to the basic
   leucine zipper (BZIP) family such as TGA3 (Cluster 1A) for a positive
   defence outcome [[189]92, [190]94–[191]97].

   Two other BZIPs (BZIP60 and BZIP28) were identified in this study and
   have been implicated in Unfolded Protein Response (UPR) during stress
   response [[192]98–[193]100]. The ability of the BZIP domain to
   translocate to the nucleus during stress could be a key contribution to
   its role during stress responses. In fact, although we did not find a
   direct relation between BZIP28 and BZIP60 to NPR1 in literature,
   existing evidence shows that the inositol-requiring protein-1a
   (IRE1a)/BZIP60 branch of the UPR pathway have a role in PR protein
   secretion following treatment with SA [[194]99]. In addition to the
   BZIP transcription factor (TGA3) identified here, proteins belonging to
   the WRKY family of transcriptional factors [[195]101] – WRKY70
   (Fig. [196]3; Cluster 1A) and WRKY40 (Fig. [197]3; Cluster 2) were
   equally identified. Unlike TGA3, no direct interaction between NPR1 and
   WRKY40 was found during our literature search. However, similar to NPR1
   monomers, evidence for the nuclei localization of WRKY40 does exist
   [[198]102]. Its role in stress response has also been demonstrated and,
   its ability to interact with other WRKY proteins has been shown
   [[199]102, [200]103]. For instance, co-expression of WRKY18 and WRKY40
   led to increased susceptible of plants to P. syringae and Botrytis
   cinerea correlating with reduced PR gene expression [[201]102]. WRKY70
   on the other hand is known to positively mediate pathogen defence
   response in an SA-dependent fashion which is independent of the NPR1
   pathway [[202]104]. PAD4, which is another protein identified (Cluster
   1A), can act dependently (e.g., for SA-dependent defence response) or
   independently (e.g., for basal immunity [[203]105]) of NPR1 for PR-1
   gene expression [[204]106, [205]107]. NPR1 also has the ability to
   interact with non-transcription factors, such as NIMIN-1, identified in
   this study (Cluster 1A) and experimentally verified to bind to NPR1
   [[206]108, [207]109] to modulate PR-1 gene expressions during SAR.

Retrieving enriched processes and pathways mediating the Arabidopsis thaliana
plant defence

   A total of 114 biological processes were found to be involved in the
   Arabidopsis thaliana protection and defence mechanisms, among which 74
   were found to be non-redundant considering the feature of the GO
   structure [[208]47]. 21 processes were identified to be statistically
   significant in plant defence mechanisms and Table [209]2 provides a
   summary of the type of response and biological activities in which
   these enriched processes are involved. These processes are related to
   both biotic and abiotic triggered responses, leading to the initiation
   of major pathways involved in the innate response (GO:0010200), HR
   (GO:0009626) and SAR (GO:0009862) in plants [[210]2–[211]7]. In
   general, these processes are triggered on the basis of the (pathogen)
   attacks classified using the mode of plant-pathogen interactions. For
   example, a pathogen may survive by using the living plant cells (for
   biotrophic pathogens) or by killing plant cells and feeding on these
   dead cells (for necrotrophic pathogens) [[212]110]. Interestingly, the
   enrichment pathway analysis reveals two enriched KEGG pathways
   participating to NPR1-associated plant defence response, namely
   Plant-pathogen interaction
   ([213]http://www.genome.jp/kegg-bin/show_pathway?ath04626) and Plant
   hormone signal transduction
   ([214]http://www.genome.jp/kegg-bin/show_pathway?ath04075) with
   p-values of 0.013085 and 0.02070, respectively. These pathways show
   evidence of regulating a wide variety of enriched biological processes,
   including programmed cell death (PCD) and defence, HR, ABA and SA
   dependent responses.

Table 2.

   Statistically significant or enriched biological processes protecting
   the Arabidopsis thaliana plant
   GO ID Process name L p-value ap-value Process functions Stressor Tps RF
   GO:0010200 Response to chitin 5 1.40e−11 1.04e−09 Defence against
   chitin bearing pathogens such as fungi, exoskeletons of animals and
   nematodes. Biotic Biotrophics ✓
   GO:0000165 MAPK cascade 9 7.69e−12 5.69e−10 The process involves
   programmed cell death as a way of responding to biotic and abiotic
   factors. Biotic Biotrophics ✓
   GO:0006612 Protein targeting to membrane 8 0.0 0.0 Directs the proteins
   to the cell membrane for further transportation and communication of
   chemicals. Biotic & abiotic - ✗
   GO:0001666 Response to hypoxia 5 8.12e−09 6.01e−07 The response is
   triggered when the oxygen goes below, 20.8−20.95%. Abiotic - ✗
   GO:0009611 Response to wounding 3 8.61e−06 6.37e−04 A biological
   mechanism that takes place indicating the damage of an organism. Biotic
   & abiotic Biotrophics ✓
   GO:0009738 Abscisic acid (ABA)-activated signalling pathway 7 0.0 0.0
   Controls environmental stress for example, drought and plant pathogens
   [[215]115]. Abiotic & biotic Necrotrophics ✓
   GO:0009409 Response to cold 4 0.0 0.0 Reacts to changes in low
   temperatures. Abiotic - ✗
   GO:0061025 Membrane fusion 4 0.0 0.0 Ensures life continuity in
   organisms. Biotic & abiotic Necrotrophic & biotrophics ✓
   GO:0009626 Plant-type hyper-sensitive response 6 5.29e−12 3.91e−10
   Killing of cells surrounding the infected area (programmed cell death)
   to starve pathogens like biotrophics Biotic Biotrophics ✓
   GO:0010310 Regulation of hydrogen peroxide metabolic process 6 3.51e−11
   2.60e−09 Controls plant stress and programmed cell death. It is also a
   go-between the wounding responses and biotic interactions. Biotic &
   abiotic Necrotrophics ✓
   GO:0009697 Salicylic Acid biosynthetic process 9 2.72e−11 2.01e−09
   Forms salicylic acid which is a fungicide present in some plants.
   Biotic Biotrophics ✓
   GO:0002679 Respiratory burst involved in defense response 4 2.86e−11
   2.12e−09 Ensures life continuity in plants when the defence response
   consumption rate increases for instance, the presence of phagocytic
   leukocytes. Biotic Necrotrophics & biotrophics ✓
   GO:0035304 Regulation of protein dephosphorylation 8 0.0 0.0 Controls
   the removal rate of phosphorous, suppresses pathogenesis related
   expression and promotes programmed cell death. Biotic Biotrophics ✓
   GO:0031348 Negative regulation of defence response 6 0.0 0.0 Controls
   the frequency of the defence response. Biotic Necrotrophics ✓
   GO:0010363 Regulation of plant-type hyper-sensitive response 7 1.01e−11
   7.51e−10 Balances the hyper-sensitive response. Biotic Biotrophs ✓
   GO:0009862 Systemic acquired resistance, salicylic acid mediated
   signalling pathway 8 4.99e−12 3.70e−10 Salicylic acid plays a vital
   role in the expression of pathogenesis related proteins. Biotic
   Biotrophics ✓
   GO:0043069 Negative regulation of programmed cell death 6 0.0 0.0 Stops
   the occurrence of programmed cell death. Biotic Necrotrophics ✓
   GO:0050832 Defence response to fungus 6 2.41e−11 1.78e−09 Protects the
   cell plants from fungi. Biotic Necrotrophics & biotrophics ✓
   GO:0009410 Response to xenobiotic stimulus 3 4.06e−06 3.00e−04 Takes
   place when the system identifies foreign compounds in the organism.
   Biotic Necrotrophics & biotrophics ✓
   GO:0010112 Regulation of systemic acquired resistance 7 1.70e−04 0.0126
   Balancing of systemic acquired resistance. Biotic Biotrophics ✓
   GO:0016045 Detection of bacterium 6 4.11e−11 3.04e−09 Converts the
   signal of presence of bacteria into a molecular signal. Biotic
   Necrotrophics & biotrophic ✓
   [216]Open in a new tab

   For each process identified, following information is provided: Level
   (L) in the GO directed acyclic graph structure, p-value and adjusted
   p-value (ap-value), functional activities triggered, nature of stress
   thwarted, type of pathogens suppressed (Tps) and whether it contributes
   to response to fungi (RF) or not

   NPR1-associated plant defence proteins share six processes, namely MAPK
   cascade (GO:0000165), SAR, SA mediated signalling pathway (GO:0009862),
   Regulation of plant-type hyper-sensitive response (GO:0010363), Protein
   targeting to membrane(GO:0006612), Defence response to fungus
   (GO:0050832) and Negative regulation of defence response (GO:0031348).
   The number of processes in which each protein is contributing is shown
   in Fig. [217]4 [218]b and [219]a, highlighting different
   protein-process associations. Interestingly, NPR1 is involved in all 21
   enriched processes identified, suggesting that NPR1 is a potential
   regulator of the plant ubiquitin proteolytic system during infection,
   controlling associated protein expression levels and functioning at the
   cross-roads of several signalling pathways, as well as modulating
   antagonistic cross-talk between different signalling pathways [[220]83,
   [221]97] critical to the plant defence-mediated response. This suggests
   that the success of the plant defence mechanism is a result of these
   different concerted and controlled signal transduction events that
   sequentially occur in the plant system, leading to an increased
   resistance to diverse attacks. As a consequence, the failure of these
   processes to function properly may yield increased susceptibility to
   the pathogen, which possibly negotiates its entry into the cell by
   activating processes to thwart plant defence mechanisms. Thus, we built
   a potential enriched process-based regulatory network using the
   following binary relation rule: Given two processes s and t, the
   process t is triggered after the process s, which is indicated in the
   regulatory network by an arrow from s to t, if A [t] ⊂ A [s] and
   [MATH:
   <msub><mrow><mi>A</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>⊇</mo><m
   i mathvariant="script">P.</mi> :MATH]
   Under the assumption that a process is triggered after another as a
   result of specific degradation of a single or a subset of proteins, and
   where A [x] is the set of proteins participating to the process x and
   [MATH: <mi mathvariant="script">P</mi> :MATH]
   is the set of proteins in at least one of the enriched pathways
   identified.

Fig. 4.

   Fig. 4
   [222]Open in a new tab

   Summary results of the biological mechanism of the NPR1-based defence
   response. a plotted using the PINV tool [[223]113] is a protein-process
   map showing NPR1-associated plant defence proteins and enriched
   processes in which they are involved, and the number of enriched
   processes in which each protein is involved is shown in (b). Note that
   processes in which all proteins are involved are not displayed in (a)
   to avoid that high number of links limits visibility. c plotted using
   the Cytoscape tool [[224]114] displays NPR1-based process regulatory
   network, showing possible process occurrence sequences and (d) shows
   the number of proteins involved in or frequency of each enriched
   process. The suffix +, ∗, ×, or − were added to enriched process
   according to the fact all NPR1-associated plant defence proteins are
   involved in the process (+), or set of proteins involved in the process
   contains all NPR1-associated plant defence proteins are implicated in
   the Plant-pathogen interaction (×) or Plant hormone signal transduction
   (−) or both (∗) pathways

   Figures [225]4 [226]c and [227]d show a potential enriched
   process-based regulatory network constructed and process frequencies
   (number of proteins annotated by each process), respectively. This
   regulatory network reveals, for example, that there are only five
   possible processes that may be triggered following the occurrence of
   the negative regulation of PCD (GO:0043069), namely: Response to
   xenobiotic stimulus (GO:0009410), Membrane fusion (GO:0061025),
   Plant-type hyper-sensitive response (HR) (GO:0009626), Abscisic acid
   (ABA)-activated signalling pathway (GO:0009738) and Response to cold
   (GO:0009409) processes. Most of these processes are expected to occur
   after the occurrence of the negative regulation of PCD, which could be
   triggered after the occurrence of processes, such as Negative
   regulation of defence response (GO:0031348). This is consistent as the
   system may trigger mechanisms susceptible to promote or normalize the
   plant defence response by coordinating different actions in order to
   offer appropriate response to and to contain the infection. For
   example, HR and ABA-activated signalling pathway may relaunch new PCD
   by killing the infected cells at the point of invasion [[228]2, [229]6,
   [230]7, [231]111]. This regulatory network also reveals that the
   occurrence of the negative regulation of PCD can result from the
   dysfunction of critical processes, such as MAPK cascade (GO:0000165),
   Protein targeting to membrane (GO:0006612) and regulation to hydrogen
   peroxide (GO:0010310), which are vital for the regulation of a wide
   variety cellular activities, including transportation and communication
   of chemical materials, proliferation, differentiation, apoptosis and
   stress response [[232]112].

Conclusions

   This study proposes a novel bioinformatics protocol that can be readily
   applied to plant phenotype based gene knock-out or -down experimental
   dataset to assess the contribution of the gene under consideration to
   plant-defence mechanisms and to build a potential enriched biological
   process-based regulatory network using publicly available biological
   data. This protocol enables the extraction of the specific
   (differentially infected) samples of the experimental dataset to
   eliminate potential artefacts, such as uncertainty of the data set and
   potential noise inherent in the experiment, which could mask downstream
   analysis. We applied the protocol to the phenotype dataset expressed in
   terms of leaf spore counts of 60 npr1 gene mutant and wild-type
   Arabidopsis thaliana plants following Pst-DC3000 infection. Results
   obtained still showed that NPR1 plays an important role during plant
   pathogen response by suppressing the growth of the pathogen. More
   specifically, we found that differences in the plant susceptibility to
   the Pst-DC3000 infection depend on whether the NPR1 protein is
   functionally active or not with about 99% of variability in Pseudomonas
   spore growth between npr1 mutant and Wt plant samples. The increased
   disease susceptibility phenotypes observed in npr1 mutant plants
   following to Pst-DC3000 infection may be due to the fact that the SAR
   signalling pathway, which is essential for plant defence response, is
   blocked in these plants as a result of knocking out or down the protein
   NPR1, leading to the attenuation of PR gene expressions. This SAR
   signal requires a functionally active NPR1, which is a key regulatory
   protein of SAR, acting as a controller and modulator of PR gene
   expressions.

   With the increasing number of publicly available biological data which
   already demonstrates the complexity of the plant defence network, new
   computational approaches are essential to provide an easy-to-analyse
   picture which can lead to the identification of gaps and opportunities
   for the development of future ‘wet or dry lab’ experiments. The
   proposed protocol uses the GO-universal metric based Gene Ontology
   semantic similarity model to identify putative proteins collaborating
   with NPR1 in regulating plant defence mediated response and retrieve
   enriched biological processes involved. We identified 26
   NPR1-associated plant defence proteins and 21 highly specific
   processes, which related to the major forms of defence processes
   reported (innate response, HR, SAR, etc.), and construct the potential
   process-based regulatory network, predicting occurrence sequences of
   different processes. This NPR1-based enriched process regulatory
   network (Fig. [233]4 [234]c) can effectively reveal eventual sequences
   of biological processes occurring during the plant defence response. As
   illustration, it suggests that the suppression of the SAR signalling
   pathway (GO:0009862) may thwart processes that enable the detection of
   the pathogen (GO:0016045), control plant stress, promote PCD and
   attenuate PR gene expressions (e.g., GO:0010310, GO:0035304) to favor
   processes leading to the negative regulation of PCD (GO:0043069) in
   order to trigger processes that possibly ensure pathogen life
   continuity within plant cells. Different results obtained demonstrate
   that this novel protocol is effective for assessing plant-defence genes
   and may help better understand the plant defence mechanisms, an
   important aspect in the biology of plants.

Additional files

   [235]Additional file 1^ (10KB, xls)

   Experimental data from Arabidopsis wild type and npr1 mutant plants.
   Leaf bacteria spore count (phenotype) dataset of 31 Arabidopsis wild
   type (control) and 29 npr1 mutant (case) plants 48 h post Pst-DC3000
   infection. (XLS 10 kb)
   [236]Additional file 2^ (536.2KB, pdf)

   Checking normality assumption using Q-Q plots. Checking normality
   assumption for initial and transformed differentially infected wild and
   npr1 mutant plant spore count datasets using Q-Q plots. (PDF 536 kb)

Acknowledgements