Abstract

   Time-course experiments using parallel sequencers have the potential to
   uncover gradual changes in cells over time that cannot be observed in a
   two-point comparison. An essential step in time-series data analysis is
   the identification of temporal differentially expressed genes (TEGs)
   under two conditions (e.g. control versus case). Model-based
   approaches, which are typical TEG detection methods, often set one
   parameter (e.g. degree or degree of freedom) for one dataset. This
   approach risks modeling of linearly increasing genes with higher-order
   functions, or fitting of cyclic gene expression with linear functions,
   thereby leading to false positives/negatives. Here, we present a
   Jonckheere–Terpstra–Kendall (JTK)-based non-parametric algorithm for
   TEG detection. Benchmarks, using simulation data, show that the
   JTK-based approach outperforms existing methods, especially in long
   time-series experiments. Additionally, application of JTK in the
   analysis of time-series RNA-seq data from seven tissue types, across
   developmental stages in mouse and rat, suggested that the wave pattern
   contributes to the TEG identification of JTK, not the difference in
   expression levels. This result suggests that JTK is a suitable
   algorithm when focusing on expression patterns over time rather than
   expression levels, such as comparisons between different species. These
   results show that JTK is an excellent candidate for TEG detection.

INTRODUCTION

   Time-course experiments using a parallel sequencer or mass spectrometry
   capture dynamic changes during the development or perturbation of a
   cellular system over time ([26]1,[27]2). Although specific issues in
   biological data, such as low sampling frequency, exist ([28]3), recent
   advancements in modern high-throughput techniques have enabled
   description of the regulatory molecular circuits that drive
   differentiation processes and adaptation to the environment in greater
   detail ([29]4,[30]5).

   One of the major steps in analyzing time-series omics data is the
   identification of genes that are differentially expressed between two
   groups (e.g. wild-type versus knockout strain) on a time axis ([31]6).
   We defined differentially expressed genes over time as temporal
   differentially expressed genes (TEGs). While algorithms for general
   differentially expressed gene analysis ([32]7,[33]8) and algorithms for
   interpretation of time-series data in the field of circadian rhythms
   have been relatively well studied ([34]9,[35]10), the golden standard
   for TEG analysis has not been established.

   Many tools have been implemented in the detection of TEGs in
   time-course experiments ([36]11). MaSigPro ([37]12,[38]13) performs
   polynomial regressions to model time-course expression values, and a
   log likelihood ratio test to detect TEGs. MaSigPro consists of two
   steps; in the first step, dynamic (non-flat) genes are selected, and in
   the second step, the best model is sought, and the P-value is
   calculated using user-specified parameters. SplineTimeR (splineTC)
   ([39]14) was originally developed for the construction of gene networks
   and for provision of pathway-enrichment analysis and visualization
   functions. SplineTC fits natural cubic spline curves to time-course
   data and applies empirical Bayes moderate F-statics between two groups.
   ImpulseDE2 ([40]15) fits the impulse model ([41]16,[42]17) to
   time-course data, and the null model is represented by a common impulse
   model; therefore, the alternative model is represented by a different
   impulse model. LimoRhyde is designed to detect differential rhythmicity
   and differential expression using cosinor regression ([43]18). These
   methods have shown high accuracy in comprehensive comparative studies
   of Spies et al., except for the recently published LimpRhyde ([44]11).
   These studies are considered as good benchmarks for TEG detection
   algorithms.

   Previous approaches for TEG analysis have mainly been performed by
   fitting regression models to two groups and by assessing whether the
   models are statistically consistent by using a hypothesis test.
   However, experimental data contain genes and proteins that differ in
   complexity, which poses the risk of modeling linearly increasing genes
   with higher-order functions or fitting cyclically varying genes with
   linear functions, thereby leading to false positives or false
   negatives. To tackle this issue, we need to explore the parameters to
   model each gene or develop a model-free and non-parametric approach.

   Here, we propose a Jonckheere–Terpstra–Kendall (JTK)-based
   non-parametric TEG detection algorithm to characterize time-course
   experiments. The Jonckheere–Terpstra test ([45]19,[46]20) is a
   non-parametric test for the detection of ordering patterns between two
   measured quantities, with the correlation coefficient, Kendall’s τ,
   measuring the ordinal association between two groups. In circadian
   rhythm studies, the JTK algorithm, which combines these two statistical
   methods, has been widely used to detect oscillating molecules in omics
   datasets ([47]9,[48]21). We expanded this application of the JTK
   algorithm to TEG detection. To the best of our knowledge, JTK is the
   only non-parametric TEG detection method. Our study demonstrates the
   novelty of a powerful non-parametric approach in the exploration of
   differentially expressed genes in time-series omics datasets.

MATERIALS AND METHODS

Design

   The schematic diagram of JTK and the definition of TEG in each method
   are shown in Figure [49]1. JTK calculates Kendall’s τ between two
   groups (e.g. control and case). For two time-series with lengths n, the
   replicates were averaged, x = (x[1], x[2], ..., x[n]) and y = (y[1],
   y[2], ..., y[n]), we define:
   graphic file with name M8.gif (1)
   graphic file with name M9.gif (2)

   The numerator in Equation ([50]1) indicates the number of pairs with
   consistent behavior between the two time-series, minus the number with
   inconsistent behavior. All possible pairs, and not just neighboring
   points, are included in the calculation of τ. The denominator indicates
   the total number of pairs in the datasets, and τ must be in the range
   from −1 to 1. Two time-series are perfectly correlated if τ = 1, and
   are perfectly anti-correlated if τ = −1, or uncorrelated if τ = 0. Null
   distributions for each gene were predicted by permutation. The number
   of Kendall’s τ obtained by permutation, which is smaller than the τ
   obtained from the actual order, was counted and divided by the number
   of permutations (1000 times) to be considered as the P-value.

Figure 1.

   Figure 1.
   [51]Open in a new tab

   (A) An illustration of the JTK algorithm. Solid arrows pointing upward,
   downward and sideways indicate an increase, decrease and constant,
   respectively, between two points. JTK compares the increase/decrease
   pattern of two sequences (e.g. control versus case). The relationship
   of all combinations between the points is summed up, and normalized by
   the number of combinations that are within the range of −1 to 1. The
   two time-series are perfectly correlated if τ = 1, are perfectly
   anti-correlated if τ = −1 and are uncorrelated if τ = 0. (B) The
   definition of the TEG in each method. (*1) MaSigPro performs a TEG test
   for non-flat genes only. (*2) SplineTC offers a choice to include or
   exclude the intercept in the test.

Data simulation

   To assess the performance of each method, we generated synthetic data
   that mimicked read count data of RNA-seq data. The time-course of each
   gene was simulated using the following four functions based on the work
   of Wang et al. ([52]22):
   graphic file with name M10.gif (3)
   graphic file with name M11.gif (4)
   graphic file with name M12.gif (5)
   graphic file with name M13.gif (6)

   where, t is the time point, and α and β set the amplitude and
   intercept, respectively. The α and β followed a power-law distribution,
   and were generated by the rplcon function (n = 1, min = 30, α = 2.5) of
   the powerRaw package in R. The values obtained by these functions
   (average expression values) were converted to read counts by a negative
   binomial (NB) distribution.
   graphic file with name M14.gif (7)

   where ϕ indicates the magnitude of the noise, and in this study, the
   noise level refers to ϕ. TEGs, which had different time-series
   expression patterns between controls and cases, were generated from
   different functions for controls and cases, while non-TEGs were
   generated from the same function. The synthetic dataset for Figures
   [53]2 and [54]3 contains 200 time-series gene expression data conducted
   for each condition. The data in Figure [55]4 contains 100 time-series
   gene expression data for each condition. For the gene-labeled non-TEG,
   the time-series data for the control and cases were generated using the
   same function, and for the gene-labeled TEG, were generated using the
   two specified functions. The combination of functions in the labeled
   genes is constant and is shown in the figure.

Figure 2.

   [56]Figure 2.
   [57]Open in a new tab

   Example plots and ROC curves of simulated data. (A) Simulated data for
   different sequence lengths and (B) number of points. In the case of
   sine curves, the number of peaks increases as the sequence length
   increases. As the number of time points increases, the number of peaks
   remains the same, but the sampling intervals become shorter. The box on
   the left indicates typical examples of non-TEG genes, and the box on
   the right indicates typical examples of TEG genes. The noise level is
   0.05, and the error bars represent the standard deviation (n = 3). The
   top boxes represent examples of sequence length or time points of 8,
   while bottom boxes represent examples of 9. Comparison of method
   performance with ROC curves for different (C) sequence lengths and (D)
   number of time points. Datasets in A were used to generate the ROC
   curves in (C), and datasets in (B) corresponded to (D). The numbers at
   the top indicate the length of the sequence or the number of time
   points, and the numbers on the right indicate the noise level. The
   degree and degree of freedom are shown using the color legend. As the
   degree and degree of freedom should be less than a time point, only
   limited conditions are illustrated in the boxes with sequence length or
   time points of 4.

Figure 3.

   Figure 3.
   [58]Open in a new tab

   Comparison of method performance with ROC curves for different number
   of replicates with 16 time points. Numbers at the top indicate the
   number of replicates, while numbers on the right indicate the noise
   level. Since ImpulseDE2 requires multiple replicates to predict the NB
   distribution, the condition of n = 1 is not shown.

Figure 4.

   [59]Figure 4.
   [60]Open in a new tab

   Preferences of each method and function. Functions 1–4 correspond to